# Implement Baseline with Most Popular Items and Local Validation template

This notebook implements a very naive baseline with a solid local validation strategy. My goal was not to achieve high accuracy, but to provide a template to have a good starting point with best practices for your further experiments. So all you need to do is to copy & edit this notebook and start kaggling!

Main idea of the baseline is to use most frequently bought items over a fixed window prior to the test set. I show how to optimize for the best window length, using cross-validation in a time-series manner. This provides an example of implementing local validation strategy with statistically significant results, together with the evaluation metric used for this competition. 

It also implements:
* [trick to reduce transactions dataframe's overall memory](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308635)
* [solid local validation strategy](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308919)
* [evaluation metric MAP@K](https://raw.githubusercontent.com/benhamner/Metrics/master/Python/ml_metrics/average_precision.py)

In [None]:
import numpy as np 
import pandas as pd 
import gc
import random

from pathlib import Path
data_path = Path('/kaggle/input/h-and-m-personalized-fashion-recommendations/')

## Read input data

In [None]:
transactions = pd.read_csv(
    data_path / 'transactions_train.csv',
    # set dtype or pandas will drop the leading '0' and convert to int
    dtype={'article_id': str} 
)

transactions = transactions[['t_dat','customer_id','article_id']]
transactions['t_dat'] = pd.to_datetime(transactions['t_dat'])
transactions['customer_id'] = transactions['customer_id'].apply(lambda x: int(x[-16:],16) ).astype('int64')
transactions['article_id'] = transactions['article_id'].astype('int32')
transactions.sort_values(by=['t_dat', 'customer_id'], inplace=True)

_ = gc.collect()

print(transactions['t_dat'].min(), transactions['t_dat'].max())
print("Num of rows in training data:", transactions.shape[0])
transactions.head()

## Evaluation Metric MAP@k

From https://raw.githubusercontent.com/benhamner/Metrics/master/Python/ml_metrics/average_precision.py

In [None]:
def apk(actual, predicted, k=10):
    """
    Computes the average precision at k.

    This function computes the average prescision at k between two lists of
    items.

    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements

    Returns
    -------
    score : double
            The average precision at k over the input lists

    """
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=10):
    """
    Computes the mean average precision at k.

    This function computes the mean average prescision at k between two lists
    of lists of items.

    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted 
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements

    Returns
    -------
    score : double
            The mean average precision at k over the input lists

    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

## Baseline - Recommend 12 most popular articles of last X weeks

Recommend most popular articles (popular being most frequently bought) of last X weeks before the start of the test period. I'll use different values to test the best performing one such as 1, 2, 3, 4, 5, and 10 weeks. Validation period will be the following 7-day window after last day of training data.

In [None]:
class BaselineMostPopularArticles():
    
    def __init__(self, k=12):
        self._k_most_popular_items = []
        self._k = k
    
    def fit(self, X: pd.DataFrame):
        """
            X is a DataFrame with purchase granularity (date, user, item) i.e. columns 
                ['t_dat', 'customer_id', 'article_id', ...].
            To fit the baseline model, it counts the total amount of times an article has been purchased
            over all the dataframe, and stores the k most popular items in memory.
        """
        self._k_most_popular_items = X['article_id'].value_counts()[:self._k].index.tolist()
    
    def predict(self, for_submission=False):
        if not for_submission: 
            return self._k_most_popular_items
        else: 
            return " ".join(['0' + str(item) for item in self._k_most_popular_items])

### Local Validation strategy

Depending on the amount of weeks used for training data, the cross-validation strategy should follow a traditional time-series train-test split over 5 different periods. For example, for a 2 week training window:

1. Fold 1 -> Weeks 1, 2 for training and Week 3 for validation
2. Fold 2 -> Weeks 2, 3 for training and Week 4 for validation
3. Fold 3 -> Weeks 3, 4 for training and Week 5 for validation
4. ...

and so on. One might want to use different seasons in the folds e.g. 4 folds in each different season, winter, spring, summer and autumn. Or having a considerable gap between each fold, to cover as much possibly different test distributions as possible. 

In [None]:
# extract week number without using repeating weeks, week 0 is last week of dataset
transactions["week"] = (transactions["t_dat"].max() - transactions["t_dat"]).dt.days // 7

### Run the experiments!

In [None]:
WEEK_HISTS = [1, 2, 3, 4, 5, 10] # the training window length for each experiment
N_FOLDS = 5 # number of folds for cross-validation
WEEK_NUMBERS = sorted(transactions['week'].unique().tolist())[1:] # 1: to avoid taking week 0
START_WEEK_FOLD = [1, 5, 10, 15, 20] # valid week for fold i will be START_WEEK_FOLD[i] - 1

VERBOSE = 0 # 0 will show only aggregated results from cross-validation, 1 will show results in each fold

print(f"nfolds={N_FOLDS} start_week_folds={START_WEEK_FOLD} \n")

for week_hist in WEEK_HISTS:

    print(f"Experiment: week_hist={week_hist}")
    metrics = []
    for fold, start_week in enumerate(START_WEEK_FOLD):
        
        week_valid = [start_week - 1]
        week_train = list(range(start_week, start_week + week_hist))
        
        train = transactions[transactions['week'].isin(week_train)].copy()
        valid = transactions[transactions['week'].isin(week_valid)].copy()
        
        ## TEST YOUR MODEL HERE
        
        baseline = BaselineMostPopularArticles(k=12)
        baseline.fit(train)

        valid_grouped = valid.groupby(['customer_id'])['article_id'].apply(list).reset_index()
        valid_grouped['article_preds'] = valid_grouped.apply(lambda x: baseline.predict(), axis=1)
        
        loss_at_fold = mapk(valid_grouped['article_id'], valid_grouped['article_preds'], k=12)
        metrics.append(loss_at_fold)
        
        if VERBOSE == 1:
            print(f"\tFold={fold}, week_train={week_train}, week_valid={week_valid}")
            print("\tTrain window:", train['t_dat'].min(), train['t_dat'].max())
            print("\tValid window:", valid['t_dat'].min(), valid['t_dat'].max())
            print("\tMAP@12 = ", loss_at_fold)
            print()
    
    print("Results MAP@12 => ", np.mean(metrics))
    print()

Best result for our naive baseline is to use last week's most popular items. Let's then create a submission with this model:

## Generate Submission (0.0071 LB score)

In [None]:
train = transactions[transactions['week'] == 0].copy()
baseline = BaselineMostPopularArticles(k=12)
baseline.fit(train)

submission = pd.read_csv(data_path / 'sample_submission.csv')
submission['prediction'] = baseline.predict(for_submission=True)
submission.head()

In [None]:
submission.to_csv("submission.csv", index=False)

# Next steps 

In this notebook, I have shown how to get started in this competition with a simple baseline and a thorough cross-validation strategy to test your experiments. My recommendation on how to use this notebook and getting started with your next experiments would be:

* Run this notebook and get familiar with the process, data, and prediction style.
* As one of the bottlenecks of execution time for this notebook is to read input data, I recommend saving the transactions dataframe (after reducing memory) as parquet file, to speed up loading time later on. You may use `to_parquet()` function from pandas for that.
* The goal should be to have an end-to-end pipeline with which you can test experiments as fast as possible.

**How to improve from here?**

Check out these notebooks implementing baselines from other kagglers. These helped me out get a better picture on the data format and how to deal with the dataset:

* [Recommend Items Purchased Together LB Score 0.021](https://www.kaggle.com/code/cdeotte/recommend-items-purchased-together-0-021/notebook)
* [Time is our Best Friend v2](https://www.kaggle.com/code/hengzheng/time-is-our-best-friend-v2/notebook)
* [This discussion](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/307288) is worth reading to get an understanding on how to approach this competition.

My ideas on how to improve from here are:

* Implement classic Item-based Collaborative Filtering approaches. Can we model similar users and similar articles solely based on purchase history?
* Integrate Customer and Item metadata in the model assuming it might help with the cold start problem (customers who bought zero (or few) article in the training dataset.
* How can we adapt the model to include time sensitive data? 
* [Advanced] use NLP and Computer Vision pretrained models to map article description and images into latent vector spaces to enhance item metadata.

# Thanks for reading!

If you've gotten this far, congrats! I hope you enjoyed this notebook. If so, please upvote and comment your thoughts. Good luck and have fun!