### Abstract

In this notebook, we aim to solve the H&M Recommendations Challenge. We approach this problem by building an Item-based collaborative sytem and combining it with a time decaying popularity bechmark model as a fallback when we do not have any purchase history for a user.

In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm

### Import data

First, we will import the data. We will just be using the transactions data for training, and the submission sample to compute our predictions.

In [None]:
# Load data from different CSV files
transactions_df = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")
submission_df = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv')

### Train dataset

We use this function to filter out articles that have not been bought frequently since they provide little to no value to our model.

In [None]:
def get_most_bought_articles(data, num_articles=5):
    # Create dataframe that contains the number of times each article has been bought
    articles_counts = data[['article_id', 't_dat']].groupby('article_id').count().reset_index().rename(columns={'t_dat': 'count'})
    articles_counts = articles_counts.sort_values(by='count', ascending=False)
        
    most_bought_articles = articles_counts.loc[articles_counts['count'] >= num_articles]['article_id'].values
    
    return most_bought_articles

In this part, we will filter the transactions by date, since we only want to use recent items for our prediction. We have fixed the range to around 2,5 months.

In [None]:
# Create training dataset with positive examples.
# The training data will contain all transactions starting from 08/07/2020.
# Only items that have been bought at least 10 times will be kept. Also, we
# are only going to compute the information for the customers that appear
# in these transactions.
start_date = pd.to_datetime('2020-07-08')
end_date = pd.to_datetime('2020-09-23')

filtered_transactions_df = transactions_df.copy()
filtered_transactions_df.t_dat = pd.to_datetime(filtered_transactions_df.t_dat)
filtered_transactions_df = filtered_transactions_df.loc[filtered_transactions_df.t_dat >= start_date]
filtered_transactions_df = filtered_transactions_df.loc[filtered_transactions_df.t_dat < end_date]

train_df = filtered_transactions_df.copy()

most_bought_articles = get_most_bought_articles(train_df, num_articles=10)
most_bought_articles = np.sort(most_bought_articles)

train_df = train_df.drop(train_df.loc[~train_df.article_id.isin(most_bought_articles)].index)
filtered_transactions_df = filtered_transactions_df.drop(filtered_transactions_df.loc[~filtered_transactions_df.article_id.isin(most_bought_articles)].index)

recent_customers = train_df.loc[train_df.article_id.isin(most_bought_articles)].customer_id.unique()
recent_customers = np.sort(recent_customers)

num_articles = len(most_bought_articles)
num_customers = len(recent_customers)

# Create dictionaries with mapping keys
articles_id_to_idx = dict(zip(most_bought_articles, range(num_articles)))
customers_id_to_idx = dict(zip(recent_customers, range(num_customers)))

train_df = train_df.loc[train_df['article_id'].isin(most_bought_articles)]
train_df = train_df[['customer_id', 'article_id']]

train_df['article_id'] = train_df['article_id'].apply(lambda x: articles_id_to_idx[x])
train_df['customer_id'] = train_df['customer_id'].apply(lambda x: customers_id_to_idx[x])
train_df['bought'] = np.ones(train_df.shape[0])

train_df

Here, we simply generate negative examples that we add to our exisiting dataset.

In [None]:
# Generate negative examples
np.random.seed(47)

num_transactions = train_df.shape[0]
negative_data = pd.DataFrame(
    {
        'article_id': np.random.choice(num_articles, num_transactions),
        'customer_id': np.random.choice(num_customers, num_transactions),
        'bought': np.zeros(num_transactions)
    }
)

train_df = pd.concat([train_df, negative_data])
train_df = train_df.sample(frac=1).reset_index(drop=True)

train_df

### Validation dataset

In this part, we add the most recent data that has not been used for training to our validation dataset. Since this is our final notebook that is being trained with all of the most recent data, we do not have any validation data.

In [None]:
import datetime
val = transactions_df.copy()
val.t_dat = pd.to_datetime(val.t_dat)
val = val.loc[val["t_dat"] >= end_date]

In [None]:
val.head()

In [None]:
positive_items_val = val.groupby(['customer_id'])['article_id'].apply(list)
# creating validation set for metrics use case
val_users = positive_items_val.keys()
val_items = []

for i,user in tqdm(enumerate(val_users)):
    val_items.append(positive_items_val[user])
    
print("Total users in validation:", len(val_users))

### Time decaying baseline model for default recommendations

As mentioned above, we will use time decaying popularity benchmark model as a fallback for users that we do not have any trasnactions data on. The model is based on this: https://www.kaggle.com/mayukh18/time-decaying-popularity-benchmark-0-0216.

It is using the last 4 weeks of transactions data and computes the popularity based on the date.

In [None]:
data = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv', dtype={'article_id':str})

In [None]:
data["t_dat"] = pd.to_datetime(data["t_dat"])
train1 = data.loc[(data["t_dat"] >= datetime.datetime(2020,9,16)) & (data['t_dat'] < datetime.datetime(2020,9,23))]
train2 = data.loc[(data["t_dat"] >= datetime.datetime(2020,9,8)) & (data['t_dat'] < datetime.datetime(2020,9,16))]
train3 = data.loc[(data["t_dat"] >= datetime.datetime(2020,8,31)) & (data['t_dat'] < datetime.datetime(2020,9,8))]
train4 = data.loc[(data["t_dat"] >= datetime.datetime(2020,8,23)) & (data['t_dat'] < datetime.datetime(2020,8,31))]

# List of all purchases per user (has repetitions)
positive_items_per_user1 = train1.groupby(['customer_id'])['article_id'].apply(list)
positive_items_per_user2 = train2.groupby(['customer_id'])['article_id'].apply(list)
positive_items_per_user3 = train3.groupby(['customer_id'])['article_id'].apply(list)
positive_items_per_user4 = train4.groupby(['customer_id'])['article_id'].apply(list)

train = pd.concat([train1, train2], axis=0)
train['pop_factor'] = train['t_dat'].apply(lambda x: 1/(datetime.datetime(2020,9,23) - x).days)
popular_items_group = train.groupby(['article_id'])['pop_factor'].sum()

_, popular_items = zip(*sorted(zip(popular_items_group, popular_items_group.keys()))[::-1])

In [None]:
from collections import Counter

popular_items = list(popular_items)

def get_popularity_based_prediction(user):
    user_output = []
    if user in positive_items_per_user1.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user1[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12]
    if user in positive_items_per_user2.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user2[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12]
    if user in positive_items_per_user3.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user3[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12]
    if user in positive_items_per_user4.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user4[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12]
    
    user_output += list(popular_items[:12 - len(user_output)])
    return user_output

### Our model

Our main model is an Item-Based Collaborative Filtering (IBCF) model.

We have stored the transactions into a dataframe that contains the index of the user that has bought an item, the index of the item that has been bought, and a label $1$ or $0$ that tells whether the item has been bought or not, respectively. We decompose this pseudo-matrix $T$ into the product of two matrices $P$, which contains the information of the users, and $Q$, which contains information of the articles. Both of these matrices are in a latent space of dimension $d$.

We use SGD in order to compute the matrices, minimizing the squared error of the actual label and the dot product of the user $p_i$ and the item $q_j$, which is an estimation of whether the user is going to buy the product or not.

Since the prediction of similar items is only based on other single items, we have implemented an approach which allows us to base the final prediction on more than one product. For that, we compute a relevance score for the last few bought items by a user which is based on the date when it was bought: $1 / (end_date - date bought)$ which is normalized with the softmax function. We multiply the score with each similarity prediction vector and sum them up. After that we take the 12 items with the highest score. 
In case we do not have any information about the customer's purchase history, we use the fallback model to get the 12 most popular items.

In [None]:
import scipy
from scipy.special import softmax

def apk(actual, predicted, k=12):
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=12):
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])
    
    
# Recommender
class ItemBasedRecommender:
    """ Collaborative filtering using a custom sim(u,u'). """
    def __init__(self, df, num_articles, num_customers, num_components=10, num_last_items=5):
        """ Constructor """
        self.num_components = num_components
        self.num_articles = num_articles
        self.num_customers = num_customers
        self.num_last_items = num_last_items
        
        self.train = df
        
        self.articles = self.train.article_id.values
        self.customers = self.train.customer_id.values
        self.bought = self.train.bought.values


    def __sdg__(self):
        for idx in tqdm(self.training_indices):
            customer_idx = self.customers[idx]
            article_idx = self.articles[idx]
            real_bought = self.bought[idx]
            
            prediction = self.__predict_train(customer_idx, article_idx)
            error = (real_bought - prediction) # error
            
            #Update latent factors
            self.customers_lat_mat[customer_idx] += self.learning_rate * \
                                    (error * self.articles_lat_mat[article_idx] - \
                                     self.lmbda * self.customers_lat_mat[customer_idx])
            self.articles_lat_mat[article_idx] += self.learning_rate * \
                                    (error * self.customers_lat_mat[customer_idx] - \
                                     self.lmbda * self.articles_lat_mat[article_idx])
                
                
    def fit(self, n_epochs=10, learning_rate=0.001, lmbda=0.1):
        """Compute the matrix factorization R = P \times Q"""
        self.learning_rate = learning_rate
        self.lmbda = lmbda
        self.n_samples = self.train.shape[0]
        
        self.train_rmse =[]
        self.test_rmse = []
        
        # Initialize latent matrices
        self.customers_lat_mat = np.random.normal(scale=1., size=(self.num_customers, self.num_components))
        self.articles_lat_mat = np.random.normal(scale=1., size=(self.num_articles, self.num_components))

        for epoch in range(n_epochs):
            print('Epoch: {}'.format(epoch))
            
            self.training_indices = np.random.permutation(self.n_samples)
            self.__sdg__()
            # self.evaluate(num_samples=10000)
        
        del self.customers_lat_mat
            
        
    def __predict_train(self, customer_idx, article_idx):
        """ Single user and item prediction."""
        prediction = np.dot(self.customers_lat_mat[customer_idx], self.articles_lat_mat[article_idx])
        prediction = np.clip(prediction, 0, 1)
        
        return prediction
    
    def predict(self, transactions_df, customers, most_bought_articles, articles_id_to_idx, customers_id_to_idx):
        recommendations = []

        # Compute similarity matrix
        similarity_matrix = np.dot(self.articles_lat_mat, self.articles_lat_mat.T)
        norms = np.sqrt(np.sum(self.articles_lat_mat ** 2, axis=1)).reshape(-1, 1)
        similarity_matrix = similarity_matrix / norms
        similarity_matrix = similarity_matrix / norms.T

        last_transactions = filtered_transactions_df.groupby('customer_id').tail(5).sort_values(by=['customer_id', 't_dat'])
        last_transactions['relevance'] = np.array([1 / (end_date - day).days for day in last_transactions.t_dat.values])
        last_transactions['relevance'] = last_transactions.groupby('customer_id').relevance.transform(softmax).values
        last_transactions['article_idx'] = last_transactions.article_id.apply(lambda x: articles_id_to_idx[x])


        # Create dictionary of transactions
        transactions_dict = last_transactions.groupby('customer_id').apply(lambda x: [(article_idx, relevance) for article_idx, relevance in zip(x.article_idx, x.relevance)])
        transactions_dict = {user: trans for user, trans in zip(transactions_dict.index, transactions_dict.values)}

        for customer in tqdm(customers):
            try:
                customer_idx = customers_id_to_idx[customer]
                customer_transactions = transactions_dict[customer]
                similarity_cum = np.zeros(similarity_matrix.shape[1])

                for article_idx, relevance in customer_transactions:
                    similarity_cum += relevance * similarity_matrix[article_idx]

                similar_articles_idx = np.argsort(similarity_cum)[::-1][:12]
                recommended_articles = most_bought_articles[similar_articles_idx]

                recommendation = "0" + " 0".join([str(article) for article in recommended_articles])
            except KeyError as kerr:
                popular_articles_customer = get_popularity_based_prediction(customer)
                recommendation = "0" + " 0".join([str(article) for article in popular_articles_customer])

            recommendations.append(recommendation)

        predictions_df = pd.DataFrame({'customer_id': customers, 'prediction': recommendations})

        return predictions_df
    
    def evaluate(self, num_samples=None):
        if num_samples is None:
            train_items = self.predict(filtered_transactions_df, val_users.values, most_bought_articles, articles_id_to_idx, customers_id_to_idx)
        else:
            train_items = self.predict(filtered_transactions_df, val_users.values[:num_samples], most_bought_articles, articles_id_to_idx, customers_id_to_idx)
        train_items = train_items.prediction.values
        train_items = [prediction.split(" ") for prediction in train_items]
        train_items = [list(map(int, item)) for item in train_items]
        score = mapk(val_items, train_items)
        print(f"Mapk score of {score}")
                
    

### Training

Note: The final model that has been submitted to Kaggle has been trained with 1000 components for 10 epochs with a learning rate of 0.001.

In [None]:
recommender = ItemBasedRecommender(train_df, num_articles, num_customers, num_components=1000, num_last_items=6)
recommender.fit(n_epochs=10, learning_rate=0.001)

### Submission

In [None]:
customers = submission_df.customer_id.values
submission = recommender.predict(filtered_transactions_df, customers, most_bought_articles, articles_id_to_idx, customers_id_to_idx)

In [None]:
submission.to_csv('submission.csv', index=False)