# Collaborative based reccomenders
---
_Project by Qijun Jin, Johhny Nuñez and Marcos Plaza._

### The goal

This first practice of the course, is based on using the data from this [kaggle](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/overview) competition to obtain a **product recommendation** based on customer transaction data, as well as other metadata.

To achieve this goal, we have relied on **collaborative methods**. The hypothesis is: _**"Similar users tend to like similar items"**_. More precisely, this is the definition of Neighborhood-based methods (that were among the earliest algorithms developed for collaborative filtering. The previous sentece is valid for a user-based system). So, for this problem as we have a lot of users, we have focused our efforts on proposing a **item-based model**.

First of all let's check the data to gain some insights:

In [None]:
import numpy as np
import pandas as pd 
import os
from tqdm.notebook import tqdm

# load data
original_df_customers=pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/customers.csv')
original_df_items=pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/articles.csv')
original_df_customers_items=pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv')

### Information of the customers

In [None]:
original_df_customers.head(10)

### Information of the items

In [None]:
original_df_items.head(10)

### Information of the transactions: Which customers have purchased certain items?

In [None]:
original_df_customers_items.head(10)

This is the data set on which we will focus our implementation. As we can see, there are different metadata that mainly tell us which customer bought a certain item. On the other hand, the number of transactions is huge, so it will not be feasible to use the whole dataset at a computational level. 

Therefore, we must force a good enough limit to give a good recommendation. We thought it was a good idea to **take into account the most recent transactions**, since products such as clothing tend to renew and change over time and seasons. We will discard all products before to the following date: ``2020-08-31``.


In [None]:
d = original_df_customers_items.copy()

original_df_customers_items = d[d['t_dat'] > '2020-08-31']

counts_df = original_df_customers_items.groupby(['t_dat', 'customer_id', 'article_id', 'price', 'sales_channel_id']).size()
counts_df = counts_df.to_frame()
counts_df.reset_index(inplace=True)

small_counts = counts_df.rename(columns={0: 'count'})

In [None]:
small_counts

### Implementation

The first thing we have to do is to build a matrix that will serve, in some way, as an indication of each person's preferences. For this purpose, we will construct a matrix $m$×$n$, where $m$ is the number of customers and $n$ the number of items or articles. In every entry $(i,j)$ of this matrix we will have the number of times that customer have bought this article.

In [None]:
from scipy.sparse import csr_matrix, dok_matrix
from pandas.api.types import CategoricalDtype

def to_dense(array):
    """
    Accepta una csr_matrix, dok_matrix o matrix i la converteix en una 
    np.array normal, densa.
    
    :param array: Array a convertir
    :return: np.array densa, sense cap dimensió de tamany 1
    """
    try:
        array = array.todense()
    except:
        pass
    
    return np.array(array).squeeze()
    
def build_counts_table(df):
    """
    Retorna una csr_matrix on les columnes són els `items`, les files `customer_id` i els valors
    el nombre de vegades que un usuari ha escoltat un `item`
    
    :param df: DataFrame original després de creuar-lo
    :return: Una tupla constistent de:
        * La csr_matrix descrita
        * Els indexos corresponents a cada fila (el customerID de la fila `i` corresponent a l'element `i` d'aquesta array)
        * Les columnes corresponents a cada columna (el article_id de la columna `j` correspon a l'element `j` d'aquesta array)
    """
    # Ids, sense repeticions i ordenats
    customer_ids = CategoricalDtype(sorted(df.customer_id.unique()), ordered=True)
    item_ids = CategoricalDtype(sorted(df.article_id.unique()), ordered=True)

    # Conversió a csr
    row = df.customer_id.astype(customer_ids).cat.codes
    col = df.article_id.astype(item_ids).cat.codes
    sparse_matrix = csr_matrix((df["count"], (row, col)), \
                           shape=(customer_ids.categories.size, item_ids.categories.size))

    return sparse_matrix, customer_ids, item_ids

### Top 10 customers & Top 10 items

In [None]:
def top_active_customers(counts, indexes, columns, n):
    """
    Exemple: Retorna els ids dels n usuaris que més reproduccions han acumulat
    
    :param counts, indexes, columns: Tupla retornada per `build_counts_table`
    :param n: Quanitat d'usuaris
    :return: Llista, tupla o pd.Series de customerID dels n usuaris
    """
    # Operate with the sparse matrix, convert to dense the result (as it has much fewer entries)
    sums = to_dense(counts.sum(axis=1))
    # Get indices
    indices = sums.argsort()
    return indexes.categories[indices[-n:]]

def top_bought_articles(counts, indexes, columns, n):
    """
    Exemple: Retorna els ids dels n itemes més escoltats
    
    :param counts, indexes, columns: Tupla retornada per `build_counts_table`
    :param n: Quanitat d'itemes
    :return: Llista, tupla o pd.Series de itemID dels n itemes
    """
    # Operate with the sparse matrix, convert to dense the result (as it has much fewer entries)
    sums = to_dense(counts.sum(axis=0))
    # Get indices
    indices = sums.argsort()
    return columns.categories[indices[-n:]]

In [None]:
counts, indexes, columns = build_counts_table(small_counts)

This is the top 10 customers who have made the most purchases. On the other hand we have the 10 most purchased products (within the restrictions that have been applied).

In [None]:
top_customers = top_active_customers(counts, indexes, columns, 10)
top_customers

In [None]:
top_items = top_bought_articles(counts, indexes, columns, 10)
top_items

At the same time, we have set another limit to further reduce the dimensionality of our data. We will **keep those top items and customers**.

In [None]:
counts, indexes, columns = build_counts_table(small_counts)
print(counts.shape)
top_customers = top_active_customers(counts, indexes, columns, 5000)
top_items = top_bought_articles(counts, indexes, columns, 5000)


s = small_counts.copy()
print("Total: ", len(s))
s = s[s.article_id.isin(top_items)]
print("Filter top articles: ", len(s))
s = s[s.customer_id.isin(top_customers)]
print("Filter top customers: ", len(s))
# s = s[:10000]
print("Total: ", len(s))
s

In [None]:
s = s.drop(s.columns[[0, 3, 4]], axis=1)
s

In [None]:
counts, indexes, columns = build_counts_table(s) # build the counts matrix

### Compute similarities

To make collaborative recommendation there are two options, to make a user-based recommender or an item-based recommender. As we said in the introduction, we wanted to give an approach to an **item-based system**:
+ On an item-based approach, we will consider the matrix N×M (items×users), in order to recommend, you will have to base your recommendation on the similarities between the items.

To compute the product-based similarity matrix, we only need to transpose the ``counts`` matrix above (``counts.T``). This will give us a triangular matrix of N x M based on items.

In [None]:
from sklearn.metrics.pairwise import pairwise_distances

def similarity_matrix(similarity_function, counts):
    if similarity_function is None:
        x = to_dense(counts)
        y = to_dense(counts.T)
        
        matrix_prod = np.dot(x,y)
        diagonal = np.diag(matrix_prod)
        inversa = 1 / diagonal
        inversa[np.isinf(inversa)] = 0
        inv_mag = np.sqrt(inversa)
        
        cosine = matrix_prod * inv_mag
        cosine = cosine.T * inv_mag
        np.fill_diagonal(cosine, 0)
        
        return cosine
    
    else: 
        print("matrix")
        x = pd.DataFrame.sparse.from_spmatrix(counts)
        matrix = pairwise_distances(X = x, metric = similarity_function, n_jobs = -1)
        del x
        matrix = csr_matrix(matrix)
                
        return matrix

In [None]:
import pickle

try:
    with open('similarities.pkl', 'rb') as fp:
        similarities = pickle.load(fp)
except:
    similarities = similarity_matrix(similarity_function="correlation", counts=counts.T)
        
    with open('similarities.pkl', 'wb') as fp:
        pickle.dump(similarities, fp, pickle.HIGHEST_PROTOCOL)

### Make a prediction

To make a collaborative recommendation, we need a function that gives us a value of how good the recommendation would be. In our case, as it is item-based, **the score** for a user $u$ i item $i$ is:
$$pred(u, i) = \hat{r}_{u,i} = \frac{\sum_{j\neq i,r_{u,j}>0} sim(i, j)\cdot r_{u,j}}{\sum_{j\neq i,r_{u,j}>0} sim(i, j)}$$

In [None]:
def score(counts, indexes, columns, similarities, customer, item):
    customer_ = indexes.categories.get_loc(customer)
    item_ = columns.categories.get_loc(item)
    
    rui = to_dense(counts[:,customer_])
    
    sim = np.triu(to_dense(similarities[item_]), k=1)
    sim = sim[item_]
    
    numerador = np.sum(rui* sim)
    denominador = np.sum(sim[rui > 0])
    
    if denominador != 0 and numerador != 0:
        return numerador/denominador
    else:
        return 0

In [None]:
print(score(counts.T, indexes, columns, similarities, '01959be607170cc2f092ee8fd13eda251b13cde70ef38dd37e37dcdadfde3b9e',781758057)) #111,123

In [None]:
def score_mean(counts, indexes, columns, similarities, customer, item):
    customer_ = indexes.categories.get_loc(customer)
    item_ = columns.categories.get_loc(item)
    ru = np.mean(counts[:,item_])
    ri = to_dense(counts.mean(axis=1))
    rpi = to_dense(counts[:,customer_])
    sim = np.triu(to_dense(similarities[item_]), k=1)
    sim = sim[item_]
    diferencia = np.subtract(rpi,ri)[rpi>0]
    numerador = np.sum(diferencia *sim[rpi > 0])
    denominador = np.sum(sim[rpi > 0])
    if numerador !=0 and denominador !=0:
        return ru + (numerador/denominador)
    else:
        return ru

In [None]:
print(score_mean(counts.T, indexes, columns, similarities, '01959be607170cc2f092ee8fd13eda251b13cde70ef38dd37e37dcdadfde3b9e',781758057)) 


In [None]:
from queue import PriorityQueue
from tqdm.notebook import trange, tqdm
import heapq as hq
import operator

def recommend_n_items(counts, indexes, columns, similarities, customer, N):
    minheap = []
    customer_ = indexes.categories.get_loc(customer)
    for item in tqdm(columns.categories.tolist()):
        item_ = columns.categories.get_loc(item)
        if counts[customer_,item_] == 0:
            valor = score(counts, indexes, columns, similarities, customer, item)
            hq.heappush(minheap, (-valor, item))

    res = []

    for i in range(N):
        res.append(hq.heappop(minheap)[1])

    return res

In [None]:
print(recommend_n_items(counts.T, indexes, columns, similarities, '01959be607170cc2f092ee8fd13eda251b13cde70ef38dd37e37dcdadfde3b9e',12)) 

In [None]:
from queue import PriorityQueue
import heapq as hq
import operator

def recommend_n_items_mean(counts, indexes, columns, similarities, customer, N):
    minheap = []
    customer_ = indexes.categories.get_loc(customer)
    for item in tqdm(columns.categories.tolist()):
        item_ = columns.categories.get_loc(item)
        if counts[customer_,item_] == 0:
            valor = score_mean(counts, indexes, columns, similarities, customer, item)
            hq.heappush(minheap, (-valor, item))

    res = []

    for i in range(N):
        res.append(hq.heappop(minheap)[1])

    return res

In [None]:
print(recommend_n_items_mean(counts.T, indexes, columns, similarities, '01959be607170cc2f092ee8fd13eda251b13cde70ef38dd37e37dcdadfde3b9e',12))

In [None]:
def Convert(string):
    li = list(string.split(" "))
    return li

### Generate the ``submission.csv`` file

In [None]:
kaggle_df_customer_articles = pd.read_csv('./data/sample_submission.csv')

In [None]:
kaggle_df_customer_articles.customer_id.unique()

In [None]:
try:
    from tqdm.notebook import trange, tqdm
except:
    def tqdm(a, _, *__): return a
    
# Fetch kaggle public data and merge
kaggle_df_customer_articles = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv')
    
# Obtain counts
pred_counts, pred_indexes, pred_columns = build_counts_table(s)
    
# Similarity
try:
    with open('pred_similarities.pkl', 'rb') as fp:
        pred_similarities = pickle.load(fp)
except:
    pred_similarities = similarity_matrix(similarity_function="correlation", counts=pred_counts.T)
        
    with open('pred_similarities.pkl', 'wb') as fp:
        pickle.dump(pred_similarities, fp, pickle.HIGHEST_PROTOCOL)
    
results = pd.DataFrame(columns=['customer_id', 'prediction'])
top_n_items = top_bought_articles(pred_counts, pred_indexes, pred_columns, 12)
top_n_items = ''.join(['0'+str(i)+' ' for i in top_n_items])[:-1]

for idx, customer in enumerate(tqdm(kaggle_df_customer_articles.customer_id.unique())):
    try:
        article_ids = recommend_n_items_mean(pred_counts.T, pred_indexes, pred_columns, pred_similarities, customer,12)
        article_ids = Convert(article_ids)
        article_ids = ''.join(['0'+str(i)+' ' for i in article_ids])[:-1]
    except:
        article_ids = top_n_items
    results.loc[idx] = (customer, ''.join(article_ids))
        
results.to_csv('submission.csv', index=False)