# Personalised recommendations to increase AOV of Instacart loyalists

---

## Part 5: Implementation of Content-Based and Collaborative Filtering systems

In this notebook, I will be testing both content-based and collaborative filtering methods for generating product recommendations.

In total, there will be 4 variations of recommender systems implemented:

1. Content-based filtering
2. Collaborative filtering: User-based 
3. Collaborative filtering: Item-based
4. Collaborative filtering: Matrix factorization model leveraging SVD  

For each method, I will generate 20 recommendations per user for a sub-sample of 3,000 users, and calculate the precision, recall, and F1 scores of the recommendations given. 

- **Precision**: what proportion of the recommended items did the user purchase?
- **Recall**: what proportion of the user's actual purchases were in the recommendations?
- **F1 score**: harmonic mean of precision and recall. The highest possible value of F1 is 1, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero.

NOTE: For the sake of evaluation, I will not be filtering out previously purchased items from the recommendations. 

After calculating the evaluation metrics, I will wrap up with a qualitative discussion of the limitations of the recommender systems and include further points for consideration from external research.

---

## Concept introduction: Content-Based and Collaborative Filtering in layman terms

Recommender systems fall into 2 broad categories:

**1. Content-based systems**

Content-based filtering focus on the attributes of the items and give recommendations based on the similarity between them. The basic idea is that if you like an item, then you will also like a “similar” item. 


**2. Collaborative filtering (CF)**

Collaborative filtering produces recommendations based on user-item interactions, such as ratings or buying behavior. It aggregates feedback for items from different users and uses similarities between users or items to provide recommendations.

- **User-based**: Measures similarity of users by their item preferences. 
- **Item-based**: Measure similarity of items by the users who like them. We are making the assumption that customers are likely to accept product recommendations that are similar to what they have bought before.
- **Matrix Factorization models**: In MF models, the user-item utility matrix is thought of as the product of two long, thin user and item matrices. The idea behind such models is that attitudes or preferences of a user can be determined by a small number of hidden latent factors. These factors represent different characteristics for users and items. Matrix factorization is done by leveraging various dimensionality reduction methods including Support Vector Decomposition (SVD), Probabilistic Matrix Factorization (PMF), and Non-Negative Matrix Factorization (NMF). In my case, I will be using SVD for matrix factorization.

<a id='sections'></a>
## Key sections

- [Coding the 4 recommender systems](#recsys_code)
- [Calculating RecSys evaluation metrics](#metric_calc)
- [Reviewing RecSys performance](#performance_eval)
- [Conclusion and additional discussion points](#discussion)
- [Case study using one randomly selected user](#case_study)

### Load libraries and datasets

In [43]:
from scipy.sparse import coo_matrix, csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy.sparse as sparse
import pandas as pd
import numpy as np
import seaborn as sns
import pickle
import time
import heapq
import scipy.sparse.linalg as linalg
import random
from IPython.display import Markdown, display
pd.set_option('max_colwidth', None)

In [2]:
# Read data files
df_train = pd.read_pickle('../datasets/df_train.pkl')
# df_test = pd.read_pickle('../datasets/df_test.pkl')
products_df = pd.read_pickle('../datasets/products_df_reduced.pkl')
products_df.set_index('product_id', inplace=True)
user_products_train = pd.read_pickle('../datasets/user_products_train.pkl')
user_products_test = pd.read_pickle('../datasets/user_products_test.pkl')
product_frequency_train = pd.read_pickle('../datasets/product_frequency_train.pkl')
product_frequency_train.index.name = 'product_id'
user_product_matrix = sparse.load_npz('../datasets/user_product_normed_sparse_matrix.npz').tocsr()
product_user_matrix = sparse.load_npz('../datasets/product_user_normed_sparse_matrix.npz').tocsr()
product_product_matrix = sparse.load_npz('../datasets/product_product_tfidf_matrix.npz')

### Helper functions for recommender systems

In [3]:
def get_train_pids(target_user_index):
    """
    Returns list of product IDs of purchases by target user in df_train (first 15 of last 20 orders)
    """
    # this creates a list of list
    pid_train = user_products_train.loc[user_products_train.index == target_user_index].product_id.values.tolist()
    # flatten the list
    pid_train = [item for sublist in pid_train for item in sublist]
    # keep only the unique items in list
    pid_train = list(set(pid_train))

    return pid_train
    
def get_test_pids(target_user_index):
    """
    Returns list of product IDs of purchases in df_test (last 5 orders)
    """
    # get products bought by the target user in test df
    # this creates a list of list
    pid_test = user_products_test.loc[user_products_test.index == target_user_index].product_id.values.tolist()
    # flatten the list
    pid_test = [item for sublist in pid_test for item in sublist]
    # keep only the unique items in list
    pid_test = list(set(pid_test))

    return pid_test

def get_train_pdetails(pid_train):
    """
    Returns dataframe containing product details of purchases by target user in df_train (first 15 of last 20 orders)
    Columns: product_name, department, and organic
    """
    pdetails_train = products_df.loc[pid_train][['product_name', 'department', 'organic']]
    
    return pdetails_train

def get_test_pdetails(pid_test):
    """
    Returns dataframe containing product details of purchases by target user in df_test (last 5 orders)
    Columns: product_name, department, and organic
    """
    pdetails_test = products_df.loc[pid_test][['product_name', 'department', 'organic']]
    return pdetails_test

def get_train_pnames(pid_train):
    """
    Returns product names of target user's purchases in df_train (first 15 of last 20 orders)
    """
    pnames_train = products_df.loc[pid_train]['product_name'].values.tolist()
    return pnames_train

def get_test_pnames(pid_test):
    """
    Returns product names of target user's purchases in test set (df_test: last 5 orders)
    """
    pnames_test = products_df.loc[pid_test]['product_name'].values.tolist()
    return pnames_test
    
def get_sim_recs_train(pid_final_recs, pid_train):
    """
    Returns list of recommended items (product names) found in user's purchases in train set (df_train)
    """
    sim_recs_train = list(set(pid_final_recs).intersection(set(pid_train)))
    sim_recs_train = products_df.loc[sim_recs_train]['product_name'].values.tolist()
    
    if not sim_recs_train:
        sim_recs_train = []
    else:
        sim_recs_train

    return sim_recs_train

def get_sim_recs_test(pid_final_recs, pid_test): 
    """
    Returns list of recommended items (product names) found in user's purchases in test set (df_test)
    """
    # list of items in final recs that are found in user's purchase history (df_test)
    sim_recs_test = list(set(pid_final_recs).intersection(set(pid_test)))
    sim_recs_test = products_df.loc[sim_recs_test]['product_name'].values.tolist()
    
    if not sim_recs_test:
        sim_recs_test = []
    else:
        sim_recs_test

    return sim_recs_test

def get_pdetails_final_recs(pid_final_recs):
    """
    Returns product details of the recommended products
    Columns: product_name, department, and organic
    """
    pdetails_final_recs = products_df.loc[pid_final_recs][['product_name', 'department', 'organic']]
    return pdetails_final_recs

def get_user_topN_prods(N=20):
    """
    Returns list of product IDs of the top N most frequently purchased products for each user
    """
    user_product_agg = df_train.groupby(['user_id', 'product_id']).agg(count=('product_id','count'))
    user_top_N = user_product_agg['count'].groupby(level=0, group_keys=False).nlargest(N).reset_index()
    user_top_N = user_top_N.groupby('user_id')['product_id'].apply(list)    
    return user_top_N

def get_top_N_prod_list(target_user_index, user_top_N):
    """
    Returns list of product IDs of the top N most frequently purchased item for the target user
    """
    user_id = user_products_train.loc[target_user_index]['user_id'] # user id corresponding to user index
    # this is a list of list of the user's top 20 products
    top_N_prod_list = user_top_N[user_top_N.index == user_id].values.tolist()
    # flatten the list of list
    top_N_prod_list = [item for lst in top_N_prod_list for item in lst]
    return top_N_prod_list

In [112]:
def printmd(string):
    """Pretty prints output in markdown style"""
    display(Markdown(string))

def print_results(target_user_index, pdetails_train, pnames_test, pdetails_final_recs, sim_recs_train, sim_recs_test, model_name, N=20):    
    """
    For a given user, this prints:
    - the top N recommendations with product details 
    - list of product names of recommended items found in train set
    - list of product names of recommended items found in test set
    - list of product names of purchases in test set
    - full details of purchases in train set
    - user's top 5 departments based on train set's purchase history
    """

    target_user_id = user_products_train.loc[target_user_index]['user_id']    
    printmd(f'_Model used: {model_name}_')
    printmd(f'### Top {N} recommendations for user {target_user_id}')
    printmd(f'Organic products: **{round(pdetails_final_recs.organic.mean() * 100, 2)}%**')
    display(pdetails_final_recs)
    print()
    printmd(f'**Recommended items that were in user\'s past purchases (first 15 of last 20 orders):**')
    printmd(f'{sim_recs_train}')
    printmd(f'**Recommended items that were in user\'s last 5 orders:**')
    printmd(f'{sim_recs_test}')
    printmd(f'**All items in user\'s last 5 orders:**')
    printmd(f'{pnames_test}')
    print()
    printmd(f'### User\'s purchase history (first 15 of last 20 orders)')
    print()
    printmd(f'Total unique items bought: **{len(pdetails_train)}**')
    printmd(f'Organic products: **{round(pdetails_train.organic.mean() * 100, 2)}%**')
    print()
    printmd('#### Top 5 departments')
    display(pd.DataFrame(pdetails_train.department.value_counts()[pdetails_train.department.value_counts() > 0][:5]))
    print()
    printmd(f'#### Products previously purchased')
    display(pdetails_train.style.hide_index())

<a id='recsys_code'></a>
## Coding the 4 recommender systems
[Back to top navigation](#sections)

### 1. Collaborative Filtering: User-based

In building the user-item utility matrix, since there are no explicit ratings for the products, I used the purchase count of an item as an approximation of the user's rating of the product. 

Cosine similarity will be used to measure similarity among users.

In [5]:
def cf_user(target_user_index, K=20, N=20):
    """
    For each user:
    
    1. Find his/her 20 nearest neighbours.
    2. Retrieve all the items these neighbours have purchased. 
    3. These will be potential items for recommendations.
    4. Rank all the potential item recommendations based on total sales volume and 
       select the top 20 as our final set of recommendations.
    
    """
    target_user_vec = user_product_matrix[target_user_index] # target user's vector from sparse matrix   
    target_user_id = user_products_train.loc[target_user_index]['user_id'] # actual id of user

    # cosine similarity scores of target user with all other users
    cos_sim = cosine_similarity(user_product_matrix, target_user_vec)

    # select K users with the highest cosine similarity scores (most similar to target user).
    # returns a list of user indices of the K users
    K_similar = heapq.nlargest(K+1, range(len(cos_sim)), cos_sim.take)[1:]
    # need to specify K+1 because the first value is the target user's cosine similarity score with himself
    
    pid_train = get_train_pids(target_user_index) # list of pids in train set
    pdetails_train = get_train_pdetails(pid_train) # list of product names in train set   
    pid_test = get_test_pids(target_user_index) # list of pids in test set
    pnames_test = get_test_pnames(pid_test) # list of product names in test set
    
    # Get recommendations from K similar users
    pid_recommendations = []
   
    for similar_user in K_similar:
        pid_similar_user = user_products_train.loc[user_products_train.index == similar_user].product_id.values.tolist()
        pid_similar_user = [item for sublist in pid_similar_user for item in sublist]
        pid_similar_user = list(set(pid_similar_user))
            
        # Add recommended items to total recommendation list
        pid_recommendations.extend(pid_similar_user)

    # From all the products in recommendations, pick the top N popularity (overall sales) to recommend
    heap = []
    for product in list(set(pid_recommendations)):
        # heapq.heappush(heap, item): Push the value item onto the heap. 
        # the order is adjusted so that heap structure is maintained.
        heapq.heappush(heap, (product_frequency_train.loc[product]['frequency'], product))
        if len(heap) > N:
            heapq.heappop(heap) 
            # heapq.heappop(heap): Pop and return the smallest item from the heap, maintaining the heap invariant.
    pid_final_recs = [item[1] for item in heap]
    
    # list of items in final recs found in train set
    sim_recs_train = get_sim_recs_train(pid_final_recs, pid_train) 
    # list of items in final recs found in test set
    sim_recs_test = get_sim_recs_test(pid_final_recs, pid_test) 
    
    # product details of final recs 
    pdetails_final_recs = get_pdetails_final_recs(pid_final_recs)  

    return  pdetails_train,\
            pnames_test, \
            pdetails_final_recs, \
            sim_recs_train, \
            sim_recs_test

### 2. Collaborative Filtering: Item-based
[Back to top navigation](#sections)

For each user:
1. Find his/her top 20 most frequently purchased items
2. Recommend the most similar item to each of those 20 items (based on cosine similarity score)

In [8]:
# retrieve the top 20 products for each user
user_top_20 = get_user_topN_prods(N=20)

In [9]:
# create a dictionary where KEY=product_id and VALUE=index in user_product_matrix
prod_id_dict = {}
for i, id_ in enumerate(sorted(df_train.product_id.unique())):
    prod_id_dict[id_] = i

prod_id_dict_key_list = list(prod_id_dict.keys())
prod_id_dict_val_list = list(prod_id_dict.values())

In [12]:
def cf_item(target_user_index):
    """
    For each of the top 20 most frequently purchased items of the target user, 
        select 10 items with the highest cosine similarity score to it. 
    Then, out of the 200 products, select the 20 with the highest sales to recommend to the target user
    """           
    # get user's top 20 most frequently purchased items
    top_20_prod_list = get_top_N_prod_list(target_user_index, user_top_20)
    
    # get the closest
    pid_recommendations = []
    
    for prod_id in top_20_prod_list:
        prod_index = prod_id_dict[prod_id]
        product_vec = product_user_matrix[prod_index] # target product's vector from sparse matrix
        product_name = products_df.at[prod_id, 'product_name'] # product name

        # cosine similarity vector of target product with matrix of all other product
        cos_sim = cosine_similarity(product_user_matrix, product_vec)

        # for each product, select the 10 items (index) with the highest cosine similarity score
        closest10_pindex = heapq.nlargest(11, range(len(cos_sim)), cos_sim.take)[1:]

        closest10_pids = []
        for pindex in closest10_pindex:
            # get the product id based on product index 
            pid = prod_id_dict_key_list[prod_id_dict_val_list.index(pindex)]
            closest10_pids.append(pid)
        pid_recommendations.extend(closest10_pids)
                                                    
    # From all the products in recommendations, pick the top N popularity (overall sales) to recommend
    heap = []
    for product in list(set(pid_recommendations)):
        heapq.heappush(heap, (product_frequency_train.loc[product]['frequency'], product))
        if len(heap) > 20:
            heapq.heappop(heap) 
    pid_final_recs = [item[1] for item in heap]      
 

    pid_train = get_train_pids(target_user_index)
    pdetails_train = get_train_pdetails(pid_train)
    pid_test = get_test_pids(target_user_index)
    pnames_test = get_test_pnames(pid_test) # list of product names in test set

    sim_recs_train = get_sim_recs_train(pid_final_recs, pid_train) # list of product names
    sim_recs_test = get_sim_recs_test(pid_final_recs, pid_test) # list of product names
    
    pdetails_final_recs = get_pdetails_final_recs(pid_final_recs) 
    
    return pdetails_train, \
            pnames_test, \
            pdetails_final_recs, \
            sim_recs_train, \
            sim_recs_test

### 3. Collaborative Filtering: Matrix factorization model (SVD)
[Back to top navigation](#sections)

In [52]:
def cf_svd(target_user_index, N=20, F=5):
    """
    Finds top N Recommendations (without filtering out previous purchases)
    """
    user_factors, sigma, product_factors = linalg.svds(user_product_matrix, F)
    product_factors = product_factors.T * sigma
    
    # scores of all 16k products for the target user
    scores = np.dot(user_factors[target_user_index], product_factors.T) 

    # get the top N recommendations for target user (unsorted)
    # argpartition explanation: https://stackoverflow.com/a/52465229/13405131
    pindices = np.argpartition(scores, -N)[-N:]

    pid_final_recs = [prod_id_dict_key_list[prod_id_dict_val_list.index(x)] for x in pindices]

    # get product names
    pname = products_df.loc[pid_final_recs]['product_name']

    # sort by scores[best], in descending order
    # returns list of tuples
    sorted(zip(pname, scores[pindices]), key=lambda x: x[1], reverse=True)

    pdetails_final_recs = get_pdetails_final_recs(pid_final_recs) 

    pid_train = get_train_pids(target_user_index)
    pdetails_train = get_train_pdetails(pid_train)
    pid_test = get_test_pids(target_user_index)
    pnames_test = get_test_pnames(pid_test) # list of product names in test set

    # list of items in final recs that are found in user's purchase history (df_train)
    sim_recs_train = get_sim_recs_train(pid_final_recs, pid_train)

    # list of items in final recs that are found in user's purchase history (df_test)
    sim_recs_test = get_sim_recs_test(pid_final_recs, pid_test)

    return pdetails_train, \
        pnames_test, \
        pdetails_final_recs, \
        sim_recs_train, \
        sim_recs_test

### 4. Content-based filtering

[Back to top navigation](#sections)

In [48]:
def content_based(target_user_index, N=20):
    """
    For each of the top 20 most frequently purchased items of the target user, 
        select the 10 items with the highest cosine similarity score (from product-product matrix). 
    Then, out of the resulting 200 products, select the top 20 products with the best sales as recommendations.
    """          
    user_id = user_products_train.loc[target_user_index]['user_id'] # user id corresponding to user index
    # this is a list of list of the user's top 20 products
    top_20_prod_list = user_top_20[user_top_20.index == user_id].values.tolist()
    # flatten the list of list
    top_20_prod_list = [item for lst in top_20_prod_list for item in lst]

    pid_recommendations = []
    
    for prod_id in top_20_prod_list:
        prod_index = prod_id_dict[prod_id]
        product_vec = product_product_matrix[prod_index] # target product's vector from sparse matrix
        product_name = products_df.at[prod_id, 'product_name'] # product name

        # cosine similarity vector of target product with matrix of all other product
        cos_sim = cosine_similarity(product_product_matrix, product_vec)

        # for each product, select 10 items (index) with the highest cosine similarity score
        closest_pindex = heapq.nlargest(11, range(len(cos_sim)), cos_sim.take)[1:]

        closest_pids = []
        for pindex in closest_pindex:
            # get the product id based on product index 
            pid = prod_id_dict_key_list[prod_id_dict_val_list.index(pindex)]
            closest_pids.append(pid)
        pid_recommendations.extend(closest_pids)
                   
    # amongst all the products in recommendations, pick the top N products in overall sales to recommend
    heap = []
    for product in list(set(pid_recommendations)):
        heapq.heappush(heap, (product_frequency_train.loc[product]['frequency'], product))
        if len(heap) > N:
            heapq.heappop(heap) 
    pid_final_recs = [item[1] for item in heap]      
 
    pid_train = get_train_pids(target_user_index)
    pdetails_train = get_train_pdetails(pid_train)
    pid_test = get_test_pids(target_user_index)
    pnames_test = get_test_pnames(pid_test) # list of product names in test set

    sim_recs_train = get_sim_recs_train(pid_final_recs, pid_train) # list of product names
    sim_recs_test = get_sim_recs_test(pid_final_recs, pid_test) # list of product names
    
    pdetails_final_recs = get_pdetails_final_recs(pid_final_recs) 
    
    return pdetails_train, \
            pnames_test, \
            pdetails_final_recs, \
            sim_recs_train, \
            sim_recs_test

### Popularity model (baseline)

[Back to top navigation](#sections)

The popularity model is a commonly used baseline – it simply recommends the most popular items that the user has not previously consumed. As the popularity accounts for the "wisdom of the crowds", it usually provides good recommendations that are generally interesting for most people.

#### Retrieve the top 20 items based on overall sales

In [21]:
# list of product IDs of the top 20 most frequently bought items
top20_popular = product_frequency_train.head(20).index.tolist()

In [22]:
products_df.loc[top20_popular]

Unnamed: 0_level_0,product_name,aisle_id,department_id,department,aisle,organic
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
24852,Banana,24,4,produce,fresh fruits,0
13176,Bag of Organic Bananas,24,4,produce,fresh fruits,1
21903,Organic Baby Spinach,123,4,produce,packaged vegetables fruits,1
47209,Organic Hass Avocado,24,4,produce,fresh fruits,1
21137,Organic Strawberries,24,4,produce,fresh fruits,1
47766,Organic Avocado,24,4,produce,fresh fruits,1
47626,Large Lemon,24,4,produce,fresh fruits,0
27966,Organic Raspberries,123,4,produce,packaged vegetables fruits,1
16797,Strawberries,24,4,produce,fresh fruits,0
26209,Limes,24,4,produce,fresh fruits,0


In [23]:
def popularity_model(target_user_index):   
    pid_train = get_train_pids(target_user_index)
    pid_test = get_test_pids(target_user_index)
    pnames_test = get_test_pnames(pid_test) 
    pdetails_train = get_train_pdetails(pid_train)
    pid_final_recs = top20_popular
    pdetails_final_recs = get_pdetails_final_recs(top20_popular) 
    sim_recs_train = get_sim_recs_train(pid_final_recs, pid_train)
    sim_recs_test = get_sim_recs_test(pid_final_recs, pid_test)
       
    return  pdetails_train,\
            pnames_test, \
            pdetails_final_recs, \
            sim_recs_train, \
            sim_recs_test

<a id='metric_calc'></a>
## Calculating RecSys evaluation metrics

[Back to top navigation](#sections)

I will evaluate the models using 3 metrics:

- **Precision@K**: what proportion of the recommended items in the top-K set are relevant (purchased by the user)?
- **Recall@K**: what proportion of the user's actual purchases were in the recommendations?
- **F1 score**: harmonic mean of precision and recall. The highest possible value of F1 is 1, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero.

To reduce computation time, I will randomly sample only 3,000 users to evaluate my models with.

Precision, recall, and F1 scores will be calculated for the recommendations generated for each of the 3,000 users, after which the averages will be taken.

In [26]:
def precision_at_k(recommended, intersection):
    """
    Inputs:
        - recommended: top-k items returned by recommender
        - intersection: items from recommendation set that are relevant    
    Output:
        - Precision@k = intersection/recommended
    """
    if len(intersection) == 0:
        return 0
    precision = len(intersection) / len(recommended)
    return precision

In [27]:
def recall_at_k(actual, intersection):
    """
    Recall@k is the proportion of relevant items found in the top-k recommendations
    
        - actual: items in target user's purchasing history
        - recommended: top-k items returned by recommender
        - relevant: items from {recommended} that are in {actual}

    Recall@k = (# of recommended items @k that are relevant) / (total # of relevant items)
    """
    if len(intersection) == 0:
        return 0
    
    recall = len(intersection) / len(actual)
    
    return recall

In [37]:
eval_df = pd.read_pickle('../datasets/eval_df.pkl')

In [30]:
# create a new dataframe to store our evaluation metrics
eval_df = user_products_train[['user_id']].sample(n=3000)

In [None]:
# eval_df_original = eval_df.copy()

In [None]:
# eval_df = eval_df_original.copy() # if i ever need to restart my evaluations

In [32]:
# create columns to store our evaluation scores
eval_df['CF_user_based_precision_train'] = 0
eval_df['CF_user_based_precision_test'] = 0
eval_df['CF_user_based_recall_train'] = 0
eval_df['CF_user_based_recall_test'] = 0
eval_df['CF_user_based_F1_train'] = 0
eval_df['CF_user_based_F1_test'] = 0

eval_df['CF_item_based_precision_train'] = 0
eval_df['CF_item_based_precision_test'] = 0
eval_df['CF_item_based_recall_train'] = 0
eval_df['CF_item_based_recall_test'] = 0
eval_df['CF_item_based_F1_train'] = 0
eval_df['CF_item_based_F1_test'] = 0

eval_df['CF_SVD_precision_train'] = 0
eval_df['CF_SVD_precision_test'] = 0
eval_df['CF_SVD_recall_train'] = 0
eval_df['CF_SVD_recall_test'] = 0
eval_df['CF_SVD_F1_train'] = 0
eval_df['CF_SVD_F1_test'] = 0

eval_df['baseline_precision_train'] = 0
eval_df['baseline_precision_test'] = 0
eval_df['baseline_recall_train'] = 0
eval_df['baseline_recall_test'] = 0
eval_df['baseline_F1_train'] = 0
eval_df['baseline_F1_test'] = 0

eval_df['content_based_precision_train'] = 0
eval_df['content_based_precision_test'] = 0
eval_df['content_based_recall_train'] = 0
eval_df['content_based_recall_test'] = 0
eval_df['content_based_F1_train'] = 0
eval_df['content_based_F1_test'] = 0

In [34]:
eval_df.head()

Unnamed: 0,user_id,CF_user_based_precision_train,CF_user_based_precision_test,CF_user_based_recall_train,CF_user_based_recall_test,CF_user_based_F1_train,CF_user_based_F1_test,CF_item_based_precision_train,CF_item_based_precision_test,CF_item_based_recall_train,...,baseline_recall_train,baseline_recall_test,baseline_F1_train,baseline_F1_test,content_based_precision_train,content_based_precision_test,content_based_recall_train,content_based_recall_test,content_based_F1_train,content_based_F1_test
20066,98859,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4928,24580,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020,10301,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14446,71435,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
24865,122893,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
# Function to evaluate model
def model_eval(target_user_index, model):
    """
    For a given model and target user, calculate Precision@K, Recall@K, F1 score
    """
    # create inputs to feed into the model
    if model == cf_svd:
        pdetails_train, pnames_test, pdetails_final_recs, sim_recs_train, sim_recs_test \
        = model(target_user_index)
    else:
        target_user_vec = user_product_matrix[target_user_index]
        cos_sim = cosine_similarity(user_product_matrix, target_user_vec)
        # get outputs from model
        pdetails_train, pnames_test, pdetails_final_recs, sim_recs_train, sim_recs_test \
        = model(target_user_index)    
    
    # compute evaluation scores
    precision_train = precision_at_k(pdetails_final_recs, sim_recs_train)
    precision_test = precision_at_k(pdetails_final_recs, sim_recs_test)
    recall_train = recall_at_k(pdetails_train, sim_recs_train)
    recall_test = recall_at_k(pnames_test, sim_recs_test)    
    # f1 score
    try:
        f1_train = 2 * precision_train * recall_train / (precision_train + recall_train)
        f1_test = 2 * precision_test * recall_test / (precision_test + recall_test)
    except ZeroDivisionError:
        f1_train = 0
        f1_test = 0
    
    return precision_train, precision_test, recall_train, recall_test, f1_train, f1_test

In [None]:
# evaluation for USER-BASED CF
for i in eval_df.index:
    eval_df.loc[eval_df.index == i, 'CF_user_based_precision_train'], \
    eval_df.loc[eval_df.index == i, 'CF_user_based_precision_test'],\
    eval_df.loc[eval_df.index == i, 'CF_user_based_recall_train'],\
    eval_df.loc[eval_df.index == i, 'CF_user_based_recall_test'],\
    eval_df.loc[eval_df.index == i, 'CF_user_based_F1_train'],\
    eval_df.loc[eval_df.index == i, 'CF_user_based_F1_test'],\
    = model_eval(i, cf_user)

In [None]:
# evaluation for ITEM-BASED CF
for i in eval_df.index:
    eval_df.loc[eval_df.index == i, 'CF_item_based_precision_train'], \
    eval_df.loc[eval_df.index == i, 'CF_item_based_precision_test'],\
    eval_df.loc[eval_df.index == i, 'CF_item_based_recall_train'],\
    eval_df.loc[eval_df.index == i, 'CF_item_based_recall_test'],\
    eval_df.loc[eval_df.index == i, 'CF_item_based_F1_train'],\
    eval_df.loc[eval_df.index == i, 'CF_item_based_F1_test'],\
    = model_eval(i, cf_item)

In [None]:
# evaluation for SVD CF
for i in eval_df.index:
    eval_df.loc[eval_df.index == i, 'CF_SVD_precision_train'], \
    eval_df.loc[eval_df.index == i, 'CF_SVD_precision_test'],\
    eval_df.loc[eval_df.index == i, 'CF_SVD_recall_train'],\
    eval_df.loc[eval_df.index == i, 'CF_SVD_recall_test'],\
    eval_df.loc[eval_df.index == i, 'CF_SVD_F1_train'],\
    eval_df.loc[eval_df.index == i, 'CF_SVD_F1_test'],\
    = model_eval(i, cf_svd)

In [None]:
# evaluation for POPULARITY MODEL (baseline)
for i in eval_df.index:
    eval_df.loc[eval_df.index == i, 'baseline_precision_train'], \
    eval_df.loc[eval_df.index == i, 'baseline_precision_test'],\
    eval_df.loc[eval_df.index == i, 'baseline_recall_train'],\
    eval_df.loc[eval_df.index == i, 'baseline_recall_test'],\
    eval_df.loc[eval_df.index == i, 'baseline_F1_train'],\
    eval_df.loc[eval_df.index == i, 'baseline_F1_test'],\
    = model_eval(i, popularity_model)

In [None]:
# evaluation for CONTENT-BASED
for i in eval_df.index:
    eval_df.loc[eval_df.index == i, 'content_based_precision_train'], \
    eval_df.loc[eval_df.index == i, 'content_based_precision_test'],\
    eval_df.loc[eval_df.index == i, 'content_based_recall_train'],\
    eval_df.loc[eval_df.index == i, 'content_based_recall_test'],\
    eval_df.loc[eval_df.index == i, 'content_based_F1_train'],\
    eval_df.loc[eval_df.index == i, 'content_based_F1_test'],\
    = model_eval(i, content_based)

In [None]:
eval_df.to_pickle('../datasets/eval_df.pkl')

In [None]:
eval_df

In [38]:
# reshaping the dataframe for easier metric comparison
eval_df_temp = pd.DataFrame(np.mean(eval_df.drop(['user_id'], axis=1)))
eval_df_temp.rename(columns={0: 'score'}, inplace=True)

eval_df_temp['metric'] = 0
eval_df_temp['model'] = 0
eval_df_temp['eval_set'] = 0

eval_df_temp['metric'] = eval_df_temp['metric'].where(~eval_df_temp.index.str.contains('precision'), 'precision')
eval_df_temp['metric'] = eval_df_temp['metric'].where(~eval_df_temp.index.str.contains('recall'), 'recall')
eval_df_temp['metric'] = eval_df_temp['metric'].where(~eval_df_temp.index.str.contains('F1'), 'F1')
eval_df_temp['model'] = eval_df_temp['model'].where(~eval_df_temp.index.str.contains('CF_user_based'), 'CF user-based')
eval_df_temp['model'] = eval_df_temp['model'].where(~eval_df_temp.index.str.contains('CF_item_based'), 'CF item-based')
eval_df_temp['model'] = eval_df_temp['model'].where(~eval_df_temp.index.str.contains('SVD'), 'CF SVD')
eval_df_temp['model'] = eval_df_temp['model'].where(~eval_df_temp.index.str.contains('baseline'), 'baseline')
eval_df_temp['model'] = eval_df_temp['model'].where(~eval_df_temp.index.str.contains('content'), 'content-based')
eval_df_temp['eval_set'] = eval_df_temp['eval_set'].where(~eval_df_temp.index.str.contains('train'), 'train')
eval_df_temp['eval_set'] = eval_df_temp['eval_set'].where(~eval_df_temp.index.str.contains('test'), 'test')

In [129]:
cm = sns.light_palette("orange", as_cmap=True)
eval_df_pivot = pd.pivot_table(eval_df_temp, values='score', index=['model', 'eval_set'], columns=['metric'], aggfunc=np.sum).reset_index()
# eval_df_pivot = eval_df_pivot[eval_df_pivot['model'] != 'content-based']
eval_df_pivot.sort_values(['model', 'eval_set'], ascending=False, inplace=True)
eval_df_pivot.to_pickle('../datasets/eval_df_pivot.pkl') # pickle for external usage
eval_df_pivot = eval_df_pivot.style.background_gradient(cmap=cm)

<a id='performance_eval'></a>
### Reviewing RecSys performance 
[Back to top navigation](#sections)

In [127]:
eval_df_pivot

metric,model,eval_set,F1,precision,recall
9,content-based,train,0.045174,0.116083,0.035937
8,content-based,test,0.046903,0.068783,0.03888
7,baseline,train,0.1081,0.264833,0.074087
6,baseline,test,0.114044,0.173317,0.09157
5,CF user-based,train,0.111239,0.270417,0.076457
4,CF user-based,test,0.117076,0.176683,0.094254
3,CF item-based,train,0.12249,0.289817,0.084993
2,CF item-based,test,0.129298,0.1922,0.10591
1,CF SVD,train,0.09125,0.22485,0.062312
0,CF SVD,test,0.104394,0.158917,0.084537


Looking at all 3 scores, we can see that only the user-based and item-based collaborative filtering methods outperformed the baseline popularity model. And between the two, the item-based CF model fared slightly better. However, we should note the high score discrepancies between the train and test sets, which suggest that the models are heavily overfitting.

On the flip side, the content-based filtering method performed much worse than the baseline. This is likely due to the limited item features available. I had only included the aisle and department name of the product as item features, and those keywords alone are insufficient to characterise our items. An attempt was made to extract additional item features from the product name using CountVectorizer, but it did not produce any noticeable difference in the results. 

As for the SVD model, its poor performance could be due to a data sparsity problem – our user-item utility matrix was 99.5% sparse, and that could have negatively affected the SVD model's ability to extract latent features from it. 

<a id='discussion'></a>

## Concluding remarks
[Back to top navigation](#sections)

In this project, I tested out 3 iterations of the collaborative filtering recommender system and 1 iteration of the content-based recommender system. It was interesting to see how different systems generate different sets of recommendations and how well they perform on average, but I would like to emphasise that it is not prudent to make a judgement on the recommender systems' performance based on these metrics alone. The more pragmatic way to test the performance of recommender systems would be to actually track the actual click-through and conversion rates of the recommendations, which is not possible in this instance.

Furthermore, the way I had generated and re-ranked potential recommendations for each model, e.g. selecting 20 nearest neighbours for user-based collaborative filtering, was highly arbitrary. Selecting different values for the k-nearest neighbours and number of similar items to form my potential recommendation set could greatly alter the performance of the models. Similarly, re-ranking the recommendations (for user-based and item-based CF) in a different way, e.g. by the number of purchases from similar customers, instead of using total sales volume, could change the final recommendation set significantly.

Another factor that could impact the performance of the non-MF models is the way we normalise the utility matrix. I had de-meaned the matrix by doing a row-wise subtraction of the average, but another way to do it is to use the TfidfVectorizer from scikit-learn to normalize the matrix values. We could also represent the matrix as a binary matrix, i.e. 1 if the user has purchased the item before, else 0.

Moving beyond the algorithms tested, I would also like to highlight these additional qualitative points of consideration since product recommendation strategies require a multi-pronged approach.

### Recommendations need to be made more contextual

The personalised recommendations generated from my implemented recommender systems only form the general "Recommended For You" layer of recommendations. In reality, e-commerce retailers employ different product recommendation strategies for different pages on the site, such as:

- **Homepage**: "Recently viewed" / "Buy it again" / "Recommended For You"


- **Category pages**: "Most Popular in Category" / "Recommended For You" (category-specific)


- **Product detail pages (PDPs)**: "Similar Products" / "Often Bought Together" 


- **Cart pages**: "Often Bought Together" (showcasing products that are slightly cheaper than those in a user’s cart can lead to quick purchase decisions)


- **Search results page**: Results returned from search queries can also be considered a form of recommendation. These items are ranked by probability of purchase.


### More granularity needed to truly personalise the recommendations

- We wouldn't want to recommend non-vegan items to a vegan customer.


- Consumers who have high average order values can be recommended more highly profitable items in order to maximise revenue.


- We don't want to recommend items from the user's most recent basket, especially for items that are not weekly purchase items.

### Consider diversity of recommendations as well

- Because our tested recommendation systems are biased towards recommending items that have relatively high sales volume, they are unable to surface truly novel items that have not been discovered by many other people.


- We can improve the diversity of recommendations by recommending these long-tail items to increase the novelty factor for the user.


- Include recommendations from stores customers may have never shopped from previously. This is appropriate for customers who have a high unique-items-to-total-items ratio.


<a id='case_study'></a>
## Case study using one randomly selected user
[Back to top navigation](#sections)

In [101]:
# randomly select a number from total number of users available
target_user_index = random.choice(user_products_train.index)
print(target_user_index)

11204


### Test run of user-based CF model

In [102]:
pdetails_train, pnames_test, pdetails_final_recs, sim_recs_train, sim_recs_test \
= cf_user(target_user_index)

In [113]:
print_results(target_user_index, pdetails_train, pnames_test, pdetails_final_recs, sim_recs_train, sim_recs_test, 'User-based CF')

_Model used: User-based CF_

### Top 20 recommendations for user 55466

Organic products: **70.0%**

Unnamed: 0_level_0,product_name,department,organic
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
24852,Banana,produce,0
13176,Bag of Organic Bananas,produce,1
21903,Organic Baby Spinach,produce,1
47209,Organic Hass Avocado,produce,1
21137,Organic Strawberries,produce,1
47766,Organic Avocado,produce,1
47626,Large Lemon,produce,0
27966,Organic Raspberries,produce,1
16797,Strawberries,produce,0
26209,Limes,produce,0





**Recommended items that were in user's past purchases (first 15 of last 20 orders):**

['Limes', 'Organic Garlic', 'Organic Whole Milk', 'Organic Hass Avocado', 'Large Lemon', 'Organic Zucchini', 'Organic Strawberries', 'Banana', 'Apple Honeycrisp Organic', 'Organic Yellow Onion']

**Recommended items that were in user's last 5 orders:**

['Limes', 'Organic Garlic', 'Organic Whole Milk', 'Organic Hass Avocado', 'Large Lemon', 'Organic Fuji Apple', 'Organic Strawberries', 'Banana', 'Bag of Organic Bananas']

**All items in user's last 5 orders:**

['Large Alfresco Eggs', 'Organic Green Butter Lettuce', 'Organic Grape Tomatoes', 'Organic Thyme', 'Organic Garlic', 'Large Lemon', 'Chopped Onions', 'Organic Strawberries', 'Taco Seasoning', 'Crescent Rolls', 'Banana', 'Feta Cheese Crumbles', 'Citrus Mandarins Organic', 'Organic Fuji Red Apple Chips', 'Organic Strawberry Smoothie', 'Organic Fuji Apple', 'Chopped Walnuts', '1% Milkfat Low Fat Buttermilk', 'Organic Chicken Bone Broth', 'Green Bell Pepper', 'Original Orange Juice', 'Oyster Crackers', 'Organic Whole Milk', 'Pizza Poppers, Three Cheese', 'Lemonade', 'Smoked Ham', 'Thin Crust Pepperoni Pizza', 'Backyard Barbeque Potato Chips', 'Watermelon Chunks', 'Organic Tomato Cluster', 'Organic Russet Potato', 'Organic Red Potato', 'Limes', 'Organic Miso Broth', 'Shaved Parmesan', 'Organic Hass Avocado', 'Organic Unsweetened Almond Milk', 'Fresh Mozzarella', 'Organic Baby Arugula', 'Red Wax Gouda', 'Blue Cheese Crumbles', 'Honey Nut Cheerios', 'Bag of Organic Bananas', 'Organic Carrot Bunch', 'Red Peppers', 'Purity Farms Ghee Clarified Butter', 'French Loaf']




### User's purchase history (first 15 of last 20 orders)




Total unique items bought: **109**

Organic products: **35.78%**




#### Top 5 departments

Unnamed: 0,department
produce,31
pantry,21
dairy eggs,12
snacks,7
bakery,7





#### Products previously purchased

product_name,department,organic
Large Lemon,produce,0
"Cookies, Chocolate Chip Walnut",snacks,0
Organic Spinach And Cheese Ravioli,dry goods pasta,1
Organic Black Beans,canned goods,1
Coconut Almond Minis Frozen Dessert Bars,frozen,0
Organic Garnet Sweet Potato (Yam),produce,1
Mild Diced Green Chiles,canned goods,0
Sliced Pepperoncini,pantry,0
Dijon Mustard,pantry,0
Lightly Salted Baked Snap Pea Crisps,snacks,0


### Test run of item-based CF

In [104]:
pdetails_train, pnames_test, pdetails_final_recs, sim_recs_train, sim_recs_test \
= cf_item(target_user_index)

In [105]:
print_results(target_user_index, pdetails_train, pnames_test, pdetails_final_recs, sim_recs_train, sim_recs_test, 'Item-based CF')

_Model used: Item-based CF_

### Top 20 recommendations for user 55466

Organic products: **75.0%**

Unnamed: 0_level_0,product_name,department,organic
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
22825,Organic D'Anjou Pears,produce,1
22035,Organic Whole String Cheese,dairy eggs,1
30391,Organic Cucumber,produce,1
45066,Honeycrisp Apple,produce,0
8277,Apple Honeycrisp Organic,produce,1
19057,Organic Large Extra Fancy Fuji Apple,produce,1
5876,Organic Lemon,produce,1
28204,Organic Fuji Apple,produce,1
16797,Strawberries,produce,0
22935,Organic Yellow Onion,produce,1





**Recommended items that were in user's past purchases (last 20 orders excluding last 5):**

['Organic Whole Milk', 'Organic Hass Avocado', 'Large Lemon', 'Organic Strawberries', 'Banana', 'Apple Honeycrisp Organic', 'Organic Yellow Onion']

**Recommended items that were in user's last 5 orders:**

['Organic Whole Milk', 'Organic Hass Avocado', 'Large Lemon', 'Organic Fuji Apple', 'Organic Strawberries', 'Banana', 'Bag of Organic Bananas']

**All items in user's last 5 orders:**

['Large Alfresco Eggs', 'Organic Green Butter Lettuce', 'Organic Grape Tomatoes', 'Organic Thyme', 'Organic Garlic', 'Large Lemon', 'Chopped Onions', 'Organic Strawberries', 'Taco Seasoning', 'Crescent Rolls', 'Banana', 'Feta Cheese Crumbles', 'Citrus Mandarins Organic', 'Organic Fuji Red Apple Chips', 'Organic Strawberry Smoothie', 'Organic Fuji Apple', 'Chopped Walnuts', '1% Milkfat Low Fat Buttermilk', 'Organic Chicken Bone Broth', 'Green Bell Pepper', 'Original Orange Juice', 'Oyster Crackers', 'Organic Whole Milk', 'Pizza Poppers, Three Cheese', 'Lemonade', 'Smoked Ham', 'Thin Crust Pepperoni Pizza', 'Backyard Barbeque Potato Chips', 'Watermelon Chunks', 'Organic Tomato Cluster', 'Organic Russet Potato', 'Organic Red Potato', 'Limes', 'Organic Miso Broth', 'Shaved Parmesan', 'Organic Hass Avocado', 'Organic Unsweetened Almond Milk', 'Fresh Mozzarella', 'Organic Baby Arugula', 'Red Wax Gouda', 'Blue Cheese Crumbles', 'Honey Nut Cheerios', 'Bag of Organic Bananas', 'Organic Carrot Bunch', 'Red Peppers', 'Purity Farms Ghee Clarified Butter', 'French Loaf']




### User's purchase history (last 20 orders excluding last 5)




Total unique items bought: **109**

Organic products: **35.78%**




#### Top 5 departments

Unnamed: 0,department
produce,31
pantry,21
dairy eggs,12
snacks,7
bakery,7





#### Products previously purchased

product_name,department,organic
Large Lemon,produce,0
"Cookies, Chocolate Chip Walnut",snacks,0
Organic Spinach And Cheese Ravioli,dry goods pasta,1
Organic Black Beans,canned goods,1
Coconut Almond Minis Frozen Dessert Bars,frozen,0
Organic Garnet Sweet Potato (Yam),produce,1
Mild Diced Green Chiles,canned goods,0
Sliced Pepperoncini,pantry,0
Dijon Mustard,pantry,0
Lightly Salted Baked Snap Pea Crisps,snacks,0


### Test run of CF SVD model

In [106]:
pdetails_train, pnames_test, pdetails_final_recs, sim_recs_train, sim_recs_test \
= cf_svd(target_user_index)

In [107]:
print_results(target_user_index, pdetails_train, pnames_test, pdetails_final_recs, sim_recs_train, sim_recs_test, 'SVD')

_Model used: SVD_

### Top 20 recommendations for user 55466

Organic products: **75.0%**

Unnamed: 0_level_0,product_name,department,organic
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
30489,Original Hummus,deli,0
45066,Honeycrisp Apple,produce,0
45007,Organic Zucchini,produce,1
5785,Organic Reduced Fat 2% Milk,dairy eggs,1
44632,Sparkling Water Grapefruit,beverages,0
5876,Organic Lemon,produce,1
28204,Organic Fuji Apple,produce,1
21903,Organic Baby Spinach,produce,1
27086,Half & Half,dairy eggs,0
19057,Organic Large Extra Fancy Fuji Apple,produce,1





**Recommended items that were in user's past purchases (last 20 orders excluding last 5):**

['Organic Whole Milk', 'Organic Hass Avocado', 'Organic Zucchini', 'Organic Strawberries', 'Banana', 'Apple Honeycrisp Organic']

**Recommended items that were in user's last 5 orders:**

['Organic Whole Milk', 'Organic Hass Avocado', 'Organic Fuji Apple', 'Organic Strawberries', 'Banana']

**All items in user's last 5 orders:**

['Large Alfresco Eggs', 'Organic Green Butter Lettuce', 'Organic Grape Tomatoes', 'Organic Thyme', 'Organic Garlic', 'Large Lemon', 'Chopped Onions', 'Organic Strawberries', 'Taco Seasoning', 'Crescent Rolls', 'Banana', 'Feta Cheese Crumbles', 'Citrus Mandarins Organic', 'Organic Fuji Red Apple Chips', 'Organic Strawberry Smoothie', 'Organic Fuji Apple', 'Chopped Walnuts', '1% Milkfat Low Fat Buttermilk', 'Organic Chicken Bone Broth', 'Green Bell Pepper', 'Original Orange Juice', 'Oyster Crackers', 'Organic Whole Milk', 'Pizza Poppers, Three Cheese', 'Lemonade', 'Smoked Ham', 'Thin Crust Pepperoni Pizza', 'Backyard Barbeque Potato Chips', 'Watermelon Chunks', 'Organic Tomato Cluster', 'Organic Russet Potato', 'Organic Red Potato', 'Limes', 'Organic Miso Broth', 'Shaved Parmesan', 'Organic Hass Avocado', 'Organic Unsweetened Almond Milk', 'Fresh Mozzarella', 'Organic Baby Arugula', 'Red Wax Gouda', 'Blue Cheese Crumbles', 'Honey Nut Cheerios', 'Bag of Organic Bananas', 'Organic Carrot Bunch', 'Red Peppers', 'Purity Farms Ghee Clarified Butter', 'French Loaf']




### User's purchase history (last 20 orders excluding last 5)




Total unique items bought: **109**

Organic products: **35.78%**




#### Top 5 departments

Unnamed: 0,department
produce,31
pantry,21
dairy eggs,12
snacks,7
bakery,7





#### Products previously purchased

product_name,department,organic
Large Lemon,produce,0
"Cookies, Chocolate Chip Walnut",snacks,0
Organic Spinach And Cheese Ravioli,dry goods pasta,1
Organic Black Beans,canned goods,1
Coconut Almond Minis Frozen Dessert Bars,frozen,0
Organic Garnet Sweet Potato (Yam),produce,1
Mild Diced Green Chiles,canned goods,0
Sliced Pepperoncini,pantry,0
Dijon Mustard,pantry,0
Lightly Salted Baked Snap Pea Crisps,snacks,0


### Test run of content-based filtering method

In [108]:
pdetails_train, pnames_test, pdetails_final_recs, sim_recs_train, sim_recs_test \
= content_based(target_user_index)

In [109]:
print_results(target_user_index, pdetails_train, pnames_test, pdetails_final_recs, sim_recs_train, sim_recs_test, 'Content-based recommender')

_Model used: Content-based recommender_

### Top 20 recommendations for user 55466

Organic products: **35.0%**

Unnamed: 0_level_0,product_name,department,organic
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
283,Organic Whole Wheat Elbows,dry goods pasta,1
343,Organic Whole Peeled Baby Carrots,produce,1
79,Wild Albacore Tuna No Salt Added,canned goods,0
210,Homemade Hot Arrabbiata Fra Diavolo Sauce,dry goods pasta,0
2537,Tomato Basil Marinara Sauce,dry goods pasta,0
1117,Turkey Bacon,meat seafood,0
2361,Mint Chip,frozen,0
397,Pure Granulated Cane Sugar,pantry,0
148,Nectarines,produce,0
3397,Organic Lightly Salted Sea Salt Thin & Crispy Restaurant Style Tortilla Chips,snacks,1





**Recommended items that were in user's past purchases (last 20 orders excluding last 5):**

['Watermelon Chunks', 'Organic Lightly Salted Sea Salt Thin & Crispy Restaurant Style Tortilla Chips']

**Recommended items that were in user's last 5 orders:**

['Watermelon Chunks']

**All items in user's last 5 orders:**

['Large Alfresco Eggs', 'Organic Green Butter Lettuce', 'Organic Grape Tomatoes', 'Organic Thyme', 'Organic Garlic', 'Large Lemon', 'Chopped Onions', 'Organic Strawberries', 'Taco Seasoning', 'Crescent Rolls', 'Banana', 'Feta Cheese Crumbles', 'Citrus Mandarins Organic', 'Organic Fuji Red Apple Chips', 'Organic Strawberry Smoothie', 'Organic Fuji Apple', 'Chopped Walnuts', '1% Milkfat Low Fat Buttermilk', 'Organic Chicken Bone Broth', 'Green Bell Pepper', 'Original Orange Juice', 'Oyster Crackers', 'Organic Whole Milk', 'Pizza Poppers, Three Cheese', 'Lemonade', 'Smoked Ham', 'Thin Crust Pepperoni Pizza', 'Backyard Barbeque Potato Chips', 'Watermelon Chunks', 'Organic Tomato Cluster', 'Organic Russet Potato', 'Organic Red Potato', 'Limes', 'Organic Miso Broth', 'Shaved Parmesan', 'Organic Hass Avocado', 'Organic Unsweetened Almond Milk', 'Fresh Mozzarella', 'Organic Baby Arugula', 'Red Wax Gouda', 'Blue Cheese Crumbles', 'Honey Nut Cheerios', 'Bag of Organic Bananas', 'Organic Carrot Bunch', 'Red Peppers', 'Purity Farms Ghee Clarified Butter', 'French Loaf']




### User's purchase history (last 20 orders excluding last 5)




Total unique items bought: **109**

Organic products: **35.78%**




#### Top 5 departments

Unnamed: 0,department
produce,31
pantry,21
dairy eggs,12
snacks,7
bakery,7





#### Products previously purchased

product_name,department,organic
Large Lemon,produce,0
"Cookies, Chocolate Chip Walnut",snacks,0
Organic Spinach And Cheese Ravioli,dry goods pasta,1
Organic Black Beans,canned goods,1
Coconut Almond Minis Frozen Dessert Bars,frozen,0
Organic Garnet Sweet Potato (Yam),produce,1
Mild Diced Green Chiles,canned goods,0
Sliced Pepperoncini,pantry,0
Dijon Mustard,pantry,0
Lightly Salted Baked Snap Pea Crisps,snacks,0


### Baseline: Popularity model

In [110]:
pdetails_target_user_train, pnames_target_user_test, pdetails_final_recs, sim_recs_train, sim_recs_test \
= popularity_model(target_user_index)

In [111]:
print_results(target_user_index, pdetails_target_user_train, pnames_target_user_test, pdetails_final_recs, sim_recs_train, sim_recs_test, 'Popularity model')

_Model used: Popularity model_

### Top 20 recommendations for user 55466

Organic products: **70.0%**

Unnamed: 0_level_0,product_name,department,organic
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
24852,Banana,produce,0
13176,Bag of Organic Bananas,produce,1
21903,Organic Baby Spinach,produce,1
47209,Organic Hass Avocado,produce,1
21137,Organic Strawberries,produce,1
47766,Organic Avocado,produce,1
47626,Large Lemon,produce,0
27966,Organic Raspberries,produce,1
16797,Strawberries,produce,0
26209,Limes,produce,0





**Recommended items that were in user's past purchases (last 20 orders excluding last 5):**

['Limes', 'Organic Garlic', 'Organic Whole Milk', 'Organic Hass Avocado', 'Large Lemon', 'Organic Zucchini', 'Organic Strawberries', 'Banana', 'Apple Honeycrisp Organic', 'Organic Yellow Onion']

**Recommended items that were in user's last 5 orders:**

['Limes', 'Organic Garlic', 'Organic Whole Milk', 'Organic Hass Avocado', 'Large Lemon', 'Organic Fuji Apple', 'Organic Strawberries', 'Banana', 'Bag of Organic Bananas']

**All items in user's last 5 orders:**

['Large Alfresco Eggs', 'Organic Green Butter Lettuce', 'Organic Grape Tomatoes', 'Organic Thyme', 'Organic Garlic', 'Large Lemon', 'Chopped Onions', 'Organic Strawberries', 'Taco Seasoning', 'Crescent Rolls', 'Banana', 'Feta Cheese Crumbles', 'Citrus Mandarins Organic', 'Organic Fuji Red Apple Chips', 'Organic Strawberry Smoothie', 'Organic Fuji Apple', 'Chopped Walnuts', '1% Milkfat Low Fat Buttermilk', 'Organic Chicken Bone Broth', 'Green Bell Pepper', 'Original Orange Juice', 'Oyster Crackers', 'Organic Whole Milk', 'Pizza Poppers, Three Cheese', 'Lemonade', 'Smoked Ham', 'Thin Crust Pepperoni Pizza', 'Backyard Barbeque Potato Chips', 'Watermelon Chunks', 'Organic Tomato Cluster', 'Organic Russet Potato', 'Organic Red Potato', 'Limes', 'Organic Miso Broth', 'Shaved Parmesan', 'Organic Hass Avocado', 'Organic Unsweetened Almond Milk', 'Fresh Mozzarella', 'Organic Baby Arugula', 'Red Wax Gouda', 'Blue Cheese Crumbles', 'Honey Nut Cheerios', 'Bag of Organic Bananas', 'Organic Carrot Bunch', 'Red Peppers', 'Purity Farms Ghee Clarified Butter', 'French Loaf']




### User's purchase history (last 20 orders excluding last 5)




Total unique items bought: **109**

Organic products: **35.78%**




#### Top 5 departments

Unnamed: 0,department
produce,31
pantry,21
dairy eggs,12
snacks,7
bakery,7





#### Products previously purchased

product_name,department,organic
Large Lemon,produce,0
"Cookies, Chocolate Chip Walnut",snacks,0
Organic Spinach And Cheese Ravioli,dry goods pasta,1
Organic Black Beans,canned goods,1
Coconut Almond Minis Frozen Dessert Bars,frozen,0
Organic Garnet Sweet Potato (Yam),produce,1
Mild Diced Green Chiles,canned goods,0
Sliced Pepperoncini,pantry,0
Dijon Mustard,pantry,0
Lightly Salted Baked Snap Pea Crisps,snacks,0
