### Recommending items with Item based collaborative filtering


####   Import Python modules which we will use in our recommender system

In [1]:
import numpy as np
import pandas as pd

#### Import data

Read the HealthyFoodStore dataset to a pandas dataframe

In [2]:
filename = 'HealthyFoodStore2017.xlsx'
df = pd.read_excel(filename,'Data')

Aggragate sales per item for each customer

In [3]:
df_grouped = df.groupby(['Customer_ID','Item'],as_index = False).sum()

We will transpose/pivot the dataset so that each row represents one unique customer, this gives us a sparse matrix. If the customer has not purchased the item we will use the fillna method to fill NaN values with 0.

In [4]:
df_pivoted = df_grouped.pivot_table(index='Customer_ID', columns='Item', values='Sales').fillna(0)

Lets look at the first five rows of the pivoted dataframe. NaN indicates the customer has not purchased the item. We see that Customer ID AA-1 has purchased several different items in our store

In [6]:
df_pivoted.head(5)

Item,Aloe Vera,Broccoli Powder,Detox Green Tea,Energy bar White Chocolate and Macadamia Nut,Fusion Spice Red Tea,Ginger Lemon Tea,Grounded Garlic & Ginger,"HealthSmart Foods Chocolite Protein, French Vanilla",Muscle Combat crunch (Chocolate chip),"Oh Yeah!, Nutritional Shake, Chocolate Milkshake",Power bar - Banana Strawberry,Sprirulina,Wheat Grass
Customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
AA-1,65.0,55.0,20.0,5.0,10.0,15.0,35.0,5.0,5.0,0.0,0.0,35.0,35.0
AA-10,40.0,50.0,10.0,5.0,20.0,15.0,45.0,0.0,5.0,0.0,0.0,30.0,60.0
AA-11,55.0,50.0,15.0,5.0,25.0,15.0,50.0,5.0,0.0,5.0,0.0,45.0,55.0
AA-12,55.0,85.0,15.0,5.0,15.0,10.0,35.0,5.0,0.0,0.0,0.0,60.0,30.0
AA-13,10.0,10.0,35.0,5.0,45.0,65.0,20.0,15.0,10.0,0.0,10.0,15.0,15.0


Our recommender system is based on the method of Item based collaboritive filtering. In this method we find the correlation score between any pair of items (columns in the df). With the Pandas corr() method we compute a correlation score for every column pair (pair of items) in the column matrix. To avoid spurious results we only score item pairs where at least 10 customers has purchased both items. Otherwise we set the value to NaN 


In [5]:
df_corr_matrix = df_pivoted.corr(method='pearson', min_periods=10)

Lets take a look at the correlation matrix <br>...as we see below the linear relationship between Aloe Vera and Broccoli Powder is very strong

In [8]:
df_corr_matrix.head(3)

Item,Aloe Vera,Broccoli Powder,Detox Green Tea,Energy bar White Chocolate and Macadamia Nut,Fusion Spice Red Tea,Ginger Lemon Tea,Grounded Garlic & Ginger,"HealthSmart Foods Chocolite Protein, French Vanilla",Muscle Combat crunch (Chocolate chip),"Oh Yeah!, Nutritional Shake, Chocolate Milkshake",Power bar - Banana Strawberry,Sprirulina,Wheat Grass
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Aloe Vera,1.0,0.900395,0.263176,-0.694006,0.331562,0.289351,0.899789,-0.644166,-0.74749,-0.673344,-0.674688,0.901041,0.881851
Broccoli Powder,0.900395,1.0,0.262013,-0.670141,0.30891,0.237135,0.792959,-0.621646,-0.721022,-0.650309,-0.641117,0.899838,0.881498
Detox Green Tea,0.263176,0.262013,1.0,-0.683327,0.708099,0.804922,0.29901,-0.556013,-0.608551,-0.577195,-0.647869,0.360842,0.294642


This matrix with the correlation score between all items in our store will be used to recommend products to our customer. We therefore need to save this dataframe. We'll use pandas to_pickle method to save the dataframe as a serialized object

In [6]:
df_corr_matrix.to_pickle('textfiles/savedmodels/corr_matrix4.pkl')

Lets test our reccomender system by creating a dict holding spendings on items from our store. Our fictive customer (Tony Romo...)  has purchased 33€ of Sprirulina and 5€ of Ginger Lemon Tea.

In [9]:
tony_romo_dict = {'Sprirulina': 33, 'Ginger Lemon Tea': 5}

#Create a pandas series from dict
tony_romo_spendings = pd.Series(tony_romo_dict)

If you follow the print statements in the for loop below, you should be able to follow along how the top 3 items are recommended to our customer. Tony loves Spirulina (a blue-green algae) and seems a little bit interested in Lemon Tea.... Can we use this information to recommend anything to Tony? <br>
1. Get correlation scores for all item previously purchased by the customer.  <br>
2. Weight the correlation score by multiplying spendings for items previously purchased in our store.  <br>
3. Summarize the weighted scores <br>
4. Since (in this case) we don't want to recommend Tony items which he's previously previously purchased we will remove previously purchased items from his recommendations <br>
5. The items with the highest scores are recommended to Tony <br>

In [11]:
all_corr_scores = pd.Series()
for i in range(0, len(tony_romo_spendings.index)):
 
    #Get correlation scores for all items purchased by Tony 
    print "1. Getting all correlation score for " + tony_romo_spendings.index[i] + "..."
    correlation = df_corr_matrix[tony_romo_spendings.index[i]]
   
    print "2. Multiply correlation score for each item with spendings for " + tony_romo_spendings.index[i] + " which are: " + str(tony_romo_spendings.values[i]) + '€'
    correlation = correlation.map(lambda x: x * tony_romo_spendings.values[i])  #
    print correlation
    
    # Add all corr scores * spend values to the all_sim_scores series
    all_corr_scores = all_corr_scores.append(correlation)
    
print "3. Summerize all_corr_scores and sort descending..."
sum_all_corr_scores = all_corr_scores.groupby(all_corr_scores.index).sum()
sum_all_corr_scores.sort_values(inplace = True, ascending = False)
print sum_all_corr_scores.head(10)

print "4. Remove products that the customer previously has purchased..."

#my_spendings_filtered = erik_spendings.drop(tony_romo_spendings[tony_romo_spendings.values < 100].index)  
#Drop values less than 100
filtered_scores = sum_all_corr_scores.drop(tony_romo_spendings.index)

print "5. Recommend top 3 products to Tony.... "  
print filtered_scores.head(3)
 

1. Getting all correlation score for Ginger Lemon Tea...
2. Multiply correlation score for each item with spendings for Ginger Lemon Tea which are: 5€
Item
Aloe Vera                                              1.446754
Broccoli Powder                                        1.185676
Detox Green Tea                                        4.024610
Energy bar White Chocolate and Macadamia Nut          -3.540275
Fusion Spice Red Tea                                   4.169016
Ginger Lemon Tea                                       5.000000
Grounded Garlic & Ginger                               1.660310
HealthSmart Foods Chocolite Protein, French Vanilla   -2.912536
Muscle Combat crunch (Chocolate chip)                 -3.175674
Oh Yeah!, Nutritional Shake, Chocolate Milkshake      -3.230677
Power bar - Banana Strawberry                         -3.199928
Sprirulina                                             1.763212
Wheat Grass                                            1.515628
Name: Ginger

The three recomended items to Tony Romo is.... <br> 
Aloe Vera, Broccoli Powder, Grounded Garlic & Ginger