Enhanced Recommendation System

Author: Zhuang Tang

In [None]:
####Import Package and read data
###I am using the built-in apriori from mlxtend, so the mlxtend needs to be installed on conda 

In [1]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [2]:
df = pd.read_csv("Groceries data train.csv")
df.head(50)

Unnamed: 0,Member_number,Date,itemDescription,year,month,day,day_of_week
0,3021,30/01/2015,frankfurter,2015,1,30,4
1,1292,24/10/2015,pork,2015,10,24,5
2,4206,4/04/2014,root vegetables,2014,4,4,4
3,4369,25/08/2015,onions,2015,8,25,1
4,1522,1/07/2014,waffles,2014,7,1,1
5,2053,17/09/2015,cereals,2015,9,17,3
6,2914,10/09/2014,yogurt,2014,9,10,2
7,4089,10/04/2015,sausage,2015,4,10,4
8,2460,27/10/2015,rolls/buns,2015,10,27,1
9,2738,2/03/2015,root vegetables,2015,3,2,0


In [3]:
df_test = pd.read_csv("Groceries data test.csv")
df_test.shape

(11765, 7)

In [4]:
df.iloc[0]

Member_number             3021
Date                30/01/2015
itemDescription    frankfurter
year                      2015
month                        1
day                         30
day_of_week                  4
Name: 0, dtype: object

## First Way of Recommendation-->Always the most popular ones

***We use a rather intuitive and not-so-bad way to do the recommendation system, sort the original data and get the top N item-sets***

In [5]:
result1 = df['itemDescription'].value_counts()
result1

itemDescription
whole milk               1709
other vegetables         1320
rolls/buns               1197
soda                     1060
yogurt                    928
                         ... 
bags                        3
rubbing alcohol             3
frozen chicken              2
preservation products       1
kitchen utensil             1
Name: count, Length: 167, dtype: int64

***From here we can extract the top N as the most popular item/itemsets, here we take 5 as an example,then no matter what we got as input, we will output the top 5 as the recommendation.***

In [6]:
recommendation1 = result1.head(5)
recommendation1

itemDescription
whole milk          1709
other vegetables    1320
rolls/buns          1197
soda                1060
yogurt               928
Name: count, dtype: int64

### Second Way of Recommendation -- > Finding the frequent pattern

***Unlike the first way of doing recommendation, we use apriori and FP-growth to calculate the frequent item-set. and then use the support, confidence and lift to evaluate the frequency of the item-set. then use the result to do the recommendation***

***The idea here is that we need to build a matrix, the rows are all the unique items(all the categories), the colunms are all the users. Each cell matrix(i, j) represents whether the jth customer bought the item or not***

***So, firstly, we extract the unique members first from the data***

In [7]:
users = df['Member_number'].unique()
users

array([3021, 1292, 4206, ..., 1459, 4386, 1942])

In [8]:
users.shape

(3872,)

***There are 3872 unique members in the dataset***

In [None]:
######combine it with integer index and create a new DataFrame

In [9]:
users_df = pd.DataFrame(users, columns = ['Member_number'], index=range(0,3872))
users_df

Unnamed: 0,Member_number
0,3021
1,1292
2,4206
3,4369
4,1522
...,...
3867,4630
3868,3605
3869,1459
3870,4386


***Above we create a member index with index from 0 to 3871, we need to method to find the specific member in the dataframe***

In [10]:
##Get the index of user
###According to the input integer, find the corresponding user
def getIndex(i):
    for j in range(3872):
        if(i == users[j]):
            break
    return j

In [11]:
##Create 3872 lists to store the purchage history
####Each backet stores the corresponding shopping history of the user
obj = {}
for i in range (3872):
    obj['user'+str(i)] = []
#obj
    

In [12]:
### Loop through the data and collect items and create purchase history
for i in range(27000):
    obj['user' + str(getIndex(df.iloc[i][0]))].append(df.iloc[i][2] + '/')
obj

{'user0': ['frankfurter/',
  'root vegetables/',
  'soups/',
  'onions/',
  'bottled water/',
  'sliced cheese/',
  'fruit/vegetable juice/'],
 'user1': ['pork/', 'whole milk/', 'beverages/', 'oil/'],
 'user2': ['root vegetables/',
  'rolls/buns/',
  'frozen vegetables/',
  'fruit/vegetable juice/',
  'margarine/',
  'bottled beer/',
  'rolls/buns/',
  'domestic eggs/',
  'chocolate/',
  'mustard/',
  'whipped/sour cream/',
  'whole milk/',
  'rolls/buns/'],
 'user3': ['onions/',
  'red/blush wine/',
  'yogurt/',
  'whole milk/',
  'butter milk/'],
 'user4': ['waffles/',
  'newspapers/',
  'yogurt/',
  'sausage/',
  'ketchup/',
  'chocolate/',
  'frozen meals/',
  'yogurt/',
  'whole milk/',
  'shopping bags/',
  'long life bakery product/',
  'newspapers/',
  'domestic eggs/'],
 'user5': ['cereals/',
  'cream cheese /',
  'rolls/buns/',
  'citrus fruit/',
  'rolls/buns/'],
 'user6': ['yogurt/',
  'long life bakery product/',
  'grapes/',
  'bottled water/',
  'other vegetables/',
  'r

***I treated the 'item/item' as two different items in the dataset, hence when I try to append item to my result list, I add a '/' to seperate them. (I used to treat the 'item/item' as one item, but when I try to apply the frequent pattern finding algorithm, I did not get an acceptable result. So here I treat them as different items.)***

In [13]:
###Merge all the purchase history with user list
###Add index to the purchase history
purchase_df = pd.DataFrame(obj.items(), columns = ['user', 'Purchase'], index=range(0,3872))
purchase_df

Unnamed: 0,user,Purchase
0,user0,"[frankfurter/, root vegetables/, soups/, onion..."
1,user1,"[pork/, whole milk/, beverages/, oil/]"
2,user2,"[root vegetables/, rolls/buns/, frozen vegetab..."
3,user3,"[onions/, red/blush wine/, yogurt/, whole milk..."
4,user4,"[waffles/, newspapers/, yogurt/, sausage/, ket..."
...,...,...
3867,user3867,[grapes/]
3868,user3868,[sausage/]
3869,user3869,"[citrus fruit/, whole milk/]"
3870,user3870,[citrus fruit/]


***Join the previous result with corresponding Member_number colunm***

In [14]:
result = pd.concat([users_df, purchase_df], axis=1, join="inner")
result

Unnamed: 0,Member_number,user,Purchase
0,3021,user0,"[frankfurter/, root vegetables/, soups/, onion..."
1,1292,user1,"[pork/, whole milk/, beverages/, oil/]"
2,4206,user2,"[root vegetables/, rolls/buns/, frozen vegetab..."
3,4369,user3,"[onions/, red/blush wine/, yogurt/, whole milk..."
4,1522,user4,"[waffles/, newspapers/, yogurt/, sausage/, ket..."
...,...,...,...
3867,4630,user3867,[grapes/]
3868,3605,user3868,[sausage/]
3869,1459,user3869,"[citrus fruit/, whole milk/]"
3870,4386,user3870,[citrus fruit/]


***From here, we get the desired dataset which is 'Member_number --> Purchase' , basically, we only need to deal with the purchase items using some string techniques***

In [15]:
result['Purchase'] = result['Purchase'].str.join(',')
result

Unnamed: 0,Member_number,user,Purchase
0,3021,user0,"frankfurter/,root vegetables/,soups/,onions/,b..."
1,1292,user1,"pork/,whole milk/,beverages/,oil/"
2,4206,user2,"root vegetables/,rolls/buns/,frozen vegetables..."
3,4369,user3,"onions/,red/blush wine/,yogurt/,whole milk/,bu..."
4,1522,user4,"waffles/,newspapers/,yogurt/,sausage/,ketchup/..."
...,...,...,...
3867,4630,user3867,grapes/
3868,3605,user3868,sausage/
3869,1459,user3869,"citrus fruit/,whole milk/"
3870,4386,user3870,citrus fruit/


In [16]:
result['Purchase'] = result['Purchase'].str.replace(',','')
result

Unnamed: 0,Member_number,user,Purchase
0,3021,user0,frankfurter/root vegetables/soups/onions/bottl...
1,1292,user1,pork/whole milk/beverages/oil/
2,4206,user2,root vegetables/rolls/buns/frozen vegetables/f...
3,4369,user3,onions/red/blush wine/yogurt/whole milk/butter...
4,1522,user4,waffles/newspapers/yogurt/sausage/ketchup/choc...
...,...,...,...
3867,4630,user3867,grapes/
3868,3605,user3868,sausage/
3869,1459,user3869,citrus fruit/whole milk/
3870,4386,user3870,citrus fruit/


***After using built-in string method, we got a rather clean and clear data set, user -> all items purchased***

In [17]:
###extract all the user's purchase history records from above dataset
purchase_his = result['Purchase']
purchase_his

0       frankfurter/root vegetables/soups/onions/bottl...
1                          pork/whole milk/beverages/oil/
2       root vegetables/rolls/buns/frozen vegetables/f...
3       onions/red/blush wine/yogurt/whole milk/butter...
4       waffles/newspapers/yogurt/sausage/ketchup/choc...
                              ...                        
3867                                              grapes/
3868                                             sausage/
3869                             citrus fruit/whole milk/
3870                                        citrus fruit/
3871                                              onions/
Name: Purchase, Length: 3872, dtype: object

In [18]:
res = purchase_his.str.split('/')
res 

0       [frankfurter, root vegetables, soups, onions, ...
1                    [pork, whole milk, beverages, oil, ]
2       [root vegetables, rolls, buns, frozen vegetabl...
3       [onions, red, blush wine, yogurt, whole milk, ...
4       [waffles, newspapers, yogurt, sausage, ketchup...
                              ...                        
3867                                           [grapes, ]
3868                                          [sausage, ]
3869                         [citrus fruit, whole milk, ]
3870                                     [citrus fruit, ]
3871                                           [onions, ]
Name: Purchase, Length: 3872, dtype: object

***We store the Purchase History in res, in order to fit it into the frequent pattern algrithm, we still need to use some tricks to process it***

In [19]:
###Using one-hot decoding to code the purchase history
purchase_his = purchase_his.str.get_dummies('/')
purchase_his

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,vegetables,vinegar,waffles,whipped,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3867,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3868,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3869,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3870,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


***After we got the desired matrix we specified at the very beginning, we feed it into the algorithm***

In [21]:
###We first use apriori to train the data we have processed above
frequent_itemsets1 = apriori(purchase_his,min_support=0.1, use_colnames=True)
frequent_itemsets1



Unnamed: 0,support,itemsets
0,0.118285,(bottled beer)
1,0.159866,(bottled water)
2,0.102014,(brown bread)
3,0.263171,(buns)
4,0.115961,(canned beer)
5,0.133781,(citrus fruit)
6,0.101498,(frankfurter)
7,0.284866,(other vegetables)
8,0.130424,(pastry)
9,0.120351,(pip fruit)


In [22]:
##Then we use lift here as the metic and get the detailed measures of previous trained rules
rules1 = association_rules(frequent_itemsets1, metric='lift')
rules1

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(buns),(rolls),0.263171,0.263171,0.263171,1.0,3.799804,0.193912,inf,1.0
1,(rolls),(buns),0.263171,0.263171,0.263171,1.0,3.799804,0.193912,inf,1.0
2,(whole milk),(buns),0.347366,0.263171,0.106147,0.305576,1.16113,0.01473,1.061065,0.21263
3,(buns),(whole milk),0.263171,0.347366,0.106147,0.403337,1.16113,0.01473,1.093806,0.188334
4,(whole milk),(other vegetables),0.347366,0.284866,0.112345,0.32342,1.135342,0.013392,1.056984,0.182657
5,(other vegetables),(whole milk),0.284866,0.347366,0.112345,0.394379,1.135342,0.013392,1.077628,0.166694
6,(whole milk),(rolls),0.347366,0.263171,0.106147,0.305576,1.16113,0.01473,1.061065,0.21263
7,(rolls),(whole milk),0.263171,0.347366,0.106147,0.403337,1.16113,0.01473,1.093806,0.188334
8,(whipped),(sour cream),0.113636,0.113636,0.113636,1.0,8.8,0.100723,inf,1.0
9,(sour cream),(whipped),0.113636,0.113636,0.113636,1.0,8.8,0.100723,inf,1.0


***From the result above, we can do some filtering to get our desired result***

In [23]:
result2 = rules1[(rules1['lift'] > 1.125) & (rules1['confidence'] > 0.8)]
result2

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(buns),(rolls),0.263171,0.263171,0.263171,1.0,3.799804,0.193912,inf,1.0
1,(rolls),(buns),0.263171,0.263171,0.263171,1.0,3.799804,0.193912,inf,1.0
8,(whipped),(sour cream),0.113636,0.113636,0.113636,1.0,8.8,0.100723,inf,1.0
9,(sour cream),(whipped),0.113636,0.113636,0.113636,1.0,8.8,0.100723,inf,1.0
10,"(whole milk, buns)",(rolls),0.106147,0.263171,0.106147,1.0,3.799804,0.078212,inf,0.824328
11,"(whole milk, rolls)",(buns),0.106147,0.263171,0.106147,1.0,3.799804,0.078212,inf,0.824328


***We got a really interesting result here when we try to get a confidence higher than 0.8 and lift higher than 1.125, We got several confidence == 1, which means these items always come together =.=, so I try to lower the threshold (the confidence) to get the result***

In [24]:
result2 = rules1[(rules1['lift'] > 1.125) & (rules1['confidence'] > 0.3)]
result2

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(buns),(rolls),0.263171,0.263171,0.263171,1.0,3.799804,0.193912,inf,1.0
1,(rolls),(buns),0.263171,0.263171,0.263171,1.0,3.799804,0.193912,inf,1.0
2,(whole milk),(buns),0.347366,0.263171,0.106147,0.305576,1.16113,0.01473,1.061065,0.21263
3,(buns),(whole milk),0.263171,0.347366,0.106147,0.403337,1.16113,0.01473,1.093806,0.188334
4,(whole milk),(other vegetables),0.347366,0.284866,0.112345,0.32342,1.135342,0.013392,1.056984,0.182657
5,(other vegetables),(whole milk),0.284866,0.347366,0.112345,0.394379,1.135342,0.013392,1.077628,0.166694
6,(whole milk),(rolls),0.347366,0.263171,0.106147,0.305576,1.16113,0.01473,1.061065,0.21263
7,(rolls),(whole milk),0.263171,0.347366,0.106147,0.403337,1.16113,0.01473,1.093806,0.188334
8,(whipped),(sour cream),0.113636,0.113636,0.113636,1.0,8.8,0.100723,inf,1.0
9,(sour cream),(whipped),0.113636,0.113636,0.113636,1.0,8.8,0.100723,inf,1.0


***Here We use the generated frequent item-set to make recommendations***

In [26]:
###If the input lies in any one of the frequent item-set, generate the recommendation set. 
###If the input does not lie in the frequent item-set we use previous 'rank top N' to generate the recommendation set.
def recommend1():
    item = input("Please enter a item:\n")
    if item == "rolls":
        print("buns, whole milk")
    elif item == "buns":
        print("rolls, whole milk")
    elif item == "whole milk":
        print("buns, rolls, other vegetables")
    elif item == "other vegetables":
        print("sour cream")
    elif item == "whipped cream":
        print("sour cream")
    else:
        print("whole milk, rolls, buns, other vegetables, soda, yogurt")
        
    ##print(item)

In [27]:
###Run it to recommend
recommend1()

Please enter a item:
buns
rolls, whole milk


************************

***From here We use FP-growth to train the dataset again***

***We use previously processed matrix using one-hot decoding, the purchase_his to feed into the FP growth algorithm***

In [28]:
purchase_his

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,vegetables,vinegar,waffles,whipped,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3867,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3868,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3869,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3870,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
from mlxtend.frequent_patterns import fpgrowth
frequent_itemsets2 = fpgrowth(purchase_his, min_support=0.1, use_colnames=True)
frequent_itemsets2



Unnamed: 0,support,itemsets
0,0.169421,(root vegetables)
1,0.159866,(bottled water)
2,0.101498,(frankfurter)
3,0.347366,(whole milk)
4,0.263171,(buns)
5,0.263171,(rolls)
6,0.118285,(bottled beer)
7,0.113636,(sour cream)
8,0.113636,(whipped)
9,0.205837,(yogurt)


In [31]:
rules2 = association_rules(frequent_itemsets2, metric='lift')
rules2

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(whole milk),(buns),0.347366,0.263171,0.106147,0.305576,1.16113,0.01473,1.061065,0.21263
1,(buns),(whole milk),0.263171,0.347366,0.106147,0.403337,1.16113,0.01473,1.093806,0.188334
2,(buns),(rolls),0.263171,0.263171,0.263171,1.0,3.799804,0.193912,inf,1.0
3,(rolls),(buns),0.263171,0.263171,0.263171,1.0,3.799804,0.193912,inf,1.0
4,(whole milk),(rolls),0.347366,0.263171,0.106147,0.305576,1.16113,0.01473,1.061065,0.21263
5,(rolls),(whole milk),0.263171,0.347366,0.106147,0.403337,1.16113,0.01473,1.093806,0.188334
6,"(whole milk, buns)",(rolls),0.106147,0.263171,0.106147,1.0,3.799804,0.078212,inf,0.824328
7,"(whole milk, rolls)",(buns),0.106147,0.263171,0.106147,1.0,3.799804,0.078212,inf,0.824328
8,"(buns, rolls)",(whole milk),0.263171,0.347366,0.106147,0.403337,1.16113,0.01473,1.093806,0.188334
9,(whole milk),"(buns, rolls)",0.347366,0.263171,0.106147,0.305576,1.16113,0.01473,1.061065,0.21263


In [32]:
rules2[(rules2['lift'] > 1.125) & (rules2['confidence'] > 0.8)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
2,(buns),(rolls),0.263171,0.263171,0.263171,1.0,3.799804,0.193912,inf,1.0
3,(rolls),(buns),0.263171,0.263171,0.263171,1.0,3.799804,0.193912,inf,1.0
6,"(whole milk, buns)",(rolls),0.106147,0.263171,0.106147,1.0,3.799804,0.078212,inf,0.824328
7,"(whole milk, rolls)",(buns),0.106147,0.263171,0.106147,1.0,3.799804,0.078212,inf,0.824328
12,(whipped),(sour cream),0.113636,0.113636,0.113636,1.0,8.8,0.100723,inf,1.0
13,(sour cream),(whipped),0.113636,0.113636,0.113636,1.0,8.8,0.100723,inf,1.0


***We got identical results as we were using apriori algorithm to train the data, so we also lower the threshold to give it a shot*** 

In [33]:
rules2[(rules2['lift'] > 1.125) & (rules2['confidence'] > 0.3)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(whole milk),(buns),0.347366,0.263171,0.106147,0.305576,1.16113,0.01473,1.061065,0.21263
1,(buns),(whole milk),0.263171,0.347366,0.106147,0.403337,1.16113,0.01473,1.093806,0.188334
2,(buns),(rolls),0.263171,0.263171,0.263171,1.0,3.799804,0.193912,inf,1.0
3,(rolls),(buns),0.263171,0.263171,0.263171,1.0,3.799804,0.193912,inf,1.0
4,(whole milk),(rolls),0.347366,0.263171,0.106147,0.305576,1.16113,0.01473,1.061065,0.21263
5,(rolls),(whole milk),0.263171,0.347366,0.106147,0.403337,1.16113,0.01473,1.093806,0.188334
6,"(whole milk, buns)",(rolls),0.106147,0.263171,0.106147,1.0,3.799804,0.078212,inf,0.824328
7,"(whole milk, rolls)",(buns),0.106147,0.263171,0.106147,1.0,3.799804,0.078212,inf,0.824328
8,"(buns, rolls)",(whole milk),0.263171,0.347366,0.106147,0.403337,1.16113,0.01473,1.093806,0.188334
9,(whole milk),"(buns, rolls)",0.347366,0.263171,0.106147,0.305576,1.16113,0.01473,1.061065,0.21263


***Based on the above new rules, we generate a recommender system***

In [34]:
###If the input lies in any one of the frequent item-set, generate the recommendation set. 
###If the input does not lie in the frequent item-set we use previous 'rank top N' to generate the recommendation set.
###Run it to recommend
def recommend2():
    item = input("Please enter a item:\n")
    if item == "rolls":
        print("buns, whole milk")
    elif item == "buns":
        print("rolls, whole milk")
    elif item == "whole milk":
        print("buns, rolls, other vegetables")
    elif item == "other vegetables":
        print("sour cream")
    elif item == "whipped cream":
        print("sour cream")
    else:
        print("whole milk, rolls, buns, other vegetables, soda, yogurt")
        
    ##print(item)

In [35]:
recommend2()

Please enter a item:
apple
whole milk, rolls, buns, other vegetables, soda, yogurt


***************

### Third way of implementing the recommending system --> CF

***Usually, there are two popular ways of doing collaborative filtering, User based CF and Item based CF. Normally, we tend to use Item based CF because, in a typical recomending system, there are far more users than the category of items. To clarify, for a user based CF, we have to generate a matrix with the shape of user * user which is gonna be huge, while for a Item based CF, the shape of target matrix is item * item,  In this dataset, we got more than 3800 unique users which would make the shape of the similarity matrix as 3800 * 3800, while there are only 175 different item categories, which make the shape of the similarity matrix as 175 * 175. That's way we choose to use Item based CF in this assignment.***

***General procedure for implementing Item based CF***

***My system receive an Member ID as input, first I try to find all the items this input user has bought. Then I loop through all these items and find all items that are similar to this item, and generate the similarity matrix. Then I calcualte the average on a column basis, then generate the most similar N items as my recommended List***

In [40]:
##First we set out to get all the unique item categories
#df1['itemDescription'][0]
res = ""
for i in range (27000):
    res = res + df['itemDescription'][i] + "/"
res

'frankfurter/pork/root vegetables/onions/waffles/cereals/yogurt/sausage/rolls/buns/root vegetables/chocolate/whipped/sour cream/butter/curd/frozen meals/whole milk/tropical fruit/rolls/buns/whole milk/whole milk/sugar/specialty cheese/whole milk/frozen meals/other vegetables/fruit/vegetable juice/misc. beverages/frozen vegetables/rolls/buns/spices/coffee/cream cheese /red/blush wine/other vegetables/curd/whole milk/whole milk/rolls/buns/red/blush wine/dessert/whole milk/long life bakery product/domestic eggs/abrasive cleaner/berries/liquor (appetizer)/semi-finished bread/whole milk/whipped/sour cream/whipped/sour cream/yogurt/other vegetables/newspapers/beverages/tropical fruit/beef/long life bakery product/female sanitary products/salt/beef/soda/other vegetables/sauces/oil/oil/whipped/sour cream/snack products/berries/long life bakery product/brown bread/brown bread/hygiene articles/pork/artif. sweetener/misc. beverages/bottled beer/sausage/liquor (appetizer)/dessert/cream cheese /bro

In [42]:
all_items = res.split('/')
all_items

['frankfurter',
 'pork',
 'root vegetables',
 'onions',
 'waffles',
 'cereals',
 'yogurt',
 'sausage',
 'rolls',
 'buns',
 'root vegetables',
 'chocolate',
 'whipped',
 'sour cream',
 'butter',
 'curd',
 'frozen meals',
 'whole milk',
 'tropical fruit',
 'rolls',
 'buns',
 'whole milk',
 'whole milk',
 'sugar',
 'specialty cheese',
 'whole milk',
 'frozen meals',
 'other vegetables',
 'fruit',
 'vegetable juice',
 'misc. beverages',
 'frozen vegetables',
 'rolls',
 'buns',
 'spices',
 'coffee',
 'cream cheese ',
 'red',
 'blush wine',
 'other vegetables',
 'curd',
 'whole milk',
 'whole milk',
 'rolls',
 'buns',
 'red',
 'blush wine',
 'dessert',
 'whole milk',
 'long life bakery product',
 'domestic eggs',
 'abrasive cleaner',
 'berries',
 'liquor (appetizer)',
 'semi-finished bread',
 'whole milk',
 'whipped',
 'sour cream',
 'whipped',
 'sour cream',
 'yogurt',
 'other vegetables',
 'newspapers',
 'beverages',
 'tropical fruit',
 'beef',
 'long life bakery product',
 'female sanitar

In [43]:
###Get all items' categories
item_set = pd.unique(all_items)
item_set

array(['frankfurter', 'pork', 'root vegetables', 'onions', 'waffles',
       'cereals', 'yogurt', 'sausage', 'rolls', 'buns', 'chocolate',
       'whipped', 'sour cream', 'butter', 'curd', 'frozen meals',
       'whole milk', 'tropical fruit', 'sugar', 'specialty cheese',
       'other vegetables', 'fruit', 'vegetable juice', 'misc. beverages',
       'frozen vegetables', 'spices', 'coffee', 'cream cheese ', 'red',
       'blush wine', 'dessert', 'long life bakery product',
       'domestic eggs', 'abrasive cleaner', 'berries',
       'liquor (appetizer)', 'semi-finished bread', 'newspapers',
       'beverages', 'beef', 'female sanitary products', 'salt', 'soda',
       'sauces', 'oil', 'snack products', 'brown bread',
       'hygiene articles', 'artif. sweetener', 'bottled beer',
       'canned beer', 'hamburger meat', 'liver loaf', 'soups',
       'pip fruit', 'hard cheese', 'shopping bags', 'canned vegetables',
       'napkins', 'citrus fruit', 'margarine', 'pasta', 'salty snack',
 

***We need to deal with the trailing white space***

In [45]:
item_set = np.delete(item_set, len(item_set) - 1)
item_set
#len(item_set)

array(['frankfurter', 'pork', 'root vegetables', 'onions', 'waffles',
       'cereals', 'yogurt', 'sausage', 'rolls', 'buns', 'chocolate',
       'whipped', 'sour cream', 'butter', 'curd', 'frozen meals',
       'whole milk', 'tropical fruit', 'sugar', 'specialty cheese',
       'other vegetables', 'fruit', 'vegetable juice', 'misc. beverages',
       'frozen vegetables', 'spices', 'coffee', 'cream cheese ', 'red',
       'blush wine', 'dessert', 'long life bakery product',
       'domestic eggs', 'abrasive cleaner', 'berries',
       'liquor (appetizer)', 'semi-finished bread', 'newspapers',
       'beverages', 'beef', 'female sanitary products', 'salt', 'soda',
       'sauces', 'oil', 'snack products', 'brown bread',
       'hygiene articles', 'artif. sweetener', 'bottled beer',
       'canned beer', 'hamburger meat', 'liver loaf', 'soups',
       'pip fruit', 'hard cheese', 'shopping bags', 'canned vegetables',
       'napkins', 'citrus fruit', 'margarine', 'pasta', 'salty snack',
 

In [137]:
df_test.head()

Unnamed: 0,Member_number,Date,itemDescription,year,month,day,day_of_week
0,3481,8/03/2015,candy,2015,3,8,6
1,1254,19/04/2015,white wine,2015,4,19,6
2,2835,28/01/2014,domestic eggs,2014,1,28,1
3,2854,2/08/2015,coffee,2015,8,2,6
4,4637,12/08/2014,bottled water,2014,8,12,1


In [90]:
# df_test['Member_number'][0] == 3481

True

In [95]:
##Get the index of user
###According to the input integer, find the corresponding user
def getIndex_test(i):
    for j in range(df_test.shape[0]):
        if(i == df_test['Member_number'][j]):
            break
    return j

In [96]:
##Find the user and return the purchased item-set
##The input here is the list we got from test set in one transaction
def get_User_Items(userId, df):
    for i in range(df_test.shape[0]):
        if df_test['Member_number'][i] == userId:
            return df_test['itemDescription'][i].split('/')
    return ""

In [97]:
###We first find out all the other users who bought the same item
###find out who has bought the specific item
def get_item_users(item, df):
    user_bought_same = []
    for i in range(27000):
        #temp = df['itemDescription'][i:i+1].str.contains(item)
        if(df['itemDescription'][i:i+1].str.contains(item).bool()):
            #print("True")
            user_bought_same.append(df['Member_number'][i])
    return user_bought_same 

***There are many ways to calculate the similarity between two items, Jaccard, cosine similarity, pearson similarity, we use Jaccard here to do the calculation, Jaccard similarity = intersection / union***

In [98]:
##Get the intersection of two lists
def intersection(lst1, lst2):
    lst3 = [value for value in lst1 if value in lst2]
    return lst3

In [99]:
##Get the Union of two lists
def Union(lst1, lst2):
    final_list = list(set(lst1) | set(lst2))
    return final_list

In [135]:
def recommend3():
    userID = input("Please type in the userID(integer):")
    user_items = get_User_Items(userID, df_test)
    user_items_user = []
    for i in range(0, len(user_items)):
        user_items_user.append(get_item_users(user_items[i], df))
    cooccurrence_matrix = np.matrix(np.zeros(shape=(len(user_items),len(item_set))),float)
    for i in range(0, len(item_set)):
        users_i = get_item_users(item_set[i], df)
        for j in range(0, len(user_items)):
            users_j = get_item_users(user_items[j])
            ##users_intersection = user_i.intersection(users_j)
            users_intersection = intersection(users_i, users_j)
            if len(users_intersection) != 0:
                ##users_union = users_i.union(users_j)
                users_union = Union(users_i, users_j)
                cooccurrence_matrix[j,i] = float(len(users_intersection))/float(len(users_union))
            else:
                cooccurrence_matrix[j,i] = 0
    mean_value = np.mean(cooccurrence_matrix, axis=0)
    mean_value = np.array(mean_value[0])
    top_5 = np.argsort(mean_value[0])[::-1][:5]
    recommended_list = []
    for i in range(0, len(top_5)):
        recommended_list.append(item_set[top_5[i]])
    return recommended_list

In [136]:
###Run it to recommend
result = recommend3()
result

Please type in the userID(integer):3481


  if(df['itemDescription'][i:i+1].str.contains(item).bool()):
  return N.ndarray.mean(self, axis, dtype, out, keepdims=True)._collapse(axis)
  ret = um.true_divide(


['kitchen utensil', 'pip fruit', 'salty snack', 'pasta', 'margarine']

**************

***Step by Step explanation and calculation for the above item based CF***

In [111]:
###Here we receive an input from the keyboard, just take 3481 as an example
##first we need to find all items the current user bought in the test set
test_user_items = get_User_Items(3481, df_test)
test_user_items

['candy']

In [113]:
##Then we should get all other users who also bought this item
test_bought_users = get_item_users(test_user_items[0], df)
test_bought_users

[1866,
 4608,
 2344,
 4610,
 2421,
 2964,
 3185,
 3393,
 4338,
 1026,
 2132,
 2839,
 4544,
 1063,
 3830,
 1572,
 2091,
 3439,
 3774,
 4776,
 1949,
 3985,
 3178,
 4508,
 3667,
 4854,
 4432,
 1618,
 2537,
 3401,
 1735,
 2854,
 4117,
 3518,
 1049,
 2573,
 1461,
 3026,
 4395,
 3145,
 1599,
 4258,
 2743,
 3222,
 3552,
 3495,
 4959,
 3486,
 2266,
 3985,
 3493,
 2093,
 2853,
 2310,
 3847,
 4320,
 4507,
 4145,
 1567,
 4108,
 4731,
 2415,
 2074,
 2157,
 3963,
 3484,
 3660,
 4661,
 3831,
 4704,
 4961,
 4334,
 3247,
 4449,
 2259,
 4173,
 2019,
 3822,
 2196,
 2550,
 3691,
 1515,
 1244,
 3563,
 1808,
 3704,
 4606,
 4283,
 3531,
 2259,
 1098,
 2362,
 2546,
 2236,
 2022,
 4744,
 3279,
 1013,
 3972,
 2996,
 4410,
 1466,
 2328,
 3397,
 2958,
 3527,
 3736,
 4416,
 3617,
 1572,
 2293,
 4589,
 4001,
 2314,
 4823,
 4092,
 4251,
 2652,
 3067,
 2528,
 3632,
 1182,
 2040,
 3430,
 4751,
 1146,
 2462,
 3315,
 1427,
 1380,
 2568,
 1853,
 4467,
 2345,
 3467,
 3496,
 1011,
 1878,
 4310,
 1929,
 1946,
 3645,
 1387]

In [115]:
##row is all items-set, column is current user purchased items
##calculate the similarity
##The cell [j,i] means the similarity between these two items
cooccurrence_matrix = np.matrix(np.zeros(shape=(len(input_list),len(item_set))),float)
cooccurrence_matrix

matrix([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [118]:
##The unique item categories we got earliear
item_set

array(['frankfurter', 'pork', 'root vegetables', 'onions', 'waffles',
       'cereals', 'yogurt', 'sausage', 'rolls', 'buns', 'chocolate',
       'whipped', 'sour cream', 'butter', 'curd', 'frozen meals',
       'whole milk', 'tropical fruit', 'sugar', 'specialty cheese',
       'other vegetables', 'fruit', 'vegetable juice', 'misc. beverages',
       'frozen vegetables', 'spices', 'coffee', 'cream cheese ', 'red',
       'blush wine', 'dessert', 'long life bakery product',
       'domestic eggs', 'abrasive cleaner', 'berries',
       'liquor (appetizer)', 'semi-finished bread', 'newspapers',
       'beverages', 'beef', 'female sanitary products', 'salt', 'soda',
       'sauces', 'oil', 'snack products', 'brown bread',
       'hygiene articles', 'artif. sweetener', 'bottled beer',
       'canned beer', 'hamburger meat', 'liver loaf', 'soups',
       'pip fruit', 'hard cheese', 'shopping bags', 'canned vegetables',
       'napkins', 'citrus fruit', 'margarine', 'pasta', 'salty snack',
 

In [121]:
##Loop through the matrix, and calculate the similarity between every two items
###using Jaccard similarity
for i in range(0, len(item_set)):
        users_i = get_item_users(item_set[i], df)
        for j in range(0, len(test_user_items)):
            users_j = get_item_users(test_user_items[j], df)
            #users_j = test_bought_users[j]
            ##users_intersection = user_i.intersection(users_j)
            users_intersection = intersection(users_i, users_j)
            if len(users_intersection) != 0:
                ##users_union = users_i.union(users_j)
                users_union = Union(users_i, users_j)
                cooccurrence_matrix[j,i] = float(len(users_intersection))/float(len(users_union))
            else:
                cooccurrence_matrix[j,i] = 0
cooccurrence_matrix

  if(df['itemDescription'][i:i+1].str.contains(item).bool()):


matrix([[0.04093567, 0.02626263, 0.04421326, 0.04117647, 0.04113924,
         0.01176471, 0.04899777, 0.03693182, 0.05215827, 0.05215827,
         0.04339964, 0.03197158, 0.03197158, 0.02799378, 0.02589641,
         0.02564103, 0.05977496, 0.04194858, 0.03548387, 0.01111111,
         0.03399668, 0.06257242, 0.04016913, 0.02657807, 0.02830189,
         0.0060241 , 0.01964637, 0.0156658 , 0.02016129, 0.02016129,
         0.03140097, 0.02990033, 0.02165354, 0.        , 0.03703704,
         0.        , 0.03508772, 0.03769841, 0.02386117, 0.05150215,
         0.01863354, 0.02452316, 0.03838583, 0.        , 0.02      ,
         0.00636943, 0.03076923, 0.02135231, 0.00625   , 0.04728546,
         0.05123675, 0.03116147, 0.00578035, 0.00546448, 0.03918228,
         0.02919708, 0.0321489 , 0.03076923, 0.03542234, 0.04423381,
         0.02801724, 0.00921659, 0.02229299, 0.02247191, 0.02745098,
         0.03538175, 1.02142857, 0.01005025, 0.03614458, 0.02371542,
         0.03170732, 0.00833333, 0

***Now with the similarity matrix, we can compute the average similarity on a column's basis***

In [131]:
###calculate the mean value on columns, 
###The result means that given current user purchased items 
###we need to see the average similarity between purchased items with the ith column item
mean_value = np.mean(cooccurrence_matrix, axis=0)
mean_value

matrix([[0.04093567, 0.02626263, 0.04421326, 0.04117647, 0.04113924,
         0.01176471, 0.04899777, 0.03693182, 0.05215827, 0.05215827,
         0.04339964, 0.03197158, 0.03197158, 0.02799378, 0.02589641,
         0.02564103, 0.05977496, 0.04194858, 0.03548387, 0.01111111,
         0.03399668, 0.06257242, 0.04016913, 0.02657807, 0.02830189,
         0.0060241 , 0.01964637, 0.0156658 , 0.02016129, 0.02016129,
         0.03140097, 0.02990033, 0.02165354, 0.        , 0.03703704,
         0.        , 0.03508772, 0.03769841, 0.02386117, 0.05150215,
         0.01863354, 0.02452316, 0.03838583, 0.        , 0.02      ,
         0.00636943, 0.03076923, 0.02135231, 0.00625   , 0.04728546,
         0.05123675, 0.03116147, 0.00578035, 0.00546448, 0.03918228,
         0.02919708, 0.0321489 , 0.03076923, 0.03542234, 0.04423381,
         0.02801724, 0.00921659, 0.02229299, 0.02247191, 0.02745098,
         0.03538175, 1.02142857, 0.01005025, 0.03614458, 0.02371542,
         0.03170732, 0.00833333, 0

In [132]:
##transfer the matrix to ndarray
mean_value = np.array(mean_value[0])
mean_value

array([[0.04093567, 0.02626263, 0.04421326, 0.04117647, 0.04113924,
        0.01176471, 0.04899777, 0.03693182, 0.05215827, 0.05215827,
        0.04339964, 0.03197158, 0.03197158, 0.02799378, 0.02589641,
        0.02564103, 0.05977496, 0.04194858, 0.03548387, 0.01111111,
        0.03399668, 0.06257242, 0.04016913, 0.02657807, 0.02830189,
        0.0060241 , 0.01964637, 0.0156658 , 0.02016129, 0.02016129,
        0.03140097, 0.02990033, 0.02165354, 0.        , 0.03703704,
        0.        , 0.03508772, 0.03769841, 0.02386117, 0.05150215,
        0.01863354, 0.02452316, 0.03838583, 0.        , 0.02      ,
        0.00636943, 0.03076923, 0.02135231, 0.00625   , 0.04728546,
        0.05123675, 0.03116147, 0.00578035, 0.00546448, 0.03918228,
        0.02919708, 0.0321489 , 0.03076923, 0.03542234, 0.04423381,
        0.02801724, 0.00921659, 0.02229299, 0.02247191, 0.02745098,
        0.03538175, 1.02142857, 0.01005025, 0.03614458, 0.02371542,
        0.03170732, 0.00833333, 0.04333868, 0.01

In [133]:
##Choose top N similar from the ndarray
top_5 = np.argsort(mean_value[0])[::-1][:5]
top_5

array([66, 21, 16, 75,  8])

In [134]:
###produce corresponding recommend list
recommended_list = []
for i in range(0, len(top_5)):
    recommended_list.append(item_set[top_5[i]])
recommended_list

['candy', 'fruit', 'whole milk', 'vegetables', 'rolls']

********************

******

***Some comparisons for all above algorithms doing a recommendation system***

***The very first one intuitive way of doing recommendation system is simply sort the original data on the times purchased or frequency of buying. It makes some sense because popularity means something. Also it's straight forward and simple to implement. But the downsides of the algrithm are also obvious, we cannot provide a more accurate and customized recommendation. When it comes to complex system, we should create a user profile and try to do a highly cusomized recommendation.***

***The second way we use to implement the recommendation system is apriori and FP-growth. They are designed to find frequent patterns, or more precisely frequent item-sets. The strength of this algrithm is also obvious, it's very simple and easy to implement, and we can get a slightly better recommendation results because the frequent item-set we generated is accurate. But the weakness of these 'Association Rule' related algorithms are that they only care about the occurrence of each item, (We only need to make up such matrix that whether the user bought it or not), they don't care about how many times an item was bought, and all the features that come with the item. And using such algorithm did not take the user's attitude and behaviour into consideration. For example, One user's favourite food is egg, but egg is not so popular that it barely comes with other products and it's not selling well. In this case, when we are trying to recommend items based on frequent sets, there is a high chance we won't recommend eggs to the user.***

***The third way we use is Item-Based CF. We have explained before why we are using Item-Based CF instead of User-Based CF. The way we implement Item-Based CF in this assignment is that we receive userID as input, then we set out to find all the items current user bought. then we try to make up a matrix, which is the similarity between the items current user bought against all the items. Then we sort the result and get the most similar top N products as the recommendation. It's strength is that it's much more accurate and highly cusomized. Also, we take everything into account. But the down side is also obvious, it's much more complex than the previous two, and the computation is huge. Even when we are doing this assignment(The shape of matrix is way smaller than real world commercial dataset), I still spent long to train the similarity matrix. So we can only use it for offline model training, but when we are supposed to recommend items online, when efficiency becomes the dominant factor, we have to use some other tricks to do it.***

***Some interesting research and thoughts:***

***1. We can use SVD to implement the item-based CF. The trick here is like our previous Assignemnt 1, when we have too many features to train the model, we tend to use PCA to get the dominant factors or features. We can use the same trick here. Apply SVD to extract the dominant similarity factors. It would greatly reduce the time used to train the model.*** 

***2. We can also use RNN or other deep learning algorithms to train our recommendation system. So far, the most popular ways we use to do recommendation system is still User-Based CF, with which we try to create the user profile, and with more purchase and interactive behavours, the model tends to become more accurate. Or the Item-Based CF, we try to recommend the most similar items to the user. But there will always be some factors we can't take into consideration. Such as Seasonality. We have done such analysis(Time series) in Assignment 2, but I think it's still relevant in this recommendation systems. For example, we will probabliy not recommend ice-cream to customer in Winter!! So maybe deep learning is a better appoach. When we are using RNN, with layers getting more and network getting deeper, more hidden factors will be taken into accout to produce more accurate results.***