# Collaborative Filtering Recommender System

In this notebook I am going to use the ecommerce dataset from Kaggle to build a user based collaborative filtering recommender system. 

The main idea is to find other similar users based on user feature vectors using cosine similarity measures. 

You can download the data at https://www.kaggle.com/carrie1/ecommerce-data, extract the zip file and save data.csv in the data directory.

### Load libraries

In [1]:
import pandas as pd
from urllib.request import urlopen
from zipfile import ZipFile
from sklearn.metrics.pairwise import cosine_similarity

### Download and extract dataset

In [2]:
#zf = ZipFile("./data/kaggle_ecommerce_data.zip") 
#zf.extractall(path = './data/') 
#zf.close()

df = pd.read_csv("./data/data.csv", encoding = 'ISO-8859-1')
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France


### Check data shape and columns

In [3]:
print("Rows     : ", df.shape[0])
print("Columns  : ", df.shape[1])
print("")
print("Features : \n", df.columns.tolist())
print("")
print("Missing values :  ", df.isnull().sum().values.sum())
print("")
print("Unique values :  \n", df.nunique())

Rows     :  541909
Columns  :  8

Features : 
 ['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country']

Missing values :   136534

Unique values :  
 InvoiceNo      25900
StockCode       4070
Description     4223
Quantity         722
InvoiceDate    23260
UnitPrice       1630
CustomerID      4372
Country           38
dtype: int64


In [4]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


Drop negative values and rows with invalid customer id

In [5]:
df = df.loc[df['Quantity'] > 0]
df = df.loc[df['UnitPrice'] > 0]

In [6]:
df.loc[df['CustomerID'].isna()].head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
1443,536544,21773,DECORATIVE ROSE BATHROOM BOTTLE,1,12/1/2010 14:32,2.51,,United Kingdom
1444,536544,21774,DECORATIVE CATS BATHROOM BOTTLE,2,12/1/2010 14:32,2.51,,United Kingdom
1445,536544,21786,POLKADOT RAIN HAT,4,12/1/2010 14:32,0.85,,United Kingdom
1446,536544,21787,RAIN PONCHO RETROSPOT,2,12/1/2010 14:32,1.66,,United Kingdom
1447,536544,21790,VINTAGE SNAP CARDS,9,12/1/2010 14:32,1.66,,United Kingdom


In [7]:
df.shape

(530104, 8)

In [8]:
df = df.dropna(subset=['CustomerID'])

In [9]:
df.shape

(397884, 8)

Data should be clean now

In [10]:
df.isnull().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

## Create user item matrix

Our goal is to create a user item matrix where the the values in each row tell us if that particular CustomerID had purchased the item before. 

In [11]:
user_item_matrix = df.pivot_table(index='CustomerID', columns='StockCode', values='Quantity', aggfunc='sum')
user_item_matrix.head(10)

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214V,90214W,90214Y,90214Z,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,,,,,,,,,,,...,,,,,,,,,,
12347.0,,,,,,,,,,,...,,,,,,,,,,
12348.0,,,,,,,,,,,...,,,,,,,,,,9.0
12349.0,,,,,,,,,,,...,,,,,,,,,,1.0
12350.0,,,,,,,,,,,...,,,,,,,,,,1.0
12352.0,,,,,,,,,,,...,,,,,,,,3.0,,7.0
12353.0,,,,,,,,,,,...,,,,,,,,,,
12354.0,,,,,,,,,,,...,,,,,,,,,,
12355.0,,,,,,,,,,,...,,,,,,,,,,
12356.0,,,,,,,,,,,...,,,,,,,,,,18.0


In [12]:
user_item_matrix = user_item_matrix.applymap(lambda x: 1 if x > 0 else 0)
user_item_matrix.head()

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214V,90214W,90214Y,90214Z,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12347.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12349.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12350.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


The rows are vectors that contain items that the customer has previously bought 

In [13]:
user_item_matrix.shape

(4338, 3665)

Then we calculate the cosine similarity matrix between the rows of vectors.  Since each row is a vector that represent a particular user, we can say that the cosine similarity between the vectors may also be the similarity between each user-user pair.

In [14]:
user_user_matrix = pd.DataFrame(cosine_similarity(user_item_matrix))
user_user_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4328,4329,4330,4331,4332,4333,4334,4335,4336,4337
0,1.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.0,1.000000,0.063022,0.046130,0.047795,0.038484,0.0,0.025876,0.136641,0.094742,...,0.0,0.029709,0.052668,0.000000,0.032844,0.062318,0.000000,0.113776,0.109364,0.012828
2,0.0,0.063022,1.000000,0.024953,0.051709,0.027756,0.0,0.027995,0.118262,0.146427,...,0.0,0.064282,0.113961,0.000000,0.000000,0.000000,0.000000,0.000000,0.170905,0.083269
3,0.0,0.046130,0.024953,1.000000,0.056773,0.137137,0.0,0.030737,0.032461,0.144692,...,0.0,0.105868,0.000000,0.000000,0.039014,0.000000,0.000000,0.067574,0.137124,0.030475
4,0.0,0.047795,0.051709,0.056773,1.000000,0.031575,0.0,0.000000,0.000000,0.033315,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.044866,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4333,0.0,0.062318,0.000000,0.000000,0.000000,0.000000,0.0,0.041523,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.105409,1.000000,0.119523,0.000000,0.000000,0.000000
4334,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.049629,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.119523,1.000000,0.000000,0.046613,0.000000
4335,0.0,0.113776,0.000000,0.067574,0.000000,0.037582,0.0,0.000000,0.160128,0.079305,...,0.0,0.174078,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.017800,0.000000
4336,0.0,0.109364,0.170905,0.137124,0.044866,0.080278,0.0,0.113354,0.034204,0.093170,...,0.0,0.037184,0.016480,0.043602,0.000000,0.000000,0.046613,0.017800,1.000000,0.096334


The shape of the matrix should be square now.  The index and columns should be the corresponding customer ID

In [15]:
user_user_matrix.shape

(4338, 4338)

In [16]:
user_user_matrix.columns = user_item_matrix.index

user_user_matrix['CustomerID'] = user_item_matrix.index

user_user_matrix = user_user_matrix.set_index('CustomerID')
user_user_matrix.head()

CustomerID,12346.0,12347.0,12348.0,12349.0,12350.0,12352.0,12353.0,12354.0,12355.0,12356.0,...,18273.0,18274.0,18276.0,18277.0,18278.0,18280.0,18281.0,18282.0,18283.0,18287.0
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347.0,0.0,1.0,0.063022,0.04613,0.047795,0.038484,0.0,0.025876,0.136641,0.094742,...,0.0,0.029709,0.052668,0.0,0.032844,0.062318,0.0,0.113776,0.109364,0.012828
12348.0,0.0,0.063022,1.0,0.024953,0.051709,0.027756,0.0,0.027995,0.118262,0.146427,...,0.0,0.064282,0.113961,0.0,0.0,0.0,0.0,0.0,0.170905,0.083269
12349.0,0.0,0.04613,0.024953,1.0,0.056773,0.137137,0.0,0.030737,0.032461,0.144692,...,0.0,0.105868,0.0,0.0,0.039014,0.0,0.0,0.067574,0.137124,0.030475
12350.0,0.0,0.047795,0.051709,0.056773,1.0,0.031575,0.0,0.0,0.0,0.033315,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.044866,0.0


## Get the list of customer id sorted by similarity

we can get a list of similar users sorted in descending order by sorting the row values as indexed by CustomerID

In [17]:
user_user_matrix.loc[12350].sort_values(ascending=False)

CustomerID
12350.0    1.000000
17935.0    0.183340
12414.0    0.181902
12652.0    0.175035
16754.0    0.171499
             ...   
14886.0    0.000000
14887.0    0.000000
14888.0    0.000000
14889.0    0.000000
18287.0    0.000000
Name: 12350.0, Length: 4338, dtype: float64

Get the list of items bought by this customer

In [18]:
def get_bought_items(user_item_m, customer_id):
    return set(user_item_m.loc[customer_id].iloc[user_item_m.loc[customer_id].to_numpy().nonzero()].index)

In [19]:
items_bought = get_bought_items(user_item_matrix, 12350)
items_bought

{'20615',
 '20652',
 '21171',
 '21832',
 '21864',
 '21866',
 '21908',
 '21915',
 '22348',
 '22412',
 '22551',
 '22557',
 '22620',
 '79066K',
 '79191C',
 '84086C',
 'POST'}

## Get the list of items to recommend to the user

First we need to find out which user is the most similar to the one we are comparing.  Then we compare the set the items our user has bought and the items that the similar user has bought and get the difference in content.  Lastly we try to find the description of the item along with the item stock code to return.

In [20]:
def get_items_to_recommend_user(main_df, user_user_m, user_item_m, user_id):
  most_similar_user = user_user_m.loc[user_id].sort_values(ascending=False).reset_index().iloc[1, 0]
  items_bought_by_user_a = get_bought_items(user_item_m, user_id)
  items_bought_by_user_b = get_bought_items(user_item_m, most_similar_user)
  items_to_recommend_to_a = items_bought_by_user_b - items_bought_by_user_a
  items_description = main_df.loc[main_df['StockCode'].isin(items_to_recommend_to_a), ['StockCode', 'Description']].drop_duplicates().set_index('StockCode')
  return items_description

In [21]:
get_items_to_recommend_user(df, user_user_matrix, user_item_matrix, 12358.0)

Unnamed: 0_level_0,Description
StockCode,Unnamed: 1_level_1
85015,SET OF 12 VINTAGE POSTCARD SET
16008,SMALL FOLDING SCISSOR(POINTED EDGE)


In [22]:
most_similar_user = user_user_matrix.loc[12358.0].sort_values(ascending=False).reset_index().iloc[1, 0]
most_similar_user

18240.0

In [23]:
a = get_bought_items(user_item_matrix, 12358.0)
a

{'15056BL',
 '15056N',
 '15056P',
 '15060B',
 '20679',
 '21232',
 '22059',
 '22063',
 '22646',
 '37447',
 '37449',
 '48185',
 'POST'}

In [24]:
b = get_bought_items(user_item_matrix, 18240.0)
b

{'15056BL', '15056N', '15056P', '16008', '20679', '85015'}

In [25]:
b - a

{'16008', '85015'}