# Building Product Recommendation

This module is intended to build a product recommendation model. It is an algorithm to recommend a product to users. In this module, I am going to use product recommendation based on similarities: **similarities between users** and **similarities between products**. The data I'm going to use here is a public available dataset from the UCI Machine Learning Repository, **Online Retail II.xlsx** (*https://archive.ics.uci.edu/ml/datasets/Online+Retail+II*). This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

As a quick introduction, the steps start from:

1. Data Preparation
    - Handling the range of the record
    - Handle nan values in the dataset
    - Build a customer-to-item matrix
2. Data Analysis
    - Collaborative filtering using cosine similarity
    - User-to-user similarity
    - Product-to-product similarity

In [1]:
# import libraries for this notebook

import pandas as pd
import numpy as np
import itertools

from sklearn.metrics.pairwise import cosine_similarity  # package to calculate cosine similarity -> similarity between users/items

## Data Collection

In [2]:
df = pd.read_excel('online_retail_II.xlsx')

In [3]:
df.head(5)

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [4]:
# pandas info show all number of records, columns, data type of records, and non-null data
# there's a lot of null data on description and customer id data

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525461 entries, 0 to 525460
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Invoice      525461 non-null  object        
 1   StockCode    525461 non-null  object        
 2   Description  522533 non-null  object        
 3   Quantity     525461 non-null  int64         
 4   InvoiceDate  525461 non-null  datetime64[ns]
 5   Price        525461 non-null  float64       
 6   Customer ID  417534 non-null  float64       
 7   Country      525461 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 32.1+ MB


In [5]:
# renaming customer id
# removing the space between "Customer" and "ID"

df = df.rename(columns={'Customer ID': 'CustomerID'})
df.head(5)

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,CustomerID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


## Data Preparation

### Handling Values Range

In [6]:
# print maximum and minimum value of quantity
# miminum value with negative value -> shows canceled orders

print('minimum quantity:', df['Quantity'].min())
print('maximum quantity:', df['Quantity'].max())

minimum quantity: -9600
maximum quantity: 19152


In [7]:
# removing quantity with negative values
# as the data properties says, it is all canceled orders
 
df = df[df['Quantity'] > 0]
print('minimum quantity:', df['Quantity'].min())

minimum quantity: 1


### Handling nan Values

In [8]:
# checking how many rows do we have for nan values
# around 100,000 rows on customer ID has nan values

df.isna().sum()

Invoice             0
StockCode           0
Description      1101
Quantity            0
InvoiceDate         0
Price               0
CustomerID     105440
Country             0
dtype: int64

In [9]:
df = df.dropna(subset=['CustomerID'])
print('number of nan values on customer ID row:', df['CustomerID'].isna().sum())
print('number of records after deleting nan values:', df.shape[0])

number of nan values on customer ID row: 0
number of records after deleting nan values: 407695


### Building a Customer-item Matrix

This is one of the most important step in product recommendation. Making a customer-item matrix to calculate for the cosine similarity between product / users later.

In [10]:
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,CustomerID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [11]:
# making customer item matrix using pivot table
# df.pivto_table -> create a spreadsheet-style pivot table as a DataFrame

customer_item_matrix = df.pivot_table(
    index='CustomerID',
    columns='StockCode',
    values='Quantity',
    aggfunc='sum'
    )

customer_item_matrix.head(5)

StockCode,10002,10080,10109,10120,10125,10133,10134,10135,10138,11001,...,ADJUST2,BANK CHARGES,C2,D,M,PADS,POST,SP1002,TEST001,TEST002
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,,,,,,,,,,,...,,,,,,,,,45.0,1.0
12347.0,,,,,,,,,,,...,,,,,,,,,,
12348.0,,,,,,,,,,,...,,,,,,,1.0,,,
12349.0,,,,,,,,,,,...,,,,,,,2.0,,,
12351.0,,,,,,,,,,,...,,,,,,,,,,


In [12]:
# encode data by 0 - 1
# value = 1 if the given product was purchased by the given customer

customer_item_matrix = customer_item_matrix.applymap(lambda x: 1 if x > 0 else 0)
customer_item_matrix.head(5)

StockCode,10002,10080,10109,10120,10125,10133,10134,10135,10138,11001,...,ADJUST2,BANK CHARGES,C2,D,M,PADS,POST,SP1002,TEST001,TEST002
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
12347.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12349.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12351.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Data Analysis: User-based Collaborative Filtering

This approach based on ***similarities between users***, looking at their item purchase history. 

    1) Calculating cosine similarity between users from customer item matrix

In [13]:
# calculate cosine similarity on customer item matrix table
# calculcate similarities between users based on their item purchase history

user_user_sim_matrix = pd.DataFrame(
    cosine_similarity(customer_item_matrix)
)

user_user_sim_matrix.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4304,4305,4306,4307,4308,4309,4310,4311,4312,4313
0,1.0,0.0,0.0,0.144707,0.0,0.0,0.0,0.0,0.0,0.183211,...,0.226455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071878,0.0
1,0.0,1.0,0.053452,0.025198,0.052164,0.028172,0.026726,0.025482,0.10146,0.083744,...,0.069007,0.0,0.0,0.053452,0.075593,0.047544,0.090351,0.0,0.087612,0.027242
2,0.0,0.053452,1.0,0.02357,0.0,0.0,0.0,0.0,0.189814,0.017408,...,0.032275,0.0,0.0,0.0,0.0,0.053368,0.0,0.0,0.0,0.0
3,0.144707,0.025198,0.02357,1.0,0.046004,0.04969,0.094281,0.044947,0.14061,0.196946,...,0.076073,0.054433,0.0,0.0,0.033333,0.109017,0.01992,0.0,0.064389,0.060062
4,0.0,0.052164,0.0,0.046004,1.0,0.051434,0.048795,0.046524,0.026463,0.033976,...,0.0,0.0,0.0,0.0,0.0,0.034721,0.041239,0.0,0.02666,0.024868


In [14]:
# converting column and index to its origincal value
# since the cosine similarity converted the column and index to 0 - 4000

user_user_sim_matrix.columns = customer_item_matrix.index
user_user_sim_matrix.index = customer_item_matrix.index

user_user_sim_matrix.head(5)

CustomerID,12346.0,12347.0,12348.0,12349.0,12351.0,12352.0,12353.0,12355.0,12356.0,12357.0,...,18277.0,18278.0,18279.0,18280.0,18281.0,18283.0,18284.0,18285.0,18286.0,18287.0
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,1.0,0.0,0.0,0.144707,0.0,0.0,0.0,0.0,0.0,0.183211,...,0.226455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071878,0.0
12347.0,0.0,1.0,0.053452,0.025198,0.052164,0.028172,0.026726,0.025482,0.10146,0.083744,...,0.069007,0.0,0.0,0.053452,0.075593,0.047544,0.090351,0.0,0.087612,0.027242
12348.0,0.0,0.053452,1.0,0.02357,0.0,0.0,0.0,0.0,0.189814,0.017408,...,0.032275,0.0,0.0,0.0,0.0,0.053368,0.0,0.0,0.0,0.0
12349.0,0.144707,0.025198,0.02357,1.0,0.046004,0.04969,0.094281,0.044947,0.14061,0.196946,...,0.076073,0.054433,0.0,0.0,0.033333,0.109017,0.01992,0.0,0.064389,0.060062
12351.0,0.0,0.052164,0.0,0.046004,1.0,0.051434,0.048795,0.046524,0.026463,0.033976,...,0.0,0.0,0.0,0.0,0.0,0.034721,0.041239,0.0,0.02666,0.024868


    2) Taking a look at each user and their similarities to the others

In [15]:
# check the top 5 most similar customers

cust_id = 12347.0  # fill it here
top_5_similar_cust = user_user_sim_matrix.loc[cust_id].sort_values(ascending=False).head(6)
top_5_similar_cust

CustomerID
12347.0    1.000000
17396.0    0.216225
15764.0    0.187055
13640.0    0.179284
13102.0    0.179109
13418.0    0.175730
Name: 12347.0, dtype: float64

    3) How to calculate recommendation for products

In [16]:
# Retrieve all the products that customer has ordered before

cust_id_A = 12347.0
cust_id_B = 17396.0

items_bought_by_A = set(customer_item_matrix.loc[cust_id_A].iloc[customer_item_matrix.loc[cust_id_A].to_numpy().nonzero()].index)
items_bought_by_B = set(customer_item_matrix.loc[cust_id_A].iloc[customer_item_matrix.loc[cust_id_B].to_numpy().nonzero()].index)

In [17]:
print('number of items bought by A:', len(items_bought_by_A))
print('number of items bought by B:', len(items_bought_by_B))

number of items bought by A: 70
number of items bought by B: 11


In [18]:
# making a list of recommendation for customer B

items_to_recommend_to_B = items_bought_by_A - items_bought_by_B

df_recommend_user = df[df['StockCode'].isin(items_to_recommend_to_B)][['StockCode', 'Description']].drop_duplicates(subset='StockCode', keep='first').set_index('StockCode')
df_recommend_user

Unnamed: 0_level_0,Description
StockCode,Unnamed: 1_level_1
22195,HEART MEASURING SPOONS LARGE
21731,RED TOADSTOOL LED NIGHT LIGHT
84991,60 TEATIME FAIRY CAKE CASES
84997D,PINK 3 PIECE MINI DOTS CUTLERY SET
84997C,BLUE 3 PIECE MINI DOTS CUTLERY SET
...,...
22771,CLEAR DRAWER KNOB ACRYLIC EDWARDIAN
22772,PINK DRAWER KNOB ACRYLIC EDWARDIAN
22773,GREEN DRAWER KNOB ACRYLIC EDWARDIAN
22805,BLUE DRAWER KNOB ACRYLIC EDWARDIAN


## Data Analysis: Item-based Collaborative Filtering

This approach based on ***similarities between items/products***. This is to solve the problem that user-based collaborative filtering cannot give recommendation for a new user, thus looking at the similarities between products would be a good alternative to look at.

    1) Compute cosine similarity between items from customer item matrix

In [19]:
customer_item_matrix.head(5)  # original customer item matrix

StockCode,10002,10080,10109,10120,10125,10133,10134,10135,10138,11001,...,ADJUST2,BANK CHARGES,C2,D,M,PADS,POST,SP1002,TEST001,TEST002
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
12347.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12349.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12351.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
customer_item_matrix.T.head(5)  # transposed customer item matrix

CustomerID,12346.0,12347.0,12348.0,12349.0,12351.0,12352.0,12353.0,12355.0,12356.0,12357.0,...,18277.0,18278.0,18279.0,18280.0,18281.0,18283.0,18284.0,18285.0,18286.0,18287.0
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
10080,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10109,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10120,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10125,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# to calculate cosine similarity between items, we have to transpose frist the customer item matrix
# thus the index shows stock code while the columns show all the customer ID

item_item_sim_matrix = pd.DataFrame(cosine_similarity(customer_item_matrix.T))
item_item_sim_matrix.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4007,4008,4009,4010,4011,4012,4013,4014,4015,4016
0,1.0,0.040423,0.0,0.070367,0.057166,0.029412,0.036564,0.046676,0.06482,0.058683,...,0.046676,0.023338,0.015559,0.0,0.148716,0.0,0.095954,0.114332,0.0,0.0
1,0.040423,1.0,0.0,0.087039,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02967,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.096225,0.267261,0.0,...,0.0,0.0,0.0,0.0,0.059339,0.0,0.0,0.0,0.0,0.0
3,0.070367,0.087039,0.0,1.0,0.024618,0.04222,0.0,0.050252,0.046524,0.090255,...,0.0,0.0,0.0,0.0,0.082637,0.0,0.022957,0.123091,0.0,0.0
4,0.057166,0.0,0.0,0.024618,1.0,0.068599,0.04264,0.068041,0.0,0.087988,...,0.0,0.040825,0.027217,0.0,0.083918,0.08165,0.027975,0.1,0.0,0.0


In [22]:
# again, convert the index and column name with the index and column from customer item matrix

item_item_sim_matrix.index = customer_item_matrix.T.index
item_item_sim_matrix.columns = customer_item_matrix.T.index

item_item_sim_matrix.head(5)

StockCode,10002,10080,10109,10120,10125,10133,10134,10135,10138,11001,...,ADJUST2,BANK CHARGES,C2,D,M,PADS,POST,SP1002,TEST001,TEST002
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,1.0,0.040423,0.0,0.070367,0.057166,0.029412,0.036564,0.046676,0.06482,0.058683,...,0.046676,0.023338,0.015559,0.0,0.148716,0.0,0.095954,0.114332,0.0,0.0
10080,0.040423,1.0,0.0,0.087039,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02967,0.0,0.0,0.0,0.0,0.0
10109,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.096225,0.267261,0.0,...,0.0,0.0,0.0,0.0,0.059339,0.0,0.0,0.0,0.0,0.0
10120,0.070367,0.087039,0.0,1.0,0.024618,0.04222,0.0,0.050252,0.046524,0.090255,...,0.0,0.0,0.0,0.0,0.082637,0.0,0.022957,0.123091,0.0,0.0
10125,0.057166,0.0,0.0,0.024618,1.0,0.068599,0.04264,0.068041,0.0,0.087988,...,0.0,0.040825,0.027217,0.0,0.083918,0.08165,0.027975,0.1,0.0,0.0


    2) Computing top 10 similar item to the item that we study

In [23]:
item_id = 10002  # fill it here
top_10_similar_item = item_item_sim_matrix.loc[item_id].sort_values(ascending=False).head(11)
top_10_similar_item

StockCode
10002    1.000000
21914    0.203011
20726    0.184680
20725    0.184313
21915    0.182040
21827    0.179961
21544    0.179779
22629    0.177894
21878    0.176390
22326    0.174871
22561    0.173350
Name: 10002, dtype: float64

In [24]:
df_recommend_item = df[df['StockCode'].isin(top_10_similar_item.index)][['StockCode', 'Description']].drop_duplicates(subset=['StockCode'], keep='first').reset_index(drop=True)
df_recommend_item = df_recommend_item.merge(pd.DataFrame(top_10_similar_item), on='StockCode', how='inner')
df_recommend_item.columns = ['StockCode', 'Description', 'CosineSim']
df_recommend_item

Unnamed: 0,StockCode,Description,CosineSim
0,10002,INFLATABLE POLITICAL GLOBE,1.0
1,20725,LUNCH BAG RED SPOTTY,0.184313
2,20726,LUNCH BAG WOODLAND,0.18468
3,21827,EIGHT PIECE CREEPY CRAWLIE SET,0.179961
4,21544,SKULLS WATER TRANSFER TATTOOS,0.179779
5,21914,BLUE HARMONICA IN BOX,0.203011
6,21915,RED HARMONICA IN BOX,0.18204
7,22326,"ROUND SNACK BOXES ,SET4, WOODLAND",0.174871
8,22561,WOODEN SCHOOL COLOURING SET,0.17335
9,21878,PACK OF 6 SANDCASTLE FLAGS ASSORTED,0.17639


# Compiling into Functions

## User-Based Collaborative Filtering

In [25]:
'''
build_user_user_matrix() is a function to calculate cosine similarity for user-based collaborative filtering
df      : dataframe that cointain your data
index   : pass the customer identifier column here
columns : pass the item/product identifier column here
values  : pass the column that contain how many items the customer bought
'''

def build_user_user_matrix(df, index, columns, values):
    
    customer_item_matrix = df.pivot_table(
    index=index,
    columns=columns,
    values=values,
    aggfunc='sum'
    )
    
    customer_item_matrix = customer_item_matrix.applymap(lambda x: 1 if x > 0 else 0)
    
    user_user_sim_matrix = pd.DataFrame(
    cosine_similarity(customer_item_matrix)
    )

    user_user_sim_matrix.columns = customer_item_matrix.index
    user_user_sim_matrix.index = customer_item_matrix.index
    
    return customer_item_matrix, user_user_sim_matrix

In [26]:
'''
top_similar_cust() is a function to show top similar customers to the customer we pass for the variable
user_user_sim_matrix    : pass user_user_sim_matrix from build_user_user_matrix()
index                   : pass the customer identifier column here
cust_main               : customer identifier as the main subject
num_sim                 : how many top users you want to know
'''

def top_similar_cust(user_user_sim_matrix, index, cust_main, num_sim):
    df_top_similar_cust = pd.DataFrame(user_user_sim_matrix.loc[cust_main].sort_values(ascending=False).reset_index(drop=False).head(num_sim))
    df_top_similar_cust.columns = [index, 'CosineSim']
    return df_top_similar_cust

In [27]:
'''
items_to_recommend() is a function to show what are the items/products should we recommend based on the similarity from top_similar_cust()
customer_item_matrix    : pass customer_item_matrix from build_user_user_matrix()
cust_main               : customer identifier as the main subject
cust_comp               : customer to be shown the products should we recommend
'''

def items_to_recommend(customer_item_matrix, cust_main, cust_comp):

    items_bought_by_A = set(customer_item_matrix.loc[cust_main].iloc[customer_item_matrix.loc[cust_main].to_numpy().nonzero()].index)
    items_bought_by_B = set(customer_item_matrix.loc[cust_comp].iloc[customer_item_matrix.loc[cust_comp].to_numpy().nonzero()].index)

    items_to_recommend_to_B = items_bought_by_A - items_bought_by_B
    list(items_to_recommend_to_B)

    return list(items_to_recommend_to_B)

In [28]:
index = 'CustomerID'
columns = 'StockCode'
values = 'Quantity'

customer_item_matrix, user_user_sim_matrix = build_user_user_matrix(df, index, columns, values)

In [29]:
display('Customer-Item Matrix:', customer_item_matrix.head(5))
display('User-to-user Similarity Matrix:', user_user_sim_matrix.head(5))

'Customer-Item Matrix:'

StockCode,10002,10080,10109,10120,10125,10133,10134,10135,10138,11001,...,ADJUST2,BANK CHARGES,C2,D,M,PADS,POST,SP1002,TEST001,TEST002
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
12347.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12349.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12351.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


'User-to-user Similarity Matrix:'

CustomerID,12346.0,12347.0,12348.0,12349.0,12351.0,12352.0,12353.0,12355.0,12356.0,12357.0,...,18277.0,18278.0,18279.0,18280.0,18281.0,18283.0,18284.0,18285.0,18286.0,18287.0
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,1.0,0.0,0.0,0.144707,0.0,0.0,0.0,0.0,0.0,0.183211,...,0.226455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071878,0.0
12347.0,0.0,1.0,0.053452,0.025198,0.052164,0.028172,0.026726,0.025482,0.10146,0.083744,...,0.069007,0.0,0.0,0.053452,0.075593,0.047544,0.090351,0.0,0.087612,0.027242
12348.0,0.0,0.053452,1.0,0.02357,0.0,0.0,0.0,0.0,0.189814,0.017408,...,0.032275,0.0,0.0,0.0,0.0,0.053368,0.0,0.0,0.0,0.0
12349.0,0.144707,0.025198,0.02357,1.0,0.046004,0.04969,0.094281,0.044947,0.14061,0.196946,...,0.076073,0.054433,0.0,0.0,0.033333,0.109017,0.01992,0.0,0.064389,0.060062
12351.0,0.0,0.052164,0.0,0.046004,1.0,0.051434,0.048795,0.046524,0.026463,0.033976,...,0.0,0.0,0.0,0.0,0.0,0.034721,0.041239,0.0,0.02666,0.024868


In [30]:
cust_main = 18283.0

top_similar_cust(user_user_sim_matrix, index, cust_main, 10)

Unnamed: 0,CustomerID,CosineSim
0,18283.0,1.0
1,17160.0,0.316825
2,13069.0,0.306307
3,14680.0,0.305603
4,15719.0,0.293204
5,16549.0,0.287967
6,18145.0,0.286077
7,15005.0,0.283232
8,18168.0,0.283186
9,16797.0,0.280785


In [31]:
cust_comp = 17160.0

items_to_recommend(customer_item_matrix, cust_main, cust_comp)[0:5]  #[0:5] for only printing top 5 product to recommend

[22021, 21527, 22551, 21531, 21535]

## Item-Based Collaborative Filtering

In [32]:
'''
build_user_user_matrix() is a function to calculate cosine similarity for user-based collaborative filtering
df      : dataframe containing your data
index   : pass the customer identifier here
columns : pass the item/product identifier here
values  : pass the column that contain how many items the customer bought
'''

def build_item_item_matrix(df, index, columns, values):
    
    customer_item_matrix = df.pivot_table(
    index=index,
    columns=columns,
    values=values,
    aggfunc='sum'
    )
    
    customer_item_matrix = customer_item_matrix.applymap(lambda x: 1 if x > 0 else 0)
    
    user_user_sim_matrix = pd.DataFrame(
    cosine_similarity(customer_item_matrix.T)
    )

    item_item_sim_matrix.columns = customer_item_matrix.T.index
    item_item_sim_matrix.index = customer_item_matrix.T.index
    
    return customer_item_matrix, item_item_sim_matrix

In [33]:
'''
top_similar_item_fun() is a function to calculate cosine similarity for user-based collaborative filtering
item_item_sim_matrix    : pass item_item_sim_matrix from build_item_item_matrix()
columns                 : pass the item/product identifier here
item_id                 : pass the item that you want to identify as the main subject
num_sim                 : number of similar item you want to see
'''

def top_similar_item_fun(item_item_sim_matrix, columns, item_id, num_sim='all'):
    
    if num_sim == 'all':
        top_similar_item = item_item_sim_matrix.loc[item_id].sort_values(ascending=False)
    elif isinstance(num_sim, int) == True:
        top_similar_item = item_item_sim_matrix.loc[item_id].sort_values(ascending=False).head(num_sim)
        
    df_recommend_item = pd.DataFrame(top_similar_item).reset_index()
    df_recommend_item.columns = [columns, 'CosineSim']
    
    return df_recommend_item

In [34]:
index = 'CustomerID'
columns = 'StockCode'
values = 'Quantity'

customer_item_matrix, item_item_sim_matrix = build_item_item_matrix(df, index, columns, values)

In [35]:
display('Customer Item matrix', customer_item_matrix)
display('item-to-item Similarity matrix', item_item_sim_matrix)

'Customer Item matrix'

StockCode,10002,10080,10109,10120,10125,10133,10134,10135,10138,11001,...,ADJUST2,BANK CHARGES,C2,D,M,PADS,POST,SP1002,TEST001,TEST002
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
12347.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12349.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12351.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18283.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18284.0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
18285.0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
18286.0,0,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0


'item-to-item Similarity matrix'

StockCode,10002,10080,10109,10120,10125,10133,10134,10135,10138,11001,...,ADJUST2,BANK CHARGES,C2,D,M,PADS,POST,SP1002,TEST001,TEST002
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,1.000000,0.040423,0.0,0.070367,0.057166,0.029412,0.036564,0.046676,0.064820,0.058683,...,0.046676,0.023338,0.015559,0.000000,0.148716,0.00000,0.095954,0.114332,0.0,0.0
10080,0.040423,1.000000,0.0,0.087039,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.029670,0.00000,0.000000,0.000000,0.0,0.0
10109,0.000000,0.000000,1.0,0.000000,0.000000,0.000000,0.000000,0.096225,0.267261,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.059339,0.00000,0.000000,0.000000,0.0,0.0
10120,0.070367,0.087039,0.0,1.000000,0.024618,0.042220,0.000000,0.050252,0.046524,0.090255,...,0.000000,0.000000,0.000000,0.000000,0.082637,0.00000,0.022957,0.123091,0.0,0.0
10125,0.057166,0.000000,0.0,0.024618,1.000000,0.068599,0.042640,0.068041,0.000000,0.087988,...,0.000000,0.040825,0.027217,0.000000,0.083918,0.08165,0.027975,0.100000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
PADS,0.000000,0.000000,0.0,0.000000,0.081650,0.000000,0.000000,0.027778,0.000000,0.029934,...,0.000000,0.000000,0.000000,0.000000,0.000000,1.00000,0.000000,0.000000,0.0,0.0
POST,0.095954,0.000000,0.0,0.022957,0.027975,0.015992,0.000000,0.012690,0.000000,0.041025,...,0.000000,0.019035,0.012690,0.029488,0.058691,0.00000,1.000000,0.046625,0.0,0.0
SP1002,0.114332,0.000000,0.0,0.123091,0.100000,0.085749,0.000000,0.000000,0.188982,0.073324,...,0.000000,0.000000,0.000000,0.000000,0.083918,0.00000,0.046625,1.000000,0.0,0.0
TEST001,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,1.0,0.5


In [36]:
item_id = 10109
num_sim = 10

top_similar_item_fun(item_item_sim_matrix, columns, item_id, num_sim)

Unnamed: 0,StockCode,CosineSim
0,10109,1.0
1,20673,0.707107
2,37471,0.57735
3,16215,0.57735
4,37345,0.447214
5,84340,0.447214
6,16245A,0.377964
7,90088,0.353553
8,84596I,0.333333
9,84925C,0.316228
