# Simple Product Recommender

Studies have shown that personalized product recommendations improve conversion rates and customer retention rates. They also work for subscription services like Spotify and Netflix and get users to use them more.

There are generally two ways to produce a product recommendation:
 - Collaborative Filtering
 - Content-based filtering
 
Collaborative filtering essentially profiles you based on your behaviour and recommends products to you that others with your profile have purchased/viewed. We will use collaborative filtering in this case study.

Content-based filtering looks at your behaviour and recommends similar products to the ones you have already purchased/viewed.

In [1]:
%load_ext autoreload
%autoreload 2

from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('Online Retail.csv', index_col=0, nrows=1000)  # It is a big file.
df.index = df.index.map(str)
df = df[df['Quantity'] > 0]
df = df[~df['CustomerID'].isnull()]

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


This dataset contains the purchases of customers, together with the prices and quantities of the items they purchased. We're going to create something a customer item matrix which will give us insight into which users are buying which items.

In [5]:
customer_item_matrix = df.pivot_table(
    index='CustomerID',
    columns='StockCode',
    values='Quantity',
    aggfunc=sum
)
print(customer_item_matrix.shape)
customer_item_matrix.head()

(4339, 3665)


StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214V,90214W,90214Y,90214Z,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,,,,,,,,,,,...,,,,,,,,,,
12347.0,,,,,,,,,,,...,,,,,,,,,,
12348.0,,,,,,,,,,,...,,,,,,,,,,9.0
12349.0,,,,,,,,,,,...,,,,,,,,,,1.0
12350.0,,,,,,,,,,,...,,,,,,,,,,1.0


What this table shows is for each row (customer), how many of each item (column) are they buying? This preview mostly shows NaNs which means that these customers are not buying these items. Note that we're seeing 4339 unique customers and 3665 unique items.

Let's clean this matrix up so that it shows zeros (not bought) and ones (bought).

In [6]:
cim = customer_item_matrix.applymap(lambda x: 1 if x > 0 else 0)
cim.head()

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214V,90214W,90214Y,90214Z,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12347.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12349.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12350.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Much cleaner looking!

## User-based Collaborative Filtering

Let's compute the cosine similarities between the users.

In [7]:
user2_sim_matrix = pd.DataFrame(cosine_similarity(cim))
user2_sim_matrix.columns = cim.index
user2_sim_matrix['CustomerID'] = cim.index
user2_sim_matrix.set_index('CustomerID', inplace=True)
user2_sim_matrix.head()

CustomerID,12346.0,12347.0,12348.0,12349.0,12350.0,12352.0,12353.0,12354.0,12355.0,12356.0,...,18273.0,18274.0,18276.0,18277.0,18278.0,18280.0,18281.0,18282.0,18283.0,18287.0
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347.0,0.0,1.0,0.063022,0.04613,0.047795,0.038484,0.0,0.025876,0.136641,0.094742,...,0.0,0.029709,0.052668,0.0,0.032844,0.062318,0.0,0.113776,0.109364,0.012828
12348.0,0.0,0.063022,1.0,0.024953,0.051709,0.027756,0.0,0.027995,0.118262,0.146427,...,0.0,0.064282,0.113961,0.0,0.0,0.0,0.0,0.0,0.170905,0.083269
12349.0,0.0,0.04613,0.024953,1.0,0.056773,0.137137,0.0,0.030737,0.032461,0.144692,...,0.0,0.105868,0.0,0.0,0.039014,0.0,0.0,0.067574,0.137124,0.030475
12350.0,0.0,0.047795,0.051709,0.056773,1.0,0.031575,0.0,0.0,0.0,0.033315,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.044866,0.0


Naturally, the diagonal of that matrix contains ones (customer 2 compared to himself/herself has a cosine similarity of 1). We're going to work off of this matrix to create recommendations.

Now we are going to pick a customer (12350) and come up with product recommendations for them based on his/her cosine similarity to other customers.

In [8]:
CA = 12350.0
user2_sim_matrix.loc[CA].sort_values(
    ascending=False).head()

CustomerID
12350.0    1.000000
17935.0    0.183340
12414.0    0.181902
12652.0    0.175035
16692.0    0.171499
Name: 12350.0, dtype: float64

The customer with the highest cosine similarity to Customer 12350 is Customer 17935. Let's use this customer to recommend products to the former. We will do this by finding the items that Customer B has bought that Customer A hasn't. We assume that Customer A is likely to buy them since the two have high cosine similarity (relative to the rest of the group).

In [9]:
def get_similar_users(sim_matrix, userid):
    """Get similar users."""
    simusers = sim_matrix.loc[userid].sort_values(ascending=False)
    return simusers.index


CB = get_similar_users(user2_sim_matrix, CA)[1]

CA_bought = set(cim.loc[CA][cim.loc[CA] > 0].index)
CB_bought = set(cim.loc[CB][cim.loc[CB] > 0].index)

CA_recommend = CB_bought - CA_bought
print('We have ' + str(len(CA_recommend)) +
      ' products to recommend')

We have 24 products to recommend


Now that we have the set of product IDs to recommend, let's make this actually useful to the customer by listing the product descriptions.

In [10]:
CA_rec2 = (df.loc[df['StockCode'].isin(CA_recommend),
                ['StockCode', 'Description']]
           .drop_duplicates().set_index('StockCode'))

CA_rec2.head()

Unnamed: 0_level_0,Description
StockCode,Unnamed: 1_level_1
22752,SET 7 BABUSHKA NESTING BOXES
22749,FELTCRAFT PRINCESS CHARLOTTE DOLL
22659,LUNCH BOX I LOVE LONDON
85099B,JUMBO BAG RED RETROSPOT
22449,SILK PURSE BABUSHKA PINK


In summary, we applied a user-based collaborative filtering method to recommend products to similar users (by their purchase history). However, for customers without much of a purchase history, the user-based collaborative filtering method is not going to work well.

## Item-based Collaborative Filtering
Let's compute the cosine similarity between the items.

In [11]:
item2_sim_matrix = pd.DataFrame(cosine_similarity(cim.T))
item2_sim_matrix.columns = cim.T.index
item2_sim_matrix['StockCode'] = cim.T.index
item2_sim_matrix.set_index('StockCode', inplace=True)
item2_sim_matrix.head()

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214V,90214W,90214Y,90214Z,BANK CHARGES,C2,DOT,M,PADS,POST
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,1.0,0.0,0.094868,0.091287,0.0,0.0,0.090351,0.062932,0.098907,0.095346,...,0.0,0.0,0.0,0.0,0.0,0.029361,0.0,0.066915,0.0,0.078217
10080,0.0,1.0,0.0,0.0,0.0,0.0,0.032774,0.045655,0.047836,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016182,0.0,0.0
10120,0.094868,0.0,1.0,0.11547,0.0,0.0,0.057143,0.059702,0.041703,0.060302,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.070535,0.0,0.010993
10123C,0.091287,0.0,0.11547,1.0,0.0,0.0,0.164957,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10124A,0.0,0.0,0.0,0.0,1.0,0.447214,0.063888,0.044499,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


How do we use this to recommend products?

Items with a high cosine similarity with each other are purchased by more people in common. For example, an item with a higher cosine similarity with another item will be more likely to have been purchased by the same customer. Let's list out the top ten similar items to Item 23166.

In [12]:
itemA = '23166'

# I will be re-using a function defined during the user-based
# collaborative filtering exercise. It still works well
# for our purposes!
item_recs = list(get_similar_users(item2_sim_matrix, itemA)[:10])
item_recs

['23166',
 '23165',
 '23167',
 '22993',
 '23307',
 '22722',
 '22720',
 '22666',
 '23243',
 '22961']

In [13]:
item_rec2 = (df.loc[df['StockCode'].isin(item_recs),
                ['StockCode', 'Description']]
           .drop_duplicates().set_index('StockCode')).loc[item_recs]

item_rec2

Unnamed: 0_level_0,Description
StockCode,Unnamed: 1_level_1
23166,MEDIUM CERAMIC TOP STORAGE JAR
23165,LARGE CERAMIC TOP STORAGE JAR
23167,SMALL CERAMIC TOP STORAGE JAR
22993,SET OF 4 PANTRY JELLY MOULDS
23307,SET OF 60 PANTRY DESIGN CAKE CASES
22722,SET OF 6 SPICE TINS PANTRY DESIGN
22720,SET OF 3 CAKE TINS PANTRY DESIGN
22666,RECIPE BOX PANTRY YELLOW DESIGN
23243,SET OF TEA COFFEE SUGAR TINS PANTRY
22961,JAM MAKING SET PRINTED


It appears we have a duplicate StockCode.

Anyway, based on an item that the customer just bought (23166), we are able to recommend a list of other items in order of decreasing cosine similarity. The list makes sense given that the recently purchased item was a storage jar. The first few recommendations are the same jars but different sizes.

# Next Steps
Split the data into train and test sets to see if you can predict what customers are buying!