# Intro to Recommender Systems Lab

Complete the exercises below to solidify your knowledge and understanding of recommender systems.

For this lab, we are going to be putting together a user similarity based recommender system in a step-by-step fashion. Our data set contains customer grocery purchases, and we will use similar purchase behavior to inform our recommender system. Our recommender system will generate 5 recommendations for each customer based on the purchases they have made.

In [2]:
import pandas as pd
from scipy.spatial.distance import pdist, squareform

In [32]:
df = pd.read_excel('online_fashion.xlsx')

In [33]:
df.isna().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [34]:
df.shape

(541909, 8)

In [35]:
df.Quantity.unique()

array([     6,      8,      2,     32,      3,      4,     24,     12,
           48,     18,     20,     36,     80,     64,     10,    120,
           96,     23,      5,      1,     -1,     50,     40,    100,
          192,    432,    144,    288,    -12,    -24,     16,      9,
          128,     25,     30,     28,      7,     56,     72,    200,
          600,    480,     -6,     14,     -2,     11,     33,     13,
           -4,     -5,     -7,     -3,     70,    252,     60,    216,
          384,    -10,     27,     15,     22,     19,     17,     21,
           34,     47,    108,     52,  -9360,    -38,     75,    270,
           42,    240,     90,    320,   1824,    204,     69,    -36,
         -192,   -144,    160,   2880,   1400,     39,    -48,    -50,
           26,   1440,     31,     82,     78,     97,     98,     35,
           57,    -20,    110,    -22,    -30,    -70,   -130,    -80,
         -120,    -40,    -25,    -14,    -15,    -69,   -140,   -320,
      

In [95]:
df.sort_values(by='Quantity')

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Rev,day_of_week,Weekday
540422,C581484,23843,"PAPER CRAFT , LITTLE BIRDIE",-80995,2011-12-09 09:27:00,2.08,16446.0,United Kingdom,-168469.60,Friday,Friday
61624,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18 10:17:00,1.04,12346.0,United Kingdom,-77183.60,Tuesday,Tuesday
225529,556690,23005,printing smudges/thrown away,-9600,2011-06-14 10:37:00,0.00,,United Kingdom,-0.00,Tuesday,Tuesday
225530,556691,23005,printing smudges/thrown away,-9600,2011-06-14 10:37:00,0.00,,United Kingdom,-0.00,Tuesday,Tuesday
4287,C536757,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,-9360,2010-12-02 14:23:00,0.03,15838.0,United Kingdom,-280.80,Thursday,Thursday
225528,556687,23003,Printing smudges/thrown away,-9058,2011-06-14 10:36:00,0.00,,United Kingdom,-0.00,Tuesday,Tuesday
115818,546152,72140F,throw away,-5368,2011-03-09 17:25:00,0.00,,United Kingdom,-0.00,Wednesday,Wednesday
431381,573596,79323W,"Unsaleable, destroyed.",-4830,2011-10-31 15:17:00,0.00,,United Kingdom,-0.00,Monday,Monday
341601,566768,16045,,-3667,2011-09-14 17:53:00,0.00,,United Kingdom,-0.00,Wednesday,Wednesday
323458,565304,16259,,-3167,2011-09-02 12:18:00,0.00,,United Kingdom,-0.00,Friday,Friday


In [36]:
df.isnull().sum() / df.shape[0] * 100.00

InvoiceNo       0.000000
StockCode       0.000000
Description     0.268311
Quantity        0.000000
InvoiceDate     0.000000
UnitPrice       0.000000
CustomerID     24.926694
Country         0.000000
dtype: float64

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null datetime64[ns]
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


In [39]:
df.InvoiceDate.min()

Timestamp('2010-12-01 08:26:00')

In [40]:
df.InvoiceDate.max()

Timestamp('2011-12-09 12:50:00')

## Decide what I want to drop

In [None]:
#Country = Unspecified 
#CustomerID
#Description

#####Price
#Unit price of 0 (zero)

#####Quantity
#Massive negative values

#####Items
#POSTAGE
#DOTCOM POSTAGE

In [41]:
df['Rev'] = df['Quantity']*df['UnitPrice']

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Rev
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


In [42]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Rev
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


In [98]:
#Top 10 Countries
df_top_countries_stg = df.groupby(['Country'])['Rev'].agg('sum')
df_top_countries = df_top_countries_stg.sort_values(ascending=False).head(10).to_frame()
df_top_countries

Unnamed: 0_level_0,Rev
Country,Unnamed: 1_level_1
United Kingdom,8187806.0
Netherlands,284661.5
EIRE,263276.8
Germany,221698.2
France,197403.9
Australia,137077.3
Switzerland,56385.35
Spain,54774.58
Belgium,40910.96
Sweden,36595.91


In [78]:
df_top_countries = df_top_countries.reset_index()


Unnamed: 0,Country,Rev
0,United Kingdom,8187806.0
1,Netherlands,284661.5
2,EIRE,263276.8
3,Germany,221698.2
4,France,197403.9
5,Australia,137077.3
6,Switzerland,56385.35
7,Spain,54774.58
8,Belgium,40910.96
9,Sweden,36595.91


In [83]:
top_country_rev = []
for i in df_top_countries['Country'][:3]:
    top_country_rev.append(i)
top_country_rev    

['United Kingdom', 'Netherlands', 'EIRE']

In [51]:
#Top 10 Description by Revenue - All Countries
df_top_items_rev = df.groupby(['Description'])['Rev'].agg('sum')
df_top_items_rev.sort_values(ascending=False).head(10)

Description
DOTCOM POSTAGE                        206245.48
REGENCY CAKESTAND 3 TIER              164762.19
WHITE HANGING HEART T-LIGHT HOLDER     99668.47
PARTY BUNTING                          98302.98
JUMBO BAG RED RETROSPOT                92356.03
RABBIT NIGHT LIGHT                     66756.59
POSTAGE                                66230.64
PAPER CHAIN KIT 50'S CHRISTMAS         63791.94
ASSORTED COLOUR BIRD ORNAMENT          58959.73
CHILLI LIGHTS                          53768.06
Name: Rev, dtype: float64

In [67]:
#Top 10 Description by Quantity - All Countries
df_top_items_vol = df.groupby(['Description'])['Quantity'].agg('sum')
df_top_items_vol.sort_values(ascending=False).head(10)

Description
WORLD WAR 2 GLIDERS ASSTD DESIGNS     53847
JUMBO BAG RED RETROSPOT               47363
ASSORTED COLOUR BIRD ORNAMENT         36381
POPCORN HOLDER                        36334
PACK OF 72 RETROSPOT CAKE CASES       36039
WHITE HANGING HEART T-LIGHT HOLDER    35317
RABBIT NIGHT LIGHT                    30680
MINI PAINT SET VINTAGE                26437
PACK OF 12 LONDON TISSUES             26315
PACK OF 60 PINK PAISLEY CAKE CASES    24753
Name: Quantity, dtype: int64

In [86]:



for i in top_country_rev:
    print(df.loc[df['Country'] == i].groupby(['Description'])['Quantity'].agg('sum').sort_values(ascending=False).head(3))
    

In [87]:
df_days_week = df

In [89]:
df_days_week.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 9 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null datetime64[ns]
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
Rev            541909 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(1), object(4)
memory usage: 37.2+ MB


In [92]:
df_days_week['Weekday'] = df_days_week['InvoiceDate'].dt.day_name()
df_days_week.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Rev,day_of_week,Weekday
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3,Wednesday,Wednesday
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34,Wednesday,Wednesday
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0,Wednesday,Wednesday
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34,Wednesday,Wednesday
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34,Wednesday,Wednesday


In [94]:
#Revenue by Day of Week - All Countries
df_day_of_week_rev = df_days_week.groupby(['Weekday'])['Rev'].agg('sum')
df_day_of_week_rev.sort_values(ascending=False)#.head(10)

Weekday
Thursday     2112519.000
Tuesday      1966182.791
Wednesday    1734147.010
Monday       1588609.431
Friday       1540610.811
Sunday        805678.891
Name: Rev, dtype: float64

In [7]:
#data = pd.ExcelFile('online_fashion.xlsx')
data = pd.read_csv('online_fashion_sm.csv')

In [8]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,1/12/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,1/12/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,1/12/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,1/12/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,1/12/2010 8:26,3.39,17850.0,United Kingdom


In [9]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,1/12/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,1/12/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,1/12/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,1/12/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,1/12/2010 8:26,3.39,17850.0,United Kingdom


In [12]:
data.rename(columns={'Description':'ProductName'}, inplace=True)

## Step 1: Create a data frame that contains the total quantity of each product purchased by each customer.

You will need to group by CustomerID and ProductName and then sum the Quantity field.

In [13]:
grouped = data.groupby(['CustomerID', 'ProductName']).sum()
grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity,UnitPrice
CustomerID,ProductName,Unnamed: 2_level_1,Unnamed: 3_level_1
12395.0,72 SWEETHEART FAIRY CAKE CASES,24,0.55
12395.0,CHARLOTTE BAG SUKI DESIGN,10,0.85
12395.0,JUMBO BAG RED RETROSPOT,10,1.95
12395.0,PACK OF 60 DINOSAUR CAKE CASES,48,0.55
12395.0,PACK OF 60 MUSHROOM CAKE CASES,48,0.55
12395.0,PACK OF 60 PINK PAISLEY CAKE CASES,120,0.42
12395.0,PACK OF 60 SPACEBOY CAKE CASES,120,0.42
12395.0,PACK OF 72 RETROSPOT CAKE CASES,120,0.42
12395.0,PLASTERS IN TIN SPACEBOY,12,1.65
12395.0,POSTAGE,2,18.00


## Step 2: Use the `pivot_table` method to create a product by customer matrix.

The rows of the matrix should represent the products, the columns should represent the customers, and the values should be the quantities of each product purchased by each customer. You will also need to replace nulls with zeros, which you can do using the `fillna` method.

In [14]:
matrix = grouped.pivot_table('Quantity', 'ProductName', 'CustomerID', aggfunc='sum', fill_value = 0)
matrix

CustomerID,12395.0,12427.0,12431.0,12433.0,12471.0,12472.0,12557.0,12567.0,12583.0,12586.0,...,18085.0,18109.0,18118.0,18144.0,18156.0,18168.0,18219.0,18225.0,18229.0,18239.0
ProductName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4 PURPLE FLOCK DINNER CANDLES,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
SET 2 TEA TOWELS I LOVE LONDON,0,0,0,0,0,0,0,0,24,0,...,0,0,0,0,0,0,0,0,0,0
10 COLOUR SPACEBOY PEN,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12 COLOURED PARTY BALLOONS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12 DAISY PEGS IN WOOD BOX,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12 IVORY ROSE PEG PLACE SETTINGS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12 MESSAGE CARDS WITH ENVELOPES,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
12 PENCIL SMALL TUBE WOODLAND,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
12 PENCILS SMALL TUBE RED RETROSPOT,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
12 PENCILS SMALL TUBE SKULL,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


## Step 3: Create a customer similarity matrix using `squareform` and `pdist`. For the distance metric, choose "euclidean."

In [15]:
# I need to transpose the matrix, otherwise I get the distance for products, no customers.
# First applying pdist, gives an 1D array.
# Then applying squareform to turn it into a squareform
# Finally convert it into a DataFrame

dist_matrix = pd.DataFrame(squareform(pdist(matrix.T)), index=matrix.columns, columns=matrix.columns)
dist_matrix

CustomerID,12395.0,12427.0,12431.0,12433.0,12471.0,12472.0,12557.0,12567.0,12583.0,12586.0,...,18085.0,18109.0,18118.0,18144.0,18156.0,18168.0,18219.0,18225.0,18229.0,18239.0
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12395.0,0.000000,222.667914,224.537302,320.574484,221.097264,255.052936,286.844906,241.588493,246.058936,221.097264,...,222.827287,222.110333,220.472220,235.442137,221.379764,225.931406,222.137345,221.124399,221.774660,225.490576
12427.0,222.667914,0.000000,47.434165,261.581727,26.776856,129.092990,184.697049,102.000000,111.229492,26.776856,...,38.535698,34.146742,35.846897,85.240835,28.809721,53.646994,34.322005,27.000000,31.890437,51.759057
12431.0,224.537302,47.434165,0.000000,263.152047,39.255573,134.242318,186.914419,106.216760,111.130554,39.255573,...,48.052055,44.609416,46.872167,89.944427,40.816663,60.811183,44.743715,39.408121,42.154478,58.881236
12433.0,320.574484,261.581727,263.152047,0.000000,260.222981,284.147849,317.981132,251.811437,277.641856,260.222981,...,261.143639,260.854365,259.961536,272.516055,259.061769,263.433103,261.107258,260.246037,260.798773,263.965907
12471.0,221.097264,26.776856,39.255573,260.222981,0.000000,128.405607,182.767612,99.060588,108.078675,2.828427,...,27.856777,21.377558,25.768197,80.975305,11.532563,46.572524,21.656408,4.472136,17.549929,44.384682
12472.0,255.052936,129.092990,134.242318,284.147849,128.405607,0.000000,223.347263,160.452485,157.216411,128.405607,...,131.118267,130.142230,129.491312,151.779445,128.790528,136.605271,130.219046,128.452326,129.568515,130.422391
12557.0,286.844906,184.697049,186.914419,317.981132,182.767612,223.347263,0.000000,199.521929,212.313448,182.767612,...,184.856701,183.991848,184.553515,199.882465,183.109257,188.586850,184.024455,182.800438,183.586492,188.058502
12567.0,241.588493,102.000000,106.216760,251.811437,99.060588,160.452485,199.521929,0.000000,143.749783,99.060588,...,100.662803,101.301530,102.024507,127.914034,99.569072,109.306907,101.360742,99.121138,99.684502,108.475804
12583.0,246.058936,111.229492,111.130554,277.641856,108.078675,157.216411,212.313448,143.749783,0.000000,108.078675,...,111.575087,110.136279,110.584809,135.018517,108.323589,117.498936,110.190744,108.134176,106.794195,116.803253
12586.0,221.097264,26.776856,39.255573,260.222981,2.828427,128.405607,182.767612,99.060588,108.078675,0.000000,...,27.856777,21.377558,25.768197,80.975305,11.532563,46.572524,21.656408,4.472136,17.549929,44.384682


In [16]:
# The distances I have doesn't tell me much. I will normalize to a value between 0 and 1,
# and inverse them: The closer to 1, the more similar they are

dist_norm = pd.DataFrame(1/(1 + dist_matrix))
dist_norm

CustomerID,12395.0,12427.0,12431.0,12433.0,12471.0,12472.0,12557.0,12567.0,12583.0,12586.0,...,18085.0,18109.0,18118.0,18144.0,18156.0,18168.0,18219.0,18225.0,18229.0,18239.0
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12395.0,1.000000,0.004471,0.004434,0.003110,0.004503,0.003905,0.003474,0.004122,0.004048,0.004503,...,0.004468,0.004482,0.004515,0.004229,0.004497,0.004407,0.004482,0.004502,0.004489,0.004415
12427.0,0.004471,1.000000,0.020647,0.003808,0.036001,0.007687,0.005385,0.009709,0.008910,0.036001,...,0.025294,0.028452,0.027139,0.011595,0.033546,0.018299,0.028311,0.035714,0.030404,0.018954
12431.0,0.004434,0.020647,1.000000,0.003786,0.024841,0.007394,0.005322,0.009327,0.008918,0.024841,...,0.020387,0.021925,0.020889,0.010996,0.023914,0.016178,0.021861,0.024748,0.023173,0.016700
12433.0,0.003110,0.003808,0.003786,1.000000,0.003828,0.003507,0.003135,0.003956,0.003589,0.003828,...,0.003815,0.003819,0.003832,0.003656,0.003845,0.003782,0.003815,0.003828,0.003820,0.003774
12471.0,0.004503,0.036001,0.024841,0.003828,1.000000,0.007728,0.005442,0.009994,0.009168,0.261204,...,0.034654,0.044688,0.037358,0.012199,0.079792,0.021021,0.044138,0.182744,0.053909,0.022034
12472.0,0.003905,0.007687,0.007394,0.003507,0.007728,1.000000,0.004457,0.006194,0.006320,0.007728,...,0.007569,0.007625,0.007663,0.006545,0.007705,0.007267,0.007621,0.007725,0.007659,0.007609
12557.0,0.003474,0.005385,0.005322,0.003135,0.005442,0.004457,1.000000,0.004987,0.004688,0.005442,...,0.005380,0.005406,0.005389,0.004978,0.005432,0.005275,0.005405,0.005441,0.005418,0.005289
12567.0,0.004122,0.009709,0.009327,0.003956,0.009994,0.006194,0.004987,1.000000,0.006908,0.009994,...,0.009836,0.009775,0.009706,0.007757,0.009943,0.009066,0.009769,0.009988,0.009932,0.009134
12583.0,0.004048,0.008910,0.008918,0.003589,0.009168,0.006320,0.004688,0.006908,1.000000,0.009168,...,0.008883,0.008998,0.008962,0.007352,0.009147,0.008439,0.008994,0.009163,0.009277,0.008489
12586.0,0.004503,0.036001,0.024841,0.003828,0.261204,0.007728,0.005442,0.009994,0.009168,1.000000,...,0.034654,0.044688,0.037358,0.012199,0.079792,0.021021,0.044138,0.182744,0.053909,0.022034


## Step 4: Check your results by generating a list of the top 5 most similar customers for a specific CustomerID.

In [17]:
Top5_cust200 = dist_norm[17976].sort_values(ascending = False).head(6)
Top5_cust200

CustomerID
17976.0    1.000000
13295.0    0.052101
14679.0    0.052101
13145.0    0.052101
15823.0    0.052101
16995.0    0.052101
Name: 17976.0, dtype: float64

## Step 5: From the data frame you created in Step 1, select the records for the list of similar CustomerIDs you obtained in Step 4.

In [18]:
# I select index from 1 because I don't want to get the first input, 
# as it is the customer itself

similar = grouped.loc[(Top5_cust200.index[1:],)]
similar

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity,UnitPrice
CustomerID,ProductName,Unnamed: 2_level_1,Unnamed: 3_level_1
13145.0,VINTAGE RED KITCHEN CABINET,1,295.0
13295.0,COLOUR GLASS. STAR T-LIGHT HOLDER,-1,3.25
14679.0,CLASSICAL ROSE SMALL VASE,-1,2.55
15823.0,Bank Charges,1,15.0
16995.0,ANTIQUE SILVER TEA GLASS ENGRAVED,-1,1.25


## Step 6: Aggregate those customer purchase records by ProductName, sum the Quantity field, and then rank them in descending order by quantity.

This will give you the total number of each product purchased by the 5 most similar customers to the customer you selected in order from most purchased to least.

In [19]:
agg_similar = similar.groupby('ProductName')[['Quantity']].sum()\
                .sort_values(by = 'Quantity', ascending = False)
agg_similar

Unnamed: 0_level_0,Quantity
ProductName,Unnamed: 1_level_1
Bank Charges,1
VINTAGE RED KITCHEN CABINET,1
ANTIQUE SILVER TEA GLASS ENGRAVED,-1
CLASSICAL ROSE SMALL VASE,-1
COLOUR GLASS. STAR T-LIGHT HOLDER,-1


## Step 7: Filter the list for products that the chosen customer has not yet purchased and then recommend the top 5 products with the highest quantities that are left.

- Merge the ranked products data frame with the customer product matrix on the ProductName field.
- Filter for records where the chosen customer has not purchased the product.
- Show the top 5 results.

In [23]:
products = pd.concat([agg_similar, matrix[17976]], axis=1, sort=False)
products.rename(columns = {17976:'Cust_200'}, inplace = True)
products

Unnamed: 0,Quantity,Cust_200
Bank Charges,1.0,0
VINTAGE RED KITCHEN CABINET,1.0,0
ANTIQUE SILVER TEA GLASS ENGRAVED,-1.0,0
CLASSICAL ROSE SMALL VASE,-1.0,0
COLOUR GLASS. STAR T-LIGHT HOLDER,-1.0,0
4 PURPLE FLOCK DINNER CANDLES,,0
SET 2 TEA TOWELS I LOVE LONDON,,1
10 COLOUR SPACEBOY PEN,,0
12 COLOURED PARTY BALLOONS,,0
12 DAISY PEGS IN WOOD BOX,,0


In [24]:
Top5rec = products.query('Quantity > 0 and Cust_200 == 0').head(5)
Top5rec

Unnamed: 0,Quantity,Cust_200
Bank Charges,1.0,0
VINTAGE RED KITCHEN CABINET,1.0,0


## Step 8: Now that we have generated product recommendations for a single user, put the pieces together and iterate over a list of all CustomerIDs.

- Create an empty dictionary that will hold the recommendations for all customers.
- Create a list of unique CustomerIDs to iterate over.
- Iterate over the customer list performing steps 4 through 7 for each and appending the results of each iteration to the dictionary you created.

In [25]:
recommendations = {}
unique_ID = dist_norm.columns.unique()

In [26]:
for customer in unique_ID:
    head = dist_norm[customer].sort_values(ascending = False).head(6)
    similar = grouped.loc[(head.index[1:],)]
    agg_similar = similar.groupby('ProductName')[['Quantity']].sum()\
                .sort_values(by = 'Quantity', ascending = False)
    products = pd.concat([agg_similar, matrix[customer]], axis=1, sort=False)
    products.rename(columns = {customer:'customer'}, inplace = True)
    recommendations[customer] = list(products.query('Quantity > 0 and customer == 0').head(5).index)
    

In [27]:
recommendations

{12395.0: ['RED CHARLIE+LOLA PERSONAL DOORSIGN',
  'CHARLIE & LOLA WASTEPAPER BIN FLORA',
  '60 TEATIME FAIRY CAKE CASES',
  'SMALL POPCORN HOLDER',
  'PACK OF 72 SKULL CAKE CASES'],
 12427.0: ['Bank Charges', 'VINTAGE RED KITCHEN CABINET'],
 12431.0: ['Bank Charges', 'VINTAGE RED KITCHEN CABINET'],
 12433.0: ['DINOSAUR KEYRINGS ASSORTED',
  'LUNCH BAG WOODLAND',
  'PARTY TIME PENCIL ERASERS',
  'COLOUR GLASS. STAR T-LIGHT HOLDER',
  'RAIN PONCHO RETROSPOT'],
 12471.0: ['Bank Charges', 'VINTAGE RED KITCHEN CABINET'],
 12472.0: ['HOME BUILDING BLOCK WORD',
  'HAND WARMER SCOTTY DOG DESIGN',
  'LUNCH BAG WOODLAND',
  'LUNCH BAG SUKI  DESIGN ',
  'LUNCH BAG RED RETROSPOT'],
 12557.0: ['LUNCH BAG WOODLAND',
  'HOME BUILDING BLOCK WORD',
  'CARD MOTORBIKE SANTA',
  '10 COLOUR SPACEBOY PEN',
  'RED RETROSPOT LUGGAGE TAG'],
 12567.0: ['DISCO BALL CHRISTMAS DECORATION',
  'BLACK LOVE BIRD CANDLE',
  'AGED GLASS SILVER T-LIGHT HOLDER',
  'LUNCH BAG  BLACK SKULL.',
  'HAND WARMER SCOTTY DOG DESI

##  Step 9: Store the results in a Pandas data frame. The data frame should a column for Customer ID and then a column for each of the 5 product recommendations for each customer.

In [28]:
recommendations_df = pd.DataFrame.from_dict(recommendations, orient='index', 
                                columns=['rec1', 'rec2', 'rec3', 'rec4', 'rec5'])
recommendations_df

Unnamed: 0,rec1,rec2,rec3,rec4,rec5
12395.0,RED CHARLIE+LOLA PERSONAL DOORSIGN,CHARLIE & LOLA WASTEPAPER BIN FLORA,60 TEATIME FAIRY CAKE CASES,SMALL POPCORN HOLDER,PACK OF 72 SKULL CAKE CASES
12427.0,Bank Charges,VINTAGE RED KITCHEN CABINET,,,
12431.0,Bank Charges,VINTAGE RED KITCHEN CABINET,,,
12433.0,DINOSAUR KEYRINGS ASSORTED,LUNCH BAG WOODLAND,PARTY TIME PENCIL ERASERS,COLOUR GLASS. STAR T-LIGHT HOLDER,RAIN PONCHO RETROSPOT
12471.0,Bank Charges,VINTAGE RED KITCHEN CABINET,,,
12472.0,HOME BUILDING BLOCK WORD,HAND WARMER SCOTTY DOG DESIGN,LUNCH BAG WOODLAND,LUNCH BAG SUKI DESIGN,LUNCH BAG RED RETROSPOT
12557.0,LUNCH BAG WOODLAND,HOME BUILDING BLOCK WORD,CARD MOTORBIKE SANTA,10 COLOUR SPACEBOY PEN,RED RETROSPOT LUGGAGE TAG
12567.0,DISCO BALL CHRISTMAS DECORATION,BLACK LOVE BIRD CANDLE,AGED GLASS SILVER T-LIGHT HOLDER,LUNCH BAG BLACK SKULL.,HAND WARMER SCOTTY DOG DESIGN
12583.0,SET OF 20 VINTAGE CHRISTMAS NAPKINS,COWBOYS AND INDIANS BIRTHDAY CARD,MONEY BOX BISCUITS DESIGN,RIBBON REEL SOCKS AND MITTENS,WOODEN OWLS LIGHT GARLAND
12586.0,Bank Charges,VINTAGE RED KITCHEN CABINET,,,


## Step 10: Change the distance metric used in Step 3 to something other than euclidean (correlation, cityblock, consine, jaccard, etc.). Regenerate the recommendations for all customers and note the differences.

In [30]:
data.Quantity.unique()

array([    6,     8,     2,    32,     3,     4,    24,    12,    48,
          18,    20,    36,    80,    64,    10,   120,    96,    23,
           5,     1,    -1,    50,    40,   100,   192,   432,   144,
         288,   -12,   -24,    16,     9,   128,    25,    30,    28,
           7,    56,    72,   200,   600,   480,    -6,    14,    -2,
          11,    33,    13,    -4,    -5,    -7,    -3,    70,   252,
          60,   216,   384,   -10,    27,    15,    22,    19,    17,
          21,    34,    47,   108,    52, -9360,   -38,    75,   270,
          42,   240,    90,   320,  1824,   204,    69,   -36,  -192,
        -144,   160,  2880,  1400,    39,   -48,   -50,    26,  1440,
          31,    82,    78,    97,    98,    35,    57,   -20,   110,
         -22,   -30,   -70,  -130,   -80,  -120,   -40,   -25,   -14,
         -15,   -69,  -140,  -320,    -8,   720,   156,   324,    38,
          37,    49], dtype=int64)