## __Association Rules and Recommendation Systems__

### Almost every major tech company has applied them in some form or the other: 
* Amazon uses it to suggest products to customers
* YouTube uses it to decide which video to play next on autoplay
* Facebook uses it to recommend pages to like and people to follow. 
* Some companies business such as Netflix's and Amazon Prime, success revolves around the potency of their recommendations. 
* Netflix even offered a million dollars in 2009 to anyone who could improve its system by 10%.

### __Association Rules  : Granularity at the transaction level__

<img src ='A-C.PNG' align ='left' >

__If a customer buys A then also buys C__

IF part of the rule (the {A} above) is known as the antecedent and the THEN part of the rule is known as 
the consequent (the {C} above). The antecedent is the condition and the consequent is the result. 

__The association rule has three measures that express the degree of confidence in the rule:__
* __Support__ 
* __Confidence__
* __Lift__

<img src ='support.PNG' align='left' >

support is used to measure the abundance or frequency of an item in a dataset

__Assume there are 100 customers 12 of them bought milk, 8 bought butter and 6 bought both of them.__

__Association: bought milk => bought butter__
* support (Milk) = Milk /Total = 12/100 = 0.12
* support (Butter) = Butter /Total = 8/100 = 0.08
* support(Milk → Butter)= 6 /100 = 0.06

<img src = 'confidence.PNG' align='left' >

* confidence (Milk→Butter) = support (Milk→Butter)/ support(Milk) = 0.06/0.12 = 0.5
* confidence values could be misleading, what if butter is popular on its own and it's coming with Milk purely by chance?

<img src='lift.PNG' align='left'>

* lift (M→B) = confidence(Milk→Butter) /support(Butter)) = 0.5 /0.08 = 6.25

Or 
* lift (M→B) = support (Milk→Butter) /(support(Milk)* support(Butter)) = 0.06 /(0.12*0.08) = 6.25

* Lift is the most important parameter

__Lets try apriori on a dataset__

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
%%time
df=pd.read_excel('Online_Retail.xlsx')

Wall time: 1min 31s


In [3]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [4]:
df.tail()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.1,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680.0,France


In [5]:
df.shape

(541909, 8)

In [22]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

__The apriori function expects data in a one-hot encoded pandas DataFrame__

__Data Cleaning__

In [6]:
df.Description=df.Description.str.strip()

In [7]:
# Since we are looking at transactions, invoice No. has to be there for each row
df.dropna(subset=['InvoiceNo'], axis=0, inplace=True)

In [8]:
df.InvoiceNo = df.InvoiceNo.astype('str')

In [9]:
df[df['InvoiceNo'].str.contains('C')].head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,2010-12-01 09:41:00,27.5,14527.0,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,2010-12-01 09:49:00,4.65,15311.0,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,2010-12-01 10:24:00,1.65,17548.0,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom


In [10]:
df[df['InvoiceNo'].str.contains('C')].shape

(9288, 8)

In [11]:
df=df[~df['InvoiceNo'].str.contains('C')]
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [12]:
basket = df[df.Country=='France']

In [13]:
basket.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
26,536370,22728,ALARM CLOCK BAKELIKE PINK,24,2010-12-01 08:45:00,3.75,12583.0,France
27,536370,22727,ALARM CLOCK BAKELIKE RED,24,2010-12-01 08:45:00,3.75,12583.0,France
28,536370,22726,ALARM CLOCK BAKELIKE GREEN,12,2010-12-01 08:45:00,3.75,12583.0,France
29,536370,21724,PANDA AND BUNNIES STICKER SHEET,12,2010-12-01 08:45:00,0.85,12583.0,France
30,536370,21883,STARS GIFT TAPE,24,2010-12-01 08:45:00,0.65,12583.0,France


In [14]:
basket=basket.groupby(['InvoiceNo', 'Description'])['Quantity'].sum()
basket.head(20)

InvoiceNo  Description                       
536370     ALARM CLOCK BAKELIKE GREEN            12
           ALARM CLOCK BAKELIKE PINK             24
           ALARM CLOCK BAKELIKE RED              24
           CHARLOTTE BAG DOLLY GIRL DESIGN       20
           CIRCUS PARADE LUNCH BOX               24
           INFLATABLE POLITICAL GLOBE            48
           LUNCH BOX I LOVE LONDON               24
           MINI JIGSAW CIRCUS PARADE             24
           MINI JIGSAW SPACEBOY                  24
           MINI PAINT SET VINTAGE                36
           PANDA AND BUNNIES STICKER SHEET       12
           POSTAGE                                3
           RED TOADSTOOL LED NIGHT LIGHT         24
           ROUND SNACK BOXES SET OF4 WOODLAND    24
           SET 2 TEA TOWELS I LOVE LONDON        24
           SET/2 RED RETROSPOT TEA TOWELS        18
           SPACEBOY LUNCH BOX                    24
           STARS GIFT TAPE                       24
           VINTAGE

In [15]:
type(basket.index)

pandas.core.indexes.multi.MultiIndex

In [16]:
basket=basket.unstack()
basket.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,,,,,,,,,,,...,,,,,,,,,,
536852,,,,,,,,,,,...,,,,,,,,,,
536974,,,,,,,,,,,...,,,,,,,,,,
537065,,,,,,,,,,,...,,,,,,,,,,
537463,,,,,,,,,,,...,,,,,,,,,,


In [17]:
basket.fillna(0, inplace=True)
basket.head(10)

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536974,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537463,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537468,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537693,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537897,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537967,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
538008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
basket.drop('POSTAGE', axis=1, inplace=True)
basket.head(6)

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536974,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537463,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537468,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
basket=basket>0
basket.head(6)

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
536852,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
536974,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
537065,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
537463,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
537468,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [20]:
basket=basket*1
basket.head(6)

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536852,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536974,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537065,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537463,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537468,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
# Choosing threshold for Support
frequent_items = apriori(basket, min_support = 0.07, use_colnames=True)

In [37]:
frequent_items.tail()

Unnamed: 0,support,itemsets
46,0.104592,"(PLASTERS IN TIN WOODLAND ANIMALS, PLASTERS IN..."
47,0.102041,"(SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED..."
48,0.102041,"(SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED..."
49,0.122449,"(SET/6 RED SPOTTY PAPER CUPS, SET/6 RED SPOTTY..."
50,0.0994898,"(SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED..."


In [38]:
rules=association_rules(frequent_items, metric='lift', min_threshold=3)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(PLASTERS IN TIN CIRCUS PARADE),(PLASTERS IN TIN SPACEBOY),0.168367,0.137755,0.089286,0.530303,3.849607,0.066092,1.835747
1,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN CIRCUS PARADE),0.137755,0.168367,0.089286,0.648148,3.849607,0.066092,2.363588
2,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE RED),0.102041,0.094388,0.07398,0.725,7.681081,0.064348,3.293135
3,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE PINK),0.094388,0.102041,0.07398,0.783784,7.681081,0.064348,4.153061
4,(DOLLY GIRL LUNCH BOX),(SPACEBOY LUNCH BOX),0.09949,0.125,0.071429,0.717949,5.74359,0.058992,3.102273


In [40]:
rules.sort_values(by='confidence', ascending=False).head(20)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
13,"(SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...",(SET/6 RED SPOTTY PAPER CUPS),0.102041,0.137755,0.09949,0.975,7.077778,0.085433,34.489796
12,"(SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...",(SET/6 RED SPOTTY PAPER PLATES),0.102041,0.127551,0.09949,0.975,7.644,0.086474,34.897959
11,(SET/6 RED SPOTTY PAPER PLATES),(SET/6 RED SPOTTY PAPER CUPS),0.127551,0.137755,0.122449,0.96,6.968889,0.104878,21.556122
10,(SET/6 RED SPOTTY PAPER CUPS),(SET/6 RED SPOTTY PAPER PLATES),0.137755,0.127551,0.122449,0.888889,6.968889,0.104878,7.852041
24,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.096939,0.079082,0.837838,8.642959,0.069932,5.568878
25,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.094388,0.079082,0.815789,8.642959,0.069932,4.916181
14,"(SET/6 RED SPOTTY PAPER CUPS, SET/6 RED SPOTTY...",(SET/20 RED RETROSPOT PAPER NAPKINS),0.122449,0.132653,0.09949,0.8125,6.125,0.083247,4.62585
21,(SET/6 RED SPOTTY PAPER PLATES),(SET/20 RED RETROSPOT PAPER NAPKINS),0.127551,0.132653,0.102041,0.8,6.030769,0.085121,4.336735
3,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE PINK),0.094388,0.102041,0.07398,0.783784,7.681081,0.064348,4.153061
17,(SET/6 RED SPOTTY PAPER PLATES),"(SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...",0.127551,0.102041,0.09949,0.78,7.644,0.086474,4.081633


__Lets try for another Country__

In [41]:
basket2=df[df.Country=='Australia'].groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().fillna(0)
basket2=basket2>0
basket2.head()

Description,10 COLOUR SPACEBOY PEN,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,16 PIECE CUTLERY SET PANTRY DESIGN,20 DOLLY PEGS RETROSPOT,3 HOOK HANGER MAGIC GARDEN,3 STRIPEY MICE FELTCRAFT,3 TIER CAKE TIN GREEN AND CREAM,3 TIER CAKE TIN RED AND CREAM,...,WRAP DOILEY DESIGN,WRAP DOLLY GIRL,WRAP ENGLISH ROSE,WRAP I LOVE LONDON,WRAP POPPIES DESIGN,WRAP RED APPLES,WRAP RED VINTAGE DOILY,WRAP VINTAGE LEAF DESIGN,WRAP WEDDING DAY,YELLOW GIANT GARDEN THERMOMETER
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536389,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
537676,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
539419,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
540267,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
540280,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [43]:
freq_items2=apriori(basket2, min_support=0.07, use_colnames=True)
rules2=association_rules(freq_items2)
rules2.sort_values(by=['lift'], ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
305,"(ROSES REGENCY TEACUP AND SAUCER, CIRCUS PARAD...","(REGENCY CAKESTAND 3 TIER, HOMEMADE JAM SCENTE...",0.070175,0.070175,0.070175,1.0,14.25,0.065251,inf
128,(SET OF 12 MINI LOAF BAKING CASES),"(SET OF 6 TEA TIME BAKING CASES, SET OF 12 FAI...",0.070175,0.070175,0.070175,1.0,14.25,0.065251,inf
120,"(SET OF 6 TEA TIME BAKING CASES, SET OF 12 FAI...","(SET OF 12 MINI LOAF BAKING CASES, SET OF 6 SN...",0.070175,0.070175,0.070175,1.0,14.25,0.065251,inf
121,"(SET OF 6 TEA TIME BAKING CASES, SET OF 12 MIN...","(SET OF 12 FAIRY CAKE BAKING CASES, SET OF 6 S...",0.070175,0.070175,0.070175,1.0,14.25,0.065251,inf
122,"(SET OF 6 TEA TIME BAKING CASES, SET OF 6 SNAC...","(SET OF 12 FAIRY CAKE BAKING CASES, SET OF 12 ...",0.070175,0.070175,0.070175,1.0,14.25,0.065251,inf


In [42]:
association_rules?

In [44]:
df.Country.unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Finland',
       'Austria', 'Bahrain', 'Israel', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'European Community', 'Malta', 'RSA'], dtype=object)

### __Recommendation Systems  : Granularity at the User level__

__Data Cleaning__

In [45]:
df=pd.read_csv('Recommend.csv')
df.head()

Unnamed: 0,196,242,3,881250949
0,186,302,3,891717742
1,22,377,1,878887116
2,244,51,2,880606923
3,166,346,1,886397596
4,298,474,4,884182806


In [46]:
df=pd.read_csv('Recommend.csv', header=None)
df.head()

Unnamed: 0,0,1,2,3
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [47]:
df.columns=['user_id', 'movie_id', 'rating', 'timestamp']
df = df.drop('timestamp', axis=1)
df.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


__Each User_ID - Movie_Id combination is unique__

In [48]:
df.shape

(100000, 3)

In [51]:
df.movie_id.nunique()

1682

__The prediction of an item for a user u is calculated by computing the weighted sum of the user ratings given by other users
to an item i.__

### P(i,u1) = Sum(Similarity(u, u1) * Rating(u,i) ) / Sum(Similarity(u, u1))

You may choose similarities of a few top similarities as well

In [52]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.25, random_state=21)

In [53]:
test_data.user_id.nunique()

943

In [54]:
train_data.user_id.nunique()

943

In [55]:
train_data.head()

Unnamed: 0,user_id,movie_id,rating
53349,804,204,4
30857,85,610,3
41671,662,291,2
44288,28,288,5
81,299,229,3


In [56]:
train_data_df = train_data.pivot(index='user_id', columns='movie_id', values='rating').fillna(0)
train_data_df.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1675,1676,1678,1679,1680,1681
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,0.0,0.0,3.0,5.0,4.0,1.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [57]:
test_data_df = test_data.pivot(index='user_id', columns='movie_id', values='rating').fillna(0)

### P(i,u1) = Sum(Similarity(u, u1) * Rating(u,i) ) / Sum(Similarity(u, u1))

<img src="Cosine_similarity.PNG" align ='left'>

__Calcultaing Similarities through Cosine Similarity.__
* _Closer the value to 1 higher the similarity_
* _We may also use Correlation or Euclidean distance, but they are not every effective_

In [58]:
from sklearn.metrics import pairwise_distances

### USER-Based Collaborative Filtering

In [59]:
user_similarity = pairwise_distances(train_data_df, metric='cosine')
print(train_data_df.shape)
print(user_similarity.shape)
user_similarity

(943, 1646)
(943, 943)


array([[0.        , 0.90614474, 0.98842298, ..., 0.87000189, 0.83068025,
        0.66785476],
       [0.90614474, 0.        , 0.92819162, ..., 0.85285856, 0.84996655,
        0.88508919],
       [0.98842298, 0.92819162, 0.        , ..., 0.90053882, 0.88911366,
        0.9640395 ],
       ...,
       [0.87000189, 0.85285856, 0.90053882, ..., 0.        , 0.91746551,
        0.93052454],
       [0.83068025, 0.84996655, 0.88911366, ..., 0.91746551, 0.        ,
        0.83831266],
       [0.66785476, 0.88508919, 0.9640395 , ..., 0.93052454, 0.83831266,
        0.        ]])

_Generates a Similarity Matrix just like Correlation matrix_

In [60]:
user_sim = pd.DataFrame(user_similarity, columns=list(train_data_df.index), index=list(train_data_df.index))
user_sim.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
1,0.0,0.906145,0.988423,0.989132,0.69918,0.624405,0.660751,0.787718,0.925734,0.689861,...,0.708121,0.908158,0.841674,0.895235,0.865044,0.942773,0.775506,0.870002,0.83068,0.667855
2,0.906145,0.0,0.928192,0.818374,0.95718,0.843768,0.923104,0.938122,0.797962,0.898419,...,0.891176,0.791294,0.728929,0.631232,0.739572,0.783317,0.788201,0.852859,0.849967,0.885089
3,0.988423,0.928192,0.0,0.683579,0.969849,0.949311,0.945124,0.929072,0.977239,0.941886,...,0.972136,1.0,0.838853,0.962186,0.928397,0.979925,0.847695,0.900539,0.889114,0.96404
4,0.989132,0.818374,0.683579,0.0,0.958961,0.954442,0.93475,0.838304,0.879813,0.967478,...,0.965125,1.0,0.853689,0.787019,0.934425,0.96231,0.806393,0.834016,0.87059,0.942131
5,0.69918,0.95718,0.969849,0.958961,0.0,0.836198,0.750258,0.798797,0.938922,0.863849,...,0.766969,0.949577,0.907845,0.913412,0.844415,0.988149,0.810978,0.866553,0.86972,0.724985


In [62]:
user_sim.loc[:,7].sort_values(ascending=False).head(10)

726    1.000000
88     1.000000
656    1.000000
729    0.997355
531    0.996859
720    0.993523
40     0.993347
258    0.993303
317    0.992923
34     0.992382
Name: 7, dtype: float64

In [63]:
mean_user_rating = train_data_df.mean(axis=1)
mean_user_rating.head()

user_id
1    0.452005
2    0.105711
3    0.066829
4    0.053463
5    0.213852
dtype: float64

In [64]:
train_data_df.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1675,1676,1678,1679,1680,1681
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,0.0,0.0,3.0,5.0,4.0,1.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [65]:
user_ratings_diff = train_data_df.sub(mean_user_rating, axis=0)
user_ratings_diff.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1675,1676,1678,1679,1680,1681
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.547995,2.547995,-0.452005,-0.452005,2.547995,4.547995,3.547995,0.547995,-0.452005,2.547995,...,-0.452005,-0.452005,-0.452005,-0.452005,-0.452005,-0.452005,-0.452005,-0.452005,-0.452005,-0.452005
2,-0.105711,-0.105711,-0.105711,-0.105711,-0.105711,-0.105711,-0.105711,-0.105711,-0.105711,-0.105711,...,-0.105711,-0.105711,-0.105711,-0.105711,-0.105711,-0.105711,-0.105711,-0.105711,-0.105711,-0.105711
3,-0.066829,-0.066829,-0.066829,-0.066829,-0.066829,-0.066829,-0.066829,-0.066829,-0.066829,-0.066829,...,-0.066829,-0.066829,-0.066829,-0.066829,-0.066829,-0.066829,-0.066829,-0.066829,-0.066829,-0.066829
4,-0.053463,-0.053463,-0.053463,-0.053463,-0.053463,-0.053463,-0.053463,-0.053463,-0.053463,-0.053463,...,-0.053463,-0.053463,-0.053463,-0.053463,-0.053463,-0.053463,-0.053463,-0.053463,-0.053463,-0.053463
5,3.786148,2.786148,-0.213852,-0.213852,-0.213852,-0.213852,-0.213852,-0.213852,-0.213852,-0.213852,...,-0.213852,-0.213852,-0.213852,-0.213852,-0.213852,-0.213852,-0.213852,-0.213852,-0.213852,-0.213852


In [66]:
user_sim.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
1,0.0,0.906145,0.988423,0.989132,0.69918,0.624405,0.660751,0.787718,0.925734,0.689861,...,0.708121,0.908158,0.841674,0.895235,0.865044,0.942773,0.775506,0.870002,0.83068,0.667855
2,0.906145,0.0,0.928192,0.818374,0.95718,0.843768,0.923104,0.938122,0.797962,0.898419,...,0.891176,0.791294,0.728929,0.631232,0.739572,0.783317,0.788201,0.852859,0.849967,0.885089
3,0.988423,0.928192,0.0,0.683579,0.969849,0.949311,0.945124,0.929072,0.977239,0.941886,...,0.972136,1.0,0.838853,0.962186,0.928397,0.979925,0.847695,0.900539,0.889114,0.96404
4,0.989132,0.818374,0.683579,0.0,0.958961,0.954442,0.93475,0.838304,0.879813,0.967478,...,0.965125,1.0,0.853689,0.787019,0.934425,0.96231,0.806393,0.834016,0.87059,0.942131
5,0.69918,0.95718,0.969849,0.958961,0.0,0.836198,0.750258,0.798797,0.938922,0.863849,...,0.766969,0.949577,0.907845,0.913412,0.844415,0.988149,0.810978,0.866553,0.86972,0.724985


In [68]:
sum_sim_user = user_sim.sum(axis=1)
sum_sim_user.head()

user_ID
1    769.026697
2    804.037019
3    846.781917
4    843.064737
5    820.063717
dtype: float64

### P(i,u1) = Sum(Similarity(u, u1) * Rating(u,i) ) / Sum(Similarity(u, u1))

In [67]:
user_P = user_sim.dot(user_ratings_diff)
user_P.index.name= 'user_ID'
user_P.columns.name= 'movie_ID'
user_P.head()

movie_ID,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1675,1676,1678,1679,1680,1681
user_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,884.395018,84.097044,26.246321,255.758954,25.402974,-69.379499,732.03551,343.042327,523.420369,66.251431,...,-116.925618,-115.061553,-115.467351,-115.038007,-115.524005,-116.318757,-116.92709,-114.964748,-115.945919,-116.120752
2,1007.201574,133.526704,26.938692,340.854483,43.695659,-80.914369,841.790404,427.733176,558.561379,74.351055,...,-138.548919,-135.873055,-136.780137,-135.483361,-136.888276,-137.703382,-138.520188,-136.893379,-137.706783,-136.692921
3,1090.077223,136.83636,32.08887,357.140556,44.824756,-82.389955,902.192808,453.000688,619.169418,86.217837,...,-146.405076,-143.517019,-144.237341,-143.11896,-144.600466,-145.431852,-146.485946,-145.26859,-145.877268,-144.227788
4,1070.34588,134.122299,30.359694,352.117395,42.588752,-82.280114,886.848908,450.309975,613.474794,86.451782,...,-145.719453,-142.854376,-143.682143,-142.61827,-143.85602,-144.717121,-145.705511,-144.237888,-144.971699,-143.667105
5,949.23784,92.262742,27.364995,287.970549,31.902585,-70.669083,803.396631,383.859769,580.489957,80.624214,...,-129.181427,-126.826404,-127.65056,-126.821395,-127.52862,-128.412889,-129.182442,-127.184471,-128.183457,-128.094608


In [69]:
user_P = user_P.div(sum_sim_user, axis=0)
user_P.head()

movie_ID,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1675,1676,1678,1679,1680,1681
user_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.150019,0.109355,0.034129,0.332575,0.033033,-0.090217,0.951899,0.446073,0.680627,0.08615,...,-0.152044,-0.14962,-0.150147,-0.149589,-0.150221,-0.151255,-0.152046,-0.149494,-0.15077,-0.150997
2,1.252681,0.16607,0.033504,0.423929,0.054345,-0.100635,1.046955,0.531982,0.694696,0.092472,...,-0.172317,-0.168989,-0.170117,-0.168504,-0.170251,-0.171265,-0.172281,-0.170258,-0.171269,-0.170008
3,1.287318,0.161596,0.037895,0.421762,0.052935,-0.097298,1.065437,0.534967,0.731203,0.101818,...,-0.172896,-0.169485,-0.170336,-0.169015,-0.170765,-0.171747,-0.172991,-0.171554,-0.172273,-0.170325
4,1.269589,0.159089,0.036011,0.417664,0.050517,-0.097596,1.051935,0.534135,0.727672,0.102545,...,-0.172845,-0.169447,-0.170428,-0.169166,-0.170635,-0.171656,-0.172828,-0.171088,-0.171958,-0.170411
5,1.157517,0.112507,0.033369,0.351156,0.038903,-0.086175,0.979676,0.468085,0.70786,0.098315,...,-0.157526,-0.154654,-0.155659,-0.154648,-0.155511,-0.156589,-0.157527,-0.155091,-0.156309,-0.156201


In [70]:
user_P = user_P.add(mean_user_rating, axis=0)
user_P.head()

movie_ID,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1675,1676,1678,1679,1680,1681
user_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.602023,0.56136,0.486134,0.78458,0.485037,0.361788,1.403904,0.898078,1.132632,0.538155,...,0.299961,0.302385,0.301857,0.302416,0.301784,0.30075,0.299959,0.302511,0.301235,0.301008
2,1.358391,0.271781,0.139215,0.52964,0.160056,0.005076,1.152666,0.637693,0.800407,0.198183,...,-0.066606,-0.063278,-0.064406,-0.062793,-0.06454,-0.065554,-0.06657,-0.064547,-0.065558,-0.064297
3,1.354146,0.228424,0.104724,0.488591,0.119764,-0.030469,1.132266,0.601796,0.798032,0.168647,...,-0.106067,-0.102657,-0.103507,-0.102186,-0.103936,-0.104918,-0.106163,-0.104725,-0.105444,-0.103496
4,1.323052,0.212552,0.089474,0.471126,0.10398,-0.044133,1.105397,0.587597,0.781135,0.156008,...,-0.119382,-0.115984,-0.116965,-0.115704,-0.117172,-0.118193,-0.119365,-0.117625,-0.118495,-0.116948
5,1.371369,0.326359,0.247221,0.565008,0.252754,0.127677,1.193528,0.681937,0.921711,0.312166,...,0.056326,0.059197,0.058192,0.059204,0.058341,0.057263,0.056324,0.058761,0.057543,0.057651


In [71]:
user_P.shape

(943, 1646)

In [72]:
test_data = test_data_df.values
user_pred = user_P.values

In [73]:
from sklearn.metrics import mean_squared_error
test = test_data[test_data.nonzero()]
pred = user_pred[test_data.nonzero()]
np.sqrt(mean_squared_error(pred, test))

3.2640451403486606

### Content-Based Filtering

In [74]:
train_data_df.T.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
movie_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,4.0,4.0,0.0,0.0,0.0,0.0,...,2.0,3.0,4.0,0.0,4.0,0.0,0.0,5.0,0.0,0.0
2,3.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [75]:
train_data_df_T = train_data_df.T

In [76]:
movie_similarity = pairwise_distances(train_data_df_T, metric='cosine')
mov_sim=pd.DataFrame(movie_similarity, columns=list(train_data_df_T.index), index=list(train_data_df_T.index))
mean_movie_rating = train_data_df_T.mean(axis=1)
movie_ratings_diff = train_data_df_T.sub(mean_movie_rating, axis=0)
sum_sim_movies = mov_sim.sum(axis=1)

In [77]:
movie_P = mov_sim.dot(movie_ratings_diff)
movie_P = movie_P.div(sum_sim_movies, axis=0)
movie_P = movie_P.add(mean_movie_rating, axis=0)

In [78]:
test_data = test_data_df.values
movie_pred = movie_P.values

In [79]:
test = test_data[test_data.nonzero()]
pred = movie_pred.T[test_data.nonzero()]
np.sqrt(mean_squared_error(pred, test))

3.261496088742623