# Recommendation Systems Project

India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India
in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by
smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has
made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they
are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the
right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system
based on individual consumer’s behaviour or choice.

**Importing Liberies for Project**

In [57]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [58]:
from surprise import SVD
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV


In [59]:
df1=pd.read_csv('Data Set/phone_user_review_file_1.csv', encoding='latin1')
df2=pd.read_csv('Data Set/phone_user_review_file_2.csv', encoding='latin1')
df3=pd.read_csv('Data Set/phone_user_review_file_3.csv', encoding='latin1')
df4=pd.read_csv('Data Set/phone_user_review_file_4.csv', encoding='latin1')
df5=pd.read_csv('Data Set/phone_user_review_file_5.csv', encoding='latin1')
df6=pd.read_csv('Data Set/phone_user_review_file_6.csv', encoding='latin1')

In [60]:
frames=[df1,df2,df3,df4,df5,df6]

In [61]:
df=pd.concat(frames)

In [62]:
df.head(5)

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.2,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


**Explore, understand the Data and share at least 2 observations.**

In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1415133 entries, 0 to 163836
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   phone_url  1415133 non-null  object 
 1   date       1415133 non-null  object 
 2   lang       1415133 non-null  object 
 3   country    1415133 non-null  object 
 4   source     1415133 non-null  object 
 5   domain     1415133 non-null  object 
 6   score      1351644 non-null  float64
 7   score_max  1351644 non-null  float64
 8   extract    1395772 non-null  object 
 9   author     1351931 non-null  object 
 10  product    1415132 non-null  object 
dtypes: float64(2), object(9)
memory usage: 129.6+ MB


In [64]:
df.shape

(1415133, 11)

In [65]:
df.describe(include="all")

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
count,1415133,1415133,1415133,1415133,1415133,1415133,1351644.0,1351644.0,1395772,1351931,1415132
unique,5556,7728,22,42,331,384,,,1321353,801103,61313
top,/cellphones/samsung-galaxy-s-iii/,7/18/2016,en,us,Amazon,amazon.com,,,#NAME?,Amazon Customer,"Lenovo Vibe K4 Note (White,16GB)"
freq,17093,3244,554746,318435,728471,214776,,,667,76978,5226
mean,,,,,,,8.00706,10.0,,,
std,,,,,,,2.616121,0.0,,,
min,,,,,,,0.2,10.0,,,
25%,,,,,,,7.2,10.0,,,
50%,,,,,,,9.2,10.0,,,
75%,,,,,,,10.0,10.0,,,


There are 11 columns.

1415133 rows are present in dataset.

There are null values present in data.


**Round off scores to the nearest integers.**

In [66]:
df[['score']]=df[['score']].round()
df[['score_max']]=df[['score_max']].round()

In [67]:
df.head(5)

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.0,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


**Check for missing values. Impute the missing values, if any.**

In [68]:
df.isnull().sum()

phone_url        0
date             0
lang             0
country          0
source           0
domain           0
score        63489
score_max    63489
extract      19361
author       63202
product          1
dtype: int64

In [69]:
df=df.dropna()

In [70]:
df.isnull().sum()

phone_url    0
date         0
lang         0
country      0
source       0
domain       0
score        0
score_max    0
extract      0
author       0
product      0
dtype: int64

**Check for duplicate values and remove them, if any**

In [71]:
df.duplicated().sum()

4480

In [72]:
df.drop_duplicates(inplace=True)

In [73]:
df.duplicated().sum()

0

**Keep only 1 Million data samples. Use random state=612**

In [74]:
df1=df.sample(n=1000000,random_state=612)

In [75]:
df1.shape

(1000000, 11)

In [76]:
df1.head(5)

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
20276,/cellphones/lenovo-vibe-k5/,7/25/2016,en,in,Amazon,amazon.in,10.0,10.0,Good product in this price...,KHILESH KUMAR VERMA,"Lenovo Vibe K5 (Gold, VoLTE update)"
104794,/cellphones/samsung-galaxy-s6/,11/10/2016,es,es,Samsung,samsung.com,8.0,10.0,"En general me gusta mucho mi nuevo S6, el reco...",Evyta,Samsung Galaxy S6
321393,/cellphones/sony-ericsson-k810i/,1/3/2010,ru,ru,Yandex,market.yandex.ru,8.0,10.0,Ð½ÐµÑÐ¼Ð¾ÑÑÑ Ð½Ð° Ð½ÐµÐ´Ð¾ÑÑÐ°ÑÐºÐ¸ Ð² ...,VanRaZor,Sony Ericsson K810i
78000,/cellphones/sony-xperia-z2/,7/19/2014,ru,ua,Hotline.ua,hotline.ua,6.0,10.0,ÐÑÑÑ ÑÐ¶Ðµ ÑÐ°Ð·Ð²ÐµÑÐ½ÑÑÑÐ¹ Ð¾ÑÐ·Ñ...,ruga,Sony Xperia Z2 (Black)
16933,/cellphones/samsung-galaxy-s7-edge/,10/21/2016,de,de,Otto.de,otto.de,10.0,10.0,Ein Wahnsinns Handy! Macht richtig schÃ¶ne Bil...,einer Kundin,"Samsung Galaxy S7 edge Smartphone, 13,9 cm (5,..."


**Drop irrelevant features. Keep features like Author, Product, and Score.**

In [77]:
dfn=df1[['score','score_max','extract','author','product']]

In [78]:
dfn.head(5)

Unnamed: 0,score,score_max,extract,author,product
20276,10.0,10.0,Good product in this price...,KHILESH KUMAR VERMA,"Lenovo Vibe K5 (Gold, VoLTE update)"
104794,8.0,10.0,"En general me gusta mucho mi nuevo S6, el reco...",Evyta,Samsung Galaxy S6
321393,8.0,10.0,Ð½ÐµÑÐ¼Ð¾ÑÑÑ Ð½Ð° Ð½ÐµÐ´Ð¾ÑÑÐ°ÑÐºÐ¸ Ð² ...,VanRaZor,Sony Ericsson K810i
78000,6.0,10.0,ÐÑÑÑ ÑÐ¶Ðµ ÑÐ°Ð·Ð²ÐµÑÐ½ÑÑÑÐ¹ Ð¾ÑÐ·Ñ...,ruga,Sony Xperia Z2 (Black)
16933,10.0,10.0,Ein Wahnsinns Handy! Macht richtig schÃ¶ne Bil...,einer Kundin,"Samsung Galaxy S7 edge Smartphone, 13,9 cm (5,..."


**Identify the most rated products**

In [79]:
df2 = pd.DataFrame(df1.groupby('product')['score'].mean()) 

In [80]:
df2['rating_counts'] = pd.DataFrame(df1.groupby('product')['score'].count())  

In [81]:
mrpo=df2['rating_counts'].sort_values(ascending=False)

In [82]:
mrpo.head()

product
Lenovo Vibe K4 Note (White,16GB)     4109
Lenovo Vibe K4 Note (Black, 16GB)    3451
OnePlus 3 (Graphite, 64 GB)          3212
OnePlus 3 (Soft Gold, 64 GB)         2798
Huawei P8lite zwart / 16 GB          2121
Name: rating_counts, dtype: int64

These are top 5 most rated products

**Identify the users with most number of reviews**

In [83]:
df2 = pd.DataFrame(df1.groupby('author')['score'].count().sort_values(ascending=False))

In [84]:
print('The users with highest number of reviews')
df2.head(5)

The users with highest number of reviews


Unnamed: 0_level_0,score
author,Unnamed: 1_level_1
Amazon Customer,60408
Cliente Amazon,15051
e-bit,6651
Client d'Amazon,6087
Amazon Kunde,3683


These are top 5 most rated users

**Select the data with products having more than 50 ratings and users who have given more than 50 ratings. Report the shape of the final dataset.**

In [85]:
df3=df1.copy(deep = True )
df3=df3.drop('score_max',axis=1)
df3=df3.drop('extract',axis=1)
df3=df3.drop('score',axis=1)
df3.shape

(1000000, 8)

In [86]:
x=pd.Series(np.linspace(0,999999,1000000)).astype(int)
df3=df3.set_index(x)
df3['userid'] = df3.groupby(['author']).ngroup()
df3['productid'] = df3.groupby(['product']).ngroup()
df3['user_count']=df3['userid'].value_counts()
df3['product_count']=df3['productid'].value_counts()
df3.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,author,product,userid,productid,user_count,product_count
0,/cellphones/lenovo-vibe-k5/,7/25/2016,en,in,Amazon,amazon.in,KHILESH KUMAR VERMA,"Lenovo Vibe K5 (Gold, VoLTE update)",196083,20587,1.0,1.0
1,/cellphones/samsung-galaxy-s6/,11/10/2016,es,es,Samsung,samsung.com,Evyta,Samsung Galaxy S6,118118,37781,1.0,1.0
2,/cellphones/sony-ericsson-k810i/,1/3/2010,ru,ru,Yandex,market.yandex.ru,VanRaZor,Sony Ericsson K810i,388052,44103,1.0,1.0
3,/cellphones/sony-xperia-z2/,7/19/2014,ru,ua,Hotline.ua,hotline.ua,ruga,Sony Xperia Z2 (Black),557947,46907,1.0,1.0
4,/cellphones/samsung-galaxy-s7-edge/,10/21/2016,de,de,Otto.de,otto.de,einer Kundin,"Samsung Galaxy S7 edge Smartphone, 13,9 cm (5,...",455976,38657,1.0,37.0


In [87]:
df4=df3[ (df3['user_count']> 50) & (df3['product_count'] > 50)]

In [88]:
df4

Unnamed: 0,phone_url,date,lang,country,source,domain,author,product,userid,productid,user_count,product_count
2968,/cellphones/samsung-galaxy-s5/,3/3/2015,en,us,Rakuten,rakuten.com,Myron A Schwab,Samsung Galaxy S5 G900A 4G LTE 16GB Unlocked G...,269857,37540,246.0,77.0
3115,/cellphones/wiko-rainbow/,6/10/2015,de,de,Amazon,amazon.de,PC-DIDI,"Wiko Rainbow Smartphone (12,7 cm (5 Zoll) Disp...",289145,48960,72.0,477.0
13594,/cellphones/samsung-galaxy-s6/,1/7/2016,es,es,Amazon,amazon.es,Isaac,Samsung Galaxy S6 - Smartphone libre Android (...,163742,37808,103.0,52.0
24176,/cellphones/lg-lg420g/,12/19/2006,en,us,Amazon,amazon.com,Rob Oppen,LG 420G Pre-Paid Cell Phone for TracFone with ...,322089,15847,55.0,55.0
24287,/cellphones/lg-arena-km900/,7/25/2011,es,ar,MercadoLibre,opinion.mercadolibre.com.ar,JOSEEDUARDO.BUSTOS,LG KM900,170666,17944,412.0,71.0
28449,/cellphones/samsung-galaxy-a5-2016/,6/15/2016,it,it,Amazon,amazon.it,Flora,"Samsung Galaxy A5 2016 Smartphone LTE, 16GB, Nero",126297,34318,171.0,61.0
29072,/cellphones/motorola-moto-g4/,12/11/2016,de,de,Amazon,amazon.de,Hr. Sieb,"Lenovo Moto G4 Smartphone (14 cm (5,5 Zoll), 1...",157903,20196,66.0,240.0
48485,/cellphones/lg-vs750/,6/9/2011,en,us,Amazon,amazon.com,"G. Yau ""TY""",LG Fathom VS750 Verizon phone Unlocked GSM Wor...,131969,16794,63.0,106.0


In [89]:
print('Shape of dataframe consisting of products that have number of ratings greater than 50 and author who have reviewed more than 50 items:', df4.shape)

Shape of dataframe consisting of products that have number of ratings greater than 50 and author who have reviewed more than 50 items: (8, 12)


**Build a popularity based model and recommend top 5 mobile phones.**

In [90]:
df1=df1[['score','score_max','extract','author','product']]
df2 = pd.DataFrame(df1.groupby('product')['score'].mean()) 
df2['rating_counts'] = pd.DataFrame(df1.groupby('product')['score'].count())  


**Top 5 products with highest ratings based on Popularity based recommender systems**

In [91]:
df2.head()

Unnamed: 0_level_0,score,rating_counts
product,Unnamed: 1_level_1,Unnamed: 2_level_1
"'Sony Xperia X (F5122) â White â Dual Sim (Google Android 6.0.1, 5 Display, 2 x CORTEX A72 1.8 GHz + 4 x cortex-a53...",10.0,1
"'Sony Xperia X (F5122) â rosa â Dual Sim (Google Android 6.0.1, 5 Display, 2 x CORTEX A72 1.8 GHz + 4 x cortex-a53...",10.0,1
"(7.62 cm (3 )Afficheur/Ã©cran, 2 MPixCamÃ©ra;blanc)-Smartphone",6.0,1
"(CUBOT) GT88 5.5"" qHD 1.3GHz MTK6572 2-Core Android 4.2.2 3G Phone 8MP CAM 512MB RAM 4GB ROM",8.0,1
"(DG300 Versione Aggiornata)5'' DOOGEE VOYAGER2 DG310 Dual Flashlights IPS Screen 3G Smartphone Android 4.4 MTK6582 1.3GHz Quad Core Telefono Cellulare Dual SIM 8G ROM OTG OTA GPS WIFI, BIANCO",7.513514,37


**Build a collaborative filtering model using SVD. You can use SVD from surprise or build it from scratch(Note: Incase you’re building it from scratch you
can limit your data points to 5000 samples if you face memory issues). Build a collaborative filtering model using kNNWithMeans from surprise. You
can try both user-based and item-based model.**

In [92]:
df2=df1.copy(deep=True)

In [93]:
#converting dataframe to a format which can be read by surprise library
df2=df2[['author','product','score']]
df2.head()

Unnamed: 0,author,product,score
20276,KHILESH KUMAR VERMA,"Lenovo Vibe K5 (Gold, VoLTE update)",10.0
104794,Evyta,Samsung Galaxy S6,8.0
321393,VanRaZor,Sony Ericsson K810i,8.0
78000,ruga,Sony Xperia Z2 (Black),6.0
16933,einer Kundin,"Samsung Galaxy S7 edge Smartphone, 13,9 cm (5,...",10.0


In [94]:
#Limiting the datasets to 5000 to avoid running out of memory issue
df2=df2.sample(n=5000,random_state=612)
x=pd.Series(np.linspace(0,4999,5000))
x=x.astype(int)
df2=df2.set_index([x])
df2[['score']]=df2[['score']].astype(int)
df2.head()

Unnamed: 0,author,product,score
0,amit kumar gupta,InFocus M810 (Gold),10
1,oscar,Sony Xperia L - Smartphone libre Android (pant...,10
2,kookie,"Samsung Galaxy S4, Brown 16GB (Verizon Wireless)",8
3,e-bit,Smartphone Asus ZenFone 3 ZE520KL,10
4,Sarah Schanz,"5,0 Zoll CUBOT S208 IPS OGS Screen 3G Android ...",8


In [95]:
df2['userid'] = df2.groupby(['author']).ngroup()
df2['productid'] = df2.groupby(['product']).ngroup()
df2.head()

Unnamed: 0,author,product,score,userid,productid
0,amit kumar gupta,InFocus M810 (Gold),10,2957,1046
1,oscar,Sony Xperia L - Smartphone libre Android (pant...,10,3783,3452
2,kookie,"Samsung Galaxy S4, Brown 16GB (Verizon Wireless)",8,3532,2778
3,e-bit,Smartphone Asus ZenFone 3 ZE520KL,10,3204,3214
4,Sarah Schanz,"5,0 Zoll CUBOT S208 IPS OGS Screen 3G Android ...",8,2446,3


SVD based recommender system

In [96]:
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(df2[['author', 'product', 'score']], reader)
trainset, testset = train_test_split(data, test_size=.30)
algo = SVD()
algo.fit(trainset)
test_predsvd = algo.test(testset)

Format conversion and creating a collaborative filtering model

Item-based recommender system

In [97]:
reader = Reader(rating_scale=(1, 10))
data1 = Dataset.load_from_df(df2[['userid', 'productid', 'score']], reader)

In [98]:
trainset1, testset1 = train_test_split(data1, test_size=.30)

In [99]:
algoib = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': False})
algoib.fit(trainset1)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1f4a84a97f0>

In [100]:
test_predib = algoib.test(testset1)

User-based recommender system

In [101]:
algoub = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': True})
algoub.fit(trainset1)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1f4a84a9f70>

In [102]:
test_predub = algoub.test(testset1)

**Evaluate the collaborative model. Print RMSE value.**

In [103]:
print("SVD-based Model : Test Set")
accuracy.rmse(test_predsvd, verbose=True)

SVD-based Model : Test Set
RMSE: 2.5311


2.5310861505994287

In [104]:
print("Item-based Model : Test Set")
accuracy.rmse(test_predib, verbose=True)

Item-based Model : Test Set
RMSE: 2.6557


2.655653046833841

In [105]:
print("User-based Model : Test Set")
accuracy.rmse(test_predub, verbose=True)

User-based Model : Test Set
RMSE: 2.6464


2.646445148317341

**Predict score (average rating) for test users.**

In [106]:
test_predsvd #svdbased system

[Prediction(uid='philipp19 ', iid='LG P880 Optimus 4X HD', r_ui=4.0, est=8.016857142857143, details={'was_impossible': False}),
 Prediction(uid='Haydenko', iid='Samsung C3222', r_ui=6.0, est=8.016857142857143, details={'was_impossible': False}),
 Prediction(uid='Matthew Jinhyuk Kim', iid='BLU Star 4.0 S410a Unlocked GSM Android 4.2 Smartphone with 4.0" Touchscreen - Pink', r_ui=6.0, est=8.176755907393074, details={'was_impossible': False}),
 Prediction(uid='Antigone', iid='LG - G3 - Smartphone DÃ©bloquÃ© 4G (Ecran 5,5 Pouces - 16 Go - Android 4.4.2 KitKat) - Titane', r_ui=8.0, est=8.17770916242619, details={'was_impossible': False}),
 Prediction(uid='S.Punkt.H.Punkt', iid='Sony Ericsson C905 Handy (8MP, GPS, WLAN) Night Black', r_ui=10.0, est=8.016857142857143, details={'was_impossible': False}),
 Prediction(uid='lorenzo craft', iid='Samsung Galaxy S7 goud, roze / 32 GB', r_ui=9.0, est=8.115650691008415, details={'was_impossible': False}),
 Prediction(uid='Scott A. Wells', iid='BLU Lif

In [107]:
test_predib #item based system

[Prediction(uid=3812, iid=1316, r_ui=4.0, est=8.03057142857143, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid=3634, iid=1512, r_ui=10.0, est=8.03057142857143, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid=161, iid=216, r_ui=10.0, est=1.8582387930188338, details={'actual_k': 50, 'was_impossible': False}),
 Prediction(uid=672, iid=3275, r_ui=4.0, est=8.03057142857143, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid=2890, iid=2274, r_ui=10.0, est=8.03057142857143, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid=2558, iid=3209, r_ui=10.0, est=8.03057142857143, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid=4163, iid=3711, r_ui=2.0, est=8.03057142857143, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid=475, iid=1742, r_ui

In [108]:
test_predub #predictions for user based system

[Prediction(uid=3812, iid=1316, r_ui=4.0, est=8.03057142857143, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid=3634, iid=1512, r_ui=10.0, est=8.03057142857143, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid=161, iid=216, r_ui=10.0, est=2.0, details={'actual_k': 1, 'was_impossible': False}),
 Prediction(uid=672, iid=3275, r_ui=4.0, est=8.03057142857143, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid=2890, iid=2274, r_ui=10.0, est=8.03057142857143, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid=2558, iid=3209, r_ui=10.0, est=8.03057142857143, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid=4163, iid=3711, r_ui=2.0, est=8.03057142857143, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid=475, iid=1742, r_ui=2.0, est=8.0305

**Report your findings and inferences**

Among all the three models of recommendation systems, SVD tends to have lower RMSE values when compared to collborative filtering based recommendation system. Both item_ based and user_based are suffering from cold start and grey sheep problem respectively. Though the original dataset is big, we had to select only 5000 data points to avoid running out of memory. This had impacted us directly. Like I mentioned earlier while checking how many different items each author has reviewed only few have reviewed multiple items while most of them have only rated only one item, this makes it much harder for collaborative filtering to recommend correctly and hence the high error and also most of the test predictions had was_impossible tag set True and only few were false. Well SVD too has a large RMSE value but it never had was_impossible tag: True on test predictions. In cases like these popularity based recommendation system is better.

**Try and recommend top 5 products for test users**

In [109]:
from collections import defaultdict


def get_top_n(test_predsvd, n=5):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in test_predsvd:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [110]:
top_n = get_top_n(test_predsvd, n=5)

In [111]:
top_n

defaultdict(list,
            {'philipp19 ': [('LG P880 Optimus 4X HD', 8.016857142857143)],
             'Haydenko': [('Samsung C3222', 8.016857142857143)],
             'Matthew Jinhyuk Kim': [('BLU Star 4.0 S410a Unlocked GSM Android 4.2 Smartphone with 4.0" Touchscreen - Pink',
               8.176755907393074)],
             'Antigone': [('LG - G3 - Smartphone DÃ©bloquÃ© 4G (Ecran 5,5 Pouces - 16 Go - Android 4.4.2 KitKat) - Titane',
               8.17770916242619)],
             'S.Punkt.H.Punkt': [('Sony Ericsson C905 Handy (8MP, GPS, WLAN) Night Black',
               8.016857142857143)],
             'lorenzo craft': [('Samsung Galaxy S7 goud, roze / 32 GB',
               8.115650691008415)],
             'Scott A. Wells': [('BLU Life One M Quad Band Unlocked (Dark Blue)',
               8.016857142857143)],
             'MARK': [('Samsung Galaxy Pocket S5300 Smartphone (7,1 cm (2,8 Zoll) Touchscreen, 2 Megapixel Kamera, Android 2.3) white',
               8.016857142857143)

In [112]:
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

philipp19  ['LG P880 Optimus 4X HD']
Haydenko ['Samsung C3222']
Matthew Jinhyuk Kim ['BLU Star 4.0 S410a Unlocked GSM Android 4.2 Smartphone with 4.0" Touchscreen - Pink']
Antigone ['LG - G3 - Smartphone DÃ©bloquÃ© 4G (Ecran 5,5 Pouces - 16 Go - Android 4.4.2 KitKat) - Titane']
S.Punkt.H.Punkt ['Sony Ericsson C905 Handy (8MP, GPS, WLAN) Night Black']
lorenzo craft ['Samsung Galaxy S7 goud, roze / 32 GB']
Scott A. Wells ['BLU Life One M Quad Band Unlocked (Dark Blue)']
MARK ['Samsung Galaxy Pocket S5300 Smartphone (7,1 cm (2,8 Zoll) Touchscreen, 2 Megapixel Kamera, Android 2.3) white']
rebeccaandersson ['Motorola V360']
Cliente Amazon ['Huawei P8 Lite Smartphone, Display 5" IPS, Processore Octa-Core 1.5 GHz, Memoria Interna da 16 GB, 2 GB RAM, Fotocamera 13 MP, monoSIM, Android 5.0, Bianco [Italia]', 'Asus ZenFone Go 5" Smartphone, 8 GB, Dual SIM, Bianco [Italia]', 'Samsung G920F Galaxy S6 Smartphone, 32 GB, Nero [Europa]', 'Lenovo Motorola Moto G 4G (2 Generazione) Smartphone, Display 

Karina Schmidt ['Huawei Ascend Y300 Smartphone (10,2 cm (4,0 Zoll) Touchscreen, 5 Megapixel, 4 GB Interner Speicher, Android 4.1.1 (Jelly Bean)) weiÃ\x9f']
Carlyn Irwin ['Huawei Nexus 6P unlocked smartphone, 32GB Gold (US Warranty)']
ghali_baba ['Samsung GT-i5801 Galaxy Naos - TÃ©lÃ©phone Mobile - Android']
Sergio Sergio ['Sony Xperia Z1 Compact']
Sateesh M ['Motorola Moto G 3rd Generation (Black, 16GB)']
??????????? ??????????????? ['Sony Xperia E (?????\x80??????)']
Rock72 ['Sony Ericsson T300']
Christophe ['SONY Xperia M4 Aqua blanc']
Gaubert Vincent ['Samsung Galaxy S4 Mini Duos GT-i9192 Blanc - Smartphone 3G+ avec Ã©cran tactile Super AMOLED 4.3`` sous Android...']
Matthew Keeton ['Samsung Galaxy S4 SGH-I337 Unlocked GSM Smartphone with 13 MP Camera, Touchscreen and 16 GB Storage, Black']
petra ['Archos 502495 45 Titanium Dual-SIM Smartphone (11,4 cm (4,5 Zoll) Touchscreen, 5 Megapixel Kamera, micro-SD Kartenslot, Android 4.2)']
NiBa ['Samsung Galaxy Core Plus Smartphone (10,9 c

**Try other techniques (Example: cross validation) to get better results**

In [113]:
algo2=SVD()
cross_validate(algo2, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.6999  2.5955  2.5627  2.4733  2.6056  2.5874  0.0730  
MAE (testset)     2.0865  2.0438  1.9680  1.9154  2.0179  2.0063  0.0595  
Fit time          0.31    0.25    0.27    0.26    0.27    0.27    0.02    
Test time         0.02    0.00    0.02    0.00    0.02    0.01    0.01    


{'test_rmse': array([2.69986252, 2.59547337, 2.56266287, 2.47330656, 2.60563425]),
 'test_mae': array([2.08650286, 2.04375616, 1.96796718, 1.91535698, 2.01787467]),
 'fit_time': (0.3094651699066162,
  0.2499833106994629,
  0.2656266689300537,
  0.2631862163543701,
  0.2650439739227295),
 'test_time': (0.015621185302734375,
  0.0,
  0.015621662139892578,
  0.0,
  0.015620231628417969)}

In [114]:
param_grid = {'n_epochs': [5, 10, 15, 20], 'lr_all': [0.002, 0.005, 0.007, 0.009, 0.01],
              'reg_all': [0.2, 0.4, 0.6, 0.8, 1.0]}

In [115]:
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)

In [116]:
gs.fit(data)

In [117]:
print(gs.best_score['rmse']) #best score

2.557718673070286


In [118]:
print(gs.best_params['rmse']) #best parameter

{'n_epochs': 15, 'lr_all': 0.009, 'reg_all': 0.6}


In [119]:
algo_final=SVD(n_epochs=10, lr_all=0.009, reg_all=0.6)

In [120]:
algo_final.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1f42ceddac0>

In [121]:
test_pred = algo_final.test(testset)

In [122]:
print("SVD-based Model : Test Set")
accuracy.rmse(test_pred, verbose=True)

SVD-based Model : Test Set
RMSE: 2.5014


2.5013932845624502

After hyperparameter tunning we get the mean RMSE of 2.52.

In [123]:
test_pred

[Prediction(uid='philipp19 ', iid='LG P880 Optimus 4X HD', r_ui=4.0, est=8.016857142857143, details={'was_impossible': False}),
 Prediction(uid='Haydenko', iid='Samsung C3222', r_ui=6.0, est=8.016857142857143, details={'was_impossible': False}),
 Prediction(uid='Matthew Jinhyuk Kim', iid='BLU Star 4.0 S410a Unlocked GSM Android 4.2 Smartphone with 4.0" Touchscreen - Pink', r_ui=6.0, est=8.160922948709194, details={'was_impossible': False}),
 Prediction(uid='Antigone', iid='LG - G3 - Smartphone DÃ©bloquÃ© 4G (Ecran 5,5 Pouces - 16 Go - Android 4.4.2 KitKat) - Titane', r_ui=8.0, est=8.165443264196393, details={'was_impossible': False}),
 Prediction(uid='S.Punkt.H.Punkt', iid='Sony Ericsson C905 Handy (8MP, GPS, WLAN) Night Black', r_ui=10.0, est=8.016857142857143, details={'was_impossible': False}),
 Prediction(uid='lorenzo craft', iid='Samsung Galaxy S7 goud, roze / 32 GB', r_ui=9.0, est=8.080493502948796, details={'was_impossible': False}),
 Prediction(uid='Scott A. Wells', iid='BLU Li

**In what business scenario you should use popularity based Recommendation Systems**

When a new customer subscribe to our service we can use popularity based recommendation system. To recommend a particular product to a person previous data about them is necessary, it is the core product for recommendation systems. Since we don't have previous information about them we can use popularity based recommendation system which displays or recommends the latest prodructs that are in trend at that time. Initially this will work good till we gain some more insights about the preferences and tastes of the customer. This popularity recommendation systems doesn't need any information about the user.

**In what business scenario you should use CF based Recommendation Systems**

When we already have information on a set of customers, we can use collaborative filtering recommendation system. This system finds similarity between customers based on the products they have rated.

**What other possible methods can you think of which can further improve the recommendation for different users**

While using collaborative filtering methods, we tend to suffer from gray sheep and cold start problems, to avoid this it is better use a hybrid recommendation system to improve the recommendations