### Recommendation Systems


#### Shrawani Singh - Project solution


### _Steps and tasks:_


##### 1. Import the necessary libraries and read the provided CSVs as a data frame and perform the below steps


In [72]:
# Importing all the necessary libs.
# Importing the libraries
import warnings
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from surprise import Reader
from surprise import accuracy
from surprise import Dataset
from surprise import KNNWithMeans
from surprise import SVD
from collections import defaultdict
from sklearn import preprocessing
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Suppressing Warnings
warnings.filterwarnings('ignore')


**A. Merge all the provided CSVs into one data-frame.**


In [2]:
# Importing data frames before merging
review_df1 = pd.read_csv(
    './phone_user_review_file_1.csv', encoding='iso-8859-1')
review_df2 = pd.read_csv(
    './phone_user_review_file_2.csv', encoding='iso-8859-1')
review_df3 = pd.read_csv(
    './phone_user_review_file_3.csv', encoding='iso-8859-1')
review_df4 = pd.read_csv(
    './phone_user_review_file_4.csv', encoding='iso-8859-1')
review_df5 = pd.read_csv(
    './phone_user_review_file_5.csv', encoding='iso-8859-1')
review_df6 = pd.read_csv(
    './phone_user_review_file_6.csv', encoding='iso-8859-1')


In [3]:
# All the files have been imported as dataset.
# Verifying if all the files have been imported
# Also checking the total count of datasets
print(
    f'Review Dataset 1: Rows: {review_df1.shape[0]} and Columns: {review_df1.shape[1]}\n')
print(
    f'Review Dataset 2: Rows: {review_df2.shape[0]} and Columns: {review_df2.shape[1]}\n')
print(
    f'Review Dataset 3: Rows: {review_df3.shape[0]} and Columns: {review_df3.shape[1]}\n')
print(
    f'Review Dataset 4: Rows: {review_df4.shape[0]} and Columns: {review_df4.shape[1]}\n')
print(
    f'Review Dataset 5: Rows: {review_df5.shape[0]} and Columns: {review_df5.shape[1]}\n')
print(
    f'Review Dataset 6: Rows: {review_df6.shape[0]} and Columns: {review_df6.shape[1]}\n')
print(
    f'Total rows: {review_df1.shape[0]+review_df2.shape[0]+review_df3.shape[0]+review_df4.shape[0]+review_df5.shape[0]+review_df6.shape[0]}')


Review Dataset 1: Rows: 374910 and Columns: 11

Review Dataset 2: Rows: 114925 and Columns: 11

Review Dataset 3: Rows: 312961 and Columns: 11

Review Dataset 4: Rows: 98284 and Columns: 11

Review Dataset 5: Rows: 350216 and Columns: 11

Review Dataset 6: Rows: 163837 and Columns: 11

Total rows: 1415133


In [4]:

# Check whether the column names are same in all the dataframes:

all(np.unique(review_df1.columns.tolist()) == np.unique(review_df1.columns.tolist() + review_df2.columns.tolist() +
    review_df3.columns.tolist() +
    review_df4.columns.tolist() +
    review_df5.columns.tolist() +
    review_df6.columns.tolist()))


True

In [5]:
# Merge the data into a single dataframe
reviews = pd.concat([review_df1, review_df2, review_df3, review_df4, review_df5, review_df6], ignore_index=True)

# Deleting the old datasets since we are not going to use them any further. To save memeory space we must delete them.

del review_df1, review_df2, review_df3, review_df4, review_df5, review_df6


**B. Explore, understand the Data and share at least 2 observations**

In [6]:
print(f'reviews: Rows: {reviews.shape[0]} and Columns: {reviews.shape[1]}\n')
print('Top 5 rows of the data: ')
display(reviews.head())
print('Bottom 5 rows of the data: ')
display(reviews.tail())


reviews: Rows: 1415133 and Columns: 11

Top 5 rows of the data: 


Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.2,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


Bottom 5 rows of the data: 


Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
1415128,/cellphones/alcatel-ot-club_1187/,5/12/2000,de,de,Ciao,ciao.de,2.0,10.0,Weil mein Onkel bei ALcatel arbeitet habe ich ...,david.paul,Alcatel Club Plus Handy
1415129,/cellphones/alcatel-ot-club_1187/,5/11/2000,de,de,Ciao,ciao.de,10.0,10.0,Hy Liebe Leserinnen und Leser!! Ich habe seit ...,Christiane14,Alcatel Club Plus Handy
1415130,/cellphones/alcatel-ot-club_1187/,5/4/2000,de,de,Ciao,ciao.de,2.0,10.0,"Jetzt hat wohl Alcatell gedacht ,sie machen wa...",michaelawr,Alcatel Club Plus Handy
1415131,/cellphones/alcatel-ot-club_1187/,5/1/2000,de,de,Ciao,ciao.de,8.0,10.0,Ich bin seit 2 Jahren (stolzer) Besitzer eines...,claudia0815,Alcatel Club Plus Handy
1415132,/cellphones/alcatel-ot-club_1187/,4/25/2000,de,de,Ciao,ciao.de,2.0,10.0,"Was sich Alkatel hier wieder ausgedacht hat,sc...",michaelawr,Alcatel Club Plus Handy


In [7]:
# Getting infos of dataset
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1415133 entries, 0 to 1415132
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   phone_url  1415133 non-null  object 
 1   date       1415133 non-null  object 
 2   lang       1415133 non-null  object 
 3   country    1415133 non-null  object 
 4   source     1415133 non-null  object 
 5   domain     1415133 non-null  object 
 6   score      1351644 non-null  float64
 7   score_max  1351644 non-null  float64
 8   extract    1395772 non-null  object 
 9   author     1351931 non-null  object 
 10  product    1415132 non-null  object 
dtypes: float64(2), object(9)
memory usage: 118.8+ MB


#Observation

#Except score and score_max (which are of float type) all other features are of object type

#feature date should be of datetype

#Also, score, score_max, extract and author: columns seems to have Null values

**C. Round off scores to the nearest integers**

In [81]:
# Alternate method, Using apply method to round the scores to the nearest integers
#Round oﬀ scores to the nearest integers.
reviews['score'] = reviews['score'].astype(int)
reviews['score_max'] = reviews['score_max'].astype(int)


**D. Check for missing values. Impute the missing values, if any.**

In [8]:
#check for missing values

reviews.isnull().values.any() 

 # If there are any null values in data set


True

In [9]:
# This prints the columns with the number of null values they have
reviews.isnull().sum()


phone_url        0
date             0
lang             0
country          0
source           0
domain           0
score        63489
score_max    63489
extract      19361
author       63202
product          1
dtype: int64

In [10]:
#Check for missing values. Impute the missing values if there is any.
# filling the null values in column 'score' and 'score_max'
reviews = reviews.fillna(reviews.median())

# dropping the null values in columns 'extract' ,'author' and 'product'
reviews = reviews.dropna()


**Round off scores to the nearest integers: Repeating 1.C question (after removing NA values)**

In [11]:
reviews['score'] = reviews['score'].astype(int)
reviews['score_max'] = reviews['score_max'].astype(int)


In [12]:
reviews['score']


0          10
1          10
2           6
3           9
4           4
           ..
1415128     2
1415129    10
1415130     2
1415131     8
1415132     2
Name: score, Length: 1336416, dtype: int32

In [13]:
reviews['score_max']


0          10
1          10
2          10
3          10
4          10
           ..
1415128    10
1415129    10
1415130    10
1415131    10
1415132    10
Name: score_max, Length: 1336416, dtype: int32

**E. Check for duplicate values and remove them, if any**

In [14]:
# 1e. Check for duplicate values and remove them if there is any.
reviews = reviews.drop_duplicates()


In [16]:
#After removing the dublicatevalues
reviews.head(3)

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10,10,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10,10,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6,10,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."


**F. Keep only 1 Million data samples. Use random state=612.**

In [17]:
rev_backup_df = reviews.copy()
df = reviews.sample(n=1000000, random_state=612)


In [18]:
# Verifying by checking the shape of the data frame
df.shape


(1000000, 11)

**G. Drop irrelevant features. Keep features like Author, Product, and Score**

In [19]:
# Drop irrelevant features. Keep features like Author, Product, and Score. 
# we can drop phone_url,date,lang,country,source,domain and extract since they do not contribute in deciding popularity.  
df.drop(['phone_url','date','lang','country','source','domain','score_max','extract'], axis = 1, inplace = True)
# same for the backup dataframe
rev_backup_df.drop(['phone_url', 'date', 'lang', 'country', 'source',
                   'domain', 'score_max', 'extract'], axis=1, inplace=True)


In [20]:
# Verifying if it really worked
df.head(2)

Unnamed: 0,score,author,product
1005326,10,Paul B,Samsung i897 Captivate Android Smartphone Gala...
453603,10,Yuvraj,"Blu Win JR LTE (Grey, 4GB)"


### 2. Answer the following questions.

**A. Identify the most rated features**

In [21]:
# Identify the most rated features.
#sorting on products that got highest mean score
df.groupby('product')['score'].mean().sort_values(ascending=False).head()


product
Smartphone Sony Xperia E1 Desbloqueado Vivo Android 4.3 Tela 4 4GB 3G Wi-Fi CÃ¢mera 3MP - Branco                     10.0
Samsung Smartphone Samsung Galaxy S5 Desbloqueado Branco Android 4.4.2 4G CÃ¢mera 16 MP MemÃ³ria Interna 16 GB       10.0
Samsung Smartphone Samsung Galaxy S5 Duos Desbloqueado/ Dual Chip / Branco / 4G / 16 MP / Android 4.4                10.0
Samsung Smartphone Samsung Galaxy S5 Desbloqueado/ Branco / 4G / 16 MP / Android 4.4.2 / 16 GB / USB 3.0             10.0
Samsung Smartphone Samsung Galaxy S5 Desbloqueado Vivo Preto Android 4.4.2 4G CÃ¢mera 16 MP MemÃ³ria Interna 16GB    10.0
Name: score, dtype: float64

**B. Identify the users with most number of reviews**

In [22]:
#Identify the users with most number of reviews. 
(df['author'].value_counts()).head()

Amazon Customer    57765
Cliente Amazon     14564
e-bit               6309
Client d'Amazon     5720
Amazon Kunde        3624
Name: author, dtype: int64

In [23]:
# The product that got most number of reviews.
df['product'].value_counts().head()


Lenovo Vibe K4 Note (White,16GB)     3908
Lenovo Vibe K4 Note (Black, 16GB)    3234
OnePlus 3 (Graphite, 64 GB)          3128
OnePlus 3 (Soft Gold, 64 GB)         2643
Huawei P8lite zwart / 16 GB          1994
Name: product, dtype: int64

**C. Select the data with products having more than 50 ratings and users who have given more than 50 ratings. Report the shape of the final
dataset**

In [24]:
# extracting authors who gave greater than 50 ratings
df1 = pd.DataFrame(columns=['author', 'a_count'])
df1['author'] = df['author'].value_counts().index.tolist()
df1['a_count'] = list(df['author'].value_counts() > 50)


In [25]:
# get names of indexes for which count column value is False
index_names = df1[df1['a_count'] == False].index
# drop these row indexes from dataFrame
df1.drop(index_names, inplace=True)
df1


Unnamed: 0,author,a_count
0,Amazon Customer,True
1,Cliente Amazon,True
2,e-bit,True
3,Client d'Amazon,True
4,Amazon Kunde,True
...,...,...
674,Rohit,True
675,mircan,True
676,Rose,True
677,Dominik,True


In [26]:
# extracting product that got more than 50 ratings
df2 = pd.DataFrame(columns=['product', 'p_count'])
df2['product'] = df['product'].value_counts().index.tolist()
df2['p_count'] = list(df['product'].value_counts() > 50)


In [27]:
# get names of indexes for which count column value is False
index_names = df2[df2['p_count'] == False].index
# drop these row indexes from dataFrame
df2.drop(index_names, inplace=True)


In [28]:
df2


Unnamed: 0,product,p_count
0,"Lenovo Vibe K4 Note (White,16GB)",True
1,"Lenovo Vibe K4 Note (Black, 16GB)",True
2,"OnePlus 3 (Graphite, 64 GB)",True
3,"OnePlus 3 (Soft Gold, 64 GB)",True
4,Huawei P8lite zwart / 16 GB,True
...,...,...
4341,Microsoft Nokia Lumia 1320 Smartphone (6 Zoll ...,True
4342,Sony Ericsson W995 Walkman,True
4343,Sim Free Apple iPhone SE 16GB Mobile Phone - R...,True
4344,SAMSUNG S5830 GALAXY ACE CEP TELEFONU,True


In [29]:
# selecting data rows where product is having more than 50 ratings.
df3 = df[df['product'].isin(df2['product'])]
df3


Unnamed: 0,score,author,product
1005326,10,Paul B,Samsung i897 Captivate Android Smartphone Gala...
453603,10,Yuvraj,"Blu Win JR LTE (Grey, 4GB)"
498651,2,Joyce D. Pratt,"BLU Vivo XL Smartphone - 5.5"" 4G LTE - GSM Unl..."
1017703,10,David B,Samsung S3350 Chat 335 Sim Free Mobile Phone
936413,10,Sebastian,"Samsung E1190 Handy (3,6 cm (1,43 Zoll) Displa..."
...,...,...,...
577008,8,Javier,Huawei Ascend Y330 - Smartphone libre Android ...
771460,8,Patrix,"Huawei Ascend G510 Smartphone Touch, Fotocamer..."
600716,2,Amazon Customer,"Apple iPhone 5C Factory Unlocked Cellphone, 8G..."
838993,10,majere1975,"Samsung Smartphone Galaxy S Advance, Display 4..."


In [30]:
# selecting data rows from df3 where author has given more than 50 ratings.
#so that we get the data with products having more than 50 ratings and users who have given more than 50 ratings
df4 = df3[df3['author'].isin(df1['author'])]
df4


Unnamed: 0,score,author,product
936413,10,Sebastian,"Samsung E1190 Handy (3,6 cm (1,43 Zoll) Displa..."
290678,8,sara,"Samsung SM-N910F Galaxy Note 4 Smartphone, 32 ..."
476314,10,ÐÐ²Ð³ÐµÐ½Ð¸Ð¹,Sony Xperia Z1 Compact (Ð»Ð°Ð¹Ð¼)
223332,8,Amazon Customer,Motorola Moto G 3rd Generation SIM-Free Smartp...
361379,10,e-bit,Smartphone Motorola Moto G 4 Play XT1603
...,...,...,...
396020,2,Amazon customer,Tracfone Motorola Moto E Android Prepaid Phone...
1222820,8,Qantas,Sony Ericsson K810i Cyber-shot
1170633,9,Capyto,Samsung M150 Cep Telefonu
577008,8,Javier,Huawei Ascend Y330 - Smartphone libre Android ...


In [31]:
# Report the shape of the final dataset.
df4.shape


(108983, 3)

### 3. Build a popularity based model and recommend top 5 mobile phones.

In [32]:
#calculating the mean score for a product by grouping it.
ratings_mean_count = pd.DataFrame(df.groupby('product')['score'].mean())


In [33]:
# calculating the number of ratings a product got
ratings_mean_count['rating_counts'] = pd.DataFrame(
    df.groupby('product')['score'].count())


In [34]:
# 3. Recommending the 5 mobile phones based in highest mean score and highest number of ratings the product got.
ratings_mean_count.sort_values(
    by=['score', 'rating_counts'], ascending=[False, False]).head()


Unnamed: 0_level_0,score,rating_counts
product,Unnamed: 1_level_1,Unnamed: 2_level_1
Samsung Galaxy Note5,10.0,144
Nokia Smartphone Nokia Lumia 520 Desbloqueado Oi Preto Windows Phone 8 CÃ¢mera 5MP 3G Wi-Fi MemÃ³ria Interna 8G GPS,10.0,132
Motorola Smartphone Motorola Moto X Desbloqueado Preto Android 4.2.2 CÃ¢mera 10MP e Frontal 2MP MemÃ³ria Interna de 16GB GSM,10.0,131
Samsung Smartphone Galaxy Win Duos Branco Desbloqueado Dual Chip CÃ¢mera 5MP Processador Quad Core 1.2 Ghz Android 4.1 3G Wi- Fi e MemÃ³ria 8GB,10.0,127
Motorola Smartphone Motorola Moto G Dual Chip Desbloqueado TIM Android 4.3 Tela 4.5 8GB 3G Wi-Fi CÃ¢mera 5MP - Preto,10.0,126


In [35]:
# Keeping the reference of data frame to another variable
data_pb = df
df
# Printing the final data frame

Unnamed: 0,score,author,product
1005326,10,Paul B,Samsung i897 Captivate Android Smartphone Gala...
453603,10,Yuvraj,"Blu Win JR LTE (Grey, 4GB)"
1010409,10,Pankaj Bhalla,"Lenovo P780 (Deep Black, 4GB)"
866960,6,Bgrazina,Samsung Galaxy XCover 2
498651,2,Joyce D. Pratt,"BLU Vivo XL Smartphone - 5.5"" 4G LTE - GSM Unl..."
...,...,...,...
873202,4,Dudls,Nokia 301 Dual
1267485,8,Cintaaa__,LG Viewty KU990
588916,10,ALBERT M. MASSILLON,BLU Dash JR K Smartphone - Unlocked - Black
102484,2,Amazon Customer,Samsung Galaxy S6 SM-G920F 32GB (FACTORY UNLOC...


### 4. Build a collaborative filtering model using SVD. You can use SVD from surprise or build it from scratch(Note: Incase you’re building it from scratch you can limit your data points to 5000 samples if you face memory issues). Build a collaborative filtering model using kNNWithMeans from surprise. You can try both user-based and item-based model

In [36]:
# arranging columns in the order of user id,item id and rating to be fed in the svd
columns_titles = ['author', 'product', 'score']
rev_backup_df = rev_backup_df.reindex(columns=columns_titles)


In [37]:
# Keep only 5000 data samples. Use random state=612
vs_data = rev_backup_df.sample(n=5000, random_state=612)


In [38]:
# 4. Build a collaborative filtering model using SVD.
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(vs_data, reader=reader)


In [39]:
trainset = data.build_full_trainset()


In [40]:
trainset.ur


defaultdict(list,
            {0: [(0, 10.0)],
             1: [(1, 10.0)],
             2: [(2, 10.0)],
             3: [(3, 6.0)],
             4: [(4, 2.0)],
             5: [(5, 10.0)],
             6: [(6, 10.0), (1363, 10.0)],
             7: [(7, 10.0)],
             8: [(8, 8.0), (465, 9.0)],
             9: [(9, 8.0)],
             10: [(10, 10.0)],
             11: [(11, 2.0)],
             12: [(12, 8.0)],
             13: [(13, 8.0)],
             14: [(14, 10.0)],
             15: [(15, 10.0)],
             16: [(16, 2.0)],
             17: [(17, 8.0)],
             18: [(18, 10.0)],
             19: [(19, 9.0)],
             20: [(20, 8.0)],
             21: [(21, 10.0),
              (909, 9.0),
              (2202, 6.0),
              (2551, 10.0),
              (3378, 9.0),
              (3614, 10.0)],
             22: [(22, 2.0)],
             23: [(23, 10.0)],
             24: [(24, 8.0)],
             25: [(25, 10.0)],
             26: [(26, 10.0)],
             27:

In [41]:
algo = SVD()
algo.fit(trainset)


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x23448a53eb0>

In [42]:
# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()


In [43]:
predictions = algo.test(testset)


In [44]:
predictions


[Prediction(uid='Paul B', iid='Blu Win JR LTE (Grey, 4GB)', r_ui=8.0086, est=8.459770908500815, details={'was_impossible': False}),
 Prediction(uid='Paul B', iid='Lenovo P780 (Deep Black, 4GB)', r_ui=8.0086, est=8.260490064475466, details={'was_impossible': False}),
 Prediction(uid='Paul B', iid='Samsung Galaxy XCover 2', r_ui=8.0086, est=8.117405640289123, details={'was_impossible': False}),
 Prediction(uid='Paul B', iid='BLU Vivo XL Smartphone - 5.5" 4G LTE - GSM Unlocked - Solid Gold', r_ui=8.0086, est=8.180260136712052, details={'was_impossible': False}),
 Prediction(uid='Paul B', iid='Samsung S3350 Chat 335 Sim Free Mobile Phone', r_ui=8.0086, est=8.329222885604253, details={'was_impossible': False}),
 Prediction(uid='Paul B', iid='Samsung E1190 Handy (3,6 cm (1,43 Zoll) Display, Dual-Band) titan gray', r_ui=8.0086, est=8.39830062134441, details={'was_impossible': False}),
 Prediction(uid='Paul B', iid='LG Nexus 4 Smartphone, Nero [Italia]', r_ui=8.0086, est=8.731832190500851, det

In [54]:
# Above are the predicted items and their estimated ratings for test user.
def get_top_n(predictions, n=5):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


In [55]:
top_n = get_top_n(predictions, n=5)


In [56]:
top_n


defaultdict(list,
            {'Paul B': [('OnePlus 3 (Graphite, 64 GB)', 9.196839311146277),
              ('OnePlus 3T (Gunmetal, 6GB RAM + 64GB memory)',
               9.175938664749326),
              ('Huawei P8 lite Smartphone, Display 5.0" IPS, Dual Sim, Processore Octa-Core, Memoria 16 GB, Fotocamera 13 MP, Android 5.0, Bianco',
               9.158013367096373),
              ('Samsung Galaxy S6 32GB (Verizon)', 9.069462372017101),
              ('Samsung Galaxy S7 edge Smartphone, 13,9 cm (5,5 Zoll) Display, LTE (4G)',
               9.05990666940936)],
             'Yuvraj': [('Huawei P8 lite Smartphone, Display 5.0" IPS, Dual Sim, Processore Octa-Core, Memoria 16 GB, Fotocamera 13 MP, Android 5.0, Bianco',
               9.065949033792043),
              ('Lenovo Motorola Moto G 4G (2 Generazione) Smartphone, Display 5 Pollici, LTE, Fotocamera 8 MP, Memoria 8 GB, Android 5 Lollipop, Nero [Italia]',
               8.95998676045737),
              ('OnePlus 3 (Graphite, 64 G

In [57]:
# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])


Paul B ['OnePlus 3 (Graphite, 64 GB)', 'OnePlus 3T (Gunmetal, 6GB RAM + 64GB memory)', 'Huawei P8 lite Smartphone, Display 5.0" IPS, Dual Sim, Processore Octa-Core, Memoria 16 GB, Fotocamera 13 MP, Android 5.0, Bianco', 'Samsung Galaxy S6 32GB (Verizon)', 'Samsung Galaxy S7 edge Smartphone, 13,9 cm (5,5 Zoll) Display, LTE (4G)']
Yuvraj ['Huawei P8 lite Smartphone, Display 5.0" IPS, Dual Sim, Processore Octa-Core, Memoria 16 GB, Fotocamera 13 MP, Android 5.0, Bianco', 'Lenovo Motorola Moto G 4G (2 Generazione) Smartphone, Display 5 Pollici, LTE, Fotocamera 8 MP, Memoria 8 GB, Android 5 Lollipop, Nero [Italia]', 'OnePlus 3 (Graphite, 64 GB)', 'OnePlus 3T (Gunmetal, 6GB RAM + 64GB memory)', 'OnePlus One (Sandstone Black, 64GB)']
Pankaj Bhalla ['Huawei P8 lite Smartphone, Display 5.0" IPS, Dual Sim, Processore Octa-Core, Memoria 16 GB, Fotocamera 13 MP, Android 5.0, Bianco', 'OnePlus 3T (Gunmetal, 6GB RAM + 64GB memory)', 'Samsung Galaxy S7 edge Smartphone, 13,9 cm (5,5 Zoll) Display, LTE 

**Build a collaborative filtering model using kNNWithMeans from surprise using Item based model**

In [70]:
# Read dataset.
reader = Reader(rating_scale=(1, 10))
data_I = Dataset.load_from_df(vs_data, reader=reader)


In [74]:
trainset_I, testset_I = train_test_split(data_I, test_size=.15)


In [75]:
# Use user_based true/false to switch between user-based or item-based collaborative filtering
algo = KNNWithMeans(k=50, sim_options={
                    'name': 'pearson_baseline', 'user_based': False})
algo.fit(trainset_I)


Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x23655426940>

In [76]:
# run the  model against the testset
test_pred_I = algo.test(testset_I)


In [77]:
test_pred_I


[Prediction(uid='Giovanni', iid='Huawei Ascend G630 Smartphone, 4 GB, Bianco', r_ui=8.0, est=7.995058823529412, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='copy2775', iid='Sony Xperia Z1', r_ui=8.0, est=7.995058823529412, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='Karen Howells', iid='Nokia X6 16GB Sim Free Mobile Phone - Black', r_ui=10.0, est=7.995058823529412, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='Amazon Customer', iid='Apple iPhone 5 - 16GB Black - SIM Free', r_ui=10.0, est=7.995058823529412, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='Amazon Customer', iid='YU Yuphoria YU5010A (Black+Silver)', r_ui=8.0, est=5.866052619008674, details={'actual_k': 50, 'was_impossible': False}),
 Prediction(uid='Kellynha', iid='Samsung Galaxy Ace', r_ui=10.0, est=7.995058823529412, details={'was_

In [78]:
# get RMSE
print("Item-based Model : Test Set")
accuracy.rmse(test_pred_I, verbose=True)


Item-based Model : Test Set
RMSE: 2.6062


2.606225679810151

**Build a collaborative filtering model using kNNWithMeans from surprise using User based model**

In [79]:
reader = Reader(rating_scale=(1, 10))
data_U = Dataset.load_from_df(vs_data, reader=reader)


In [80]:
trainset_U, testset_U = train_test_split(data_U, test_size=.15)


In [81]:
# Use user_based true/false to switch between user-based or item-based collaborative filtering
algo = KNNWithMeans(k=50, sim_options={
                    'name': 'pearson_baseline', 'user_based': True})
algo.fit(trainset_U)


Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x23655426c40>

In [82]:
# we can now query for specific predicions
uid = 'Frances DeSimone'  # raw user id
iid = 'Samsung Galaxy Star Pro DUOS S7262 Unlocked Ce.'  # raw item id


In [83]:
# get a prediction for specific users and items.
pred = algo.predict(uid, iid, verbose=True)


user: Frances DeSimone item: Samsung Galaxy Star Pro DUOS S7262 Unlocked Ce. r_ui = None   est = 8.02   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


In [84]:
# run the trained model against the testset
test_pred_U = algo.test(testset_U)


In [85]:
#6. Predict score (average rating) for test users
test_pred_U


[Prediction(uid='Computer In Due', iid='Honor 7 Smartphone 4G, Display Full HD 5.2 Pollici, Processore Kirin 935 Octa Core 2.2 GHz, 16 GB Memoria Interna, 3 GB RAM, Fotocamera 20 MP, Grigio', r_ui=10.0, est=8.018823529411765, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='kogster', iid='LEAGOO Lead 3 MTK6582 Cell Phones 1.3GHz Quad Core 3G Android 4.4 Smartphone WCDMA Mobile 4.5" QHD IPS 4GB ROM...', r_ui=10.0, est=8.018823529411765, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='B. Kollmeier', iid='LG Electronics KF510 (Touchpad, 3MP Kamera)', r_ui=8.0, est=8.018823529411765, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='6546', iid='Sony Xperia M5 Smartphone (guld)', r_ui=8.0, est=8.018823529411765, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='tetegi', iid='SÃ\x83Â\xad Siemens C45', r_ui=8.0, est=

**5. Evaluate the collaborative model. Print RMSE value.**

In [86]:
print("User-based Model : Test Set")
accuracy.rmse(test_pred_U, verbose=True)


User-based Model : Test Set
RMSE: 2.6043


2.6042974985016016

In [73]:
cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)
#RMSE of SVD model is lower than for cross validation.


{'test_rmse': array([2.539945  , 2.61332294, 2.58545407]),
 'fit_time': (0.5536513328552246, 0.6861636638641357, 0.3926737308502197),
 'test_time': (0.019943952560424805, 0.05049943923950195, 0.01822662353515625)}

In [87]:
d_df = df
df.shape


(1000000, 3)

In [65]:
def get_Iu(uid):
    """ return the number of items rated by given user
    args: 
      uid: the id of the user
    returns: 
      the number of items rated by the user
    """
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError:  # user was not part of the trainset
        return 0


def get_Ui(iid):
    """ return number of users that have rated given item
    args:
      iid: the raw id of the item
    returns:
      the number of users that have rated the item.
    """
    try:
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:
        return 0


bf = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
bf['Iu'] = bf.uid.apply(get_Iu)
bf['Ui'] = bf.iid.apply(get_Ui)
bf['err'] = abs(bf.est - bf.rui)
best_predictions = bf.sort_values(by='err')[:10]
worst_predictions = bf.sort_values(by='err')[-10:]


In [66]:
best_predictions


Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err
10920947,hobbesie,Blackberry Storm 9530 Cep Telefonu,8.0086,8.0086,{'was_impossible': False},1,1,3.211797e-08
9560655,Jim B.,Sony Xperia V,8.0086,8.0086,{'was_impossible': False},1,2,5.270443e-08
7406312,Injamamul Golder,Samsung U700,8.0086,8.0086,{'was_impossible': False},1,1,9.594255e-08
5698960,H. Thies,Nokia 7200,8.0086,8.0086,{'was_impossible': False},1,1,1.419834e-07
4758522,GARNEROVICH,LG KP500,8.0086,8.0086,{'was_impossible': False},1,2,2.222486e-07
7534359,Dino B.,Samsung Galaxy Pocket Neo GT-S5310,8.0086,8.0086,{'was_impossible': False},1,1,2.473721e-07
12859650,GS3USER,"Samsung Galaxy Next Turbo 3.14 pollici, Colore...",8.0086,8.0086,{'was_impossible': False},1,1,2.671331e-07
8243147,christiand,"Asus Zenfone Max ZC550KL (White, 2GB, 16GB)",8.0086,8.0086,{'was_impossible': False},1,1,2.761097e-07
700796,marki,"Microsoft Nokia N95 8 GB black (UMTS, MP3, GPS...",8.0086,8.0086,{'was_impossible': False},1,1,3.025085e-07
240967,Sukhitha,Nokia Asha 308,8.0086,8.0086,{'was_impossible': False},1,1,3.361926e-07


**6. Predict score (average rating) for test users**

In [89]:
#Predict score(average rating) for test users
# get a prediction for specific users and items.
pred = algo.predict(uid, iid, verbose=True)
# we can now query for specific predicions
uid = 'Frances DeSimone'  # raw user id
iid = 'Samsung Galaxy Star Pro DUOS S7262 Unlocked Ce.'  # raw item id
# get a prediction for specific users and items.
pred = algo.predict(uid, iid, verbose=True)
test_pred_U


user: Frances DeSimone item: Samsung Galaxy Star Pro DUOS S7262 Unlocked Ce. r_ui = None   est = 8.02   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
user: Frances DeSimone item: Samsung Galaxy Star Pro DUOS S7262 Unlocked Ce. r_ui = None   est = 8.02   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


[Prediction(uid='Computer In Due', iid='Honor 7 Smartphone 4G, Display Full HD 5.2 Pollici, Processore Kirin 935 Octa Core 2.2 GHz, 16 GB Memoria Interna, 3 GB RAM, Fotocamera 20 MP, Grigio', r_ui=10.0, est=8.018823529411765, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='kogster', iid='LEAGOO Lead 3 MTK6582 Cell Phones 1.3GHz Quad Core 3G Android 4.4 Smartphone WCDMA Mobile 4.5" QHD IPS 4GB ROM...', r_ui=10.0, est=8.018823529411765, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='B. Kollmeier', iid='LG Electronics KF510 (Touchpad, 3MP Kamera)', r_ui=8.0, est=8.018823529411765, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='6546', iid='Sony Xperia M5 Smartphone (guld)', r_ui=8.0, est=8.018823529411765, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='tetegi', iid='SÃ\x83Â\xad Siemens C45', r_ui=8.0, est=

**7. Report your findings and inferences**

In [90]:
"""
RMSE of SVD model is lower than for cross validation.
when, author = Frances DeSimone , item: Samsung Galaxy Star Pro DUOS S7262 Unlocked Ce. estimated rating is 8.03
"""


'\nRMSE of SVD model is lower than for cross validation.\nwhen, author = Frances DeSimone , item: Samsung Galaxy Star Pro DUOS S7262 Unlocked Ce. estimated rating is 8.03\n'

**8. Try and recommend top 5 products for test users.**

In [92]:
## Created a utility function for the same 
def get_top_n(predictions, n=5):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

top_n = get_top_n(predictions, n=5)


In [93]:
top_n


defaultdict(list,
            {'Paul B': [('OnePlus 3 (Graphite, 64 GB)', 9.196839311146277),
              ('OnePlus 3T (Gunmetal, 6GB RAM + 64GB memory)',
               9.175938664749326),
              ('Huawei P8 lite Smartphone, Display 5.0" IPS, Dual Sim, Processore Octa-Core, Memoria 16 GB, Fotocamera 13 MP, Android 5.0, Bianco',
               9.158013367096373),
              ('Samsung Galaxy S6 32GB (Verizon)', 9.069462372017101),
              ('Samsung Galaxy S7 edge Smartphone, 13,9 cm (5,5 Zoll) Display, LTE (4G)',
               9.05990666940936)],
             'Yuvraj': [('Huawei P8 lite Smartphone, Display 5.0" IPS, Dual Sim, Processore Octa-Core, Memoria 16 GB, Fotocamera 13 MP, Android 5.0, Bianco',
               9.065949033792043),
              ('Lenovo Motorola Moto G 4G (2 Generazione) Smartphone, Display 5 Pollici, LTE, Fotocamera 8 MP, Memoria 8 GB, Android 5 Lollipop, Nero [Italia]',
               8.95998676045737),
              ('OnePlus 3 (Graphite, 64 G

**9. Try other techniques (Example: cross validation) to get better results**

In [94]:
## Using Cross validate method techniques to get better results.
# by default using measures as RMSE
cross_validate(algo, data_U, measures=['RMSE'], cv=3, verbose=False)


Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([2.62731721, 2.58991519, 2.61119776]),
 'fit_time': (0.32715487480163574, 0.2982027530670166, 0.44780993461608887),
 'test_time': (0.012933731079101562,
  0.012970447540283203,
  0.01893925666809082)}

**10. In what business scenario you should use popularity based Recommendation Systems ?**

Answer: Let us take an example of a website that streams movies. The website is in its nascent stage and has listed all the movies for the users to search and watch. What the website misses here is a recommendation system. This results in users browsing through a long list of movies, with no suggestions about what to watch. This, in turn, reduces the propensity of a user to engage with the website and use its services. Therefore, the simplest way to fix this issue is to use a popularity based recommendation system.

We can assume a scenario where if any product which is usually bought by every new user then there are chances that it may suggest that item to the user who just signed up.They are not sensitive to the interests and tastes of a particular user.It  works on the principle of popularity and or anything which is in trend. These systems check about the product or movie which are in trend or are most popular among the users and directly recommend those.

**11. In what business scenario you should use CF based Recommendation Systems ?**

Answer: Most of business like e-commerce Amazon, or media content business like YouTube, and Netflix can use collaborative filtering as a part of their sophisticated recommendation systems. Basically this technique to build recommenders that give suggestions to a user on the basis of the likes and dislikes of similar users. This kind of business may invest on CF based RS as it simply works by searching a large group of people and finding a smaller set of users with tastes similar to a particular user. It looks at the items they like and combines them to create a ranked list of suggestions.

**12. What other possible methods can you think of which can further improve the recommendation for different users ?**

Answers: Bottlenecks in user-based collaborative filtering models largely arise in the search for neighbours, which are other users who have historically shown similar preferences to a given user, among large user populations

Another can be
Standard Similarity Computation Technique,
Algorithm Using Model Size,
Model-based techniques,
Matrix completion techniques,
Hybrid filtering,
Memory based techniques,
Content-based filtering
