## Contents

[Background](#Background)

[Data Import and Cleaning](#Data_Import)

[Popularity Based Model](#Pop)

[Collaborative Modelling - SVD](#SVD)

[Collaborative Modelling - knnwithmeans](#knn)

[Recommendation](#Rec)

[Inferences](#FnI)

[Cross-Validation](#CV)

[Usage of Popularity Model](#U_Pop)

[Usage of CF Model](#U_CF)

[Improvement](#Imp)


<a id='the_destination'></a>



<a id='Background'></a>
## Background

**DOMAIN:**  Smartphone, Electronics
    
    
**CONTEXT:** India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system based on individual consumer’s behaviour or choice.



**DATA DESCRIPTION:**


• author : name of the person who gave the rating

• country : country the person who gave the rating belongs to

• data : date of the rating

• domain: website from which the rating was taken from

• extract: rating content

• language: language in which the rating was given

• product: name of the product/mobile phone for which the rating was given

• score: average rating for the phone

• score_max: highest rating given for the phone

• source: source from where the rating was taken

<a id='Data_Import'></a>
## Data Import and Cleaning

In [1]:
# Import Libraries

import numpy as np
import pandas as pd
import math
import time
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors

import scipy.sparse
from scipy.sparse import csr_matrix
import warnings; warnings.simplefilter('ignore')
%matplotlib inline


from surprise import Dataset,Reader
from surprise.model_selection import cross_validate
from surprise import SVD, KNNWithMeans
from surprise import accuracy
from surprise.model_selection import KFold
from surprise.model_selection import GridSearchCV


reader = Reader(rating_scale=(1, 10))

pd.set_option('display.max_columns', 100)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# Data Import

df1 = pd.read_csv('phone_user_review_file_1.csv',encoding='ISO-8859-1')
df2 = pd.read_csv('phone_user_review_file_2.csv',encoding='ISO-8859-1')
df3 = pd.read_csv('phone_user_review_file_3.csv',encoding='ISO-8859-1')
df4 = pd.read_csv('phone_user_review_file_4.csv',encoding='ISO-8859-1')
df5 = pd.read_csv('phone_user_review_file_5.csv',encoding='ISO-8859-1')
df6 = pd.read_csv('phone_user_review_file_6.csv',encoding='ISO-8859-1')

In [3]:
# Check few observations

df1.head(2)
df2.head(2)
df3.head(2)
df4.head(2)
df5.head(2)
df6.head(2)

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8


Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/leagoo-lead-7/,4/15/2015,en,us,Amazon,amazon.com,2.0,10.0,"The telephone headset is of poor quality , not...",luis,Leagoo Lead7 5.0 Inch HD JDI LTPS Screen 3G Sm...
1,/cellphones/leagoo-lead-7/,5/23/2015,en,gb,Amazon,amazon.co.uk,10.0,10.0,This is my first smartphone so I have nothing ...,Mark Lavin,Leagoo Lead 7 Lead7 MTK6582 Quad core 1GB RAM ...


Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s-iii-slim-sm-g3812/,11/7/2015,pt,br,Submarino,submarino.com.br,6.0,10.0,"recomendo, eu comprei um, a um ano, e agora co...",herlington tesch,Samsung Smartphone Samsung Galaxy S3 Slim G381...
1,/cellphones/samsung-galaxy-s-iii-slim-sm-g3812/,10/2/2015,pt,br,Submarino,submarino.com.br,10.0,10.0,Comprei um pouco desconfiada do site e do celu...,Luisa Silva Marieta,Samsung Smartphone Samsung Galaxy S3 Slim G381...


Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-s7262-duos-galaxy-ace/,3/11/2015,en,us,Amazon,amazon.com,2.0,10.0,was not conpatable with my phone as stated. I ...,Frances DeSimone,Samsung Galaxy Star Pro DUOS S7262 Unlocked Ce...
1,/cellphones/samsung-s7262-duos-galaxy-ace/,17/11/2015,en,in,Zopper,zopper.com,10.0,10.0,Decent Functions and Easy to Operate Pros:- Th...,Expert Review,Samsung Galaxy Star Pro S7262 Black


Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/karbonn-k1616/,7/13/2016,en,in,91 Mobiles,91mobiles.com,2.0,10.0,I bought 1 month before. currently speaker is ...,venkatesh,Karbonn K1616
1,/cellphones/karbonn-k1616/,7/13/2016,en,in,91 Mobiles,91mobiles.com,6.0,10.0,"I just bought one week back, I have Airtel con...",Venkat,Karbonn K1616


Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-instinct-sph-m800/,9/16/2011,en,us,Phone Arena,phonearena.com,8.0,10.0,I've had the phone for awhile and it's a prett...,ajabrams95,Samsung Instinct HD
1,/cellphones/samsung-instinct-sph-m800/,2/13/2014,en,us,Amazon,amazon.com,6.0,10.0,to be clear it is not the sellers fault that t...,Stephanie,Samsung SPH M800 Instinct


In [4]:
# Checking column names and data types

df1.info()
df2.info()
df3.info()
df4.info()
df5.info()
df6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374910 entries, 0 to 374909
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   phone_url  374910 non-null  object 
 1   date       374910 non-null  object 
 2   lang       374910 non-null  object 
 3   country    374910 non-null  object 
 4   source     374910 non-null  object 
 5   domain     374910 non-null  object 
 6   score      366691 non-null  float64
 7   score_max  366691 non-null  float64
 8   extract    371934 non-null  object 
 9   author     371641 non-null  object 
 10  product    374910 non-null  object 
dtypes: float64(2), object(9)
memory usage: 31.5+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114925 entries, 0 to 114924
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   phone_url  114925 non-null  object 
 1   date       114925 non-null  object 
 2   lang       114925 non-n

#### Data merge and basic checks

In [5]:
# Merge individual data into single dataframe and check few observations

df = pd.concat([df1,df2,df3,df4,df5,df6],axis=0,sort=False)

df.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.2,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


In [6]:
# Shape of the data

print("The data frame has {} rows / observations and {} columns / features".format(df.shape[0],df.shape[1]))

The data frame has 1415133 rows / observations and 11 columns / features


In [7]:
# Checking datatypes

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1415133 entries, 0 to 163836
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   phone_url  1415133 non-null  object 
 1   date       1415133 non-null  object 
 2   lang       1415133 non-null  object 
 3   country    1415133 non-null  object 
 4   source     1415133 non-null  object 
 5   domain     1415133 non-null  object 
 6   score      1351644 non-null  float64
 7   score_max  1351644 non-null  float64
 8   extract    1395772 non-null  object 
 9   author     1351931 non-null  object 
 10  product    1415132 non-null  object 
dtypes: float64(2), object(9)
memory usage: 129.6+ MB


* The data has two float varaiables (scores and score_max) and rest object type. This is fine.

#### Data Cleaning and Statistics

In [8]:
# Round off scores

df['score']=round(df.score)
df.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.0,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


* Before any data cleaning we will first check few basic statistics of the data

In [9]:
# Summary statistics of 'score' variable

df['score'].describe().transpose()

count    1.351644e+06
mean     8.008083e+00
std      2.617634e+00
min      0.000000e+00
25%      7.000000e+00
50%      9.000000e+00
75%      1.000000e+01
max      1.000000e+01
Name: score, dtype: float64

In [10]:
# Find minimum and maximum ratings (score variable)

print('The minimum rating is: %d' %(df['score'].min()))
print('The maximum rating is: %d' %(df['score'].max()))

The minimum rating is: 0
The maximum rating is: 10


In [11]:
# Number of unique reviewers and products in the data

print('Number of unique REVIEWERS in Raw data = ', df['author'].nunique())
print('Number of unique PRODUCTS in Raw data = ', df['product'].nunique())

Number of unique REVIEWERS in Raw data =  801103
Number of unique PRODUCTS in Raw data =  61313


In [12]:
# Most Rated Products

dfprod=df.groupby('product')['product'].count().sort_values(ascending = False ).head(10)
print('Top 10 most rated products are :\n',dfprod)

Top 10 most rated products are :
 product
Lenovo Vibe K4 Note (White,16GB)       5226
Lenovo Vibe K4 Note (Black, 16GB)      4390
OnePlus 3 (Graphite, 64 GB)            4103
OnePlus 3 (Soft Gold, 64 GB)           3563
Huawei P8lite zwart / 16 GB            2707
Samsung Galaxy Express I8730           2686
Lenovo Vibe K5 (Gold, VoLTE update)    2534
Samsung Galaxy S6 zwart / 32 GB        2345
Nokia 5800 XpressMusic                 2125
Lenovo Vibe K5 (Grey, VoLTE update)    2108
Name: product, dtype: int64


In [13]:
# Most Reviewers

dfauth=df.groupby('author')['author'].count().sort_values(ascending = False ).head(10)
print('Top 10 users who rated most are :\n',dfauth)

Top 10 users who rated most are :
 author
Amazon Customer    76978
Cliente Amazon     19304
e-bit               8663
Client d'Amazon     7716
Amazon Kunde        4750
Anonymous           2750
einer Kundin        2610
einem Kunden        1898
unknown             1738
Anonymous           1461
Name: author, dtype: int64


* In the original data the most rated product is "Lenovo Vibe K4 Note (White,16GB)" while the reviewer with most reviews is "Amazon Customer".

In [14]:
# Top 5 sources

dfsource=df.groupby('source')['source'].count().sort_values(ascending = False ).head(5)
print('Most of the data is from the sources :\n',dfsource)

Most of the data is from the sources :
 source
Amazon          728471
Yandex          123066
Ciao             59425
Samsung          45585
MercadoLibre     33531
Name: source, dtype: int64


Now we move ahead to clean data

In [15]:
# Identify duplicates records in the data and remove, if any

dupes = df.duplicated()
if sum(dupes) == 0:
    print("There is no duplicates in the data")
else:
    df = df.drop_duplicates()
    dupes_check = df.duplicated()
    sum(dupes_check)

0

In [16]:
# Check shape of data after removal of duplicates

print("After removal of duplicates the data frame has {} rows / observations and {} columns / features".format(df.shape[0],df.shape[1]))

After removal of duplicates the data frame has 1408697 rows / observations and 11 columns / features


In [17]:
# Check for missing values

pd.DataFrame( df.isnull().sum(), columns= ['Number of missing values'])

Unnamed: 0,Number of missing values
phone_url,0
date,0
lang,0
country,0
source,0
domain,0
score,63093
score_max,63093
extract,19004
author,61815


* We will impute the numeric values i.e. score and score_max using mean.
* We will not impute anything for the author

In [18]:
# Substitute with mean rating

df['score'] = df['score'].fillna(df['score'].mean())
df['score_max'] = df['score_max'].fillna(df['score_max'].mean())

In [19]:
# Drop irrelevant columns

df.drop(['phone_url','date','lang','country','source','domain','extract'],axis=1, inplace=True)

In [20]:
# Re-check missing values

pd.DataFrame( df.isnull().sum(), columns= ['Number of missing values'])

Unnamed: 0,Number of missing values
score,0
score_max,0
author,61815
product,1


In [21]:
# Sampling only 1000000 observations

df_sample=df.sample(n = 1000000,random_state=612)

In [22]:
# Check shape of the sampled data

print("The sampled data frame has {} rows / observations and {} columns / features".format(df_sample.shape[0],df_sample.shape[1]))

The sampled data frame has 1000000 rows / observations and 4 columns / features


In [23]:
# Verify the missing values

pd.DataFrame( df_sample.isnull().sum(), columns= ['Number of missing values'])

Unnamed: 0,Number of missing values
score,0
score_max,0
author,43809
product,1


In [24]:
# Most Rated Products

dfprod=df_sample.groupby('product')['product'].count().sort_values(ascending = False ).head(10)
print('Top 10 most rated products are :\n',dfprod)

print()
print()

# Reviewer with most reviews

dfauth=df_sample.groupby('author')['author'].count().sort_values(ascending = False ).head(10)
print('Top 10 users who rated most are :\n',dfauth)

Top 10 most rated products are :
 product
Lenovo Vibe K4 Note (White,16GB)       3709
Lenovo Vibe K4 Note (Black, 16GB)      3083
OnePlus 3 (Graphite, 64 GB)            2890
OnePlus 3 (Soft Gold, 64 GB)           2522
Samsung Galaxy Express I8730           1898
Huawei P8lite zwart / 16 GB            1895
Lenovo Vibe K5 (Gold, VoLTE update)    1801
Samsung Galaxy S6 zwart / 32 GB        1669
Nokia 5800 XpressMusic                 1503
Lenovo Vibe K5 (Grey, VoLTE update)    1488
Name: product, dtype: int64


Top 10 users who rated most are :
 author
Amazon Customer    54542
Cliente Amazon     13661
e-bit               5959
Client d'Amazon     5495
Amazon Kunde        3283
Anonymous           1970
einer Kundin        1890
einem Kunden        1350
unknown             1206
Anonymous           1014
Name: author, dtype: int64


* In line with the original data, from sampled data also we get "Lenovo Vibe K4 Note (White,16GB)" as the most rated feature and "Amazon Customer" as the reviewer with the most reviews.

In [25]:
# Further subsetting data based on given condition

df_final1=df_sample.copy()

final=df_final1[(df_final1.groupby('product')['product'].transform('count')>50) & (df_final1.groupby('author')['author'].transform('count')>50)]

final.head()

Unnamed: 0,score,score_max,author,product
230845,8.0,10.0,Cliente Amazon,"Microsoft Telefonia Lumia 950 XL Smartphone, 3..."
15508,2.0,10.0,Amazon Customer,Nokia Lumia 635 8GB Unlocked GSM 4G LTE Window...
276796,2.0,10.0,Amazon Customer,"Lenovo Used Lenovo Zuk Z1 (Space Grey, 64GB)"
343009,10.0,10.0,Anonymous,Samsung Rant
119096,10.0,10.0,Amazon Customer,"OnePlus 3T (Gunmetal, 6GB RAM + 64GB memory)"


In [26]:
# Checking shape after the subsetting

print("The final data frame has {} rows / observations and {} columns / features".format(final.shape[0],final.shape[1]))

The final data frame has 101504 rows / observations and 4 columns / features


<a id='Pop'></a>
## Popularity Based Model

In [27]:
#Split the data randomnly into test and train datasets
#Split the training and test data in the ratio 70:30

train_data, test_data = train_test_split(final, test_size = 0.3, random_state=2)
train_data.head()

Unnamed: 0,score,score_max,author,product
35372,10.0,10.0,Stefano,"Samsung Galaxy S7 Smartphone, 32 GB, Nero"
121126,10.0,10.0,Amazon Customer,Samsung Galaxy Note 5 N920C 32GB Factory Unloc...
50819,10.0,10.0,Amazon Customer,"Lenovo Vibe K4 Note (Black, 16GB)"
196632,10.0,10.0,ÐÐ¸ÑÐ¸Ð»Ð»,Sony Xperia L (ÐºÑÐ°ÑÐ½ÑÐ¹)
241091,10.0,10.0,Amazon Customer,"Samsung Galaxy J7 SM-J700F (Black, 16GB)"


In [28]:
# Count of author for each unique product as recommendation score

train_data_grouped = train_data.groupby('product').agg({'author': 'count'}).reset_index()
train_data_grouped.rename(columns = {'author': 'rec_score'},inplace=True)
train_data_grouped.head()

Unnamed: 0,product,rec_score
0,5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...,5
1,"AICEK Coque ASUS ZenFone 3 Max ZC520TL, AICEK ...",13
2,"AICEK Coque ASUS ZenFone 3 ZE520KL, AICEK Etui...",16
3,"AICEK Coque Samsung Galaxy A3 2016, AICEK Etui...",43
4,"AICEK Coque Samsung Galaxy J3 2016, AICEK Etui...",17


In [29]:
# Sort the products on recommendation score 
train_data_sort = train_data_grouped.sort_values(['rec_score', 'product'], ascending = [0,1]) 
      
# Generate a recommendation rank based upon score 
train_data_sort['Rank'] = train_data_sort['rec_score'].rank(ascending=0, method='first') 
          
# Get the top 5 recommendations 
popularity_recommendations = train_data_sort.head(5) 
popularity_recommendations 

Unnamed: 0,product,rec_score,Rank
1416,"Lenovo Vibe K4 Note (White,16GB)",1539,1.0
1415,"Lenovo Vibe K4 Note (Black, 16GB)",1243,2.0
2222,"OnePlus 3 (Graphite, 64 GB)",921,3.0
2223,"OnePlus 3 (Soft Gold, 64 GB)",864,4.0
1417,"Lenovo Vibe K5 (Gold, VoLTE update)",798,5.0


Thus the top-5 recommendations are :
    
* Lenovo Vibe K4 Note (White,16GB)
* Lenovo Vibe K4 Note (Black, 16GB)
* OnePlus 3 (Graphite, 64 GB)
* OnePlus 3 (Soft Gold, 64 GB)
* Lenovo Vibe K5 (Gold, VoLTE update)

We can also use the popularity based model for prediction. However, since this is a popularity based model it will be same for all users as it is not personalised.

In [30]:
# Use popularity based recommender model to make predictions
def recommend(user_id):     
    user_recommendations = popularity_recommendations 
          
    #Add user_id column for which the recommendations are being generated 
    user_recommendations['author'] = user_id
      
    #Bring user_id column to the front 
    cols = user_recommendations.columns.tolist() 
    cols = cols[-1:] + cols[:-1] 
    user_recommendations = user_recommendations[cols] 
          
    return user_recommendations 

In [31]:
find_recom = ['Laura','Joshua']   # This list is user choice.
for names in find_recom:
    print("Here is the recommendation for the author: \n", names)
    print(recommend(names))    
    print("\n") 

Here is the recommendation for the author: 
 Laura
     author                              product  rec_score  Rank
1416  Laura     Lenovo Vibe K4 Note (White,16GB)       1539   1.0
1415  Laura    Lenovo Vibe K4 Note (Black, 16GB)       1243   2.0
2222  Laura          OnePlus 3 (Graphite, 64 GB)        921   3.0
2223  Laura         OnePlus 3 (Soft Gold, 64 GB)        864   4.0
1417  Laura  Lenovo Vibe K5 (Gold, VoLTE update)        798   5.0


Here is the recommendation for the author: 
 Joshua
      author                              product  rec_score  Rank
1416  Joshua     Lenovo Vibe K4 Note (White,16GB)       1539   1.0
1415  Joshua    Lenovo Vibe K4 Note (Black, 16GB)       1243   2.0
2222  Joshua          OnePlus 3 (Graphite, 64 GB)        921   3.0
2223  Joshua         OnePlus 3 (Soft Gold, 64 GB)        864   4.0
1417  Joshua  Lenovo Vibe K5 (Gold, VoLTE update)        798   5.0




* Thus, we see that for both users the same recommendation is returned which are the top recommendations from the popularity model.

<a id='SVD'></a>
## Collaborative Modelling (SVD)

In [32]:
# Read data in Surprise library

data = Dataset.load_from_df(final[['author','product','score']], reader)

data.df.head(2)

Unnamed: 0,author,product,score
230845,Cliente Amazon,"Microsoft Telefonia Lumia 950 XL Smartphone, 3...",8.0
15508,Amazon Customer,Nokia Lumia 635 8GB Unlocked GSM 4G LTE Window...,2.0


In [33]:
# Split data to train and test

from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=.3,random_state=123)


In [34]:
# Train the algorithm

algo_svd = SVD(n_factors=5,biased=False,random_state=88)
algo_svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x200ad9b72e0>

In [35]:
# Run on test set

test_pred_svd = algo_svd.test(testset)

In [36]:
# Getting the scores for test users

pred_svd = pd.DataFrame(test_pred_svd)

pred_svd.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,DJ,Samsung U600,2.0,4.200405,{'was_impossible': False}
1,Amazon Customer,"Lenovo Vibe K4 Note (Black, 16GB)",8.0,6.724184,{'was_impossible': False}
2,Amazon Customer,"Apple iPhone 6 Unlocked Cellphone, 16GB, Gold",8.0,8.272434,{'was_impossible': False}
3,Sarah,LG Electronics GM360 Viewty Plus Smartphone (7...,4.0,5.593367,{'was_impossible': False}
4,Amazon Customer,Lenovo A1000 (White),10.0,4.839986,{'was_impossible': False}


In [37]:
# Calculating the RMSE

print("SVD Model : Test Set")
accuracy.rmse(test_pred_svd, verbose=True)

SVD Model : Test Set
RMSE: 2.6750


2.6750148472520907

<a id='knn'></a>
## Collaborative Recommendation (KNNwithMeans)

#### User-user collaboration

In [38]:
# Training the algorithm

algo_knn_u = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': True})
algo_knn_u.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x2009516b610>

In [39]:
# Run the trained model against the testset

test_pred_knn_u = algo_knn_u.test(testset)

In [40]:
# Getting the scores for test users

pred_knn_u = pd.DataFrame(test_pred_knn_u)

pred_knn_u.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,DJ,Samsung U600,2.0,6.875,"{'actual_k': 0, 'was_impossible': False}"
1,Amazon Customer,"Lenovo Vibe K4 Note (Black, 16GB)",8.0,6.68,"{'actual_k': 50, 'was_impossible': False}"
2,Amazon Customer,"Apple iPhone 6 Unlocked Cellphone, 16GB, Gold",8.0,7.07567,"{'actual_k': 18, 'was_impossible': False}"
3,Sarah,LG Electronics GM360 Viewty Plus Smartphone (7...,4.0,4.57973,"{'actual_k': 1, 'was_impossible': False}"
4,Amazon Customer,Lenovo A1000 (White),10.0,6.195122,"{'actual_k': 41, 'was_impossible': False}"


In [41]:
# Calculating the RMSE

print("User-based Model : Test Set")
accuracy.rmse(test_pred_knn_u, verbose=True)

User-based Model : Test Set
RMSE: 2.7252


2.725154297616089

#### Item based collaboration

In [42]:
# Training the algorithm

algo_knn_i = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': False})
algo_knn_i.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x20095181dc0>

In [43]:
# Run the trained model against the testset

test_pred_knn_i = algo_knn_i.test(testset)

In [44]:
# Getting the scores for test users

pred_knn_i = pd.DataFrame(test_pred_knn_i)

pred_knn_i.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,DJ,Samsung U600,2.0,5.66788,"{'actual_k': 0, 'was_impossible': False}"
1,Amazon Customer,"Lenovo Vibe K4 Note (Black, 16GB)",8.0,6.68,"{'actual_k': 50, 'was_impossible': False}"
2,Amazon Customer,"Apple iPhone 6 Unlocked Cellphone, 16GB, Gold",8.0,7.024801,"{'actual_k': 50, 'was_impossible': False}"
3,Sarah,LG Electronics GM360 Viewty Plus Smartphone (7...,4.0,3.957831,"{'actual_k': 21, 'was_impossible': False}"
4,Amazon Customer,Lenovo A1000 (White),10.0,6.183125,"{'actual_k': 50, 'was_impossible': False}"


In [45]:
# Calculating the RMSE

print("Item-based Model : Test Set")
accuracy.rmse(test_pred_knn_i, verbose=True)

Item-based Model : Test Set
RMSE: 2.6885


2.688461856391547

<a id='Rec'></a>
## Recommending for Test users

For recommendation we use the SVD model since the RMSE is lowest for that. Using that model we will recommend test user Laura and Joshua.

In [46]:
# Top 5 predictions for test user Laura and Joshua

pred = pd.DataFrame(test_pred_svd)
pred[pred['uid'] == 'Laura'][['iid', 'r_ui','est']].sort_values(by = 'r_ui',ascending = False).head(5)
pred[pred['uid'] == 'Joshua'][['iid', 'r_ui','est']].sort_values(by = 'r_ui',ascending = False).head(5)

Unnamed: 0,iid,r_ui,est
9,"Huawei Y6 Smartphone, Display 5.0"" HD, IPS, 2 ...",10.0,7.738639
12053,Motorola Moto E (2nd Gen.) - Smartphone libre ...,10.0,7.730575
29767,Asus ZE551ML-2A760WW Smartphone ZenFone 2 Delu...,10.0,7.283121
29125,Huawei Ascend P1 - Smartphone libre Android (p...,10.0,7.926389
28431,"Nokia C2-01, Nero [Italia]",10.0,6.602836


Unnamed: 0,iid,r_ui,est
13,Sim Free Samsung Galaxy S7 Edge Mobile Phone -...,10.0,10.0
122,Microsoft Nokia 5800 XpressMusic Comes with Mu...,10.0,5.67407
164,Tech Armor Samsung Galaxy S3 S III Premium Hig...,10.0,9.253299
10441,"Apple iPhone 5 Unlocked Cellphone, 32GB, Black",10.0,5.083313
18529,"BLU Vivo XL Smartphone - 5.5"" 4G LTE - GSM Unl...",10.0,6.191583


<a id='FnI'></a>
## Findings and Inferences

* Based on RMSE the SVD model is the best (RMSE = 2.67). Though the other models are also efficient as the difference of RMSE values across the different models is not much.
* Collaborative model is better than the popularised model as it gives recommendations based on past behaviour of the user and thus is personalised.
* We can see that for users Laura and Joshua have been given the same recommendation based on popularity model. However, the reccomendations for them based on collaborative model is different from each other and also different from the popularity based model.

<a id='CV'></a>
## Cross Validation

Several cross validation techniques can be applied. Here we will use 3 methods :
    
* Basic Cross-Validation
* Cross-Validation Iterator k-folds
* Gridsearch method


For all the above we will use SVD and on the full data.

#### Basic Cross Validation

In [47]:
algo = SVD(random_state=88)

# Run 3-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    2.6449  2.6552  2.6688  2.6563  0.0098  
MAE (testset)     2.0369  2.0536  2.0688  2.0531  0.0130  
Fit time          7.38    7.12    8.06    7.52    0.40    
Test time         0.41    0.61    0.76    0.59    0.14    


{'test_rmse': array([2.64486823, 2.65521073, 2.66877019]),
 'test_mae': array([2.03686369, 2.05355903, 2.06879346]),
 'fit_time': (7.384751796722412, 7.115077972412109, 8.064022302627563),
 'test_time': (0.4073822498321533, 0.6099729537963867, 0.7594940662384033)}

* The basic cross validation reduced the error to 2.65 (close to 1% reduction).

#### Cross Validation Iterator k-folds

In [48]:

# define a cross-validation iterator
kf = KFold(n_splits=3)

algo = SVD(random_state=88)

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x20095156a60>

RMSE: 2.6497


2.649700267833995

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x20095156a60>

RMSE: 2.6485


2.648531891583128

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x20095156a60>

RMSE: 2.6687


2.6687192974876246

* The k-folds reduced the RMSE to 2.64 - 2.66 (around 1% reduction).

#### Grid Search

In [49]:

# Define parameter grid

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}

# Appply grid search


gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)

# Best RMSE score
print(gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

2.5970420647000747
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


* The Grid Search reduced the RMSE to 2.59 (close to 3% reduction).

Thus, we see that cross validation methods helped to reduce the RMSE by 1-3% and thus improving the model marginally.

<a id='U_Pop'></a>
## Usage of Popularity Based Recommendation System

* Popularity based recommendation system is preferred when an user is new to the website / app. Since the user is new his purchase may be very less or none. Thus, during this time it's important to give the user some direction instead of completely relying on him / her to purchase. In such cases a popularity based recommendation system works best so that the website can recommend the new user those products which have been purchased most / rated more by other users.

<a id='U_CF'></a>
## Usage of CF Based Recommendation System

* Collaborative filter based recommendation is used when the past purchase behaviour of users are available. This type of recommendation is mostly better than popularity based system as it gives personalised recommendations. This type of system, may however, run into two types of problems:



1. Cold Start - when an user may not have rated or purchased any product or there is any product which has not been purchased or rated by anybody. In these situations it is difficult to provide recommendations based on collaborative algorithm. We can use popularity system or may use user profile to show him recommendation.

2. Grey Sheep - when an user has purchased or rated only few products high but no other user have purchased or rated those high. Since there is no intersection between the user and the other users the number of neighbours is 0 and thus recommendation becomes difficult. We can use popularity based or content based using profile information of the user.

<a id='Imp'></a>
## Improvements

Before going into improvements related to recommendation system, we will discuss about the data improvements.

The main drawback of the data was that it didn't have any identifier like customer id. We saw that the customer names were very general like Amazon Customer. While there were some names which are unique like Laura, Sarah etc, even these names may indicate different persons by same name. And for this reason the recommendation system may suffer setback. A customer who is registered as Amazon Customer or any other name may end up getting wrong recommendation.

This data seems to be only a feedback / review data. Combining this data with purchase bahviour or browsing / search behaviour of customer, along with an unique cutomer id will result in a better recommendation system.

Example : based on purchase behaviour we can use market basket analysis, based on browsing history we can use collaborative filtering, based on search we can use content recommendation. The individual models can be combined into a hybrid model by giving weightages to the different models. This may result in a better recommendation system.

### Appendix - Installing surprise package

In [57]:
conda install -c conda-forge scikit-surprise

Collecting package metadata (current_repodata.json): ...working... done
Note: you may need to restart the kernel to use updated packages.
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\User\anaconda3

  added / updated specs:
    - scikit-surprise


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.10.1               |   py38haa244fe_0         3.1 MB  conda-forge
    scikit-surprise-1.1.1      |   py38h1e00858_1         567 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.6 MB

The following NEW packages will be INSTALLED:

  scikit-surprise    conda-forge/win-64::scikit-surprise-1.1.1-py38h1e00858_1

The following packages will be UPDATED:

  conda                                4.9.2-py38haa244fe_0 --> 4.10.1-py38haa244fe_0



Downloading and 