***DOMAIN:*** Smartphone, Electronics<br>
***CONTEXT:*** India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India
in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by
smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has
made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they
are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the
right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system
based on individual consumer’s behaviour or choice.

***DATA DESCRIPTION:***<br>
***• author :*** name of the person who gave the rating<br>
***• country :*** country the person who gave the rating belongs to<br>
***• data :*** date of the rating<br>
***• domain:*** website from which the rating was taken from<br>
***• extract:*** rating content<br>
***• language:*** language in which the rating was given<br>
***• product:*** name of the product/mobile phone for which the rating was given<br>
***• score:*** average rating for the phone<br>
***• score_max:*** highest rating given for the phone<br>
***• source:*** source from where the rating was taken<br>

**PROJECT OBJECTIVE:** We will build a recommendation system using popularity based and collaborative filtering methods to recommend mobile phones to a user which are most popular and personalised respectively..


**1. Import the necessary libraries and read the provided CSVs as a data frame**

In [52]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split

from collections import defaultdict

from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise import KNNWithMeans

from sklearn.model_selection import KFold
from surprise.model_selection import cross_validate

***Merge the provided CSVs into one data-frame.***

In [2]:
csv_file_list = ["phone_user_review_file_1.csv", "phone_user_review_file_2.csv","phone_user_review_file_3.csv","phone_user_review_file_4.csv","phone_user_review_file_5.csv","phone_user_review_file_6.csv"]

list_of_dataframes = []
for filename in csv_file_list:
    print(filename)
    list_of_dataframes.append(pd.read_csv(filename,encoding='latin1'))

phones_df = pd.concat(list_of_dataframes)


phone_user_review_file_1.csv
phone_user_review_file_2.csv
phone_user_review_file_3.csv
phone_user_review_file_4.csv
phone_user_review_file_5.csv
phone_user_review_file_6.csv


***Check a few observations and shape of the data-frame***

In [3]:
phones_df.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.2,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


In [4]:
row,col = phones_df.shape
print("Number of rows: {}".format(row))
print("Number of columns: {}".format(col))

Number of rows: 1415133
Number of columns: 11


In [5]:
phones_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1415133 entries, 0 to 163836
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   phone_url  1415133 non-null  object 
 1   date       1415133 non-null  object 
 2   lang       1415133 non-null  object 
 3   country    1415133 non-null  object 
 4   source     1415133 non-null  object 
 5   domain     1415133 non-null  object 
 6   score      1351644 non-null  float64
 7   score_max  1351644 non-null  float64
 8   extract    1395772 non-null  object 
 9   author     1351931 non-null  object 
 10  product    1415132 non-null  object 
dtypes: float64(2), object(9)
memory usage: 129.6+ MB


In [6]:
phones_df.isna().sum()

phone_url        0
date             0
lang             0
country          0
source           0
domain           0
score        63489
score_max    63489
extract      19361
author       63202
product          1
dtype: int64

There are few NA's in score, score_max, extract and author columns. Dropping off the records having NA's

In [7]:
phones_cl = phones_df.dropna()

***Round off scores to the nearest integers.***

In [8]:
phones_cl.loc[:, ('score', 'score_max')]

Unnamed: 0,score,score_max
0,10.0,10.0
1,10.0,10.0
2,6.0,10.0
3,9.2,10.0
4,4.0,10.0
...,...,...
163832,2.0,10.0
163833,10.0,10.0
163834,2.0,10.0
163835,8.0,10.0


In [9]:
phones_cl.loc[:, ('score', 'score_max')] = phones_cl.loc[:, ('score', 'score_max')].round()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, v)


***Check for missing values. Impute the missing values if there is any***<br>
MIssing values are dropped as the number of records are 63000+ out of 1400000 records.

***Check for duplicate values and remove them if there is any.***

In [10]:
phones_cl.drop_duplicates()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.0,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8
...,...,...,...,...,...,...,...,...,...,...,...
163832,/cellphones/alcatel-ot-club_1187/,5/12/2000,de,de,Ciao,ciao.de,2.0,10.0,Weil mein Onkel bei ALcatel arbeitet habe ich ...,david.paul,Alcatel Club Plus Handy
163833,/cellphones/alcatel-ot-club_1187/,5/11/2000,de,de,Ciao,ciao.de,10.0,10.0,Hy Liebe Leserinnen und Leser!! Ich habe seit ...,Christiane14,Alcatel Club Plus Handy
163834,/cellphones/alcatel-ot-club_1187/,5/4/2000,de,de,Ciao,ciao.de,2.0,10.0,"Jetzt hat wohl Alcatell gedacht ,sie machen wa...",michaelawr,Alcatel Club Plus Handy
163835,/cellphones/alcatel-ot-club_1187/,5/1/2000,de,de,Ciao,ciao.de,8.0,10.0,Ich bin seit 2 Jahren (stolzer) Besitzer eines...,claudia0815,Alcatel Club Plus Handy


4500+ records has been removed as duplicates.

***Keep only 1000000 data samples. Use random state=612.***

# Reduced the sample size to 5000 as we are getting memory issue.

In [11]:
phones_sampled = phones_cl.sample(n=5000,random_state=612)

***Drop irrelevant features. Keep features like Author, Product, and Score***

In [12]:
phones_fl = phones_sampled[['author','product','score']]

In [13]:
phones_fl.head(5)

Unnamed: 0,author,product,score
292711,Giuseppe Calavaro,"Alcatel One Touch 20-04G Telefono Cellulare, Nero",6.0
78482,Buraian22,Huawei M750,2.0
126183,badamyan.karen,Nokia C7-00,10.0
32139,Amazon Customer,Binatone SM800 Touch Screen Big Button Sim Fre...,10.0
17325,unknown,Samsung Samsung Galaxy A5 2016 - Wit,6.0


***Identify the most rated features***

In [14]:
phones_fl[phones_fl['score'].values == phones_fl['score'].median()]

Unnamed: 0,author,product,score
86389,Petras,Samsung SGH-X700,9.0
149640,BruceDude,S46,9.0
160977,pklat,Samsung GALAXY A3 (2016) A310F white Android S...,9.0
85798,J.R90,Samsung Galaxy S6 edge zwart / 32 GB,9.0
107910,Maris123654,Samsung Galaxy S6 zwart / 32 GB,9.0
...,...,...,...
131789,Anoniem,Samsung Galaxy A3 (2017) blauw / 16 GB,9.0
55772,Mr Muskel,Huawei Mate 9 Pro,9.0
181825,Mr Perfect,Huawei P8lite zwart / 16 GB,9.0
138259,LEELEE74,6800,9.0


***Identify the users with most number of reviews.***

In [15]:
phones_fl[['author','score']].groupby(by='author',axis=0,sort=False).count().sort_values('score',ascending=False).head(10)

Unnamed: 0_level_0,score
author,Unnamed: 1_level_1
Amazon Customer,304
Cliente Amazon,78
e-bit,33
Client d'Amazon,28
Amazon Kunde,16
Anonymous,14
einem Kunden,11
einer Kundin,11
Anonymous,8
Ð¡ÐµÑÐ³ÐµÐ¹,6


***Select the data with products having more than 50 ratings and users who have given more than 50 ratings. Report the shape of the final
dataset.***

In [16]:
phones_fl.groupby(by=['product'],axis=0).count()>50

Unnamed: 0_level_0,author,score
product,Unnamed: 1_level_1,Unnamed: 2_level_1
2270,False,False
3220,False,False
3390,False,False
5-Zoll- Android 4.2 Cubot P9 3G Smart Phone MTK6572 Dual Core 1.3GHz QHD IPS Schirm 512MB RAM 4GB ROM GPS 8MP...,False,False
5165,False,False
...,...,...
×××¤×× ×¡××××¨× Apple iPhone 5s 16GB SimFree ××××¦×¨×,False,False
×××¤×× ×¡××××¨× Apple iPhone 6 128GB Sim Free,False,False
×××¤×× ×¡××××¨× LG Nexus 5X 32GB,False,False
×××¤×× ×¡××××¨× OnePlus One 64GB,False,False


***51961 products are having rating more than 50****

In [17]:
phones_fl.groupby(by=['author'],axis=0).count()>50

Unnamed: 0_level_0,product,score
author,Unnamed: 1_level_1,Unnamed: 2_level_1
"""gerard_lyons""",False,False
#,False,False
-,False,False
-SH-,False,False
-nic-,False,False
...,...,...
ÑÑÑÐ¿Ð¸Ð½ Ð°Ð½Ð´ÑÐµÐ¹,False,False
×× ×××,False,False
×××× ×©××,False,False
××©×,False,False


***625104 users has given more than 50 rating*** 

***Build a popularity based model and recommend top 5 mobile phones.***

In [18]:
phones_fl.groupby('product')['score'].mean().sort_values(ascending=False).head(5)

product
×××¤×× ×¡××××¨× Samsung Galaxy S7 SM-G930F 32GB                                                                                                          10.0
SAMSUNG Galaxy S6 Edge Plus - noir - 32 Go - 4G+ - Smartphone                                                                                                    10.0
Samsung Brand New Samsung Galaxy Grand 2 Duos G7102 Dual SIM 8GB White Unlocked Smartphone                                                                       10.0
Samsung Brightside, Sapphire Blue (Verizon Wireless)                                                                                                             10.0
Honor 6 4G UK Smartphone (5 inch, Touchscreen, Octa-Core, 3GB RAM, 16GB ROM, 13MP rear camera, 5MP front camera, LTE CAT6, Android 4.4, Emotion UI 2.3) White    10.0
Name: score, dtype: float64

***Build a collaborative filtering model using SVD. You can use SVD from surprise or build it from scratch(Note: Incase you’re building it from scratch you
can limit your data points to 5000 samples if you face memory issues). Build a collaborative filtering model using kNNWithMeans from surprise. You
can try both user-based and item-based model.***

In [19]:
reader = Reader(rating_scale=(1, 10))

In [20]:
data = Dataset.load_from_df(phones_fl,reader)

In [21]:
trainset = data.build_full_trainset()

In [22]:
algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x206030603a0>

In [23]:
testset = trainset.build_anti_testset()

In [24]:
predictions = algo.test(testset)

***Evaluate the collaborative model. Print RMSE value***

In [25]:
accuracy.rmse(predictions, verbose=True)

RMSE: 0.3431


0.3431398905343248

***Try and recommend top 5 products for test users.***

In [26]:
def get_top_n(predictions, n=5):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [27]:
top_n = get_top_n(predictions, n=5)

In [28]:
top_n

defaultdict(list,
            {'Giuseppe Calavaro': [('Samsung Galaxy S7 edge 32GB (Verizon)',
               8.988629684563655),
              ('Samsung Galaxy S7 G930 Black', 8.54162960794981),
              ('Nokia 5800', 8.524177830059461),
              ('Samsung Galaxy S Duos II GT-S7582 Factory Unlocked Cellphone, International Version, White',
               8.504247068067952),
              ('Samsung Guru GT-E1200 (Indigo Blue)', 8.498034668671787)],
             'Buraian22': [('Samsung Galaxy S7 edge 32GB (Verizon)',
               8.55473838322269),
              ('Motorola Moto X Pure Edition Unlocked Smartphone, 64 GB Black XT1575, 5.7" Quad HD display, 21 MP Camera, Quad-core 1.8GHz',
               8.39442619416507),
              ('Apple iPhone 5s 16GB (Ñ\x81ÐµÑ\x80ÐµÐ±Ñ\x80Ð¸Ñ\x81Ñ\x82Ñ\x8bÐ¹)',
               8.313059422133042),
              ('Honor 7 Smartphone dÃ©bloquÃ© 4G (Ecran: 5,2 pouces - 16 Go - Double Nano SIM - Android 5.0 Lollipop) Gris/Noir',
           

***KNN user-user based***

In [30]:
KNNalgo = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': True})
KNNalgo.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x208603124c0>

In [38]:
uid = str('Giuseppe Calavaro') 
iid = str('Huawei M750')

In [39]:
pred = KNNalgo.predict(uid, iid, verbose=True)

user: Giuseppe Calavaro item: Huawei M750 r_ui = None   est = 6.00   {'actual_k': 0, 'was_impossible': False}


In [40]:
test_pred = KNNalgo.test(testset)

In [41]:
test_pred

[Prediction(uid='Giuseppe Calavaro', iid='Huawei M750', r_ui=8.0084, est=6.0, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='Giuseppe Calavaro', iid='Nokia C7-00', r_ui=8.0084, est=6.0, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='Giuseppe Calavaro', iid='Binatone SM800 Touch Screen Big Button Sim Free Mobile Phone', r_ui=8.0084, est=6.0, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='Giuseppe Calavaro', iid='Samsung Samsung Galaxy A5 2016 - Wit', r_ui=8.0084, est=6.0, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='Giuseppe Calavaro', iid='Samsung SGH-X700', r_ui=8.0084, est=6.0, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='Giuseppe Calavaro', iid='Samsung Galaxy S Plus I9001 Smartphone (10,16 cm (4 Zoll) Display, Touchscreen, 5 Megapixel Kamera, Android Betriebssystem) schwarz', r_ui=8.0084, est=6.0, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='Giuseppe C

***Evaluate the collaborative model. Print RMSE value.***

In [42]:
print("User-based Model : Test Set")
accuracy.rmse(test_pred, verbose=True)

User-based Model : Test Set
RMSE: 2.5598


2.5598478147409125

***KNN With item-item based***

In [43]:
KNNItemalgo = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': False})
KNNItemalgo.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x20860312b20>

In [None]:
uid = str('Giuseppe Calavaro') 
iid = str('Huawei M750')

In [44]:
preditem = KNNItemalgo.predict(uid, iid, verbose=True)

user: Giuseppe Calavaro item: Huawei M750 r_ui = None   est = 2.00   {'actual_k': 0, 'was_impossible': False}


In [45]:
test_pred_item = KNNItemalgo.test(testset)

In [46]:
print("Item-based Model : Test Set")
accuracy.rmse(test_pred_item, verbose=True)

Item-based Model : Test Set
RMSE: 2.5222


2.522165812002203

***Predict score (average rating) for test users***

In [47]:
test_pred_item

[Prediction(uid='Giuseppe Calavaro', iid='Huawei M750', r_ui=8.0084, est=2.0, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='Giuseppe Calavaro', iid='Nokia C7-00', r_ui=8.0084, est=7.333333333333333, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='Giuseppe Calavaro', iid='Binatone SM800 Touch Screen Big Button Sim Free Mobile Phone', r_ui=8.0084, est=10, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='Giuseppe Calavaro', iid='Samsung Samsung Galaxy A5 2016 - Wit', r_ui=8.0084, est=6.0, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='Giuseppe Calavaro', iid='Samsung SGH-X700', r_ui=8.0084, est=9.0, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='Giuseppe Calavaro', iid='Samsung Galaxy S Plus I9001 Smartphone (10,16 cm (4 Zoll) Display, Touchscreen, 5 Megapixel Kamera, Android Betriebssystem) schwarz', r_ui=8.0084, est=7.5, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(ui

***Report your findings and inferences***

1. Lot of data is not prasent in given data set, very less comaprision between products.<br>
2. RMSE is very less in SVD and fits proper for this data comapred to KNNwithmean.<br>


***Try cross validation techniques to get better results***

In [54]:
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.7416  2.7066  2.6154  2.5477  2.4932  2.6209  0.0934  
MAE (testset)     2.1039  2.1227  2.0399  2.0082  1.9800  2.0509  0.0547  
Fit time          0.24    0.24    0.22    0.26    0.26    0.25    0.01    
Test time         0.00    0.01    0.01    0.00    0.00    0.00    0.00    


{'test_rmse': array([2.74161017, 2.70656001, 2.61537511, 2.54770898, 2.49315275]),
 'test_mae': array([2.10391454, 2.12268124, 2.03992214, 2.00824365, 1.97997023]),
 'fit_time': (0.24236297607421875,
  0.24400067329406738,
  0.22405362129211426,
  0.256483793258667,
  0.2634396553039551),
 'test_time': (0.0020368099212646484,
  0.007961034774780273,
  0.010065317153930664,
  0.0,
  0.0030853748321533203)}

***In what business scenario you should use popularity based Recommendation Systems ?***<br>
For cold start problem scenarios popularity based recommendation systems is good.