# Recommendation Systems Project

![Recommendation System Question](RS_qn.PNG "Recommendation System Question")

In [293]:

#Loading libraries
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import numpy as np
from surprise import SVD
import seaborn as sns
import pathlib

In [294]:
#Read and merge CSVs

ph1 = pd.read_csv("Phone reviews Dataset/phone_user_review_file_1.csv",encoding="latin-1")
ph2 = pd.read_csv("Phone reviews Dataset/phone_user_review_file_2.csv",encoding="latin-1")
ph3 = pd.read_csv("Phone reviews Dataset/phone_user_review_file_3.csv",encoding="latin-1")
ph4 = pd.read_csv("Phone reviews Dataset/phone_user_review_file_4.csv",encoding="latin-1")
ph5 = pd.read_csv("Phone reviews Dataset/phone_user_review_file_5.csv",encoding="latin-1")
ph6 = pd.read_csv("Phone reviews Dataset/phone_user_review_file_6.csv",encoding="latin-1")

In [295]:
li = [ph1, ph2,ph3, ph4, ph5, ph6]
rvws = pd.concat(li, axis=0, ignore_index=True)

In [296]:
rvws.shape

(1415133, 11)

In [297]:
rvws.sample(10)

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
1411542,/cellphones/bosch-com-509/,7/8/2000,en,gb,Dooyoo,dooyoo.co.uk,10.0,10.0,i have had this phone for about half a year no...,anonym,Bosch 509e
761148,/cellphones/lg-optimus-l9-p760/,5/23/2013,de,de,Amazon,amazon.de,10.0,10.0,Nach ziemlich schlechten Erfahrungen mit einem...,DAnt,LG Electronics P760 Optimus L9 Smartphone (Dua...
110140,/cellphones/samsung-galaxy-s6/,4/16/2015,nl,be,KIESKEURIG,kieskeurig.be,9.0,10.0,Prachtig toestel met een nog mooier scherm. Ac...,Oosie78,Samsung Galaxy S6 zwart / 32 GB
1047858,/cellphones/samsung-sgh-c414/,2/20/2012,en,ca,Samsung,samsung.com,6.0,10.0,"When charging symbol on front stops flashing, ...",Old fellow,Samsung SGH-c414 | Black
606664,/cellphones/blu-life-pure-xl/,5/8/2014,en,us,Amazon,amazon.com,10.0,10.0,"Price is great, camera quality and resolution ...",Carlos Mejia,"BLU Life Pure XL Full HD, 16MP, (32 GB+3GB RAM..."
1262158,/cellphones/sony-ericsson-s500i-44861/,5/12/2010,de,de,Amazon,amazon.de,10.0,10.0,Ich habe mein Sony Ericsson S500i zu Weihnacht...,tweety2be,Sony Ericsson S500i lila Handy
836616,/cellphones/sony-xperia-v/,1/4/2013,ru,ru,??????????????,svyaznoy.ru,8.0,10.0,"???????????? ?????????, ???? ????????????????...",???????????,Sony Xperia V (?????????????)
823814,/cellphones/nokia-lumia-820/,3/8/2013,pt,br,Cissa Magazine,cissamagazine.com.br,,,Um bom aparelho leve e bonito Loja respons??ve...,Elisiane de Paula da Costa Mendes,Smartphone Nokia Lumia 820 Desbloqueado Preto/...
434154,/cellphones/nokia-215-dual-sim/,7/8/2015,en,in,Amazon,amazon.in,10.0,10.0,nice budget phone,Dhaval S.,"Nokia 215 (Dual SIM, Black)"
352095,/cellphones/samsung-galaxy-s5/,10/15/2014,ru,ru,Yandex,market.yandex.ru,4.0,10.0,ÐÑÐµ Ð¿Ð°ÑÐ° Ð²Ð°Ð¶Ð½ÑÑ Ð½ÐµÐ´Ð¾ÑÑÐ°ÑÐ...,,Samsung Galaxy S5 SM-G900F 16Gb


### Check for missing values and impute them

In [298]:
rvws.isna().sum()

phone_url        0
date             0
lang             0
country          0
source           0
domain           0
score        63489
score_max    63489
extract      19361
author       63202
product          1
dtype: int64

There are many missing values, we can impute the values before round off the numbers

In [299]:
rvws['score'].fillna(rvws['score'].mean(), inplace = True)
rvws['score_max'].fillna(rvws['score_max'].mean(), inplace = True)
rvws['extract'].fillna("No review available", inplace = True)
rvws['author'].fillna("default", inplace = True)

#if product name is na then we could extract product name from phone_url, for simplicity I will just drop that row
rvws = rvws[rvws['product'].notna()]

### 1.3 Round off scores to the nearest integers.

In [300]:
rvws['score'].round(0).astype(int)

0          10
1          10
2           6
3           9
4           4
           ..
1415128     2
1415129    10
1415130     2
1415131     8
1415132     2
Name: score, Length: 1415132, dtype: int32

### Check for duplicate rows and remove them

In [301]:
rvws.duplicated().sum()

6412

There are many duplicates lets drop them

In [302]:
rvws = rvws.drop_duplicates();

### Keep only 1000000 data samples. Use random state=612.

In [303]:
rvws_sample = rvws.sample(1000000, random_state=612)

In [304]:
rvws_sample.shape

(1000000, 11)

### Drop irrelevant features. Keep features like Author, Product, and Score.

In [305]:
rvws_sample = rvws_sample.drop(columns=['phone_url','date','lang','country','source','domain'])

In [306]:
rvws_sample.columns

Index(['score', 'score_max', 'extract', 'author', 'product'], dtype='object')

### Identify the most rated features

In [308]:
products_rvw = rvws_sample.groupby('product')['score'].count().reset_index().sort_values(by='score',ascending=False)

In [309]:
products_rvw[products_rvw['score'] > 50]

Unnamed: 0,product,score
21748,"Lenovo Vibe K4 Note (White,16GB)",3700
21747,"Lenovo Vibe K4 Note (Black, 16GB)",3093
31898,"OnePlus 3 (Graphite, 64 GB)",2889
31899,"OnePlus 3 (Soft Gold, 64 GB)",2522
37638,Samsung Galaxy Express I8730,1902
...,...,...
15588,Huawei P8 Lite - Smartphone libre Android (pan...,51
51918,ThL W100,51
16069,"Huawei Y6 (2GB) Smartphone Dual SIM, Display 5...",51
38413,"Samsung Galaxy K zoom C115 Smartphone (12,2 cm...",51


In [310]:
users_rvw = rvws_sample.groupby('author')['score'].count().reset_index().sort_values(by='score',ascending=False)

### Identify the users with most number of reviews.

In [312]:
users_rvw[users_rvw['score'] > 50]

Unnamed: 0,author,score
22787,Amazon Customer,54543
422625,default,43911
72560,Cliente Amazon,13630
429341,e-bit,5965
72538,Client d'Amazon,5501
...,...,...
316816,Sabine,51
38917,B,51
175203,Johnny,51
441459,gabberino93,51


### Select the data with products having more than 50 ratings and users who have given more than 50 ratings. 

In [313]:
product_df = rvws_sample[rvws_sample['product'].isin(products_rvw['product'])]
user_df = rvws_sample[rvws_sample['author'].isin(users_rvw['author'])]

In [314]:
reviews = pd.merge(product_df, user_df)

In [315]:
reviews.head()

Unnamed: 0,score,score_max,extract,author,product
0,7.0,10.0,We got this phone unlocked as we had plans to ...,drprashams,Samsung SGH-T139
1,2.0,10.0,It was advertised as a Verizon phone but in re...,Nickolas Gudmundson,"Apple iPhone 4S Verizon Cellphone, 16GB, White"
2,10.0,10.0,Habe mir das Samsung Galaxy Note 3 Neo als Nac...,BulldoZer,"Samsung Galaxy Note 3 Neo Smartphone (13,94 cm..."
3,6.0,10.0,nokia 5030 me game kesa dalay,vinod kumar,Nokia 5030
4,10.0,10.0,"Beh, dopo Iphone e Lumia sono passato al mondo...",Giorgio,"Huawei P9 Plus Smartphone, LTE, Display 5.5'' ..."


###  Report the shape of the final dataset

In [316]:
reviews.shape

(1029990, 5)

## Build a popularity based model and recommend top 5 mobile phones

In [317]:
## since we have already sorted products by score, we can list the top five products here
products_rvw['Rank'] = products_rvw['score'].rank(ascending=0, method='first') 
products_rvw.head(5)

Unnamed: 0,product,score,Rank
21748,"Lenovo Vibe K4 Note (White,16GB)",3700,1.0
21747,"Lenovo Vibe K4 Note (Black, 16GB)",3093,2.0
31898,"OnePlus 3 (Graphite, 64 GB)",2889,3.0
31899,"OnePlus 3 (Soft Gold, 64 GB)",2522,4.0
37638,Samsung Galaxy Express I8730,1902,5.0


In [318]:
from surprise import SVD
from surprise.model_selection import cross_validate

In [319]:
reviews.columns

Index(['score', 'score_max', 'extract', 'author', 'product'], dtype='object')

In [320]:
reviews = reviews.drop(columns=['score_max','extract'])

##  Build a collaborative filtering model using SVD.

In [337]:
small_dataset = reviews[["author", "product", "score"]].head(5000)

In [338]:
small_dataset.head(5)

Unnamed: 0,author,product,score
0,drprashams,Samsung SGH-T139,7.0
1,Nickolas Gudmundson,"Apple iPhone 4S Verizon Cellphone, 16GB, White",2.0
2,BulldoZer,"Samsung Galaxy Note 3 Neo Smartphone (13,94 cm...",10.0
3,vinod kumar,Nokia 5030,6.0
4,Giorgio,"Huawei P9 Plus Smartphone, LTE, Display 5.5'' ...",10.0


In [339]:
from surprise import Reader
from surprise import Dataset

## reducing sample size to 5000 since my hardware has insufficient memory to run the entire dataset
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(small_dataset, reader)

In [340]:
from surprise import KNNWithMeans

# To use item-based cosine similarity
sim_options = {
    "name": "cosine",
    "user_based": False,  # Compute  similarities between items
}
algo_item_based = KNNWithMeans(sim_options=sim_options)

In [341]:
# To use item-based cosine similarity
sim_options = {
    "name": "cosine",
    "user_based": True,  # Compute  similarities between users
}
algo_user_based = KNNWithMeans(sim_options=sim_options)

In [342]:
trainset = data.build_full_trainset()

In [343]:
algo_user_based.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1e146d1bd30>

In [344]:
algo_item_based.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1e1002a1700>

## Evaluate the collaborative model. Print RMSE value

In [345]:
from surprise import accuracy
from surprise.model_selection import KFold

In [346]:

# define a cross-validation iterator
kf = KFold(n_splits=8)

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo_user_based.fit(trainset)
    predictions = algo_user_based.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 2.4895
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 2.4652
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 2.5444
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 2.4467
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 2.5490
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 2.3682
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 2.3299
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 2.3204


In [347]:

# define a cross-validation iterator
kf = KFold(n_splits=3)

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo_item_based.fit(trainset)
    predictions = algo_item_based.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 2.4527
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 2.4325
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 2.4457


## Predict score (average rating) for test users. 

In [348]:
## get a sample prediction for a user of a product
prediction = algo.predict('Giorgio', 'Nokia 5030')
prediction.est

2.6995650217489127

## Report your findings and inferences.

1. Popularity based scoring was applied on the dataset. The following are the top mobile products
Lenovo Vibe K4 Note (White,16GB)	
Lenovo Vibe K4 Note (Black, 16GB)	
OnePlus 3 (Graphite, 64 GB)	
OnePlus 3 (Soft Gold, 64 GB)	
Samsung Galaxy Express I873

2. Item based and User based collaborative filtering was applied with KNN algorithm. Using the default parameters on a dataset size of 5000, we obtained an RMSE of around 2 for both.

## Try and recommend top 5 popularity based products for test users.

In [356]:
products_rvw.head(5)

Unnamed: 0,product,score,Rank
21748,"Lenovo Vibe K4 Note (White,16GB)",3700,1.0
21747,"Lenovo Vibe K4 Note (Black, 16GB)",3093,2.0
31898,"OnePlus 3 (Graphite, 64 GB)",2889,3.0
31899,"OnePlus 3 (Soft Gold, 64 GB)",2522,4.0
37638,Samsung Galaxy Express I8730,1902,5.0


## Get top 5 recommendations for a test user

In [350]:
rcmd = pd.DataFrame(columns=['product','score_recommended'])
testP = products_rvw.head(10);
for index, row in testP.iterrows():
    prediction = algo_item_based.predict('Giorgio', row['product'])
    print(row['product']+" "+str(prediction))
    #rcmd.append({"product":row['product'],"score_recommended":prediction}, ignore_index=True)
    rcmd.loc[index]= [row['product'], float(prediction.est)]

Lenovo Vibe K4 Note (White,16GB) user: Giorgio    item: Lenovo Vibe K4 Note (White,16GB) r_ui = None   est = 7.64   {'actual_k': 0, 'was_impossible': False}
Lenovo Vibe K4 Note (Black, 16GB) user: Giorgio    item: Lenovo Vibe K4 Note (Black, 16GB) r_ui = None   est = 5.67   {'actual_k': 0, 'was_impossible': False}
OnePlus 3 (Graphite, 64 GB) user: Giorgio    item: OnePlus 3 (Graphite, 64 GB) r_ui = None   est = 9.11   {'actual_k': 0, 'was_impossible': False}
OnePlus 3 (Soft Gold, 64 GB) user: Giorgio    item: OnePlus 3 (Soft Gold, 64 GB) r_ui = None   est = 9.50   {'actual_k': 0, 'was_impossible': False}
Samsung Galaxy Express I8730 user: Giorgio    item: Samsung Galaxy Express I8730 r_ui = None   est = 8.41   {'actual_k': 0, 'was_impossible': False}
Huawei P8lite zwart / 16 GB user: Giorgio    item: Huawei P8lite zwart / 16 GB r_ui = None   est = 8.88   {'actual_k': 0, 'was_impossible': False}
Lenovo Vibe K5 (Gold, VoLTE update) user: Giorgio    item: Lenovo Vibe K5 (Gold, VoLTE updat

In [354]:
rcmd['Rank'] = rcmd['score_recommended'].rank(ascending=0, method='first')
rcmd.sort_values(by='Rank',ascending=True)

Unnamed: 0,product,score_recommended,Rank
28114,Nokia 5800 XpressMusic,10.0,1.0
31899,"OnePlus 3 (Soft Gold, 64 GB)",9.5,2.0
31898,"OnePlus 3 (Graphite, 64 GB)",9.111111,3.0
41041,Samsung Galaxy S6 zwart / 32 GB,8.933333,4.0
15709,Huawei P8lite zwart / 16 GB,8.884211,5.0
37638,Samsung Galaxy Express I8730,8.405648,6.0
21748,"Lenovo Vibe K4 Note (White,16GB)",7.636364,7.0
21747,"Lenovo Vibe K4 Note (Black, 16GB)",5.666667,8.0
21753,"Lenovo Vibe K5 (Grey, VoLTE update)",5.333333,9.0
21751,"Lenovo Vibe K5 (Gold, VoLTE update)",5.0,10.0


In [355]:
cross_validate(algo_user_based, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.5236  2.4275  2.3727  2.4777  2.4229  2.4449  0.0515  
MAE (testset)     1.8922  1.7943  1.7765  1.8244  1.8478  1.8270  0.0407  
Fit time          0.98    1.07    0.96    0.93    0.97    0.98    0.05    
Test time         0.06    0.06    0.06    0.09    0.05    0.07    0.02    


{'test_rmse': array([2.52364445, 2.42751268, 2.37268793, 2.47771503, 2.4228879 ]),
 'test_mae': array([1.89215972, 1.79430566, 1.77653492, 1.82442593, 1.84780086]),
 'fit_time': (0.9751038551330566,
  1.06736159324646,
  0.9595808982849121,
  0.9265761375427246,
  0.9698042869567871),
 'test_time': (0.06249499320983887,
  0.06252169609069824,
  0.06249117851257324,
  0.09375739097595215,
  0.04688620567321777)}

## In what business scenario you should use popularity based Recommendation Systems ?

Popularity based RS should be used when item based or user data is not available for prediction. It can be used to find trends without user-specific data over a large number of users.

##  In what business scenario you should use CF based Recommendation Systems ?

CF based RS can be used when we have contextual information about the user.

## What other possible methods can you think of which can further improve the recommendation for different users ?

1. Collecting user more contextutal user data to aid in better recommendations.
2. Increasing model size
3. Running a real time recommendation system 