***DOMAIN:*** Smartphone, Electronics<br>
***CONTEXT:*** India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India
in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by
smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has
made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they
are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the
right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system
based on individual consumer’s behaviour or choice.

***DATA DESCRIPTION:***<br>
***• author :*** name of the person who gave the rating<br>
***• country :*** country the person who gave the rating belongs to<br>
***• data :*** date of the rating<br>
***• domain:*** website from which the rating was taken from<br>
***• extract:*** rating content<br>
***• language:*** language in which the rating was given<br>
***• product:*** name of the product/mobile phone for which the rating was given<br>
***• score:*** average rating for the phone<br>
***• score_max:*** highest rating given for the phone<br>
***• source:*** source from where the rating was taken<br>

**PROJECT OBJECTIVE:** We will build a recommendation system using popularity based and collaborative filtering methods to recommend mobile phones to a user which are most popular and personalised respectively..


**1. Import the necessary libraries and read the provided CSVs as a data frame**

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split

***Merge the provided CSVs into one data-frame.***

In [7]:
csv_file_list = ["phone_user_review_file_1.csv", "phone_user_review_file_2.csv","phone_user_review_file_3.csv","phone_user_review_file_4.csv","phone_user_review_file_5.csv","phone_user_review_file_6.csv"]

list_of_dataframes = []
for filename in csv_file_list:
    print(filename)
    list_of_dataframes.append(pd.read_csv(filename,encoding='latin1'))

phones_df = pd.concat(list_of_dataframes)


phone_user_review_file_1.csv
phone_user_review_file_2.csv
phone_user_review_file_3.csv
phone_user_review_file_4.csv
phone_user_review_file_5.csv
phone_user_review_file_6.csv


***Check a few observations and shape of the data-frame***

In [8]:
phones_df.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.2,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


In [15]:
row,col = phones_df.shape
print("Number of rows: {}".format(row))
print("Number of columns: {}".format(col))

Number of rows: 1415133
Number of columns: 11


In [16]:
phones_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1415133 entries, 0 to 163836
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   phone_url  1415133 non-null  object 
 1   date       1415133 non-null  object 
 2   lang       1415133 non-null  object 
 3   country    1415133 non-null  object 
 4   source     1415133 non-null  object 
 5   domain     1415133 non-null  object 
 6   score      1351644 non-null  float64
 7   score_max  1351644 non-null  float64
 8   extract    1395772 non-null  object 
 9   author     1351931 non-null  object 
 10  product    1415132 non-null  object 
dtypes: float64(2), object(9)
memory usage: 129.6+ MB


In [17]:
phones_df.isna().sum()

phone_url        0
date             0
lang             0
country          0
source           0
domain           0
score        63489
score_max    63489
extract      19361
author       63202
product          1
dtype: int64

There are few NA's in score, score_max, extract and author columns. Dropping off the records having NA's

In [20]:
phones_cl = phones_df.dropna()

***Round off scores to the nearest integers.***

In [30]:
phones_cl.loc[:, ('score', 'score_max')]

Unnamed: 0,score,score_max
0,10.0,10.0
1,10.0,10.0
2,6.0,10.0
3,9.0,10.0
4,4.0,10.0
...,...,...
163832,2.0,10.0
163833,10.0,10.0
163834,2.0,10.0
163835,8.0,10.0


In [32]:
phones_cl.loc[:, ('score', 'score_max')] = phones_cl.loc[:, ('score', 'score_max')].round()

***Check for missing values. Impute the missing values if there is any***<br>
MIssing values are dropped as the number of records are 63000+ out of 1400000 records.

***Check for duplicate values and remove them if there is any.***

In [36]:
phones_cl.drop_duplicates()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.0,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8
...,...,...,...,...,...,...,...,...,...,...,...
163832,/cellphones/alcatel-ot-club_1187/,5/12/2000,de,de,Ciao,ciao.de,2.0,10.0,Weil mein Onkel bei ALcatel arbeitet habe ich ...,david.paul,Alcatel Club Plus Handy
163833,/cellphones/alcatel-ot-club_1187/,5/11/2000,de,de,Ciao,ciao.de,10.0,10.0,Hy Liebe Leserinnen und Leser!! Ich habe seit ...,Christiane14,Alcatel Club Plus Handy
163834,/cellphones/alcatel-ot-club_1187/,5/4/2000,de,de,Ciao,ciao.de,2.0,10.0,"Jetzt hat wohl Alcatell gedacht ,sie machen wa...",michaelawr,Alcatel Club Plus Handy
163835,/cellphones/alcatel-ot-club_1187/,5/1/2000,de,de,Ciao,ciao.de,8.0,10.0,Ich bin seit 2 Jahren (stolzer) Besitzer eines...,claudia0815,Alcatel Club Plus Handy


4500+ records has been removed as duplicates.

***Keep only 1000000 data samples. Use random state=612.***<br>
80% of the data has to be taken after clean up to keep 1000000 records. 

In [38]:
phones_sampled = phones_cl.sample(n=1000000,random_state=612)

***Drop irrelevant features. Keep features like Author, Product, and Score***

In [None]:
phones_

In [22]:
phones_cl["lang"].value_counts()

en    543199
de    167841
ru    139573
it    112362
es     98232
fr     83202
pt     57329
nl     36771
sv     17104
fi      6795
tr      6477
cs      2280
no      1900
he      1361
pl       468
da       407
hu       330
id       271
ar        12
zh         3
Name: lang, dtype: int64