# Case Study : Recommendation on Smart Phones

CONTEXT: 
India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system based on individual consumer’s behaviour or choice. 
• DATA DESCRIPTION: 
• author : name of the person who gave the rating 
• country : country the person who gave the rating belongs to 
• data : date of the rating 
• domain: website from which the rating was taken from 
• extract: rating content 
• language: language in which the rating was given 
• product: name of the product/mobile phone for which the rating was given 
• score: average rating for the phone 
• score_max: highest rating given for the phone 
• source: source from where the rating was taken  

• PROJECT OBJECTIVE: We will build a recommendation system using popularity based and collaborative filtering methods to recommend mobile phones to a user which are most popular and personalised respectively.. 

In [None]:
# Importing the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing
from collections import defaultdict
from surprise import SVD
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Loading Data files
rev1 = pd.read_csv('../input/recommendation-system/phone_user_review_file_1.csv', encoding='iso-8859-1')
rev2 = pd.read_csv('../input/recommendation-system/phone_user_review_file_2.csv', encoding='iso-8859-1')
rev3 = pd.read_csv('../input/recommendation-system/phone_user_review_file_3.csv', encoding='iso-8859-1')
rev4 = pd.read_csv('../input/recommendation-system/phone_user_review_file_4.csv', encoding='iso-8859-1')
rev5 = pd.read_csv('../input/recommendation-system/phone_user_review_file_5.csv', encoding='iso-8859-1')
rev6 = pd.read_csv('../input/recommendation-system/phone_user_review_file_6.csv', encoding='iso-8859-1')   

In [None]:
rev1.head().T

In [None]:
rev2.head().T

In [None]:
rev3.head().T

In [None]:
rev4.head().T

In [None]:
rev5.head().T

In [None]:
rev6.head().T

In [None]:
rev1.shape

In [None]:
rev2.shape

In [None]:
rev3.shape

In [None]:
rev4.shape

In [None]:
rev5.shape

In [None]:
rev6.shape

In [None]:
# 1a. Merge the provided CSVs into one data-frame. 
rev_f = pd.concat([rev1,rev2,rev3,rev4,rev5,rev6],axis=0)

In [None]:
rev_copy = rev_f.copy()

In [None]:
#Checking training dataset attributes datatypes 
rev_f.info()

All columns are objects except score and score_max which are floating point.

In [None]:
# 1b. Check a few observations and shape of the data-frame.
rev_f.shape

In [None]:
rev_f.describe()

Standard deviation from the mean score of 8 is 2.616121e+00

In [None]:
#check for missing values
rev_f.isnull().values.any() # If there are any null values in data set

In [None]:
null_counts = rev_f.isnull().sum()  # This prints the columns with the number of null values they have
print (null_counts)

In [None]:
# 1d. Check for missing values. Impute the missing values if there is any. 
# filling the null values in column 'score' and 'score_max' 
rev_f = rev_f.fillna(rev_f.median())

# dropping the null values in columns 'extract' ,'author' and 'product'
rev_f = rev_f.dropna()

In [None]:
# 1c. Round oﬀ scores to the nearest integers. 
rev_f['score'] = rev_f['score'].astype(int) 
rev_f['score_max'] = rev_f['score_max'].astype(int) 

In [None]:
rev_f.shape

In [None]:
# 1e. Check for duplicate values and remove them if there is any. 
rev_d = rev_f.drop_duplicates()

In [None]:
# 1g. Drop irrelevant features. Keep features like Author, Product, and Score. 
# we can drop phone_url,date,lang,country,source,domain and extract since they do not contribute in deciding popularity.  
rev_d.drop(['phone_url','date','lang','country','source','domain','score_max','extract'], axis = 1, inplace = True)

In [None]:
rev_vs = rev_d.copy()

In [None]:
rev_d.shape

In [None]:
# 1f. Keep only 1000000 data samples. Use random state=612
df = rev_d.sample(n=1000000, random_state=612)

In [None]:
# 2a. Identify the most rated features.
#sorting on products that got highest mean score
df.groupby('product')['score'].mean().sort_values(ascending=False).head()  

In [None]:
# 2 b. Identify the users with most number of reviews. 
(df['author'].value_counts()).head()

In [None]:
# The product that got most number of reviews.
df['product'].value_counts().head()

In [None]:
# extracting authors who gave greater than 50 ratings
df1 = pd.DataFrame(columns=['author', 'a_count'])
df1['author']=df['author'].value_counts().index.tolist() 
df1['a_count'] = list(df['author'].value_counts() > 50)

In [None]:
# get names of indexes for which count column value is False
index_names = df1[ df1['a_count'] == False ].index 
# drop these row indexes from dataFrame 
df1.drop(index_names, inplace = True) 
df1

In [None]:
# extracting product that got more than 50 ratings
df2 = pd.DataFrame(columns=['product', 'p_count'])
df2['product']=df['product'].value_counts().index.tolist() 
df2['p_count'] = list(df['product'].value_counts() > 50)

In [None]:
# get names of indexes for which count column value is False
index_names = df2[ df2['p_count'] == False ].index 
# drop these row indexes from dataFrame 
df2.drop(index_names, inplace = True)

In [None]:
df2

In [None]:
# selecting data rows where product is having more than 50 ratings.  
df3 = df[df['product'].isin(df2['product'])] 
df3

In [None]:
# selecting data rows from df3 where author has given more than 50 ratings.
# 2c. so that we get the data with products having more than 50 ratings and users who have given more than 50 ratings
df4 = df3[df3['author'].isin(df1['author'])]
df4

In [None]:
# 2c. Report the shape of the final dataset.
df4.shape

# Build a popularity based model and recommend top 5 mobile phones. 

In [None]:
#calculating the mean score for a product by grouping it.
ratings_mean_count = pd.DataFrame(df.groupby('product')['score'].mean()) 

In [None]:
# calculating the number of ratings a product got
ratings_mean_count['rating_counts'] = pd.DataFrame(df.groupby('product')['score'].count())  

In [None]:
# 3. Recommending the 5 mobile phones based in highest mean score and highest number of ratings the product got. 
ratings_mean_count.sort_values(by=['score','rating_counts'], ascending=[False,False]).head()

In [None]:
data_pb = df
df

# Build a collaborative filtering model using SVD. 

In [None]:
# arranging columns in the order of user id,item id and rating to be fed in the svd
columns_titles = ['author','product','score']
vs_rev = rev_vs.reindex(columns=columns_titles)

In [None]:
# Keep only 5000 data samples. Use random state=612
vs_data = vs_rev.sample(n=5000, random_state=612)

In [None]:
# 4. Build a collaborative filtering model using SVD. 
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(vs_data,reader = reader)

In [None]:
trainset = data.build_full_trainset()

In [None]:
trainset.ur

In [None]:
algo = SVD()
algo.fit(trainset)

In [None]:
# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()

In [None]:
predictions = algo.test(testset)

In [None]:
predictions

Above are the  predicted items and their estimated ratings for test user.

In [None]:
def get_top_n(predictions, n=5):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [None]:
# 8. Try and recommend top 5 products for test users
top_n = get_top_n(predictions, n=5)

In [None]:
top_n 

Above are the top 5 predicted items and their ratings for test users.

In [None]:
# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

In [None]:
# 5. Evaluate the collaborative model. Print RMSE value for SVD
print("SVD Model : Test Set")
accuracy.rmse(predictions, verbose=True)

In [None]:
cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)

RMSE of SVD model is lower than for cross validation.

In [None]:
def get_Iu(uid):
    """ return the number of items rated by given user
    args: 
      uid: the id of the user
    returns: 
      the number of items rated by the user
    """
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0
    
def get_Ui(iid):
    """ return number of users that have rated given item
    args:
      iid: the raw id of the item
    returns:
      the number of users that have rated the item.
    """
    try: 
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:
        return 0
    
bf = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
bf['Iu'] = bf.uid.apply(get_Iu)
bf['Ui'] = bf.iid.apply(get_Ui)
bf['err'] = abs(bf.est - bf.rui)
best_predictions = bf.sort_values(by='err')[:10]
worst_predictions = bf.sort_values(by='err')[-10:]


In [None]:
best_predictions

# Build a collaborative filtering model using kNNWithMeans from surprise using Item based model

In [None]:

#data_II = vs_rev.sample(n=5000, random_state=612)

In [None]:
# Read dataset.
reader = Reader(rating_scale=(1, 10))
data_I = Dataset.load_from_df(vs_data,reader = reader)

In [None]:
trainset_I, testset_I = train_test_split(data_I, test_size=.15)

In [None]:
# Use user_based true/false to switch between user-based or item-based collaborative filtering
algo = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': False})
algo.fit(trainset_I)

In [None]:
# run the  model against the testset
test_pred_I = algo.test(testset_I)

In [None]:
test_pred_I

In [None]:
# get RMSE
print("Item-based Model : Test Set")
accuracy.rmse(test_pred_I, verbose=True)

# Build a collaborative filtering model using kNNWithMeans from surprise using User based model

In [None]:
reader = Reader(rating_scale=(1, 10))
data_U = Dataset.load_from_df(vs_data,reader = reader)

In [None]:
trainset_U, testset_U = train_test_split(data_U, test_size=.15)

In [None]:
# Use user_based true/false to switch between user-based or item-based collaborative filtering
algo = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': True})
algo.fit(trainset_U)

In [None]:
# we can now query for specific predicions
uid = 'Frances DeSimone'  # raw user id
iid = 'Samsung Galaxy Star Pro DUOS S7262 Unlocked Ce.'  # raw item id

In [None]:
# get a prediction for specific users and items.
pred = algo.predict(uid, iid, verbose=True)

when, author = Frances DeSimone ,
item: Samsung Galaxy Star Pro DUOS S7262 Unlocked Ce.
estimated rating is 8.03

In [None]:
# run the trained model against the testset
test_pred_U = algo.test(testset_U)

In [None]:
#6. Predict score (average rating) for test users
test_pred_U

Above are the prediction of user item combinations and the estimated ratings.

In [None]:
# 5. Evaluate the collaborative model. Print RMSE value for User Based CF
print("User-based Model : Test Set")
accuracy.rmse(test_pred_U, verbose=True)

In [None]:
d_df = df
df.shape

In [None]:
# 9. Check for outliers and impute them as required. 
# only score is the column which is numeric so we check it for outliers.
#Checking for outliers in the sample of 1000000
sns.boxplot(x= d_df['score'], color='cyan')
plt.show()
print('Boxplot of score')
#calculating the outiers in attribute 
Q1 = d_df['score'].quantile(0.25)
Q2 = d_df['score'].quantile(0.50)
Q3 = d_df['score'].quantile(0.75) 
IQR = Q3 - Q1
L_W = (Q1 - 1.5 *IQR)
U_W = (Q3 + 1.5 *IQR)    
print('Q1 is : ',Q1)
print('Q2 is : ',Q2)
print('Q3 is : ',Q3)
print('IQR is:',IQR)
print('Lower Whisker, Upper Whisker : ',L_W,',',U_W)
bools = (d_df['score'] < (Q1 - 1.5 *IQR)) |(d_df['score'] > (Q3 + 1.5 * IQR))
print('number of outliers are:',bools.sum())   #calculating the number of outliers

There are 147884 outliers in the column score

In [None]:
#  function to treat outliers
#Removing outliers by removing data below lower whisker and above upper whisker
Q1 = d_df['score'].quantile(0.25)
Q3 = d_df['score'].quantile(0.75)
IQR = Q3 - Q1
d_df = d_df[(d_df['score'] > (Q1 - 1.5 *IQR)) & (d_df['score'] < (Q3 + 1.5 *IQR))]
bools = (d_df['score'] < (Q1 - 1.5 *IQR)) |(d_df['score'] > (Q3 + 1.5 * IQR))
print('number of outliers are:',bools.sum())   #calculating the number of outliers
d_df.shape

In [None]:
# 10. Try cross validation techniques to get better results.
cross_validate(algo,data_U, measures=['RMSE'], cv=3, verbose=False)

 7. Report your findings and inferences.
Samsung Galaxy Note5 is the most popular product 
Amazon Customer is the most active author who writes reviews.
Lenovo Vibe K4 Note (White,16GB) was rated by most of the authors
CV rmse was 2.5

11. In what business scenario you should use popularity based Recommendation Systems ? 
Ans. Popularity based recommendation system relies on the popularity,trends and frequency counts of which items were most purchased.It is used buy the travel companies selling holiday packages in a season, by Google News and other news websites to show Top Stories with images.


12.  In what business scenario you should use CF based Recommendation Systems ? 
Ans. Collaborative Filtering is used to building intelligent recommender systems that can learn to give better recommendations as more information about users is collected. It isa personalised recommender system , recommendations are made based on the past behaviour of the user. Most websites like Amazon, YouTube, and Netflix use collaborative filtering as a part of their sophisticated recommendation system.

13.  What other possible methods can you think of which can further improve the recommendation for diﬀerent users ?
Ans. Apart from Popularity and Collaborative Filtering , Content-based, Demographic, Utility based, Knowledge based and Hybrid recommendation system can be used as per the user needs.