<b><u>Data Description</u>:</b>  Amazon Reviews data (Electronics Dataset) The repository has several 
datasets. For this case study, we are using the Electronics 
dataset. 

<b><u>Domain</u>:</b>  E-commerce 

<b><u>Context</u>:</b> Online E-commerce websites like Amazon, Flipkart uses 
different recommendation models to provide different 
suggestions to different users. Amazon currently uses 
item-to-item collaborative filtering, which scales to massive 
data sets and produces high-quality recommendations in 
real-time.   

<b><u>Objective</u>:</b>  Build a recommendation system to recommend products to 
customers based on the their previous ratings for other 
products.

<b><u>Attribute Information </u> :</b>

Input variables:
##### Amazon Electronics Rating data:
1. UserId : Every user identified with a unique id 
2. ProductId : Every product identified with a unique id 
3. Rating : Rating of the corresponding product by the corresponding user
4. Timestamp : Time of the rating ( ignore this column for this exercise)
    
<b> <u>Learning Outcomes </u> :</b>
- Exploratory Data Analysis
- Creating a Recommendation system using real data 
- Collaborative filtering

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# :::::::::::::::::::::::::::::::::::::Steps and tasks::::::::::::::::::::::::::::::::::::::::::::

##  Read and explore the given dataset. (Rename column/add headers, plot histograms, find data characteristics)

### => Import the necessary libraries :

In [None]:
import numpy as np  
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from collections import defaultdict
from surprise import KNNWithMeans
from surprise import SVD, SVDpp
from surprise import KNNBaseline
from surprise import KNNBasic
from surprise import KNNWithZScore
from surprise import BaselineOnly
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import cross_validate
from surprise.model_selection import KFold
from surprise.model_selection import GridSearchCV

import time

<b>Comment :</b> 
- Here I have used numpy, pandas, matplotlib, seaborn for EDA and Data Visualization. 
- Also used Surprise library for data spliting, model building and for accuracy.
- GridSearchCV to find the best parameters.

# ::--------------------------- Exploratory Data Analysis -------------------------------- ::

### => Read the data as a dataframe :- 

In [None]:
start_time = time.time()

df = pd.read_csv("/kaggle/input/ratings-electronics/ratings_Electronics.csv", names=["userId", "productId", "rating", "timestamp"])  
df.head() 

computational_time = time.time() - start_time
print('Done in %0.3fs' %(computational_time))

<b>Comment:</b> Here I have read the Ratings Data using read_csv() function of pandas. df is a dataframe. I have used head() funtion to display first 5 records of the dataset.

### => Shape of the data :- 

In [None]:
rows_count, columns_count = df.shape
print('Total Number of rows :', rows_count)
print('Total Number of columns :', columns_count)

<b>Comment:</b> Shape of the dataframe is (7824482, 4).
There are 7824482 rows and 4 columns in the dataset.

### => Data type of each attribute :-

In [None]:
df.dtypes

<b>Comment:</b> By displaying the datatypes of each variable we can see the following:

   -  int type           :  rating, timestamp
   -  object type(string):  userId, productId



### => Unique UserId and ProductID :-

In [None]:
unique_userId = df['userId'].nunique()
unique_productId = df['productId'].nunique()
print('Total number of unique Users    : ', unique_userId)
print('Total number of unique Products : ', unique_productId)

### => Checking the presence of missing values :-

In [None]:
sns.heatmap(df.isna(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
df.apply(lambda x : sum(x.isnull()))

In [None]:
df.isnull().sum()

In [None]:
df.isna().any()

<b>Comment:</b> From above missing value graph we can see that there is no missing values which I have also checked bu using inull() and isna() function of dataframe.

### =>  Data Characteristics :-

In [None]:
df_transpose = df.describe().T
df_transpose

<b>Comment : </b>  From above we can see that 
- Mean of rating is less than median which stats that the distribution is negatively skewed.
- Mean of timestamp is almost near to median which stats the distribution is symmetric.

### =>  Five point summary of  numerical attributes  :-

In [None]:
df_transpose[['min', '25%', '50%', '75%', 'max']]

### => Checking the presence of outliers :-

In [None]:
plt.figure(figsize=(20,5))
sns.boxplot(data=df, orient='h', palette='Set2', dodge=False)

<b>Observation : </b> From the above boxplot we can see that there are outliers in timestamp column. But I will not fixed the outliers as I will be dropping timestamp which is mentioned in the problem statement.

# ::-------------------------------------- Data Visualization ------------------------------------::

### =>  Pair plot that includes all the columns of the data frame :-

In [None]:
start_time = time.time()

sns.pairplot(df, diag_kind= 'kde')

computational_time = time.time() - start_time
print('Done in %0.3fs' %(computational_time))

<b> Comment : </b> From above we can see there is tall tower of 5 rating which stats that most of the customers have given 5 rating. Very less customers have given 2 rating. From here we can infer an important thing that most of the electronics products are liked by the customers.

### => Checking the ratio of all 5 ratings

In [None]:
df['rating'].value_counts()

In [None]:
rating_counts = pd.DataFrame(df['rating'].value_counts()).reset_index()
rating_counts.columns = ['Labels', 'Ratings']
rating_counts

<b>Comment:</b> Number of 1 ratings in our dataset is higher than other rating. There are most number of people who have given 1 rating to the products. 
Below code shows the ration among them.

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15,7))
sns.countplot(df['rating'], ax=ax1)
ax1.set_xlabel('Rating Distribution', fontsize=10)
ax1.set_ylabel('Count', fontsize=10)


explode = (0.1, 0, 0.1, 0, 0)
ax2.pie(rating_counts["Ratings"], explode=explode, labels=rating_counts.Labels, autopct='%1.2f%%',
        shadow=True, startangle=70)
ax2.axis('equal')
plt.title("Rating Ratio")
plt.legend(rating_counts.Labels, loc=3)
plt.show()

<b>Observation :</b> From the barplot and pia chart we can clearly see that approx 55% of data have 5 rating followed by 4(approx 19%). Least number of people have given 2 rating. One important insight is coming from here that most of the products are liked by the customers.   

### => Creating and view the correlation matrix :-

In [None]:
df.corr()

In [None]:
mask = np.zeros_like(df.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(5,2))
plt.title('Correlation of Attributes', y=1.05, size=10)
sns.heatmap(df.corr(),vmin=-1, cmap='plasma',annot=True,  mask=mask, fmt='.2f')

<b>Comment : </b> As we do not have many attributes and if we see the correlation between rating and timestamp then we won't find any high correlation between them.

#### Dropping timestamp :-

In [None]:
df = df.drop(['timestamp'], axis=1)

Taking copy of dataframe df to df1 before doing any manupulation with dataset so to save loading time I have copied.

In [None]:
df1 = df.copy()

In [None]:
df1.head()

# Taking a subset of the dataset to make it less sparse/ denser.( For example, keep the users only who has given 50 or more number of ratings ) ::-

#### => Taking Subset of users who have given 50 or more number of ratings :-

In [None]:
users_counts = df1['userId'].value_counts().rename('users_counts')
users_data   = df1.merge(users_counts.to_frame(),
                                left_on='userId',
                                right_index=True)

In [None]:
subset_df = users_data[users_data.users_counts >= 50]
subset_df.head()

<b>Comment : </b> Above we can see only those users who have given 50 or more number of ratings.

#### => Taking Subset of products which have recieved 10 or more number of ratings to overcome the Grey Ship problem :-

In [None]:
product_rating_counts = subset_df['productId'].value_counts().rename('product_rating_counts')
product_rating_data   = subset_df.merge(product_rating_counts.to_frame(),
                                left_on='productId',
                                right_index=True)

In [None]:
product_rating_data = product_rating_data[product_rating_data.product_rating_counts >= 10]
product_rating_data.head()

<b>Important : </b> Here I am considering only those products which recieved as least 10 ratings. Because there may be some cases where product's number of rating will be 1 or 2 but rating value will be 5, in this case these kind of products will be appear at top for the recommendation which would not be a good recommendation technique.    

In [None]:
amazon_df = product_rating_data.copy()

In [None]:
panda_data = amazon_df.drop(['users_counts', 'product_rating_counts'], axis=1)

In [None]:
panda_data.head()

## Spliting the data randomly into train and test dataset. (Split it in 70/30 ratio) ::-

To load the dataset from a pandas dataframe, we will need the load_from_df() method. we will also need a reader obeject which I have I have already decleared. 

To Get top - K ( K = 5) recommendations I am initalizing k below.

In [None]:
k = 5

We need to define a Reader object for Surprise to be able to parse the dataframe.

In [None]:
reader = Reader(rating_scale=(1, 5))

In [None]:
surprise_data = Dataset.load_from_df(panda_data[['userId', 'productId', 'rating']], reader)

In [None]:
trainset, testset = train_test_split(surprise_data, test_size=.30, random_state=7)

# Building Popularity Recommender model ::- 

In two way I will be creating the Recommender model.
- Using mean of product rating
- Using Ranking Based Algorithm

## =>  Using mean of products rating :-

In [None]:
panda_data.groupby('productId')['rating'].mean().head()

In [None]:
panda_data.groupby('productId')['rating'].mean().sort_values(ascending=False).head()

In [None]:
prod_rating_count = pd.DataFrame(panda_data.groupby('productId')['rating'].mean().sort_values(ascending=False))
prod_rating_count['prod_rating_count'] = pd.DataFrame(panda_data.groupby('productId')['rating'].count())
prod_rating_count.head(k)

In [None]:
basic_poplurity_model = prod_rating_count.sort_values(by=['prod_rating_count'], ascending=False)
basic_poplurity_model.head(k)

<b>Comment : </b> Above is the list of top 5 popular products for the recommendation.

## => Ranking-Based Algorithms  :-

Creating a Product recommender :-

Building Popularity Recommender model(Non-personalised) :-

In [None]:
#Count of user_id for each unique song as recommendation score 
panda_data_grouped = panda_data.groupby('productId').agg({'userId': 'count'}).reset_index()
panda_data_grouped.rename(columns = {'userId': 'score'},inplace=True)
panda_data_grouped.head()


In [None]:
#Sort the songs on recommendation score 
panda_data_sort = panda_data_grouped.sort_values(['score', 'productId'], ascending = [0,1]) 
      
#Generate a recommendation rank based upon score 
panda_data_sort['Rank'] = panda_data_sort['score'].rank(ascending=0, method='first') 
          
#Get the top 5 recommendations 
popularity_recommendations = panda_data_sort.head(k) 
popularity_recommendations 

#### Using popularity based recommender model to make predictions and find recommendations for random list of users with inferences

In [None]:
# UsINNG popularity based recommender model to make predictions
import warnings
warnings.filterwarnings('ignore')
def recommend(userId):     
    user_recommendations = popularity_recommendations 
          
    #Adding user_id column for which the recommendations are being generated 
    user_recommendations['userID'] = userId 
      
    #Bringing user_id column to the front 
    cols = user_recommendations.columns.tolist() 
    cols = cols[-1:] + cols[:-1] 
    user_recommendations = user_recommendations[cols] 
          
    return user_recommendations 

In [None]:
find_recom = [15,121,55,230,344]   # This list is user choice.
for i in find_recom:
    print("Here is the recommendation for the userId: %d\n" %(i))
    print(recommend(i))    
    print("\n") 

<b>Comment : </b> Top 5 popular products(B0088CJT4U, B003ES5ZUU, B000N99BBC, B007WTAJTO, B00829TIEK).
- Since this is a popularity-based recommender model, recommendations remain the same for all users. We predict the products based on the popularity. It is not personalized to particular user.

## Building Collaborative Filtering model ::-

For the Collaborative Filtering Model I am going to use <b>SVD, KNNWithMeans </b> and I will also test with other algorithm

In [None]:
cv_results = []  # to store cross validation result 

# ------------------------Matrix Factorization Based Algorithms------------------------------

### => Grid Search :-

Here I am Using grid search to find out the best hyper parameters for SVD and SVDpp Algorithm.
- n_epochs values : [20, 25] 
- lr_all          : [0.007, 0.009, 0.01]
- reg_all         : [0.4, 0.6]

In [None]:
svd_param_grid = {'n_epochs': [20, 25], 'lr_all': [0.007, 0.009, 0.01], 'reg_all': [0.4, 0.6]}

svd_gs = GridSearchCV(SVD, svd_param_grid, measures=['rmse', 'mae'], cv=5, n_jobs=5)
svdpp_gs = GridSearchCV(SVDpp, svd_param_grid, measures=['rmse', 'mae'], cv=5, n_jobs=5)

svd_gs.fit(surprise_data)
svdpp_gs.fit(surprise_data)

# best RMSE score
print(svd_gs.best_score['rmse'])
print(svdpp_gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(svd_gs.best_params['rmse'])
print(svdpp_gs.best_params['rmse'])

<b>Comment : </b> Above we have found the best parameters for the SVD an SVDpp algorithm and we will be pass these parameters to while creating model.

### =>  SVD :-

In [None]:
start_time = time.time()

# Creating Model using best parameters
svd_model = SVD(n_epochs=20, lr_all=0.005, reg_all=0.2)

# Training the algorithm on the trainset
svd_model.fit(trainset)


# Predicting for test set
predictions_svd = svd_model.test(testset)

# Evaluating RMSE, MAE of algorithm SVD on 5 split(s) by cross validation
svd_cv = cross_validate(svd_model, surprise_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Storing Crossvalidation Results in dataframe
svd_df = pd.DataFrame.from_dict(svd_cv)
svd_described = svd_df.describe()
cv_results = pd.DataFrame([['SVD', svd_described['test_rmse']['mean'], svd_described['test_mae']['mean'], 
                           svd_described['fit_time']['mean'], svd_described['test_time']['mean']]],
                            columns = ['Model', 'RMSE', 'MAE', 'Fit Time', 'Test Time'])


# get RMSE
print("\n\n==================== Model Evaluation ===============================")
accuracy.rmse(predictions_svd, verbose=True)
print("=====================================================================")
computational_time = time.time() - start_time
print('\n Computational Time : %0.3fs' %(computational_time))
cv_results

<b> Comment :</b> Here we can see that the RMSE of testset and complete dataset found from cross_validation is amost same it seems our model is performing well on trainset and testset.

### => SVD++ :-

In [None]:
start_time = time.time()

# Creating Model using best parameters
svdpp_model = SVDpp(n_epochs=25, lr_all=0.01, reg_all=0.4)

# Training the algorithm on the trainset
svdpp_model.fit(trainset)


# Predicting for test set
predictions_svdpp = svdpp_model.test(testset)

# Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s) by cross validation
svdpp_cv = cross_validate(svdpp_model, surprise_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Storing Crossvalidation Results in dataframe
svdpp_df = pd.DataFrame.from_dict(svdpp_cv)
svdpp_described = svdpp_df.describe()
svdpp_cv_results = pd.DataFrame([['SVDpp', svdpp_described['test_rmse']['mean'], svdpp_described['test_mae']['mean'], 
                           svdpp_described['fit_time']['mean'], svdpp_described['test_time']['mean']]],
                            columns = ['Model', 'RMSE', 'MAE', 'Fit Time', 'Test Time'])

cv_results = cv_results.append(svdpp_cv_results, ignore_index=True)

# get RMSE
print("\n\n==================== Model Evaluation ===============================")
accuracy.rmse(predictions_svdpp, verbose=True)
print("=====================================================================")
computational_time = time.time() - start_time
print('\n Computational Time : %0.3fs' %(computational_time))
cv_results

# ---------------------------------- k-NN Based Algorithms ----------------------------------------

### => Grid Search :-

Here I am Using grid search to find out the best hyper parameters for <b>KNNBasic</b>, <b>KNNWithMeans</b> and <b>KNNWithZScore<b/> Algorithm.

In [None]:
start_time = time.time()

knn_param_grid = {'bsl_options': {'method': ['als', 'sgd'],
                              'reg': [1, 2]},
              'k': [15, 20, 25, 30, 40, 50, 60],
              'sim_options': {'name': ['msd', 'cosine', 'pearson_baseline']}
              }

knnbasic_gs = GridSearchCV(KNNBasic, knn_param_grid, measures=['rmse', 'mae'], cv=5, n_jobs=5)
knnmeans_gs = GridSearchCV(KNNWithMeans, knn_param_grid, measures=['rmse', 'mae'], cv=5, n_jobs=5)
knnz_gs     = GridSearchCV(KNNWithZScore, knn_param_grid, measures=['rmse', 'mae'], cv=5, n_jobs=5)


knnbasic_gs.fit(surprise_data)
knnmeans_gs.fit(surprise_data)
knnz_gs.fit(surprise_data)

# best RMSE score
print(knnbasic_gs.best_score['rmse'])
print(knnmeans_gs.best_score['rmse'])
print(knnz_gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(knnbasic_gs.best_params['rmse'])
print(knnmeans_gs.best_params['rmse'])
print(knnz_gs.best_params['rmse'])

computational_time = time.time() - start_time
print('\nComputational Time : %0.3fs' %(computational_time))

### => KNNBasic :-

In [None]:
start_time = time.time()

# Creating Model using best parameters
knnBasic_model = KNNBasic(k=50, sim_options={'name': 'cosine', 'user_based': False})

# Training the algorithm on the trainset
knnBasic_model.fit(trainset)

# Predicting for test set
prediction_knnBasic = knnBasic_model.test(testset)

# Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s)
knnBasic_cv = cross_validate(knnBasic_model, surprise_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Storing Crossvalidation Results in dataframe
knnBasic_df = pd.DataFrame.from_dict(knnBasic_cv)
knnBasic_described = knnBasic_df.describe()
knnBasic_cv_results = pd.DataFrame([['KNNBasic', knnBasic_described['test_rmse']['mean'], knnBasic_described['test_mae']['mean'], 
                           knnBasic_described['fit_time']['mean'], knnBasic_described['test_time']['mean']]],
                            columns = ['Model', 'RMSE', 'MAE', 'Fit Time', 'Test Time'])

cv_results = cv_results.append(knnBasic_cv_results, ignore_index=True)

# get RMSE
print("\n\n==================== Model Evaluation ===============================")
accuracy.rmse(prediction_knnBasic, verbose=True)
print("=====================================================================")

computational_time = time.time() - start_time
print('\n Computational Time : %0.3fs' %(computational_time))
cv_results


### => KNNWithZScore :-

In [None]:
start_time = time.time()

# Creating Model using best parameters
knnZscore_model = KNNWithZScore(k=60, sim_options={'name': 'cosine', 'user_based': False})

# Training the algorithm on the trainset
knnZscore_model.fit(trainset)

# Predicting for testset
prediction_knnZscore = knnZscore_model.test(testset)

# Evaluating RMSE, MAE of algorithm KNNWithZScore on 5 split(s)
knnZscore_cv = cross_validate(knnZscore_model, surprise_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Storing Crossvalidation Results in dataframe
knnZscore_df = pd.DataFrame.from_dict(knnZscore_cv)
knnZscore_described = knnZscore_df.describe()
knnZscore_cv_results = pd.DataFrame([['KNNWithZScore', knnZscore_described['test_rmse']['mean'], knnZscore_described['test_mae']['mean'], 
                           knnZscore_described['fit_time']['mean'], knnZscore_described['test_time']['mean']]],
                            columns = ['Model', 'RMSE', 'MAE', 'Fit Time', 'Test Time'])

cv_results = cv_results.append(knnZscore_cv_results, ignore_index=True)

# get RMSE
print("\n\n==================== Model Evaluation ===============================")
accuracy.rmse(prediction_knnZscore, verbose=True)
print("=====================================================================")

computational_time = time.time() - start_time
print('\n Computational Time : %0.3fs' %(computational_time))
cv_results


### => KNNWithMeans User-User 

In [None]:
start_time = time.time()

# Creating Model using best parameters
knnMeansUU_model = KNNWithMeans(k=60, sim_options={'name': 'cosine', 'user_based': True})

# Training the algorithm on the trainset
knnMeansUU_model.fit(trainset)

# Predicting for testset
prediction_knnMeansUU = knnMeansUU_model.test(testset)

# Evaluating RMSE, MAE of algorithm KNNWithMeans User-User on 5 split(s)
knnMeansUU_cv = cross_validate(knnMeansUU_model, surprise_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Storing Crossvalidation Results in dataframe
knnMeansUU_df = pd.DataFrame.from_dict(knnMeansUU_cv)
knnMeansUU_described = knnMeansUU_df.describe()
knnMeansUU_cv_results = pd.DataFrame([['KNNWithMeans User-User', knnMeansUU_described['test_rmse']['mean'], knnMeansUU_described['test_mae']['mean'], 
                           knnMeansUU_described['fit_time']['mean'], knnMeansUU_described['test_time']['mean']]],
                            columns = ['Model', 'RMSE', 'MAE', 'Fit Time', 'Test Time'])

cv_results = cv_results.append(knnMeansUU_cv_results, ignore_index=True)

# get RMSE
print("\n\n==================== Model Evaluation ===============================")
accuracy.rmse(prediction_knnMeansUU, verbose=True)
print("=====================================================================")

computational_time = time.time() - start_time
print('\n Computational Time : %0.3fs' %(computational_time))
cv_results


### => KNNWithMeans Item-Item 

In [None]:
start_time = time.time()

# Creating Model using best parameters
knnMeansII_model = KNNWithMeans(k=60, sim_options={'name': 'cosine', 'user_based': False})

# Training the algorithm on the trainset
knnMeansII_model.fit(trainset)

# Predicting for testset
prediction_knnMeansII = knnMeansII_model.test(testset)

# Evaluating RMSE, MAE of algorithm KNNWithMeans Item-Item on 5 split(s)
knnMeansII_cv = cross_validate(knnMeansII_model, surprise_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Storing Crossvalidation Results in dataframe
knnMeansII_df = pd.DataFrame.from_dict(knnMeansII_cv)
knnMeansII_described = knnMeansII_df.describe()
knnMeansII_cv_results = pd.DataFrame([['KNNWithMeans Item-Item', knnMeansII_described['test_rmse']['mean'], knnMeansII_described['test_mae']['mean'], 
                           knnMeansII_described['fit_time']['mean'], knnMeansII_described['test_time']['mean']]],
                            columns = ['Model', 'RMSE', 'MAE', 'Fit Time', 'Test Time'])

cv_results = cv_results.append(knnMeansII_cv_results, ignore_index=True)

# get RMSE
print("\n\n==================== Model Evaluation ===============================")
accuracy.rmse(prediction_knnMeansII, verbose=True)
print("=====================================================================")

computational_time = time.time() - start_time
print('\n Computational Time : %0.3fs' %(computational_time))
cv_results


## ----------------------------- Comparison of all algorithms on RMSE and MAE ------------------------

In [None]:
x_algo = ['KNN Basic', 'KNNWithMeans-User-User', 'KNNWithMeans-Item-Item', 'KNN ZScore', 'SVD', 'SVDpp']
all_algos_cv = [knnBasic_cv, knnMeansUU_cv, knnMeansII_cv, knnZscore_cv, svd_cv, svdpp_cv]

rmse_cv = [round(res['test_rmse'].mean(), 4) for res in all_algos_cv]
mae_cv  = [round(res['test_mae'].mean(), 4) for res in all_algos_cv]

plt.figure(figsize=(20,15))

plt.subplot(2, 1, 1)
plt.title('Comparison of Algorithms on RMSE', loc='center', fontsize=15)
plt.plot(x_algo, rmse_cv, label='RMSE', color='darkgreen', marker='o')
plt.xlabel('Algorithms', fontsize=15)
plt.ylabel('RMSE Value', fontsize=15)
plt.legend()
plt.grid(ls='dashed')

plt.subplot(2, 1, 2)
plt.title('Comparison of Algorithms on MAE', loc='center', fontsize=15)
plt.plot(x_algo, mae_cv, label='MAE', color='navy', marker='o')
plt.xlabel('Algorithms', fontsize=15)
plt.ylabel('MAE Value', fontsize=15)
plt.legend()
plt.grid(ls='dashed')

plt.show()

cv_results

## Evaluation Results :-

From above algorithm comparisons plots we can infer the followings:
- RMSE : we can see that SVD++ is giving the best RMSE value with parameters {'n_epochs': 25, 'lr_all': 0.01, 'reg_all': 0.4} and SVD is giving the second best RMSE with parameters {'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.2}
- MAE : Here SVD++ and KNNWithMeans both are giving the best MAE value.
- Svd++ is having the best RMSE in Matrix Factorization Based Algorithms.
- KNNWithMeans is giving the best RMSE in Collaborative Filtering Algorithms.
- <b> Important : </b> If compare SVD and SVD++ then can notice that RMSE and MAE value of SVD is slightly differs from the SVD++ but the Fit Time and Test Time taken by SVD is significant less(12 times) than SVD++. So, we will proceed with SVD got get top-k recommendations


## top - K ( K = 5) recommendations ::-

- Here I am using SVD algorithm to get the top 5 recommendations of new products for each user.

In [None]:
top_n = defaultdict(list)
def get_top_n(predictions, n=k):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

top_n = get_top_n(predictions_svd, n=k)
top_n

<b> Comment : </b> From the above list we can see that model is recommending top 5 products to each user. There are some cases which  it recommends less than 5 products. It happend becaus model is not able to find appropriate number of neighbours.

### => Precision and recall at k=5

In [None]:
def precision_recall_at_k(predictions, k=5, threshold=3.5):
    '''Return precision and recall at k metrics for each user.'''

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1

        # Recall@K: Proportion of relevant items that are recommended
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

    return precisions, recalls


kf = KFold(n_splits=5)
svd_model = SVD(n_epochs=20, lr_all=0.005, reg_all=0.2)
precs = []
recalls = []

for trainset, testset in kf.split(surprise_data):
    svd_model.fit(trainset)
    predictions = svd_model.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=3.5)

    # Precision and recall can then be averaged over all users
    print('Precision : ', sum(prec for prec in precisions.values()) / len(precisions))
    print('recalls : ',sum(rec for rec in recalls.values()) / len(recalls))


<b>Comment : </b> I have calculated Precision and recall at k=5. As we know that Precision and recall are binary metrics used to evaluate models with binary output. Thus we need a way to translate our numerical problem (ratings usually from 1 to 5) into a binary problem (relevant and not relevant items). To do the translation I have assumed that any true rating above 3.5 corresponds to a <b>relevant</b> item and any true rating below 3.5 is <b>irrelevant</b>. 
- My precision at 5 in a top-5 recommendation problem is alomost 87%. This means that 87% of the recommendation are relevent to the users.
- My recall at 5 in a top-5 recommendation problem is almost 83%. This means that 83% of the total number of the relevent products appear in the top-k result.

## Summary ::-

<b>Insight : </b>
- I Have done EDA to understand the data in precise way and found that most of the custermors have given 5 rating. Which gives us an important information that Amazon is performing good in tems of Electronics Products sell.  
- Taken the subset of data based on users who have given 50 or more rating and the products which recieved 10 or more number of ratings to overcome the Grey ship problem.
- In <b>Popularity Model</b> I have shown the top 5 recommended products irrespective of users. This means same top 5 products will be recommended to each user.
- I have used 'Matrix Factorization Based Algorithms' & 'k-NN Based Algorithms' to build <b>Collaborative Filtering model</b>.
- We have also seen that KNNWithMeans is performing well compare to other k-NN Based Algorithms.
- I found that SVD++ has given lowest RMSE which is slightly better than SVD but computational time of SVD++ is 12 times greater than SVD. Hence I have consider SVD to get the recommended products.
    - SVD with parameters   => Number of Epochs = 20, Learning Rate= 0.005, Regularization Term = 0.2
    - SVD++ with parameters => Number of Epochs = 25, Learning Rate= 0.01, Regularization Term = 0.4
- I have computed the precision which is almost 87%. Here we can interpret that 87% of my recommendations are actually relevant to the user.
- In recall we can interpret that 83% percent of the relevant items were recommended in the top-k items.