#### Pre-steps 1: Import the necessary libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import split
from surprise import Dataset,Reader
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from collections import defaultdict

%matplotlib inline
sns.set(style="darkgrid",color_codes=True)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session




#### Step1: Read and explore the given dataset.

In [None]:
# load the dataset
rdata = pd.read_csv('/kaggle/input/amazon-product-reviews/ratings_Electronics (1).csv',names=['userid','productid','rating','timestamp'])

In [None]:
# lets make a copy of the data so that all the transformation is done on the copy and not on the main dataset
t1data=rdata.copy()

In [None]:
# step 2.1: browse through the first few columns
t1data

Online E-commerce websites like Amazon, Flipkart uses different recommendation models to provide different suggestions to different users. Amazon currently uses item-to-item collaborative filtering, which scales to massive data sets and produces high-quality recommendations in real-time.

Dataset:<br>
- The dataset comprises of  7824482 rows of user who have rated different products at different times.

Objective of the project:<br>
- Build a recommendation system to recommend products to customers based on the their previous ratings for other products.

In [None]:
## identifying the range of the ratings
np.sort(t1data['rating'].unique())

The ratings are in the range of [1,5]

In [None]:
### analysing the first few records 
t1data.head()

In [None]:
#dropping timestamp column since it is not much of value add
t1data=t1data.drop("timestamp",axis=1)

In [None]:
t1data.head()

In [None]:
t1data.info()

- There are 3 columns. 
  - user id and product id are of type object while rating is of type float

In [None]:
## count of each attribute in the dataset

unique_users =len(np.unique(t1data.userid))
unique_pdts = len(np.unique(t1data.productid))
print('Total number of users is: ',unique_users,'\n')
print('Total number of products is: ',unique_pdts,'\n')

In [None]:
### lets analyse the spread of data
t1data.describe().T

- The range of rating is [1,5] with a median of 5. The mean is less than median which means that there could be slight skewness on the left.
There might be few outliars on the left; we will plot a box plot to confirm the same.

In [None]:
# Identify Duplicate records in the data 
# It is very important to check and remove data duplicates. 
# Else our model may break or report overly optimistic / pessimistic performance results
dupes=t1data.duplicated()
print(' The number of duplicates in the dataset are:',sum(dupes), '\n','There are no duplicates in the dataset')

In [None]:
# checking if there are any null values
t1data.isnull().any()

Clearly there are no null values

In [None]:
a=t1data.groupby('rating')['rating'].count()

In [None]:
# Attributes in the Group
Atr1g1='userid'
Atr2g1='productid'
Atr3g1='rating'
data=t1data

In [None]:
##EDA: Spread
# fig, ax = plt.subplots(1,2,figsize=(16,8)) 
plt.figure(figsize=(8,6))
sns.distplot(data[Atr3g1]);

In [None]:
# EDA: count of ratings:
plt.figure(figsize=(8,6))
sns.countplot(data[Atr3g1]);

From the countplot above, it appears that a lot of products have got a rating of 5; 
it might appear that there is a lot of noise in the data; There also might be cases wherein few products have high ratings but the count of ratings is less.
We will try to clear all the noises in the next sections.

### Step 2 Take a subset of the dataset to make it less sparse/ denser. ( For example, keep the users only who has given 50 or more number of ratings )

popularity based Recommendation systems dont consider the count of people giving ratings; so if 1 person give 5 rating to a product; 
Popularity based recommendation will consider it to be similar to let's say if there are 500 people who give 5 rating to the same product.
Hence, an individual's recommendation can impact the recommendations made to a new user.<br>
To ensure that our recommendation is not impacted by the count of people giving better ratings; as an example for enhanced analysis lets keep the users who have given 50 or more than 50 number of ratings

In [None]:
t2data=t1data.copy()
t2data = t2data[t2data.groupby('userid')['userid'].transform('size') > 49]
t2data=pd.DataFrame(t2data)

In [None]:
t2data=t2data.reset_index(drop=True)

In [None]:
t2data.head()

In [None]:
shape_t2data=t2data.shape
print('The shape of the new dataframe is',shape_t2data,'which means there are',shape_t2data[0],'rows of ratings and',shape_t2data[1],'attributes of userid, productid and rating.')

In [None]:
## lets check the count of ratings given by the users
ratings_per_user = t2data.groupby(by='userid')['rating'].count().sort_values(ascending=False)
ratings_per_user

The range of ratings is clearly visible. There are around 1540 users with a range of [50,520]

### Step 3: Split the data randomly into train and test dataset. ( For example, split it in 70/30 ratio)

In [None]:
reader = Reader(rating_scale=(1, 5))

In [None]:
t3data=Dataset.load_from_df(t2data[['userid','productid','rating']],reader)
t3data

In [None]:
trainset, testset = train_test_split(t3data, test_size=.30, random_state=1)

In [None]:
print(type(testset))
print(type(trainset))

### Step 4: Build Popularity Recommender model.

Popularity based recommendation system works with the trend. It basically uses the items which are in trend right now. For example, if any product which is usually bought by every new user then there are chances that it may suggest that item to the user who just signed up

As discussed in the previous step. Since popularity based recommendations don't consider the count of people recommending. Hence, recommendations can easily be influenced even if lesser count customers have recommended. To help with that, we reduced our dataset to include only those customers who have given 50 or more ratings.
In this section, we will build popularity based recommendation system and arrive that products which can be recommended to the new customers.

Lets build the model

In [None]:
# First we will group by product ids and then display mean ratings for the products. For better visualization we will display first 5 records.
t2data.groupby('productid')['rating'].mean().head()

-  we can notice that the first product id ending with 647 has a rating of 5; the second product id ending with 813 has a mean rating of 3. 
-  The third product id ending with 998 has a mean rating of 2.5. 
- All of these rating are in-conclusive since we dont know how many users gave these rating.
-  there could be a case wherein only one user rated product 1 (ending with 647). Hence, going by this we might recommend product 1; which might not be correct

In [None]:
## Next we want to look at which product has got the highest rating. FOr the same same we will sort the productid by the mean ratings.
## We then displayed top 10 products which have the highest ratings
# this analysis is also inconclusive since top ratings dont add value without the count
t2data.groupby('productid')['rating'].mean().sort_values(ascending=False).head(10)

There are a lot of products who have a mean rating of 5; however, the analysis wont be conclusive since we dont know how many users rated these products.

In [None]:
## Next lets try and analyse the products which have been rated the most
t2data.groupby('productid')['rating'].count().sort_values(ascending=False).head()

As seen in the section above; the product id ending with T4U has been rated the most with the count 206; The product id ending with ZUU is the second most rated product.
This will also not give us the recommendations; since it doesnt tell us what was the rating of these products. So next we will create a dataframe wherein we will have 2 columns; 
column 1: the count of rating
coulmn 2: the mean rating
That will help us in building up the recommendations for any new user who logs into our website

In [None]:
t2data_product_ratings =pd.DataFrame(t2data.groupby('productid')['rating'].mean())
t2data_product_ratings['ratings_count'] = pd.DataFrame(t2data.groupby('productid')['rating'].count())
t2data_product_ratings.head()

As seen above, even though, the product ending with 647 has a high rating; but there is only 1 rating against it; Hence, this product might not be popular and hence cant be recommended to other users.
Next, we will sort the products basis the rating counts in the descending order to identify the products with the best rating and the rating counts.

In [None]:
t2data_product_ratings.sort_values(by='ratings_count',ascending=False)

The dataframe above also gives us inconclusive recommendations. Since it doesnt tell us between (lets say) product 1 which has low rating but high count of vote and product 2 which has higher rating but lesser count of vote, which one is the first recommendation.
Hence, we will add another column to this dataframe and call it score. The column score will be a multiple of rating and rating count. We will sort the column score in the descending order and that will give us the top recommendations.

In [None]:
t2data_product_ratings['score'] = t2data_product_ratings['rating']*t2data_product_ratings['ratings_count']

In [None]:
plt.figure(figsize=(8,6))
sns.jointplot(x='rating', y='ratings_count', data=t2data_product_ratings, alpha=0.4)

In [None]:
t2data_product_ratings.sort_values(by='score',ascending=False)

In [None]:
print('the top 5 recommendations are:') 
t2data_product_ratings.sort_values(by='score',ascending=False).head()

As seen in the above dataframe the top 5 popular recommendations are products 
1. B003ES5ZUU
2. B0088CJT4U
3. B000N99BBC
4. B007WTAJTO
5. B00829TIEK

However, The problems with popularity based recommendation system is that the personalization is not available with this method i.e. even though we know the behaviour of the user you cannot recommend items accordingly.

### Step 5: Build Collaborative Filtering model.

Collaborative filtering addresses limitations of the popularity basis recommendation systems.
collaborative filtering uses similarities between users and items to provide recommendations.
collaborative filtering models can recommend a product to user X based on the interests of a similar user Y.

In [None]:
### Lets build the model

In [None]:
data = t3data

We will use KNN algorithm for prediction. First we will select default parameters and check the RMSE. Post which we will use hyper parameters for tuning and arrive at the best parameters for our model

In [None]:
algo_knn = KNNWithMeans()
algo_knn.fit(trainset)

In [None]:
predictions_knn = algo_knn.test(testset)

In [None]:
# get RMSE
print("User-based Model : Test Set")
accuracy.rmse(predictions_knn, verbose=True)

We will use grid-search to arrive at the best hyper parameters.

In [None]:
## We could use item-item based collaborative filtering. Since everytime we used it, google colab crashed giving out of memory issues. 
#We tried executing on local machine as well but no luck. Another option could be trucating the data to reduce memory requirements.
# But that approach dint appear apt to follow.

In [None]:

sim_options = {
    "name": ["msd", "cosine","pearson_baseline"],
    "min_support": [3, 4, 5],
    "user_based": [True],
    "k":[5,10,20,30,40,50,100]
    
}

In [None]:
param_grid = {"sim_options": sim_options,"verbose":[True,False]}
gs = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse", "mae"],cv=3)
gs.fit(data)

In [None]:
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

In [None]:
algo = KNNWithMeans(sim_options={'name': 'pearson_baseline', 'min_support': 5, 'user_based': True,'k':5},verbose= True,c=3)
algo.fit(trainset)

#### Note: We are unable to run a item-item based model with the command 'user_based': False; since the RAM requirement is much more than the available RAM. Everytime we try to execute the same; the session crashes giving errors. We tried on local machine with 8 GB RAM but no luck.

In [None]:
# run the trained model against the testset
predictions = algo.test(testset)
predictions

### Step 6: Evaluate both the models.

##### Step 6.1 evaluate Popularity based model

As seen while building the popularity based model. The 5 products which will be recommended to all the users basis the popularity are:

In [None]:
print('the top 5 recommendations are:') 
t2data_product_ratings.sort_values(by='score',ascending=False).head()

As discussed earlier, these 5 products will be recommended to all the users irrespective of their personal likes and dis-likes. We'll give further explanations in the last section wherein we'll summarise the models.

##### step 6.2: Evaluate collaborative filtering model

In [None]:
# get RMSE
print('For the User-based Model, the accuracy of the Test Set is:')
accuracy.rmse(predictions, verbose=True)

In [None]:
cross_validate(algo, t3data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

### Step 7: Get top - K ( K = 5) recommendations. 
Since our goal is to recommend new products for each user based on his/her habits, we will recommend 5 new products.

##### step 7.1: Popularity based recommendation systems:<br>
As seen in the previous section, the top 5 popular recommendations are products

- B003ES5ZUU
- B0088CJT4U
- B000N99BBC
- B007WTAJTO
- B00829TIEK

challenges
- it is not personalised. We are not catering to an individuals preference. Since same kind of recommendations are done to every new user on to the site.
- While, it increases the probability of purchase; since before this we dint have any recommendation system because we dint have any information. But now we have some information with which we are recommending products to people. 
- However, that percentage of increase of probability of sale is marginally increased. 
- Hence, in the next section, we will evaluate other recommendation systems. beginning with collaborative filtering model.

##### Step 7.2: Collaborative Filtering model:<br>
Lets identify the top 5 recommendations for each user:

In [None]:
def get_top_n(predictions,n=5):
  top_n=defaultdict(list)
  for uid,iid,true_r,est,_ in predictions:
    top_n[uid].append((iid,est))
  for uid,user_ratings in top_n.items():
    user_ratings.sort(key=lambda x: x[1],reverse=True)
    top_n[uid]=user_ratings[:n]
  return top_n

In [None]:
top_n=get_top_n(predictions)

In [None]:
print('top 5 recommended products for each user are:')
top_n

The list above captures product recommendations for each user. These are recommended basis an users preferences and likes (referring the ratings done by them on the other products). The model can help users discover new interests w.r.t products. while the model might not know the user's interest but still it might recommend products because similar users are interested in that product.

### Step 8: Summarising the insights

##### Step 8.1: Popularity based recommender system: <br>
- The Popularity based recommender provide a general count of recommended products to all the users. They are not sensitive to the interests and tastes of a particular user. <br>
- Popularity based Recommender system might be a good starting point for a new business where we dont have any user reviews and hence customization basis user preferences might not be possible. Hence, they might increase the probablity of purchase since for the new business there doesnt exist any user information. However, the probablity of increase in purchase would most likely be marginal.
- Since they dont consider an individuals interests / likes / dislikes; hence it is not a solution which can be recommended to all.
- Also, consider a scenario where in a user has already bought a product and that product is high rated product. Popularity based recommender system wont consider this fact that the user has already bought the product and will continue to recommend the same product.
- While we got 5 products which could be recommended to the users but for the reasons mentioned above, popularity based recommender systems are not a ideal way of recommendations.

Step 8.2: Collaborative filtering model:<br>
- Leveraging the collaborative filtering model, we were able to recommend top 5 products for each user. 
- The distictive feature of collaborative filtering is that the recommendations are not generic in nature; but are customized for each user basis their likings.
- Hence, collaborative filtering technique is the preferred technique over popularity based recommendation system.