# Please upvote if you like this kernel


# Recommendation System Project
### <u>Data Description and Context:</u>
Amazon Reviews data (data source) The repository has several datasets. For this case study, we are using the Electronics dataset.
### <u>Domain:</u>
E-commerce 

### <u>Context</u>
Online E-commerce websites like Amazon, Flipkart uses different recommendation models to provide different suggestions to different users. Amazon currently uses item-to-item collaborative filtering, which scales to massive data sets and produces high-quality recommendations in real-time.

### <u>Attribute Information:</u>
* userId : Every user identified with a unique id
* productId : Every product identified with a unique id
* Rating : Rating of the corresponding product by the corresponding user
* timestamp : Time of the rating ( ignore this column for this exercise)

### <u>Objective:</u>
Build a recommendation system to recommend products to customers based on the their previous ratings for other products.


# Types of recommendations

There are mainly 6 types of the recommendations systems :-

1. Popularity based systems :- It works by recommeding items viewed and purchased by most people and are rated high.It is not a personalized recommendation.
2. Classification model based:- It works by understanding the features of the user and applying the classification algorithm to decide whether the user is     interested or not in the prodcut.
3. Content based recommedations:- It is based on the information on the contents of the item rather than on the user opinions.The main idea is if the user likes an item then he or she will like the "other" similar item.
4. Collaberative Filtering:- It is based on assumption that people like things similar to other things they like, and things that are liked by other people with similar taste. it is mainly of two types:
 a) User-User 
 b) Item -Item
 
5. Hybrid Approaches:- This system approach is to combine collaborative filtering, content-based filtering, and other approaches . 
6. Association rule mining :- Association rules capture the relationships between items based on their patterns of co-occurrence across transactions.



# Import Libraries 

In [None]:

#importing necessary Libraries 

#working with data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import model_selection


import sklearn 
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import KNeighborsClassifier

from collections import defaultdict
from surprise import SVD
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import train_test_split
import os


import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

### <font color='red'>Step 1 </font> Read and explore the given dataset

In [None]:
Data=pd.read_csv("/kaggle/input/amazon-product-reviews/ratings_Electronics (1).csv",names=['UserId', 'ProductId','Rating','timestamp'])


In [None]:
# Display the data

Data.head()


In [None]:
#checking datatypes of each column
Data.dtypes

In [None]:
#shape of data 
shape_Data = Data.shape
print('Data set contains "{x}" number of rows and "{y}" number of columns' .format(x=shape_Data[0],y=shape_Data[1]))

In [None]:
#null check
sns.heatmap(Data.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
#Oveview of Data
Data.describe().T

In [None]:
print("Total data ")
print("-"*50)
print("\nTotal no of ratings :",Data.shape[0])
print("Total No of Users   :", len(np.unique(Data['UserId'])))
print("Total No of products  :", len(np.unique(Data['ProductId'])))


#### Data Understanding
1. There is no MISSING data <br>
2. There are 4 Attributes - *UserId*, *ProductId* are object *Rating* is Integer while *Timestamp* is float <br>
3. Rating lies between 1-5

In [None]:
# Rating frequency

sns.set(rc={'figure.figsize': (12, 6)})
sns.set_style('whitegrid')
ax = sns.countplot(x='Rating', data=Data)
ax.set(xlabel='Rating', ylabel='Count')

* most User Rated 5

In [None]:
# let's check what is on avarage rating of each product
Rating_prod = Data.groupby('ProductId')['Rating'].mean()
Rating_prod.head()

In [None]:
sns.distplot(Rating_prod, color="green", kde=True)

#### We can notice that large peak of rating "5", this may be because single user rating or some other kind of skewness.

In [None]:
# let's check how many rating does a product have

product_rating_count = Data.groupby('ProductId')['Rating'].count()
product_rating_count.head()

In [None]:
sns.distplot(product_rating_count, color="red", kde=True, bins=40)

#### this shows that most items have around 0-100 rating, with some outliers such as product having more then 2000 rating

In [None]:
#Analysis of rating given by the user 

no_of_rated_products_per_user = Data.groupby(by='UserId')['Rating'].count().sort_values(ascending=False)
no_of_rated_products_per_user.head()

In [None]:
sns.distplot(no_of_rated_products_per_user, color="Orange", kde=True, bins=40)

#### this shows that most user have rated just 1 item, with some outliers such as user rating more then 100 item.

### <font color='red'>Step 2 </font> Take a subset of the dataset to make it less sparse/ denser.

In [None]:
# checking number of users how gave 1 rating rating only.
user_1=no_of_rated_products_per_user[no_of_rated_products_per_user==1].count()
#percentage of user who gave rating only one time are
per = user_1/no_of_rated_products_per_user.count()
print('Total {} percent of User have just given rating once'.format(per*100))

In [None]:
print('\n Number of rated product more than 50 per user : {}\n'.format(sum(no_of_rated_products_per_user >= 50)) )

In [None]:
#Getting the new dataframe which contains users who has given 50 or more ratings

new_Data=Data.groupby("ProductId").filter(lambda x:x['Rating'].count() >=50)

In [None]:
new_Data.head()

In [None]:
new_Data.shape

In [None]:
#percentage of data taken
print('we are taking {} percent of data from Raw data for analysis'.format(new_Data['UserId'].count()/Data['UserId'].count()*100))

In [None]:
#Dropping Unwanted Columns
new_Data.drop('timestamp',inplace=True,axis=1)

### <font color='red'>Step 3 </font> Build Popularity Recommender model.

In [None]:
#group by product and corresponding mean rating
ratings_mean_count = pd.DataFrame(new_Data.groupby('ProductId')['Rating'].mean())
ratings_mean_count['rating_counts'] = pd.DataFrame(new_Data.groupby('ProductId')['Rating'].count())

In [None]:
#let's check for highest rating count
ratings_mean_count['rating_counts'].max()

In [None]:
#let's check for highest rating count
ratings_mean_count['rating_counts'].min()

In [None]:
#checking distribution of rating_counts
sns.distplot(ratings_mean_count['rating_counts'],kde=False, bins=40)

In [None]:
#checking distribution of rating
sns.distplot(ratings_mean_count['Rating'],kde=False, bins=40)

In [None]:
#Top 10 Product that would be recommended.
popular=ratings_mean_count.sort_values(['rating_counts','Rating'], ascending=False)
popular.head(10)

In [None]:
#Top 30 Product that would be recommended.
popular.head(30).plot(kind='bar')

#### The above graph gives us the most popular products (arranged in descending order) sold by the business.

### <font color='red'>Step 4 </font>Split the data randomly into train and test dataset.

In [None]:
#Reading the dataset using Surprise package for Model Based Collaborative Filtering
reader = Reader(rating_scale=(1, 5))
data_reader_SVD = Dataset.load_from_df(new_Data,reader)
#Splitting the dataset with 70% training and 30% testing using Surprise train_test_split
trainset_SVD, testset_SVD = train_test_split(data_reader_SVD, test_size=.30)

In [None]:
#Data Split for Memory Based Collaborative Filtering
# we were going out of memory problem so lets take first 10lac record to Collaborative filtering process.
# so splitting data in diffrent part to train them saparately 
# splitting data into 5 Equal parts of 1074862 record each
reader = Reader(rating_scale=(1, 5))
data_reader_1 = Dataset.load_from_df(new_Data.iloc[:1074862,0:],reader)
data_reader_2 = Dataset.load_from_df(new_Data.iloc[1074862:2149725,0:],reader)
data_reader_3 = Dataset.load_from_df(new_Data.iloc[2149725:3224586,0:],reader)
data_reader_4 = Dataset.load_from_df(new_Data.iloc[3224586:4299448,0:],reader)
data_reader_5 = Dataset.load_from_df(new_Data.iloc[4299448:,0:],reader)

#Splitting the dataset with 70% training and 30% testing using Surprise train_test_split
trainset_1, testset_1 = train_test_split(data_reader_1, test_size=.30)
trainset_2, testset_2 = train_test_split(data_reader_2, test_size=.30)
trainset_3, testset_3 = train_test_split(data_reader_3, test_size=.30)
trainset_4, testset_4 = train_test_split(data_reader_4, test_size=.30)
trainset_5, testset_5 = train_test_split(data_reader_5, test_size=.30)

#holding all training set
trainset=[trainset_1,trainset_2,trainset_3,trainset_4,trainset_5]
#holding all testing set
testset=[testset_1,testset_2,testset_3,testset_4,testset_5]

### <font color='red'>Step 5 </font>Build Collaborative Filtering model 
#### Memory Based Collaborative Filtering

* Collaborative filtering techniques aim to fill in the missing entries of a user-item association matrix.
* We are going to use collaborative filtering approach. This is based on the idea that the best recommendations come from people who have similar tastes.

In [None]:
# Use user_based true/false to switch between user-based or item-based collaborative filtering
algo = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': False})

In [None]:
#fitting all training set and storing testing results
test=[]
for item in range(5):
    algo.fit(trainset[item])
    test.append(algo.test(testset[item]))

In [None]:
#checking prediction
test[0][0:5]

#### Model-based collaborative filtering system

In [None]:
algo_SVD = SVD()
algo_SVD.fit(trainset_SVD)

In [None]:
predictions_SVD = algo.test(testset_SVD)

In [None]:
RMSE_SVD=accuracy.rmse(predictions_SVD, verbose=True)

### <font color='red'>Step 6 </font>Evaluate both the models.

#### Evaluating Popularity based model

In [None]:
popular=ratings_mean_count.sort_values(['rating_counts','Rating'], ascending=False)
popular.head(10)

#### We can see top product i.e. B0074BW614 have rating 4.49 and number of user who gave rating to this product is 18244, which seems legit thus we can conclude we are getting expected result

#### evaluating Collobarative filtering (memory based model)

In [None]:
# evaluating Collobarative filtering (memory based model)
print("Item-based Model : Test Set")
RMSE = []
Total_RMSE = 0
for i in range(5):
    RMSE.append(accuracy.rmse(test[i], verbose=True))
    Total_RMSE = Total_RMSE + RMSE[i]

In [None]:
#avarage RMSE
print ('Avarage RMSE for Memory Based Collaborative Filtering of all TEST data is = {}'.format(Total_RMSE/5))

#### Evaluating Collobarative filtering (Model based model)

In [None]:
# evaluating Collobarative filtering (Model based model)
print ('Avarage RMSE for Model Based Collaborative Filtering of all TEST data is = {}'.format(RMSE_SVD))


### <font color='red'>Step 6 </font> Get top - K ( K = 5) recommendations.

In [None]:
#creating function to get top 5 Product Recommendation for each user.
def get_top_n(predictions, n=5):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [None]:
top_n = get_top_n(predictions_SVD, n=5)

In [None]:
# Print the recommended items for first 50 user
count=0
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])
    if(count>49):
        break
    count=count+1

#### Thus we can notice
* For User :A359EJJXW154UD  
* Recommendation : ['B000M2TAN4', 'B000FGI970', 'B0041OSQ9I', 'B00005LEN4', 'B000ZMCILW']

* There are many Users which having less then 5 Reccomendation that occurs because those products have missing ratings via users.
* so people having less then 5 recommendations, we will feed in product based on Popularity Based Recommendation.