# Recommender Systems 

##### A recommender system, or a recommendation system, is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item.They are primarily used in commercial applications.

Everyday a million products are being recommended to users based on popularity and other metrics on e-commerce websites.

Lets make our own recommendation system that recommends 5 new products based on the user's habits.

Data Source - Amazon Reviews data (http://jmcauley.ucsd.edu/data/amazon/) The repository has several datasets. For this case study, we are using the Electronics dataset.

In [None]:
#import the basic libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')


In [None]:
Data=pd.read_csv('../input/amazon-product-reviews/ratings_Electronics (1).csv',names=('userId','productId','ratings','timestamp'))

In [None]:
Data.head()

In [None]:
print("Number of rows: {} & Columns : {} in our Dataset".format(Data.shape[0],Data.shape[1]))

In [None]:
#Lets drop the timestamp column as it is not relevent for our analysis
Data=Data.drop(['timestamp'],axis=1)

In [None]:
Data.head()

In [None]:
Data.info()

After removing timestamp column, our dataset has three columns where two are of object type and rating being a neumeric(float).

In [None]:
dp=Data.duplicated().sum()
#Check for duplicates 
print("Number of Duplicates in our dataset :{}".format(dp))

##### Check for Null Values

In [None]:
Data.isna().sum()

In [None]:
Data.isnull().sum()

As you could see there are no null values either in our dataset

In [None]:
uq=len(Data['userId'].unique())
pq=len(Data['productId'].unique())
print("The number of Unique Users:{} and number of unique products:{} in our ecommerce site".format(uq,pq))

In [None]:
print(Data.describe(exclude=[np.object]).T)
q1=Data['ratings'].quantile(.25)
q3=Data['ratings'].quantile(.50)
IQR=q3-q1
print("#################################################################")
print("IQR for ratings in our data is :{}".format(IQR))

##### Statistical analysis of numeric column:

Since userid and productid columns are objects, we shall do Statistical analysis of rating column alone. 

1. Every user in the dataset has rated atleast one product.
2. The Minimum rating that a product has received in 1.0 and the max rating the product has received is 5.0.
3. The range of dispersion for rating is 1-5.
4. The average/mean rating by all users to our products is 4.01, with standard deviation 0f 1.3. Our data points are quite widely spread from the mean.

5. Our First Qaurtile 25% is 3 which means 25% of data points fall at or below it.
6. Our median second Qaurtile at 50% is 5.
7. Our Third quartile 75% is 5.

In [None]:
rt_gp=Data.groupby('ratings')['ratings'].count()
print(rt_gp)
plt.figure(figsize=(15,5))
sns.distplot(Data['ratings'],norm_hist=True);

In [None]:
plt.figure(figsize=(12,5))
sns.countplot(Data['ratings'],palette="Set3");

##### Let's explore the rating groups:
1. From the histogram, we could see the five groups of ratings.
2. Looks like our users are more generous and have given the top rating 5 for good products.
3. Ratings 1,2,3 have a similiar trend among users, whereas rating 4 is slighly higher.
##### Distribution Analysis:
1. User Group 1,2,3 is normally distributed with a smooth peak and are platykurtic , whereas for groups 4 & 5 the peaks are sharp and are leptokurtic.

Tip: kurtosis values are compared with that of the normal distribution as values less than 3 are said to be platykurtic, or "flat-topped." Alternatively, kurtosis values higher than 3 are said to be leptokurtic, usually appearing sharp at their peak value. 


In [None]:
Data['ratings'].skew()

Ratings in our data is a negatively skewed with the tail being extended towrdas left from the median.

In [None]:
rt_gp_user=Data.groupby('userId')['ratings'].count()
rt_gp_product=Data.groupby('productId')['ratings'].count()
Most_occured_procuct=Data['productId'].value_counts().idxmax()
Most_freq_user=Data['userId'].value_counts().idxmax()

In [None]:
print("###########################################################################################################################")
print("The Max number of ratings we have received for a single product is :{} & the product ID that has received is :{}".format(rt_gp_product.max(),Most_occured_procuct))
print("The User :{} has given max number of ratings across products with Number of ratings being:{}".format(Most_freq_user,rt_gp_user.max()))

### Take a subset of the dataset to make it less sparse/ denser

#### Identifying number of ratings provided by each user.
1. I am using pandas join and groupby to get the count of number of ratings given by each user. 

The column ratings_user_count will give the number of ratings provided by the user.

In [None]:
df=Data.join(Data.groupby('userId')['ratings'].count(),on='userId',rsuffix='_user_count')

In [None]:
df.head()

#### Now, Lets try to make a subset of data. Though we have a large dataset, lets consider ratings provided by user who have rated more than 50 products. The reason being
1. We may not be able to understand or rely on a user rating with fewer number of representation from the user. For example, if a product has received only a single rating or a user has rated only one product it doesn't give us any variety.
2. Memory consideration, for techinques like Matrix factorization,SVD, collaborative filtering methods it becomes compuationally complex with the local machines on a high volume dataset.

I am creating a function subset , which will count the number of ratings provided by each user and will calssify a user as if he has rated more than 50 ratings or rated less than 50 ratings in a column called Group

In [None]:
def subset(row):
    if row['ratings_user_count']> 50:
        return "Rated more than 50"
    else:
        return "Rated Less than 50"
df['Group']=df.apply(subset,axis=1)    

In [None]:
df.tail(10)

In [None]:
user_more_than_50=df[df['Group']=='Rated more than 50']
user_less_than_50=df[df['Group']=='Rated Less than 50']
A=user_more_than_50['Group'].count()
B=user_less_than_50['Group'].count()
print("Number of users who have more than 50 ratings:{}".format(user_more_than_50['Group'].count()))
print("Number of users who have less than 50 ratings:{}".format(user_less_than_50['Group'].count()))

sub_per=(A/(A+B)) * 100
print("Our subset is just :{} % of our total data, However it gives us the data density required with total number of records:{}".format(sub_per,A))

In [None]:
df['Group'].value_counts().plot.pie(shadow=True, startangle=120,autopct='%.2f')

#### Keep the users only who has given 50 or more number of ratings

Now, Let's extract the details of users and products who have given more than 50 ratings and store it is a seperate dataframe.

In [None]:
subset=df[df['Group']=='Rated more than 50']

In [None]:
subset.shape

Now the subset dataset contains 122171 rows of unique users who have given morethan 50 ratings.

The below distplot shows the distribution of our subset.

If we compare the distribution of original data vs subset, we could see that the distributions are similar for user groups who have rated  5,4. whereas the distribution for ratings 1,2,3 have slightly changed and the peaks looks flat and smooth.

This is due to change in the data both in volume and values.

In [None]:
plt.figure(figsize=(15,5))
sns.distplot(subset['ratings']);

In [None]:
#exporting the subset to csv 
subset.to_csv('subset_morethan_50ratings.csv')

In [None]:
subset_with_number_of_ratings=subset.join(subset.groupby('productId')['ratings'].count(),on='productId',rsuffix='_product_count')

In [None]:
subset_with_number_of_ratings.head()

Now , Iam adding one more column "ratings_product_count" this will give the count of how many ratings the specific product ID has received.

In popularity based recommendation , we will use this detail to give more insights to the user about the product.

How to read the above dataframe?

"The product 0594481813 has a rating of 3.0 and has received only one rating by a user AT09WGFUM934H who has rated 110 other products in our subset".

### Popularity Recommender model

#### Having built the dataset with our required columns, we can now build our popularity based recommender system.


### What is popularity based recommender system?

It is the simplest recommendation model that works on principle of popularity that identifies the products that are popular among users. This will give the users recommendation of products that are in trend, high in demand and are bought by users.

In [None]:
#Lets drop the columns userId and group 
Popularity_based_Recommendadtion=subset_with_number_of_ratings.drop(['userId','Group'],axis=1)

In [None]:
Popularity_based_Recommendadtion.shape

#### Popularity based on user rating and number of ratings received by each product.

In [None]:
PRS_Product_rated_count=pd.DataFrame(Popularity_based_Recommendadtion.groupby('productId')['ratings','ratings_product_count'].mean().sort_values(by='ratings_product_count',ascending=False))

I have now created a dataframe that lists the products, its ratings and how many ratings the product has received and sorted based on number of ratings received.

In [None]:
#The Top 10 products based on number of ratings the product has received 
PRS_Product_rated_count.head(10)

#### Popularity based on user rating

In [None]:
PRS_by_Rating=pd.DataFrame(Popularity_based_Recommendadtion.groupby('productId')['ratings','ratings_product_count'].mean().sort_values(by='ratings',ascending=False))

I have now created a dataframe that lists the products, sorted based on rating and how many ratings the product has received and sorted based on user ratings

In [None]:
#The Top 10 products based on top rating received by the product  
PRS_by_Rating.head(10)

#### Both the above systems give the user top rated products. The table 2 though shows the products which have been rated the best but the number  of ratings it has received is just 1.
#### However the table 1 gives more credibility as it recommends the products based on number of users who have rated the product and not just products that have just been rated high

Now, We will get the number of ratings for each scale between 1 to 5 for each product and make it as a dataframe

In [None]:
S=pd.DataFrame(subset.groupby('productId')['ratings'].value_counts().unstack().fillna(0))

In [None]:
S

#### Rating count for each scale 
Now lets further try to enhance the Table 1 with  recommendations to users with how many ratings have been given by users for each product on a scale of 1 to 5 along with product rating & product rating count.

In [None]:
Popularity_Final=pd.merge(PRS_Product_rated_count,S,on='productId')

In [None]:
Popularity_Final.nlargest(15,'ratings_product_count')

#### So, above is the popularity based recommendation that we have created and I am highlighting the top 15 products that are popular among users along with average rating, number of ratings the product has received and its splitup on number of rating for each scale between 1 to 5.

#### Advantages of Popularity based recommendation:
1. Computationally easy, Less complex.
2. No user charecterstics is required , hence does not suffer from cold start problem.
3. Can be made available to the user from day 1 of starting business.

#### Disadvantages:
1. There is no variety in the recommendadtion.
2. No personalization in the recommendation either. Irrespective of what the user might be interested the model recommends the same set of prodcut to every user.

### Collaborative Filtering model

##### This method of recommendadtion overcomes the shortfall of popularity based recommendation systems.

Collaborative filtering works on the similarity between different users or similarity between items.The similarity could be used to recommend the products based on user or item behaviour.

Couple of methods to implement Collaborative filtering :

1. Matrix Factorization using Singular value decomposition
2. Nearest Neighbour collaborative filtering (User based and Item based)

### Matrix Factorization using Singular value decomposition

##### Matrix Factorization is used to identify/predict what  rating will a specific user give for a given product  based on the previous ratings he has provided to other items.

#### How is this done ?

1. The user & item data is first made avaiable as a matrix. Product and user charecterstics is computed for each user and each item.
2. Dot product of these charectersics will give the predictions for each user for each item in the matrix.
3. Now since we have the actual and the predicted data the algorithm will further use gradient descent to find the minimal error to predict the closest possible rating for the user.

#### We will use singular value decomposition from Surprise Library to implement this in python.

The name surprise  stands for "Simple Python RecommendatIon System Engine"

Lets import SVD , Dataset and reader from surprise to read the dataset in surprise format.

In [None]:
from surprise import SVD
from surprise import Dataset
from surprise import Reader

In [None]:
subset

In [None]:
reader=Reader(rating_scale=(1.0, 5.0))

In [None]:
#This line will create the matrix from our pandas dataframe
data_for_mf=Dataset.load_from_df(subset[['userId','productId','ratings']],reader)

In [None]:
data_for_mf

#### Split the data randomly into a train and test dataset :

Note that I am using train_test_split from surprise package and not from sklearn , so there will be a slight change in syntax

In [None]:
from surprise.model_selection.split import train_test_split
X,Y=train_test_split(data_for_mf,test_size=0.3)

In [None]:
#Lets name the model M1 and call the SVD algorithm, This single line of code does all the magic
M1=SVD()

In [None]:
#Lets train our model M1 with the training dats
M1.fit(X)

In [None]:
#Lets do the prediction for the testset
predictions=M1.test(Y)

In [None]:
predictions

For each user (uid), for the product (iid) our model has given the actual rating(r_ui) and the predicted rating (est) the user will give for the product.

was_impossible= False denotes that the algorithm was able predict the user rating , If was_impossible = True then it means that the algorithm was not able to predict the rating.This may generally happen if the user does not has any rating provided to any of the product which is also known as the cold start problem.

Now using the below function we will map the procuct id's(iid and est rating) for each user and also will sort the first 5 products recommended for each user

In [None]:
from collections import defaultdict
def get_top_n(predictions, n=5):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [None]:
top_n = get_top_n(predictions,n=5)

In [None]:
top_n

In [None]:
from surprise import accuracy

In [None]:
print("The Root Mean Square Error for the Matrix Factorization using SVD:{}".format(accuracy.rmse(predictions,verbose=False)))
print("The Mean Absolute Error for the Matrix Factorization using SVD:{}".format(accuracy.mae(predictions,verbose=False)))

### Collaborative Filtering Using KNNwithMeans 

KNNWith means is a nearest neighbour mapping method , that is used to recommend users the products based on the what your closest neighbour has bought/liked.

More often than not, we do see "People who liked this also liked this", "People who bought this also bought these" when we do online shopping or watch movies etc. These are classic examples of recommendation systems that work on the concept of nearest neighbour colloborative filtering.

How is this done ?

The similarity between user/item is calculated for each user either based on cosine similiarity or using pearson correlation coefficient. Depending on the values of these parameters the products are suggested to the user.

Nearest Neighbour recommendation is not just limited to users , it can also be applied to understand the similarity between items.

In [None]:
#Agin from surprise library , I am importing KNNwithmeans to implement this
from surprise import KNNWithMeans

In [None]:
#Due to memory issue I am using just a small subset of data , 
#I tried with higher values but I ran out of memory allocation and 50000 seems to be working fine
subset2=subset.head(50000)

In [None]:
subset2.shape

In [None]:
#Let's load the dataset in surprise format from the pandas dataframe and split the data into test and train
data_for_collab=Dataset.load_from_df(subset2[['userId','productId','ratings']],reader)
trainset,testset=train_test_split(data_for_collab,test_size=.15)

###### I am using Grid.search from surprise to implement KNN for both user-user colloboration and item-item colloboration , we will also use both cosine and pearson similirtity and find the best parameters.

In [None]:
parm_grid={'k':[50,60,70],'name':["cosine","pearson_baseline"],'user_based':[True,False]}

In [None]:
from surprise.model_selection import GridSearchCV
from surprise.model_selection import cross_validate
from surprise.model_selection import cross_validate
Grid_1=GridSearchCV(KNNWithMeans,parm_grid,measures=["rmse", "mae"],cv=3)

In [None]:
Grid_1.fit(data_for_collab)

In [None]:
Grid_1.best_params['rmse']

In [None]:
Grid_1.best_estimator

In [None]:
Grid_1.best_score

#### KNNwithMeans (user - user similarity)
Now, since we have obtained the best parameters for the KNNwithmeans using Gridsearch, we shall train the model M2 with these parameters.

Though the best results are obtained through user user similarity in gridsearch, we shall also perform KNN with item -item similarity.

#### What is user-user similarity?

The recommendations are provided to a user based on how similar the given user is with his neighbour within a set of cluster. The similarity is calculated either through cosine distance or pearson correlation.

#### What is item-item  similarity?

The recommendations are provided to a user based on how similar the given item is with it's neighbours within a set of cluster. The similarity is calculated either through cosine distance or pearson correlation.


In [None]:
M2=KNNWithMeans(k=50,sim_options={'name': 'cosine', 'min_support': 5, 'user_based': True,'k':5},verbose= True,c=3)

In [None]:
#Lets fit our training data to our model
M2.fit(trainset)

In [None]:
pred2=M2.test(testset)

In [None]:
pred2

In [None]:
print("The Root Mean Square Error for KNNwithMeans using user user similarity:{}".format(accuracy.rmse(pred2,verbose=False)))
print("The Mean Absolute Error for KNNwithMeans using user user similarity:{}".format(accuracy.mae(pred2,verbose=False)))

#### KNNwithMeans Item-Item similarity 

When we implement colloborative filtering with surprise package, the method that we will invoke remains the same and just the boolean value for User_based parameter decides if we are going to implement user-user similarity or item-item similarity.

User_based= False will perform item , item similarity
User_based= True  will perform user, user similarity

In [None]:
M3=KNNWithMeans(k=50,sim_options={'name': 'cosine', 'min_support': 5, 'user_based': False,'k':5},verbose= True,c=3)

In [None]:
M3

In [None]:
M3.fit(trainset)

In [None]:
pred3=M3.test(testset)

In [None]:
pred3

In [None]:
print("The Root Mean Square Error for KNNwithMeans using item item similarity:{}".format(accuracy.rmse(pred3,verbose=False)))
print("The Mean Absolute Error for KNNwithMeans using item item similarity:{}".format(accuracy.mae(pred3,verbose=False)))

### TOP 5 Recommendations from each model

### Popularity Based Recommendation

In [None]:
print("The top 5 products that we recommend using popularity based recommendation:")
print("##########################################################################")
print(Popularity_Final.nlargest(5,'ratings_product_count'))

### Matrix Factorization using Singular Value Decomposition 

In [None]:
top_n

### KNNWithMeans user - user similarity

In [None]:
def get_top_user(pred2,n=5):
  top_n=defaultdict(list)
  for uid,iid,true_r,est,_ in pred2:
    top_n[uid].append((iid,est))
  for uid,user_ratings in top_n.items():
    user_ratings.sort(key=lambda x: x[1],reverse=True)
    top_n[uid]=user_ratings[:n]
  return top_n
Top_CF_user=get_top_user(pred2,n=5)

# For each user Print the recommended items
for uid, user_ratings in Top_CF_user.items():
    print(uid, [iid for (iid, _) in user_ratings])


### KNNWithMeans item-item similarity

In [None]:
def get_top_item(pred3,n=5):
  top_n=defaultdict(list)
  for uid,iid,true_r,est,_ in pred3:
    top_n[uid].append((iid,est))
  for uid,user_ratings in top_n.items():
    user_ratings.sort(key=lambda x: x[1],reverse=True)
    top_n[uid]=user_ratings[:n]
  return top_n
Top_CF_item=get_top_item(pred3)

# For each user Print the recommended items
df_item=pd.DataFrame()
for uid, user_ratings in Top_CF_item.items():
    print(uid, [iid for (iid, _) in user_ratings])

#### User-User VS Item-Item similarity
Though user based and item based looks similar in implementation the results & how they work is entirely different.

Below I am highlighting the predictions of our model M2 (user based) & M3 (Item based)

Product recommendation Prediction of user-user similarity for user A250AXLRBVYKB4

A250AXLRBVYKB4 ['B00004Z5M1', 'B000WGR3VG', 'B001ISK6FW', 'B00154MCKQ', 'B001GPVGZ6']

Product recommendation Prediction of item-item similarity for user A3P1508PZ0UADD

A250AXLRBVYKB4 ['B001G04VJO', 'B000P0CTSQ', 'B00194101O', 'B001TH7GSW', 'B00081A2KY']

As you could see, the list of products are entirely differnt.

In [None]:
print("The Root Mean Square Error for the Matrix Factorization using SVD:{}".format(accuracy.rmse(predictions,verbose=False)))
print("The Mean Absolute Error for the Matrix Factorization using SVD:{}".format(accuracy.mae(predictions,verbose=False)))
print("The Root Mean Square Error for KNNwithMeans using user user similarity:{}".format(accuracy.rmse(pred2,verbose=False)))
print("The Mean Absolute Error for KNNwithMeans using user user similarity:{}".format(accuracy.mae(pred2,verbose=False)))
print("The Root Mean Square Error for KNNwithMeans using item item similarity:{}".format(accuracy.rmse(pred3,verbose=False)))
print("The Mean Absolute Error for KNNwithMeans using item item similarity:{}".format(accuracy.mae(pred3,verbose=False)))

### Summary

#### Recommendation systems built:
1. Popularity based 
2. Collaborative filtering using Singular value decomposition
3. Collaborative filtering using KNNWithmeans (user-user similarity)
4. Collaborative filtering using KNNWithmeans (item-item similarity)

Popularity based system was useful in providing the users with products those have received good ratings and were on high demand. However, there was no variety and personalization to the user. 

Whereas, collaborative filtering with KNNwithmeans was good in providing the recommendations with personalization by selecting products those were purchased by similar users or products that were similar to other products the user has bought/liked.

With SVD we were able to predict what rating a user will provide if a product is recommended to a user depending on other ratings he had provided, And recommendation can be done by setting a threshold value and recommend products that have high rating. 

On compairing all the recommendations we have built, each has its own advantages and disadvantages. 

Initially, when we do not have any user details we can prefer the popularity based recommendation , As the business grows and when we have more details about user's behaviour etc we should definetely go with collaborative filtering techniques to make precise , personalised recommendation to be useful for the user. 

##### Lets do 5 fold cross validation for our best model M2 with user_based = True and see the RMSE and MAE 

In [None]:
cross_validate(M2,data_for_mf , measures=['RMSE', 'MAE'], cv=5, verbose=True)

To conclude, for this problem statement I suggest to go with Collaborative filtering using KNNwith means(User-user similarity) as the RMSE and MAE for this system is least in error compared to others.

This means that , the recommendation done with our model will be precise , customized for each user providing personalization.