# Description

## Context
Online E-commerce websites like Amazon, Flipkart uses different recommendation models to provide different suggestions to different users. Amazon currently uses item-to-item collaborative filtering, which scales to massive data sets and produces high-quality recommendations in real-time.

## Objective
Build a recommendation system to recommend products to customers based on their previous ratings for other products. Apply the concepts and techniques you have learned in the previous weeks and summarise your insights at the end.

 

## Dataset
We are using the Electronics dataset from the Amazon Reviews data repository, which has several datasets.

## Attribute Information

**userId:** Every user identified with a unique id

**productId:** Every product identified with a unique id

**Rating:** Rating of the corresponding product by the corresponding user

**timestamp:** Time of the rating ( ignore this column for this exercise)

In [2]:
%matplotlib inline

import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import time
import sklearn.externals
import joblib

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

## Load Electronics dataset

In [3]:
df = pd.read_csv('ratings_Electronics.csv', names=['userId', 'productId', 'rating', 'timestamp'])

## Explore Data

In [4]:
df.head()

Unnamed: 0,userId,productId,rating,timestamp
0,AKM1MP6P0OYPR,132793040,5.0,1365811200
1,A2CX7LUOHB2NDG,321732944,5.0,1341100800
2,A2NWSAGRHCP8N5,439886341,1.0,1367193600
3,A2WNBOD3WNDNKT,439886341,3.0,1374451200
4,A1GI0U4ZRJA8WN,439886341,1.0,1334707200


In [5]:
# there are 7824482 rows and 4 feature
df.shape

(7824482, 4)

In [6]:
len(df['productId'].unique())

476002

In [7]:
len(df['userId'].unique())

4201696

Observation:
- There are `476002` unique product.
- There are `4201696` unique user.

In [8]:
df.dtypes

userId        object
productId     object
rating       float64
timestamp      int64
dtype: object

In [9]:
# drop the timestamp
df.drop(['timestamp'], axis=1,inplace=True)

In [10]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rating,7824482.0,4.012337,1.38091,1.0,3.0,5.0,5.0,5.0


In [11]:
#Check for missing values
print('Number of missing values across columns: \n',df.isnull().sum())

Number of missing values across columns: 
 userId       0
productId    0
rating       0
dtype: int64


## Take a subset of the dataset to make it less sparse (or denser) (For example, keep the users only who have given 50 or more ratings.)

In [12]:
df_review_50 = df.groupby('userId').filter(lambda x: len(x)>50)

In [13]:
df_review_50.head()

Unnamed: 0,userId,productId,rating
118,AT09WGFUM934H,594481813,3.0
177,A32HSNCNPRUMTR,970407998,1.0
178,A17HMM1M7T9PJ1,970407998,4.0
492,A3CLWR1UUZT6TG,972683275,5.0
631,A3TAS1AG6FMBQW,972683275,5.0


In [14]:
df_review_50.shape

(122171, 3)

In [15]:
print("Total data ")
print("-"*50)
print("\nTotal no of ratings :",df_review_50.shape[0])
print("Total No of Users   :", len(np.unique(df_review_50.userId)))
print("Total No of products  :", len(np.unique(df_review_50.productId)))

Total data 
--------------------------------------------------

Total no of ratings : 122171
Total No of Users   : 1466
Total No of products  : 47155


## Split the data randomly into train and test datasets

In [16]:
train_data, test_data = train_test_split(df_review_50, test_size = 0.20, random_state=0)


In [17]:
train_data.shape

(97736, 3)

In [18]:
test_data.shape

(24435, 3)

## Build Popularity Recommender model

In [19]:
#Count of user_id for each unique song as recommendation score 
train_data_grouped = train_data.groupby('productId').agg({'userId': 'count'}).reset_index()
train_data_grouped.rename(columns = {'userId': 'score'},inplace=True)
train_data_grouped.head()

Unnamed: 0,productId,score
0,594481813,1
1,970407998,2
2,972683275,2
3,1400501466,4
4,1400501520,1


In [20]:
#Sort the product on recommendation score 
train_data_sort = train_data_grouped.sort_values(['score', 'productId'], ascending = [0,1]) 
      
#Generate a recommendation rank based upon score 
train_data_sort['Rank'] = train_data_sort['score'].rank(ascending=0, method='first') 
          
#Get the top 5 recommendations 
popularity_recommendations = train_data_sort.head(10) 
popularity_recommendations 

Unnamed: 0,productId,score,Rank
32948,B0088CJT4U,169,1.0
20944,B003ES5ZUU,148,2.0
9278,B000N99BBC,127,3.0
32319,B007WTAJTO,118,4.0
32626,B00829TIEK,116,5.0
32622,B00829THK0,108,6.0
33233,B008DWCRQW,108,7.0
18544,B002R5AM7C,101,8.0
24270,B004CLYEDC,90,9.0
32663,B00834SJNA,84,10.0


## Build Collaborative Filtering model

In [21]:
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
import os
from surprise.model_selection import train_test_split

In [22]:
#Reading the dataset
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df_review_50,reader)

In [23]:
#Splitting the dataset
trainset, testset = train_test_split(data, test_size=0.3,random_state=10)

In [24]:
# user-based collaborative filtering
algo_user_based = KNNWithMeans(k=5, sim_options={'name': 'pearson_baseline', 'user_based': True})
algo_user_based.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f5b57f57f90>

In [25]:
# item-based collaborative filtering
algo_item_based = KNNWithMeans(k=5, sim_options={'name': 'pearson_baseline', 'user_based': False})
algo_item_based.fit(trainset)

In [None]:
# run the trained model against the testset
test_pred_user_based = algo_user_based.test(testset)
test_pred_item_based = algo_item_based.test(testset)

## Evaluate both models (Once the model is trained on the training data, it can be used to compute the error (RMSE) on predictions made on the test data.)

In [None]:
# get RMSE
print("Item-based Model : Test Set")
accuracy.rmse(test_pred_user_based, verbose=True)

Item-based Model : Test Set
RMSE: 1.0476


1.0476044141023797

In [None]:
# get RMSE
print("Item-based Model : Test Set")
accuracy.rmse(test_pred_item_based, verbose=True)

## Get top K (K = 5) recommendations (Since our goal is to recommend new products to each user based on his/her habits, we will recommend 5 new products.)

## Summarize your insights