Recommendation Systems

Problem Statement:

Build your own recommendation system for products on an e-commerce website like Amazon.com, using the dataset from below source:
Amazon Reviews data (http://jmcauley.ucsd.edu/data/amazon/) The repository has several datasets. 
For this case study, we are using the Electronics dataset.


Background:

Amazon currently uses item-to-item collaborative filtering, which scales to massive data sets and produces high-quality recommendations in real time. This type of filtering matches each of the user's purchased and rated items to similar items, then combines those similar items into a recommendation list for the user.

Dataset columns:

userId, productId, ratings and the fourth column is timestamp.

In [1]:
# importing required libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
import Recommenders as Recommenders
import Evaluation as Evaluation

1. Exploratory Analysis:

In [2]:
# read the given dataset

df = pd.read_csv('./ratings_Electronics.csv')

In [3]:
# glance of first few records

df.head()

Unnamed: 0,AKM1MP6P0OYPR,0132793040,5.0,1365811200
0,A2CX7LUOHB2NDG,321732944,5.0,1341100800
1,A2NWSAGRHCP8N5,439886341,1.0,1367193600
2,A2WNBOD3WNDNKT,439886341,3.0,1374451200
3,A1GI0U4ZRJA8WN,439886341,1.0,1334707200
4,A1QGNMC6O1VW39,511189877,5.0,1397433600


Observations:
i) We see the there are no headers
ii) The last column 'timestamp' is of not much use for our analysis hence we can drop the same

In [5]:
df = df.iloc[:, :-1]
df.columns = ["userId", "productId", "ratings"]
df

Unnamed: 0,userId,productId,ratings
0,A2CX7LUOHB2NDG,0321732944,5.0
1,A2NWSAGRHCP8N5,0439886341,1.0
2,A2WNBOD3WNDNKT,0439886341,3.0
3,A1GI0U4ZRJA8WN,0439886341,1.0
4,A1QGNMC6O1VW39,0511189877,5.0
5,A3J3BRHTDRFJ2G,0511189877,2.0
6,A2TY0BTJOTENPG,0511189877,5.0
7,A34ATBPOK6HCHY,0511189877,5.0
8,A89DO69P0XZ27,0511189877,5.0
9,AZYNQZ94U6VDB,0511189877,5.0


In [7]:
df.shape

(7824481, 3)

In [9]:
# Dataset has 7824481 rows/records with 3 columns
# Checking column data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7824481 entries, 0 to 7824480
Data columns (total 3 columns):
userId       object
productId    object
ratings      float64
dtypes: float64(1), object(2)
memory usage: 179.1+ MB


Observations:
i) We see that there are 7.8 million recrods, of course which could have the ratings of same user for different products
ii) Taking a subset of this would be ideal for our exploration and further analysis

2. Getting a subset of the given dataset

In [10]:
# below we can find the top 10 users who have rated maximum number of products

most_rated=df.groupby('userId').size().sort_values(ascending=False)[:10]
most_rated

userId
A5JLAU2ARJ0BO     520
ADLVFFE4VBT8      501
A3OXHLG6DIBRW8    498
A6FIAB28IS79      431
A680RUE1FDO8B     406
A1ODOGXEYECQQ8    380
A36K2N527TXXJN    314
A2AY4YUOX2N1BQ    311
AWPODHOB4GFWL     308
A25C2M3QF9G7OQ    296
dtype: int64

In [11]:
# getting subset from the actual dataset where a user (userId) who has given 100 or more ratings
counts=df['userId'].value_counts()
df_final=df[df['userId'].isin(counts[counts>=100].index)]
df_final.head()

Unnamed: 0,userId,productId,ratings
117,AT09WGFUM934H,594481813,3.0
177,A17HMM1M7T9PJ1,970407998,4.0
630,A3TAS1AG6FMBQW,972683275,5.0
1776,A18S2VGUH9SCV5,1400501776,4.0
2161,A5JLAU2ARJ0BO,1400532655,1.0


In [12]:
df_final.shape

(44209, 3)

In [13]:
# the new dataframe df_final has 44209 compared to 7.8  million in actual dataset

# in the below code we are framing a pivot table with productId's as column and each row has rating values for the productId's for each user (userId)

pivot_df=df_final.pivot(index='userId', columns='productId', values='ratings').fillna(0)
pivot_df.shape
pivot_df.head()

productId,0594481813,0970407998,0972683275,1400501776,1400532655,1400599997,1400699169,1685560148,7562434166,787988002X,...,B00L2P3TRS,B00L3YHF6O,B00L403O94,B00L43HAY6,B00L8I6SFY,B00LA6T0LS,B00LBZ1Z7K,B00LGQ6HL8,B00LI4ZZO8,B00LKG1MC8
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100UD67AHFODS,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A10PEXB6XAQ5XF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A10Y058K7B96C6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A10ZFE6YE0UHW8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
# getting a visualization of the number of ratings

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot(131)
ax.hist(df_final['ratings'], bins=[0.9, 1.1, 1.9, 2.1, 2.9, 3.1, 3.9, 4.1, 4.9, 5.1])
ax.set_xlabel('product rating', fontsize=12)

<matplotlib.text.Text at 0x12e085f90b8>

3. Split into train and test dataset

In [16]:
# importing required libraries and splitting the data into train and test sets of 70/30 ratio
# getting a glance of test set:

from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

train_data, test_data=train_test_split(df_final,test_size=0.3,random_state=0)
test_data.head()

Unnamed: 0,userId,productId,ratings
32465,A36K2N527TXXJN,B00003006E,5.0
1494827,A2UOHALGF2X77Q,B000WOIG8C,5.0
2152310,A2J8T58Z4X15IO,B001MT8J4W,5.0
4168342,A2L42QEWR77PKZ,B004I45QBW,5.0
563139,AT6CZDCP4TRGA,B0006G0RSI,3.0


In [17]:
# getting a glance of train set
train_data.head()

Unnamed: 0,userId,productId,ratings
5537141,AN81JUYW2SL24,B006ZW4H4C,5.0
3066467,AEWYUPCNDV7HY,B0035FZ124,5.0
1535040,A1U5IJHJK84S54,B000YNVSMW,3.0
3310086,A1ZU55TM45Y2R8,B003ES4NIA,5.0
1188837,AGOH8N902URMW,B000N29KOW,5.0


4. Popularity Recommender Model:

In [18]:
#importing required libraries and create popularity model

import Recommenders as Recommenders
import Evaluation as Evaluation
users = df_final['userId'].unique()
pm = Recommenders.popularity_recommender_py()
pm.create(train_data, 'userId', 'productId')

In [19]:
# we would be getting the same popularity recommendations for any user, as this is not user based and is based on popularity
# below we check for the popularity recommendations for the user 50th index the dataframe

user_id = users[50]
pm.recommend(user_id)

Unnamed: 0,userId,productId,score,Rank
14029,A1V3TRGWOMA8LC,B0088CJT4U,57,1.0
13912,A1V3TRGWOMA8LC,B00829TIEK,39,2.0
8342,A1V3TRGWOMA8LC,B002R5AM7C,36,3.0
9322,A1V3TRGWOMA8LC,B003ES5ZUU,36,4.0
14131,A1V3TRGWOMA8LC,B008DWCRQW,35,5.0
10619,A1V3TRGWOMA8LC,B004CLYEFK,33,6.0
13908,A1V3TRGWOMA8LC,B00829THK0,33,7.0
4248,A1V3TRGWOMA8LC,B000N99BBC,30,8.0
13799,A1V3TRGWOMA8LC,B007WTAJTO,30,9.0
17144,A1V3TRGWOMA8LC,B00HFRWWAM,29,10.0


In [20]:
# below we check the popularity recommendations for anothe user

user_id = users[20]
pm.recommend(user_id)

Unnamed: 0,userId,productId,score,Rank
14029,ADOR3TR7GDF68,B0088CJT4U,57,1.0
13912,ADOR3TR7GDF68,B00829TIEK,39,2.0
8342,ADOR3TR7GDF68,B002R5AM7C,36,3.0
9322,ADOR3TR7GDF68,B003ES5ZUU,36,4.0
14131,ADOR3TR7GDF68,B008DWCRQW,35,5.0
10619,ADOR3TR7GDF68,B004CLYEFK,33,6.0
13908,ADOR3TR7GDF68,B00829THK0,33,7.0
4248,ADOR3TR7GDF68,B000N99BBC,30,8.0
13799,ADOR3TR7GDF68,B007WTAJTO,30,9.0
17144,ADOR3TR7GDF68,B00HFRWWAM,29,10.0


i)  we find that the recommendations were the same for both users
ii) this recommends the popular product amongst all users, but an itemisation based popularity model is a better approach for   
    recommendations


5. Collaborative filtering Model:

In [21]:
# importing required libraries and we would be using SVD so that each user can get recommendations based on the past behavior of that user
# The Singular-Value Decomposition, or SVD for short, is a matrix decomposition method for reducing a matrix to its constituent parts in order to make certain subsequent matrix calculations simpler.
# A = U . Sigma . V^T :- A is the real m x n matrix that we wish to decompose, U is an m x m matrix, Sigma  is an m x n diagonal matrix, and V^T is the  transpose of an n x n matrix

import numpy as np
from scipy.sparse.linalg import svds
# Singular Value Decomposition
U, sigma, Vt = svds(pivot_df, k = 10)
# Construct diagonal array in SVD
sigma = np.diag(sigma)
U2 = np.diag(U)

#sigma
print(U)
print(U2)

[[-0.01654602  0.02512749  0.01479837 ..., -0.00499753 -0.02084606
  -0.03004781]
 [ 0.00199923  0.01093674  0.00381616 ..., -0.00649364  0.03417711
  -0.03801345]
 [-0.01787443  0.00407825 -0.00117616 ..., -0.00876928  0.0233249
  -0.03377197]
 ..., 
 [-0.00923847  0.01072083 -0.00932572 ..., -0.01694248  0.04036    -0.02578002]
 [-0.01783128  0.01843143  0.00269933 ..., -0.01813256  0.02270296
  -0.03087051]
 [-0.01298711  0.06549062 -0.018101   ..., -0.00267206 -0.14561345
  -0.10065359]]
[-0.01654602  0.01093674 -0.00117616  0.03487586 -0.00159369 -0.00730938
  0.03745791 -0.00130176 -0.06549579 -0.00503429]


In [22]:
# To estimate the rating for each pair of User/Item we can simply take the dot product of User-Feature and Feature-Item Matrix

all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 

# Predicted ratings
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = pivot_df.columns)
preds_df.head()

productId,0594481813,0970407998,0972683275,1400501776,1400532655,1400599997,1400699169,1685560148,7562434166,787988002X,...,B00L2P3TRS,B00L3YHF6O,B00L403O94,B00L43HAY6,B00L8I6SFY,B00LA6T0LS,B00LBZ1Z7K,B00LGQ6HL8,B00LI4ZZO8,B00LKG1MC8
0,0.004376,0.001057,0.004161,0.001726,0.004825,0.000194,0.017578,0.009128,0.001966,0.002814,...,0.044277,0.333528,-0.00039,-0.00804,0.059227,0.086071,-0.006778,0.140664,0.072311,0.022799
1,0.003667,0.008384,0.029385,0.015703,-0.004602,0.000366,0.024532,0.009325,0.001859,0.015246,...,-0.01924,0.172144,0.015216,0.038608,0.03406,-0.00877,0.01824,-0.025528,-0.017451,0.006688
2,0.003137,0.003995,0.024677,0.011827,-0.001617,0.000196,0.025942,0.008179,0.001592,0.011904,...,-0.009984,0.237399,0.01295,0.028121,0.034888,-7.9e-05,0.016353,-0.012854,-0.00331,0.008004
3,0.000404,0.014348,0.054795,0.028561,0.001552,0.000493,0.023471,0.004463,0.000555,0.024806,...,-0.056389,0.229455,0.037794,0.083599,0.025563,-0.071602,0.059349,-0.041721,-0.055347,-0.009855
4,0.005866,0.00915,0.022547,0.010896,0.003353,0.000406,0.02553,0.012655,0.002587,0.012838,...,0.031246,0.317201,0.011743,0.024374,0.081484,0.081634,0.009393,0.143562,0.058537,0.022167


In [23]:
# below is a function that takes the userId and number of recommendations as input from the user and prints the personalised recommendations:

def recommend_items(userID, pivot_df, preds_df, num_recommendations):
      
    user_idx = userID-1 # index starts at 0
    
    # Get and sort the user's ratings
    sorted_user_ratings = pivot_df.iloc[user_idx].sort_values(ascending=False)
    print(sorted_user_ratings.head())
    #sorted_user_ratings
    sorted_user_predictions = preds_df.iloc[user_idx].sort_values(ascending=False)
    print(sorted_user_predictions.head())

    #sorted_user_predictions

    temp = pd.concat([sorted_user_ratings, sorted_user_predictions], axis=1)
    print(temp.head())

    temp.index.name = 'Recommended Items'
    temp.columns = ['user_ratings', 'user_predictions']
    print(temp.head())


    temp = temp.loc[temp.user_ratings == 0]   
    temp = temp.sort_values('user_predictions', ascending=False)
    print('\nBelow are the recommended items for user(user_id = {}):\n'.format(userID))
    print(temp.head(num_recommendations))

6. Suggest top 5 product recommendations for any user:

In [24]:
# For example for the below user, we are requesting 5 recommendations using the above function
userID = 1
num_recommendations = 5
recommend_items(userID, pivot_df, preds_df, num_recommendations)

productId
B00483WRZ6    5.0
B009LL9VDG    5.0
B0000AR0I4    5.0
B004R6J2KW    5.0
B00998P7UW    5.0
Name: A100UD67AHFODS, dtype: float64
productId
B003ES5ZUU    0.723593
B0082E9K7U    0.622621
B007WTAJTO    0.610299
B00G4UQ6U8    0.590279
B007OY5V68    0.587605
Name: 0, dtype: float64
            A100UD67AHFODS         0
0594481813             0.0  0.004376
0970407998             0.0  0.001057
0972683275             0.0  0.004161
1400501776             0.0  0.001726
1400532655             0.0  0.004825
                   user_ratings  user_predictions
Recommended Items                                
0594481813                  0.0          0.004376
0970407998                  0.0          0.001057
0972683275                  0.0          0.004161
1400501776                  0.0          0.001726
1400532655                  0.0          0.004825

Below are the recommended items for user(user_id = 1):

                   user_ratings  user_predictions
Recommended Items                  

7. Compute RMSE value

In [25]:
# The collaborative filter that we built and used SVD was on both the training and the test sets combined

# Below we can find the RMSE for SVD model which gives the recommendations that are based on the past behavior of the user

# Lower value of RMSE indicates a better fit

rmse_df = pd.concat([pivot_df.mean(), preds_df.mean()], axis=1)
rmse_df.columns = ['Avg_actual_ratings', 'Avg_predicted_ratings']
rmse_df['item_index'] = np.arange(0, rmse_df.shape[0], 1)
rmse_df.head()

Unnamed: 0_level_0,Avg_actual_ratings,Avg_predicted_ratings,item_index
productId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
594481813,0.010381,0.003668,0
970407998,0.013841,0.007692,1
972683275,0.017301,0.02244,2
1400501776,0.013841,0.012313,3
1400532655,0.013841,0.007726,4


In [26]:
RMSE = round((((rmse_df.Avg_actual_ratings - rmse_df.Avg_predicted_ratings) ** 2).mean() ** 0.5), 5)
print('\nRMSE SVD Model = {} \n'.format(RMSE))


RMSE SVD Model = 0.01085 



8. Summary:

i)   In the given dataset we had close to 7.8 million ratings or records, these were not for unique set of users as it had the        same users who gave ratings for several different products.

ii)  We took the dataframe/subset from the dataset by considering the users who have given 100 or more ratings.

iii) After performing the above step, we had 44209 ratings/records to deal with.

iv)  Dataframe was split into train and test sets, however we used SVD over the comnination that contains both of these sets.

v)   A Popularity Recommendation model was built, this is common for all users based ont the popularity.

vi)  We used SVD model to provide new product recommendations to user based on his/her previous habits from the dataset.

vii) The function 'recommend_items' takes the useId for whom we provide the recommendations and the number of products that we        want to recommend.

viii) Glance on the dataset tells use that most of the ratings given was '5'.
