Domain: E-Commerce

Context: Everyday a million products are being recommended to users based on popularity and other metrics on e-commerce websites. The most popular e-commerce website boosts average order value by 50%, increases revenues by 300%, and improves conversion. In addition to being a powerful tool for increasing revenues, product recommendations are so essential that customers now expect to see similar features on all other eCommerce sites.

Objective: To make a recommendation system that recommends at least five(5) new products based on the user's habits.

In [None]:
#1. Read and explore the given dataset. ( Rename column/add headers, plot, histograms, find data characteristics)

In [None]:
# Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

In [None]:
# Load the data and name it as erData (i.e. Electronics Item Rating Data)
erData = pd.read_csv("../input/electronics-item-ratings/ratings_Electronics.csv")
erData.shape

- Given data set have total of 4 properties or columns and 7824481 rows are loaded into the memory

In [None]:
# Display the data set info
erData.info()

- From above data set information we see, first 2 properties are of object type 
- 1st item denotes the User ID
- 2nd item denotes the Item ID
- 3rd column is all of float type numbers representing the item rating given by the users
- 4th column represents the timestamp but we may not need this column for further computation

In [None]:
# Let's have a look into the sample dataset
erData.head()

In [None]:
# No column names are added to the dataset. Let's add them as given in problem statement
erData.columns = ['user id','item id', 'rating','timestamp']

# Have a look into the sample dataset
erData.head()

In [None]:
# As we don't need the timestamp column for further processing, dropping the column
erDataRS = erData.drop(['timestamp'],axis=1) #erDataRS = Electronics Rating data set for Recommendation Systems

In [None]:
# Have a look into the sample dataset
erDataRS.head()

In [None]:
# Check if there is any missing values in the data set
erDataRS.isnull().values.any()

- Above indicates there is no blank values in the dataset provided

In [None]:
# Check how many ZERO values present in each column
(erDataRS == 0).sum(axis=0)

- Above indicates that none of the user has missed to rate any item or give zero rating

In [None]:
# Display the histogram of rating properties
sns.distplot(erDataRS['rating'])
plt.show()

- From above we see there are total of 5 rating numbers associated
- Rating histogram indicates majority of the ratings were given for 5
- Least rating count is for number 2
- User ID and Item ID are of object type data, so no histogram can be obtained

In [None]:
# Let's review the rating column and check it's 5 point summary
erData.describe()['rating'].T

# Note, User ID and Item ID are of object name, so not checking any 5 point summary for these

- From above we see minimum rating is 1 whereas max is 5
- 25% of the dataset have 3 rating but at Q3 i.e. 75% range 5 rating is observed
- Mean rating is 4

In [None]:
# Get the list of unique records of each category

print("Number of records in the dataset: ", erDataRS.shape[0])
print("Number of unique User ID: ", len(np.unique(erDataRS['user id'])))
print("Number of unique Item ID: ", len(np.unique(erDataRS['item id'])))
print("Number of unique Rating: ", len(np.unique(erDataRS['rating'])))

In [None]:
#2. Take a subset of the dataset to make it less sparse/ denser. 
# (For example, keep the users only who has given 50 or more number of ratings )

In [None]:
# Group by users with their corresponding rating count and list in a dataset
userListGroupbyRating = erDataRS.groupby('user id')['rating'].count().sort_values(ascending=False)

# Display the groupby dataset
userListGroupbyRating.head()

In [None]:
# Display the total number of users who have rated more than 50 items
print("Count of unique users who has given 50 or more number of ratings: ", sum(userListGroupbyRating >= 50))

In [None]:
# Build a new dataframe with user who has given 50 or more number of ratings
erDataRS50_byUserID = erDataRS.groupby("user id").filter(lambda x:x['rating'].count() >= 50)

# View the dataset with user who has given 50 or more number of ratings
erDataRS50_byUserID.head()

In [None]:
erDataRS50_byUserID.shape # Display the shape of the data set

- Total of 125871 rows are present in the dataset with users who have rated 50 or more number of ratings

In [None]:
#3. Build Popularity Recommender model

In [None]:
# Identify the popular items based on rating and sort

# Group by Items with their corresponding rating count and list in a dataset
itemListGroupbyRating = erDataRS.groupby('item id')['rating'].count().sort_values(ascending=False)

# Display the top 5 items which gets maximum number of rating
itemListGroupbyRating.head(5)

In [None]:
# Create a dataset groupby Items and corresponding mean rating
itemListMeanRating = pd.DataFrame(erDataRS.groupby('item id')['rating'].mean())

itemListMeanRating.head() #Display the dataset sample

In [None]:
# Add a column to the dataset to get the corresponding rating count grouped by items
itemListMeanRating['rating_count'] = pd.DataFrame(erDataRS.groupby('item id')['rating'].count())

itemListMeanRating.head() #Display the dataset sample

In [None]:
# Display top 10 popular items based on mean rating and max number of rating it received
itemListMeanRating.sort_values(by='rating_count', ascending=False).head(10)

In [None]:
# Display the histogram of rating vs rating count properties
sns.jointplot(x='rating', y='rating_count', data=itemListMeanRating)

- From above pictorial view we can pick top 5 items which have got hightest number of rating and it's corresponding mean rating value
- All of these 5 items got more than 12k ratings and it's matches with the preious computation of the most popular items

In [None]:
#4. Split the data randomly into a train and test dataset. ( For example, split it in 70/30 ratio)
# Load the required library
from surprise.model_selection import train_test_split 
from surprise import Dataset
from surprise import Reader
import os

# Reduce the size of the dataset to use in collaborative method to avoid memory issue
# Considered the Electronic Rating dataset with users rated >50 and 10% of the data to create reduced dataset
erData_reduced = erDataRS50_byUserID.sample(frac=0.1) 

# Initialize the reader to create the dataset on which the train-test split willbe applied
reader = Reader(rating_scale=(1, 5))
erDataset = Dataset.load_from_df(erData_reduced,reader) 

# Split the data randomly into train and test set
trainset, testset = train_test_split(erDataset, test_size=.30, random_state=1)

In [None]:
erDataRS50_byUserID.shape #Print the shape of the ER dataset with users rated more than 50 times

In [None]:
erData_reduced.shape #Print the reduced shape of the ER dataset with users rated more than 50 times

- Above shows the number of sample rows reduced after taking the 10% fraction of the actual dataset

In [None]:
#5. Build Collaborative Filtering model

In [None]:
# Implement Item-item collaboration system

# Load the library
from surprise import KNNWithMeans
from surprise import accuracy

# Use user_based true/false to switch between user-based or item-based collaborative filtering
# In this case user_based = False means item-item collaborative filtering
item_item_model = KNNWithMeans(k=10, sim_options={'name': 'pearson_baseline', 'user_based': False})
item_item_model.fit(trainset)

In [None]:
# run the trained model against the testset
item_item_prediction = item_item_model.test(testset)

In [None]:
# Have a look into the prediction
item_item_prediction[:10]

- In the above item-item collaborative model used the reduced data set by selecting eandom records from the dataset of 50 and more user rating
- upto k=10 neighbour node criteria applied to compare the item against another item
- Wherever the userand/or item found for comparison, 'was_impossible: False' is mentioned for each item prediction
- For all items, the actual and predicted rating is displayed
- Wherever the userand/or item not found for comparison, 'was_impossible: True' is mentioned
- The RMSE (root mean square error) for this model will be computed and compared in next section

In [None]:
# Implement User-user collaboration system

In [None]:
# Use user_based true/false to switch between user-based or item-based collaborative filtering
# In this case user_based = True means user-user collaborative filtering
user_user_model = KNNWithMeans(k=10, sim_options={'name': 'pearson_baseline', 'user_based': True})
user_user_model.fit(trainset)

In [None]:
# we can now query for specific predicions
uid = str('A1UQBFCERIP7VJ')  # raw user id
iid = str('B0046HAO40')  # raw item id

In [None]:
# get a prediction for specific users and items.
specific_user_pred = user_user_model.predict(uid, iid, verbose=True)

- Above shows the predicted rating for the given user and item ID where this user never rated the item earlier

In [None]:
# run the trained model against the testset
user_user_prediction = user_user_model.test(testset)
# Display the list of prediction
user_user_prediction[:10]

- In the above user-user collaborative model used the reduced data set by selecting eandom records from the dataset of 50 and more user rating
- k=10 neighbour node criteria applied to compare the item against another item
- Wherever the userand/or item found for comparison, 'was_impossible: False' is mentioned for each item prediction
- For all items, the actual and predicted rating is displayed
- Wherever the userand/or item not found for comparison, 'was_impossible: True' is mentioned
- The RMSE (root mean square error) for this model will be computed and compared in next section

In [None]:
#6. Evaluate the above model. ( Once the model is trained on the training data, it can be used to compute the error 
# (like RMSE) on predictions made on the test data.) You can also use a different method to evaluate the models

In [None]:
# RMSE - Item_Item collaboration model:
print("Item-based Model : Test Set")
accuracy.rmse(item_item_prediction, verbose=True)

In [None]:
# RMSE - User_User collaboration model:
print("User-based Model : Test Set")
accuracy.rmse(user_user_prediction, verbose=True)

- The RMSE i.e. root of mean square error is computed to find the difference between actual and predicted rating for item-item and user-user collaborative filtering
- The RMSE is less in the user-user model (1.08) than the item-item model (1.12)
- Hence above computaion indicates that user-user collaborative model is little better than the item-item filtering in order to predict the rating or recommend items to users who have not rated that specific item yet

In [None]:
#7. Get top - K ( K = 5) recommendations. Since our goal is to recommend new products to each user based on 
# his/her habits, we will recommend 5 new products.

In [None]:
# Import libraries to implement SVD model
from collections import defaultdict
from surprise import SVD

# Build train dataset to train SVD model
# erDataset is built from the reduced dataset and y initializing the reader library earlier
trainset_svd = erDataset.build_full_trainset() 

In [None]:
#Display the slice of a dictionary containing lists of tuples of the form (item_inner_id, rating). The keys are user inner ids

import itertools # This is for iteration over dictionary

# Display the slice containing lists of tuples of the form (item_inner_id, rating). The keys are user inner ids
dict(itertools.islice(trainset_svd.ur.items(), 5))

In [None]:
svd_model = SVD() # Initialize the SVD model
svd_model.fit(trainset_svd) # Fit the trainset dataset to svd model

# Predict ratings for all pairs (u, i) that are NOT in the training set.
testset_svd = trainset_svd.build_anti_testset()

In [None]:
# Have a look into the test set
testset_svd[:10]

In [None]:
# Get the svd prediction
svd_predictions = svd_model.test(testset_svd)

In [None]:
# Have a look into the SVD prediction
svd_predictions[:10]

- In the above user-user collaborative model used the reduced data set by selecting eandom records from the dataset of 50 and more user rating
- Wherever the user and/or item found for comparison, 'was_impossible: False' is mentioned for each item prediction
- For the user id and items, the actual (r_ui) and predicted rating (est) is displayed

In [None]:
# Build a module to store top items for users recommendation based on SVD model

def get_top_n(svd_predictions, n=5):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in svd_predictions:
        top_n[uid].append((iid, est))

# Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [None]:
top_n = get_top_n(svd_predictions, n=5) #Store the top 5 recommended items by user

In [None]:
# Display the recommended top 5 item by user and predicted rating

#Chose the slice of the dictionary to display details for 5 user and recommended 5 items for each of them
dict(itertools.islice(top_n.items(), 5)) 

- when the top 5 items are predicted, corresponding predicted ratings are also displayed above

In [None]:
# Print the recommended top 5 items for each user

i = 0 # Initialize the iterator
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])
    i +=1 # Increase the iterator by 1
    if(i==5):
        break # Break the loop to display top 5 recommended items only for 5 users

In [None]:
#8. Summarise insights of the analysis

- From above computation we find following inferences
    - The given dataset has huge amount of user, item,  corresponding user rating and timestamp when the rating was provided
    - Timestamp field was not required so we dropped it
    - First based on my current Python environment setup total of 7.8 million records got loaded
    - Using the simple group by filtering method we considered the data (0.1 million) only where 50 or more ratings were given i.e. these items were more rated by the users
    - Built the popularity based recommendation system based on user's rating or items got hightest number of ratings
    - The highest rating items would be recommended to the new users who never bought it earlier
    - While creating even further less sparse dataset to create train and test data set, selected 10% of the data i.e. ~ 12k records
    - Built 2 following collaborative filtering mechanism:
        - Item-Item model:
            - In this model measuerd how close the items are based on prior buying history or how users rated
            - The corresponding prediction on rating done whereas the actual ratings are given
            - Error in prediction of item rating is computed and Root of Mean Square Error (RMSE) for this model
        - User-User model:
            - In this model measuerd how close the users are based on prior buying history or items were rated
            - The corresponding prediction on rating done whereas the actual ratings are given
            - Error in prediction of item rating is computed and Root of Mean Square Error (RMSE) for this model
    - Upon comparison of item-item and user-user collaborative model's RMSE value we see the prediction error is less in user-user model to recommend new items to the users to upscale the item sale behavior
    - As next steps, matrix factorization performed through SVD model where top 5 items are recommended to each user to upscale the sales of an item
    - Overall, through the various steps of recommendation system, we can achieve following:
        - predict any items rating and recommend those product to the users
        - sales of highest rated items could be influenced by recommendation system
        - similar type of users can be found whos buying habit is similar
        - Popular items with similarity can be idetified 
        - Less popular items can be isolated and specific plan can be made to upscale sales