**DISCLAIMER:** I have done this project as part of my enrollment in MIT's Applied Data Science Program. The notebook is templated and some credit goes to the program's faculty/mentors.

# Recommendation Systems: Amazon product reviews

We will work with the Amazon product reviews dataset for this project. The dataset contains ratings of different electronic products. It does not include information about the products or reviews to avoid bias while building the model. 

--------------
### Context: 
--------------

Online E-commerce websites like Amazon, Flipkart uses different recommendation models to provide personalized suggestions to different users. Amazon currently uses item-to-item collaborative filtering, which scales to massive data sets and produces high-quality recommendations in real-time.

----------------
### Objective:
----------------

Build a recommendation system to recommend products to customers based on their previous ratings for other products.

--------------
### Dataset:
--------------

The Amazon dataset contains the following attributes:

- **userId:** Every user identified with a unique id
- **productId:** Every product identified with a unique id
- **Rating:** Rating of the corresponding product by the corresponding user
- **timestamp:** Time of the rating (ignore this column for this exercise)

### Importing Libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics.pairwise import cosine_similarity

from sklearn.metrics import mean_squared_error

### Loading data

In [1]:
df = pd.read_csv('../input/amazon-product-reviews/ratings_Electronics (1).csv', header=None) #There are no headers in the data file

df.columns = ['user_id', 'prod_id', 'rating', 'timestamp'] #Adding column names

df = df.drop('timestamp', axis=1) #Dropping timestamp

df_copy = df.copy(deep=True) #Copying the data to another dataframe

In [1]:
# see few rows of the imported dataset
df.head()

### Exploratory Data Analysis

#### Shape of the data

In [1]:
# Check the number of rows and columns
rows, columns = df.shape[0], df.shape[1]
print("No of rows: ", rows) 
print("No of columns: ", columns) 

#### Data types

In [1]:
#Check Data types
df.dtypes

#### Checking for missing values

In [1]:
# Find number of missing values in each column
df.isna().sum()

#### Summary Statistics

In [1]:
# Summary statistics of 'rating' variable
df.describe()

#### Checking the rating distribution

Check the distribution of ratings and **provide observations** from the plot 

In [1]:
#Create the plot and provide observations

plt.figure(figsize = (12,6))
df['rating'].value_counts(1).plot(kind='bar')
plt.show()

## Observations

- Around 75% of the ratings are positive (~55% is 5.0, and ~19% is 4.0). This shows that most users were happy with the products they rated.
- ~12% of the ratings is 1.0 and only ~5% is 2.0. It also gives positive idea about the products. 



#### Checking the number of unique users and items in the dataset

In [1]:
# Number of unique user id and product id in the data
print('Number of unique USERS in Raw data = ', df['user_id'].nunique())
print('Number of unique ITEMS in Raw data = ', df['prod_id'].nunique())

- There are **42,01,696 users and 4,76,002 products** in the dataset

#### Users with most number of ratings

In [1]:
# Top 10 users based on rating
most_rated = df.groupby('user_id').size().sort_values(ascending=False)[:10]
most_rated

- The highest number of ratings by a user is 520 which is far from the actual number of products present in the data. We can build a recommendation system to recommend products to users which they have not interacted with.

### Data preparation

**Let's take a subset of the dataset (by only keeping the users who have given 50 or more ratings) to make the dataset less sparse and easy to work with.**

In [1]:
counts = df['user_id'].value_counts()
df_final = df[df['user_id'].isin(counts[counts >= 50].index)]

In [1]:
print('The number of observations in the final data =', len(df_final))
print('Number of unique USERS in the final data = ', df_final['user_id'].nunique())
print('Number of unique PRODUCTS in the final data = ', df_final['prod_id'].nunique())

- The dataframe **df_final has users who have rated 50 or more items**
- **We will use df_final to build recommendation systems**

#### Checking the density of the rating matrix

In [1]:
#Creating the interaction matrix of products and users based on ratings and replacing NaN value with 0
final_ratings_matrix = df_final.pivot(index = 'user_id', columns ='prod_id', values = 'rating').fillna(0)
print('Shape of final_ratings_matrix: ', final_ratings_matrix.shape)

#Finding the number of non-zero entries in the interaction matrix 
given_num_of_ratings = np.count_nonzero(final_ratings_matrix)
print('given_num_of_ratings = ', given_num_of_ratings)

#Finding the possible number of ratings as per the number of users and products
possible_num_of_ratings = final_ratings_matrix.shape[0] * final_ratings_matrix.shape[1]
print('possible_num_of_ratings = ', possible_num_of_ratings)

#Density of ratings
density = (given_num_of_ratings/possible_num_of_ratings)
density *= 100
print ('density: {:4.2f}%'.format(density))

final_ratings_matrix.head()

- Even with the subset of users and products, the current number of ratings is just **0.17%** of the possible number of ratings. This implies that the data is **highly sparse**.
- We will build recommendation systems to recommend products to users with which they have not interacted.

Now that we have explored and preprocessed the data, let's build the first recommendation system

### Rank Based Recommendation System

In [1]:
#Calculate the average rating for each product 
average_rating = df_final.groupby(['prod_id']).mean().rating
print(average_rating.head())
#Calculate the count of ratings for each product
count_rating = df_final.groupby(['prod_id']).count().rating

#Create a dataframe with calculated average and count of ratings
final_rating = pd.DataFrame(pd.concat([average_rating,count_rating], axis = 1))
final_rating.columns=["Average Rating", "Ratings Count"]

#Sort the dataframe by average of ratings
final_rating = final_rating.sort_values(by='Average Rating', ascending=False)

final_rating.head()

In [1]:
#defining a function to get the top n products based on highest average rating and minimum interactions
def top_n_products(final_rating, n, min_interaction):
    
    #Finding movies with minimum number of interactions
    recommendations = final_rating[final_rating['Ratings Count'] >= min_interaction]
    
    #Sorting values w.r.t average rating 
    recommendations = recommendations.sort_values(by='Average Rating', ascending=False)
    
    return recommendations.index[:n]

#### Recommending top 5 products with 50 minimum interactions based on popularity

In [1]:
list(top_n_products(final_rating, 5, 50))

#### Recommending top 5 products with 100 minimum interactions based on popularity

In [1]:
list(top_n_products(final_rating, 5, 100))

We have recommended the top 5 products by using popularity recommendation system. Now, let's build a recommendation system using collaborative filtering

### Collaborative Filtering based Recommendation System (15 marks)

In [1]:
final_ratings_matrix.head()

**Here, user_id (index) is of the object data type. We will replace the user_id by numbers starting from 0 to 1539 (for all user ids) so that the index is of integer type and represents a user id in the same format**

In [1]:
final_ratings_matrix['user_index'] = np.arange(0, final_ratings_matrix.shape[0])
final_ratings_matrix.set_index(['user_index'], inplace=True)

# Actual ratings given by users
final_ratings_matrix.head()

Now, let's define a **function to get similar users** for a particular user

In [1]:
# defining a function to get similar users
def similar_users(user_index, interactions_matrix):
    similarity = []
    for user in range(0, interactions_matrix.shape[0]):
        
        #finding cosine similarity between the user_id and each user
        sim = cosine_similarity([interactions_matrix.loc[user_index]], [interactions_matrix.loc[user]])
        
        #Appending the user and the corresponding similarity score with user_id as a tuple
        similarity.append((user, sim))
        
    similarity.sort(key=lambda x: x[1], reverse=True)
    most_similar_users = [Tuple[0] for Tuple in similarity] #Extract the user from each tuple in the sorted list
    similarity_score = [Tuple[1] for Tuple in similarity]   ##Extracting the similarity score from each tuple in the sorted list
   
    #Remove the original user and its similarity score and keep only other similar users 
    most_similar_users.remove(user_index)
    similarity_score.remove(similarity_score[0])
       
    return most_similar_users, similarity_score

#### Finding out top 10 similar users to the user index 3 and their similarity score

In [1]:
similar = similar_users(3, final_ratings_matrix)[0][0:10]
similar

In [1]:
#Print the similarity score
similar_users(3,final_ratings_matrix)[1][0:10]

#### Finding out top 10 similar users to the user index 1521 and their similarity score

In [1]:
similar = similar_users(1521, final_ratings_matrix)[0][0:10]
similar

In [1]:
#Print the similarity score
similar_users(1521, final_ratings_matrix)[1][0:10]

We have found similar users for a given user. Now, let's create **a function to recommend products** to the user using the ratings given by similar users.

In [1]:
# defining the recommendations function to get recommendations by using the similar users' preferences
def recommendations(user_index, num_of_products, interactions_matrix):
    
    #Saving similar users using the function similar_users defined above
    most_similar_users = similar_users(user_index, interactions_matrix)[0]
    
    #Finding product IDs with which the user_id has interacted
    prod_ids = set(list(interactions_matrix.columns[np.where(interactions_matrix.loc[user_index] > 0)]))
    recommendations = []
    
    observed_interactions = prod_ids.copy()
    for similar_user in most_similar_users:
        if len(recommendations) < num_of_products:
            
            #Finding 'n' products which have been rated by similar users but not by the user_id
            similar_user_prod_ids = set(list(interactions_matrix.columns[np.where(interactions_matrix.loc[similar_user] > 0)]))
            recommendations.extend(list(similar_user_prod_ids.difference(observed_interactions)))
            observed_interactions = observed_interactions.union(similar_user_prod_ids)
        else:
            break
    
    return recommendations[:num_of_products]

#### Recommend 5 products to user index 3 based on similarity based collaborative filtering

In [1]:
recommendations(3, 5, final_ratings_matrix)

#### Recommend 5 products to user index 1521 based on similarity based collaborative filtering

In [1]:
recommendations(1521, 5, final_ratings_matrix)

We have applied two technique to recommend products to users. Now, let's build one more recommendation system using matrix factorization (SVD).

### Model based Collaborative Filtering: Singular Value Decomposition

**We have seen above that the interaction matrix is highly sparse. SVD is best to apply on a large sparse matrix. Note that for sparse matrices, we can use the sparse.linalg.svds() function to perform the decomposition**

Also, we will use **k=50 latent features** to predict rating of products

In [1]:
from scipy.sparse.linalg import svds # for sparse matrices

# Singular Value Decomposition
U, s, Vt = svds(final_ratings_matrix, k = 50) # here k is the number of latent features

# Construct diagonal array in SVD
sigma = np.diag(s)

In [1]:
U.shape #checking the shape of the U matrix

In [1]:
sigma.shape #checking the shape of the sigma matrix

In [1]:
Vt.shape #checking the shape of the Vt matrix

Now, let's regenerate the original matrix using U, Sigma, and Vt matrices. The resulting matrix would be the predicted ratings for all users and products

In [1]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 

# Predicted ratings
preds_df = pd.DataFrame(abs(all_user_predicted_ratings), columns = final_ratings_matrix.columns)
preds_df.head()

We have the prediction of ratings but we need to create a **function to recommend products** to the users on the basis of predicted ratings for each product

In [1]:
# Recommend the items with the highest predicted ratings

def recommend_items(user_index, interactions_matrix, preds_df, num_recommendations):
    
    # Get and sort the user's ratings from the actual and predicted interaction matrix
    sorted_user_ratings = interactions_matrix.loc[user_index].sort_values(ascending=False)
    sorted_user_predictions = preds_df.loc[user_index].sort_values(ascending=False)

    #Creating a dataframe with actual and predicted ratings columns
    temp = pd.concat([sorted_user_ratings, sorted_user_predictions], axis=1)
    temp.index.name = 'Recommended Products'
    temp.columns = ['user_ratings', 'user_predictions']
    
    #Filtering the dataframe where actual ratings are 0 which implies that the user has not interacted with that product
    temp = temp.loc[temp['user_ratings'] == 0]   
    
    #Recommending products with top predicted ratings
    temp = temp.sort_values(by='user_predictions', ascending=False) #Sort the dataframe by user_predictions in descending order
    print('\nBelow are the recommended products for user(user_id = {}):\n'.format(user_index))
    print(temp['user_predictions'].head(num_recommendations))

**Recommending top 5 products to user id 121**

In [1]:
#Enter 'user index' and 'num_recommendations' for the user
recommend_items(121, final_ratings_matrix, preds_df, 5)

**Recommending the 5 products to user index 465**

In [1]:
#Enter 'user_index' and 'num_recommendations' for the user #
recommend_items(465, final_ratings_matrix, preds_df, 5)

### Evaluate the model

#### Evaluation of the Model based Collaborative Filtering (SVD)

In [1]:
# Actual ratings given by the users
final_ratings_matrix.head()

In [1]:
# Find average actual rating for each item
final_ratings_matrix.mean()

In [1]:
# Predicted ratings 
preds_df.head()

In [1]:
# Find average predicted rating for each item
preds_df.mean()

In [1]:
#create a dataframe containing average actual ratings and avearge predicted ratings for each product
rmse_df = pd.concat([final_ratings_matrix.mean(), preds_df.mean()], axis=1)

rmse_df.columns = ['Avg_actual_ratings', 'Avg_predicted_ratings']

rmse_df.head()

In [1]:
#Calculate and print RMSE using the mean_square_error function
RMSE = mean_squared_error(rmse_df['Avg_actual_ratings'], rmse_df['Avg_predicted_ratings'], squared=False)
print('\nRMSE SVD Model = {} \n'.format(RMSE))

### Recommendation

#### Recommend top 10 products to the user id 100

In [1]:
# Enter 'user_index' and 'num_recommendations' for the user #
recommend_items(100, final_ratings_matrix, preds_df, 10)

### Conclusion

**Conclusion: _______________**
- Explored the data (sparse) and prepared it for recommendation system prediction task.
- Implemented ranking recommendation system and build recommender for user who has given more that 50 rankings.
- Build Collaborative Filtering based Recommendation System by finding similarities and utilizing it.
- Improved the collaborative filtering based system by using Singular Value Decomposition, evaluated cost function and drew recommendations.