# **Project: Amazon Product Recommendation System**

# **Marks: 40**


Welcome to the project on Recommendation Systems. We will work with the Amazon product reviews dataset for this project. The dataset contains ratings of different electronic products. It does not include information about the products or reviews to avoid bias while building the model.

--------------
## **Context:**
--------------

Today, information is growing exponentially with volume, velocity and variety throughout the globe. This has lead to information overload, and too many choices for the consumer of any business. It represents a real dilemma for these consumers and they often turn to denial. Recommender Systems are one of the best tools that help recommending products to consumers while they are browsing online. Providing personalized recommendations which is most relevant for the user is what's most likely to keep them engaged and help business.

E-commerce websites like Amazon, Walmart, Target and Etsy use different recommendation models to provide personalized suggestions to different users. These companies spend millions of dollars to come up with algorithmic techniques that can provide personalized recommendations to their users.

Amazon, for example, is well-known for its accurate selection of recommendations in its online site. Amazon's recommendation system is capable of intelligently analyzing and predicting customers' shopping preferences in order to offer them a list of recommended products. Amazon's recommendation algorithm is therefore a key element in using AI to improve the personalization of its website. For example, one of the baseline recommendation models that Amazon uses is item-to-item collaborative filtering, which scales to massive data sets and produces high-quality recommendations in real-time.

----------------
## **Objective:**
----------------

You are a Data Science Manager at Amazon, and have been given the task of building a recommendation system to recommend products to customers based on their previous ratings for other products. You have a collection of labeled data of Amazon reviews of products. The goal is to extract meaningful insights from the data and build a recommendation system that helps in recommending products to online consumers.

-----------------------------
## **Dataset:**
-----------------------------

The Amazon dataset contains the following attributes:

- **userId:** Every user identified with a unique id
- **productId:** Every product identified with a unique id
- **Rating:** The rating of the corresponding product by the corresponding user
- **timestamp:** Time of the rating. We **will not use this column** to solve the current problem

**Note:** The code has some user defined functions that will be usefull while making recommendations and measure model performance, you can use these functions or can create your own functions.

Sometimes, the installation of the surprise library, which is used to build recommendation systems, faces issues in Jupyter. To avoid any issues, it is advised to use **Google Colab** for this project.

Let's start by mounting the Google drive on Colab.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

ValueError: mount failed

**Installing surprise library**

In [None]:
#______________________
!pip install scikit-surprise

## **Importing the necessary libraries and overview of the dataset**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from collections import defaultdict
#from surprise import Dataset, SVD
#from surprise.model_selection import GridSearchCV

### **Loading the data**
- Import the Dataset
- Add column names ['user_id', 'prod_id', 'rating', 'timestamp']
- Drop the column timestamp
- Copy the data to another DataFrame called **df**

In [None]:
dataFilePath = '/content/ratings_Electronics.csv'
df = pd.read_csv(dataFilePath)

df.columns = ['user_id', 'prod_id', 'rating', 'timestamp']
df = df.drop(columns=['timestamp'])
df2 = df.copy()

df.head()

**As this dataset is very large and has 7,824,482 observations, it is not computationally possible to build a model using this. Moreover, many users have only rated a few products and also some products are rated by very few users. Hence, we can reduce the dataset by considering certain logical assumptions.**

Here, we will be taking users who have given at least 50 ratings, and the products that have at least 5 ratings, as when we shop online we prefer to have some number of ratings of a product.

In [None]:
# Get the column containing the users
users = df.user_id

# Create a dictionary from users to their number of ratings
ratings_count = dict()

for user in users:

    # If we already have the user, just add 1 to their rating count
    if user in ratings_count:
        ratings_count[user] += 1

    # Otherwise, set their rating count to 1
    else:
        ratings_count[user] = 1

In [None]:
# We want our users to have at least 50 ratings to be considered
RATINGS_CUTOFF = 50

remove_users = []

for user, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)

df = df.loc[ ~ df.user_id.isin(remove_users)]

In [None]:
# Get the column containing the products
prods = df.prod_id

# Create a dictionary from products to their number of ratings
ratings_count = dict()

for prod in prods:

    # If we already have the product, just add 1 to its rating count
    if prod in ratings_count:
        ratings_count[prod] += 1

    # Otherwise, set their rating count to 1
    else:
        ratings_count[prod] = 1

In [None]:
# We want our item to have at least 5 ratings to be considered
RATINGS_CUTOFF = 5

remove_users = []

for user, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)

df_final = df.loc[~ df.prod_id.isin(remove_users)]

In [None]:
# Print a few rows of the imported dataset
df_final.head()

## **Exploratory Data Analysis**

### **Shape of the data**

### **Check the number of rows and columns and provide observations.**

In [None]:
# Check the number of rows and columns and provide observations
df_final.shape

**Write your observations here: There are 65,290 rows representing ratings and 3 columns of information for each one including the user_id, prod_id, and the rating.**

### **Data types**

In [None]:
# Check Data types and provide observations
#__________
df_final.dtypes

**Write your observations here: The user_id and prod_id hold objects that are alphanumeric strings and the rating holds float64 type data that are numerals.**

### **Checking for missing values**

In [None]:
# Check for missing values present and provide observations
#___________
df_final.isnull().sum()

**Write your observations here: There are no null values for the user_id, prod_id, and rating so all data is filled.**

### **Summary Statistics**

In [None]:
# Summary statistics of 'rating' variable and provide observations
df_final['rating'].describe()

**Write your observations here: The average of the ratings are about 4.3, which is quite high and there is not a significant amount of variation within the data. **

### **Checking the rating distribution**

In [3]:
# Create the bar plot and provide observations
df_final = df_final[df_final['rating'].between(1,5)]

plt.figure(figsize=(10,10))
sns.countplot(df_final['rating'])
plt.title("Distribution of Ratings")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.show()

NameError: name 'df_final' is not defined

**Write your observations here:________**

### **Checking the number of unique users and items in the dataset**

In [15]:
# Number of total rows in the data and number of unique user id and product id in the data
rowTotals = df_final.shape[0]
uniqueUsersData = df_final["user_id"].nunique()
uniqueProductsData = df_final["prod_id"].nunique()

print(f"Total Number of Rows: {rowTotals}")
print(f"Number of Unique Users: {uniqueUsersData}")
print(f"Number of Unique Products {uniqueProductsData}")


Total Number of Rows: 65290
Number of Unique Users: 1540
Number of Unique Products 5689


**Write your observations here: There are reasonable numbers of unique users and unique products in the dataset that can help in developing the recommendations for users.**

### **Users with the most number of ratings**

In [16]:
# Top 10 users based on the number of ratings
topTenUsers = df_final["user_id"].value_counts().head(10)
print(topTenUsers)

user_id
ADLVFFE4VBT8      295
A3OXHLG6DIBRW8    230
A1ODOGXEYECQQ8    217
A36K2N527TXXJN    212
A25C2M3QF9G7OQ    203
A680RUE1FDO8B     196
A1UQBFCERIP7VJ    193
A22CW0ZHY3NJH8    193
AWPODHOB4GFWL     184
AGVWTYW0ULXHT     179
Name: count, dtype: int64


**Write your observations here: The top users in the dataset have hundreds of ratings which can be helpful in training the recommendation model.**

**Now that we have explored and prepared the data, let's build the first recommendation system.**

## **Model 1: Rank Based Recommendation System**

In [17]:
# Calculate the average rating for each product
averageProductRating = df_final.groupby("prod_id")["rating"].mean()
# Calculate the count of ratings for each product
countRatingPerProduct = df_final.groupby("prod_id")["rating"].count()
# Create a dataframe with calculated average and count of ratings
dfRating = pd.DataFrame({"average_rating": averageProductRating, "rating_count": countRatingPerProduct})
# Sort the dataframe by average of ratings in the descending order
dfRating = dfRating.sort_values(by = "average_rating", ascending = False)

# See the first five records of the "final_rating" dataset
dfRating.head()

Unnamed: 0_level_0,average_rating,rating_count
prod_id,Unnamed: 1_level_1,Unnamed: 2_level_1
B00LGQ6HL8,5.0,5
B003DZJQQI,5.0,14
B005FDXF2C,5.0,7
B00I6CVPVC,5.0,7
B00B9KOCYA,5.0,8


In [18]:
# Defining a function to get the top n products based on the highest average rating and minimum interactions
def getTopNProducts(topN, minInteraction):
  # Finding products with minimum number of interactions
  topProducts = dfRating[dfRating["rating_count"] >= minInteraction]

  # Sorting values with respect to average rating
  topProducts = topProducts.sort_values(by = "average_rating", ascending = False)

  return topProducts.head(topN)


### **Recommending top 5 products with 50 minimum interactions based on popularity**

In [19]:
topFiveWithFiftyInteractions = getTopNProducts(5, 50)
print(topFiveWithFiftyInteractions)

            average_rating  rating_count
prod_id                                 
B001TH7GUU        4.871795            78
B003ES5ZUU        4.864130           184
B0019EHU8G        4.855556            90
B006W8U2MU        4.824561            57
B000QUUFRW        4.809524            84


### **Recommending top 5 products with 100 minimum interactions based on popularity**

In [20]:
topFiveWithHundredInteractions = getTopNProducts(5, 100)
print(topFiveWithHundredInteractions)

            average_rating  rating_count
prod_id                                 
B003ES5ZUU        4.864130           184
B000N99BBC        4.772455           167
B002WE6D44        4.770000           100
B007WTAJTO        4.701220           164
B002V88HFE        4.698113           106


We have recommended the **top 5** products by using the popularity recommendation system. Now, let's build a recommendation system using **collaborative filtering.**

## **Model 2: Collaborative Filtering Recommendation System**

### **Building a baseline user-user similarity based recommendation system**

- Below, we are building **similarity-based recommendation systems** using `cosine` similarity and using **KNN to find similar users** which are the nearest neighbor to the given user.  
- We will be using a new library, called `surprise`, to build the remaining models. Let's first import the necessary classes and functions from this library.

In [21]:
# To compute the accuracy of models
from surprise import accuracy

# Class is used to parse a file containing ratings, data should be in structure - user ; item ; rating
from surprise.reader import Reader

# Class for loading datasets
from surprise.dataset import Dataset

# For tuning model hyperparameters
from surprise.model_selection import GridSearchCV

# For splitting the rating data in train and test datasets
from surprise.model_selection import train_test_split

# For implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic

# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD

# for implementing K-Fold cross-validation
from surprise.model_selection import KFold

# For implementing clustering-based recommendation system
from surprise import CoClustering

**Before building the recommendation systems, let's  go over some basic terminologies we are going to use:**

**Relevant item:** An item (product in this case) that is actually **rated higher than the threshold rating** is relevant, if the **actual rating is below the threshold then it is a non-relevant item**.  

**Recommended item:** An item that's **predicted rating is higher than the threshold is a recommended item**, if the **predicted rating is below the threshold then that product will not be recommended to the user**.  


**False Negative (FN):** It is the **frequency of relevant items that are not recommended to the user**. If the relevant items are not recommended to the user, then the user might not buy the product/item. This would result in the **loss of opportunity for the service provider**, which they would like to minimize.

**False Positive (FP):** It is the **frequency of recommended items that are actually not relevant**. In this case, the recommendation system is not doing a good job of finding and recommending the relevant items to the user. This would result in **loss of resources for the service provider**, which they would also like to minimize.

**Recall:** It is the **fraction of actually relevant items that are recommended to the user**, i.e., if out of 10 relevant products, 6 are recommended to the user then recall is 0.60. Higher the value of recall better is the model. It is one of the metrics to do the performance assessment of classification models.

**Precision:** It is the **fraction of recommended items that are relevant actually**, i.e., if out of 10 recommended items, 6 are found relevant by the user then precision is 0.60. The higher the value of precision better is the model. It is one of the metrics to do the performance assessment of classification models.

**While making a recommendation system, it becomes customary to look at the performance of the model. In terms of how many recommendations are relevant and vice-versa, below are some most used performance metrics used in the assessment of recommendation systems.**

### **Precision@k, Recall@ k, and F1-score@k**

**Precision@k** - It is the **fraction of recommended items that are relevant in `top k` predictions**. The value of k is the number of recommendations to be provided to the user. One can choose a variable number of recommendations to be given to a unique user.  


**Recall@k** - It is the **fraction of relevant items that are recommended to the user in `top k` predictions**.

**F1-score@k** - It is the **harmonic mean of Precision@k and Recall@k**. When **precision@k and recall@k both seem to be important** then it is useful to use this metric because it is representative of both of them.

### **Some useful functions**

- Below function takes the **recommendation model** as input and gives the **precision@k, recall@k, and F1-score@k** for that model.  
- To compute **precision and recall**, **top k** predictions are taken under consideration for each user.
- We will use the precision and recall to compute the F1-score.

In [22]:
def precision_recall_at_k(model, k = 10, threshold = 3.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user
    user_est_true = defaultdict(list)

    # Making predictions on the test data
    predictions = model.test(testset)

    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key = lambda x: x[0], reverse = True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. Therefore, we are setting Precision to 0 when n_rec_k is 0

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. Therefore, we are setting Recall to 0 when n_rel is 0

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    # Mean of all the predicted precisions are calculated.
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)), 3)

    # Mean of all the predicted recalls are calculated.
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)), 3)

    accuracy.rmse(predictions)

    print('Precision: ', precision) # Command to print the overall precision

    print('Recall: ', recall) # Command to print the overall recall

    print('F_1 score: ', round((2*precision*recall)/(precision+recall), 3)) # Formula to compute the F-1 score

**Hints:**

- To compute **precision and recall**, a **threshold of 3.5 and k value of 10 can be considered for the recommended and relevant ratings**.
- Think about the performance metric to choose.

Below we are loading the **`rating` dataset**, which is a **pandas DataFrame**, into a **different format called `surprise.dataset.DatasetAutoFolds`**, which is required by this library. To do this, we will be **using the classes `Reader` and `Dataset`.**

In [23]:
# Instantiating Reader scale with expected rating scale
reader = Reader(rating_scale = (1, 5))
# Loading the rating dataset
ratingData = Dataset.load_from_df(df_final[["user_id", "prod_id", "rating"]], reader)
# Splitting the data into train and test datasets
trainset, testset = train_test_split(ratingData, test_size = 0.2, random_state = 1)

Now, we are **ready to build the first baseline similarity-based recommendation system** using the cosine similarity.

### **Building the user-user Similarity-based Recommendation System**

In [24]:
# Declaring the similarity options
similarityOptions = {
    'name': 'cosine',
    'user_based': True
}

# Initialize the KNNBasic model using sim_options declared, Verbose = False, and setting random_state = 1
KNNModel = KNNBasic(sim_options = similarityOptions, verbose = False, random_state = 1)

# Fit the model on the training data
KNNModel.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score using the precision_recall_at_k function defined above
precision_recall_at_k(KNNModel, k = 10, threshold = 3.5)

RMSE: 1.0260
Precision:  0.844
Recall:  0.862
F_1 score:  0.853


**Write your observations here: According to the results, the model has a high precision and recall so it performs well. Nonetheless there is room for improvement as the RSME value represents some error.**

Let's now **predict rating for a user with `userId=A3LDPF5FMB782Z` and `productId=1400501466`** as shown below. Here the user has already interacted or watched the product with productId '1400501466' and given a rating of 5.

In [25]:
# Predicting rating for a sample user with an interacted product
userID = 'A3LDPF5FMB782Z'
productID = '1400501466'

predictUserRating = KNNModel.predict(userID, productID)

print(f'{userID} is predicted to give {productID} a rating of {predictUserRating.est}')

A3LDPF5FMB782Z is predicted to give 1400501466 a rating of 3.3333333333333335


**Write your observations here: According to the model, the user appears to somewhat have medicore satisfaction with the product.**

Below is the **list of users who have not seen the product with product id "1400501466"**.

In [26]:
# Find unique user_id where prod_id is not equal to "1400501466"
nonInteracted = df_final[df_final["prod_id"] != "1400501466"]["user_id"].unique()
dfNonInteracted = pd.DataFrame(nonInteracted, columns = ["user_id"])

print(dfNonInteracted)

             user_id
0     A2ZR3YTMEEIIZ4
1     A3CLWR1UUZT6TG
2      A5JLAU2ARJ0BO
3     A1P4XD7IORSEFN
4     A341HCMGNZCBIT
...              ...
1535  A1X3ESYZ79H59E
1536  A328S9RN3U5M68
1537  A215WH6RUDUCMP
1538  A38C12950IM24P
1539  A2J4XMWKR8PPD0

[1540 rows x 1 columns]


* It can be observed from the above list that **user "A34BZM6S9L7QI4" has not seen the product with productId "1400501466"** as this userId is a part of the above list.

**Below we are predicting rating for `userId=A34BZM6S9L7QI4` and `prod_id=1400501466`.**

In [27]:
# Predicting rating for a sample user with a non interacted product
userID = 'A34BZM6S9L7QI4'
productID = '1400501466'

predictUserRating = KNNModel.predict(userID, productID)

print(f'{userID} is predicted to give {productID} a rating of {predictUserRating.est}')

A34BZM6S9L7QI4 is predicted to give 1400501466 a rating of 1.991150442477876


**Write your observations here: According to the model, the user is predicted to be particularly unsatisfied with the product. **

### **Improving similarity-based recommendation system by tuning its hyperparameters**

Below, we will be tuning hyperparameters for the `KNNBasic` algorithm. Let's try to understand some of the hyperparameters of the KNNBasic algorithm:

- **k** (int) – The (max) number of neighbors to take into account for aggregation. Default is 40.
- **min_k** (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set to the global mean of all ratings. Default is 1.
- **sim_options** (dict) – A dictionary of options for the similarity measure. And there are four similarity measures available in surprise -
    - cosine
    - msd (default)
    - Pearson
    - Pearson baseline

In [28]:
# Setting up parameter grid to tune the hyperparameters
parameterGrid = {
    'k': [10, 20, 30, 40, 50],
    'min_k': [1, 5, 10],
    'similarityOptions': {
        'name': ['cosine', 'msd', 'pearson', ' pearson_baseline'],
        'user_based': [True],
    }

}

# Performing 3-fold cross-validation to tune the hyperparameters
searchGrid = GridSearchCV(KNNBasic, parameterGrid, measures = ['rmse', 'mae'], cv = 3, n_jobs=-1)

# Fitting the data
searchGrid.fit(ratingData)

# Best RMSE score
bestRSMEScore = searchGrid.best_score['rmse']

# Combination of parameters that gave the best RMSE score
bestParameters = searchGrid.best_params['rmse']

print(f"Best RSME Score: {bestRSMEScore}")
print(f"Best Parameter Combination: {bestParameters}")


Best RSME Score: 0.9701555299862622
Best Parameter Combination: {'k': 50, 'min_k': 5, 'similarityOptions': 'name'}


Once the grid search is **complete**, we can get the **optimal values for each of those hyperparameters**.

Now, let's build the **final model by using tuned values of the hyperparameters**, which we received by using **grid search cross-validation**.

In [29]:
# Using the optimal similarity measure for user-user based collaborative filtering
#bestParam = searchGrid.best_params['knn']

# Creating an instance of KNNBasic with optimal hyperparameter values
optKNNBasic = KNNBasic(k = bestParameters['k'], min_k = bestParameters['min_k'], sim_options = similarityOptions, random_state = 1)

# Training the algorithm on the trainset
optKNNBasic.fit(trainset)

# Let us compute precision@k and recall@k also with k =10
precision_recall_at_k(optKNNBasic, k = 10)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9740
Precision:  0.836
Recall:  0.895
F_1 score:  0.864


**Write your observations here: The results show that the model has a high precision and recall and the F1 score tells us that there is a desirable balaence in determining false positives/negatives.**

### **Steps:**
- **Predict rating for the user with `userId="A3LDPF5FMB782Z"`, and `prod_id= "1400501466"` using the optimized model**
- **Predict rating for `userId="A34BZM6S9L7QI4"` who has not interacted with `prod_id ="1400501466"`, by using the optimized model**
- **Compare the output with the output from the baseline model**

In [30]:
# Use sim_user_user_optimized model to recommend for userId "A3LDPF5FMB782Z" and productId 1400501466
optPred = optKNNBasic.predict("A3LDPF5FMB782Z", "1400501466")

print(f'User A3LDPF5FMB782Z is predicted to give product 1400501466 a rating of {optPred.est}')

User A3LDPF5FMB782Z is predicted to give product 1400501466 a rating of 3.3333333333333335


In [31]:
# Use sim_user_user_optimized model to recommend for userId "A34BZM6S9L7QI4" and productId "1400501466"
optPred2 = optKNNBasic.predict("A34BZM6S9L7QI4", "1400501466")

print(f'User A34BZM6S9L7QI4 is predicted to give product 1400501466 a rating of {optPred2.est}')

User A34BZM6S9L7QI4 is predicted to give product 1400501466 a rating of 4.296427477408486


**Write your observations here:** User A34BZM6S9L7QI4 is expected to be more satisfied with the product according to the model's predicted rating.

### **Identifying similar users to a given user (nearest neighbors)**

We can also find out **similar users to a given user** or its **nearest neighbors** based on this KNNBasic algorithm. Below, we are finding the 5 most similar users to the first user in the list with internal id 0, based on the `msd` distance metric.

In [32]:
# 0 is the inner id of the above user
similarUsers = optKNNBasic.get_neighbors(0, k = 5)

similarUsersRaw = [trainset.to_raw_uid(innerUserId) for innerUserId in similarUsers]

print(f"Five most similar users:")

for userIDRaw in similarUsersRaw:
  print(userIDRaw)

Five most similar users:
A3NEAETOSXDBOM
A225G2TFM76GYX
AOWF9T81XMX2S
AR18DH5SL9F73
A39137LW12KK7B


### **Implementing the recommendation algorithm based on optimized KNNBasic model**

Below we will be implementing a function where the input parameters are:

- data: A **rating** dataset
- user_id: A user id **against which we want the recommendations**
- top_n: The **number of products we want to recommend**
- algo: the algorithm we want to use **for predicting the ratings**
- The output of the function is a **set of top_n items** recommended for the given user_id based on the given algorithm

In [33]:
def get_recommendations(data, user_id, top_n, algo):

    # Creating an empty list to store the recommended product ids
    recommendations = []

    # Creating an user item interactions matrix
    user_item_interactions_matrix = data.pivot(index = 'user_id', columns = 'prod_id', values = 'rating')

    # Extracting those product ids which the user_id has not interacted yet
    non_interacted_products = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()

    # Looping through each of the product ids which user_id has not interacted yet
    for item_id in non_interacted_products:

        # Predicting the ratings for those non interacted product ids by this user
        est = algo.predict(user_id, item_id).est

        # Appending the predicted ratings
        recommendations.append((item_id, est))

    # Sorting the predicted ratings in descending order
    recommendations.sort(key = lambda x: x[1], reverse = True)

    return recommendations[:top_n] # Returing top n highest predicted rating products for this user

**Predicting top 5 products for userId = "A3LDPF5FMB782Z" with similarity based recommendation system**

In [34]:
# Making top 5 recommendations for user_id "A3LDPF5FMB782Z" with a similarity-based recommendation engine
topFiveRec = get_recommendations(df_final, "A3LDPF5FMB782Z", top_n = 5, algo = optKNNBasic)

In [35]:
# Building the dataframe for above recommendations with columns "prod_id" and "predicted_ratings"
dfTopFiveRec = pd.DataFrame(topFiveRec, columns = ["prod_id", "predicted_ratings"])

print(dfTopFiveRec)

      prod_id  predicted_ratings
0  B00005LEN4                  5
1  B000067RT6                  5
2  B0000X0VCY                  5
3  B000ENUCR4                  5
4  B000TXEE14                  5


### **Item-Item Similarity-based Collaborative Filtering Recommendation System**

* Above we have seen **similarity-based collaborative filtering** where similarity is calculated **between users**. Now let us look into similarity-based collaborative filtering where similarity is seen **between items**.

In [36]:
# Declaring the similarity options
similarityOptions2 = {
    'name': 'cosine',
    'user_based': False

}
# KNN algorithm is used to find desired similar items. Use random_state=1
itemRec = KNNBasic(sim_options = similarityOptions2, random_state = 1)

# Train the algorithm on the trainset, and predict ratings for the test set
itemRec.fit(trainset)
predRatings = itemRec.test(testset)

# Let us compute precision@k, recall@k, and f_1 score with k = 10
precision_recall_at_k(itemRec, k =10, threshold = 3.5)


Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0147
Precision:  0.826
Recall:  0.853
F_1 score:  0.839


**Write your observations here:**The model likely makes good recommendations as it has high precision and recall, but it could still use matrix factorization for further improvements.

Let's now **predict a rating for a user with `userId = A3LDPF5FMB782Z` and `prod_Id = 1400501466`** as shown below. Here the user has already interacted or watched the product with productId "1400501466".

In [37]:
# Predicting rating for a sample user with an interacted product
predictRating = optKNNBasic.predict("A3LDPF5FMB782Z", "1400501466")
print(f'User A3LDPF5FMB782Z is predicted to give product 1400501466 a rating of {predictRating.est}')

User A3LDPF5FMB782Z is predicted to give product 1400501466 a rating of 3.3333333333333335


**Write your observations here:** The user is predicted to be somewhat satisfied with the product according to the model.

Below we are **predicting rating for the `userId = A34BZM6S9L7QI4` and `prod_id = 1400501466`**.

In [38]:
# Predicting rating for a sample user with a non interacted product
predictRating2 = optKNNBasic.predict("A34BZM6S9L7QI4", "1400501466")
print(f'User A34BZM6S9L7QI4 is predicted to give product 1400501466 a rating of {predictRating2.est}')

User A34BZM6S9L7QI4 is predicted to give product 1400501466 a rating of 4.296427477408486


**Write your observations here:** The user is predicted to be satisfied with the product according to the model.

### **Hyperparameter tuning the item-item similarity-based model**
- Use the following values for the param_grid and tune the model.
  - 'k': [10, 20, 30]
  - 'min_k': [3, 6, 9]
  - 'sim_options': {'name': ['msd', 'cosine']
  - 'user_based': [False]
- Use GridSearchCV() to tune the model using the 'rmse' measure
- Print the best score and best parameters

In [39]:
# Setting up parameter grid to tune the hyperparameters
pGrid = {
    'k': [10, 20],
    'min_k': [2, 4],
    'similarityOptions': {
        'name': ['msd', 'cosine'],
        'user_based': [False]
    }
}
# Performing 3-fold cross validation to tune the hyperparameters
search = GridSearchCV(KNNBasic, pGrid, measures = ['rmse'], cv = 3)

# Fitting the data
search.fit(ratingData)

# Find the best RMSE score
topRSME = search.best_score['rmse']

# Find the combination of parameters that gave the best RMSE score
topParameters = search.best_params['rmse']

print(f"Best RSME Score: {topRSME}")
print(f"Best Parameter Combination: {topParameters}")

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

Once the **grid search** is complete, we can get the **optimal values for each of those hyperparameters as shown above.**

Now let's build the **final model** by using **tuned values of the hyperparameters** which we received by using grid search cross-validation.

### **Use the best parameters from GridSearchCV to build the optimized item-item similarity-based model. Compare the performance of the optimized model with the baseline model.**

In [40]:
# Using the optimal similarity measure for item-item based collaborative filtering
best_params = search.best_params['rmse']
#print(search.best_params)

# Creating an instance of KNNBasic with optimal hyperparameter values
optVals = KNNBasic(
    random_state = 1,
    k = best_params['k'],
    min_k = best_params['min_k'],
    similarityOptions = best_params['similarityOptions'],
)

# Training the algorithm on the trainset
trainset, testset = train_test_split(ratingData, test_size = 0.25)
optVals.fit(trainset)

# Let us compute precision@k and recall@k, f1_score and RMSE
pred = optVals.test(testset)

precision_recall_at_k(optVals, k = 10, threshold = 3.5)




Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9631
Precision:  0.85
Recall:  0.852
F_1 score:  0.851


**Write your observations here: This optimization proves to be more effective as the precision, recall, and balence is high and the RMSE is low.**

### **Steps:**
- **Predict rating for the user with `userId="A3LDPF5FMB782Z"`, and `prod_id= "1400501466"` using the optimized model**
- **Predict rating for `userId="A34BZM6S9L7QI4"` who has not interacted with `prod_id ="1400501466"`, by using the optimized model**
- **Compare the output with the output from the baseline model**

In [41]:
# Use sim_item_item_optimized model to recommend for userId "A3LDPF5FMB782Z" and productId "1400501466"
optPred = optVals.predict(trainset.to_inner_uid("A3LDPF5FMB782Z"), trainset.to_inner_iid("1400501466"))
print(f'User A3LDPF5FMB782Z is predicted to give product 1400501466 a rating of {optPred.est}')

User A3LDPF5FMB782Z is predicted to give product 1400501466 a rating of 4.293585475932771


In [42]:
# Use sim_item_item_optimized model to recommend for userId "A34BZM6S9L7QI4" and productId "1400501466"
optPred2 = optVals.predict(trainset.to_inner_uid("A34BZM6S9L7QI4"), trainset.to_inner_iid("1400501466"))
print(f'User A34BZM6S9L7QI4 is predicted to give product 1400501466 a rating of {optPred2.est}')

User A34BZM6S9L7QI4 is predicted to give product 1400501466 a rating of 4.293585475932771


**Write your observations here: These results show a correlation between item-item recommendations as both users are expected to rate the product the same. The model can still however be improved with further calibration.**

### **Identifying similar items to a given item (nearest neighbors)**

We can also find out **similar items** to a given item or its nearest neighbors based on this **KNNBasic algorithm**. Below we are finding the 5 most similar items to the item with internal id 0 based on the `msd` distance metric.

In [43]:
alikeItems = optVals.get_neighbors(0, k = 5)
alikeItemsID = [trainset.to_raw_iid(x) for x in alikeItems]

print(f"Five most alike items:")

for x in alikeItemsID:
  print(x)

Five most alike items:
B003WUBIZQ
B00IODWU6W
B002NO7PWC
B005DIH1IS
B00275XTSQ


**Predicting top 5 products for userId = "A1A5KUIIIHFF4U" with similarity based recommendation system.**

**Hint:** Use the get_recommendations() function.

In [44]:
# Making top 5 recommendations for user_id A1A5KUIIIHFF4U with similarity-based recommendation engine.
rawRatingData = pd.DataFrame(ratingData.raw_ratings, columns = ["user_id", "prod_id", "rating", "timestamp"])

optTop5 = get_recommendations(rawRatingData, "A1A5KUIIIHFF4U", top_n = 5, algo = optVals)

for x, y in optTop5:
  print(f"Product: {x} => Predicted Rating: ")

Product: B000067RT6 => Predicted Rating: 
Product: B0000BZL1P => Predicted Rating: 
Product: B000BQ7GW8 => Predicted Rating: 
Product: B001TH7GSW => Predicted Rating: 
Product: B001TH7GUU => Predicted Rating: 


In [45]:
# Building the dataframe for above recommendations with columns "prod_id" and "predicted_ratings"
dfOptTop5 = pd.DataFrame(optTop5, columns = ['prod_id', 'predicted_ratings'])

print(dfOptTop5)

      prod_id  predicted_ratings
0  B000067RT6                  5
1  B0000BZL1P                  5
2  B000BQ7GW8                  5
3  B001TH7GSW                  5
4  B001TH7GUU                  5


Now as we have seen **similarity-based collaborative filtering algorithms**, let us now get into **model-based collaborative filtering algorithms**.

### **Model 3: Model-Based Collaborative Filtering - Matrix Factorization**

Model-based Collaborative Filtering is a **personalized recommendation system**, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use **latent features** to find recommendations for each user.

### Singular Value Decomposition (SVD)

SVD is used to **compute the latent features** from the **user-item matrix**. But SVD does not work when we **miss values** in the **user-item matrix**.

In [46]:
# Using SVD matrix factorization. Use random_state = 1
svd = SVD(random_state = 1)

# Training the algorithm on the trainset
svd.fit(trainset)
pred = svd.test(testset)

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSE
precision_recall_at_k(svd, k = 10, threshold = 3.5)

RMSE: 0.9032
Precision:  0.85
Recall:  0.843
F_1 score:  0.846


**Write your observations here: This model shows improvement as the precision and recall are still high, but the RMSE is reduced making the results of the recommendation more accurate.**

**Let's now predict the rating for a user with `userId = "A3LDPF5FMB782Z"` and `prod_id = "1400501466`.**

In [47]:
# Making prediction
optPred = svd.predict(trainset.to_inner_uid("A3LDPF5FMB782Z"), trainset.to_inner_iid("1400501466"))
print(f'User A3LDPF5FMB782Z is predicted to give product 1400501466 a rating of {optPred.est}')

User A3LDPF5FMB782Z is predicted to give product 1400501466 a rating of 4.293585475932771


**Write your observations here: This optimized model appears to be reaching an asymptotic expected rating of about 4.3 **

**Below we are predicting rating for the `userId = "A34BZM6S9L7QI4"` and `productId = "1400501466"`.**

In [48]:
# Making prediction
optPred2 = svd.predict(trainset.to_inner_uid("A34BZM6S9L7QI4"), trainset.to_inner_iid("1400501466"))
print(f'User A34BZM6S9L7QI4 is predicted to give product 1400501466 a rating of {optPred2.est}')

User A34BZM6S9L7QI4 is predicted to give product 1400501466 a rating of 4.293585475932771


**Write your observations here: The model appears to be producing the same result for the other user aswell, which is reinforcing the item-item recommendation algorithm.**

### **Improving Matrix Factorization based recommendation system by tuning its hyperparameters**

Below we will be tuning only three hyperparameters:
- **n_epochs**: The number of iterations of the SGD algorithm.
- **lr_all**: The learning rate for all parameters.
- **reg_all**: The regularization term for all parameters.

In [49]:
# Set the parameter space to tune
pGrid2 = {
    'n_epochs': [10, 20],
    'lr_all': [0.002, 0.005],
    'reg_all': [0.4, 0.8]
}
# Performing 3-fold gridsearch cross-validation
search = GridSearchCV(SVD, pGrid2, measures = ['rmse'], cv = 3)

# Fitting data
search.fit(ratingData)

# Best RMSE score
topRSME = search.best_score['rmse']

# Combination of parameters that gave the best RMSE score
topPar = search.best_params['rmse']

print(f"Best RSME Score: {topRSME}")
print(f"Best Parameter Combination: {topPar}")


Best RSME Score: 0.9016577566799294
Best Parameter Combination: {'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.4}


Now, we will **the build final model** by using **tuned values** of the hyperparameters, which we received using grid search cross-validation above.

In [50]:
# Build the optimized SVD model using optimal hyperparameter search. Use random_state=1
optPar = search.best_params['rmse']

optSVD = SVD(
    random_state = 1,
    lr_all = optPar['lr_all'],
    n_epochs = optPar['n_epochs'],
    reg_all = optPar['reg_all']
)

# Train the algorithm on the trainset
optSVD.fit(trainset)

pred = optSVD.test(testset)

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSE
precision_recall_at_k(optSVD, k = 10, threshold = 3.5)


RMSE: 0.8978
Precision:  0.852
Recall:  0.857
F_1 score:  0.854


**Write your observations here: The precision, recall, F1 score, and RMSE didn't change significantly in this optimization, so the fine tuning is reaching a relative optimum.**

### **Steps:**
- **Predict rating for the user with `userId="A3LDPF5FMB782Z"`, and `prod_id= "1400501466"` using the optimized model**
- **Predict rating for `userId="A34BZM6S9L7QI4"` who has not interacted with `prod_id ="1400501466"`, by using the optimized model**
- **Compare the output with the output from the baseline model**

In [51]:
# Use svd_algo_optimized model to recommend for userId "A3LDPF5FMB782Z" and productId "1400501466"
optPred = optSVD.predict(trainset.to_inner_uid("A3LDPF5FMB782Z"), trainset.to_inner_iid("1400501466"))
print(f'User A3LDPF5FMB782Z is predicted to give product 1400501466 a rating of {optPred.est}')

User A3LDPF5FMB782Z is predicted to give product 1400501466 a rating of 4.293585475932771


In [52]:
# Use svd_algo_optimized model to recommend for userId "A34BZM6S9L7QI4" and productId "1400501466"
optPred2 = optSVD.predict(trainset.to_inner_uid("A34BZM6S9L7QI4"), trainset.to_inner_iid("1400501466"))
print(f'User A34BZM6S9L7QI4 is predicted to give product 1400501466 a rating of {optPred2.est}')

User A34BZM6S9L7QI4 is predicted to give product 1400501466 a rating of 4.293585475932771


### **Conclusion and Recommendations**

**Write your conclusion and recommendations here**

In conclusion, after testing the different recommendation algorithms, we found that after fine tuning the SVD model it showed noteable accuracy in generating user recommendations through user-item interaction as represented by the higher precision, recall, and F1 scores and the lower RMSE score.

I would recommend that in an effort to detect underlying factors that can help make model recommendations more accurate and tailroed to the user, it would be a good idea to experiment with specific product attributes, user demographics, product quality, pricing, and more. This way, we can further train and calibrate the model to consider more aspects around predicitng a user's product rating.

