# **Project: Amazon Product Recommendation System**

# **Marks: 40**


Welcome to the project on Recommendation Systems. We will work with the Amazon product reviews dataset for this project. The dataset contains ratings of different electronic products. It does not include information about the products or reviews to avoid bias while building the model.

--------------
## **Context:**
--------------

Today, information is growing exponentially with volume, velocity and variety throughout the globe. This has lead to information overload, and too many choices for the consumer of any business. It represents a real dilemma for these consumers and they often turn to denial. Recommender Systems are one of the best tools that help recommending products to consumers while they are browsing online. Providing personalized recommendations which is most relevant for the user is what's most likely to keep them engaged and help business.

E-commerce websites like Amazon, Walmart, Target and Etsy use different recommendation models to provide personalized suggestions to different users. These companies spend millions of dollars to come up with algorithmic techniques that can provide personalized recommendations to their users.

Amazon, for example, is well-known for its accurate selection of recommendations in its online site. Amazon's recommendation system is capable of intelligently analyzing and predicting customers' shopping preferences in order to offer them a list of recommended products. Amazon's recommendation algorithm is therefore a key element in using AI to improve the personalization of its website. For example, one of the baseline recommendation models that Amazon uses is item-to-item collaborative filtering, which scales to massive data sets and produces high-quality recommendations in real-time.

----------------
## **Objective:**
----------------

You are a Data Science Manager at Amazon, and have been given the task of building a recommendation system to recommend products to customers based on their previous ratings for other products. You have a collection of labeled data of Amazon reviews of products. The goal is to extract meaningful insights from the data and build a recommendation system that helps in recommending products to online consumers.

-----------------------------
## **Dataset:**
-----------------------------

The Amazon dataset contains the following attributes:

- **userId:** Every user identified with a unique id
- **productId:** Every product identified with a unique id
- **Rating:** The rating of the corresponding product by the corresponding user
- **timestamp:** Time of the rating. We **will not use this column** to solve the current problem

**Note:** The code has some user defined functions that will be usefull while making recommendations and measure model performance, you can use these functions or can create your own functions.

Sometimes, the installation of the surprise library, which is used to build recommendation systems, faces issues in Jupyter. To avoid any issues, it is advised to use **Google Colab** for this project.

Let's start by mounting the Google drive on Colab.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Installing surprise library**

In [4]:
!pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise (from surprise)
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3163001 sha256=cc7cb5062b73e08b6437207d312e855c37e4dd2a26a554880275dada32ace3f6
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.3 surprise-0.1


In [5]:
!pip install scikit-surprise



## **Importing the necessary libraries and overview of the dataset**

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
from surprise.prediction_algorithms.matrix_factorization import SVD
from surprise.dataset import DatasetAutoFolds
from surprise import KNNBasic,Reader
from collections import defaultdict
from surprise.model_selection import KFold

### **Loading the data**
- Import the Dataset
- Add column names ['user_id', 'prod_id', 'rating', 'timestamp']
- Drop the column timestamp
- Copy the data to another DataFrame called **df**

In [7]:
colnames = ['user_id','prod_id','rating','timestamp']
original_data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ratings_Electronics.csv",names=colnames)

In [8]:
df = original_data.copy()

In [9]:
df.head()

Unnamed: 0,user_id,prod_id,rating,timestamp
0,AKM1MP6P0OYPR,132793040,5.0,1365811200
1,A2CX7LUOHB2NDG,321732944,5.0,1341100800
2,A2NWSAGRHCP8N5,439886341,1.0,1367193600
3,A2WNBOD3WNDNKT,439886341,3.0,1374451200
4,A1GI0U4ZRJA8WN,439886341,1.0,1334707200


In [10]:
df.shape

(7824482, 4)

**As this dataset is very large and has 7,824,482 observations, it is not computationally possible to build a model using this. Moreover, many users have only rated a few products and also some products are rated by very few users. Hence, we can reduce the dataset by considering certain logical assumptions.**

Here, we will be taking users who have given at least 50 ratings, and the products that have at least 5 ratings, as when we shop online we prefer to have some number of ratings of a product.

In [11]:
# Get the column containing the users
users = df.user_id

# Create a dictionary from users to their number of ratings
ratings_count = dict()

for user in users:

    # If we already have the user, just add 1 to their rating count
    if user in ratings_count:
        ratings_count[user] += 1

    # Otherwise, set their rating count to 1
    else:
        ratings_count[user] = 1

In [13]:
# We want our users to have at least 50 ratings to be considered
RATINGS_CUTOFF = 50

remove_users = []

for user, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)

df = df.loc[ ~ df.user_id.isin(remove_users)]

In [14]:
# Get the column containing the products
prods = df.prod_id

# Create a dictionary from products to their number of ratings
ratings_count = dict()

for prod in prods:

    # If we already have the product, just add 1 to its rating count
    if prod in ratings_count:
        ratings_count[prod] += 1

    # Otherwise, set their rating count to 1
    else:
        ratings_count[prod] = 1

In [15]:
# We want our item to have at least 5 ratings to be considered
RATINGS_CUTOFF = 5

remove_users = []

for user, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)

df_final = df.loc[~ df.prod_id.isin(remove_users)]

In [16]:
# Print a few rows of the imported dataset
df_final.head()

Unnamed: 0,user_id,prod_id,rating,timestamp
1310,A3LDPF5FMB782Z,1400501466,5.0,1336003200
1322,A1A5KUIIIHFF4U,1400501466,1.0,1332547200
1335,A2XIOXRRYX0KZY,1400501466,3.0,1371686400
1451,AW3LX47IHPFRL,1400501466,5.0,1339804800
1456,A1E3OB6QMBKRYZ,1400501466,1.0,1350086400


## **Exploratory Data Analysis**

### **Shape of the data**

### **Check the number of rows and columns and provide observations.**

In [17]:
# Check the number of rows and columns and provide observations
df_final.shape

(65290, 4)

**Write your observations here:______**

The dataset has **65290 rows** and **4 columns**.

### **Data types**

In [18]:
# Check Data types and provide observations
df_final.dtypes

user_id       object
prod_id       object
rating       float64
timestamp      int64
dtype: object

**Write your observations here:______**

There are **2 object** columns (**user_id, prod_id**) and **2 float** columns (**rating, timestamp**).

### **Checking for missing values**

In [19]:
# Check for missing values present and provide observations
df_final.isnull().sum()

user_id      0
prod_id      0
rating       0
timestamp    0
dtype: int64

**Write your observations here:______**

There are **no null** values in the dataset.

### **Summary Statistics**

In [20]:
# Summary statistics of 'rating' variable and provide observations
df_final['rating'].describe()

count    65290.000000
mean         4.294808
std          0.988915
min          1.000000
25%          4.000000
50%          5.000000
75%          5.000000
max          5.000000
Name: rating, dtype: float64

**Write your observations here:______**

There are total **65290 ratings**.

The **standard deviation** is **0.99**.

**Mean** rating is **4.29**.

**Minimum** rating is **1.00**.

**Maximum** rating is **5.00**.

**25%** of the ratings are **less than 4.00**.

**50%** of the ratings are **less than 5.00** and **50%** are **greater than 5.00**.

**75%** of the ratings are **less than 5.00** and **25%** are **greater than 5.00**.

### **Checking the rating distribution**

In [21]:
# Create the bar plot and provide observations
rat_con = df_final['rating'].value_counts()
rating_dist = pd.DataFrame({'count' : rat_con}).reset_index()
rating_dist.columns = ['rating','count']
fig = px.bar(rating_dist,x="rating",y="count",title="Rating distribution")
fig.show()

**Write your observations here:________**

The **most popular** rating is **5.00** (36.32k) and the **least popular** rating is **1.00** (1852).

### **Checking the number of unique users and items in the dataset**

In [22]:
# Number of total rows in the data and number of unique user id and product id in the data
df_final.shape[0],df_final['user_id'].nunique(),df_final['prod_id'].nunique()

(65290, 1540, 5689)

**Write your observations here:_______**

The total **number of rows** in the data is **65290**. The total **number of unique user id** is **1540** and the total **number of unique product id** is **5689**.

### **Users with the most number of ratings**

In [23]:
# Top 10 users based on the number of ratings
df_final_counts = df_final.groupby(['user_id'])['rating'].count()
df_final_counts.sort_values(ascending=False).head(10)

user_id
ADLVFFE4VBT8      295
A3OXHLG6DIBRW8    230
A1ODOGXEYECQQ8    217
A36K2N527TXXJN    212
A25C2M3QF9G7OQ    203
A680RUE1FDO8B     196
A22CW0ZHY3NJH8    193
A1UQBFCERIP7VJ    193
AWPODHOB4GFWL     184
A3LGT6UZL99IW1    179
Name: rating, dtype: int64

**Write your observations here:_______**

The **most ratings** given by a user is **295** by **user_id** **'ADLVFFE4VBT8'**.

## Univariate Data Analysis:

### Categorical Columns:

In [77]:
cate_col = df_final.select_dtypes(include=['object']).columns
cate_col

Index(['user_id', 'prod_id'], dtype='object')

For **user_id**:

In [108]:
frequency_userid = df_final.groupby(['user_id']).size().reset_index(name="Count").rename(columns={'user_id' : 'User ID'})
frequency_userid['Count%'] = np.round(frequency_userid['Count']/sum(frequency_userid['Count'])*100,2)
frequency_userid

Unnamed: 0,User ID,Count,Count%
0,A100UD67AHFODS,53,0.08
1,A100WO06OQR8BQ,77,0.12
2,A105S56ODHGJEK,58,0.09
3,A105TOJ6LTVMBG,32,0.05
4,A10AFVU66A79Y1,47,0.07
...,...,...,...
1535,AZBXKUH4AIW3X,22,0.03
1536,AZCE11PSTCH1L,23,0.04
1537,AZMY6E8B52L2T,105,0.16
1538,AZNUHQSHZHSUE,30,0.05


**Top 5 users** who gave the **most number** of **ratings**:

In [109]:
freq_max = frequency_userid.sort_values(by="Count",ascending=False).head()
freq_max

Unnamed: 0,User ID,Count,Count%
1287,ADLVFFE4VBT8,295,0.45
1086,A3OXHLG6DIBRW8,230,0.35
264,A1ODOGXEYECQQ8,217,0.33
903,A36K2N527TXXJN,212,0.32
462,A25C2M3QF9G7OQ,203,0.31


In [110]:
hist = px.histogram(freq_max,x="User ID",y="Count",title="Top 5 users who gave the most number of ratings")
hist.show()

**Top 5 users** who gave the **least number** of **ratings**:

In [111]:
freq_min = frequency_userid.sort_values(by="Count",ascending=True).head()
freq_min

Unnamed: 0,User ID,Count,Count%
963,A3DL29NLZ7SXXG,1,0.0
1414,AP2NZAALUQKF5,1,0.0
1066,A3MV1KKHX51FYT,1,0.0
524,A2BGZ52M908MJY,2,0.0
72,A16CVJUQOB6GIB,2,0.0


In [94]:
hist = px.histogram(freq_min,x="User ID",y="Count",title="Top 5 users who gave the least number of ratings:")
hist.show()

For **prod_id**:

In [112]:
frequency_prodid = df_final.groupby(['prod_id']).size().reset_index(name="Count").rename(columns={'prod_id' : 'Product ID'})
frequency_prodid['Count%'] = np.round(frequency_prodid['Count']/sum(frequency_prodid['Count'])*100,2)
frequency_prodid

Unnamed: 0,Product ID,Count,Count%
0,1400501466,6,0.01
1,1400532655,6,0.01
2,1400599997,5,0.01
3,9983891212,8,0.01
4,B00000DM9W,5,0.01
...,...,...,...
5684,B00L21HC7A,16,0.02
5685,B00L2442H0,12,0.02
5686,B00L26YDA4,13,0.02
5687,B00L3YHF6O,14,0.02


**Top 5 products** with the **most number** of **ratings**:

In [113]:
freq_prod_max = frequency_prodid.sort_values(by="Count",ascending=False).head()
freq_prod_max

Unnamed: 0,Product ID,Count,Count%
4218,B0088CJT4U,206,0.32
2316,B003ES5ZUU,184,0.28
781,B000N99BBC,167,0.26
4126,B007WTAJTO,164,0.25
4180,B00829TIEK,149,0.23


In [114]:
fig = px.histogram(freq_prod_max,x="Product ID",y="Count",title="Top 5 Products that got the most number of ratings:")
fig.show()

**Top 5 products** with the **least number** of **ratings**:

In [115]:
freq_prod_min = frequency_prodid.sort_values(by="Count",ascending=True).head()
freq_prod_min

Unnamed: 0,Product ID,Count,Count%
5688,B00LGQ6HL8,5,0.01
993,B000WXSO76,5,0.01
3176,B004WMGT1G,5,0.01
991,B000WOT6O0,5,0.01
3181,B004WR125O,5,0.01


In [116]:
fig = px.histogram(freq_prod_min,x="Product ID",y="Count",title="Top 5 Products that got the least amount of ratings:")
fig.show()

## For Numerical variables:

In [119]:
num_cols = df_final.select_dtypes(include=['int64','float64']).columns
num_cols

Index(['rating', 'timestamp'], dtype='object')

For rating:

In [122]:
fig = px.box(df_final,x="rating",title="Boxplot for ratings")
fig.show()

According to the above **boxplot** we get to know that:

1. There are **2 outliers**.

2. **Minimum** rating is **3.00**.
*(**Note**: It is **1.00** in the **summary** but it is **3.00** in the **boxplot** since **1.00** becomes a **outlier** here.)*

3. **Maximum** rating is **5.00**.

4. The **first quartile (Q1)** is **4.00**. It means that **25%** of the ratings are **less than 4.00**. It is the **median of the lower half** of the dataset.

5. The **median (Q2)** is **5.00**. It means that **50%** of the ratings are **less than 5.00** and **50%** of them are **greater than 5.00**.

6. The **third quartile (Q3)** is **5.00**. It means that **75%** of the ratings are **less than 5.00** and **25%** of them are **greater than 5.00**. It is the **median of the upper half** of the dataset.

7. The **Interquartile Range (IQR)** is **1.00 (Q3-Q1)**.

8. Since, the **left whisker** is **longer** than the **right whisker**, the **boxplot** is **negatively skewed (i.e., Left-skewed)**.

**Now that we have explored and prepared the data, let's build the first recommendation system.**

## **Model 1: Rank Based Recommendation System**

In [24]:
# Calculate the average rating for each product
avg_rating = df_final.groupby('prod_id')['rating'].mean()

# Calculate the count of ratings for each product
rating_count = df_final.groupby('prod_id')['rating'].count()

# Create a dataframe with calculated average and count of ratings
final_rating = pd.DataFrame({'Average_Rating' : avg_rating, 'Rating_Count' : rating_count})
final_rating.reset_index(inplace=True)
final_rating.rename(columns={'index' : 'prod_id'},inplace=True)

# Sort the dataframe by average of ratings in the descending order
final_rating['Average_Rating'].sort_values(ascending=False)

# See the first five records of the "final_rating" dataset
final_rating

Unnamed: 0,prod_id,Average_Rating,Rating_Count
0,1400501466,3.333333,6
1,1400532655,3.833333,6
2,1400599997,4.000000,5
3,9983891212,4.875000,8
4,B00000DM9W,5.000000,5
...,...,...,...
5684,B00L21HC7A,4.625000,16
5685,B00L2442H0,4.916667,12
5686,B00L26YDA4,4.384615,13
5687,B00L3YHF6O,5.000000,14


In [25]:
# Defining a function to get the top n products based on the highest average rating and minimum interactions
def top_n_products(final_rating,n,min_interaction=100):

# Finding products with minimum number of interactions
  recommendations = final_rating[final_rating['Rating_Count'] >= min_interaction]

# Sorting values with respect to average rating
  recommendations = recommendations.sort_values(by='Average_Rating',ascending=False)

  return recommendations.index[:n]

### **Recommending top 5 products with 50 minimum interactions based on popularity**

In [26]:
list(top_n_products(final_rating,5,50))

[1594, 2316, 1227, 3877, 850]

### **Recommending top 5 products with 100 minimum interactions based on popularity**

In [27]:
list(top_n_products(final_rating,5,100))

[2316, 781, 2073, 4126, 2041]

We have recommended the **top 5** products by using the popularity recommendation system. Now, let's build a recommendation system using **collaborative filtering.**

## **Model 2: Collaborative Filtering Recommendation System**

### **Building a baseline user-user similarity based recommendation system**

- Below, we are building **similarity-based recommendation systems** using `cosine` similarity and using **KNN to find similar users** which are the nearest neighbor to the given user.  
- We will be using a new library, called `surprise`, to build the remaining models. Let's first import the necessary classes and functions from this library.

In [28]:
# To compute the accuracy of models
from surprise import accuracy

# Class is used to parse a file containing ratings, data should be in structure - user ; item ; rating
from surprise.reader import Reader

# Class for loading datasets
from surprise.dataset import Dataset

# For tuning model hyperparameters
from surprise.model_selection import GridSearchCV

# For splitting the rating data in train and test datasets
from surprise.model_selection import train_test_split

# For implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic

# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD

# for implementing K-Fold cross-validation
from surprise.model_selection import KFold

# For implementing clustering-based recommendation system
from surprise import CoClustering

**Before building the recommendation systems, let's  go over some basic terminologies we are going to use:**

**Relevant item:** An item (product in this case) that is actually **rated higher than the threshold rating** is relevant, if the **actual rating is below the threshold then it is a non-relevant item**.  

**Recommended item:** An item that's **predicted rating is higher than the threshold is a recommended item**, if the **predicted rating is below the threshold then that product will not be recommended to the user**.  


**False Negative (FN):** It is the **frequency of relevant items that are not recommended to the user**. If the relevant items are not recommended to the user, then the user might not buy the product/item. This would result in the **loss of opportunity for the service provider**, which they would like to minimize.

**False Positive (FP):** It is the **frequency of recommended items that are actually not relevant**. In this case, the recommendation system is not doing a good job of finding and recommending the relevant items to the user. This would result in **loss of resources for the service provider**, which they would also like to minimize.

**Recall:** It is the **fraction of actually relevant items that are recommended to the user**, i.e., if out of 10 relevant products, 6 are recommended to the user then recall is 0.60. Higher the value of recall better is the model. It is one of the metrics to do the performance assessment of classification models.

**Precision:** It is the **fraction of recommended items that are relevant actually**, i.e., if out of 10 recommended items, 6 are found relevant by the user then precision is 0.60. The higher the value of precision better is the model. It is one of the metrics to do the performance assessment of classification models.

**While making a recommendation system, it becomes customary to look at the performance of the model. In terms of how many recommendations are relevant and vice-versa, below are some most used performance metrics used in the assessment of recommendation systems.**

### **Precision@k, Recall@ k, and F1-score@k**

**Precision@k** - It is the **fraction of recommended items that are relevant in `top k` predictions**. The value of k is the number of recommendations to be provided to the user. One can choose a variable number of recommendations to be given to a unique user.  


**Recall@k** - It is the **fraction of relevant items that are recommended to the user in `top k` predictions**.

**F1-score@k** - It is the **harmonic mean of Precision@k and Recall@k**. When **precision@k and recall@k both seem to be important** then it is useful to use this metric because it is representative of both of them.

### **Some useful functions**

- Below function takes the **recommendation model** as input and gives the **precision@k, recall@k, and F1-score@k** for that model.  
- To compute **precision and recall**, **top k** predictions are taken under consideration for each user.
- We will use the precision and recall to compute the F1-score.

In [29]:
def precision_recall_at_k(model, k = 10, threshold = 3.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user
    user_est_true = defaultdict(list)

    # Making predictions on the test data
    predictions = model.test(testset)

    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key = lambda x: x[0], reverse = True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. Therefore, we are setting Precision to 0 when n_rec_k is 0

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. Therefore, we are setting Recall to 0 when n_rel is 0

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    # Mean of all the predicted precisions are calculated.
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)), 3)

    # Mean of all the predicted recalls are calculated.
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)), 3)

    accuracy.rmse(predictions)

    print('Precision: ', precision) # Command to print the overall precision

    print('Recall: ', recall) # Command to print the overall recall

    print('F_1 score: ', round((2*precision*recall)/(precision+recall), 3)) # Formula to compute the F-1 score

**Hints:**

- To compute **precision and recall**, a **threshold of 3.5 and k value of 10 can be considered for the recommended and relevant ratings**.
- Think about the performance metric to choose.

Below we are loading the **`rating` dataset**, which is a **pandas DataFrame**, into a **different format called `surprise.dataset.DatasetAutoFolds`**, which is required by this library. To do this, we will be **using the classes `Reader` and `Dataset`.**

In [30]:
# Instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(1,5))

# Loading the rating dataset
data = Dataset.load_from_df(df_final[['user_id','prod_id','rating']], reader)

# Splitting the data into train and test datasets
trainset, testset = train_test_split(data, test_size = 0.3, random_state=42)

Now, we are **ready to build the first baseline similarity-based recommendation system** using the cosine similarity.

### **Building the user-user Similarity-based Recommendation System**

In [31]:
# Declaring the similarity options
sim_options = {'name' : 'cosine',
               'user_based' : True}


# Initialize the KNNBasic model using sim_options declared, Verbose = False, and setting random_state = 1
algo_knn_user = KNNBasic(sim_options=sim_options,verbose=False)

# Fit the model on the training data
algo_knn_user.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score using the precision_recall_at_k function defined above
precision_recall_at_k(algo_knn_user)

RMSE: 1.0250
Precision:  0.86
Recall:  0.783
F_1 score:  0.82


**Write your observations here:__________**

**RMSE** is **1.0250**.

**Precision** is **0.86**.

**Recall** is **0.783**.

**F1-Score** is **0.82**.

Let's now **predict rating for a user with `userId=A3LDPF5FMB782Z` and `productId=1400501466`** as shown below. Here the user has already interacted or watched the product with productId '1400501466' and given a rating of 5.

In [32]:
# Predicting rating for a sample user with an interacted product
algo_knn_user.predict('A3LDPF5FMB782Z','1400501466',r_ui=5,verbose=True)

user: A3LDPF5FMB782Z item: 1400501466 r_ui = 5.00   est = 3.00   {'actual_k': 4, 'was_impossible': False}


Prediction(uid='A3LDPF5FMB782Z', iid='1400501466', r_ui=5, est=3.0, details={'actual_k': 4, 'was_impossible': False})

**Write your observations here:__________**

We observe that the **actual** rating for this **user-item pair** is **5.00** and **predicted** rating is **3.00** by this **similarity based baseline model**, which is **alright**.

Below is the **list of users who have not seen the product with product id "1400501466"**.

In [36]:
# Initialize an empty list to store users who have not seen the product
users_not_seen_product = []

# Iterate through all users in the dataset
for user_id in df_final['user_id'].unique():
    # Check if the user has not interacted with the specified product
    if '1400501466' not in df_final[df_final['user_id'] == user_id]['prod_id'].values:
        # Add the user to the list
        users_not_seen_product.append(user_id)

# Print the list of users who have not seen the product
print("Users who have not seen the product with product ID '1400501466':")
print(users_not_seen_product)

Users who have not seen the product with product ID '1400501466':
['A2ZR3YTMEEIIZ4', 'A3CLWR1UUZT6TG', 'A5JLAU2ARJ0BO', 'A1P4XD7IORSEFN', 'A341HCMGNZCBIT', 'A3HPCRD9RX351S', 'A1DQHS7MOVYYYA', 'ALUNVOQRXOZIA', 'A3G7BEJJCPD6DS', 'A2JXS1JII6SAUD', 'A1C82BC5GNABOA', 'A1VHCO8RQFIGQJ', 'A2Z9S2RQD542CP', 'A2QIC4G483SQQA', 'A3L6L5O89JTX2T', 'A1OGCPMSIVK7G4', 'A18HE80910BTZI', 'A3F9CBHV4OHFBS', 'A1T1YSCDW0PD25', 'ABVYGB2TKBO8F', 'A11ED8O95W2103', 'A3NCIN6TNL0MGA', 'ASHJAZC9OA9NS', 'A105TOJ6LTVMBG', 'A14JBDSWKPKTZA', 'A3QX0ERX4D03TF', 'A13WREJ05GMRA6', 'A3N8O68DOEQ2FE', 'A3J8A5L5AF5TX9', 'A2HRHF83I3NDGT', 'A1R3GN9MEJFXM3', 'A3963R7EPE3A7E', 'A2JOPUWVV0XQJ3', 'AAW7X3GRD8GY9', 'A3V8P0O224OBDB', 'AY6A8KPYCE6B0', 'A212MDP6K4VJS5', 'A28X0LT2100RL1', 'A1V3TRGWOMA8LC', 'A1NZLRAZJGD99W', 'A1522TN5FVJL0Y', 'A3UXW18DP4WSD6', 'A3CW0ZLUO5X2B1', 'A3TBMGNSEQBWIL', 'AEZJTA4KDIWY8', 'A22CW0ZHY3NJH8', 'A2V7EO331SFUF6', 'A3977M5S0GIG5H', 'A1F1A0QQP2XVH5', 'A231WM2Z2JL0U3', 'A2JWF9IG8PJAOA', 'A3LWC833HQIG7J', 'A38

In [39]:
def search_user(user_id, users_not_seen_product):
    if user_id in users_not_seen_product:
        print(f"User ID {user_id} found in the list.")
    else:
        print(f"User ID {user_id} not found in the list.")

search_user("A34BZM6S9L7QI4", users_not_seen_product)

User ID A34BZM6S9L7QI4 found in the list.


* It can be observed from the above list that **user "A34BZM6S9L7QI4" has not seen the product with productId "1400501466"** as this userId is a part of the above list.

**Below we are predicting rating for `userId=A34BZM6S9L7QI4` and `prod_id=1400501466`.**

In [41]:
# Predicting rating for a sample user with a non interacted product
algo_knn_user.predict('A34BZM6S9L7QI4','1400501466',verbose=True)

user: A34BZM6S9L7QI4 item: 1400501466 r_ui = None   est = 4.29   {'was_impossible': True, 'reason': 'Not enough neighbors.'}


Prediction(uid='A34BZM6S9L7QI4', iid='1400501466', r_ui=None, est=4.291403190162572, details={'was_impossible': True, 'reason': 'Not enough neighbors.'})

**Write your observations here:__________**

The **estimated** rating comes out to be **4.29**.

### **Improving similarity-based recommendation system by tuning its hyperparameters**

Below, we will be tuning hyperparameters for the `KNNBasic` algorithm. Let's try to understand some of the hyperparameters of the KNNBasic algorithm:

- **k** (int) – The (max) number of neighbors to take into account for aggregation. Default is 40.
- **min_k** (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set to the global mean of all ratings. Default is 1.
- **sim_options** (dict) – A dictionary of options for the similarity measure. And there are four similarity measures available in surprise -
    - cosine
    - msd (default)
    - Pearson
    - Pearson baseline

In [45]:
# Setting up parameter grid to tune the hyperparameters
param_grid = {'k' : [20,30,40], 'min_k' : [4,5,11],
              'sim_options' : {'name' : ['msd','cosine'],
                               'user_based' : [True]}}
# Performing 3-fold cross-validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse','mae'], cv=3, n_jobs=-1)

# Fitting the data
gs.fit(data)

# Best RMSE score
print(gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.9681808002753934
{'k': 40, 'min_k': 4, 'sim_options': {'name': 'cosine', 'user_based': True}}


Once the grid search is **complete**, we can get the **optimal values for each of those hyperparameters**.

Now, let's build the **final model by using tuned values of the hyperparameters**, which we received by using **grid search cross-validation**.

In [46]:
# Using the optimal similarity measure for user-user based collaborative filtering
sim_options = {'name' : 'cosine',
               'user_based' : True}
# Creating an instance of KNNBasic with optimal hyperparameter values
similarity_algo_optimized = KNNBasic(sim_options=sim_options,k=40,min_k=4,verbose=False)

# Training the algorithm on the trainset
similarity_algo_optimized.fit(trainset)

# Let us compute precision@k and recall@k also with k =10
precision_recall_at_k(similarity_algo_optimized)

RMSE: 0.9614
Precision:  0.854
Recall:  0.808
F_1 score:  0.83


**Write your observations here:__________**

After **tuning hyperparameters**,

**RMSE** for the **test set** has **decreased** from **1.0250** to **0.9614**.

**Precision** for the **test set** has **decreased slightly** from **0.86** to **0.85**.

**Recall** for the **test set** has **increased** from **0.783** to **0.808**.

**F1-Score** for the **test set** has **increased slightly** from **0.82** to **0.83**.

### **Steps:**
- **Predict rating for the user with `userId="A3LDPF5FMB782Z"`, and `prod_id= "1400501466"` using the optimized model**
- **Predict rating for `userId="A34BZM6S9L7QI4"` who has not interacted with `prod_id ="1400501466"`, by using the optimized model**
- **Compare the output with the output from the baseline model**

In [47]:
# Use sim_user_user_optimized model to recommend for userId "A3LDPF5FMB782Z" and productId 1400501466
similarity_algo_optimized.predict('A3LDPF5FMB782Z','1400501466',r_ui=5,verbose=True)

user: A3LDPF5FMB782Z item: 1400501466 r_ui = 5.00   est = 3.00   {'actual_k': 4, 'was_impossible': False}


Prediction(uid='A3LDPF5FMB782Z', iid='1400501466', r_ui=5, est=3.0, details={'actual_k': 4, 'was_impossible': False})

In [48]:
# Use sim_user_user_optimized model to recommend for userId "A34BZM6S9L7QI4" and productId "1400501466"
similarity_algo_optimized.predict('A34BZM6S9L7QI4','1400501466',verbose=True)

user: A34BZM6S9L7QI4 item: 1400501466 r_ui = None   est = 4.29   {'was_impossible': True, 'reason': 'Not enough neighbors.'}


Prediction(uid='A34BZM6S9L7QI4', iid='1400501466', r_ui=None, est=4.291403190162572, details={'was_impossible': True, 'reason': 'Not enough neighbors.'})

**Write your observations here:**____________

There is **difference** in the **prediction** of the **baseline model** and the **tuned model** for this **particular user-item pair**. Before **estimated** rating was **3.00** and now its **4.29**.

### **Identifying similar users to a given user (nearest neighbors)**

We can also find out **similar users to a given user** or its **nearest neighbors** based on this KNNBasic algorithm. Below, we are finding the 5 most similar users to the first user in the list with internal id 0, based on the `msd` distance metric.

In [49]:
# 0 is the inner id of the above user
similarity_algo_optimized.get_neighbors(0,k=5)

[7, 12, 16, 17, 26]

### **Implementing the recommendation algorithm based on optimized KNNBasic model**

Below we will be implementing a function where the input parameters are:

- data: A **rating** dataset
- user_id: A user id **against which we want the recommendations**
- top_n: The **number of products we want to recommend**
- algo: the algorithm we want to use **for predicting the ratings**
- The output of the function is a **set of top_n items** recommended for the given user_id based on the given algorithm

In [50]:
def get_recommendations(data, user_id, top_n, algo):

    # Creating an empty list to store the recommended product ids
    recommendations = []

    # Creating an user item interactions matrix
    user_item_interactions_matrix = data.pivot(index = 'user_id', columns = 'prod_id', values = 'rating')

    # Extracting those product ids which the user_id has not interacted yet
    non_interacted_products = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()

    # Looping through each of the product ids which user_id has not interacted yet
    for item_id in non_interacted_products:

        # Predicting the ratings for those non interacted product ids by this user
        est = algo.predict(user_id, item_id).est

        # Appending the predicted ratings
        recommendations.append((item_id, est))

    # Sorting the predicted ratings in descending order
    recommendations.sort(key = lambda x: x[1], reverse = True)

    return recommendations[:top_n] # Returing top n highest predicted rating products for this user

**Predicting top 5 products for userId = "A3LDPF5FMB782Z" with similarity based recommendation system**

In [55]:
# Making top 5 recommendations for user_id "A3LDPF5FMB782Z" with a similarity-based recommendation engine
recommendations = get_recommendations(df_final,'A3LDPF5FMB782Z',5,similarity_algo_optimized)
recommendations

[('B000067RT6', 5),
 ('B000M17AVO', 5),
 ('B001CCAISE', 5),
 ('B001TH7GUU', 5),
 ('B002SGATH8', 5)]

In [56]:
# Building the dataframe for above recommendations with columns "prod_id" and "predicted_ratings"
pd.DataFrame(recommendations, columns=['prod_id','predicted_ratings'])

Unnamed: 0,prod_id,predicted_ratings
0,B000067RT6,5
1,B000M17AVO,5
2,B001CCAISE,5
3,B001TH7GUU,5
4,B002SGATH8,5


### **Item-Item Similarity-based Collaborative Filtering Recommendation System**

* Above we have seen **similarity-based collaborative filtering** where similarity is calculated **between users**. Now let us look into similarity-based collaborative filtering where similarity is seen **between items**.

In [57]:
# Declaring the similarity options
sim_options = {'name' : 'cosine',
               'user_based' : False}
# KNN algorithm is used to find desired similar items. Use random_state=1
algo_knn_item = KNNBasic(sim_options=sim_options,verbose=False)

# Train the algorithm on the trainset, and predict ratings for the test set
algo_knn_item.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k = 10
precision_recall_at_k(algo_knn_item)

RMSE: 1.0232
Precision:  0.835
Recall:  0.758
F_1 score:  0.795


**Write your observations here:**____________

We can see that the **baseline model** has

**RMSE = 1.0232**

**Precision = 0.835**

**Recall = 0.758**

**F_1 Score = 0.795** on the **test set**.

We can try to **improve the performance number** by **using GridSearchCV** to **tune different hyperparameters** of this **algorithm**.

Let's now **predict a rating for a user with `userId = A3LDPF5FMB782Z` and `prod_Id = 1400501466`** as shown below. Here the user has already interacted or watched the product with productId "1400501466".

In [58]:
# Predicting rating for a sample user with an interacted product
algo_knn_item.predict('A3LDPF5FMB782Z','1400501466',r_ui=5,verbose=True)

user: A3LDPF5FMB782Z item: 1400501466 r_ui = 5.00   est = 4.32   {'actual_k': 19, 'was_impossible': False}


Prediction(uid='A3LDPF5FMB782Z', iid='1400501466', r_ui=5, est=4.315789473684211, details={'actual_k': 19, 'was_impossible': False})

**Write your observations here:**____________

The rating for the **user-item pair** is **5.00** while the **estimated** ratings turned out to be **4.32**.

Below we are **predicting rating for the `userId = A34BZM6S9L7QI4` and `prod_id = 1400501466`**.

In [59]:
# Predicting rating for a sample user with a non interacted product
algo_knn_item.predict('A34BZM6S9L7QI4','1400501466',verbose=True)

user: A34BZM6S9L7QI4 item: 1400501466 r_ui = None   est = 4.29   {'was_impossible': True, 'reason': 'Not enough neighbors.'}


Prediction(uid='A34BZM6S9L7QI4', iid='1400501466', r_ui=None, est=4.291403190162572, details={'was_impossible': True, 'reason': 'Not enough neighbors.'})

**Write your observations here:**____________

There is **no change** in the **estimated** rating **even though** **no** **user-item pair** rating is mentioned.

### **Hyperparameter tuning the item-item similarity-based model**
- Use the following values for the param_grid and tune the model.
  - 'k': [10, 20, 30]
  - 'min_k': [3, 6, 9]
  - 'sim_options': {'name': ['msd', 'cosine']
  - 'user_based': [False]
- Use GridSearchCV() to tune the model using the 'rmse' measure
- Print the best score and best parameters

In [61]:
# Setting up parameter grid to tune the hyperparameters
param_grid = {'k' : [10,20,30], 'min_k' : [3,6,9],
              'sim_options' : {'name' : ['msd','cosine'],
                               'user_based' : [False]}}

# Performing 3-fold cross validation to tune the hyperparameters
grid_obj = GridSearchCV(KNNBasic, param_grid, measures = ['rmse','mae'],cv=3)

# Fitting the data
grid_obj.fit(data)

# Find the best RMSE score
print(grid_obj.best_score['rmse'])

# Find the combination of parameters that gave the best RMSE score
print(grid_obj.best_params['rmse'])

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matr

Once the **grid search** is complete, we can get the **optimal values for each of those hyperparameters as shown above.**

Now let's build the **final model** by using **tuned values of the hyperparameters** which we received by using grid search cross-validation.

### **Use the best parameters from GridSearchCV to build the optimized item-item similarity-based model. Compare the performance of the optimized model with the baseline model.**

In [62]:
# Using the optimal similarity measure for item-item based collaborative filtering
sim_options = {'name' : 'msd',
               'user_based' : False}

# Creating an instance of KNNBasic with optimal hyperparameter values
similarity_algo_optimized_item = KNNBasic(sim_options=sim_options,k=30,min_k=6,verbose=False)

# Training the algorithm on the trainset
similarity_algo_optimized_item.fit(trainset)

# Let us compute precision@k and recall@k, f1_score and RMSE
precision_recall_at_k(similarity_algo_optimized_item)

RMSE: 0.9694
Precision:  0.836
Recall:  0.797
F_1 score:  0.816


**Write your observations here:__________**

After **tuning the hyperparameters**,

**RMSE decreased** from **1.0232** to **0.9694**.

**Precision increased slightly** from **0.835** to **0.836**.

**Recall increased** from **0.758** to **0.797**.

**F1-Score increased** from **0.795** to **0.816**.

### **Steps:**
- **Predict rating for the user with `userId="A3LDPF5FMB782Z"`, and `prod_id= "1400501466"` using the optimized model**
- **Predict rating for `userId="A34BZM6S9L7QI4"` who has not interacted with `prod_id ="1400501466"`, by using the optimized model**
- **Compare the output with the output from the baseline model**

In [63]:
# Use sim_item_item_optimized model to recommend for userId "A3LDPF5FMB782Z" and productId "1400501466"
similarity_algo_optimized_item.predict('A3LDPF5FMB782Z','1400501466',r_ui=5,verbose=True)

user: A3LDPF5FMB782Z item: 1400501466 r_ui = 5.00   est = 4.70   {'actual_k': 19, 'was_impossible': False}


Prediction(uid='A3LDPF5FMB782Z', iid='1400501466', r_ui=5, est=4.699444206926037, details={'actual_k': 19, 'was_impossible': False})

In [64]:
# Use sim_item_item_optimized model to recommend for userId "A34BZM6S9L7QI4" and productId "1400501466"
similarity_algo_optimized_item.predict('A34BZM6S9L7QI4','1400501466',verbose=True)

user: A34BZM6S9L7QI4 item: 1400501466 r_ui = None   est = 4.29   {'was_impossible': True, 'reason': 'Not enough neighbors.'}


Prediction(uid='A34BZM6S9L7QI4', iid='1400501466', r_ui=None, est=4.291403190162572, details={'was_impossible': True, 'reason': 'Not enough neighbors.'})

**Write your observations here:__________**

There was **significant difference** in the **two models**. The **estimated** rating **changed** from **4.70** to **4.29**.

### **Identifying similar items to a given item (nearest neighbors)**

We can also find out **similar items** to a given item or its nearest neighbors based on this **KNNBasic algorithm**. Below we are finding the 5 most similar items to the item with internal id 0 based on the `msd` distance metric.

In [65]:
similarity_algo_optimized_item.get_neighbors(0,k=5)

[53, 67, 106, 151, 156]

**Predicting top 5 products for userId = "A1A5KUIIIHFF4U" with similarity based recommendation system.**

**Hint:** Use the get_recommendations() function.

In [67]:
#Making top 5 recommendations for user_id A1A5KUIIIHFF4U with similarity-based recommendation engine.
recommendations = get_recommendations(df_final,'A1A5KUIIIHFF4U',5,similarity_algo_optimized_item)

In [68]:
#Building the dataframe for above recommendations with columns "prod_id" and "predicted_ratings"
pd.DataFrame(recommendations,columns=['prod_id','predicted_ratings'])

Unnamed: 0,prod_id,predicted_ratings
0,1400532655,4.291403
1,1400599997,4.291403
2,9983891212,4.291403
3,B00000DM9W,4.291403
4,B00000J1V5,4.291403


Now as we have seen **similarity-based collaborative filtering algorithms**, let us now get into **model-based collaborative filtering algorithms**.

### **Model 3: Model-Based Collaborative Filtering - Matrix Factorization**

Model-based Collaborative Filtering is a **personalized recommendation system**, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use **latent features** to find recommendations for each user.

### Singular Value Decomposition (SVD)

SVD is used to **compute the latent features** from the **user-item matrix**. But SVD does not work when we **miss values** in the **user-item matrix**.

In [69]:
# Using SVD matrix factorization. Use random_state = 1
svd = SVD(random_state=1)

# Training the algorithm on the trainset
svd.fit(trainset)

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSE
precision_recall_at_k(svd)

RMSE: 0.8989
Precision:  0.86
Recall:  0.797
F_1 score:  0.827


**Write your observations here:___________**

**RMSE** is **0.8989**.

**Precision** is **0.86**.

**Recall** is **0.797**.

**F1-Score** is **0.827**.

**Let's now predict the rating for a user with `userId = "A3LDPF5FMB782Z"` and `prod_id = "1400501466`.**

In [70]:
# Making prediction
svd.predict('A3LDPF5FMB782Z','1400501466',r_ui=5,verbose=True)

user: A3LDPF5FMB782Z item: 1400501466 r_ui = 5.00   est = 4.07   {'was_impossible': False}


Prediction(uid='A3LDPF5FMB782Z', iid='1400501466', r_ui=5, est=4.070652912318144, details={'was_impossible': False})

**Write your observations here:___________**

The **actual** rating of the **user-item pair** is **5.00** and the **estimated** rating is **4.07**.

**Below we are predicting rating for the `userId = "A34BZM6S9L7QI4"` and `productId = "1400501466"`.**

In [71]:
# Making prediction
svd.predict('A34BZM6S9L7QI4','1400501466',verbose=True)

user: A34BZM6S9L7QI4 item: 1400501466 r_ui = None   est = 4.39   {'was_impossible': False}


Prediction(uid='A34BZM6S9L7QI4', iid='1400501466', r_ui=None, est=4.3949263041205775, details={'was_impossible': False})

**Write your observations here:___________**

The **estimated** rating is **4.39** **even though user-item pair** rating is **not mentioned**.

### **Improving Matrix Factorization based recommendation system by tuning its hyperparameters**

Below we will be tuning only three hyperparameters:
- **n_epochs**: The number of iterations of the SGD algorithm.
- **lr_all**: The learning rate for all parameters.
- **reg_all**: The regularization term for all parameters.

In [73]:
# Set the parameter space to tune
param_grid = {'n_epochs' : [10,20,30], 'lr_all' : [0.001,0.005,0.01],
              'reg_all' : [0.2,0.4,0.6]}

# Performing 3-fold gridsearch cross-validation
gs_ = GridSearchCV(SVD,param_grid,measures=['rmse'],cv=3,n_jobs=-1)

# Fitting data
gs_.fit(data)

# Best RMSE score
print(gs_.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print(gs_.best_params['rmse'])

0.8999173103910411
{'n_epochs': 30, 'lr_all': 0.005, 'reg_all': 0.2}


Now, we will **the build final model** by using **tuned values** of the hyperparameters, which we received using grid search cross-validation above.

In [74]:
# Build the optimized SVD model using optimal hyperparameter search. Use random_state=1
svd_optimized = SVD(n_epochs=30,lr_all=0.005,reg_all=0.2,random_state=1)

# Train the algorithm on the trainset
svd_optimized.fit(trainset)

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSE
precision_recall_at_k(svd_optimized)

RMSE: 0.8906
Precision:  0.861
Recall:  0.799
F_1 score:  0.829


**Write your observations here:_____________**

After **hyperparameter tuning**,

**RMSE decreased** from **0.8989** to **0.8906**.

**Precision increased slightly** from **0.86** to **0.861**.

**Recall increased** from **0.797** to **0.799**.

**F1-Score increased** from **0.827** to **0.829**.

### **Steps:**
- **Predict rating for the user with `userId="A3LDPF5FMB782Z"`, and `prod_id= "1400501466"` using the optimized model**
- **Predict rating for `userId="A34BZM6S9L7QI4"` who has not interacted with `prod_id ="1400501466"`, by using the optimized model**
- **Compare the output with the output from the baseline model**

In [75]:
# Use svd_algo_optimized model to recommend for userId "A3LDPF5FMB782Z" and productId "1400501466"
svd_optimized.predict('A3LDPF5FMB782Z','1400501466',r_ui=5,verbose=True)

user: A3LDPF5FMB782Z item: 1400501466 r_ui = 5.00   est = 4.07   {'was_impossible': False}


Prediction(uid='A3LDPF5FMB782Z', iid='1400501466', r_ui=5, est=4.07003575619942, details={'was_impossible': False})

In [76]:
# Use svd_algo_optimized model to recommend for userId "A34BZM6S9L7QI4" and productId "1400501466"
svd_optimized.predict('A34BZM6S9L7QI4','1400501466',verbose=True)

user: A34BZM6S9L7QI4 item: 1400501466 r_ui = None   est = 4.20   {'was_impossible': False}


Prediction(uid='A34BZM6S9L7QI4', iid='1400501466', r_ui=None, est=4.198648921636727, details={'was_impossible': False})

There is a **significant difference** in the **estimated** rating between the **two models**. It **changed** from **4.07** to **4.20**.

### **Conclusion and Recommendations**

**Write your conclusion and recommendations here**

# Conclusions:

###1. Rank Based Recommendation System (Model 1):



*   Serves as baseline recommendation system but lacks personalization as it relies on popularity more than user choices.
*   Achieves moderate performance metrics but may not satisfy individual users' preferences.



###2. Collaborative Filtering Recommendation System:

####a. User-User Similarity-Based Model:

Uses KNN.

RMSE = 1.0250 --> 0.9614.
Precision = 0.86 --> 0.85
Recall = 0.783 --> 0.808
F1-Score = 0.82 --> 0.83

estimated rating changed from 3.00 to 4.29.

*   Improvement in performance metrics post hyperparameter tuning show enhanced accuracy and effectiveness.
*   Higher recall scores suggest better recommendation relevance and coverage for users.





####b. Item-Item Similarity-Based Model:

Uses KNNBasic.

RMSE = 1.0232 --> 0.9694.
Precision = 0.835 --> 0.836
Recall = 0.758 --> 0.797
F1-Score = 0.795 --> 0.816

estimated rating changed from 4.70 to 4.29.

*   Almost similar to user-user performance except for a slight variation after hyperparameter tuning.
*   Similar improvements in the performance metrics suggest effective tuning of KNN parameters.





###3. Model-Based Collaborative Filtering - Matrix Factorization:

Uses SVD.

RMSE = 0.8989 --> 0.8906
Precision = 0.86 --> 0.861
Recall = 0.797 --> 0.799
F1-Score = 0.827 --> 0.829.

estimated rating changed from 4.07 to 4.20.

*   This SVD-based approach outperforms the other models in terms of RMSE, indicating better accuracy in predicting user-item ratings.
*   Consistent improvements in performance metrics demonstrate the effectiveness of matrix factorization in capturing user preferences and item characteristics.



#Recommendations:

1. Personalization:

   Keep enhancing user-user and item-item collaborative filtering models to improve personalization and recommendation relevance. Add extra info like website analytics and purchase behaviour to better capture individual preferences.


2. Continuous Hyperparameter Tuning:

   Keep fine-tuning to optimize model performance metrics. Exploring more advanced tuning techniques to further obtain an optimal combination of parameters.


3. Ensemble Approaches:

   Try ensemble methods that combine multiple recommendation algorithms together and increase their complementary strengths. It helps to further improve recommendation accuracy and robustness.


4. Real-World Impact:

   Conduct A/B testing to validate performance improvements of the recommendation systems in real-time. Monitor business metrics like revenue, user engagement, etc. to assess the practical impact of the recommendation system enhancements.
  
