<a href="https://colab.research.google.com/github/saketvaibhav7114/Book-Recommendation-System/blob/main/Book_Recommendation_System_(Unsupervised_Learning_Project).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Book Recommendation System



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** Saket Vaibhav

# **Project Summary -**

During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys.

A recommendation system helps an organization to create loyal customers and build trust with them by providing the products and services they desire. The recommendation system today is so powerful that it can handle the new customer who has visited the site for the first time. They recommend the products that are currently trending or highly rated, and they can also recommend the products that bring maximum profit to the company.

### **Data Collection:**
The foundation of any recommendation system is data. In this project, data collection involved gathering information about books, authors, user preferences, and historical reading patterns. The dataset for Book Recommendation System comprises three files:

**Users**

Contains the users IDs, Location & Age.


**Books**

Books are identified by their respective ISBN. Some content-based information is also given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services.


**Ratings**
Contains the book rating information expressed on a scale from 1-10 (higher values denoting higher appreciation). The data included details like book titles, genres, authors, user ratings, and textual descriptions.


### **Data Preprocessing:**
Data preprocessing involved tasks such as cleaning the data, handling missing values, and transforming textual descriptions into numerical representations through techniques like TF-IDF (Term Frequency-Inverse Document Frequency).


### **Clustering Algorithms:**
One of the key components of the recommendation system is the use of clustering algorithms. Unsupervised clustering methods, such as K-Means are applied to group books with similar characteristics. These clusters are created based on factors like genre, author, and book content. The goal is to identify patterns and associations among books that could aid in recommendations.

### **Matrix Factorization:**
Matrix factorization techniques, including Singular Value Decomposition (SVD) are employed to uncover latent factors that influence user preferences. By decomposing the user-item interaction matrix, these algorithms revealed hidden relationships between users and books. This information is then used to make personalized recommendations.

### **Collaborative Filtering:**
Collaborative filtering relies on the idea that users who have similar reading preferences will likely enjoy similar books. Collaborative filtering algorithms, such as user-based and item-based collaborative filtering, were implemented to generate recommendations based on user behavior and item similarity. This approach helped in fine-tuning the suggestions.

### **Content-Based Filtering:**
In addition to collaborative filtering, content-based filtering is used to improve the recommendation system's accuracy. This approach analyzed the textual descriptions of books and matched them with user preferences. Natural Language Processing (NLP) techniques were employed to extract meaningful features from the book descriptions and align them with user profiles.

### **Evaluation Metrics:**
To assess the performance of the Book Recommendation System, several evaluation metrics are employed. Common metrics included Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and precision-recall metrics. These measures helped gauge the accuracy and effectiveness of the recommendations provided to users.

# **GitHub Link -**

https://github.com/saketvaibhav7114/Book-Recommendation-System

# **Problem Statement**


The world of literature is vast, with millions of books spanning various genres and subjects. Navigating this extensive library can be overwhelming for readers looking for their next captivating read. To address this challenge, a Book Recommendation System was developed. This system leverages the power of unsupervised learning algorithms to provide personalized book recommendations to users, enhancing their reading experience.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import sklearn
import warnings
import random
import sklearn
import scipy
import math
import nltk
import string

# Tokenization
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')

# Stemming & Lemmatization
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# ML Model
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds


warnings.simplefilter('ignore')
pd.set_option('display.max_colwidth', -1)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Book Data
book_data = pd.read_csv("/content/drive/MyDrive/Books.csv")

# Users Data
users_data= pd.read_csv('/content/drive/MyDrive/Users.csv')

# Ratings Data
ratings_data = pd.read_csv("/content/drive/MyDrive/Ratings.csv")


### Dataset First View

In [None]:
# Dataset First Look

# Book Data
book_data.head()

In [None]:
# Users Data
users_data.head()

In [None]:
# Ratings Data
ratings_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# Book Data
book_data.shape

In [None]:
# User Data
users_data.shape

In [None]:
# ratings_data
ratings_data.shape

### Dataset Information

In [None]:
# Dataset Info
# Book Data
book_data.info()

In [None]:
# User Data
users_data.info()

In [None]:
# ratings_data
ratings_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Book Data
book_data.duplicated().sum()

In [None]:
# User Data
users_data.duplicated().sum()

In [None]:
# ratings_data
ratings_data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Book Data
book_data.isnull().sum()

In [None]:
# User Data
users_data.isnull().sum()

In [None]:
# ratings_data
ratings_data.isnull().sum()

In [None]:
# Visualizing the missing values
# Book Data
book_missing_value=book_data.isnull().sum()
columns_with_missing_values = book_missing_value[book_missing_value > 0]      #  Filter columns with missing values

# Calculate the percentage of missing values in each column
total_rows = len(book_data)
percentage_missing = (columns_with_missing_values / total_rows) * 100

# Create a bar chart
plt.figure(figsize=(10, 6))
bar_plot = columns_with_missing_values.plot(kind='bar', color='lightcoral')
plt.xlabel('Columns with Missing Value',fontsize=14)
plt.ylabel('Number of Missing Values',fontsize=14)
plt.title('Number of Missing Values in Book Dataset',fontsize=14)
plt.xticks(rotation=0, ha='center',fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

In [None]:
users_missing_value=users_data.isnull().sum()
columns_with_missing_values = users_missing_value[users_missing_value > 0]      #  Filter columns with missing values

# Calculate the percentage of missing values in each column
total_rows = len(book_data)
percentage_missing = (columns_with_missing_values / total_rows) * 100

# Create a bar chart
plt.figure(figsize=(10, 6))
bar_plot = columns_with_missing_values.plot(kind='bar', color='lightcoral')
plt.xlabel('Columns with Missing Value',fontsize=14)
plt.ylabel('Number of Missing Values',fontsize=14)
plt.title('Number of Missing Values in User Dataset',fontsize=14)
plt.xticks(rotation=0, ha='center',fontsize=14)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the percentage of missing values on top of bar
for index, value in enumerate(columns_with_missing_values):
    plt.text(index, value, f'{percentage_missing[index]:.2f}%', ha='center', va='bottom',fontsize=10)

plt.show()

### What did you know about your dataset?

**Ans:**The dataset is well-prepared for further analysis, as it contains no duplicated rows and some missing values which needs to be fixed either by using the fillna method or dropping the rows so that there is a clean and unique dataset for analysis. Most of the missing value is in age columns of users dataset. Most of the features are either objects or floats. If necessary, it needs to be converted into the required datatype. After the necessary cleaning, the dataset will be ready for preprocessing steps, allowing the focus to be on feature engineering and model development to achieve accurate predictions.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Book Data
book_data.columns

In [None]:
# User Data
users_data.columns

In [None]:
# ratings_data
ratings_data.columns

In [None]:
# Dataset Describe
# Book Data
book_data.describe().T

In [None]:
# User Data
users_data.describe().T

In [None]:
# ratings_data
ratings_data.describe().T

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Book Data
book_data.nunique()

In [None]:
# User Data
users_data.nunique()

In [None]:
# ratings_data
ratings_data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

* Data Cleaning of book dataset

In [None]:
# Write your code to make your dataset analysis ready.
book_data.rename(columns = {'Book-Title':'title', 'Book-Author':'author', 'Year-Of-Publication':'year', 'Publisher':'publisher'}, inplace=True)

# droping the url
book_data.drop(['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], axis= 1, inplace= True)

In [None]:
book_data.info()

In [None]:
book_data.isnull().sum()

In [None]:
# nan values in book_author column
book_data.loc[(book_data['author'].isnull()),: ]

In [None]:
# nan values in publisher column
book_data.loc[(book_data['publisher'].isnull()),: ]

In [None]:
# getting unique value from 'year_of_publication' feature
book_data['year'].unique()

In [None]:
# Extracting rows with year column="DK Publishing Inc"
book_data[book_data['year'] == 'DK Publishing Inc']

In [None]:
# Extracting rows with year column="Gallimard"
book_data[book_data['year'] == 'Gallimard']

In [None]:
book_data.loc[187689]

In [None]:
book_data.loc[221678]

In [None]:
book_data.loc[209538]

In [None]:
book_data.loc[220731]

* Let's fix the column and make it in correct format as per our dataset.

In [None]:
# function to fix mismatch data in feature 'book_title', 'book_author', ' year_of_publication', 'publisher'
def replace_df_value(df, idx, col_name, val):
  df.loc[idx, col_name] = val
  return df

In [None]:
replace_df_value(book_data, 209538, 'title', 'DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)')
replace_df_value(book_data, 209538, 'author', 'Michael Teitelbaum')
replace_df_value(book_data, 209538, 'year', 2000)
replace_df_value(book_data, 209538, 'publisher', 'DK Publishing Inc')

replace_df_value(book_data, 221678, 'title', 'DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)')
replace_df_value(book_data, 221678, 'author', 'James Buckley')
replace_df_value(book_data, 221678, 'year', 2000)
replace_df_value(book_data, 221678, 'publisher', 'DK Publishing Inc')

replace_df_value(book_data, 220731,'title', "Peuple du ciel, suivi de 'Les Bergers")
replace_df_value(book_data, 220731, 'author', 'Jean-Marie Gustave Le ClÃ?Â©zio')
replace_df_value(book_data, 220731, 'year', 2003)
replace_df_value(book_data, 220731, 'publisher', 'Gallimard')

In [None]:
book_data.loc[209538]

In [None]:
book_data.loc[221678]

In [None]:
book_data.loc[220731]

In [None]:
book_data['year'].unique()

In [None]:
# Change the datatype of year column from object to int
book_data['year'] = book_data['year'].astype(int)

In [None]:
book_data.info()

* Data Cleaning of user dataset

In [None]:
# Renamimg the column
users_data.rename(columns = {'User-ID':'user_id', 'Location':'location', 'Age':'age'}, inplace=True)

In [None]:
users_data.info()

* Data Cleaning of ratings dataset

In [None]:
# Renamimg the column
ratings_data.rename(columns = {'User-ID':'user_id', 'Book-Rating':'rating'}, inplace=True)

In [None]:
ratings_data.info()

In [None]:
ratings_data['rating'].unique()

In [None]:
ratings_data['user_id'].value_counts()

**A big flaw with a problem statement in the rating dataset**

If we take all the books and all the users for modeling, it will create a problem because we cannot consider a user who has only registered on the website or has only read one or two books. On such a user, we cannot rely to recommend books to others because we have to extract knowledge from data. So we will limit this number and we will take a user who has rated at least 200 books and also we will limit books and we will take only those books which have received at least 50 ratings from a user.

**Extract users who has rated more than 200 books**

In [None]:
x = ratings_data['user_id'].value_counts() > 200

In [None]:
y = x[x].index  # user_ids
print(y.shape)

In [None]:
ratings_data = ratings_data[ratings_data['user_id'].isin(y)]

In [None]:
ratings_data.shape

So 900 users are there who have given 5.2 lakh rating

Calculate Average Ratings of Each Book

In [None]:
number_rating = ratings_data.groupby('ISBN').count()['rating'].reset_index()
number_rating.rename(columns= {'rating':'number_of_ratings'}, inplace=True)
number_rating

In [None]:
final_rating=ratings_data.merge(number_rating,on='ISBN')
final_rating

Extract books that have received more than 50 ratings.

In [None]:
final_rating = final_rating[final_rating['number_of_ratings'] >= 50]

In [None]:
# Drop Duplicated Row
final_rating.drop_duplicates(['user_id','ISBN'], inplace=True)

**Merge ratings with books_data & user_data**

> Merge ratings with books on basis of ISBN so that we will get the rating of each user on each book id and the user who has not rated that book id the value will be zero.

> Merge ratings_with_books on basis of user_id so that we will get the rating of each user on each book id and the user who has not rated that book id the value will be zero



In [None]:
rating_with_books = final_rating.merge(book_data, on='ISBN')

* **Merge Final Rating Dataset with the Users Dataset**

In [None]:
rating_book_users=rating_with_books.merge(users_data,on='user_id')

In [None]:
rating_book_users.head()

In [None]:
# Extract 'country' values from 'location' column
rating_book_users['country'] = rating_book_users['location'].str.split(',').str[-1].str.strip()

# Drop the 'location' column
rating_book_users.drop(columns=['location'], inplace=True)

In [None]:
rating_book_users['country'].unique()

In [None]:
# Replace 'usa' with 'us',double quotes (") & 'n/a' with 'nan' in 'country' column
rating_book_users['country'] = rating_book_users['country'].str.replace('usa','us').replace('n/a',np.nan).replace('', np.nan)

# Display the modified 'country' column
rating_book_users['country'].unique()

In [None]:
rating_book_users.tail()

In [None]:
rating_book_users.shape

Finally, we have a dataset with that user who has rated more than 200 books and books that received more than 50 ratings. The shape of the final dataframe is 59850 rows and 10 columns.

In [None]:
rating_book_users.info()

In [None]:
rating_book_users.loc[41803]

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(14,6))
ax=sns.countplot(x="rating",palette = 'Paired',data= rating_book_users)
plt.title('Count of Each Ratings',fontsize=15)
plt.xlabel('Rating',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Add value annotations to the bars
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.0f'),
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                textcoords='offset points')

plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10,6))
sns.distplot(rating_book_users['age'])
plt.title('Age Distribution\n',fontsize=15)
plt.xlabel('Age',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
sns.boxplot(rating_book_users['age'])
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
sns.boxplot(rating_book_users['year'])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
sns.boxplot(rating_book_users['number_of_ratings'])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
sns.boxplot(rating_book_users['rating'])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(15,6))
ax=sns.countplot(data=rating_book_users, y="author", palette = 'Paired', order=rating_book_users['author'].value_counts().index[0:20])
plt.title("Top 20 author with number of books",fontsize=15)
plt.xlabel("Count of Books",fontsize=15)
plt.ylabel("Author Name",fontsize=15)

# Add values on top of each bar
for p in ax.patches:
    width = p.get_width()
    plt.text(width + 40, p.get_y() + p.get_height() / 2, f'{int(width)}',
             ha='center', va='center', fontsize=10, color='black')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(15, 6))
ax = sns.countplot(data=rating_book_users, y="publisher", palette='Paired',
                   order=rating_book_users['publisher'].value_counts().index[0:20])

# Set the title
plt.title("Top 20 Publishers with the number of books published", fontsize=15)
plt.xlabel("Number of Books", fontsize=15)
plt.ylabel("Publishers Name", fontsize=15)

# Adding values of each bar
for p in ax.patches:
    width = p.get_width()
    plt.text(width + 80, p.get_y() + p.get_height() / 2, f'{int(width)}',
             ha='center', va='center', fontsize=10, color='black')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(12, 6))

# Create a countplot for the top 15 books based on the number of ratings
ax = sns.countplot(y="title", palette='Paired', data=rating_book_users, order=rating_book_users['title'].value_counts().index[0:15])
plt.title("Top 15 Books by Number of Ratings", fontsize=15)
plt.xlabel("Total Number of Ratings Given", fontsize=15)
plt.ylabel("Book Title", fontsize=15)
plt.xticks(fontsize=10)
plt.yticks(fontsize=8)

# Adding values on top of each bar
for p in ax.patches:
    width = p.get_width()
    plt.text(width + 7, p.get_y() + p.get_height() / 2, f'{int(width)}',
             ha='center', va='center', fontsize=10, color='black')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(16, 10))

# Create a countplot for the number of books published each year
ax=sns.countplot(data=rating_book_users, x="year", palette='Paired', order=sorted(rating_book_users['year'].unique()))

# Set the title and labels
plt.title("Number of Books Published Each Year")
plt.xlabel("Year",fontsize=15)
plt.ylabel("Number of Books",fontsize=15)
plt.xticks(rotation=90)

# Add values on top of each bar
for p in ax.patches:
  ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2, p.get_height()+30),
                ha='center', va='bottom', fontsize=10, color='black')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

plt.figure(figsize=(15, 6))

# Filter out rows where 'year' is not equal to 0
filtered_final_data = rating_book_users[rating_book_users['year'] != 0]

sns.lineplot(x='year', y='number_of_ratings', data=filtered_final_data)

plt.title("Number of Ratings Over the Years", fontsize=15)
plt.xlabel("Year", fontsize=15)
plt.ylabel("Number of Ratings", fontsize=15)

plt.xticks(range(min(filtered_final_data['year']), max(filtered_final_data['year'])+1, 5))

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Group the data by 'country' and count the unique 'author' values in each group
country_author_counts = rating_book_users.groupby('country')['author'].nunique()

# Create a bar plot
plt.figure(figsize=(14, 7))
ax=country_author_counts.plot(kind='bar',color='skyblue')
plt.title('Number of Unique Authors by Country',fontsize=15)
plt.xlabel('Country',fontsize=15)
plt.ylabel('Number of Unique Authors',fontsize=15)
plt.xticks(rotation=90,fontsize=11)
plt.tight_layout()

# Add value annotations to the bars
for i, v in enumerate(country_author_counts):
  ax.text(i, v + 1, str(v), ha='center', va='bottom', fontsize=11)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

# Group the data by 'country' and count the unique 'publisher' values in each group
country_publisher_counts = rating_book_users.groupby('country')['publisher'].nunique()

# Create a bar plot
plt.figure(figsize=(14, 7))
ax = country_publisher_counts.plot(kind='bar', color='skyblue')
plt.title('Number of Unique Publishers by Country', fontsize=15)
plt.xlabel('Country', fontsize=15)
plt.ylabel('Number of Unique Publishers', fontsize=15)
plt.xticks(rotation=90,fontsize=11)

# Add value annotations to the bars
for i, v in enumerate(country_publisher_counts):
    ax.text(i, v + 0.1, str(v), ha='center', va='bottom', fontsize=11)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Calculate the correlation matrix
correlation_matrix = rating_book_users.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap of rating_book_users", fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(rating_book_users)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Hypothetical Statement 1:** The average age of users from the United States ('USA') is higher than the average age of users from Canada ('Canada').

**Hypothetical Statement 2:** The average age of users who rated books with a rating of 5 is higher than the average age of users who rated books with a rating less than 5.

**Hypothetical Statement 3:** The average number of ratings for books with a rating of 5 is significantly higher than the average number of ratings for books with a rating less than 5.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** The average age of users from the USA is equal to the average age of users from Canada.

**Alternative Hypothesis (H1):** The average age of users from the USA is not equal to the average age of users from Canada.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Separate the data for users from the USA and Canada
age_usa = rating_book_users[rating_book_users['country'] == 'us']['age'].dropna()
age_canada = rating_book_users[rating_book_users['country'] == 'canada']['age'].dropna()

# Perform a t-test
t_statistic, p_value = stats.ttest_ind(age_usa, age_canada, alternative='two-sided')

# Significance level
alpha = 0.05

# Print results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

if p_value < alpha:
  print("Reject the null hypothesis: The average age of users from the USA is not equal to the average age of users from Canada.")
else:
  print("Fail to reject the null hypothesis: The average age of users from the USA is equal to the average age of users from Canada.")

##### Which statistical test have you done to obtain P-Value?

**Ans:** T-test are performed to find P-value

##### Why did you choose the specific statistical test?

**Ans:** T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** The average age of users who rated books with a rating of 5 is equal to the average age of users who rated books with a rating less than 5.

**Alternative Hypothesis (H1):** The average age of users who rated books with a rating of 5 is not equal to the average age of users who rated books with a rating less than 5.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Separate the data for users who rated books with a rating of 5 and users who rated books with a rating less than 5
age_rating_5 = rating_book_users[rating_book_users['rating'] == 5]['age'].dropna()
age_rating_less_than_5 = rating_book_users[rating_book_users['rating'] < 5]['age'].dropna()

# Perform a one-tailed t-test (greater)
t_stat, p_value = stats.ttest_ind(age_rating_5, age_rating_less_than_5, alternative='two-sided')

# Significance level
alpha = 0.05

# Print results
print("t-statistic:", t_stat)
print("p-value:", p_value)

if p_value < alpha:
  print("Reject the null hypothesis: The average age of users who rated books with a rating of 5 is not equal\
        \nto average age of users who rated books with a rating less than 5.")

else:
  print("Fail to reject the null hypothesis: The average age of users who rated books with a rating of 5 is equal\
        \nto the average age of users who rated books with a rating less than 5.")


##### Which statistical test have you done to obtain P-Value?

**Ans:** T-test are performed to find P-value

##### Why did you choose the specific statistical test?

**Ans:** T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** The average number of ratings for books with a rating of 5 is equal to the average number of ratings for books with a rating less than 5.

**Alternative Hypothesis (H1):** The average number of ratings for books with a rating of 5 is higher than the average number of ratings for books with a rating less than 5.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Separate the data for books with a rating of 5 and books with a rating less than 5
num_ratings_rating_5 = rating_book_users[rating_book_users['rating'] == 5]['number_of_ratings']
num_ratings_rating_less_than_5 = rating_book_users[rating_book_users['rating'] < 5]['number_of_ratings']

# Perform a one-tailed t-test (greater)
t_stat, p_value = stats.ttest_ind(num_ratings_rating_5, num_ratings_rating_less_than_5, alternative='greater')

# Significance level
alpha = 0.05

# Print results
print("t-statistic:", t_stat)
print("p-value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis: The average number of ratings for books with a rating of 5\
          \nis higher than the average number of ratings for books with a rating less than 5.")
else:
    print("Fail to reject the null hypothesis: The average number of ratings for books with a rating of 5\
            \nis equal to the average number of ratings for books with a rating less than 5.")


##### Which statistical test have you done to obtain P-Value?

**Ans:** T-test are performed to find P-value

##### Why did you choose the specific statistical test?

**Ans:** T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Making copy of original dataframe
df=final_rating.copy()

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

In [None]:
book_data.isnull().sum()

In [None]:
mode_publisher = book_data['publisher'].mode()[0]
mode_author = book_data['author'].mode()[0]

# Fill missing values in 'publisher' and 'author' columns with their respective modes
book_data['publisher'].fillna(mode_publisher, inplace=True)
book_data['author'].fillna(mode_author, inplace=True)

In [None]:
book_data.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

**Ans:** There is no missing value in the dataset. Hence no need to impute any missing value.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
sns.boxplot(df)

##### What all outlier treatment techniques have you used and why did you use those techniques?

**Ans:** There is no outlier in the dataset.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
avg_rating=df.groupby('ISBN').mean()['rating'].reset_index()
avg_rating.rename(columns= {'rating':'avg_ratings'}, inplace=True)
avg_rating.head()

Merging avg_rating dataset with original df dataset on 'ISBN'

In [None]:
avg_rating_df=df.merge(avg_rating,on='ISBN')
avg_rating_df.head()

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

In [None]:
df=book_data.copy()

#### 1. Expand Contraction

In [None]:
df.head()

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing
def string_lower(word):
  return word.lower()
df['title']=df['title'].apply(string_lower)
df['author']=df['author'].apply(string_lower)
df['publisher']=df['publisher'].apply(string_lower)

In [None]:
df.head()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
[punc for punc in string.punctuation]

In [None]:
def remove_punc(text):
  nopunc =[char for char in text if char not in string.punctuation]
  nopunc=''.join(nopunc)
  return nopunc
df['title']=df['title'].apply(remove_punc)
df['author']=df['author'].apply(remove_punc)
df['publisher']=df['publisher'].apply(remove_punc)

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

# Function to remove digits from text & sentence
def remove_digits(text):
  return ''.join([char for char in text if not char.isdigit()])

# Apply the remove_digits function to the 'text' column
df['title']=df['title'].apply(remove_digits)
df['author']=df['author'].apply(remove_digits)
df['publisher']=df['publisher'].apply(remove_digits)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopword

def remove_stopwords(sentence, language='english'):
  # Get the list of stopwords for the specified language
  stop_words = set(stopwords.words(language))
  words = sentence.split()

  # Remove stopwords from the list of words
  filtered_words = [word for word in words if word not in stop_words]

  # Join the filtered words to form a sentence without stopwords
  filtered_sentence = ' '.join(filtered_words)
  return filtered_sentence

In [None]:
df['title']=df['title'].apply(remove_stopwords)
df['author']=df['author'].apply(remove_stopwords)
df['publisher']=df['publisher'].apply(remove_stopwords)

In [None]:
# Remove White spaces
df['title']=df['title'].replace(" ","")
df['author']=df['author'].replace(" ","")
df['publisher']=df['publisher'].replace(" ","")

#### 6. Rephrase Text

In [None]:
# Rephrase Text
# Create a new columns & Concatenate all the columns into it
df['tags']=df['title']+df['author']+df['publisher']
df.head()

#### 7. Tokenization

In [None]:
# Tokenize the 'tags' column using nltk
df['tokenized_tags'] = df['tags'].apply(word_tokenize)

# Display the result
df

In [None]:
# Creating a new dataframe
book_data_new=df[['ISBN','tokenized_tags']]
book_data_new

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

# Create lemmatizer objects
lemmatizer = WordNetLemmatizer()



# Define a function to lemmatize and join tokens
def lemmatize_and_join(tokens):
  # Lemmatize each token and join them back into a single string
  lemmatized_text = " ".join([lemmatizer.lemmatize(token) for token in tokens])

  return lemmatized_text

book_data_new['tokenized_tags']=book_data_new['tokenized_tags'].apply(lemmatize_and_join)
book_data_new.head()


##### Which text normalization technique have you used and why?

**Ans:** Lemmatiztaion technique is used for text normalization because Lemmatization produces more linguistically correct and readable words compared to stemming.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
stopwords_list = stopwords.words('french') + stopwords.words('portuguese') + stopwords.words('spanish') + stopwords.words('german')+ stopwords.words('finnish')+ stopwords.words('swedish')

#Trains a model whose vectors size is 5000, composed by the main unigrams and bigrams found in the corpus, ignoring stopwords
vectorizer = TfidfVectorizer(analyzer='word',
                     ngram_range=(1, 2),
                     min_df=0.03,
                     max_df=0.6,
                     max_features=5000,
                     stop_words=stopwords_list)
tfidf_matrix = vectorizer.fit_transform(book_data_new['tokenized_tags']).toarray()
tfidf_matrix

##### Which text vectorization technique have you used and why?

Answer Here.

In [None]:
print(vectorizer.get_feature_names_out())

In [None]:
tfidf_matrix.shape

### 7. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(avg_rating_df, test_size=0.2,random_state=42)

In [None]:
print(f'Training set lengths: {len(train_data)}')
print(f'Testing set lengths: {len(test_data)}')
print(f'Test set is {(len(test_data)/(len(train_data)+len(test_data))*100):.0f}% of the full dataset.')

##### What data splitting ratio have you used and why?

**Ans:** Splitting ratio is set to 20 % because it is a usual practice keep 80 % of data for training purpose & 20% data for testing purpose.

In [None]:
#Indexing by user_id to speed up the searches during evaluation
interactions_full_indexed_df = avg_rating_df.set_index('user_id')
interactions_train_indexed_df =train_data.set_index('user_id')
interactions_test_indexed_df = test_data.set_index('user_id')

In [None]:
train_data.head()

## ***7. ML Model Implementation***

### ML Model - 1-Collaborative Filtering Method

In [None]:
#Creating a sparse pivot table with ISBN in rows and user_id in columns
users_items_pivot_matrix_df = train_data.pivot_table(columns='ISBN', index='user_id', values="avg_ratings")

In [None]:
users_items_pivot_matrix_df.shape

In [None]:
users_items_pivot_matrix_df.head()

In [None]:
users_items_pivot_matrix_df.fillna(0, inplace=True)

In [None]:
users_items_pivot_matrix_df.head()

In [None]:
users_items_pivot_matrix=users_items_pivot_matrix_df.values
users_items_pivot_matrix[:10]

In [None]:
user_id = list(users_items_pivot_matrix_df.index)
user_id[:10]

In [None]:
# The number of factors to factor the user-item matrix.
NUMBER_OF_FACTORS_MF = 15

#Performs matrix factorization of the original user item matrix
U, sigma, Vt = svds(users_items_pivot_matrix, k = NUMBER_OF_FACTORS_MF)

In [None]:
users_items_pivot_matrix.shape

In [None]:
U.shape

In [None]:
sigma = np.diag(sigma)
sigma.shape

In [None]:
Vt.shape

In [None]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt)
all_user_predicted_ratings

In [None]:
all_user_predicted_ratings.shape

In [None]:
#Converting the reconstructed matrix back to a Pandas dataframe
cf_preds_df = pd.DataFrame(all_user_predicted_ratings, columns = users_items_pivot_matrix_df.columns, index=user_id).transpose()
cf_preds_df.head()

In [None]:
len(cf_preds_df.columns)

In [None]:
class CFRecommender:

    MODEL_NAME = 'Collaborative Filtering'

    def __init__(self, cf_predictions_df, items_df=None):
        self.cf_predictions_df = cf_predictions_df
        self.items_df = items_df

    def get_model_name(self):
        return self.MODEL_NAME

    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # Get and sort the user's predictions
        sorted_user_predictions = self.cf_predictions_df[user_id].sort_values(ascending=False).reset_index().rename(columns={user_id: 'recStrength'})

        # Recommend the highest predicted rating content that the user hasn't seen yet.
        recommendations_df = sorted_user_predictions[~sorted_user_predictions['ISBN'].isin(items_to_ignore)].sort_values('recStrength', ascending = False).head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left',
                                                          left_on = 'ISBN',
                                                          right_on = 'ISBN')[['recStrength', 'ISBN','title']]


        return recommendations_df

cf_recommender_model = CFRecommender(cf_preds_df,book_data)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

The Top-N accuracy metric choosen was Recall@N which evaluates whether the interacted item is among the top N items (hit) in the ranked list of 101 recommendations for a user.

In [None]:
def get_items_interacted(user_id, ratings_data):
    interacted_items = ratings_data.loc[user_id]['ISBN']
    return set(interacted_items if type(interacted_items) == pd.Series else [interacted_items])

In [None]:
#Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

class ModelEvaluator:

    # Function for getting the set of items which a user has not interacted with
    def get_not_interacted_items_sample(self, user_id, sample_size, seed=42):
        interacted_items = get_items_interacted(user_id, interactions_full_indexed_df)
        all_items = set(book_data['ISBN'])
        non_interacted_items = all_items - interacted_items

        random.seed(seed)
        non_interacted_items_sample = random.sample(non_interacted_items, sample_size)
        return set(non_interacted_items_sample)

    # Function to verify whether a particular item_id was present in the set of top N recommended items
    def _verify_hit_top_n(self, item_id, recommended_items, topn):
            try:
                index = next(i for i, c in enumerate(recommended_items) if c == item_id)
            except:
                index = -1
            hit = int(index in range(0, topn))
            return hit, index

    # Function to evaluate the performance of model for each user
    def evaluate_model_for_user(self, model, user_id):
      try:

        # Getting the items in test set
        interacted_values_testset = interactions_test_indexed_df.loc[user_id]

        if type(interacted_values_testset['ISBN']) == pd.Series:
            person_interacted_items_testset = set(interacted_values_testset['ISBN'])
        else:
            person_interacted_items_testset = set(interacted_values_testset['ISBN'])

        interacted_items_count_testset = len(person_interacted_items_testset)

        # Getting a ranked recommendation list from the model for a given user
        person_recs_df = model.recommend_items(user_id, items_to_ignore=get_items_interacted(user_id, interactions_train_indexed_df),topn=10000000000)

        hits_at_5_count = 0
        hits_at_10_count = 0

        # For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:

            # Getting a random sample of 100 items the user has not interacted with
            non_interacted_items_sample = self.get_not_interacted_items_sample(user_id, sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS)

            # Combining the current interacted item with the 100 random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))

            # Filtering only recommendations that are either the interacted item or from a random sample of 100 non-interacted items
            valid_recs_df = person_recs_df[person_recs_df['ISBN'].isin(items_to_filter_recs)]
            valid_recs = valid_recs_df['ISBN'].values

            # Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        # Recall is the rate of the interacted items that are ranked among the Top-N recommended items
        recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
        recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)

        user_metrics = {'hits@5_count':hits_at_5_count,
                          'hits@10_count':hits_at_10_count,
                          'interacted_count': interacted_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return user_metrics
      except KeyError:
        # Handle the KeyError gracefully, e.g., by returning default metrics or logging the error
        print(f"User with user_id {user_id} not found in the test set.")
        return {'hits@5_count': 0, 'hits@10_count': 0, 'interacted_count': 0, 'recall@5': 0, 'recall@10': 0}


    # Function to evaluate the performance of model at overall level
    def evaluate_model(self, model):

        people_metrics = []

        for idx, user_id in enumerate(list(interactions_test_indexed_df.index.unique().values)):
            person_metrics = self.evaluate_model_for_user(model, user_id)
            person_metrics['_user_id'] = user_id
            people_metrics.append(person_metrics)

        print('{0} users processed' .format(idx))

        detailed_results_df = pd.DataFrame(people_metrics).sort_values('interacted_count', ascending=False)

        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())

        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}
        return global_metrics, detailed_results_df

model_evaluator = ModelEvaluator()

In [None]:
print('Evaluating Collaborative Filtering (SVD Matrix Factorization) model...')
cf_global_metrics, cf_detailed_results_df = model_evaluator.evaluate_model(cf_recommender_model)

# Move the user_id column to the first position
user_id_column = cf_detailed_results_df['_user_id']  # Extract the user_id column
cf_detailed_results_df = cf_detailed_results_df.drop(columns=['_user_id'])
cf_detailed_results_df.insert(0, '_user_id', user_id_column)

print('\nGlobal metrics:\n{}'.format(cf_global_metrics))
cf_detailed_results_df.head(10)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2 Content Based Filtering

Obtain vector embeddings of each word in our corpus using TF-IDF Vectorizer technique.

In [None]:
!pip install langdetect

In [None]:
#Ignoring stopwords (words with no semantics) from English
from langdetect import detect
stopwords_list = stopwords.words('english')

#Trains a model whose vectors size is 5000, composed by the main unigrams and bigrams found in the corpus, ignoring stopwords
vectorizer = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0.003,max_df=0.5,max_features=5000,stop_words=stopwords_list)

item_ids = book_data['ISBN'].tolist()
tfidf_matrix = vectorizer.fit_transform(book_data['title'] + "" + book_data['author'] + "" + book_data['publisher'])
tfidf_feature_names = vectorizer.get_feature_names_out()


In [None]:
tfidf_matrix.shape

In [None]:
# stopwords_list

In [None]:
# tfidf_feature_names

To model the user profile, we take all the item profiles the user has interacted and average them. The average is weighted by the interaction strength, in other words, the articles the user has interacted the most (eg. liked or commented) will have a higher strength in the final user profile.

In [None]:
def get_item_profile(item_id):
  try:
    idx = item_ids.index(item_id)
    item_profile = tfidf_matrix[idx:idx+1]
    return item_profile

  except ValueError:
    # Handle the case where the item_id is not found
    print(f"Item with ISBN '{item_id}' not found in the list.")
    return None

def get_item_profiles(ids):
  item_profiles_list = [get_item_profile(x) for x in ids]
  item_profiles = scipy.sparse.vstack(item_profiles_list)
  return item_profiles


def build_users_profile(user_id, avg_indexed_df):
  interactions_person_df = avg_indexed_df.loc[user_id]
  user_item_profiles = get_item_profiles(interactions_person_df['ISBN'])
  user_item_strengths = np.array(interactions_person_df['avg_ratings']).reshape(-1, 1)
  return user_item_strengths


def build_users_profiles():
  avg_indexed_df = avg_rating_df[avg_rating_df['ISBN'].isin(book_data['ISBN'])].set_index('user_id')
  user_profiles = {}
  for user_id in avg_indexed_df.index.unique():
      user_profiles[user_id] = build_users_profile(user_id, avg_indexed_df)
  return user_profiles


In [None]:
user_profiles = build_users_profiles()
len(user_profiles)

In [None]:
user_profiles

Let's take a look at a particular user profile. It is a unit vector of 5000 length. The value in each position represents how relevant is a token (unigram or bigram) for the selected user

In [None]:
user_profile = user_profiles[222488]
print(user_profile.shape)

pd.DataFrame(sorted(zip(tfidf_feature_names,
                        user_profiles[222488].flatten().tolist()), key=lambda x: -x[1])[:20],
             columns=['token', 'relevance'])

### Class for Content-Based Filtering

In [None]:
class ContentBasedRecommender:

    MODEL_NAME = 'Content-Based'

    def __init__(self, items_df=None):
        self.item_ids = item_ids
        self.items_df = items_df

    def get_model_name(self):
        return self.MODEL_NAME

    def _get_similar_items_to_user_profile(self, user_id, topn=1000):

        # Compute the cosine similarity between the user profile and all item profiles
        cosine_similarities = cosine_similarity(user_profiles[user_id], tfidf_matrix)

        # Get the top similar items
        similar_indices = cosine_similarities.argsort().flatten()[-topn:]

        # Sort the similar items by similarity
        similar_items = sorted([(item_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
        return similar_items

    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        similar_items = self._get_similar_items_to_user_profile(user_id)

        #Ignores items the user has already interacted
        similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))

        recommendations_df = pd.DataFrame(similar_items_filtered, columns=['ISBN', 'recStrength']).head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left',
                                                          left_on = 'ISBN',
                                                          right_on = 'ISBN')[['recStrength', 'ISBN','title']]


        return recommendations_df

content_based_recommender_model = ContentBasedRecommender(book_data)

## Evaluation

In [None]:
#Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

class ModelEvaluator:

    # Function for getting the set of items which a user has not interacted with
    def get_not_interacted_items_sample(self, user_id, sample_size, seed=42):
      interacted_items = get_items_interacted(user_id, interactions_full_indexed_df)
      all_items = set(book_data['ISBN'])
      non_interacted_items = all_items - interacted_items

      random.seed(seed)
      non_interacted_items_sample = random.sample(non_interacted_items, sample_size)
      return set(non_interacted_items_sample)

    # Function to verify whether a particular item_id was present in the set of top N recommended items
    def _verify_hit_top_n(self, item_id, recommended_items, topn):
      try:
        index = next(i for i, c in enumerate(recommended_items) if c == item_id)
      except:
        index = -1
      hit = int(index in range(0, topn))
      return hit, index


    # Function to evaluate the performance of model for each user
    def evaluate_model_for_user(self, model, user_id):
      try:

        # Getting the items in test set
        interacted_values_testset = interactions_test_indexed_df.loc[user_id]

        if type(interacted_values_testset['ISBN']) == pd.Series:
          person_interacted_items_testset = set(interacted_values_testset['ISBN'])
        else:
          person_interacted_items_testset = set(interacted_values_testset['ISBN'])

        interacted_items_count_testset = len(person_interacted_items_testset)

        # Getting a ranked recommendation list from the model for a given user
        person_recs_df = model.recommend_items(user_id, items_to_ignore=get_items_interacted(user_id, interactions_train_indexed_df),topn=10000000000)

        hits_at_5_count = 0
        hits_at_10_count = 0

        # For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:

            # Getting a random sample of 100 items the user has not interacted with
            non_interacted_items_sample = self.get_not_interacted_items_sample(user_id, sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS)

            # Combining the current interacted item with the 100 random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))

            # Filtering only recommendations that are either the interacted item or from a random sample of 100 non-interacted items
            valid_recs_df = person_recs_df[person_recs_df['ISBN'].isin(items_to_filter_recs)]
            valid_recs = valid_recs_df['ISBN'].values

            # Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        # Recall is the rate of the interacted items that are ranked among the Top-N recommended items
        recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
        recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)

        user_metrics = {'hits@5_count':hits_at_5_count,
                          'hits@10_count':hits_at_10_count,
                          'interacted_count': interacted_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return user_metrics
      except KeyError:
        # Handle the KeyError gracefully, e.g., by returning default metrics or logging the error
        print(f"User with user_id {user_id} not found in the test set.")
        return {'hits@5_count': 0, 'hits@10_count': 0, 'interacted_count': 0, 'recall@5': 0, 'recall@10': 0}


    # Function to evaluate the performance of model at overall level
    def evaluate_model(self, model):

        people_metrics = []

        for idx, user_id in enumerate(list(interactions_test_indexed_df.index.unique().values)):
            person_metrics = self.evaluate_model_for_user(model,user_id)
            person_metrics['_person_id'] = user_id
            people_metrics.append(person_metrics)

        print('{0} users processed' .format(idx))

        detailed_results_df = pd.DataFrame(people_metrics).sort_values('interacted_count', ascending=False)

        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())

        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}
        return global_metrics, detailed_results_df

model_evaluator = ModelEvaluator()

In [None]:
print('Evaluating Content-Based Filtering model...')
cb_global_metrics, cb_detailed_results_df = model_evaluator.evaluate_model(content_based_recommender_model)

# # Move the user_id column to the first position
# user_id_column = cb_detailed_results_df['_user_id']  # Extract the user_id column
# cb_detailed_results_df = cb_detailed_results_df.drop(columns=['_user_id'])
# cb_detailed_results_df.insert(0, '_user_id', user_id_column)

print('\nGlobal metrics:\n{}' .format(cb_global_metrics))
cb_detailed_results_df.head(10)

In [None]:
print(user_profiles.shape)
print(tfidf_matrix.shape)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***