<a href="https://colab.research.google.com/github/saketvaibhav7114/Book-Recommendation-System/blob/main/Book_Recommendation_System_(Unsupervised_Learning_Project).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Book Recommendation System



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** Saket Vaibhav

# **Project Summary -**

During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys.

A recommendation system helps an organization to create loyal customers and build trust with them by providing the products and services they desire. The recommendation system today is so powerful that it can handle the new customer who has visited the site for the first time. They recommend the products that are currently trending or highly rated, and they can also recommend the products that bring maximum profit to the company.

### **Data Collection:**
The foundation of any recommendation system is data. In this project, data collection involved gathering information about books, authors, user preferences, and historical reading patterns. The dataset for Book Recommendation System comprises three files:

**Users**

Contains the users IDs, Location & Age.


**Books**

Books are identified by their respective ISBN. Some content-based information is also given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services.


**Ratings**
Contains the book rating information expressed on a scale from 1-10 (higher values denoting higher appreciation). The data included details like book titles, genres, authors, user ratings, and textual descriptions.


### **Data Preprocessing:**
Data preprocessing involved tasks such as cleaning the data, handling missing values, and transforming textual descriptions into numerical representations through techniques like TF-IDF (Term Frequency-Inverse Document Frequency).


### **Clustering Algorithms:**
One of the key components of the recommendation system is the use of clustering algorithms. Unsupervised clustering methods, such as K-Means are applied to group books with similar characteristics. These clusters are created based on factors like genre, author, and book content. The goal is to identify patterns and associations among books that could aid in recommendations.

### **Matrix Factorization:**
Matrix factorization techniques, including Singular Value Decomposition (SVD) are employed to uncover latent factors that influence user preferences. By decomposing the user-item interaction matrix, these algorithms revealed hidden relationships between users and books. This information is then used to make personalized recommendations.

### **Collaborative Filtering:**
Collaborative filtering relies on the idea that users who have similar reading preferences will likely enjoy similar books. Collaborative filtering algorithms, such as user-based and item-based collaborative filtering, were implemented to generate recommendations based on user behavior and item similarity. This approach helped in fine-tuning the suggestions.

### **Content-Based Filtering:**
In addition to collaborative filtering, content-based filtering is used to improve the recommendation system's accuracy. This approach analyzed the textual descriptions of books and matched them with user preferences. Natural Language Processing (NLP) techniques were employed to extract meaningful features from the book descriptions and align them with user profiles.

### **Evaluation Metrics:**
To assess the performance of the Book Recommendation System, several evaluation metrics are employed. Common metrics included Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and precision-recall metrics. These measures helped gauge the accuracy and effectiveness of the recommendations provided to users.

# **GitHub Link -**

https://github.com/saketvaibhav7114/Book-Recommendation-System

# **Problem Statement**


The world of literature is vast, with millions of books spanning various genres and subjects. Navigating this extensive library can be overwhelming for readers looking for their next captivating read. To address this challenge, a Book Recommendation System was developed. This system leverages the power of unsupervised learning algorithms to provide personalized book recommendations to users, enhancing their reading experience.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import sklearn
import warnings
warnings.simplefilter('ignore')
pd.set_option('display.max_colwidth', -1)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Book Data
book_data = pd.read_csv("/content/drive/MyDrive/Books.csv")

# Users Data
users_data= pd.read_csv('/content/drive/MyDrive/Users.csv')

# Ratings Data
ratings_data = pd.read_csv("/content/drive/MyDrive/Ratings.csv")


### Dataset First View

In [None]:
# Dataset First Look

# Book Data
book_data.head()

In [None]:
# Users Data
users_data.head()

In [None]:
# Ratings Data
ratings_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# Book Data
book_data.shape

In [None]:
# User Data
users_data.shape

In [None]:
# ratings_data
ratings_data.shape

### Dataset Information

In [None]:
# Dataset Info
# Book Data
book_data.info()

In [None]:
# User Data
users_data.info()

In [None]:
# ratings_data
ratings_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Book Data
book_data.duplicated().sum()

In [None]:
# User Data
users_data.duplicated().sum()

In [None]:
# ratings_data
ratings_data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Book Data
book_data.isnull().sum()

In [None]:
# User Data
users_data.isnull().sum()

In [None]:
# ratings_data
ratings_data.isnull().sum()

In [None]:
# Visualizing the missing values
# Book Data
book_missing_value=book_data.isnull().sum()
columns_with_missing_values = book_missing_value[book_missing_value > 0]      #  Filter columns with missing values

# Calculate the percentage of missing values in each column
total_rows = len(book_data)
percentage_missing = (columns_with_missing_values / total_rows) * 100

# Create a bar chart
plt.figure(figsize=(10, 6))
bar_plot = columns_with_missing_values.plot(kind='bar', color='lightcoral')
plt.xlabel('Columns with Missing Value',fontsize=14)
plt.ylabel('Number of Missing Values',fontsize=14)
plt.title('Number of Missing Values in Book Dataset',fontsize=14)
plt.xticks(rotation=0, ha='center',fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

In [None]:
users_missing_value=users_data.isnull().sum()
columns_with_missing_values = users_missing_value[users_missing_value > 0]      #  Filter columns with missing values

# Calculate the percentage of missing values in each column
total_rows = len(book_data)
percentage_missing = (columns_with_missing_values / total_rows) * 100

# Create a bar chart
plt.figure(figsize=(10, 6))
bar_plot = columns_with_missing_values.plot(kind='bar', color='lightcoral')
plt.xlabel('Columns with Missing Value',fontsize=14)
plt.ylabel('Number of Missing Values',fontsize=14)
plt.title('Number of Missing Values in User Dataset',fontsize=14)
plt.xticks(rotation=0, ha='center',fontsize=14)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the percentage of missing values on top of bar
for index, value in enumerate(columns_with_missing_values):
    plt.text(index, value, f'{percentage_missing[index]:.2f}%', ha='center', va='bottom',fontsize=10)

plt.show()

### What did you know about your dataset?

**Ans:**The dataset is well-prepared for further analysis, as it contains no duplicated rows and some missing values which needs to be fixed either by using the fillna method or dropping the rows so that there is a clean and unique dataset for analysis. Most of the missing value is in age columns of users dataset. Most of the features are either objects or floats. If necessary, it needs to be converted into the required datatype. After the necessary cleaning, the dataset will be ready for preprocessing steps, allowing the focus to be on feature engineering and model development to achieve accurate predictions.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Book Data
book_data.columns

In [None]:
# User Data
users_data.columns

In [None]:
# ratings_data
ratings_data.columns

In [None]:
# Dataset Describe
# Book Data
book_data.describe().T

In [None]:
# User Data
users_data.describe().T

In [None]:
# ratings_data
ratings_data.describe().T

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Book Data
book_data.nunique()

In [None]:
# User Data
users_data.nunique()

In [None]:
# ratings_data
ratings_data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

* Data Cleaning of book dataset

In [None]:
# Write your code to make your dataset analysis ready.
book_data.rename(columns = {'Book-Title':'title', 'Book-Author':'author', 'Year-Of-Publication':'year', 'Publisher':'publisher'}, inplace=True)

# droping the url
book_data.drop(['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], axis= 1, inplace= True)

In [None]:
book_data.info()

In [None]:
book_data.isnull().sum()

In [None]:
# nan values in book_author column
book_data.loc[(book_data['author'].isnull()),: ]

In [None]:
# nan values in publisher column
book_data.loc[(book_data['publisher'].isnull()),: ]

In [None]:
# getting unique value from 'year_of_publication' feature
book_data['year'].unique()

In [None]:
# Extracting rows with year column="DK Publishing Inc"
book_data[book_data['year'] == 'DK Publishing Inc']

In [None]:
# Extracting rows with year column="Gallimard"
book_data[book_data['year'] == 'Gallimard']

In [None]:
book_data.loc[221678]

In [None]:
book_data.loc[209538]

In [None]:
book_data.loc[220731]

* Let's fix the column and make it in correct format as per our dataset.

In [None]:
# function to fix mismatch data in feature 'book_title', 'book_author', ' year_of_publication', 'publisher'
def replace_df_value(df, idx, col_name, val):
  df.loc[idx, col_name] = val
  return df

In [None]:
replace_df_value(book_data, 209538, 'title', 'DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)')
replace_df_value(book_data, 209538, 'author', 'Michael Teitelbaum')
replace_df_value(book_data, 209538, 'year', 2000)
replace_df_value(book_data, 209538, 'publisher', 'DK Publishing Inc')

replace_df_value(book_data, 221678, 'title', 'DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)')
replace_df_value(book_data, 221678, 'author', 'James Buckley')
replace_df_value(book_data, 221678, 'year', 2000)
replace_df_value(book_data, 221678, 'publisher', 'DK Publishing Inc')

replace_df_value(book_data, 220731,'title', "Peuple du ciel, suivi de 'Les Bergers")
replace_df_value(book_data, 220731, 'author', 'Jean-Marie Gustave Le ClÃ?Â©zio')
replace_df_value(book_data, 220731, 'year', 2003)
replace_df_value(book_data, 220731, 'publisher', 'Gallimard')

In [None]:
book_data.loc[209538]

In [None]:
book_data.loc[221678]

In [None]:
book_data.loc[220731]

In [None]:
book_data['year'].unique()

In [None]:
# Change the datatype of year column from object to int
book_data['year'] = book_data['year'].astype(int)

In [None]:
book_data.info()

* Data Cleaning of user dataset

In [None]:
# Renamimg the column
users_data.rename(columns = {'User-ID':'user_id', 'Location':'location', 'Age':'age'}, inplace=True)

In [None]:
users_data.info()

* Data Cleaning of ratings dataset

In [None]:
# Renamimg the column
ratings_data.rename(columns = {'User-ID':'user_id', 'Book-Rating':'rating'}, inplace=True)

In [None]:
ratings_data.info()

In [None]:
ratings_data['rating'].unique()

In [None]:
ratings_data['user_id'].value_counts()

**A big flaw with a problem statement in the rating dataset**

If we take all the books and all the users for modeling, it will create a problem because we cannot consider a user who has only registered on the website or has only read one or two books. On such a user, we cannot rely to recommend books to others because we have to extract knowledge from data. So we will limit this number and we will take a user who has rated at least 200 books and also we will limit books and we will take only those books which have received at least 50 ratings from a user.

**Extract users who has rated more than 200 books**

In [None]:
x = ratings_data['user_id'].value_counts() > 200

In [None]:
y = x[x].index  # user_ids
print(y.shape)

In [None]:
ratings_data = ratings_data[ratings_data['user_id'].isin(y)]

In [None]:
ratings_data.shape

So 900 users are there who have given 5.2 lakh rating

**Merge ratings with books**

Merge ratings with books on basis of ISBN so that we will get the rating of each user on each book id and the user who has not rated that book id the value will be zero.


In [None]:
rating_with_books = ratings_data.merge(book_data, on='ISBN')
rating_with_books.head()

In [None]:
rating_with_books.shape

Calculate Average Ratings of Each Book

In [None]:
number_rating = rating_with_books.groupby('title').count()['rating'].reset_index()
number_rating.rename(columns= {'rating':'number_of_ratings'}, inplace=True)
number_rating

In [None]:
final_rating=rating_with_books.merge(number_rating,on='title')
final_rating

Extract books that have received more than 50 ratings.

In [None]:
final_rating = final_rating[final_rating['number_of_ratings'] >= 50]

In [None]:
# Drop Duplicated Row
final_rating.drop_duplicates(['user_id','title'], inplace=True)

* **Merge Final Rating Dataset with the Users Dataset**

In [None]:
final_data=final_rating.merge(users_data, on="user_id")

In [None]:
final_data.head()

In [None]:
# Extract 'country' values from 'location' column
final_data['country'] = final_data['location'].str.split(',').str[-1].str.strip()

# Drop the 'location' column
final_data.drop(columns=['location'], inplace=True)

In [None]:
final_data['country'].unique()

In [None]:
# Replace 'usa' with 'us',double quotes (") & 'n/a' with 'nan' in 'country' column
final_data['country'] = final_data['country'].str.replace('usa','us').replace('n/a',np.nan).replace('', np.nan)

# Display the modified 'country' column
final_data['country'].unique()

In [None]:
final_data.tail()

In [None]:
final_data.shape

Finally, we have a dataset with that user who has rated more than 200 books and books that received more than 50 ratings. The shape of the final dataframe is 59850 rows and 10 columns.

In [None]:
final_data.info()

In [None]:
final_data.loc[59845]

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(14,6))
ax=sns.countplot(x="rating",palette = 'Paired',data= final_data)
plt.title('Count of Each Ratings',fontsize=15)
plt.xlabel('Rating',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Add value annotations to the bars
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.0f'),
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                textcoords='offset points')

plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10,6))
sns.distplot(final_data['age'])
plt.title('Age Distribution\n',fontsize=15)
plt.xlabel('Age',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
sns.boxplot(final_data['age'])
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
sns.boxplot(final_data['year'])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
sns.boxplot(final_data['number_of_ratings'])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
sns.boxplot(final_data['rating'])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(15,6))
ax=sns.countplot(data=final_data, y="author", palette = 'Paired', order=final_data['author'].value_counts().index[0:20])
plt.title("Top 20 author with number of books",fontsize=15)
plt.xlabel("Count of Books",fontsize=15)
plt.ylabel("Author Name",fontsize=15)

# Add values on top of each bar
for p in ax.patches:
    width = p.get_width()
    plt.text(width + 40, p.get_y() + p.get_height() / 2, f'{int(width)}',
             ha='center', va='center', fontsize=10, color='black')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(15, 6))
ax = sns.countplot(data=final_data, y="publisher", palette='Paired',
                   order=final_data['publisher'].value_counts().index[0:20])

# Set the title
plt.title("Top 20 Publishers with the number of books published", fontsize=15)
plt.xlabel("Number of Books", fontsize=15)
plt.ylabel("Publishers Name", fontsize=15)

# Adding values of each bar
for p in ax.patches:
    width = p.get_width()
    plt.text(width + 80, p.get_y() + p.get_height() / 2, f'{int(width)}',
             ha='center', va='center', fontsize=10, color='black')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
final_data.columns

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(12, 6))

# Create a countplot for the top 15 books based on the number of ratings
ax = sns.countplot(y="title", palette='Paired', data=final_data, order=final_data['title'].value_counts().index[0:15])
plt.title("Top 15 Books by Number of Ratings", fontsize=15)
plt.xlabel("Total Number of Ratings Given", fontsize=15)
plt.ylabel("Book Title", fontsize=15)
plt.xticks(fontsize=10)
plt.yticks(fontsize=8)

# Adding values on top of each bar
for p in ax.patches:
    width = p.get_width()
    plt.text(width + 10, p.get_y() + p.get_height() / 2, f'{int(width)}',
             ha='center', va='center', fontsize=10, color='black')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(16, 10))

# Create a countplot for the number of books published each year
ax=sns.countplot(data=final_data, x="year", palette='Paired', order=sorted(final_data['year'].unique()))

# Set the title and labels
plt.title("Number of Books Published Each Year")
plt.xlabel("Year",fontsize=15)
plt.ylabel("Number of Books",fontsize=15)
plt.xticks(rotation=90)

# Add values on top of each bar
for p in ax.patches:
  ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2, p.get_height()+30),
                ha='center', va='bottom', fontsize=8, color='black', rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

plt.figure(figsize=(12, 6))

# Filter out rows where 'year' is not equal to 0
filtered_final_data = final_data[final_data['year'] != 0]

sns.lineplot(x='year', y='number_of_ratings', data=filtered_final_data)

plt.title("Number of Ratings Over the Years", fontsize=15)
plt.xlabel("Year", fontsize=15)
plt.ylabel("Number of Ratings", fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Group the data by 'country' and count the unique 'author' values in each group
country_author_counts = final_data.groupby('country')['author'].nunique()

# Create a bar plot
plt.figure(figsize=(14, 7))
ax=country_author_counts.plot(kind='bar',color='skyblue')
plt.title('Number of Unique Authors by Country',fontsize=15)
plt.xlabel('Country',fontsize=15)
plt.ylabel('Number of Unique Authors',fontsize=15)
plt.xticks(rotation=90,fontsize=11)
plt.tight_layout()

# Add value annotations to the bars
for i, v in enumerate(country_author_counts):
  ax.text(i, v + 1, str(v), ha='center', va='bottom', fontsize=11)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

# Group the data by 'country' and count the unique 'publisher' values in each group
country_publisher_counts = final_data.groupby('country')['publisher'].nunique()

# Create a bar plot
plt.figure(figsize=(14, 7))
ax = country_publisher_counts.plot(kind='bar', color='skyblue')
plt.title('Number of Unique Publishers by Country', fontsize=15)
plt.xlabel('Country', fontsize=15)
plt.ylabel('Number of Unique Publishers', fontsize=15)
plt.xticks(rotation=90,fontsize=11)

# Add value annotations to the bars
for i, v in enumerate(country_publisher_counts):
    ax.text(i, v + 0.1, str(v), ha='center', va='bottom', fontsize=11)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Calculate the correlation matrix
correlation_matrix = final_data.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap of final_data", fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(final_data)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Hypothetical Statement 1:** The average age of users from the United States ('USA') is higher than the average age of users from Canada ('Canada').

**Hypothetical Statement 2:** The average age of users who rated books with a rating of 5 is higher than the average age of users who rated books with a rating less than 5.

**Hypothetical Statement 3:** The average number of ratings for books with a rating of 5 is significantly higher than the average number of ratings for books with a rating less than 5.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** The average age of users from the USA is equal to the average age of users from Canada.

**Alternative Hypothesis (H1):** The average age of users from the USA is not equal to the average age of users from Canada.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Separate the data for users from the USA and Canada
age_usa = final_data[final_data['country'] == 'us']['age'].dropna()
age_canada = final_data[final_data['country'] == 'canada']['age'].dropna()

# Perform a t-test
t_statistic, p_value = stats.ttest_ind(age_usa, age_canada, alternative='two-sided')

# Significance level
alpha = 0.05

# Print results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

if p_value < alpha:
  print("Reject the null hypothesis: The average age of users from the USA is not equal to the average age of users from Canada.")
else:
  print("Fail to reject the null hypothesis: The average age of users from the USA is equal to the average age of users from Canada.")

##### Which statistical test have you done to obtain P-Value?

**Ans:** T-test are performed to find P-value

##### Why did you choose the specific statistical test?

**Ans:** T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** The average age of users who rated books with a rating of 5 is equal to the average age of users who rated books with a rating less than 5.

**Alternative Hypothesis (H1):** The average age of users who rated books with a rating of 5 is not equal to the average age of users who rated books with a rating less than 5.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Separate the data for users who rated books with a rating of 5 and users who rated books with a rating less than 5
age_rating_5 = final_data[final_data['rating'] == 5]['age'].dropna()
age_rating_less_than_5 = final_data[final_data['rating'] < 5]['age'].dropna()

# Perform a one-tailed t-test (greater)
t_stat, p_value = stats.ttest_ind(age_rating_5, age_rating_less_than_5, alternative='two-sided')

# Significance level
alpha = 0.05

# Print results
print("t-statistic:", t_stat)
print("p-value:", p_value)

if p_value < alpha:
  print("Reject the null hypothesis: The average age of users who rated books with a rating of 5 is not equal\
        \nto average age of users who rated books with a rating less than 5.")

else:
  print("Fail to reject the null hypothesis: The average age of users who rated books with a rating of 5 is equal\
        \nto the average age of users who rated books with a rating less than 5.")


##### Which statistical test have you done to obtain P-Value?

**Ans:** T-test are performed to find P-value

##### Why did you choose the specific statistical test?

**Ans:** T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** The average number of ratings for books with a rating of 5 is equal to the average number of ratings for books with a rating less than 5.

**Alternative Hypothesis (H1):** The average number of ratings for books with a rating of 5 is higher than the average number of ratings for books with a rating less than 5.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Separate the data for books with a rating of 5 and books with a rating less than 5
num_ratings_rating_5 = final_data[final_data['rating'] == 5]['number_of_ratings']
num_ratings_rating_less_than_5 = final_data[final_data['rating'] < 5]['number_of_ratings']

# Perform a one-tailed t-test (greater)
t_stat, p_value = stats.ttest_ind(num_ratings_rating_5, num_ratings_rating_less_than_5, alternative='greater')

# Significance level
alpha = 0.05

# Print results
print("t-statistic:", t_stat)
print("p-value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis: The average number of ratings for books with a rating of 5\
          \nis higher than the average number of ratings for books with a rating less than 5.")
else:
    print("Fail to reject the null hypothesis: The average number of ratings for books with a rating of 5\
            \nis equal to the average number of ratings for books with a rating less than 5.")


##### Which statistical test have you done to obtain P-Value?

**Ans:** T-test are performed to find P-value

##### Why did you choose the specific statistical test?

**Ans:** T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Making copy of original dataframe
df=final_data.copy()

In [None]:
df.columns

In [None]:
# Columns to Keep
df=df[['user_id','ISBN','rating', 'title', 'author', 'year', 'publisher','number_of_ratings']]

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

**Ans:** There is no missing value in the dataset. Hence no need to impute any missing value.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
df['year'].unique()

In [None]:
# Calculate the mode of the 'year' column (excluding 0)
median_year = df['year'].median()

# Replace '0' with the median year
df['year'] = df['year'].replace(0, median_year)

In [None]:
median_year

In [None]:
df['year'].unique()

##### What all outlier treatment techniques have you used and why did you use those techniques?

**Ans:** 0 value in the year column is replaced with the median year to maintain the central tendency of the data.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.2,random_state=42)

In [None]:
print(f'Training set lengths: {len(train_data)}')
print(f'Testing set lengths: {len(test_data)}')
print(f'Test set is {(len(test_data)/(len(train_data)+len(test_data))*100):.0f}% of the full dataset.')

In [None]:
train_data.head()

##### What data splitting ratio have you used and why?

**Ans:** Splitting ratio is set to 20 % because it is a usual practice keep 80 % of data for training purpose & 20% data for testing purpose.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1-Collaborative Filtering Method

In [None]:
book_pivot = df.pivot_table(columns='user_id', index='title', values="rating")

In [None]:
book_pivot.shape

In [None]:
book_pivot

In [None]:
book_pivot.fillna(0, inplace=True)

In [None]:
book_pivot

In [None]:
from scipy.sparse import csr_matrix
book_sparse = csr_matrix(book_pivot)

In [None]:
book_sparse

In [None]:
from sklearn.neighbors import NearestNeighbors
model = NearestNeighbors(n_neighbors=6, metric='cosine')
model.fit(book_sparse)

In [None]:
def recommend(book_name):
  try:
    # Check if the book_name exists in the index
    if book_name not in book_pivot.index:
        raise KeyError(f"'{book_name}' not found in the index.")

    # Fetch index of book_name
    index = np.where(book_pivot.index == book_name)[0][0]

    # Find the nearest neighbors
    distances, neighbor_indices = model.kneighbors(book_sparse[index].reshape(1, -1))

    # Exclude the first neighbor, which is the book itself
    similar_items = neighbor_indices[0][1:]

    for i in similar_items:
      print(book_pivot.index[i])
  except KeyError as e:
    print(e)

In [None]:
recommend('Zoya')

In [None]:
recommend('1984')

In [None]:
recommend('Harry Potter and the Prisoner of Azkaban (Book 3)')

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***