# Introduction

In today's digital age, individuals are often overwhelmed by the sheer volume of available books. A book recommender system aims to alleviate this information overload by suggesting books that are likely to be of interest to a particular user. By analyzing user preferences, reading history, and book characteristics, these systems enhance the book discovery process, improve user engagement with online book platforms, and ultimately drive sales and readership. This project focuses on developing a book recommender system using three distinct datasets: books_df containing book information, ratings_df detailing user ratings, and user_df providing user demographics.

## Business Understanding

The development of an effective book recommender system holds significant value for various stakeholders:

- E-commerce Platforms and Online Bookstores: Recommender systems can drive sales by suggesting relevant books to customers, increasing the likelihood of purchases. They also enhance user experience, leading to greater customer satisfaction and loyalty.

* Libraries and Educational Institutions: These systems can help patrons discover new titles aligned with their interests or academic needs, fostering a more engaging and enriching experience.


+ Publishers and Authors: Understanding user preferences can provide valuable insights into market trends, potentially influencing publishing decisions and marketing strategies.

* Individual Users: The primary benefit for users is the ability to effortlessly discover books they are likely to enjoy, saving time and expanding their literary horizons.

By leveraging the data within the provided datasets, this project aims to build a system that accurately predicts user preferences and delivers personalized book recommendations, thereby addressing a crucial need in the book discovery ecosystem.

## Problem Statement

The core problem this project seeks to address is the challenge users face in discovering books that align with their individual tastes from a vast and ever-expanding catalog. Without an effective recommendation system, users may struggle to find new authors, genres, or titles they would enjoy, potentially leading to a less engaging and fulfilling reading experience. This project aims to develop a data-driven solution that can predict user preferences based on their past interactions and the characteristics of books, thereby providing personalized and relevant recommendations.

## Objectives

The primary objectives of this project are to:

* Develop a book recommender system: Implement one or more recommendation algorithms (e.g., collaborative filtering, content-based filtering, or hybrid approaches).

* Evaluate the performance of the recommender system: Use appropriate metrics to assess the accuracy and effectiveness of the developed model(s).

* Provide actionable insights: Based on the analysis and model results, offer insights into user preferences and potential strategies for book recommendations.

## Data Limitations

Before embarking on this project, it's important to consider potential limitations of the data:

* Data Sparsity: The ratings_df might suffer from sparsity, meaning that most users have only rated a small fraction of the available books. This can pose challenges for collaborative filtering techniques.

* Cold Start Problem: New users or new books with no or very few ratings will be difficult to recommend using collaborative filtering.

* Data Bias: The data might reflect biases present in the user base or the book catalog. For example, certain genres or authors might be over-represented.

* Data Quality: The datasets might contain inconsistencies, errors, or missing values that need to be addressed during preprocessing.

* Implicit vs. Explicit Feedback: The ratings_df likely contains explicit feedback (numerical ratings). However, other forms of implicit feedback (e.g., browsing history, purchase history) might not be available, potentially limiting the richness of user preference data.

* Evolution of Preferences: User preferences can change over time, which might not be captured effectively by static datasets.


# Data Understanding

### Loading Data

In [217]:
import pandas as pd
import numpy as np

# Load the dataset
books_df = pd.read_csv(
    '../books_df.csv',
    sep=';',                   # Semicolon separator
    quotechar='"',             # Handles quoted text properly
    encoding='latin1',         # Supports special characters
    on_bad_lines='skip',        # Skips problematic lines
    engine='python'  # Use python engine for more flexibility
)

ratings_df = pd.read_csv(
    '../ratings_df.csv',
    sep=';',                   # Semicolon separator
    quotechar='"',             # Handles quoted text properly
    encoding='latin1',         # Supports special characters
    on_bad_lines='skip',        # Skips problematic lines
    engine='python'  # Use python engine for more flexibility
)

user_df = pd.read_csv(
    '../user_df.csv',
    sep=';',                   # Semicolon separator
    quotechar='"',             # Handles quoted text properly
    encoding='latin1',         # Supports special characters
    on_bad_lines='skip',        # Skips problematic lines
    engine='python'  # Use python engine for more flexibility
)

In [218]:
books_df.head(2)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...


In [219]:
#summary of dataframe
books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270491 entries, 0 to 270490
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 270491 non-null  object
 1   Book-Title           270491 non-null  object
 2   Book-Author          270489 non-null  object
 3   Year-Of-Publication  270491 non-null  int64 
 4   Publisher            270489 non-null  object
 5   Image-URL-S          270491 non-null  object
 6   Image-URL-M          270491 non-null  object
 7   Image-URL-L          270491 non-null  object
dtypes: int64(1), object(7)
memory usage: 16.5+ MB


In [220]:
# check dataframe dimension
books_df.shape

(270491, 8)

In [221]:
ratings_df.head(2)

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5


In [222]:
#summary of dataframe
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149766 entries, 0 to 1149765
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149766 non-null  int64 
 1   ISBN         1149766 non-null  object
 2   Book-Rating  1149766 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [223]:
# check dataframe dimension
ratings_df.shape

(1149766, 3)

In [224]:
user_df.head(2)

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0


In [225]:
#summary of dataframe
user_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278177 entries, 0 to 278176
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User-ID   278177 non-null  int64  
 1   Location  278177 non-null  object 
 2   Age       167669 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [226]:
# check dataframe dimension
user_df.shape

(278177, 3)

## Data Cleaning

In [227]:
# List of dataframes with names for clarity
dfs = [("Books", books_df), ("Ratings", ratings_df), ("Users", user_df)]

# Loop through and check missing values
for name, df in dfs:
    print(f"\nMissing values in {name} dataset:")
    missing = df.isnull().sum()
    missing = missing[missing > 0]
    if not missing.empty:
        print(missing)
    else:
        print("No missing values.")



Missing values in Books dataset:
Book-Author    2
Publisher      2
dtype: int64

Missing values in Ratings dataset:
No missing values.

Missing values in Users dataset:
Age    110508
dtype: int64


* Books dataset has missing values in Book-Author and Publisher. 
* Users dataset has missing values in Age.



## Handling missing values

We will drop the missing values in books dataset because the number is insignificant.

In [228]:
#Dropping missin values of Books data as they are negligible
books_df.dropna(inplace=True)

In [229]:
books_df.isna().sum()

ISBN                   0
Book-Title             0
Book-Author            0
Year-Of-Publication    0
Publisher              0
Image-URL-S            0
Image-URL-M            0
Image-URL-L            0
dtype: int64

In [230]:
#Fill missing Values in Age Column with Median
user_df['Age'] =user_df['Age'].fillna(user_df['Age'].median())

In [231]:
user_df.isna().sum()

User-ID     0
Location    0
Age         0
dtype: int64

In [232]:
#check summary statistics
user_df['Age'].describe()

count    278177.000000
mean         33.658624
std          11.284321
min           0.000000
25%          29.000000
50%          32.000000
75%          35.000000
max         244.000000
Name: Age, dtype: float64

Minimum age 0 and max age 244 indicates outliers exist in age

In [233]:
#Capping the outlier rows with Percentiles
upper_lim = user_df['Age'].quantile(.95)
lower_lim = user_df['Age'].quantile(.05)
user_df.loc[(user_df["Age"] > upper_lim),"Age"] = upper_lim
user_df.loc[(user_df["Age"] < lower_lim),"Age"] = lower_lim
     

In [234]:
#recheck age summary
user_df['Age'].describe()

count    278177.000000
mean         33.402449
std           9.522502
min          18.000000
25%          29.000000
50%          32.000000
75%          35.000000
max          56.000000
Name: Age, dtype: float64

### Checking for duplicates

In [235]:

# Loop through and check for duplicate rows
for name, df in dfs:
    duplicates = df.duplicated().sum()
    print(f"{name} dataset has {duplicates} duplicate rows.")


Books dataset has 0 duplicate rows.
Ratings dataset has 0 duplicate rows.
Users dataset has 0 duplicate rows.


Drop unnecessary column in books dataset

In [236]:
# Drop image URLs columns
books_df = books_df.drop(columns=['Image-URL-S', 'Image-URL-M', 'Image-URL-L'])


In [237]:
books_df.head(2)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada


In [238]:
books_df.shape

(270487, 5)

## Merging the datasets

Merging user and ratings dataset on User-ID column.

In [239]:
#Merging users and rating dataframe
user_ratings_df=pd.merge(user_df,ratings_df, on='User-ID')
     

In [240]:
user_ratings_df.shape

(1149558, 5)

Merging combined data of user and ratings with books data on ISBN column

In [241]:
#Merging both data
merged_df=pd.merge(books_df,user_ratings_df, on='ISBN')

In [242]:
#head of all 3 merged dataframe
merged_df.head(2)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,User-ID,Location,Age,Book-Rating
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,2,"stockton, california, usa",18.0,0
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,8,"timmins, ontario, canada",32.0,5


In [243]:
# check summary
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028622 entries, 0 to 1028621
Data columns (total 9 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   ISBN                 1028622 non-null  object 
 1   Book-Title           1028622 non-null  object 
 2   Book-Author          1028622 non-null  object 
 3   Year-Of-Publication  1028622 non-null  int64  
 4   Publisher            1028622 non-null  object 
 5   User-ID              1028622 non-null  int64  
 6   Location             1028622 non-null  object 
 7   Age                  1028622 non-null  float64
 8   Book-Rating          1028622 non-null  int64  
dtypes: float64(1), int64(3), object(5)
memory usage: 70.6+ MB


In [244]:
merged_df.shape

(1028622, 9)

In [245]:
#Country value counts
merged_df.Location.value_counts()
     

Location
toronto, ontario, canada            14738
n/a, n/a, n/a                       11135
chicago, illinois, usa               8481
seattle, washington, usa             8377
ottawa, ontario, canada              8096
                                    ...  
mitcham, ,                              1
norwood, north carolina, usa            1
vitoria, país vasco, spain              1
shenley, england, united kingdom        1
linclon, nebraska, usa                  1
Name: count, Length: 22403, dtype: int64

Split location into components: City, State, Country. This helps normalize and allows country-based filtering.

In [246]:
# Split 'Location' column into three new columns
merged_df[['City', 'State', 'Country']] = merged_df['Location'].str.split(',', n=2, expand=True)

# Clean up whitespace
merged_df['City'] = merged_df['City'].str.strip()
merged_df['State'] = merged_df['State'].str.strip()
merged_df['Country'] = merged_df['Country'].str.strip()

# Replace 'n/a'  with 'Unknown'
merged_df[['City', 'State', 'Country']] = merged_df[['City', 'State', 'Country']].replace(['n/a', 'N/A', 'na', 'NA'], 'Unknown')
merged_df[['City', 'State', 'Country']] = merged_df[['City', 'State', 'Country']].fillna('Unknown')

# Optional: Check result
merged_df[['Location', 'City', 'State', 'Country']].head()


Unnamed: 0,Location,City,State,Country
0,"stockton, california, usa",stockton,california,usa
1,"timmins, ontario, canada",timmins,ontario,canada
2,"ottawa, ontario, canada",ottawa,ontario,canada
3,"n/a, n/a, n/a",Unknown,Unknown,Unknown
4,"sudbury, ontario, canada",sudbury,ontario,canada


In [247]:
# Drop the 'Location' column
merged_df.drop(columns=['Location'], inplace=True)




In [248]:
# display dataframe
merged_df.head(3)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,User-ID,Age,Book-Rating,City,State,Country
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,2,18.0,0,stockton,california,usa
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,8,32.0,5,timmins,ontario,canada
2,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11400,49.0,0,ottawa,ontario,canada
