# Introduction

In today's digital age, individuals are often overwhelmed by the sheer volume of available books. A book recommender system aims to alleviate this information overload by suggesting books that are likely to be of interest to a particular user. By analyzing user preferences, reading history, and book characteristics, these systems enhance the book discovery process, improve user engagement with online book platforms, and ultimately drive sales and readership. This project focuses on developing a book recommender system using three distinct datasets: books_df containing book information, ratings_df detailing user ratings, and user_df providing user demographics.

## Business Understanding

The development of an effective book recommender system holds significant value for various stakeholders:

- E-commerce Platforms and Online Bookstores: Recommender systems can drive sales by suggesting relevant books to customers, increasing the likelihood of purchases. They also enhance user experience, leading to greater customer satisfaction and loyalty.

* Libraries and Educational Institutions: These systems can help patrons discover new titles aligned with their interests or academic needs, fostering a more engaging and enriching experience.


+ Publishers and Authors: Understanding user preferences can provide valuable insights into market trends, potentially influencing publishing decisions and marketing strategies.

* Individual Users: The primary benefit for users is the ability to effortlessly discover books they are likely to enjoy, saving time and expanding their literary horizons.

By leveraging the data within the provided datasets, this project aims to build a system that accurately predicts user preferences and delivers personalized book recommendations, thereby addressing a crucial need in the book discovery ecosystem.

## Problem Statement

The core problem this project seeks to address is the challenge users face in discovering books that align with their individual tastes from a vast and ever-expanding catalog. Without an effective recommendation system, users may struggle to find new authors, genres, or titles they would enjoy, potentially leading to a less engaging and fulfilling reading experience. This project aims to develop a data-driven solution that can predict user preferences based on their past interactions and the characteristics of books, thereby providing personalized and relevant recommendations.

## Objectives

The primary objectives of this project are to:

* Develop a book recommender system: Implement one or more recommendation algorithms (e.g., collaborative filtering, content-based filtering, or hybrid approaches).

* Evaluate the performance of the recommender system: Use appropriate metrics to assess the accuracy and effectiveness of the developed model(s).

* Provide actionable insights: Based on the analysis and model results, offer insights into user preferences and potential strategies for book recommendations.

## Data Limitations

Before embarking on this project, it's important to consider potential limitations of the data:

* Data Sparsity: The ratings_df might suffer from sparsity, meaning that most users have only rated a small fraction of the available books. This can pose challenges for collaborative filtering techniques.

* Cold Start Problem: New users or new books with no or very few ratings will be difficult to recommend using collaborative filtering.

* Data Bias: The data might reflect biases present in the user base or the book catalog. For example, certain genres or authors might be over-represented.

* Data Quality: The datasets might contain inconsistencies, errors, or missing values that need to be addressed during preprocessing.

* Implicit vs. Explicit Feedback: The ratings_df likely contains explicit feedback (numerical ratings). However, other forms of implicit feedback (e.g., browsing history, purchase history) might not be available, potentially limiting the richness of user preference data.

* Evolution of Preferences: User preferences can change over time, which might not be captured effectively by static datasets.


# Data Understanding

### Loading Data

In [None]:
import pandas as pd

# Load the dataset
books_df = pd.read_csv(
    '../books_df.csv',
    sep=';',                   # Semicolon separator
    quotechar='"',             # Handles quoted text properly
    encoding='latin1',         # Supports special characters
    on_bad_lines='skip',        # Skips problematic lines
    engine='python'  # Use python engine for more flexibility
)

ratings_df = pd.read_csv(
    '../ratings_df.csv',
    sep=';',                   # Semicolon separator
    quotechar='"',             # Handles quoted text properly
    encoding='latin1',         # Supports special characters
    on_bad_lines='skip',        # Skips problematic lines
    engine='python'  # Use python engine for more flexibility
)

user_df = pd.read_csv(
    '../user_df.csv',
    sep=';',                   # Semicolon separator
    quotechar='"',             # Handles quoted text properly
    encoding='latin1',         # Supports special characters
    on_bad_lines='skip',        # Skips problematic lines
    engine='python'  # Use python engine for more flexibility
)

In [14]:
books_df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [15]:
books_df.info

<bound method DataFrame.info of               ISBN                                         Book-Title  \
0       0195153448                                Classical Mythology   
1       0002005018                                       Clara Callan   
2       0060973129                               Decision in Normandy   
3       0374157065  Flu: The Story of the Great Influenza Pandemic...   
4       0393045218                             The Mummies of Urumchi   
...            ...                                                ...   
270486  0440400988                         There's a Bat in Bunk Five   
270487  0525447644                            From One to One Hundred   
270488  006008667X  Lily Dale : The True Story of the Town that Ta...   
270489  0192126040                        Republic (World's Classics)   
270490  0767409752  A Guided Tour of Rene Descartes' Meditations o...   

                 Book-Author  Year-Of-Publication  \
0         Mark P. O. Morford          

In [12]:
ratings_df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [16]:
ratings_df.info

<bound method DataFrame.info of          User-ID         ISBN  Book-Rating
0         276725   034545104X            0
1         276726   0155061224            5
2         276727   0446520802            0
3         276729   052165615X            3
4         276729   0521795028            6
...          ...          ...          ...
1149761   276704   1563526298            9
1149762   276706   0679447156            0
1149763   276709   0515107662           10
1149764   276721   0590442449           10
1149765   276723  05162443314            8

[1149766 rows x 3 columns]>

In [13]:
user_df.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [17]:
user_df.info

<bound method DataFrame.info of         User-ID                            Location   Age
0             1                  nyc, new york, usa   NaN
1             2           stockton, california, usa  18.0
2             3     moscow, yukon territory, russia   NaN
3             4           porto, v.n.gaia, portugal  17.0
4             5  farnborough, hants, united kingdom   NaN
...         ...                                 ...   ...
278172   278854               portland, oregon, usa   NaN
278173   278855  tacoma, washington, united kingdom  50.0
278174   278856           brampton, ontario, canada   NaN
278175   278857           knoxville, tennessee, usa   NaN
278176   278858                dublin, n/a, ireland   NaN

[278177 rows x 3 columns]>

In [18]:
# Checking for duplicates
print("\n--- Duplicates ---")
print("Duplicate rows in books_df:", books_df.duplicated().sum())
print("Duplicate rows in ratings_df:", ratings_df.duplicated().sum())
print("Duplicate rows in user_df:", user_df.duplicated().sum())


--- Duplicates ---
Duplicate rows in books_df: 0
Duplicate rows in ratings_df: 0
Duplicate rows in user_df: 0


In [19]:
# Checking for missing values
print("\n--- Missing Values ---")
print("Missing values in books_df:\n", books_df.isnull().sum())
print("Missing values in ratings_df:\n", ratings_df.isnull().sum())
print("Missing values in user_df:\n", user_df.isnull().sum())


--- Missing Values ---
Missing values in books_df:
 ISBN                   0
Book-Title             0
Book-Author            2
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            0
dtype: int64
Missing values in ratings_df:
 User-ID        0
ISBN           0
Book-Rating    0
dtype: int64
Missing values in user_df:
 User-ID          0
Location         0
Age         110508
dtype: int64
