## BUSINESS UNDERSTANDING

### 1.1 Overview
In the era of exponential data growth, the emergence of more sophisticated systems leveraging big data has become increasingly prevalent. Among these systems, recommendation systems have proven to be valuable information filtering tools, enhancing search results by providing users with more relevant items based on their search queries or browsing history. Major technology companies have embraced recommendation systems across various applications: YouTube utilizes them to determine the next autoplay video, while Spotify employs them to curate personalized "Made for You" daily mixes.

In line with this project's objectives, we aim to harness the power of data analysis to recommend the best books to users. By examining user behaviors, both individual and collective, we can derive insights that enable us to deliver tailored book recommendations that align with their interests and preferences.

The underlying principle of this project is to leverage data-driven techniques to understand user preferences and behaviors. By analyzing user interactions, historical data, and patterns, we can uncover valuable insights that inform our recommendation system. This allows us to present users with a curated list of book suggestions that are highly likely to resonate with their tastes.

### 1.2 Problem Statement
Book-Crossing is looking to optimize their recommendation system such that it will suggest different and new books with the emphasis on relevancy to the user tastes

We have therefore been appointed as Junior Data Scientists by Book-Crossing so as to optimize their book recommendation system. This will enhance customer engagement, and optimize relevant user recommendations.
The data set was obtained from here

### 1.3 The Data
The Book-Crossing dataset comprises 3 files:

`Users`: Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL values.

`Books`: Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in the case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavors (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon website.

`Ratings`: Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.


### 1.5 Project Goals
1. Develop a Personalized Book Recommendation Model
Build a robust recommendation engine using collaborative filtering (user-user or item-item) and/or matrix factorization techniques that predicts book preferences by analyzing the behavior of similar users. The model should rank books based on predicted ratings or affinity scores.

2. Address the Cold Start Problem
Design strategies to recommend books for new users with little to no interaction of data


3. Model Evaluation and Optimization
Continuously optimize the recommendation model using performance metrics such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Precision@K/Recall@K. Apply techniques like cross-validation and hyperparameter tuning (e.g., for matrix factorization or neural collaborative filtering) to improve model accuracy and relevance.

4. Implement a Scalable Recommendation Function
Develop a reusable function (e.g., get_top_n_recommendations(user_id, n)) that returns the top N most relevant book recommendations for a given user, considering both model predictions and business logic (e.g., genre diversity, recency).


5. System Deployment and User Interface Integration
Deploy the recommendation engine using a backend framework (like Flask or FastAPI), connect it to a frontend or chatbot interface, and expose endpoints for fetching personalized book suggestions. Ensure API scalability, reliability, and response time optimizatio


## Data Understanding

### `Data Loading`

In [18]:
#Importing necessary libraries required to load and inspect the datasets.

import pandas as pd
import numpy as np


In [None]:
#importing the books data

books_df = pd.read_csv(
    r'C:\Users\AHB\Desktop\my_quick_acess\3.Projects\projects pipeline\2025\4. Library recomender system\books_df.csv',
    sep=';',
    encoding='latin1',
    quotechar='"',
    on_bad_lines='skip',
    engine='python'  # Use python engine for more flexibility
)

books_df.head()


Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [None]:
#importing the rating dataset

rating_df = pd.read_csv(
    r'C:\Users\AHB\Desktop\my_quick_acess\3.Projects\projects pipeline\2025\4. Library recomender system\ratings_df.csv',
    sep=';',
    encoding='latin1',
    quotechar='"',
    on_bad_lines='skip',
    engine='python'  
)

rating_df.head()


Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [7]:
#importing the rating dataset

user_df = pd.read_csv(
    r'C:\Users\AHB\Desktop\my_quick_acess\3.Projects\projects pipeline\2025\4. Library recomender system\user_df.csv',
    sep=';',
    encoding='latin1',
    quotechar='"',
    on_bad_lines='skip',
    engine='python'  # Use python engine for more flexibility
)

user_df.head()


Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### `Data Inspection`

In [12]:
# Function to checking the shape, info and descriptive statistics of the dataset

def get_info_shape_stats(dataset, dataset_name):

    
    print('The Dataset:', dataset_name )

    print(f"has {dataset.shape[0]} rows and {dataset.shape[1]} columns")

    print('-----------------------------------------------------------------------------------------------------------------------------')
    print('-----------------------------------------------------------------------------------------------------------------------------')

    print(dataset.info())
    print('-----------------------------------------------------------------------------------------------------------------------------')
    print('-----------------------------------------------------------------------------------------------------------------------------')
    print(dataset.dtypes)

In [13]:
#calling on the function get_info_shape_stats

get_info_shape_stats(rating_df, 'Book Ratings')

The Dataset: Book Ratings
has 1149766 rows and 3 columns
-----------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149766 entries, 0 to 1149765
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149766 non-null  int64 
 1   ISBN         1149766 non-null  object
 2   Book-Rating  1149766 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB
None
-----------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------
User-ID         int64
ISBN           object
Book-Rating 

1. `Overview of the rating dataset`
    - The 'book_ratings' dataset contains a total of 1,149,780 rows and 3 columns. Here are some key observations about the dataset:

    - The dataset consists of the following columns:

        1. User-ID: An anonymized identifier for the users.
        2. ISBN: The unique identifier for the books.
        3. Book-Rating: The rating given by the users for the books. Ratings range from 0 to 10, with higher values indicating higher appreciation.

    - The dataset has no missing values as indicated by the 'Non-Null Count' column.

    - Data types:

        1. User-ID and Book-Rating columns are of integer type (int64).
        2. ISBN column is of object type (string).

In [16]:
get_info_shape_stats(user_df, 'users')

The Dataset: users
has 278177 rows and 3 columns
-----------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278177 entries, 0 to 278176
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User-ID   278177 non-null  int64  
 1   Location  278177 non-null  object 
 2   Age       167669 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB
None
-----------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------
User-ID       int64
Location     object
Age         float64
dtype: obj

`Overview of the user dataset`

1. The 'users' dataset contains a total of `278,858 rows` and `3 columns`. Here are some key observations about the dataset:

2. The dataset consists of the following columns:

    - User-ID: An anonymized unique identifier for the users.
    - Location: The location of the users.
    - Age: The age of the users.
    - The dataset `has some missing values` in the 'Age' column, as indicated by the difference between the 'Non-Null Count' and the total number of rows.

3. Data types:

    - The 'User-ID' column is of integer type.
    - The 'Location' column is of object type (string).
    - The 'Age' column is of float type.

In [17]:
get_info_shape_stats(books_df, 'books')

The Dataset: books
has 270491 rows and 8 columns
-----------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270491 entries, 0 to 270490
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 270491 non-null  object
 1   Book-Title           270491 non-null  object
 2   Book-Author          270489 non-null  object
 3   Year-Of-Publication  270491 non-null  int64 
 4   Publisher            270489 non-null  object
 5   Image-URL-S          270491 non-null  object
 6   Image-URL-M          270491 non-null  object
 7   Image-URL-L          270491 non-null  object
dtypes: int64(1), object(7)
memory usage: 16.5+ MB
None
---------------------------------

`Overview of the books dataset`

1. The 'books' dataset contains a total of 271,360 rows and 8 columns. Here are some key observations about the dataset:

2. The dataset consists of the following columns:

    - ISBN: The unique identifier for the books.
    - Book-Title: The title of the books.
    - Book-Author: The author of the books.
    - Year-Of-Publication: The year when the books were published.
    - Publisher: The publisher of the books.
    - Image-URL-S: The URL of the small-sized cover image of the books.
    - Image-URL-M: The URL of the medium-sized cover image of the books.
    - Image-URL-L: The URL of the large-sized cover image of the books.

3. The dataset has `some missing values` in the 'Book-Author', 'Publisher', and 'Image-URL-L' columns, as indicated by the 'Non-Null Count' column.

4. Data types:

    - All columns in the dataset are of object type (string).

************************************************************************************************************************************************