# Research notebook for the book recommendation using collaborative filtering

In [194]:
import pandas as pd
import numpy as np


In [195]:
# Loading the books dataset
books = pd.read_csv('BX-Books.csv', sep=';', on_bad_lines='skip', encoding='latin-1', low_memory=False)
# Encoding is latin-1 to handle special characters in book titles and authors
# low_memory=False reads the entire file at once to avoid dtype inference warnings

In [196]:
books.head(5)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


## Books Dataset Overview

The `BX-Books.csv` file contains information about books in the Book-Crossing dataset. The dataset has the following columns:

### Dataset Structure:
- **ISBN**: Unique International Standard Book Number - the primary identifier for each book
- **Book-Title**: The title of the book
- **Book-Author**: The author(s) of the book
- **Year-Of-Publication**: The year the book was published
- **Publisher**: The publisher of the book
- **Image-URL-S**: Small image URL from Amazon (book cover thumbnail)
- **Image-URL-M**: Medium image URL from Amazon (book cover)
- **Image-URL-L**: Large image URL from Amazon (book cover high resolution)

### Observations:
- The dataset spans multiple publication years (from early dates to 2002 in the sample shown)
- Books are from various publishers like Oxford University Press, HarperFlamingo, etc.
- Image URLs are provided for visualization purposes (though some may be broken)
- This is a structured dataset suitable for recommendation system analysis

In [197]:
books.shape

(271360, 8)

In [198]:
books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

In [199]:
# Selecting only relevant columns (ISBN, book info, and large image URL)
books = books[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher', 'Image-URL-L']]

In [200]:
# Renaming columns to more concise names
books.rename(columns={
    'Book-Title': 'Title',
    'Book-Author': 'Author',
    'Year-Of-Publication': 'Year',
    'Publisher': 'Publisher',
    'Image-URL-L': 'Image-URL'
}, inplace=True)

In [201]:
books.head()

Unnamed: 0,ISBN,Title,Author,Year,Publisher,Image-URL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...


In [202]:
# Loading the users dataset
users = pd.read_csv('BX-Users.csv', sep=';', on_bad_lines='skip', encoding='latin-1', low_memory=False)
# Encoding is latin-1 to handle special characters in user location

In [203]:
users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [204]:
# Loading the book ratings dataset
ratings = pd.read_csv('BX-Book-Ratings.csv', sep=';', on_bad_lines='skip', encoding='latin-1', low_memory=False)
# Encoding is latin-1 to handle special characters

In [205]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


## Data Relationships Overview

The Book-Crossing dataset consists of three interconnected tables that enable collaborative filtering:

### Dataset Relationships:

1. **Books Dataset** (`BX-Books.csv`)
   - Contains book metadata identified by **ISBN**
   - Includes title, author, year, publisher, and image URLs
   - Acts as the catalog of all available books

2. **Users Dataset** (`BX-Users.csv`)
   - Includes location and age (with some missing values)
   - Represents the user base of the Book-Crossing community

3. **Ratings Dataset** (`BX-Book-Ratings.csv`)
   - Acts as the **junction table** connecting users and books
   - Contains three key columns:
     - **User-ID**: Reference to a user
     - **ISBN**: Reference to a book
     - **Book-Rating**: User's rating for the book (0-10 scale)

### Key Insights:
- Each rating record represents a **User-Book interaction**
- Multiple ratings can exist per user (one user, many books)
- Multiple ratings can exist per book (one book, many users)
- This creates a **user-item interaction matrix** perfect for collaborative filtering
- Ratings of 0 indicate the user may not have finished or didn't rate the book

In [206]:
print(books.shape, users.shape, ratings.shape,sep='\n')

(271360, 6)
(278858, 3)
(1149780, 3)


In [207]:
ratings['User-ID'].value_counts()

User-ID
11676     13602
198711     7550
153662     6109
98391      5891
35859      5850
          ...  
116180        1
116166        1
116154        1
116137        1
276723        1
Name: count, Length: 105283, dtype: int64

# Starting feature engineering process

In [208]:
# Filtering users who have rated more than 200 books
user_rating_counts = ratings['User-ID'].value_counts() > 200
user_rating_counts

User-ID
11676      True
198711     True
153662     True
98391      True
35859      True
          ...  
116180    False
116166    False
116154    False
116137    False
276723    False
Name: count, Length: 105283, dtype: bool

In [209]:
# Count of users who rated more than 200 books
user_rating_counts[user_rating_counts == True].shape

(899,)

In [210]:
# Filtering the ratings dataframe to include only active users (>200 ratings)
active_users = user_rating_counts[user_rating_counts == True].index
filtered_ratings = ratings[ratings['User-ID'].isin(active_users)]

In [211]:
filtered_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
1456,277427,002542730X,10
1457,277427,0026217457,0
1458,277427,003008685X,8
1459,277427,0030615321,0
1460,277427,0060002050,0


In [212]:
# Merging filtered ratings with books dataframe on ISBN
ratings_with_books = pd.merge(filtered_ratings, books, on='ISBN')

In [213]:
ratings_with_books.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Title,Author,Year,Publisher,Image-URL
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...


In [214]:
# Grouping by title to count the number of ratings per book
book_rating_counts = ratings_with_books.groupby('Title')['Book-Rating'].count().reset_index()

In [215]:
book_rating_counts.head()

Unnamed: 0,Title,Book-Rating
0,A Light in the Storm: The Civil War Diary of ...,2
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1


In [216]:
# Renaming the Book-Rating column to Num_Ratings
book_rating_counts.rename(columns={'Book-Rating': 'Num_Ratings'}, inplace=True)

In [217]:
# Merging the rating counts back to the main dataframe
ratings_with_books = pd.merge(ratings_with_books, book_rating_counts, on='Title')

In [218]:
# Verify the new Num_Ratings column has been added
ratings_with_books.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Title,Author,Year,Publisher,Image-URL,Num_Ratings
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82


In [219]:
# Filtering out books with fewer than 50 ratings to ensure quality recommendations
final_ratings = ratings_with_books[ratings_with_books['Num_Ratings'] >= 50]

In [220]:
# Removing duplicate user-book rating combinations
final_ratings.drop_duplicates(['User-ID', 'Title'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_ratings.drop_duplicates(['User-ID', 'Title'], inplace=True)


In [221]:
# Checking for null values in the final dataset
final_ratings.isnull().sum()

User-ID        0
ISBN           0
Book-Rating    0
Title          0
Author         0
Year           0
Publisher      0
Image-URL      0
Num_Ratings    0
dtype: int64

In [222]:
# Creating a user-item matrix (books as rows, users as columns, ratings as values)
user_book_matrix = final_ratings.pivot_table(columns='User-ID', index='Title', values='Book-Rating')

In [223]:
user_book_matrix.head()

User-ID,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,,,,,,,,,,...,,,,,,0.0,,,,
1st to Die: A Novel,,,,,,,,,,,...,,,,,,,,,,
2nd Chance,,10.0,,,,,,,,,...,,,,0.0,,,,,0.0,
4 Blondes,,,,,,,,,,0.0,...,,,,,,,,,,
84 Charing Cross Road,,,,,,,,,,,...,,,,,,10.0,,,,


In [224]:
# Replacing NaN values with 0 (no rating means no interaction)
user_book_matrix.fillna(0, inplace=True)

In [225]:
user_book_matrix.head()

User-ID,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84 Charing Cross Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0


In [226]:
user_book_matrix.shape

(742, 888)

In [227]:
# Converting to CSR (Compressed Sparse Row) matrix for efficient computation
from scipy.sparse import csr_matrix 
sparse_user_book_matrix = csr_matrix(user_book_matrix.values)

In [228]:
sparse_user_book_matrix

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 14942 stored elements and shape (742, 888)>

In [229]:
# Training the K-Nearest Neighbors model using brute force algorithm
from sklearn.neighbors import NearestNeighbors
knn_model = NearestNeighbors(algorithm='brute')
knn_model.fit(sparse_user_book_matrix)

In [230]:
# Testing the recommendation system with a sample book at index 250
user_book_matrix.iloc[250, :]

User-ID
254       0.0
2276      0.0
2766      0.0
2977      0.0
3363      0.0
         ... 
275970    0.0
277427    0.0
277478    0.0
277639    0.0
278418    0.0
Name: Hemlock Bay, Length: 888, dtype: float64

In [231]:
# Finding 4 nearest neighbors for the book at index 250
distances, similar_book_indices = knn_model.kneighbors(user_book_matrix.iloc[250, :].values.reshape(1, -1), n_neighbors=4)

In [232]:
# Distance values (lower means more similar)
distances

array([[ 0.        , 21.77154106, 23.91652149, 24.65765601]])

In [233]:
# Indices of similar books
similar_book_indices

array([[250, 184, 291, 372]], dtype=int64)

In [234]:
# Displaying recommended books with their distances
for i in range(len(similar_book_indices.flatten())):
    book_idx = similar_book_indices.flatten()[i]
    book_title = user_book_matrix.index[book_idx]
    distance = distances.flatten()[i]
    print(f"{i}: {book_title}, with distance of {distance}")

0: Hemlock Bay, with distance of 0.0
1: Exclusive, with distance of 21.77154105707724
2: Jacob Have I Loved, with distance of 23.916521486202797
3: No Safe Place, with distance of 24.657656011875904


In [235]:
# Storing all book titles for easy access
book_titles = user_book_matrix.index

In [236]:
book_titles

Index(['1984', '1st to Die: A Novel', '2nd Chance', '4 Blondes',
       '84 Charing Cross Road', 'A Bend in the Road', 'A Case of Need',
       'A Child Called \It\": One Child's Courage to Survive"',
       'A Civil Action', 'A Cry In The Night',
       ...
       'Winter Solstice', 'Wish You Well', 'Without Remorse',
       'Wizard and Glass (The Dark Tower, Book 4)', 'Wuthering Heights',
       'Year of Wonders', 'You Belong To Me',
       'Zen and the Art of Motorcycle Maintenance: An Inquiry into Values',
       'Zoya', '\O\" Is for Outlaw"'],
      dtype='object', name='Title', length=742)

In [237]:
# Fetching and displaying recommended books with image URLs
for i in range(len(similar_book_indices.flatten())):
    book_idx = similar_book_indices.flatten()[i]
    book_title = user_book_matrix.index[book_idx]
    distance = distances.flatten()[i]
    image_url = final_ratings[final_ratings['Title'] == book_title]['Image-URL'].iloc[0]
    
    print(f"{i}: {book_title}")
    print(f"   Distance: {distance}")
    print(f"   Image URL: {image_url}\n")

0: Hemlock Bay
   Distance: 0.0
   Image URL: http://images.amazon.com/images/P/0515133302.01.LZZZZZZZ.jpg

1: Exclusive
   Distance: 21.77154105707724
   Image URL: http://images.amazon.com/images/P/0446604232.01.LZZZZZZZ.jpg

2: Jacob Have I Loved
   Distance: 23.916521486202797
   Image URL: http://images.amazon.com/images/P/0064403688.01.LZZZZZZZ.jpg

3: No Safe Place
   Distance: 24.657656011875904
   Image URL: http://images.amazon.com/images/P/0345404777.01.LZZZZZZZ.jpg



In [238]:
# Saving the trained model and processed data for deployment
import pickle
import os

# Create artifacts folder if it doesn't exist
if not os.path.exists('artifacts'):
    os.makedirs('artifacts')

# Save the trained KNN model
pickle.dump(knn_model, open('artifacts/model.pkl', 'wb'))

# Save the book titles index
pickle.dump(book_titles, open('artifacts/book_name.pkl', 'wb'))

# Save the final ratings dataframe
pickle.dump(final_ratings, open('artifacts/final_ratings.pkl', 'wb'))

# Save the user-book matrix
pickle.dump(user_book_matrix, open('artifacts/book_matrix.pkl', 'wb'))

print("All files saved successfully in artifacts folder!")

All files saved successfully in artifacts folder!


# Model Testing

In [240]:
def recommend_books(book_title, n_recommendations=5):
    """
    Recommend books based on a given book title using collaborative filtering
    
    Parameters:
    -----------
    book_title : str
        Title of the book to base recommendations on
    n_recommendations : int, optional (default=5)
        Number of book recommendations to return
    
    Returns:
    --------
    None
        Prints recommended book titles with distances and image URLs
    """
    try:
        # Find the book's position in the matrix
        book_index = book_titles.get_loc(book_title)
        
        # Get similar books using KNN model
        distances, similar_indices = knn_model.kneighbors(
            user_book_matrix.iloc[book_index, :].values.reshape(1, -1),
            n_neighbors=n_recommendations + 1  # +1 to include the input book itself
        )
        
        # Display recommendations (skip first one as it's the input book)
        print(f"Books similar to '{book_title}':\n")
        for i in range(1, len(similar_indices.flatten())):
            idx = similar_indices.flatten()[i]
            recommended_title = user_book_matrix.index[idx]
            similarity_distance = distances.flatten()[i]
            book_image_url = final_ratings[final_ratings['Title'] == recommended_title]['Image-URL'].iloc[0]
            
            print(f"{i}. {recommended_title}")
            print(f"   Distance: {similarity_distance:.4f}")
            print(f"   Image: {book_image_url}\n")
    
    except (IndexError, KeyError):
        print(f"❌ Book '{book_title}' not found in the dataset.")
        print("\n📚 Sample available books:")
        for idx, title in enumerate(book_titles[:5], 1):
            print(f"  {idx}. {title}")

In [241]:
# Test the recommendation function
recommend_books('Harry Potter and the Sorcerer\'s Stone (Harry Potter (Paperback))', n_recommendations=5)

Books similar to 'Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback))':

1. Exclusive
   Distance: 71.9861
   Image: http://images.amazon.com/images/P/0446604232.01.LZZZZZZZ.jpg

2. The Cradle Will Fall
   Distance: 72.1526
   Image: http://images.amazon.com/images/P/0440115450.01.LZZZZZZZ.jpg

3. Jacob Have I Loved
   Distance: 72.5259
   Image: http://images.amazon.com/images/P/0064403688.01.LZZZZZZZ.jpg

4. Tough Cookie
   Distance: 72.7118
   Image: http://images.amazon.com/images/P/0553578308.01.LZZZZZZZ.jpg

5. Secrets
   Distance: 73.0890
   Image: http://images.amazon.com/images/P/0440176484.01.LZZZZZZZ.jpg

