# **Book Recommendation System — Content Based Filtering Approach**

**Author:** Milos Saric [https://saricmilos.com/]  
**Date:** November 04, 2025 - November 18th, 2025  
**Dataset:** Kaggle — *Book Recommendation Dataset*  

---

### Required Libraries Import

In [1]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

In [2]:
%load_ext autoreload
%autoreload 2

from src.dataloader import load_all_csvs_from_folder
from src.preprocess_user_books_ratings import preprocess_books_ratings_users
from pathlib import Path

In [3]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
dataset_folder = Path(r"C:\Users\Milos\Desktop\ESCAPE_9-5\PYTHON\GitHub_Kaggle_Projects\what-else-should-I-read\datasets")

In [5]:
datasets = load_all_csvs_from_folder(dataset_folder,low_memory=False)

In [6]:
merged_df = preprocess_books_ratings_users(
    datasets["Books"],
    datasets["Ratings"],
    datasets["Users"]
)

In [7]:
merged_df.shape

(383839, 21)

# **1. Content Based Filtering**

For **content-based recommendation**, one-hot encoding works well for columns with **low cardinality**.  

However, for high-cardinality columns like `isbn` (149,833 unique values) or `book_title` (135,564 unique values), traditional one-hot encoding is **impractical**:

- It creates **very large, sparse matrices**  
- Consumes **excessive memory**  
- Slows down computations  

Alternative encoding methods (embeddings, hashing, or TF-IDF for text) are better suited for these cases.

For content-based filtering, we focus on attributes that describe the item, not the use.

In [8]:
merged_df.columns

Index(['user_id', 'age', 'country_clean', 'region', 'city_clean',
       'state_clean', 'isbn', 'book_rating', 'book_title', 'book_author',
       'year_of_publication', 'publisher', 'user_avg_rating',
       'user_num_ratings', 'book_avg_rating', 'book_num_ratings',
       'book_popularity_score', 'author_avg_rating', 'publisher_avg_rating',
       'book_age', 'User_age_Group'],
      dtype='object')

In [9]:
book_identifiers = merged_df[['isbn', 'book_title']].drop_duplicates(subset='book_title').reset_index(drop=True)

In [10]:
book_features = merged_df[['book_title', 'book_author', 'year_of_publication', 'publisher', 'book_avg_rating']].copy()
book_features['is_high_rating'] = (book_features['book_avg_rating'] >= 8).astype(int)

Frequency encoding: encode each author/publisher by the number of books they have in the dataset or their average book rating.

In [11]:
# Frequency encoding
author_freq = book_features['book_author'].value_counts().to_dict()
publisher_freq = book_features['publisher'].value_counts().to_dict()
book_features['author_freq'] = book_features['book_author'].map(author_freq)
book_features['publisher_freq'] = book_features['publisher'].map(publisher_freq)

In [12]:
# Drop duplicates based on book_title
book_features = book_features.drop_duplicates(subset='book_title').reset_index(drop=True)

In [13]:
# Scale numeric columns
cols_to_scale = ['author_freq', 'publisher_freq', 'year_of_publication', 'book_avg_rating']
scaler = MinMaxScaler()
book_features[cols_to_scale] = scaler.fit_transform(book_features[cols_to_scale])

In [14]:
book_features[cols_to_scale].describe().T[['min', 'max']]

Unnamed: 0,min,max
author_freq,0.0,1.0
publisher_freq,0.0,1.0
year_of_publication,0.0,1.0
book_avg_rating,0.0,1.0


In [15]:
# Final book vectors
book_vectors = book_features[['author_freq', 'publisher_freq', 'year_of_publication', 'book_avg_rating', 'is_high_rating']].values

In [16]:
# book_identifiers now aligns with book_vectors
book_identifiers = book_identifiers.reset_index(drop=True)

In [17]:
# create a mapping from book_title to vector index
title_to_index = {title: idx for idx, title in enumerate(book_identifiers['book_title'])}

Build user profiles

In [18]:
user_profiles = {}

for user_id, group in merged_df.groupby('user_id'):
    # map books to indices safely
    book_indices = [title_to_index[title] for title in group['book_title'] if title in title_to_index]
    
    if not book_indices:  # skip users with no valid books
        continue
    
    # average their vectors
    user_vector = book_vectors[book_indices].mean(axis=0)
    user_profiles[user_id] = user_vector


For a content based recommender using user profiles, we don’t need to precompute cosine similarity for all books. We compute a user vector by averaging the book vectors the user liked. To get recommendations, we compute cosine similarity between this single user vector and all book vector.

In [19]:
def recommend_books(user_id, top_n=5):
    if user_id not in user_profiles:
        return []
    
    user_vector = user_profiles[user_id].reshape(1, -1)
    sims = cosine_similarity(user_vector, book_vectors).flatten()
    
    # exclude books the user has already rated
    user_books = merged_df[merged_df['user_id'] == user_id]['book_title'].tolist()
    
    recommended_indices = [
        i for i in sims.argsort()[::-1] if book_identifiers.iloc[i]['book_title'] not in user_books
    ][:top_n]
    
    return book_identifiers.iloc[recommended_indices]


In [20]:
# Example: recommend top 5 books for user with ID 123
recommended_books = recommend_books(user_id=243, top_n=5)

# Show the titles and ISBNs
print(recommended_books)

             isbn                                         book_title
13079  068480297X      SHOCK WAVE (Dirk Pitt Adventures (Hardcover))
4490   0449213765                                        Lady Oracle
55668  0590934899  Escape from Camp Run-for-Your-Life (Give Yours...
21398  0061093998                                 Homebody : A Novel
7844   0590568817      Say Cheese and Die-Again! (Goosebumps, No 44)


Evaluation

Item Based

We compute similarity between books. When a user inputs a book title, you return the top-k most similar books based on their feature vectors.

In [21]:
title_to_index = {title: idx for idx, title in enumerate(book_identifiers['book_title'])}

In [22]:
# Build nearest neighbors model
nn_model = NearestNeighbors(n_neighbors=10, metric='cosine', algorithm='brute')
nn_model.fit(book_vectors)

0,1,2
,n_neighbors,10
,radius,1.0
,algorithm,'brute'
,leaf_size,30
,metric,'cosine'
,p,2
,metric_params,
,n_jobs,


In [23]:
def recommend_similar_books(book_title, top_n=5):
    if book_title not in title_to_index:
        return []
    
    idx = title_to_index[book_title]
    vector = book_vectors[idx].reshape(1, -1)
    
    distances, indices = nn_model.kneighbors(vector, n_neighbors=top_n+1)
    indices = indices.flatten()
    distances = distances.flatten()
    
    # exclude the book itself
    indices = [i for i in indices if i != idx][:top_n]
    
    return book_identifiers.iloc[indices][['book_title', 'isbn']]


In [24]:
recommend_similar_books("1984", top_n=5)

Unnamed: 0,book_title,isbn
56649,A Christmas Carol in Prose Being a Ghost Story...,140071202
17890,Moon Palace (Contemporary American Fiction),140115854
44707,The Hand of Chaos: A Death Gate Novel (The Dea...,553093770
23841,Adventures of Tom Sawyer (Classics S.),140390480
57077,Virtue of Selfishness: A New Concept of Egoism,451163931
