# **Book Recommendation System — Content Based Filtering Approach**

**Author:** Milos Saric [https://saricmilos.com/]  
**Date:** November 04, 2025 - November 18th, 2025  
**Dataset:** Kaggle — *Book Recommendation Dataset*  

---

### Required Libraries Import

In [1]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

In [2]:
%load_ext autoreload
%autoreload 2

from src.dataloader import load_all_csvs_from_folder
from src.preprocess_user_books_ratings import preprocess_books_ratings_users
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns

In [36]:
from sklearn.preprocessing import MinMaxScaler

In [3]:
dataset_folder = Path(r"C:\Users\Milos\Desktop\ESCAPE_9-5\PYTHON\GitHub_Kaggle_Projects\what-else-should-I-read\datasets")

In [4]:
datasets = load_all_csvs_from_folder(dataset_folder)

  datasets[csv_file.stem] = pd.read_csv(csv_file, **read_csv_kwargs)


In [5]:
merged_df = preprocess_books_ratings_users(
    datasets["Books"],
    datasets["Ratings"],
    datasets["Users"]
)

In [6]:
merged_df.shape

(383839, 21)

# **1. Content Based Filtering**

For **content-based recommendation**, one-hot encoding works well for columns with **low cardinality**.  

However, for high-cardinality columns like `isbn` (149,833 unique values) or `book_title` (135,564 unique values), traditional one-hot encoding is **impractical**:

- It creates **very large, sparse matrices**  
- Consumes **excessive memory**  
- Slows down computations  

Alternative encoding methods (embeddings, hashing, or TF-IDF for text) are better suited for these cases.

For content-based filtering, we focus on attributes that describe the item, not the use.

In [21]:
merged_df.columns

Index(['user_id', 'age', 'country_clean', 'region', 'city_clean',
       'state_clean', 'isbn', 'book_rating', 'book_title', 'book_author',
       'year_of_publication', 'publisher', 'user_avg_rating',
       'user_num_ratings', 'book_avg_rating', 'book_num_ratings',
       'book_popularity_score', 'author_avg_rating', 'publisher_avg_rating',
       'book_age', 'User_age_Group'],
      dtype='object')

In [29]:
book_identifiers = merged_df[["isbn", "book_title"]].copy()

In [27]:
book_features = merged_df[['book_author', 'year_of_publication', 'publisher', 'book_avg_rating']].copy()
book_features['is_high_rating'] = (book_features['book_avg_rating'] >= 8).astype(int)

In [28]:
author_freq = book_features['book_author'].value_counts().to_dict()
publisher_freq = book_features['publisher'].value_counts().to_dict()

book_features['author_freq'] = book_features['book_author'].map(author_freq)
book_features['publisher_freq'] = book_features['publisher'].map(publisher_freq)

In [33]:
book_features = book_features.drop(columns=["book_author","publisher"])

Frequency encoding: encode each author/publisher by the number of books they have in the dataset or their average book rating.

In [34]:
book_features.head()

Unnamed: 0,year_of_publication,book_avg_rating,is_high_rating,author_freq,publisher_freq
1,2001,7.666667,0,9,23
9,2002,5.0,0,21,3095
12,2004,5.0,0,1,12
13,1999,6.5,0,5,397
15,1998,6.0,0,32,21


In [35]:
book_features.nunique()

year_of_publication      96
book_avg_rating        1515
is_high_rating            2
author_freq             386
publisher_freq          394
dtype: int64

In [37]:
cols_to_scale = ['author_freq', 'publisher_freq', 'year_of_publication', 'book_avg_rating']

In [38]:
scaler = MinMaxScaler()
book_features[cols_to_scale] = scaler.fit_transform(book_features[cols_to_scale])

In [39]:
book_features

Unnamed: 0,year_of_publication,book_avg_rating,is_high_rating,author_freq,publisher_freq
1,0.834711,0.740741,0,0.001725,0.001721
9,0.842975,0.444444,0,0.004312,0.242097
12,0.859504,0.444444,0,0.000000,0.000861
13,0.818182,0.611111,0,0.000862,0.030986
15,0.809917,0.555556,0,0.006684,0.001565
...,...,...,...,...,...
1031125,0.801653,0.678788,0,0.050884,0.045618
1031126,0.809917,0.621266,0,0.203105,0.729264
1031127,0.809917,0.694444,0,0.009056,0.729264
1031129,0.727273,0.577778,0,0.041182,0.571440


In [40]:
book_features[cols_to_scale].describe().T[['min', 'max']]

Unnamed: 0,min,max
author_freq,0.0,1.0
publisher_freq,0.0,1.0
year_of_publication,0.0,1.0
book_avg_rating,0.0,1.0


In [42]:
book_vectors = book_features.values

book_features = book_features.reset_index(drop=True)
book_vectors = book_features.values
book_identifiers = book_identifiers.reset_index(drop=True)

Build user profiles

In [43]:
user_profiles = {}
for user_id, group in merged_df.groupby('user_id'):
    # Map book rows to indices in book_features/book_vectors
    book_indices = group.index.values  # these rows correspond to your book_vectors
    user_vector = book_vectors[book_indices].mean(axis=0)
    user_profiles[user_id] = user_vector

IndexError: index 383847 is out of bounds for axis 0 with size 383839

Train/Test split