# Book recommender system
This project is a book recommendation system. For this I applied the item based collaborative filtering technique. I made this choice because is more efficient than the "user based". The system will recommend n books similar to a reference. The only selection criteria will be the ratings given by users.

Source: https://www.kaggle.com/datasets/rxsraghavagrawal/book-recommender-system

# Initial imports

In [None]:
%pip install ipython-autotime --upgrade

In [None]:
from google.colab import drive, files
import pandas as pd
import warnings
drive.mount('/content/drive', force_remount=True)
warnings.filterwarnings("ignore")
%load_ext autotime

Mounted at /content/drive
time: 536 µs (started: 2023-04-16 19:49:48 +00:00)


In [None]:
books   = pd.read_csv("/content/drive/MyDrive/datasets/book-recommender-system/BX-Books.csv", sep=';', encoding='latin-1', on_bad_lines='skip')
ratings = pd.read_csv("/content/drive/MyDrive/datasets/book-recommender-system/BX-Book-Ratings.csv", sep=';', encoding='latin-1', on_bad_lines='skip')

time: 1.68 s (started: 2023-04-16 19:53:56 +00:00)


# EDA

In [None]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


time: 14.9 ms (started: 2023-04-14 00:27:35 +00:00)


In [None]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271359 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB
time: 309 ms (started: 2023-04-14 00:27:35 +00:00)


In [None]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


time: 7.31 ms (started: 2023-04-14 00:27:37 +00:00)


In [None]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB
time: 394 ms (started: 2023-04-14 00:27:39 +00:00)


# Feature engineering

In [None]:
from scipy.sparse      import csr_matrix
from sklearn.base      import BaseEstimator, TransformerMixin
from sklearn.compose   import ColumnTransformer
from sklearn.pipeline  import Pipeline

import numpy as np

time: 593 µs (started: 2023-04-16 19:51:08 +00:00)


In [None]:
books   = books[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher']]

books   = books.rename(columns={'ISBN':'isbn', 'Book-Title':'title', 'Book-Author':'author', 'Year-Of-Publication':'year', 'Publisher':'publisher'})
ratings = ratings.rename(columns={'User-ID':'user_id', 'ISBN':'isbn', 'Book-Rating':'rating'})

print(f'books:   {list(books.columns)}')
print(f'ratings: {list(ratings.columns)}')

books:   ['isbn', 'title', 'author', 'year', 'publisher']
ratings: ['user_id', 'isbn', 'rating']
time: 39.1 ms (started: 2023-04-16 19:54:03 +00:00)


**Step ##:** Identify and select users who have made more than 100 reviews and books that have received more than 100. The numbers were chosen following personal criteria.


In [None]:
def filter_rows(dataset, col_name, min_ratings):
  valid_rows = dataset[col_name].value_counts() > min_ratings
  valid_rows_ids = valid_rows[valid_rows].index
  mask = dataset[col_name].isin(valid_rows_ids)
  return dataset[mask]

ratings = filter_rows(ratings, 'user_id', 50)
ratings = filter_rows(ratings, 'isbn', 50)
print(ratings.shape)

(99223, 3)
time: 293 ms (started: 2023-04-16 19:54:07 +00:00)


**Step ##:** Create a new table (isbn, user) that will be filled with the ratings that each book received from each user.

In [None]:
ratings_books = ratings.merge(books, on='isbn').drop(['title', 'author', 'year', 'publisher'], axis=1)
ratings_books = ratings_books.reset_index().drop('index', axis=1)
X = ratings_books.pivot(index='isbn', columns='user_id', values='rating')

time: 221 ms (started: 2023-04-16 19:54:10 +00:00)


**Step ##:** Find the arithmetic mean of each line and subtract it for the entire line. This will solve the problem of assigning a score of 0 in NaN fields. This way, the model will not think that a score of 0 means that the person did not like the film since it is the average of each line.

In [None]:
row_means = np.nanmean(X, axis=1)
X = np.subtract(row_means.reshape(-1, 1), X)

time: 46.2 ms (started: 2023-04-16 19:54:13 +00:00)


**Step ##:** Replace NaN values ​​with zero.

In [None]:
mask = np.isnan(X)
X[mask] = 0

time: 25.2 ms (started: 2023-04-16 19:54:15 +00:00)


**Step ##:** Transform the dataframe into a numpy array.

In [None]:
X_numpy = X.values
print(f'X shape: {X_numpy.shape}')

X shape: (1059, 3150)
time: 1.2 ms (started: 2023-04-16 19:54:19 +00:00)


# Model training

As the dataset does not have a label, I will use the NearestNeighbors clustering model. The choice criterion for deciding which vectors (lines) are closest will be the similarity of cosines.

In [None]:
from sklearn.metrics   import mean_absolute_error
from sklearn.neighbors import NearestNeighbors

time: 329 µs (started: 2023-04-16 19:54:19 +00:00)


Below, the mean_mae() function will evaluate the model. First, for each book in the dataset, the model will find x number of other books that have a note pattern similar to yours. This similarity is calculated by the NearestNeighbors class. To evaluate these groupings, for each book, I will compare the grades it received from each user with the grades stipulated by the system. The distance between the real score and the calculated one will be stored in an error list. Lastly, I will calculate the mean and standard deviation of these errors.

In [None]:
def mean_mae(model, dataset):
  errors = list()
  for row in dataset:
    distances, indices = model.kneighbors(row.reshape(1, -1))
    distances = distances.flatten()
    indices   = indices.flatten()
    y_true, y_pred = predict_ratings(dataset.copy(), indices, distances)
    error = np.sqrt(mean_absolute_error(y_true, y_pred))
    errors.append(error)
  return errors

def predict_ratings(dataset, indices, distances):
  dataset[indices[1:]] = dataset[indices[1:]] * distances[1:].reshape(1, -1).T
  predictions  = dataset[indices[1:]].sum(axis=0) / distances[1:].sum()
  idx_nonzeros = dataset[indices[0]].nonzero()
  y_true = dataset[0][idx_nonzeros]
  y_pred = np.around(predictions[idx_nonzeros])
  return y_true, y_pred

time: 932 µs (started: 2023-04-16 19:54:21 +00:00)


## NearestNeighbors

The model below will determine the grade of each book by the weighted average of the 50 most similar books.

In [None]:
nn = NearestNeighbors(n_neighbors=50, metric='cosine')
nn.fit(X_numpy)

time: 7.73 ms (started: 2023-04-16 19:54:26 +00:00)


In [None]:
result = mean_mae(nn, X_numpy)
print(np.mean(result))

0.5401533060210272
time: 18.6 s (started: 2023-04-16 19:54:31 +00:00)


# Query

Now I obtain a sample of ten books that are part of those that were analyzed by the model.

In [None]:
isbns = ratings_books['isbn'].unique()
mask = books['isbn'].isin(isbns)
books[mask]['title'].sample(10)

9294                                       A Painted House
10569                               In Her Shoes : A Novel
20584    H Is for Homicide (Kinsey Millhone Mysteries (...
7997                                     Last Man Standing
3459      Harry Potter and the Chamber of Secrets (Book 2)
136                                  Before I Say Good-Bye
5962                                      Violets Are Blue
8143                                            Open House
4544       Heaven and Earth (Three Sisters Island Trilogy)
3022                              I Know This Much Is True
Name: title, dtype: object

time: 32 ms (started: 2023-04-16 20:04:47 +00:00)


I will choose the name of a book so that the model recommends 10 similar books in terms of rating.

In [None]:
def get_recommendation(model, dataset, neighbors, title):
  mask = dataset['title'].str.startswith(title)
  book_isbn = dataset[mask]['isbn']
  mask = neighbors.index.isin(book_isbn)
  book_reference = neighbors[mask].head(1)
  distances, indices = nn.kneighbors(book_reference.values)
  for i in indices[0][1:]:
    print(f"Title: {dataset.iloc[i]['title']}")

get_recommendation(nn, books, X, 'Violets Are Blue')

Title: Slow Waltz in Cedar Bend
Title: Coyote Waits (Joe Leaphorn/Jim Chee Novels)
Title: Life of Pi
Title: The Community in America
Title: The Firm
Title: The War in Heaven (Eternal Warriors)
Title: The Best Canadian Animal Stories: Classic Tales by Master Storytellers
Title: More Cunning Than Man: A Social History of Rats and Man
Title: If Love Were Oil, I'd Be About a Quart Low
Title: Chronique d'une mort annoncÃ?Â©e
Title: Twin Blessings (Love Inspired (Numbered))
Title: Die Mechanismen der Freude. ErzÃ?Â¤hlungen.
Title: The Gospel of Judas: A Novel
Title: Emma (Signet Classics (Paperback))
Title: PLEADING GUILTY
Title: The Bear and the Dragon
Title: Awakening
Title: The Hunted
Title: Tess of the D'Urbervilles (Wordsworth Classics)
Title: Night Sins
Title: The Robber Bride
Title: Deception Point
Title: Sturmzeit. Roman.
Title: Rule of the Bone : Novel, A
Title: Horus's Horrible Day (First Graders from Mars)
Title: Not a Day Goes By : A Novel
Title: Amy and Isabelle
Title: Postmorte