# User-Based Recommender

Using the dataset [arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset), the function `user_based_recommender` returns the `n` most popular books recommended to the user `user_id`.

The recommender is implemented using the Suprise library.

Initial data cleaning involves removing mal-formatted rows and removing the implicit ratings.
Only explicit ratings in the range from 1 to 10 are considered.

In [1]:
import zipfile
from urllib.request import urlretrieve

import pandas as pd
from surprise import SVD, Dataset, KNNBaseline, KNNBasic, KNNWithZScore, Reader, SVDpp
from surprise.model_selection import cross_validate

## Download the dataset

In [2]:
url = (
    "https://www.kaggle.com/api/v1/datasets/download/"
    "arashnic/book-recommendation-dataset?datasetVersionNumber=3"
)
zip_path = urlretrieve(url)[0]
with zipfile.ZipFile(zip_path, "r") as zf:

    with zf.open("Books.csv") as f:
        books = pd.read_csv(f)

    with zf.open("Ratings.csv") as f:
        ratings = pd.read_csv(f)

  books = pd.read_csv(f)


## Data Cleaning

In [3]:
# Any books to removem, remove from "books" and "ratings"
removed_isbn = []

# Three book entries are mal-formatted, and some authors and publishers are missing
removed_isbn += books[books["Year-Of-Publication"].str.isnumeric() == False].ISBN.to_list()
removed_isbn += books[books["Book-Author"].isna()].ISBN.to_list()
removed_isbn += books[books["Publisher"].isna()].ISBN.to_list()

# Remove the selected books in both "books" and "ratings"
books = books[~books["ISBN"].isin(removed_isbn)]
ratings = ratings[~ratings["ISBN"].isin(removed_isbn)]

# Implicit ratings are marked as zero. We are not using implicit ratings here for now
ratings = ratings.drop(ratings[ratings["Book-Rating"] == 0].index)

# Remove any books that received no ratings
books = books[books["ISBN"].isin(ratings["ISBN"])]

# Remove second editions (TODO: Keep the most popular edition or merge the ratings)
books = books.drop_duplicates(subset=["Book-Title", "Book-Author"])

# Remove any ratings to non-existing books
ratings = ratings[ratings["ISBN"].isin(books["ISBN"])]

# Convert year to integer
books["Year-Of-Publication"] = pd.to_numeric(books["Year-Of-Publication"])

## Reduce the dataset

The dataset is too large with nearly 200k books.
Creating a user-item-matrix is not feasible for a lightweight Streamlit app in the end.
Since there are a lot of users who gave only one rating and books that only received one rating, I will remove these.

In [4]:
users_to_drop = [1]
books_to_drop = [1]

# Because dropping users influces the number of ratings per book,
# we need to iterate until no more users or books are dropped
while len(users_to_drop) != 0 and len(books_to_drop) != 0:

    # Find users that gave less than five ratings
    few_rating_users = ratings["User-ID"].value_counts() < 4
    users_to_drop = few_rating_users[few_rating_users].index

    # Find books that received less than five ratings
    few_rated_books = ratings["ISBN"].value_counts() < 4
    books_to_drop = few_rated_books[few_rated_books].index

    # Remove them from the ratings
    ratings = ratings[~ratings["User-ID"].isin(users_to_drop)]
    ratings = ratings[~ratings["ISBN"].isin(books_to_drop)]
    books = books[books["ISBN"].isin(ratings["ISBN"])]

# Report remaining rating size
print(f"Remaining ratings: {ratings.shape[0]}")
print(f"Remaining users: {ratings['User-ID'].nunique()}")
print(f"Remaining books: {ratings['ISBN'].nunique()}, {books['ISBN'].nunique()}")

Remaining ratings: 144055
Remaining users: 9689
Remaining books: 13743, 13743


## Create user-based recommender

In [5]:
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(ratings, reader)

### Crossvalidate different models

In [None]:
cv_results = dict()

KNNBasic

In [None]:
options = dict(
    k=40,
    min_k=1,
    sim_options = dict(
        name='cosine',
        user_based=True,
    )
)
algo = KNNBasic(**options)

result = cross_validate(algo, data, measures=["RMSE", "MAE", "FCP"], cv=5, n_jobs=-1)
cv_results["KNNBasic"] = (
    result["test_rmse"].mean(),
    result["test_mae"].mean(),
    result["test_fcp"].mean(),
)

KNNWithZScore

In [None]:
options = dict(
    k=40,
    min_k=1,
    sim_options = dict(
        name='cosine',
        user_based=True,
    )
)
algo = KNNWithZScore(**options)

result = cross_validate(algo, data, measures=["RMSE", "MAE", "FCP"], cv=5, n_jobs=-1)
cv_results["KNNWithZScore"] = (
    result["test_rmse"].mean(),
    result["test_mae"].mean(),
    result["test_fcp"].mean(),
)

KNNBaseline

In [None]:
options = dict(
    k=40,
    min_k=1,
    sim_options = dict(
        name='cosine',
        user_based=True,
    )
    bsl_options = dict(
        method="als",
        reg_i=10,
        reg_u=15,
        n_epochs=10,
    )
)
algo = KNNBaseline(**options)

result = cross_validate(algo, data, measures=["RMSE", "MAE", "FCP"], cv=5, n_jobs=-1)
cv_results["KNNBaseline"] = (
    result["test_rmse"].mean(),
    result["test_mae"].mean(),
    result["test_fcp"].mean(),
)

SVD

In [None]:
options = dict(
    n_factors=100,  # Number of factors
    n_epochs=20,  # Number of iteration of the SGD procedure
    biased=True,  # Whether to use baselines (or biases)
    init_mean=0,  # Mean of normal distribution for factor vectors initialization
    init_std_dev=0.1,  # S.D. of normal distribution for factor vectors initialization
    lr_all=0.005,  # Learning rate for all parameters
    reg_all=0.02,  # Regularization term for all parameters
    lr_bu=None,  # Learning rate. Takes precedence over lr_all if set
    lr_bi=None,  # Learning rate. Takes precedence over lr_all if set
    lr_pu=None,  # Learning rate. Takes precedence over lr_all if set
    lr_qi=None,  # Learning rate. Takes precedence over lr_all if set
    reg_bu=None,  # Regularization term. Takes precedence over reg_all if set
    reg_bi=None,  # Regularization term. Takes precedence over reg_all if set
    reg_pu=None,  # Regularization term. Takes precedence over reg_all if set
    reg_qi=None,  # Regularization term. Takes precedence over reg_all if set
)
algo = SVD(**options, random_state=123)

result = cross_validate(algo, data, measures=["RMSE", "MAE", "FCP"], cv=5, n_jobs=-1)
cv_results["SVD"] = (
    result["test_rmse"].mean(),
    result["test_mae"].mean(),
    result["test_fcp"].mean(),
)

SVD++

In [None]:
options = dict(
    n_factors=100,  # Number of factors
    n_epochs=20,  # Number of iteration of the SGD procedure
    cache_ratings=True,  # Whether or not to cache ratings
    init_mean=0,  # Mean of normal distribution for factor vectors initialization
    init_std_dev=0.1,  # S.D. of normal distribution for factor vectors initialization
    lr_all=0.005,  # Learning rate for all parameters
    reg_all=0.02,  # Regularization term for all parameters
    lr_bu=None,  # Learning rate. Takes precedence over lr_all if set
    lr_bi=None,  # Learning rate. Takes precedence over lr_all if set
    lr_pu=None,  # Learning rate. Takes precedence over lr_all if set
    lr_qi=None,  # Learning rate. Takes precedence over lr_all if set
    lr_yj=None,  # Learning rate. Takes precedence over lr_all if set
    reg_bu=None,  # Regularization term. Takes precedence over reg_all if set
    reg_bi=None,  # Regularization term. Takes precedence over reg_all if set
    reg_pu=None,  # Regularization term. Takes precedence over reg_all if set
    reg_qi=None,  # Regularization term. Takes precedence over reg_all if set
    reg_yj=None,  # Regularization term. Takes precedence over reg_all if set
)
algo = SVDpp(**options, random_state=123)

result = cross_validate(algo, data, measures=["RMSE", "MAE", "FCP"], cv=5, n_jobs=-1)
cv_results["SVDpp"] = (
    result["test_rmse"].mean(),
    result["test_mae"].mean(),
    result["test_fcp"].mean(),
)

Evaluation

In [None]:
pd.DataFrame(cv_results, index=['rmse', 'mae', 'fcp']).T

Unnamed: 0,rmse,mae,fcp
KNNBasic,1.963572,1.521722,0.580354
KNNWithZScore,1.733412,1.289334,0.581157
KNNBaseline,1.755636,1.340416,0.545017
SVD,1.57106,1.210091,0.517932
SVDpp,1.584473,1.22166,0.516004


## Create final model

In [6]:
options = dict(
    k=40,
    min_k=1,
    sim_options = dict(
        name='cosine',
        user_based=True,
    )
)
algo = KNNWithZScore(**options)

full_train = data.build_full_trainset()
algo.fit(full_train)

testset = full_train.build_anti_testset()

Computing the cosine similarity matrix...
Done computing similarity matrix.


The main function is below.

In [7]:
def user_based_recommender(user_id, n):
    """
    Recommends the n best matching books for a giving user

    Parameters
    ----------
    user_id : int
        User ID for which to get recommendations
    n : int
        Number of books to recommend

    Returns
    -------
    pd.DataFrame
        DataFrame containing the top n book recommendations for the specified user_id
    """
    # Filter the testset to include only rows with the specified user_id
    filtered_testset = [row for row in testset if row[0] == user_id]

    # Make predictions on the filtered testset
    predictions = algo.test(filtered_testset)

    # Get the top n predictions based on the estimated ratings ('est')
    top_n_predictions_df = pd.DataFrame(predictions).nlargest(n, 'est')

    # Creating a DataFrame from the top_n with columns 'ISBN' and 'estimated_rating'
    reduced_top_n_df = top_n_predictions_df.loc[:, ["iid", "est"]].rename(
        columns=dict(iid="ISBN", est="estimated_rating")
    )

    # Merging the 2 created DataFrames based on 'ISBN', retaining only the matching rows
    merged_df = reduced_top_n_df.merge(books, how="left")

    # Selecting specific columns from the merged DataFrame to include in the final result
    final_df = merged_df[[
        "ISBN",
        "Book-Title",
        "Book-Author",
        "Year-Of-Publication",
    ]]
    
    return final_df


Select a user that has a good record of rating books

In [8]:
rating_count = ratings.drop(columns="ISBN")
rating_count = rating_count.groupby("User-ID")["Book-Rating"].agg(["count"]).reset_index()
rating_count.nlargest(10, "count").iloc[4, 0]

114368

Example usage to obtain the top 10 most recommended books for a user in the dataset.

In [9]:
user_id = 114368

user_based_recommender(user_id, 10)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication
0,843760494X,Cien AÃ±os de Soledad,Gabriel GarcÃ­a MÃ¡rquez,1994
1,0439176824,"The Fall (The Seventh Tower, Book 1)",Garth Nix,2000
2,0440228352,Whirligig (Laurel Leaf Books),Paul Fleischman,1999
3,0140283641,Kits Law,Donna Morrissey,0
4,0689831404,The Wind in the Willows (Aladdin Classics),Kenneth Grahame,1999
5,0330328743,Butcher Boy,Patrick Mccabe,0
6,0345419081,The Eight,KATHERINE NEVILLE,1997
7,060961004X,Eat Cake : A Novel,JEANNE RAY,2003
8,0152099905,The Borrowers,Mary Norton,1989
9,0553213113,Moby-Dick,HERMAN MELVILLE,1981
