# Cleaning

Using the dataset [arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset), initial data cleaning involves removing mal-formatted rows and removing the implicit ratings. Only explicit ratings in the range from 1 to 10 are considered.

In [1]:
import zipfile
from urllib.request import urlretrieve

import pandas as pd
from sklearn import preprocessing

## Download the dataset

In [None]:
url = (
    "https://www.kaggle.com/api/v1/datasets/download/"
    "arashnic/book-recommendation-dataset?datasetVersionNumber=3"
)
zip_path = urlretrieve(url)[0]
with zipfile.ZipFile(zip_path, "r") as zf:

    with zf.open("Books.csv") as f:
        books = pd.read_csv(f)

    with zf.open("Ratings.csv") as f:
        ratings = pd.read_csv(f)

## Data Cleaning

In [3]:
# Any books to removem, remove from "books" and "ratings"
removed_isbn = []

# Three book entries are mal-formatted, and some authors and publishers are missing
removed_isbn += books[books["Year-Of-Publication"].str.isnumeric() == False].ISBN.to_list()
removed_isbn += books[books["Book-Author"].isna()].ISBN.to_list()
removed_isbn += books[books["Publisher"].isna()].ISBN.to_list()

# Remove the selected books in both "books" and "ratings"
books = books[~books["ISBN"].isin(removed_isbn)]
ratings = ratings[~ratings["ISBN"].isin(removed_isbn)]

# Implicit ratings are marked as zero. We are not using implicit ratings here for now
ratings = ratings.drop(ratings[ratings["Book-Rating"] == 0].index)

# Remove any books that received no ratings
books = books[books["ISBN"].isin(ratings["ISBN"])]

# Remove second editions (TODO: Keep the most popular edition or merge the ratings)
books = books.drop_duplicates(subset=["Book-Title", "Book-Author"])

# Remove any ratings to non-existing books
ratings = ratings[ratings["ISBN"].isin(books["ISBN"])]

# Convert year to integer
books["Year-Of-Publication"] = pd.to_numeric(books["Year-Of-Publication"])

## Reduce the dataset

The dataset is too large with nearly 200k books.
Creating a user-item-matrix is not feasible for a lightweight Streamlit app in the end.
Since there are a lot of users who gave only one rating and books that only received one rating, I will remove these.

In [4]:
users_to_drop = [1]
books_to_drop = [1]

# Because dropping users influces the number of ratings per book,
# we need to iterate until no more users or books are dropped
while len(users_to_drop) != 0 and len(books_to_drop) != 0:

    # Find users that gave less than five ratings
    few_rating_users = ratings["User-ID"].value_counts() < 10
    users_to_drop = few_rating_users[few_rating_users].index

    # Find books that received less than five ratings
    few_rated_books = ratings["ISBN"].value_counts() < 10
    books_to_drop = few_rated_books[few_rated_books].index

    # Remove them from the ratings
    ratings = ratings[~ratings["User-ID"].isin(users_to_drop)]
    ratings = ratings[~ratings["ISBN"].isin(books_to_drop)]
    books = books[books["ISBN"].isin(ratings["ISBN"])]

# Report remaining rating size
print(f"Remaining ratings: {ratings.shape[0]}")
print(f"Remaining users: {ratings['User-ID'].nunique()}")
print(f"Remaining books: {ratings['ISBN'].nunique()}, {books['ISBN'].nunique()}")

Remaining ratings: 29557
Remaining users: 1383
Remaining books: 1505, 1505


## Scale the ratings

The Streamlit feedback rating widget spans a range from 1 to 5 stars. So we will scale the rating to the same range.

Because that range is compressed and may lose information, we will use power transform to diversify the ratings. Note that this results in ratings that are no longer accurate, but it is a good demonstration.

In [22]:
ratings["Book-Rating"] = preprocessing.power_transform(ratings[["Book-Rating"]])
ratings["Book-Rating"] = preprocessing.minmax_scale(ratings[["Book-Rating"]], (1, 5))
ratings["Book-Rating"] = ratings["Book-Rating"].round().astype(int)

## Write to disk

In [None]:
books.to_csv("../data/books/clean/books.csv", index=False)
ratings.to_csv("../data/books/clean/ratings.csv", index=False)