# Popularity Recommender

Using the dataset [arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset), the function `popularity_recommender` returns the `n` most popular books.

The popularity is determined by a minimum number of user ratings (50 for now).
To provide diverse results, only one recommendation per author is provided. (This is mostly due to the Harry Potter and Lord of the Rings franchises dominating the popularity ratings.)

Initial data cleaning involves removing mal-formatted rows and removing the implicit ratings.
Only explicit ratings in the range from 1 to 10 are considered.

In [1]:
import zipfile
from urllib.request import urlretrieve

import pandas as pd

## Download the dataset

In [2]:
url = (
    "https://www.kaggle.com/api/v1/datasets/download/"
    "arashnic/book-recommendation-dataset?datasetVersionNumber=3"
)
zip_path = urlretrieve(url)[0]
with zipfile.ZipFile(zip_path, "r") as zf:

    with zf.open("Books.csv") as f:
        books = pd.read_csv(f)

    with zf.open("Ratings.csv") as f:
        ratings = pd.read_csv(f)

  books = pd.read_csv(f)


## Data Cleaning

In [3]:
# Any books to removem, remove from "books" and "ratings"
removed_isbn = []

# Three book entries are mal-formatted, and some authors and publishers are missing
removed_isbn += books[books["Year-Of-Publication"].str.isnumeric() == False].ISBN.to_list()
removed_isbn += books[books["Book-Author"].isna()].ISBN.to_list()
removed_isbn += books[books["Publisher"].isna()].ISBN.to_list()

# Remove the selected books in both "books" and "ratings"
books = books[~books["ISBN"].isin(removed_isbn)]
ratings = ratings[~ratings["ISBN"].isin(removed_isbn)]

# Implicit ratings are marked as zero. We are not using implicit ratings here for now
ratings = ratings.drop(ratings[ratings["Book-Rating"] == 0].index)

# Remove any books that received no ratings
books = books[books["ISBN"].isin(ratings["ISBN"])]

# Remove second editions (TODO: Keep the most popular edition or merge the ratings)
books = books.drop_duplicates(subset=["Book-Title", "Book-Author"])

# Remove any ratings to non-existing books
ratings = ratings[ratings["ISBN"].isin(books["ISBN"])]

# Convert year to integer
books["Year-Of-Publication"] = pd.to_numeric(books["Year-Of-Publication"])

## Reduce the dataset

The dataset is too large with nearly 200k books.
Creating a user-item-matrix is not feasible for a lightweight Streamlit app in the end.
Since there are a lot of users who gave only one rating and books that only received one rating, I will remove these.

In [4]:
users_to_drop = [1]
books_to_drop = [1]

# Because dropping users influces the number of ratings per book,
# we need to iterate until no more users or books are dropped
while len(users_to_drop) != 0 and len(books_to_drop) != 0:

    # Find users that gave less than five ratings
    few_rating_users = ratings["User-ID"].value_counts() < 4
    users_to_drop = few_rating_users[few_rating_users].index

    # Find books that received less than five ratings
    few_rated_books = ratings["ISBN"].value_counts() < 4
    books_to_drop = few_rated_books[few_rated_books].index

    # Remove them from the ratings
    ratings = ratings[~ratings["User-ID"].isin(users_to_drop)]
    ratings = ratings[~ratings["ISBN"].isin(books_to_drop)]
    books = books[books["ISBN"].isin(ratings["ISBN"])]

# Report remaining rating size
print(f"Remaining ratings: {ratings.shape[0]}")
print(f"Remaining users: {ratings['User-ID'].nunique()}")
print(f"Remaining books: {ratings['ISBN'].nunique()}, {books['ISBN'].nunique()}")

Remaining ratings: 119805
Remaining users: 8437
Remaining books: 11684, 11684


## Create popularity recommender

In [5]:
# Create a minimalistic DataFrame containing the mean and count of ratings
rating_count = ratings.drop(columns="User-ID")
rating_count = rating_count.groupby('ISBN')['Book-Rating'].agg(['mean', 'count']).reset_index()

The main function is below.

In [6]:
def popularity_recommender(n):
    """
    Recommends the n most popular books.

    Parameters
    ----------
    n : integer
        Number of books to recommend.

    Returns
    -------
    pd.DataFrame
        DataFrame with the top n most popular books.
    """
    count_threshold = 50

    # Get the most rated books above a rating count threshold 
    mask = rating_count["count"] > count_threshold

    # Get the best rated books sorted in descending order of their mean rating
    top_rated = rating_count[mask].sort_values("mean", ascending=False)

    # Combine rating and book list
    top_rated_books = top_rated.merge(books).drop(columns=["mean", "count"])

    # Ensure diverse results by only taking one book per author
    top_rated_books = top_rated_books.drop_duplicates(subset=["Book-Author"])

    # Grab the top n books
    top_rated_books = top_rated_books.head(n).reset_index()

    # Selecting specific columns from the merged DataFrame to include in the final result
    top_rated_books = top_rated_books[[
        "ISBN",
        "Book-Title",
        "Book-Author",
        "Year-Of-Publication",
    ]]

    return top_rated_books

Example usage to obtain the top 10 most popular books in the dataset.

In [7]:
popularity_recommender(10)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication
0,345339738,"The Return of the King (The Lord of the Rings,...",J.R.R. TOLKIEN,1986
1,439139597,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,2000
2,446310786,To Kill a Mockingbird,Harper Lee,1988
3,441172717,Dune (Remembering Tomorrow),Frank Herbert,1996
4,451524934,1984,George Orwell,1990
5,812550706,Ender's Game (Ender Wiggins Saga (Paperback)),Orson Scott Card,1994
6,440498058,A Wrinkle In Time,MADELEINE L'ENGLE,1998
7,553296981,Anne Frank: The Diary of a Young Girl,ANNE FRANK,1993
8,345348036,The Princess Bride: S Morgenstern's Classic Ta...,WILLIAM GOLDMAN,1987
9,345342968,Fahrenheit 451,RAY BRADBURY,1987
