# Collaborative Filtering with FastAI

In [None]:
# General Data Science
import numpy as np
import pandas as pd
import plotly.express as px

# Collaborative Filtering
from fastai.collab import CollabDataBunch, collab_learner

# Miscellaneous
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Let's perform a simple collaborative filtering exercise with fast.ai on this anime recommendations database. This database consists of two tables, a ratings table, consisting of a collection of users and anime with the ratings a given user has given that anime, and an anime table consisting of some further information about that anime.

## Data Preparation

Let's begin by joining these two tables together and taking a peek at their content.

In [None]:
rating = pd.read_csv("/kaggle/input/anime-recommendations-database/rating.csv")
anime = pd.read_csv("/kaggle/input/anime-recommendations-database/anime.csv")
provided_data = rating.merge(anime, how="left", on="anime_id", suffixes=("", "_overall"))
provided_data

Let's view some statistical summary information about this data.

In [None]:
for field in ["user_id", "anime_id"]:
    provided_data[field] = provided_data[field].astype(str)
provided_data
provided_data.describe(include="all")

It would be nice to refer to anime by their name instead of their assigned ID key going forward. Let's check to see if the mapping between *anime_id* and *name* is 1-to-1.

In [None]:
provided_data.groupby(["anime_id", "name"]).size().reset_index().describe(include="all")

It seems that this is the case except for one film, 'Saru Kani Gassen', which seems to have two different ID codes. Let's view all data records with this name.

In [None]:
provided_data.loc[provided_data["name"] == "Saru Kani Gassen"]

As we can, there are only three records with this name, Since this is the only instance where there is a disparity between the *anime_id* and *name* for an anime, and since it occurs for an anime that doesn't seem to be particularly popular, we can safely drop the *anime_id* field and use *name* in its place.

In [None]:
provided_data.drop("anime_id", axis=1, inplace=True)

# Exploratory Data Analysis (EDA)

Let's now spend some time exploring the data we've been provided with. As we saw above, we have just under 8 million records in this table. Let's check the number of unique users and anime that we have.

In [None]:
print("Number of unique users:", len(provided_data["user_id"].unique()))
print("Number of unique anime:", len(provided_data["name"].unique()))

As we can see, we have just over 70,000 users in the dataset. Let's see the distribution of anime viewed by these users.

In [None]:
num_user_views = (provided_data["user_id"].value_counts()
                  .reset_index()
                  .rename(columns={"index": "User", "user_id": "Number of Views by User"}))
fig = px.histogram(num_user_views, x="Number of Views by User")
fig.show()

It seems that most users have only seen a small number of anime, though there are a number of users who have viewed anime numbering in the thousands. Let's check to see the typical number of views garnered by anime productions in this dataset.

In [None]:
num_anime_views = (provided_data["name"].value_counts()
                   .reset_index()
                   .rename(columns={"index": "Anime", "name": "Number of Views of Anime"}))
fig = px.histogram(num_anime_views, x="Number of Views of Anime")
fig.show()

As we can see, the majority of anime productions garner only a small number of views. Let's check out the ratings distribution in this dataset.

In [None]:
rating_distro = (provided_data["rating"].value_counts()
                 .sort_index()
                 .reset_index()
                 .rename(columns={"index": "Rating", "rating": "Frequency"}))
fig = px.bar(rating_distro, x="Rating", y="Frequency")
fig.show()

Viewing the above, we can see that most anime productions are rated an 8, and very few are rated less than 6. There is also a high number of records where a user provided no rating for the anime they viewed. Let's finish this exploration by viewing the most popular productions in this dataset, with popular in this circumstance being defined as anime with the most views.

In [None]:
most_popular_animes = (provided_data["name"].value_counts()[:10]
                       .reset_index()[::-1]
                       .rename(columns={"index": "Anime", "name": "View Frequency"}))
fig = px.bar(most_popular_animes, x="View Frequency", y="Anime", orientation="h")
fig.show()

## Collaborative Filtering

Let's now build a collaborative filtering model for user anime ratings. For this exercise, we will only consider the user ID, anime name, and rating combination in this dataset, not the supplementary data fields provided to us. Since a number of records have no rating, we will hold these out as our final 'test set', in which we will fill in the rating predicted by the collaborative filtering model we will develop for each given user-anime combination. Let's therefore drop any unnecessary fields and partition our dataset into the data that will be used for model development and the 'test' data that will be used for final model predictions.

In [None]:
data_filter = provided_data["rating"] != -1
provided_data.drop(["genre", "type", "episodes", "rating_overall", "members"], axis=1, inplace=True)
train_val = provided_data[data_filter]
test = provided_data[~data_filter]

Let's now aggregate our data into a fast.ai 'CollabDataBunch' object, randomly partitioning 20% of the model development data for model validation, and setting the batch size to 256.

In [None]:
data = CollabDataBunch.from_df(train_val, user_name="user_id", item_name="name", rating_name="rating", seed=0, test=test, bs=256)
data

Let's check out a sample of this aggregated data.

In [None]:
data.show_batch()

Let's now proceed to train a recommendation system for this task. We will use a 'collab_learner' object from fast.ai, setting the number of latent factors to be 100 and indicate that the ratings range is from 1 - 10 (inclusively).

In [None]:
learn = collab_learner(data, n_factors=100, y_range=[1, 10])

Let's perform an 'lr_find' operation to get an idea of what learning rate we should use.

In [None]:
learn.lr_find()
learn.recorder.plot()

Judging by the above graph, it seems that a learning rate of 3e-1 will be suitable for this task. Let's now train this model for 3 epochs using the 1-cycle policy.

In [None]:
learn.fit_one_cycle(3, 3e-1)

As we can see, we've reduced our validation loss by a fair bit after only a few epochs, let's run our 'lr_find' operation again to see if we should perhaps modify our learning rate prior to training it for a few more epochs.

In [None]:
learn.lr_find()
learn.recorder.plot()

To prevent our loss from increasing, let's change our learning rate to be 3e-3, and run for a few more epochs.

In [None]:
learn.fit_one_cycle(3, 3e-3)

Using this model, let's check to see the calculated bias of the most popular animes we observed above.

In [None]:
most_popular_animes_bias = (pd.Series({anime: bias for anime, bias in zip(most_popular_animes["Anime"],
                                                                          learn.bias(most_popular_animes["Anime"]).tolist())})
                            .reset_index()
                            .rename(columns={"index": "Anime", 0: "Bias"}))
fig = px.bar(most_popular_animes_bias, x="Bias", y="Anime", orientation="h")
fig.show()

As we can see, all of the 10 most popular animes have a bias of at least 1, with 'Fullmetal Alchemist: Brotherhood', having a bias of 2.5.

## To be continued...

I intended to end this analysis by generating predictions from the model on the test set and analyzing its results, but due to issues I experienced trying to achieve this, I will hold off on this for now, and fix this on a later date.