## Intro
Recommendation systems are one of the applications of machine learning that have become so embedded in daily life that it can be surprising to consider them as even related to machine learning. Nonetheless, looking at their internals provides a good scaffolding to use for more advanced topics.

Netflix, Amazon, and Spotify all have suggested shows, products, and songs for their users. Though the recommended items are all different, all share the same origin of being generated by a process called 'collaborative filtering'. The essence of it boils down to three steps: identify the things you used or liked, find other users who used or liked the same things, and suggest things that the other users used or liked. Notably, the process doesn't rely on user data entry or any manual assignment of categories for recommendations. Instead, what's happens is the attribution of **latent factors** to users and items. These are numerical representations of the strength of the many and varied motivations behind user ratings and selections. Similarly, the recommendations have corresponding numbers that represent how well they fit those criteria.

## Example: boardgame ratings
To make things less abstract let's use an example based on [a dataset of boardgames](https://www.kaggle.com/datasets/threnjen/board-games-database-from-boardgamegeek) from BoardGameGeek that's been uploaded to Kaggle. Once downloaded, it provides 9 files total but let's just use two to start with, giving us some basic data about board games and users.

In [None]:
from fastai.collab import *
from fastai.tabular.all import *
from pathlib import Path
# import zipfile

# zipdata = zipfile.ZipFile('boardgamegeek.zip')
# zipdata.extractall(path='./data')
# zipdata.close()
gamepath = Path('./data/games.csv')
games = pd.read_csv(gamepath)
subset = ['BGGId', 'Name', 'YearPublished', 'Kickstarted', 'NumUserRatings']
games[subset].head()

The games.csv file contains much more metadata than we need right now so this is just a subset. As for the users, it's nothing more than pairing game and user ids along with a rating.

In [None]:
users = pd.read_csv('./data/user_ratings.csv').sample(75000)
users.head()

## An intuitive overview of the process
How does a machine understand if you like something and by how much? First by converting the terms of the discussion into numbers. Suppose we are considering Risk, the classic game of war and conquest, with 31510 ratings in the dataset. 

In [None]:
games.query('Name == "Risk"')[subset]

Users who rated it may have been considering any number of aspects they encountered while playing Risk, such as theme, playtime, complexity, and newness. Risk delivers fairly well on the theme of war, its playtime can be short as well as long, has low complexity in its rules, and is an old title. We could assign numbers between -1 and 1 to each of these like so:

In [None]:
risk = np.array([0.7,0.5,0.3,-0.6])

Similarly, a user might have a low interest in war games, be short on free time, prefers simplicity, and enjoys newer games. They could be assigned these numbers:

In [None]:
user1 = np.array([-.8,0.2,-0.5,0.6])

Collaborative filtering recommends items to users if the match between them is high, and it determines this by multiplying the arrays and adding up the result:

In [None]:
(user1 * risk).sum()

The operation is referred to as a **dot product** and the arrays of numbers are the latent factors. In this case the -0.97 indicates a poor match. Someone with the opposite preferences would yield a higher number and thus be recommended Risk.

The latent factors in our example were arbitrarily selected (both the array lengths and array values), but in practice machine learning doesn't start any differently--the process initializes from random weights and refines them as the model learns.

## Setting up
To start, we need to put the data into a `dataloader`, which is a fastai feature that helps with the creation of mini-batches for iteration during machine learning, and identify some constants: the number of users and games in the dataset. We arbitrarily pick 9 as the number of factors to train for.

In [None]:
ratings = users.merge(games)
dls = CollabDataLoaders.from_df(ratings, item_name='BGGId', rating_name='Rating', user_name='Username')
n_users  = len(dls.classes['Username'])
n_games = len(dls.classes['BGGId'])
n_factors = 5

The next step is to create **one-hot encodings**: these are tensors that are mostly zeros except at one index, and they will represent categorical data about the boardgames, such as theme and year of release. Furthermore, it needs to be able to be passed as arguments to parameters.

In [None]:
def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

## The main components of the model
The model will be trained by using dot products on the users and boardgames but there are additional pieces that improve its performance.

### Bias
One of the most relatable is bias. We all know people who think everything they try is the most amazing thing ever. Conversely, some people find flaws in everything. Adding additional tensors of equal size will prevent these kinds of ratings from distorting the machine learning model.

In [None]:
class BoardGameRecs(Module):
    def __init__(self, n_users, n_games, n_factors, y_range=(0,10.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.game_factors = create_params([n_games, n_factors])
        self.game_bias = create_params([n_games])
        self.y_range = y_range

    def forward(self, x):
        users = self.user_factors[x[:,0]]
        games = self.game_factors[x[:,1]]
        res = (users * games).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.game_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

### Weight decay
Weight decay is modification of the loss function every time it is calculated, simply adding a large constant, the intent of which is to counteract overfitting. Making the loss function grow bigger is counterproductive in the short run, since it extends training time, but the tradeoff is worth it. 

But how does simple addition prevent overfitting? Consider a plot of an overfitted loss function:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.relplot(games, x='AvgRating', y='NumUserRatings', alpha=.5)
plt.plot([1, 10], [0, 77000], label='under fit', color='navy')
plt.plot(range(1,11), [13, 300, 4000, 100, 8000, 2000, 51000, 110000, 700, 123], label='over fit', color='red', scaley='log')
plt.legend(loc="upper left")
plt.show()

When models are underfit, the loss function can take a wildly inaccurate and straight path. Overfit models, on the other hand, tend to zigzag as they try to adhere to data points. The scatterplot above is from some of the omitted columns in the boardgame dataset that's been overlaid with cherry-picked numbers to illustrate, but this example also shows the zigzag pattern:
![](att_00000.png)

Next, consider that quadratics such as

{{< katex >}}
$$ax^2+bx+c$$

get steeper and narrower as *a* and *b* grow larger. Weight decay leverages this effect so that the resulting trained weights (which want to go in the opposite direction of this line) do not overfit.

### What is `forward` and `x`?
The `forward` function allows pytorch to send arguments to other method calls. The model input is a tensor of shape `[batch_size, 2]`, where the first column is user ids (x[:,0]) and the second column game ids (x[:,1]).

## Results
Now we can look at some training results at 5 epochs.
### With weight decay

In [None]:
model = BoardGameRecs(n_users, n_games, n_factors)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

### without weight decay

In [None]:
learn.fit_one_cycle(5, 5e-3)

## Deep learning version
Converting this model to use deep learning is a simple modification to the class; instead of taking dot products, the deep learning model will concatenate 

In [None]:
class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range
        
    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

embs = get_emb_sz(dls)
model = CollabNN(*embs)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)

## Further reading
A great post on the broader context of fitting categorical data into machine learning models, which the one-hot encoding used here is a part of, can be found here: https://www.featureform.com/post/the-definitive-guide-to-embeddings
