# Collaborative filtering

### Data

Well use MovieLens, a set of movie rankings by users.

In [56]:
from fastai.collab import *
from fastai.tabular.all import *

set_seed(42)

path = untar_data(URLs.ML_100k)

In [57]:
# The main ratings are in u.data, its tabe separated, it doesnt include headings
ratings = pd.read_csv(path / "u.data", delimiter="\t", header=None, names=["user", "movie", "rating", "timestamp"])

# To make it easier to see what is going on we'll use the u.item file to get the movei titles as well
movies = pd.read_csv(
    path / "u.item", delimiter="|", encoding="latin-1", header=None, usecols=(0, 1), names=["movie", "title"]
)

ratings = ratings.merge(movies)
ratings.head()

Unnamed: 0,user,movie,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


What if we want to predict how a user would rate a movie they haven't watched (and therefore might choose to recommend it to them).
 
If we knew what different categories a movie falls into we could create a score for each category. Then for each user we could score their likes/dislikes in thes same categories. Multiply the two vectors and sum them (dot product) and we have a prediction.

But we dont know what these categories are and we don't know how each user scores them.

## Learning latent factors

Let see if we can learn some factors.

Imagine if we decided that there were 5 of these factors. To start for each movie we give the factors random scores and for each user we give them a random score for each factor. Now we can predict a rating for every movie for every user.

Now we could use SGD to improve our predictions.

In [58]:
# First make the dataloaders
dls = CollabDataLoaders.from_df(ratings, item_name="title", bs=64)
dls.show_batch()

Unnamed: 0,user,title,rating
0,542,My Left Foot (1989),4
1,422,Event Horizon (1997),3
2,311,"African Queen, The (1951)",4
3,595,Face/Off (1997),4
4,617,Evil Dead II (1987),1
5,158,Jurassic Park (1993),5
6,836,Chasing Amy (1997),3
7,474,Emma (1996),3
8,466,Jackie Chan's First Strike (1996),3
9,554,Scream (1996),3


In [59]:
# Now we need our factors
n_factors = 10
n_users = len(dls.classes["user"])
n_movies = len(dls.classes["title"])

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

In [60]:
# We need to be able look up our but that isnt something our models know how to do
# The just know matrix multiplication and activation functions
# We can do this with a matrix multiply by using a one-hot encoded vector

# For example to get user factors at index 3
one_hot_3 = one_hot(3, n_users).float()
user_factors.t() @ one_hot_3, user_factors[3]

(tensor([-1.2604, -1.3016, -0.3323, -0.1222,  0.7545,  0.5075, -0.9962,  0.5073,
         -1.1468, -0.6767]),
 tensor([-1.2604, -1.3016, -0.3323, -0.1222,  0.7545,  0.5075, -0.9962,  0.5073,
         -1.1468, -0.6767]))

Most deep learning libs (including pytorch) dont actually do this, they have a layer that
knows how to do a lookup up a way that is the same as if the had done. This is called an embedding.

## Model from scratch

We are going to create a model from scratch rather than fine tuning a pre-trained model.

In [65]:
# Make the model
class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)

    def forward(self, x):
        # Col 0 is user idx
        users = self.user_factors(x[:, 0])

        # Col 1 is movie ids
        movies = self.movie_factors(x[:, 1])

        return (users * movies).sum(dim=1)

In [68]:
# Instantiate it with a learner
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())

model, learn

(DotProduct(
   (user_factors): Embedding(944, 50)
   (movie_factors): Embedding(1665, 50)
 ),
 <fastai.learner.Learner at 0x7f83ae6c4760>)

In [73]:
# Train it
learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,0.194885,1.30107,00:06
1,0.275665,1.326712,00:06
2,0.227505,1.287911,00:05
3,0.144962,1.274299,00:05
4,0.085033,1.275448,00:06


Its a start but we can hopefully do better.