In [None]:
!pip install -U fastbook torchtext==0.8.1

Collecting fastbook
[?25l  Downloading https://files.pythonhosted.org/packages/f2/ff/66f16fb9ceb45646e59a38ad5eb0f05fbd6524c20d9c4a2c922cdcd2955b/fastbook-0.0.16-py3-none-any.whl (720kB)
[K     |████████████████████████████████| 727kB 10.2MB/s 
[?25hCollecting torchtext==0.8.1
[?25l  Downloading https://files.pythonhosted.org/packages/13/80/046f0691b296e755ae884df3ca98033cb9afcaf287603b2b7999e94640b8/torchtext-0.8.1-cp37-cp37m-manylinux1_x86_64.whl (7.0MB)
[K     |████████████████████████████████| 7.0MB 26.7MB/s 
Collecting nbdev>=0.2.38
[?25l  Downloading https://files.pythonhosted.org/packages/5c/c8/e2fba530b84a770373a106e4828ea83df62104b9694e367d169e07ea484f/nbdev-1.1.14-py3-none-any.whl (46kB)
[K     |████████████████████████████████| 51kB 8.1MB/s 
[?25hCollecting fastai>=2.1
[?25l  Downloading https://files.pythonhosted.org/packages/5b/53/edf39e15b7ec5e805a0b6f72adbe48497ebcfa009a245eca7044ae9ee1c6/fastai-2.3.0-py3-none-any.whl (193kB)
[K     |███████████████████████████

# Collaborative Filtering Deep Dive

Recommendation systems use *collaborative filtering* to recommend users products that other similar users have liked. Collaborative filtering does not require the model to know the exact properties of an item to recommend it to others. For example, Netflix does not need to know the genres of movies that one user tends to watch; it only needs to know that other users who have watched the same movies also like watching some other movies. These "some other movies" are then recommended to that one user, since their watch histories are similar.

These systems involve *latent factors*, some underlying concept of what the movies are categorized in, yet not specifically added to a column in a data table.

## A First Look at the Data

Import stuff and get the MovieLens dataset.

In [None]:
from fastai.collab import *
from fastai.tabular.all import *
path = untar_data(URLs.ML_100k)

The data table is contained in the file *u.data*. The data is tab-separated and the columns are *user*, *movie*, *rating*, and *timestamp*. We use Pandas to read in the data, and use a subset of size 100.

In [None]:
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user', 'movie', 'rating', 'timestamp'])

ratings = ratings[:100]
ratings.head()

Unnamed: 0,user,movie,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Below is a small example of how we can use numbers between -1 and 1 to represent how much of something a movie is. The arrays below contain 3 numbers: the first represents how sci-fi related the movie is (closer to 1 means more sci-fi); the second represents how action-based it is (closer to 1 means more action-based); and the third represents how old it is (closer to 1 means very old).

We can represent these properties in user arrays too, based on how much they like these categories/properties. User1 really likes sci-fi, action, and newer movies, so we can calculate how much the user may like *The Last Skywalker* using the *dot product* (multiplying the elements of two vectors together, then summing the result).

In [None]:
last_skywalker = np.array([0.98,0.9,-0.9])
user1 = np.array([0.9,0.8,-0.6])
(user1 * last_skywalker).sum()

2.1420000000000003

If we use this scale for *Casablanca*, we can see that user1 is predicted to not like this movie as much:

In [None]:
casablanca = np.array([-0.99,-0.3,0.8])
(user1 * casablanca).sum()

-1.611

## Learning the Latent Factors

We will use the latent factors as the parameters in our model. The latent factors are the underlying properties - in the movie example, how much action a user liked or how much action a movie contained were considered latent factors. We (1) randomly initialize these. We can then (2) just use the dot product to calculate how likely a user is to like a certain movie. Finally, (3) we can calculate the loss.

The loss will be used when the model calculates a prediction to see whether or not a user will like a movie on a numerical scale. This prediction will be compared to what the user actually rated the movie, which results in the loss.

## Creating the DataLoaders

`u.item` contains the table linking movies to their IDs; we want to see the movie title instead, so we grab that information:

In [None]:
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie', 'title'), header=None)
movies.head()

Unnamed: 0,movie,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


Merge `movies` and `ratings` together to get the user ratings by title:

In [None]:
ratings = ratings.merge(movies)
ratings.head()

Unnamed: 0,user,movie,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,186,302,3,891717742,L.A. Confidential (1997)
2,22,377,1,878887116,Heavyweights (1994)
3,244,51,2,880606923,Legends of the Fall (1994)
4,166,346,1,886397596,Jackie Brown (1997)


Create `DataLoaders` object. We specify a batch size of 64, using the ratings data. The `CollabDataLoaders` defaults to taking the first three columns (user, item (movie), and rating). However, we want the movie title rather than the ID, so we specify `item_name` as "title."

In [None]:
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

Unnamed: 0,user,title,rating
0,62,"African Queen, The (1951)",4
1,305,Grease (1978),3
2,243,Mr. Holland's Opus (1995),3
3,20,Scream (1996),1
4,290,"Sound of Music, The (1965)",5
5,138,"Brothers McMullen, The (1995)",5
6,157,Sabrina (1995),4
7,194,Sabrina (1995),2
8,99,Get Shorty (1995),5
9,10,French Twist (Gazon maudit) (1995),4


The dictionary `dls.classes` contains the titles and user IDs in matrices:

In [None]:
dls.classes

{'title': ['#na#', 'Adventures of Priscilla, Queen of the Desert, The (1994)', 'African Queen, The (1951)', 'Age of Innocence, The (1993)', 'Aladdin (1992)', 'Angels and Insects (1995)', 'Backbeat (1993)', 'Batman (1989)', 'Batman Forever (1995)', 'Bean (1997)', 'Beautiful Thing (1996)', 'Ben-Hur (1959)', 'Birdcage, The (1996)', 'Boot, Das (1981)', 'Broken Arrow (1996)', 'Brothers McMullen, The (1995)', 'Casper (1995)', 'Chasing Amy (1997)', 'City of Lost Children, The (1995)', 'Con Air (1997)', 'Conan the Barbarian (1981)', 'Cop Land (1997)', 'Copycat (1995)', 'Crumb (1994)', 'Curdled (1996)', 'Dangerous Minds (1995)', 'Dead Poets Society (1989)', 'Die Hard (1988)', 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)', 'E.T. the Extra-Terrestrial (1982)', 'Endless Summer 2, The (1994)', 'Evil Dead II (1987)', 'Fantasia (1940)', 'Fargo (1996)', 'Fly Away Home (1996)', 'French Twist (Gazon maudit) (1995)', 'Get Shorty (1995)', 'Grease (1978)', 'Grosse Pointe Bla

Get the number of users, number of movies, and number of latent factors. We then create `n_users` simple matrices, which contain the randomly-initialized activations (one for each factor).

In [None]:
n_users  = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

user_factors

tensor([[ 2.1402e+00, -8.4710e-01,  7.8952e-01,  3.5393e-01,  2.7922e-01],
        [ 4.8255e-01,  1.2134e+00,  6.5220e-01,  8.9973e-01,  4.6080e-02],
        [-4.9669e-01, -8.2709e-01, -1.5043e+00, -1.4431e+00, -4.7567e-01],
        [-9.1052e-01,  1.5512e+00, -2.0908e+00,  2.4692e-01,  7.7302e-01],
        [ 8.3420e-02, -7.9846e-02, -1.8727e-01, -3.6270e-01, -5.7868e-01],
        [ 1.9277e+00,  1.4729e+00, -5.2919e-01,  6.3958e-02,  4.1789e-01],
        [-8.3687e-01,  7.0918e-02,  2.6778e-02, -1.6565e-01,  6.2842e-01],
        [-5.3191e-01,  5.2061e-01, -1.3808e+00, -1.9204e-01,  2.4967e-02],
        [ 1.5520e-01, -7.5036e-01, -4.0835e-01, -1.3841e+00,  5.5969e-01],
        [-1.7786e+00, -2.1356e-01, -8.1852e-01,  4.6667e-01, -1.7879e+00],
        [ 8.0190e-01,  1.1741e-01,  4.7339e-01, -9.8263e-01,  9.5627e-01],
        [ 5.8329e-01,  1.7201e+00,  1.1283e+00, -1.2653e+00,  7.5348e-02],
        [-1.0841e+00,  3.2637e-01, -4.6441e-01,  9.7867e-01, -8.3603e-01],
        [ 1.2445e-01,  2.

Create a one-hot-encoded vector representing the index 3 (where all values in the vector are 0 except for index 3). The size of the vector is `n_users`.

In [None]:
one_hot_3 = one_hot(3, n_users).float()

Take the transpose of the `user_factors` matrix and do matrix multiplication with `one_hot_3`. The size of the `user_factors` transpose is 5x`n_users` and the size of `one_hot_3` is `n_users`x1, so we will end up with a 5x1 matrix.

In [None]:
user_factors.t() @ one_hot_3

tensor([-0.9105,  1.5512, -2.0908,  0.2469,  0.7730])

Doing this matrix multiplication is the same as indexing into `user_factors` at index 3.

In [None]:
user_factors[3]

tensor([-0.9105,  1.5512, -2.0908,  0.2469,  0.7730])

Embedding matrix is mulitplied by the one-hot-encoded matrix to get the desired vector. This is the same as indexing directly into the embedding matrix.

## Collaborative Filtering from Scratch

In [None]:
class Example:
    def __init__(self, a): self.a = a
    def say(self,x): return f'Hello {self.a}, {x}.'

In [None]:
class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users * movies).sum(dim=1)

In [None]:
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())

In [None]:
class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return sigmoid_range((users * movies).sum(dim=1), *self.y_range)

In [None]:
class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

### Weight Decay

### Creating Our Own Embedding Module

In [None]:
def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

In [None]:
class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

## Interpreting Embeddings and Biases

### Using fastai.collab

In [None]:
learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))

### Embedding Distance

## Bootstrapping a Collaborative Filtering Model

## Deep Learning for Collaborative Filtering

In [None]:
class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range
        
    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

In [None]:
model = CollabNN(*embs)

In [None]:
@delegates(TabularModel)
class EmbeddingNN(TabularModel):
    def __init__(self, emb_szs, layers, **kwargs):
        super().__init__(emb_szs, layers=layers, n_cont=0, out_sz=1, **kwargs)

### Sidebar: kwargs and Delegates

### End sidebar

## Conclusion

## Questionnaire

1. What problem does collaborative filtering solve?
1. How does it solve it?
1. Why might a collaborative filtering predictive model fail to be a very useful recommendation system?
1. What does a crosstab representation of collaborative filtering data look like?
1. Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!).
1. What is a latent factor? Why is it "latent"?
1. What is a dot product? Calculate a dot product manually using pure Python with lists.
1. What does `pandas.DataFrame.merge` do?
1. What is an embedding matrix?
1. What is the relationship between an embedding and a matrix of one-hot-encoded vectors?
1. Why do we need `Embedding` if we could use one-hot-encoded vectors for the same thing?
1. What does an embedding contain before we start training (assuming we're not using a pretained model)?
1. Create a class (without peeking, if possible!) and use it.
1. What does `x[:,0]` return?
1. Rewrite the `DotProduct` class (without peeking, if possible!) and train a model with it.
1. What is a good loss function to use for MovieLens? Why? 
1. What would happen if we used cross-entropy loss with MovieLens? How would we need to change the model?
1. What is the use of bias in a dot product model?
1. What is another name for weight decay?
1. Write the equation for weight decay (without peeking!).
1. Write the equation for the gradient of weight decay. Why does it help reduce weights?
1. Why does reducing weights lead to better generalization?
1. What does `argsort` do in PyTorch?
1. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why/why not?
1. How do you print the names and details of the layers in a model?
1. What is the "bootstrapping problem" in collaborative filtering?
1. How could you deal with the bootstrapping problem for new users? For new movies?
1. How can feedback loops impact collaborative filtering systems?
1. When using a neural network in collaborative filtering, why can we have different numbers of factors for movies and users?
1. Why is there an `nn.Sequential` in the `CollabNN` model?
1. What kind of model should we use if we want to add metadata about users and items, or information such as date and time, to a collaborative filtering model?

### Further Research

1. Take a look at all the differences between the `Embedding` version of `DotProductBias` and the `create_params` version, and try to understand why each of those changes is required. If you're not sure, try reverting each change to see what happens. (NB: even the type of brackets used in `forward` has changed!)
1. Find three other areas where collaborative filtering is being used, and find out what the pros and cons of this approach are in those areas.
1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book's website and the fast.ai forum for ideas. Note that there are more columns in the full dataset—see if you can use those too (the next chapter might give you ideas).
1. Create a model for MovieLens that works with cross-entropy loss, and compare it to the model in this chapter.