# 07_Lecture_CollaborativeFiltering

look at what products the current user has used or liked, find other users that have used or liked similar products, and then recommend other products that those users have used or liked.

This Lecture is progressively biult on the following Lectures:
- Lecture 07 (last 40 min)
- Lecture 08

**Comments:**
- Computational Linear Algebra for Coders (for PCA and others) (https://github.com/fastai/numerical-linear-algebra)  
- To look at the course code: collab_learner??

In [3]:
from fastai.collab import *
from fastai.tabular.all import *
set_seed(42)

### Data
Data is from MovieLens available through the usual fastai function.

The main table is in the file u.data. It is tab-separated and the columns are, respectively user, movie, rating, and timestamp. Since those names are not encoded, we need to indicate them when reading the file with Pandas. Also, the table u.item contains the correspondence of IDs to titles. Open the table and take a look:

In [4]:
path = untar_data(URLs.ML_100k)
ratings = pd.read_csv(path/'u.data', delimiter="\t", header=None, names = ['user','movie','rating','timestamp'])
movies = pd.read_csv(path/'u.item',  delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie','title'), header=None)
print(ratings.head())
print(movies.head())
ratings = ratings.merge(movies)
print(ratings.head())

   user  movie  rating  timestamp
0   196    242       3  881250949
1   186    302       3  891717742
2    22    377       1  878887116
3   244     51       2  880606923
4   166    346       1  886397596
   movie              title
0      1   Toy Story (1995)
1      2   GoldenEye (1995)
2      3  Four Rooms (1995)
3      4  Get Shorty (1995)
4      5     Copycat (1995)
   user  movie  rating  timestamp                       title
0   196    242       3  881250949                Kolya (1996)
1   186    302       3  891717742    L.A. Confidential (1997)
2    22    377       1  878887116         Heavyweights (1994)
3   244     51       2  880606923  Legends of the Fall (1994)
4   166    346       1  886397596         Jackie Brown (1997)


### If we knew a latent factor
If we knew for each user to what degree they liked each movie category, then we could calculate this info for each movie. E.g., if we had three categories (science-fiction, action, and old movies), we could rate each film as matching the categories (e.g., from -1 to +1) and also calculate category-preference "profile" for each user (from -1 to +1). Then we could compare them together:

In [5]:
last_skywalker = np.array([0.98, 0.9, -0.9])  # categories matching for movie1
casablanca = np.array([-0.99, -0.3, 0.8])  # categories matching for movie2
user1 = np.array([0.9, 0.8, -0.6])  # category-preferences profile for user1
print((user1 * last_skywalker).sum())  # dot product: 2.142 (meaning hight preference, probably user1 will like movie1)
print((user1 * casablanca).sum())  # dot product: -1.611 (meaning low preference, probably user1 will not like movie2)

2.1420000000000003
-1.611


## Method 1: Build own model 1
### Learning the Latent Factors
Since in our case we don't know what the latent factors actually are, and we don't know how to score them for each user and movie, we should learn them.  
The major example in the Excel in Lecture 07 (starts from 01:09).  

**Creating the DataLoaders**  
By default, it takes the first column for the user, the second column for the item (here our movies), and the third column for the ratings. We need to change the value of item_name in our case to use the titles instead of the IDs

In [6]:
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

Unnamed: 0,user,title,rating
0,782,Starship Troopers (1997),2
1,943,Judge Dredd (1995),3
2,758,Mission: Impossible (1996),4
3,94,Farewell My Concubine (1993),5
4,23,Psycho (1960),4
5,296,Secrets & Lies (1996),5
6,940,"American President, The (1995)",4
7,334,Star Trek VI: The Undiscovered Country (1991),1
8,380,Braveheart (1995),4
9,690,So I Married an Axe Murderer (1993),1


We can represent our movie and user latent factor tables as simple matrices:

In [7]:
n_users = len(dls.classes["user"])  # n of users is how many users there are
n_movies = len(dls.classes["title"])  # n of items is how many items there are
n_factors = 5  # this is just Jeremy's intuition that works well

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)
print(user_factors)
print(movie_factors)

tensor([[-1.0827,  0.2138,  0.9310, -0.2739, -0.4359],
        [-0.5195,  0.7613, -0.4365,  0.1365,  1.3300],
        [-1.2804,  0.0705,  0.6489, -1.2110,  1.8266],
        ...,
        [ 0.8009, -0.4734, -0.8962, -0.7348, -0.0246],
        [ 0.3354, -0.8262, -0.1541,  0.4699,  0.4873],
        [ 2.4054, -0.2156, -1.4126, -0.2467,  1.0571]])
tensor([[-0.3978,  0.4563,  1.2301,  0.3745,  0.9689],
        [-1.1836, -0.5818, -0.5587, -0.4316,  0.2128],
        [ 0.0420,  1.3201, -0.7999,  1.1123, -0.7585],
        ...,
        [ 2.4743,  1.3068,  0.4540,  0.6958,  0.5228],
        [ 2.3970, -0.2559, -1.7196,  1.0440, -0.2662],
        [ 0.2786, -0.6593,  0.5260, -0.3416, -1.3938]])


### Creating a new PyTorch module
The input of the model is a tensor of shape batch_size x 2, where the first column (x[:, 0]) contains the user IDs and the second column (x[:, 1]) contains the movie IDs. As explained before, we use the embedding layers to represent our matrices of user and movie latent factors:

In [8]:
class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users * movies).sum(dim=1)

Now that we have defined our architecture, and created our parameter matrices, we need to create a Learner to optimize our model. Since we are doing things from scratch here, we will use the plain Learner class. After this, we are now ready to fit our model:

In [9]:
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,1.355056,1.308388,00:05
1,1.032619,1.099086,00:06
2,0.881544,0.985998,00:06
3,0.760961,0.90101,00:06
4,0.746256,0.880025,00:06


### Improving the model: adding sigmoid and biases
(range from 0 to 5.5 is because sigmoid of 1 will never hit 1, but we need 5, right? So 5.5 by sigmoid will give 5

In [10]:
class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range  # this is the range the sigmoid will squash the results by.
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])  # add biases for user and for movie
        return sigmoid_range(res, *self.y_range)  # add sigmoid to final answer

In [11]:
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,1.313619,1.311592,00:06
1,1.059954,1.122256,00:06
2,0.92376,1.001304,00:06
3,0.820529,0.920403,00:06
4,0.79636,0.89308,00:06


### Weight decay
We think we might be overfitting. Let's do L2 regularization

In [70]:
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)  # add lamda here, try different values 0.1,0.01... that's it o____O

epoch,train_loss,valid_loss,time
0,0.918345,0.949725,00:06
1,0.675732,0.895881,00:05
2,0.508346,0.871886,00:06
3,0.465301,0.864229,00:06
4,0.442191,0.85974,00:06


## Method 2: Build own model 2
So far, we've used Embedding without thinking about how it really works. Let's re-create DotProductBias without using this class. We'll need a randomly initialized weight matrix for each of the embeddings. 

We have to be careful: optimizers require that they can get all the parameters of a module from the module's parameters method. However, this does not happen fully automatically. If we just add a tensor as an attribute to a Module, it will not be included in parameters. To tell Module that we want to treat a tensor as a parameter, we have to wrap it in the nn.Parameter class. This class doesn't actually add any functionality (other than automatically calling requires_grad_ for us). It's only used as a "marker" to show what to include in parameters. 

We can create a tensor as a parameter, with random initialization. Then we will use this to create DotProductBias again, but without Embedding:

In [21]:
def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

In [22]:
class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

In [23]:
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,0.844252,0.937136,00:07
1,0.688366,0.895349,00:06
2,0.519066,0.876162,00:06
3,0.453932,0.855043,00:06
4,0.437529,0.851891,00:06


## Method 3: Use a fastai.collab
We can create and train a collaborative filtering model using the exact structure shown earlier by using fastai's collab_learner:

In [35]:
learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,0.902238,0.950396,00:06
1,0.687878,0.897314,00:06
2,0.533113,0.869954,00:06
3,0.459932,0.856269,00:06
4,0.433415,0.852636,00:06


In [38]:
# print names of the layers:
learn.model

EmbeddingDotBias(
  (u_weight): Embedding(944, 50)
  (i_weight): Embedding(1665, 50)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1665, 1)
)

## Interpreting
### Visulising the bias (interpreting the biases)
It is interesting to see what parameters the model has discovered. Biases are like intercepts in lmm, meaning that they show the influence of factors independently of IV

In [40]:
# for method 2:

movie_bias = learn.model.movie_bias.squeeze()

# the lowest rating
idxs = movie_bias.argsort()[:10]
print([dls.classes['title'][i] for i in idxs])

# the highest ratings
idxs = movie_bias.argsort(descending=True)[:10]
print([dls.classes['title'][i] for i in idxs])

AttributeError: 'EmbeddingDotBias' object has no attribute 'movie_bias'

In [39]:
# for method 3:

movie_bias = learn.model.i_bias.weight.squeeze()

# the lowest rating
idxs = movie_bias.argsort()[:10]
print([dls.classes['title'][i] for i in idxs])

# the highest ratings
idxs = movie_bias.argsort(descending=True)[:10]
print([dls.classes['title'][i] for i in idxs])

['Showgirls (1995)', 'Children of the Corn: The Gathering (1996)', 'Grease 2 (1982)', 'Spice World (1997)', 'Cable Guy, The (1996)', 'Bio-Dome (1996)', "Amityville 1992: It's About Time (1992)", 'Lawnmower Man 2: Beyond Cyberspace (1996)', 'Amityville II: The Possession (1982)', 'Free Willy 2: The Adventure Home (1995)']
['Shawshank Redemption, The (1994)', 'Titanic (1997)', "Schindler's List (1993)", 'Good Will Hunting (1997)', 'L.A. Confidential (1997)', 'Rear Window (1954)', 'Vertigo (1958)', 'Star Wars (1977)', 'To Kill a Mockingbird (1962)', 'Usual Suspects, The (1995)']


### Embedding Distance
On a two-dimensional map we can calculate the distance between two coordinates using the formula of Pythagoras (assuming that x and y are the distances between the coordinates on each axis). For a 50-dimensional embedding we can do exactly the same thing, except that we add up the squares of all 50 of the coordinate distances.

If there were two movies that were nearly identical, then their embedding vectors would also have to be nearly identical, because the users that would like them would be nearly exactly the same. There is a more general idea here: movie similarity can be defined by the similarity of users that like those movies. And that directly means that the distance between two movies' embedding vectors can define that similarity. We can use this to find the most similar movie to Silence of the Lambs:

In [44]:
# (it seemed not working correctly even on the j's computer)
movie_factors = learn.model.i_weight.weight
idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
idx = distances.argsort()[1]
dls.classes['title'][idx]

'Ready to Wear (Pret-A-Porter) (1994)'

## Method 4: DL model 1
Our dot product model works quite well, and it is the basis of many successful real-world recommendation systems. This approach to collaborative filtering is known as probabilistic matrix factorization (PMF). Another approach, which generally works similarly well given the same data, is deep learning.

To turn our architecture into a deep learning model, the first step is to take the results of the embedding lookup and concatenate those activations together. This gives us a matrix which we can then pass through linear layers and nonlinearities in the usual way.

Since we'll be concatenating the embeddings, rather than taking their dot product, the two embedding matrices can have different sizes (i.e., different numbers of latent factors). fastai has a function get_emb_sz that returns recommended sizes for embedding matrices for your data, based on a heuristic that fast.ai has found tends to work well in practice:

In [49]:
# get sizes dor embedding matrices
embs = get_emb_sz(dls)
embs

[(944, 74), (1665, 102)]

In [50]:
# implement the class
class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range

    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

In [53]:
# use the class to create the model
model = CollabNN(*embs)

In [54]:
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)

epoch,train_loss,valid_loss,time
0,0.929937,0.946122,00:07
1,0.851822,0.912206,00:06
2,0.838512,0.881456,00:06
3,0.779626,0.870814,00:07
4,0.731197,0.862495,00:06


### Method 4: DL model 2
fastai provides this model in fastai.collab if you pass use_nn=True in your call to collab_learner (including calling get_emb_sz for you), and it lets you easily create more layers. For instance, here we're creating two hidden layers, of size 100 and 50, respectively:

In [57]:
learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100,50])
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,0.981596,0.978487,00:09
1,0.904838,0.918161,00:08
2,0.829316,0.875605,00:07
3,0.774758,0.859824,00:07
4,0.752902,0.861667,00:07
