# Collaborative Filtering with Neural Networks

In this notebook we will write a matrix factorization model in pytorch to solve a recommendation problem. Then we will write a more general neural model for the same problem.

Collaborative filtering: systems recommend items based on similarity measures between users and/or items. The items recommended to a user are those preferred by similar users. 

The MovieLens dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. https://grouplens.org/datasets/movielens/. To get the data:

`wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip`

## MovieLens dataset

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

In [2]:
PATH = Path("/Users/yinterian/teaching/deeplearning/data/ml-latest-small/")
list(PATH.iterdir())

[PosixPath('/Users/yinterian/teaching/deeplearning/data/ml-latest-small/links.csv'),
 PosixPath('/Users/yinterian/teaching/deeplearning/data/ml-latest-small/movies.csv'),
 PosixPath('/Users/yinterian/teaching/deeplearning/data/ml-latest-small/ratings.csv'),
 PosixPath('/Users/yinterian/teaching/deeplearning/data/ml-latest-small/README.txt'),
 PosixPath('/Users/yinterian/teaching/deeplearning/data/ml-latest-small/tags.csv'),
 PosixPath('/Users/yinterian/teaching/deeplearning/data/ml-latest-small/tiny_training2.csv'),
 PosixPath('/Users/yinterian/teaching/deeplearning/data/ml-latest-small/tiny_val2.csv')]

In [3]:
! head /Users/yinterian/teaching/deeplearning/data/ml-latest-small/ratings.csv

userId,movieId,rating,timestamp
1,31,2.5,1260759144
1,1029,3.0,1260759179
1,1061,3.0,1260759182
1,1129,2.0,1260759185
1,1172,4.0,1260759205
1,1263,2.0,1260759151
1,1287,2.0,1260759187
1,1293,2.0,1260759148
1,1339,3.5,1260759125


In [4]:
# reading a csv into pandas
data = pd.read_csv(PATH/"ratings.csv")

In [5]:
data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


### Encoding data
We enconde the data to have contiguous ids for users and movies. You can think about this as a categorical encoding of our two categorical variables userId and movieId.

In [6]:
# split train and validation before encoding
np.random.seed(3)
msk = np.random.rand(len(data)) < 0.8
train = data[msk].copy()
val = data[~msk].copy()

In [7]:
# here is a handy function modified from fast.ai
def proc_col(col, train_col=None):
    """Encodes a pandas column with continous ids. 
    """
    if train_col is not None:
        uniq = train_col.unique()
    else:
        uniq = col.unique()
    name2idx = {o:i for i,o in enumerate(uniq)}
    return name2idx, np.array([name2idx.get(x, -1) for x in col]), len(uniq)

In [8]:
def encode_data(df, train=None):
    """ Encodes rating data with continous user and movie ids. 
    If train is provided, encodes df with the same encoding as train.
    """
    df = df.copy()
    for col_name in ["userId", "movieId"]:
        train_col = None
        if train is not None:
            train_col = train[col_name]
        _,col,_ = proc_col(df[col_name], train_col)
        df[col_name] = col
        df = df[df[col_name] >= 0]
    return df

In [11]:
# to check my implementation
df_t = pd.read_csv("test_data/tiny_training2.csv")
df_v = pd.read_csv("test_data/tiny_val2.csv")
df_t_e = encode_data(df_t)
df_v_e = encode_data(df_v, df_t)
df_v_e
df_t_e

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [12]:
df_train = encode_data(train)
df_val = encode_data(val, train)

## Embedding layer

In [13]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [14]:
# an Embedding module containing 10 users or items embedding size 3
# embedding will be initialized at random
embed = nn.Embedding(10, 3)
embed.weight

Parameter containing:
tensor([[ 0.6350, -0.0253,  0.4673],
        [-0.4085, -0.3153,  1.3539],
        [ 0.2604,  0.0159,  0.1949],
        [-1.0947,  0.5370,  0.0827],
        [-1.1583,  0.0707,  0.2396],
        [ 0.8532, -0.5157,  1.4439],
        [-0.1900,  0.6031,  0.5594],
        [ 0.6184,  1.4199, -0.0626],
        [-2.0728, -1.7474,  0.3763],
        [-0.1594, -0.3822, -0.8410]])

In [15]:
# given a list of ids we can "look up" the embedding corresponing to each id
# can you see that some vectors are the same?
a = torch.LongTensor([[1,0,1,4,5,1]])
embed(a)

tensor([[[-0.4085, -0.3153,  1.3539],
         [ 0.6350, -0.0253,  0.4673],
         [-0.4085, -0.3153,  1.3539],
         [-1.1583,  0.0707,  0.2396],
         [ 0.8532, -0.5157,  1.4439],
         [-0.4085, -0.3153,  1.3539]]])

## Matrix factorization model

In [16]:
class MF(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        # initlializing weights
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        
    def forward(self, u, v):
        u = self.user_emb(u)
        v = self.item_emb(v)
        return (u*v).sum(1)   

## Debugging MF model

In [17]:
df_t_e

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [18]:
num_users = 7
num_items = 4
emb_size = 3

user_emb = nn.Embedding(num_users, emb_size)
item_emb = nn.Embedding(num_items, emb_size)
users = torch.LongTensor(df_t_e.userId.values)
items = torch.LongTensor(df_t_e.movieId.values)

In [19]:
U = user_emb(users)
V = item_emb(items)

In [20]:
U

tensor([[ 0.2887, -0.1039, -0.6517],
        [ 0.2887, -0.1039, -0.6517],
        [-0.7562,  0.7185, -2.2700],
        [-0.7562,  0.7185, -2.2700],
        [ 1.6527, -0.2885,  0.0281],
        [ 1.6527, -0.2885,  0.0281],
        [-1.0987, -1.5382,  0.3912],
        [-1.0987, -1.5382,  0.3912],
        [-2.2866, -0.6564, -0.5094],
        [-2.2866, -0.6564, -0.5094],
        [ 0.1742, -1.2741,  0.6683],
        [-0.1845, -1.2902, -0.1542],
        [-0.1845, -1.2902, -0.1542]])

In [21]:
# element wise multiplication
U*V 

tensor([[-0.4151,  0.0906,  0.1340],
        [-0.2531,  0.0403, -0.3552],
        [ 0.6629, -0.2785, -1.2372],
        [ 0.1777, -0.8371, -1.0000],
        [-2.3761,  0.2516, -0.0058],
        [-1.4488,  0.1118,  0.0153],
        [ 1.5797,  1.3414, -0.0805],
        [ 0.7645, -0.0590, -0.1178],
        [ 3.2875,  0.5724,  0.1048],
        [ 1.5910, -0.0252,  0.1533],
        [-0.1212, -0.0489, -0.2012],
        [ 0.1617,  0.5000, -0.0840],
        [ 0.1283, -0.0495,  0.0464]])

In [22]:
# what we want is a dot product per row
(U*V).sum(1) 

tensor([-0.1905, -0.5680, -0.8528, -1.6593, -2.1303, -1.3217,  2.8406,
         0.5877,  3.9646,  1.7191, -0.3713,  0.5777,  0.1252])

## Training MF model

In [23]:
num_users = len(df_train.userId.unique())
num_items = len(df_train.movieId.unique())
print(num_users, num_items) 

671 8442


In [24]:
model = MF(num_users, num_items, emb_size=100)  # if you have a GPU .cuda()

In [25]:
# here we are not using data loaders because our data fits well in memory
def train_epocs(model, epochs=10, lr=0.01, wd=0.0, unsqueeze=False):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=wd)
    model.train()
    for i in range(epochs):
        users = torch.LongTensor(df_train.userId.values)  #.cuda()
        items = torch.LongTensor(df_train.movieId.values) #.cuda()
        ratings = torch.FloatTensor(df_train.rating.values)  #.cuda()
        if unsqueeze:
            ratings = ratings.unsqueeze(1)
        y_hat = model(users, items)
        loss = F.mse_loss(y_hat, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(loss.item()) # used to be loss.data[0]
    test_loss(model, unsqueeze)

In [26]:
# Here is what unsqueeze does
ratings = torch.FloatTensor(df_train.rating.values)
print(ratings.shape)
ratings = ratings.unsqueeze(1) #.cuda()
ratings.shape

torch.Size([79799])


torch.Size([79799, 1])

In [27]:
def test_loss(model, unsqueeze=False):
    model.eval()
    users = torch.LongTensor(df_val.userId.values) # .cuda()
    items = torch.LongTensor(df_val.movieId.values) #.cuda()
    ratings = torch.FloatTensor(df_val.rating.values) #.cuda()
    if unsqueeze:
        ratings = ratings.unsqueeze(1)
    y_hat = model(users, items)
    loss = F.mse_loss(y_hat, ratings)
    print("test loss %.3f " % loss.item())

In [28]:
train_epocs(model, epochs=10, lr=0.1)

13.231466293334961
5.121332168579102
2.379080057144165
3.449378728866577
0.9091715216636658
1.808618187904358
2.7496719360351562
2.2799339294433594
1.158699631690979
0.9229275584220886
test loss 1.947 


In [29]:
train_epocs(model, epochs=15, lr=0.01)

1.7031512260437012
1.0512332916259766
0.7494320869445801
0.6944719552993774
0.75920170545578
0.8395007252693176
0.8817469477653503
0.8754511475563049
0.8335105776786804
0.7767617106437683
0.7246007919311523
0.6899178624153137
0.6766810417175293
0.680435836315155
0.6915622353553772
test loss 0.894 


In [30]:
train_epocs(model, epochs=15, lr=0.01)

0.700132429599762
0.6617763638496399
0.6678289175033569
0.6448845267295837
0.6372687816619873
0.644047200679779
0.6397343873977661
0.624600887298584
0.6133706569671631
0.6119531393051147
0.6125348806381226
0.6067298054695129
0.5952818393707275
0.5843498110771179
0.5774412751197815
test loss 0.823 


## MF with bias

In [31]:
class MF_bias(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF_bias, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.item_bias = nn.Embedding(num_items, 1)
        # init 
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        self.user_bias.weight.data.uniform_(-0.01,0.01)
        self.item_bias.weight.data.uniform_(-0.01,0.01)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        b_u = self.user_bias(u).squeeze()
        b_v = self.item_bias(v).squeeze()
        return (U*V).sum(1) +  b_u  + b_v

In [32]:
model = MF_bias(num_users, num_items, emb_size=100) #.cuda()

In [33]:
train_epocs(model, epochs=10, lr=0.1, wd=1e-5)

13.233013153076172
4.373607635498047
3.4791698455810547
2.467517852783203
0.7869764566421509
1.812579870223999
2.5184104442596436
2.1367201805114746
1.2716211080551147
0.9042772650718689
test loss 1.537 


In [34]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5)

1.2838945388793945
0.8581913709640503
0.6942571401596069
0.6954460740089417
0.7550364136695862
0.8006176352500916
0.8073124289512634
0.7806740403175354
0.7376977801322937
0.6958044767379761
test loss 0.825 


In [35]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.6671925187110901
0.6591733694076538
0.6524665951728821
0.6470252871513367
0.642689049243927
0.6392977237701416
0.6366432309150696
0.6345278024673462
0.6328005790710449
0.6313199996948242
test loss 0.810 


In [47]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.6308879256248474
0.6289032697677612
0.6273546814918518
0.6259253025054932
0.6244949102401733
0.623039186000824
0.6215739846229553
0.6200994253158569
0.6186307668685913
0.6171799302101135
test loss 0.810 


Note that these models are susceptible to weight initialization, optimization algorithm and regularization.

## Neural Network Model

In [36]:
# Note here there is no matrix multiplication, we could potentially make the embeddings 
# for users and items of different sizes.
# Here we could get better results by keep playing with regularization.
    
class CollabFNet(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100, n_hidden=10):
        super(CollabFNet, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.lin1 = nn.Linear(emb_size*2, n_hidden)
        self.lin2 = nn.Linear(n_hidden, 1)
        self.drop1 = nn.Dropout(0.1)
        self.drop2 = nn.Dropout(0.0)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        x = torch.cat([U, V], dim=1)
        x = self.drop1(x)
        x = F.relu(self.lin1(x))
        x = self.drop2(x)
        x = self.lin2(x)
        return x

In [48]:
model = CollabFNet(num_users, num_items, emb_size=100) #.cuda()

In [49]:
train_epocs(model, epochs=20, lr=0.1, wd=1e-6, unsqueeze=True) 

16.079273223876953
12.428119659423828
4.739638805389404
10.597055435180664
2.3113698959350586
1.7216858863830566
2.528798818588257
2.6324732303619385
2.051955461502075
1.3755041360855103
1.3581899404525757
1.79594087600708
1.528417706489563
1.001665711402893
0.9564343690872192
1.1813850402832031
1.2449344396591187
1.062353491783142
0.8449162244796753
0.8801705837249756
test loss 1.141 


In [50]:
train_epocs(model, epochs=20, lr=0.01, wd=1e-6, unsqueeze=True)

1.0608270168304443
0.820761501789093
0.803485631942749
0.8454993367195129
0.8469330668449402
0.8072155714035034
0.7594216465950012
0.7247920036315918
0.71588534116745
0.7272763252258301
0.738793134689331
0.7351263165473938
0.7192756533622742
0.702589213848114
0.6896664500236511
0.6917107701301575
0.6998754143714905
0.702173113822937
0.7014030814170837
0.695222795009613
test loss 0.814 


In [51]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-6, unsqueeze=True)

0.6847543120384216
0.6789888143539429
0.6788713335990906
0.6815334558486938
0.6770483255386353
0.6780238151550293
0.6797886490821838
0.6779595017433167
0.6767784357070923
0.6768439412117004
test loss 0.808 


In [52]:
train_epocs(model, epochs=20, lr=0.001, wd=1e-6, unsqueeze=True)

0.6753724217414856
0.6748806238174438
0.67445969581604
0.6726959943771362
0.6738818287849426
0.6735833287239075
0.6708208918571472
0.6716728806495667
0.6723615527153015
0.6711774468421936
0.6723656058311462
0.6721900701522827
0.6705502271652222
0.6711705327033997
0.6707764267921448
0.6702867746353149
0.6691016554832458
0.6694663763046265
0.6703580617904663
0.6676943898200989
test loss 0.810 


## TODO
* use t-sne to visualize embeddings

# Lab
* Can you use `tags.csv` and `timestamp` to improve your predictions?
* Play with the hyperparameters
* Look at fastai version of this network and try his transformation https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb
* You may need a dataloader if you data is larger. Can you construct a dataset? Here is an example:
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel.html
* Work with the largest dataset http://files.grouplens.org/datasets/movielens/ml-latest.zip

# References
* This notebook is based on [lesson 5 of Jeremy Howard's Deep Learning Course](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb)