# Collaborative Filtering with Neural Networks

In this notebook we will write a matrix factorization model in pytorch to solve a recommendation problem. Then we will write a more general neural model for the same problem.

The MovieLens dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. https://grouplens.org/datasets/movielens/. To get the data:

`wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip`

## MovieLens dataset

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

In [2]:
PATH = Path("/data2/yinterian/ml-latest-small/")
list(PATH.iterdir())

[PosixPath('/data2/yinterian/ml-latest-small/ratings.csv'),
 PosixPath('/data2/yinterian/ml-latest-small/tags.csv'),
 PosixPath('/data2/yinterian/ml-latest-small/tiny_training2.csv'),
 PosixPath('/data2/yinterian/ml-latest-small/links.csv'),
 PosixPath('/data2/yinterian/ml-latest-small/tiny_val2.csv'),
 PosixPath('/data2/yinterian/ml-latest-small/README.txt'),
 PosixPath('/data2/yinterian/ml-latest-small/movies.csv')]

In [3]:
! head /data2/yinterian/ml-latest-small/ratings.csv

userId,movieId,rating,timestamp
1,31,2.5,1260759144
1,1029,3.0,1260759179
1,1061,3.0,1260759182
1,1129,2.0,1260759185
1,1172,4.0,1260759205
1,1263,2.0,1260759151
1,1287,2.0,1260759187
1,1293,2.0,1260759148
1,1339,3.5,1260759125


In [4]:
data = pd.read_csv(PATH/"ratings.csv")

In [5]:
data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


### Encoding data
This is similar to what you did for your hw1 in ML-2. We enconde the data to have contiguous ids for users and movies. You can think about this as a categorical encoding of our two categorical variables userId and movieId.

In [6]:
# split train and validation before encoding
np.random.seed(3)
msk = np.random.rand(len(data)) < 0.8
train = data[msk].copy()
val = data[~msk].copy()

In [7]:
# here is a handy function modified from fast.ai
def proc_col(col, train_col=None):
    """Encodes a pandas column with continous ids. 
    """
    if train_col is not None:
        uniq = train_col.unique()
    else:
        uniq = col.unique()
    name2idx = {o:i for i,o in enumerate(uniq)}
    return name2idx, np.array([name2idx.get(x, -1) for x in col]), len(uniq)

In [8]:
def encode_data(df, train=None):
    """ Encodes rating data with continous user and movie ids. 
    If train is provided, encodes df with the same encoding as train.
    """
    df = df.copy()
    for col_name in ["userId", "movieId"]:
        train_col = None
        if train is not None:
            train_col = train[col_name]
        _,col,_ = proc_col(df[col_name], train_col)
        df[col_name] = col
        df = df[df[col_name] >= 0]
    return df

In [9]:
# to check my new implementation
df_t = pd.read_csv(PATH/"tiny_training2.csv")
df_v = pd.read_csv(PATH/"tiny_val2.csv")
df_t_e = encode_data(df_t)
df_v_e = encode_data(df_v, df_t)
#df_v_e
#df_t_e

In [10]:
df_train = encode_data(train)
df_val = encode_data(val, train)

## Embedding layer

In [11]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

In [12]:
# an Embedding module containing 10 user or item embedding size 3
# embedding will be initialized at random
embed = nn.Embedding(10, 3)

In [13]:
# given a list of ids we can "look up" the embedding corresponing to each id
a = Variable(torch.LongTensor([[1,2,0,4,5,1]]))
embed(a)

Variable containing:
(0 ,.,.) = 
 -1.3876 -1.2333 -1.6822
 -1.1038  0.4362  0.9202
  1.7541 -0.3677 -0.2281
  0.2216  0.5096  0.0526
  0.8723 -0.9660 -0.7170
 -1.3876 -1.2333 -1.6822
[torch.FloatTensor of size 1x6x3]

## Matrix factorization model

In [107]:
class MF(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        
    def forward(self, u, v):
        u = self.user_emb(u)
        v = self.item_emb(v)
        return (u*v).sum(1)   

## Debugging MF model

In [15]:
df_t_e

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [16]:
num_users = 7
num_items = 4
emb_size = 3

user_emb = nn.Embedding(num_users, emb_size)
item_emb = nn.Embedding(num_items, emb_size)
users = Variable(torch.LongTensor(df_t_e.userId.values))
items = Variable(torch.LongTensor(df_t_e.movieId.values))

In [17]:
U = user_emb(users)
V = item_emb(items)

In [18]:
U

Variable containing:
 0.0631 -0.1049 -1.1792
 0.0631 -0.1049 -1.1792
-0.3294 -0.6571  1.0094
-0.3294 -0.6571  1.0094
 1.6597 -0.3185  0.4174
 1.6597 -0.3185  0.4174
 0.0679 -1.5368 -0.0632
 0.0679 -1.5368 -0.0632
 0.7946  0.3988  2.1184
 0.7946  0.3988  2.1184
 0.2863  1.4572  0.2051
 1.0307  1.4180  0.7040
 1.0307  1.4180  0.7040
[torch.FloatTensor of size 13x3]

In [19]:
# element wise multiplication
U*V 

Variable containing:
 0.0539  0.0394  0.3661
 0.0085  0.1380 -1.2082
-0.0446  0.8646  1.0342
-0.3252 -0.5799  2.0386
 1.4189  0.1198 -0.1296
 0.2246  0.4191  0.4277
 0.0581  0.5781  0.0196
-0.0394 -1.4677 -0.0378
 0.6793 -0.1500 -0.6578
-0.4614  0.3809  1.2660
-0.1662  1.3916  0.1226
 0.1395 -1.8658  0.7212
-0.5985  1.3542  0.4207
[torch.FloatTensor of size 13x3]

In [20]:
# what we want is a dot product per row
(U*V).sum(1) 

Variable containing:
 0.4595
-1.0616
 1.8542
 1.1334
 1.4091
 1.0714
 0.6557
-1.5449
-0.1285
 1.1855
 1.3480
-1.0051
 1.1764
[torch.FloatTensor of size 13]

## Training MF model

In [108]:
num_users = len(df_train.userId.unique())
num_items = len(df_train.movieId.unique())
print(num_users, num_items) 

671 8442


In [135]:
model = MF(num_users, num_items, emb_size=100).cuda()

In [136]:
def train_epocs(model, epochs=10, lr=0.01, wd=0.0):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr, weight_decay=wd)
    model.train()
    for i in range(epochs):
        users = Variable(torch.LongTensor(df_train.userId.values)).cuda()
        items = Variable(torch.LongTensor(df_train.movieId.values)).cuda()
        ratings = Variable(torch.FloatTensor(df_train.rating.values)).cuda()
        y_hat = model(users, items)
        loss = F.mse_loss(y_hat, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(loss.data[0])
    test_loss(model)

In [137]:
def test_loss(model):
    model.eval()
    users = Variable(torch.LongTensor(df_val.userId.values)).cuda()
    items = Variable(torch.LongTensor(df_val.movieId.values)).cuda()
    ratings = Variable(torch.FloatTensor(df_val.rating.values)).cuda()
    y_hat = model(users, items)
    loss = F.mse_loss(y_hat, ratings)
    print("test loss %.3f " % loss.data[0])

In [138]:
train_epocs(model, epochs=10, lr=0.1)

13.236610412597656
5.137965202331543
2.354017496109009
3.4741756916046143
0.9128307700157166
1.8048418760299683
2.7530932426452637
2.288565158843994
1.1635662317276
0.9178478121757507
test loss 1.944 


In [139]:
train_epocs(model, epochs=15, lr=0.01)

1.7026231288909912
1.0509860515594482
0.7494574189186096
0.6945649981498718
0.7591621279716492
0.8392602801322937
0.881389319896698
0.8751192688941956
0.8333470821380615
0.7767939567565918
0.724796712398529
0.69017493724823
0.6768792271614075
0.6804761290550232
0.6914284825325012
test loss 0.894 


In [140]:
train_epocs(model, epochs=15, lr=0.01)

0.6998854875564575
0.6620432734489441
0.6676947474479675
0.6447718143463135
0.6374446153640747
0.6442126035690308
0.6396975517272949
0.6244640350341797
0.6132749915122986
0.611960232257843
0.6125966310501099
0.6067469120025635
0.5951989889144897
0.5841732621192932
0.5772355794906616
test loss 0.822 


## MF with bias

In [141]:
class MF_bias(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF_bias, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.item_bias = nn.Embedding(num_items, 1)
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        self.user_bias.weight.data.uniform_(-0.01,0.01)
        self.item_bias.weight.data.uniform_(-0.01,0.01)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        b_u = self.user_bias(u).squeeze()
        b_v = self.item_bias(v).squeeze()
        return (U*V).sum(1) +  b_u  + b_v

In [149]:
model = MF_bias(num_users, num_items, emb_size=100).cuda()

In [150]:
train_epocs(model, epochs=10, lr=0.1, wd=1e-5)

13.230323791503906
4.380349636077881
3.4689183235168457
2.4748196601867676
0.7888643741607666
1.8115053176879883
2.5194005966186523
2.1395790576934814
1.273136854171753
0.9016141295433044
test loss 1.536 


In [151]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5)

1.2817140817642212
0.8586933016777039
0.6953701972961426
0.6957386136054993
0.7542874813079834
0.7994323968887329
0.8065431714057922
0.7809088826179504
0.7390813827514648
0.6980960369110107
test loss 0.826 


In [152]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.6699392795562744
0.6618765592575073
0.6551422476768494
0.6496731042861938
0.6453258991241455
0.6419147253036499
0.6392385363578796
0.6371069550514221
0.6353580951690674
0.6338663101196289
test loss 0.811 


In [153]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.63254314661026
0.630549967288971
0.6290085315704346
0.6276098489761353
0.6262186765670776
0.6248037815093994
0.6233746409416199
0.6219477653503418
0.6205318570137024
0.6191259622573853
test loss 0.811 


Note that these models are susceptible to weight initialization, optimization algorithm and regularization.

## Neural Network Model

In [173]:
# Note here there is no matrix multiplication, we could potentially make the embeddings of different sizes.
# Here we could get better results by keep playing with regularization.
    
class CollabFNet(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100, n_hidden=10):
        super(CollabFNet, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.lin1 = nn.Linear(emb_size*2, n_hidden)
        self.lin2 = nn.Linear(n_hidden, 1)
        self.drop1 = nn.Dropout(0.1)
        self.drop2 = nn.Dropout(0.0)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        x = F.relu(torch.cat([U, V], dim=1))
        x = self.drop1(x)
        x = F.relu(self.lin1(x))
        x = self.drop2(x)
        x = self.lin2(x)
        return x

In [181]:
model = CollabFNet(num_users, num_items, emb_size=100).cuda()

In [182]:
train_epocs(model, epochs=20, lr=0.01, wd=1e-5)

14.666293144226074
9.48507308959961
5.082248210906982
2.1838996410369873
1.3149502277374268
2.3689379692077637
3.670354127883911
3.9370362758636475
3.295283794403076
2.364993095397949
1.6142159700393677
1.2404167652130127
1.2286133766174316
1.4336869716644287
1.7053074836730957
1.9117978811264038
1.9993972778320312
1.9518417119979858
1.794342041015625
1.576772689819336
test loss 1.336 


In [183]:
train_epocs(model, epochs=20, lr=0.01, wd=1e-6)

1.3451672792434692
1.2867872714996338
1.1990998983383179
1.0111206769943237
1.0731201171875
1.1048247814178467
1.007625937461853
0.9261730909347534
0.9513388872146606
0.9726068377494812
0.925067126750946
0.8661768436431885
0.8609121441841125
0.8794094324111938
0.8693878650665283
0.8317432999610901
0.8063588738441467
0.8150901794433594
0.8166025280952454
0.8022860288619995
test loss 0.822 


In [184]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-6)

0.7776695489883423
0.7752051949501038
0.7757472395896912
0.7720879316329956
0.7715520858764648
0.7729965448379517
0.7709333300590515
0.768379807472229
0.7679958343505859
0.7693462371826172
test loss 0.815 


In [185]:
train_epocs(model, epochs=20, lr=0.001, wd=1e-6)

0.7651445865631104
0.7693643569946289
0.7657420635223389
0.7633621096611023
0.7632601261138916
0.7648181319236755
0.7619305849075317
0.7608171105384827
0.7615530490875244
0.7589362263679504
0.7570609450340271
0.7563826441764832
0.756084680557251
0.7564724087715149
0.756054162979126
0.7550194263458252
0.751815676689148
0.7511407732963562
0.7509831786155701
0.7511487603187561
test loss 0.806 


## TODO
* use t-sne to visualize embeddings

# Lab
* Work with the largest dataset http://files.grouplens.org/datasets/movielens/ml-latest.zip
* Can you use `tags.csv` and `timestamp` to improve your predictions?