# Collaborative Filtering with Neural Networks

In this notebook we will write a matrix factorization model in pytorch to solve a recommendation problem. Then we will write a more general neural model for the same problem.

Collaborative filtering: systems recommend items based on similarity measures between users and/or items. The items recommended to a user are those preferred by similar users. 

The MovieLens dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. https://grouplens.org/datasets/movielens/. To get the data:

`wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip`

## MovieLens dataset

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

In [2]:
PATH = Path("/Users/yinterian/teaching/ML-2/data/ml-latest-small/")
list(PATH.iterdir())

[PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/links.csv'),
 PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/tags.csv'),
 PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/ratings.csv'),
 PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/README.txt'),
 PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/movies.csv')]

In [3]:
! head /Users/yinterian/teaching/deeplearning/data/ml-latest-small/ratings.csv

userId,movieId,rating,timestamp
1,31,2.5,1260759144
1,1029,3.0,1260759179
1,1061,3.0,1260759182
1,1129,2.0,1260759185
1,1172,4.0,1260759205
1,1263,2.0,1260759151
1,1287,2.0,1260759187
1,1293,2.0,1260759148
1,1339,3.5,1260759125


In [4]:
# reading a csv into pandas
data = pd.read_csv(PATH/"ratings.csv")

In [5]:
data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


### Encoding data
We enconde the data to have contiguous ids for users and movies. You can think about this as a categorical encoding of our two categorical variables userId and movieId.

In [6]:
# split train and validation before encoding
np.random.seed(3)
msk = np.random.rand(len(data)) < 0.8
train = data[msk].copy()
val = data[~msk].copy()

In [7]:
# here is a handy function modified from fast.ai
def proc_col(col, train_col=None):
    """Encodes a pandas column with continous ids. 
    """
    if train_col is not None:
        uniq = train_col.unique()
    else:
        uniq = col.unique()
    name2idx = {o:i for i,o in enumerate(uniq)}
    return name2idx, np.array([name2idx.get(x, -1) for x in col]), len(uniq)

In [8]:
def encode_data(df, train=None):
    """ Encodes rating data with continous user and movie ids. 
    If train is provided, encodes df with the same encoding as train.
    """
    df = df.copy()
    for col_name in ["userId", "movieId"]:
        train_col = None
        if train is not None:
            train_col = train[col_name]
        _,col,_ = proc_col(df[col_name], train_col)
        df[col_name] = col
        df = df[df[col_name] >= 0]
    return df

In [9]:
# to check my implementation
df_t = pd.read_csv("test_data/tiny_training2.csv")
df_v = pd.read_csv("test_data/tiny_val2.csv")
df_t_e = encode_data(df_t)
df_v_e = encode_data(df_v, df_t)
df_v_e
df_t_e

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [10]:
df_train = encode_data(train)
df_val = encode_data(val, train)

## Embedding layer

In [11]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [12]:
# an Embedding module containing 10 users or items embedding size 3
# embedding will be initialized at random
embed = nn.Embedding(10, 3)
embed.weight

Parameter containing:
tensor([[-2.8846, -0.6033, -1.7295],
        [-0.0655, -0.3334,  0.7959],
        [-1.6476,  0.5635,  0.3031],
        [-0.6678,  2.0692, -0.1898],
        [ 0.8253, -0.2421,  0.5325],
        [-0.5016,  0.8493,  1.2077],
        [ 0.0506,  0.7678, -0.2365],
        [ 0.8921, -0.2085, -0.7218],
        [ 0.5329,  0.8161, -2.1785],
        [ 1.2591, -0.7965, -1.1599]])

In [13]:
# given a list of ids we can "look up" the embedding corresponing to each id
# can you see that some vectors are the same?
a = torch.LongTensor([[1,0,1,4,5,1]])
embed(a)

tensor([[[-0.0655, -0.3334,  0.7959],
         [-2.8846, -0.6033, -1.7295],
         [-0.0655, -0.3334,  0.7959],
         [ 0.8253, -0.2421,  0.5325],
         [-0.5016,  0.8493,  1.2077],
         [-0.0655, -0.3334,  0.7959]]])

## Matrix factorization model

In [62]:
class MF(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        # initlializing weights
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        
    def forward(self, u, v):
        u = self.user_emb(u)
        v = self.item_emb(v)
        return (u*v).sum(1)   

## Debugging MF model

In [63]:
df_t_e

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [64]:
num_users = 7
num_items = 4
emb_size = 3

user_emb = nn.Embedding(num_users, emb_size)
item_emb = nn.Embedding(num_items, emb_size)
users = torch.LongTensor(df_t_e.userId.values)
items = torch.LongTensor(df_t_e.movieId.values)

In [65]:
U = user_emb(users)
V = item_emb(items)

In [66]:
U

tensor([[ 0.5304, -0.6472,  0.4554],
        [ 0.5304, -0.6472,  0.4554],
        [ 0.1543, -0.2089, -0.3614],
        [ 0.1543, -0.2089, -0.3614],
        [ 0.0763, -0.8429,  0.0911],
        [ 0.0763, -0.8429,  0.0911],
        [-1.1417, -1.1717, -0.4392],
        [-1.1417, -1.1717, -0.4392],
        [ 1.9817, -1.9040,  0.4578],
        [ 1.9817, -1.9040,  0.4578],
        [ 1.3078,  0.2057, -0.2978],
        [-0.3635, -0.2366, -1.4741],
        [-0.3635, -0.2366, -1.4741]])

In [67]:
# element wise multiplication
U*V 

tensor([[-0.0763, -0.3313,  0.3695],
        [-0.0621,  0.4944,  0.6128],
        [-0.0181,  0.1596, -0.4864],
        [-0.1408, -0.1944, -0.1817],
        [-0.0110, -0.4314,  0.0739],
        [-0.0089,  0.6438,  0.1226],
        [ 0.1643, -0.5997, -0.3564],
        [-0.2089, -0.2127,  0.3451],
        [-0.2851, -0.9745,  0.3715],
        [ 0.3627, -0.3456, -0.3597],
        [ 0.2393,  0.0373,  0.2340],
        [ 0.0425,  0.1808, -1.9838],
        [-0.0665, -0.0430,  1.1580]])

In [68]:
# what we want is a dot product per row
(U*V).sum(1) 

tensor([-0.0381,  1.0451, -0.3449, -0.5169, -0.3685,  0.7575, -0.7919,
        -0.0766, -0.8881, -0.3426,  0.5107, -1.7605,  1.0486])

## Training MF model

In [69]:
num_users = len(df_train.userId.unique())
num_items = len(df_train.movieId.unique())
print(num_users, num_items) 

671 8442


In [299]:
# here we are not using data loaders because our data fits well in memory
def train_epocs(model, epochs=10, lr=0.01, wd=0.0, unsqueeze=False):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=wd)
    for i in range(epochs):
        model.train()
        users = torch.LongTensor(df_train.userId.values)  #.cuda()
        items = torch.LongTensor(df_train.movieId.values) #.cuda()
        ratings = torch.FloatTensor(df_train.rating.values)  #.cuda()
        if unsqueeze:
            ratings = ratings.unsqueeze(1)
        y_hat = model(users, items)
        loss = F.mse_loss(y_hat, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        testloss = test_loss(model, unsqueeze)
        print("train loss %.3f test loss %.3f" % (loss.item(), testloss)) 

In [300]:
# Here is what unsqueeze does
ratings = torch.FloatTensor(df_train.rating.values)
print(ratings.shape)
ratings = ratings.unsqueeze(1) #.cuda()
ratings.shape

torch.Size([79799])


torch.Size([79799, 1])

In [301]:
def test_loss(model, unsqueeze=False):
    model.eval()
    users = torch.LongTensor(df_val.userId.values) # .cuda()
    items = torch.LongTensor(df_val.movieId.values) #.cuda()
    ratings = torch.FloatTensor(df_val.rating.values) #.cuda()
    if unsqueeze:
        ratings = ratings.unsqueeze(1)
    y_hat = model(users, items)
    loss = F.mse_loss(y_hat, ratings)
    return loss.item()

In [190]:
model = MF(num_users, num_items, emb_size=100)  # if you have a GPU .cuda()

In [191]:
train_epocs(model, epochs=20, lr=0.1, wd=1e-5)

13.22977066040039
5.127340793609619
2.357693672180176
3.313488721847534
0.899741530418396
1.8573846817016602
2.7810661792755127
2.2830049991607666
1.1453200578689575
0.9064176082611084
1.6611082553863525
1.4776304960250854
0.8247122168540955
0.909933865070343
1.3347585201263428
1.392578125
1.0245131254196167
0.6999652981758118
0.8274641633033752
1.0505080223083496
test loss 1.131 


In [192]:
train_epocs(model, epochs=15, lr=0.01, wd=1e-5)

0.8765497803688049
0.6484661102294922
0.6463428735733032
0.6898732781410217
0.6919195055961609
0.6609521508216858
0.6276931166648865
0.6107308864593506
0.6090696454048157
0.6101611256599426
0.6040680408477783
0.5906494855880737
0.576405942440033
0.5673468708992004
0.5642893314361572
test loss 0.847 


In [193]:
train_epocs(model, epochs=15, lr=0.001, wd=1e-5)

0.563148021697998
0.5561341047286987
0.5504226684570312
0.5457829236984253
0.5419900417327881
0.5387914180755615
0.536023736000061
0.5335185527801514
0.531192421913147
0.5289515256881714
0.5267682671546936
0.5246036648750305
0.5224275588989258
0.5202339291572571
0.5180172920227051
test loss 0.816 


In [194]:
train_epocs(model, epochs=15, lr=0.001, wd=1e-5)

0.5157772302627563
0.5129480361938477
0.5102515816688538
0.5076241493225098
0.5050358176231384
0.5024455785751343
0.49983420968055725
0.49720141291618347
0.494540274143219
0.49184277653694153
0.4891185164451599
0.4863676428794861
0.48359766602516174
0.48080265522003174
0.47798213362693787
test loss 0.818 


## MF with bias

In [302]:
class MF_bias(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF_bias, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.item_bias = nn.Embedding(num_items, 1)
        # init 
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        self.user_bias.weight.data.uniform_(-0.01,0.01)
        self.item_bias.weight.data.uniform_(-0.01,0.01)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        b_u = self.user_bias(u).squeeze()
        b_v = self.item_bias(v).squeeze()
        return (U*V).sum(1) +  b_u  + b_v

In [315]:
model = MF_bias(num_users, num_items, emb_size=100) #.cuda()

In [316]:
train_epocs(model, epochs=15, lr=0.1, wd=1e-5)

train loss 13.235 test loss 9.547
train loss 9.456 test loss 4.670
train loss 4.613 test loss 1.256
train loss 1.227 test loss 2.546
train loss 2.460 test loss 4.070
train loss 3.887 test loss 2.838
train loss 2.613 test loss 1.393
train loss 1.157 test loss 1.057
train loss 0.822 test loss 1.538
train loss 1.312 test loss 2.128
train loss 1.915 test loss 2.413
train loss 2.211 test loss 2.300
train loss 2.108 test loss 1.880
train loss 1.694 test loss 1.347
train loss 1.164 test loss 0.945


In [317]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5)

train loss 0.766 test loss 0.860
train loss 0.673 test loss 0.865
train loss 0.683 test loss 0.844
train loss 0.671 test loss 0.822
train loss 0.657 test loss 0.823
train loss 0.660 test loss 0.833
train loss 0.668 test loss 0.834
train loss 0.665 test loss 0.827
train loss 0.653 test loss 0.822
train loss 0.641 test loss 0.825


In [318]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

train loss 0.636 test loss 0.822
train loss 0.633 test loss 0.821
train loss 0.631 test loss 0.819
train loss 0.630 test loss 0.818
train loss 0.628 test loss 0.818
train loss 0.627 test loss 0.817
train loss 0.626 test loss 0.817
train loss 0.624 test loss 0.816
train loss 0.623 test loss 0.816
train loss 0.622 test loss 0.815


In [319]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

train loss 0.621 test loss 0.815
train loss 0.619 test loss 0.814
train loss 0.617 test loss 0.814
train loss 0.616 test loss 0.814
train loss 0.614 test loss 0.813
train loss 0.613 test loss 0.813
train loss 0.611 test loss 0.813
train loss 0.609 test loss 0.813
train loss 0.608 test loss 0.812
train loss 0.606 test loss 0.812


In [320]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

train loss 0.604 test loss 0.812
train loss 0.602 test loss 0.812
train loss 0.600 test loss 0.812
train loss 0.598 test loss 0.812
train loss 0.596 test loss 0.812
train loss 0.594 test loss 0.812
train loss 0.592 test loss 0.812
train loss 0.590 test loss 0.811
train loss 0.588 test loss 0.811
train loss 0.586 test loss 0.811


Note that these models are susceptible to weight initialization, optimization algorithm and regularization.

## Neural Network Model

In [451]:
# Note here there is no matrix multiplication, we could potentially make the embeddings 
# for users and items of different sizes.
# Here we could get better results by keep playing with regularization.
    
class CollabFNet(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100, n_hidden=20):
        super(CollabFNet, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.lin1 = nn.Linear(emb_size*2, n_hidden)
        self.lin2 = nn.Linear(n_hidden, 1)
        self.drop1 = nn.Dropout(0.3)
        self.drop2 = nn.Dropout(0.0)
        self.dense_bn = nn.BatchNorm1d(n_hidden)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        x = torch.cat([U, V], dim=1)
        x = self.drop1(x)
        x = F.relu(self.dense_bn(self.lin1(x)))
        x = self.drop2(x)
        x = self.lin2(x)
        return x

In [452]:
model = CollabFNet(num_users, num_items, emb_size=120, n_hidden=40) #.cuda()

In [445]:
train_epocs(model, epochs=15, lr=0.1, wd=1e-5, unsqueeze=True) 

train loss 13.927 test loss 5.797
train loss 7.231 test loss 4.354
train loss 2.362 test loss 20.705
train loss 1.667 test loss 33.821
train loss 3.590 test loss 25.512
train loss 2.669 test loss 13.091
train loss 1.299 test loss 5.303
train loss 0.888 test loss 2.151
train loss 1.183 test loss 1.264
train loss 1.540 test loss 1.105
train loss 1.641 test loss 1.121
train loss 1.467 test loss 1.229
train loss 1.152 test loss 1.476
train loss 0.870 test loss 1.885
train loss 0.750 test loss 2.372


In [446]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5, unsqueeze=True)

train loss 0.826 test loss 1.577
train loss 0.712 test loss 1.161
train loss 0.677 test loss 0.972
train loss 0.689 test loss 0.903
train loss 0.701 test loss 0.882
train loss 0.700 test loss 0.875
train loss 0.687 test loss 0.871
train loss 0.667 test loss 0.867
train loss 0.654 test loss 0.865
train loss 0.645 test loss 0.863


In [447]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5, unsqueeze=True)

train loss 0.641 test loss 0.865
train loss 0.648 test loss 0.853
train loss 0.631 test loss 0.845
train loss 0.626 test loss 0.844
train loss 0.627 test loss 0.844
train loss 0.621 test loss 0.851
train loss 0.612 test loss 0.866
train loss 0.607 test loss 0.881
train loss 0.605 test loss 0.887
train loss 0.601 test loss 0.882


In [448]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5, unsqueeze=True)

train loss 0.598 test loss 0.873
train loss 0.593 test loss 0.866
train loss 0.594 test loss 0.860
train loss 0.595 test loss 0.856
train loss 0.596 test loss 0.854
train loss 0.593 test loss 0.853
train loss 0.592 test loss 0.853
train loss 0.591 test loss 0.853
train loss 0.595 test loss 0.853
train loss 0.590 test loss 0.854


In [449]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5, unsqueeze=True)

train loss 0.591 test loss 0.854
train loss 0.589 test loss 0.854
train loss 0.590 test loss 0.854
train loss 0.586 test loss 0.853
train loss 0.589 test loss 0.853
train loss 0.587 test loss 0.852
train loss 0.585 test loss 0.852
train loss 0.586 test loss 0.852
train loss 0.584 test loss 0.852
train loss 0.586 test loss 0.852


In [450]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5, unsqueeze=True)

train loss 0.584 test loss 0.851
train loss 0.586 test loss 0.853
train loss 0.584 test loss 0.852
train loss 0.583 test loss 0.851
train loss 0.581 test loss 0.851
train loss 0.583 test loss 0.851
train loss 0.582 test loss 0.851
train loss 0.582 test loss 0.852
train loss 0.582 test loss 0.852
train loss 0.581 test loss 0.851


In [442]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5, unsqueeze=True)

train loss 0.719 test loss 0.829
train loss 0.714 test loss 0.828
train loss 0.711 test loss 0.827
train loss 0.710 test loss 0.827
train loss 0.711 test loss 0.827
train loss 0.710 test loss 0.827
train loss 0.717 test loss 0.827
train loss 0.710 test loss 0.827
train loss 0.712 test loss 0.827
train loss 0.711 test loss 0.826


## TODO
* use t-sne to visualize embeddings

# Lab
* Can you use `tags.csv` and `timestamp` to improve your predictions?
* Play with the hyperparameters
* Look at fastai version of this network and try his transformation https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb
* You may need a dataloader if you data is larger. Can you construct a dataset? Here is an example:
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel.html
* Work with the largest dataset http://files.grouplens.org/datasets/movielens/ml-latest.zip

# References
* This notebook is based on [lesson 5 of Jeremy Howard's Deep Learning Course](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb)