# Collaborative Filtering with Neural Networks

In this notebook we will write a matrix factorization model in pytorch to solve a recommendation problem. Then we will write a more general neural model for the same problem.

Collaborative filtering: systems recommend items based on similarity measures between users and/or items. The items recommended to a user are those preferred by similar users. 

The MovieLens dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. https://grouplens.org/datasets/movielens/. To get the data:

`wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip`

## MovieLens dataset

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

In [2]:
PATH = Path("/Users/yinterian/teaching/ML-2/data/ml-latest-small/")
list(PATH.iterdir())

[PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/links.csv'),
 PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/tags.csv'),
 PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/ratings.csv'),
 PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/README.txt'),
 PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/movies.csv')]

In [3]:
! head /Users/yinterian/teaching/deeplearning/data/ml-latest-small/ratings.csv

userId,movieId,rating,timestamp
1,31,2.5,1260759144
1,1029,3.0,1260759179
1,1061,3.0,1260759182
1,1129,2.0,1260759185
1,1172,4.0,1260759205
1,1263,2.0,1260759151
1,1287,2.0,1260759187
1,1293,2.0,1260759148
1,1339,3.5,1260759125


In [4]:
# reading a csv into pandas
data = pd.read_csv(PATH/"ratings.csv")

In [5]:
data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


### Encoding data
We enconde the data to have contiguous ids for users and movies. You can think about this as a categorical encoding of our two categorical variables userId and movieId.

In [6]:
# split train and validation before encoding
np.random.seed(3)
msk = np.random.rand(len(data)) < 0.8
train = data[msk].copy()
val = data[~msk].copy()

In [7]:
# here is a handy function modified from fast.ai
def proc_col(col, train_col=None):
    """Encodes a pandas column with continous ids. 
    """
    if train_col is not None:
        uniq = train_col.unique()
    else:
        uniq = col.unique()
    name2idx = {o:i for i,o in enumerate(uniq)}
    return name2idx, np.array([name2idx.get(x, -1) for x in col]), len(uniq)

In [8]:
def encode_data(df, train=None):
    """ Encodes rating data with continous user and movie ids. 
    If train is provided, encodes df with the same encoding as train.
    """
    df = df.copy()
    for col_name in ["userId", "movieId"]:
        train_col = None
        if train is not None:
            train_col = train[col_name]
        _,col,_ = proc_col(df[col_name], train_col)
        df[col_name] = col
        df = df[df[col_name] >= 0]
    return df

In [9]:
# to check my implementation
df_t = pd.read_csv("test_data/tiny_training2.csv")
df_v = pd.read_csv("test_data/tiny_val2.csv")
df_t_e = encode_data(df_t)
df_v_e = encode_data(df_v, df_t)
df_v_e
df_t_e

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [10]:
df_train = encode_data(train)
df_val = encode_data(val, train)

## Embedding layer

In [11]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [12]:
# an Embedding module containing 10 users or items embedding size 3
# embedding will be initialized at random
embed = nn.Embedding(10, 3)
embed.weight

Parameter containing:
tensor([[-1.1881, -0.8265,  0.0338],
        [-0.5068, -1.1261, -0.5361],
        [-0.5703, -0.0012, -0.1918],
        [-0.2228,  0.4780, -0.5000],
        [ 0.0603, -1.2193, -0.3168],
        [ 0.6763,  0.7838, -0.7186],
        [-1.3731, -0.1078, -0.3492],
        [-0.2062, -0.5206,  1.0072],
        [ 1.7192, -0.7729, -0.2444],
        [-0.7064,  0.2915,  0.5784]])

In [13]:
# given a list of ids we can "look up" the embedding corresponing to each id
# can you see that some vectors are the same?
a = torch.LongTensor([[1,0,1,4,5,1]])
embed(a)

tensor([[[-0.5068, -1.1261, -0.5361],
         [-1.1881, -0.8265,  0.0338],
         [-0.5068, -1.1261, -0.5361],
         [ 0.0603, -1.2193, -0.3168],
         [ 0.6763,  0.7838, -0.7186],
         [-0.5068, -1.1261, -0.5361]]])

## Matrix factorization model

In [14]:
class MF(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        # initlializing weights
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        
    def forward(self, u, v):
        u = self.user_emb(u)
        v = self.item_emb(v)
        return (u*v).sum(1)   

## Debugging MF model

In [15]:
df_t_e

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [16]:
num_users = 7
num_items = 4
emb_size = 3

user_emb = nn.Embedding(num_users, emb_size)
item_emb = nn.Embedding(num_items, emb_size)
users = torch.LongTensor(df_t_e.userId.values)
items = torch.LongTensor(df_t_e.movieId.values)

In [17]:
U = user_emb(users)
V = item_emb(items)

In [18]:
U

tensor([[ 1.2874, -0.4427, -0.7411],
        [ 1.2874, -0.4427, -0.7411],
        [-0.9783,  0.5827,  1.0137],
        [-0.9783,  0.5827,  1.0137],
        [ 1.0557, -0.4477, -0.4696],
        [ 1.0557, -0.4477, -0.4696],
        [-1.2368,  0.6748, -0.6885],
        [-1.2368,  0.6748, -0.6885],
        [-0.9106,  0.6204, -0.1956],
        [-0.9106,  0.6204, -0.1956],
        [ 0.2215,  0.4014, -0.1028],
        [ 0.0061,  0.4803,  0.0214],
        [ 0.0061,  0.4803,  0.0214]])

In [19]:
# element wise multiplication
U*V 

tensor([[-0.9865, -0.4910,  0.7336],
        [ 0.2174, -0.4591, -0.5186],
        [-0.1652,  0.6042,  0.7093],
        [-1.0072, -0.7783, -0.2334],
        [-0.8090, -0.4965,  0.4648],
        [ 0.1783, -0.4642, -0.3286],
        [ 0.9478,  0.7484,  0.6815],
        [ 1.3673, -0.3748, -0.7869],
        [ 0.6978,  0.6880,  0.1936],
        [ 1.0066, -0.3446, -0.2236],
        [-0.2449, -0.2230, -0.1175],
        [ 0.0010,  0.4980,  0.0150],
        [-0.0067, -0.2667,  0.0244]])

In [20]:
# what we want is a dot product per row
(U*V).sum(1) 

tensor([-0.7440, -0.7602,  1.1483, -2.0188, -0.8407, -0.6145,  2.3776,
         0.2056,  1.5795,  0.4385, -0.5853,  0.5139, -0.2490])

## Training MF model

In [21]:
num_users = len(df_train.userId.unique())
num_items = len(df_train.movieId.unique())
print(num_users, num_items) 

671 8442


In [25]:
# here we are not using data loaders because our data fits well in memory
def train_epocs(model, epochs=10, lr=0.01, wd=0.0, unsqueeze=False):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=wd)
    for i in range(epochs):
        model.train()
        users = torch.LongTensor(df_train.userId.values)  #.cuda()
        items = torch.LongTensor(df_train.movieId.values) #.cuda()
        ratings = torch.FloatTensor(df_train.rating.values)  #.cuda()
        if unsqueeze:
            ratings = ratings.unsqueeze(1)
        y_hat = model(users, items)
        loss = F.mse_loss(y_hat, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        testloss = valid_loss(model, unsqueeze)
        print("train loss %.3f valid loss %.3f" % (loss.item(), testloss)) 

In [26]:
# Here is what unsqueeze does
ratings = torch.FloatTensor(df_train.rating.values)
print(ratings.shape)
ratings = ratings.unsqueeze(1) #.cuda()
ratings.shape

torch.Size([79799])


torch.Size([79799, 1])

In [27]:
def valid_loss(model, unsqueeze=False):
    model.eval()
    users = torch.LongTensor(df_val.userId.values) # .cuda()
    items = torch.LongTensor(df_val.movieId.values) #.cuda()
    ratings = torch.FloatTensor(df_val.rating.values) #.cuda()
    if unsqueeze:
        ratings = ratings.unsqueeze(1)
    y_hat = model(users, items)
    loss = F.mse_loss(y_hat, ratings)
    return loss.item()

In [28]:
model = MF(num_users, num_items, emb_size=100)  # if you have a GPU .cuda()

In [29]:
train_epocs(model, epochs=20, lr=0.1, wd=1e-5)

train loss 13.235 valid loss 5.204
train loss 5.145 valid loss 2.340
train loss 2.338 valid loss 3.533
train loss 3.336 valid loss 1.137
train loss 0.902 valid loss 2.098
train loss 1.854 valid loss 3.024
train loss 2.786 valid loss 2.531
train loss 2.293 valid loss 1.393
train loss 1.152 valid loss 1.133
train loss 0.903 valid loss 1.858
train loss 1.661 valid loss 1.633
train loss 1.486 valid loss 0.934
train loss 0.828 valid loss 1.005
train loss 0.906 valid loss 1.451
train loss 1.333 valid loss 1.542
train loss 1.396 valid loss 1.205
train loss 1.030 valid loss 0.907
train loss 0.700 valid loss 1.058
train loss 0.823 valid loss 1.302
train loss 1.050 valid loss 1.134


In [30]:
train_epocs(model, epochs=15, lr=0.01, wd=1e-5)

train loss 0.880 valid loss 0.894
train loss 0.648 valid loss 0.880
train loss 0.645 valid loss 0.915
train loss 0.690 valid loss 0.909
train loss 0.693 valid loss 0.872
train loss 0.662 valid loss 0.836
train loss 0.628 valid loss 0.819
train loss 0.610 valid loss 0.823
train loss 0.608 valid loss 0.834
train loss 0.609 valid loss 0.839
train loss 0.604 valid loss 0.836
train loss 0.591 valid loss 0.833
train loss 0.576 valid loss 0.833
train loss 0.567 valid loss 0.840
train loss 0.563 valid loss 0.848


In [31]:
train_epocs(model, epochs=15, lr=0.001, wd=1e-5)

train loss 0.562 valid loss 0.841
train loss 0.555 valid loss 0.836
train loss 0.550 valid loss 0.831
train loss 0.545 valid loss 0.828
train loss 0.541 valid loss 0.826
train loss 0.538 valid loss 0.824
train loss 0.535 valid loss 0.823
train loss 0.533 valid loss 0.821
train loss 0.530 valid loss 0.820
train loss 0.528 valid loss 0.820
train loss 0.526 valid loss 0.819
train loss 0.524 valid loss 0.818
train loss 0.522 valid loss 0.818
train loss 0.519 valid loss 0.817
train loss 0.517 valid loss 0.817


In [32]:
train_epocs(model, epochs=15, lr=0.001, wd=1e-5)

train loss 0.515 valid loss 0.817
train loss 0.512 valid loss 0.817
train loss 0.509 valid loss 0.818
train loss 0.507 valid loss 0.818
train loss 0.504 valid loss 0.818
train loss 0.502 valid loss 0.818
train loss 0.499 valid loss 0.819
train loss 0.496 valid loss 0.819
train loss 0.494 valid loss 0.819
train loss 0.491 valid loss 0.819
train loss 0.488 valid loss 0.819
train loss 0.486 valid loss 0.819
train loss 0.483 valid loss 0.819
train loss 0.480 valid loss 0.820
train loss 0.477 valid loss 0.820


## MF with bias

In [33]:
class MF_bias(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF_bias, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.item_bias = nn.Embedding(num_items, 1)
        # init 
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        self.user_bias.weight.data.uniform_(-0.01,0.01)
        self.item_bias.weight.data.uniform_(-0.01,0.01)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        b_u = self.user_bias(u).squeeze()
        b_v = self.item_bias(v).squeeze()
        return (U*V).sum(1) +  b_u  + b_v

In [34]:
model = MF_bias(num_users, num_items, emb_size=100) #.cuda()

In [35]:
train_epocs(model, epochs=15, lr=0.1, wd=1e-5)

train loss 13.232 valid loss 4.428
train loss 4.375 valid loss 3.504
train loss 3.483 valid loss 2.645
train loss 2.469 valid loss 0.994
train loss 0.787 valid loss 2.035
train loss 1.813 valid loss 2.752
train loss 2.520 valid loss 2.389
train loss 2.139 valid loss 1.547
train loss 1.274 valid loss 1.187
train loss 0.904 valid loss 1.537
train loss 1.283 valid loss 1.534
train loss 1.344 valid loss 1.059
train loss 0.932 valid loss 0.909
train loss 0.815 valid loss 1.135
train loss 1.044 valid loss 1.315


In [36]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5)

train loss 1.209 valid loss 1.013
train loss 0.894 valid loss 0.859
train loss 0.721 valid loss 0.835
train loss 0.675 valid loss 0.880
train loss 0.698 valid loss 0.924
train loss 0.722 valid loss 0.936
train loss 0.718 valid loss 0.922
train loss 0.692 valid loss 0.902
train loss 0.664 valid loss 0.887
train loss 0.646 valid loss 0.881


In [37]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

train loss 0.640 valid loss 0.873
train loss 0.633 valid loss 0.865
train loss 0.627 valid loss 0.858
train loss 0.622 valid loss 0.852
train loss 0.617 valid loss 0.846
train loss 0.613 valid loss 0.841
train loss 0.609 valid loss 0.837
train loss 0.606 valid loss 0.833
train loss 0.603 valid loss 0.829
train loss 0.600 valid loss 0.826


In [38]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

train loss 0.597 valid loss 0.824
train loss 0.595 valid loss 0.822
train loss 0.592 valid loss 0.820
train loss 0.590 valid loss 0.818
train loss 0.587 valid loss 0.817
train loss 0.585 valid loss 0.816
train loss 0.583 valid loss 0.815
train loss 0.580 valid loss 0.814
train loss 0.578 valid loss 0.813
train loss 0.576 valid loss 0.812


In [39]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

train loss 0.574 valid loss 0.812
train loss 0.571 valid loss 0.811
train loss 0.569 valid loss 0.811
train loss 0.567 valid loss 0.811
train loss 0.564 valid loss 0.811
train loss 0.562 valid loss 0.811
train loss 0.560 valid loss 0.810
train loss 0.557 valid loss 0.810
train loss 0.555 valid loss 0.810
train loss 0.552 valid loss 0.810


Note that these models are susceptible to weight initialization, optimization algorithm and regularization.

## Neural Network Model

In [40]:
# Note here there is no matrix multiplication, we could potentially make the embeddings 
# for users and items of different sizes.
# Here we could get better results by keep playing with regularization.
    
class CollabFNet(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100, n_hidden=20):
        super(CollabFNet, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.lin1 = nn.Linear(emb_size*2, n_hidden)
        self.lin2 = nn.Linear(n_hidden, 1)
        self.drop1 = nn.Dropout(0.3)
        self.drop2 = nn.Dropout(0.0)
        self.dense_bn = nn.BatchNorm1d(n_hidden)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        x = torch.cat([U, V], dim=1)
        x = self.drop1(x)
        x = F.relu(self.dense_bn(self.lin1(x)))
        x = self.drop2(x)
        x = self.lin2(x)
        return x

In [41]:
model = CollabFNet(num_users, num_items, emb_size=120, n_hidden=40) #.cuda()

In [42]:
train_epocs(model, epochs=15, lr=0.1, wd=1e-5, unsqueeze=True) 

train loss 12.418 valid loss 5.096
train loss 5.841 valid loss 7.370
train loss 1.697 valid loss 26.980
train loss 2.557 valid loss 30.591
train loss 3.289 valid loss 19.496
train loss 1.930 valid loss 9.190
train loss 1.028 valid loss 3.707
train loss 1.077 valid loss 1.729
train loss 1.468 valid loss 1.230
train loss 1.689 valid loss 1.168
train loss 1.604 valid loss 1.256
train loss 1.303 valid loss 1.509
train loss 0.989 valid loss 1.959
train loss 0.831 valid loss 2.498
train loss 0.889 valid loss 2.838


In [43]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5, unsqueeze=True)

train loss 1.044 valid loss 1.842
train loss 0.861 valid loss 1.287
train loss 0.770 valid loss 1.005
train loss 0.736 valid loss 0.889
train loss 0.743 valid loss 0.863
train loss 0.761 valid loss 0.874
train loss 0.778 valid loss 0.891
train loss 0.780 valid loss 0.902
train loss 0.770 valid loss 0.902
train loss 0.754 valid loss 0.893


In [44]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5, unsqueeze=True)

train loss 0.736 valid loss 0.848
train loss 0.699 valid loss 0.841
train loss 0.704 valid loss 0.837
train loss 0.703 valid loss 0.837
train loss 0.692 valid loss 0.851
train loss 0.680 valid loss 0.875
train loss 0.679 valid loss 0.897
train loss 0.678 valid loss 0.907
train loss 0.675 valid loss 0.904
train loss 0.668 valid loss 0.892


In [45]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5, unsqueeze=True)

train loss 0.663 valid loss 0.881
train loss 0.659 valid loss 0.873
train loss 0.660 valid loss 0.867
train loss 0.661 valid loss 0.864
train loss 0.658 valid loss 0.863
train loss 0.654 valid loss 0.863
train loss 0.656 valid loss 0.863
train loss 0.657 valid loss 0.863
train loss 0.657 valid loss 0.863
train loss 0.654 valid loss 0.863


In [None]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5, unsqueeze=True)

In [None]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5, unsqueeze=True)

In [None]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5, unsqueeze=True)

## TODO
* use t-sne to visualize embeddings

# Lab
* Can you use `tags.csv` and `timestamp` to improve your predictions?
* Play with the hyperparameters
* Look at fastai version of this network and try his transformation https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb
* You may need a dataloader if you data is larger. Can you construct a dataset? Here is an example:
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel.html
* Work with the largest dataset http://files.grouplens.org/datasets/movielens/ml-latest.zip

# References
* This notebook is based on [lesson 5 of Jeremy Howard's Deep Learning Course](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb)