# Collaborative Filtering with Neural Networks

In this notebook we will write a matrix factorization model in pytorch to solve a recommendation problem. Then we will write a more general neural model for the same problem.

Collaborative filtering: systems recommend items based on similarity measures between users and/or items. The items recommended to a user are those preferred by similar users. 

The MovieLens dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. https://grouplens.org/datasets/movielens/. To get the data:

`wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip`

## MovieLens dataset

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

In [2]:
PATH = Path("/Users/yinterian/teaching/ML-2/data/ml-latest-small/")
list(PATH.iterdir())

[PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/links.csv'),
 PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/tags.csv'),
 PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/ratings.csv'),
 PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/README.txt'),
 PosixPath('/Users/yinterian/teaching/ML-2/data/ml-latest-small/movies.csv')]

In [3]:
! head /Users/yinterian/teaching/deeplearning/data/ml-latest-small/ratings.csv

userId,movieId,rating,timestamp
1,31,2.5,1260759144
1,1029,3.0,1260759179
1,1061,3.0,1260759182
1,1129,2.0,1260759185
1,1172,4.0,1260759205
1,1263,2.0,1260759151
1,1287,2.0,1260759187
1,1293,2.0,1260759148
1,1339,3.5,1260759125


In [4]:
# reading a csv into pandas
data = pd.read_csv(PATH/"ratings.csv")

In [5]:
data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


### Encoding data
We enconde the data to have contiguous ids for users and movies. You can think about this as a categorical encoding of our two categorical variables userId and movieId.

In [6]:
# split train and validation before encoding
np.random.seed(3)
msk = np.random.rand(len(data)) < 0.8
train = data[msk].copy()
val = data[~msk].copy()

In [7]:
# here is a handy function modified from fast.ai
def proc_col(col, train_col=None):
    """Encodes a pandas column with continous ids. 
    """
    if train_col is not None:
        uniq = train_col.unique()
    else:
        uniq = col.unique()
    name2idx = {o:i for i,o in enumerate(uniq)}
    return name2idx, np.array([name2idx.get(x, -1) for x in col]), len(uniq)

In [8]:
def encode_data(df, train=None):
    """ Encodes rating data with continous user and movie ids. 
    If train is provided, encodes df with the same encoding as train.
    """
    df = df.copy()
    for col_name in ["userId", "movieId"]:
        train_col = None
        if train is not None:
            train_col = train[col_name]
        _,col,_ = proc_col(df[col_name], train_col)
        df[col_name] = col
        df = df[df[col_name] >= 0]
    return df

In [9]:
# to check my implementation
df_t = pd.read_csv("test_data/tiny_training2.csv")
df_v = pd.read_csv("test_data/tiny_val2.csv")
df_t_e = encode_data(df_t)
df_v_e = encode_data(df_v, df_t)
df_v_e
df_t_e

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [10]:
df_train = encode_data(train)
df_val = encode_data(val, train)

## Embedding layer

In [11]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [12]:
# an Embedding module containing 10 users or items embedding size 3
# embedding will be initialized at random
embed = nn.Embedding(10, 3)
embed.weight

Parameter containing:
tensor([[-2.8846, -0.6033, -1.7295],
        [-0.0655, -0.3334,  0.7959],
        [-1.6476,  0.5635,  0.3031],
        [-0.6678,  2.0692, -0.1898],
        [ 0.8253, -0.2421,  0.5325],
        [-0.5016,  0.8493,  1.2077],
        [ 0.0506,  0.7678, -0.2365],
        [ 0.8921, -0.2085, -0.7218],
        [ 0.5329,  0.8161, -2.1785],
        [ 1.2591, -0.7965, -1.1599]])

In [13]:
# given a list of ids we can "look up" the embedding corresponing to each id
# can you see that some vectors are the same?
a = torch.LongTensor([[1,0,1,4,5,1]])
embed(a)

tensor([[[-0.0655, -0.3334,  0.7959],
         [-2.8846, -0.6033, -1.7295],
         [-0.0655, -0.3334,  0.7959],
         [ 0.8253, -0.2421,  0.5325],
         [-0.5016,  0.8493,  1.2077],
         [-0.0655, -0.3334,  0.7959]]])

## Matrix factorization model

In [62]:
class MF(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        # initlializing weights
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        
    def forward(self, u, v):
        u = self.user_emb(u)
        v = self.item_emb(v)
        return (u*v).sum(1)   

## Debugging MF model

In [63]:
df_t_e

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [64]:
num_users = 7
num_items = 4
emb_size = 3

user_emb = nn.Embedding(num_users, emb_size)
item_emb = nn.Embedding(num_items, emb_size)
users = torch.LongTensor(df_t_e.userId.values)
items = torch.LongTensor(df_t_e.movieId.values)

In [65]:
U = user_emb(users)
V = item_emb(items)

In [66]:
U

tensor([[ 0.5304, -0.6472,  0.4554],
        [ 0.5304, -0.6472,  0.4554],
        [ 0.1543, -0.2089, -0.3614],
        [ 0.1543, -0.2089, -0.3614],
        [ 0.0763, -0.8429,  0.0911],
        [ 0.0763, -0.8429,  0.0911],
        [-1.1417, -1.1717, -0.4392],
        [-1.1417, -1.1717, -0.4392],
        [ 1.9817, -1.9040,  0.4578],
        [ 1.9817, -1.9040,  0.4578],
        [ 1.3078,  0.2057, -0.2978],
        [-0.3635, -0.2366, -1.4741],
        [-0.3635, -0.2366, -1.4741]])

In [67]:
# element wise multiplication
U*V 

tensor([[-0.0763, -0.3313,  0.3695],
        [-0.0621,  0.4944,  0.6128],
        [-0.0181,  0.1596, -0.4864],
        [-0.1408, -0.1944, -0.1817],
        [-0.0110, -0.4314,  0.0739],
        [-0.0089,  0.6438,  0.1226],
        [ 0.1643, -0.5997, -0.3564],
        [-0.2089, -0.2127,  0.3451],
        [-0.2851, -0.9745,  0.3715],
        [ 0.3627, -0.3456, -0.3597],
        [ 0.2393,  0.0373,  0.2340],
        [ 0.0425,  0.1808, -1.9838],
        [-0.0665, -0.0430,  1.1580]])

In [68]:
# what we want is a dot product per row
(U*V).sum(1) 

tensor([-0.0381,  1.0451, -0.3449, -0.5169, -0.3685,  0.7575, -0.7919,
        -0.0766, -0.8881, -0.3426,  0.5107, -1.7605,  1.0486])

## Training MF model

In [69]:
num_users = len(df_train.userId.unique())
num_items = len(df_train.movieId.unique())
print(num_users, num_items) 

671 8442


In [70]:
# here we are not using data loaders because our data fits well in memory
def train_epocs(model, epochs=10, lr=0.01, wd=0.0, unsqueeze=False):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=wd)
    model.train()
    for i in range(epochs):
        users = torch.LongTensor(df_train.userId.values)  #.cuda()
        items = torch.LongTensor(df_train.movieId.values) #.cuda()
        ratings = torch.FloatTensor(df_train.rating.values)  #.cuda()
        if unsqueeze:
            ratings = ratings.unsqueeze(1)
        y_hat = model(users, items)
        loss = F.mse_loss(y_hat, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(loss.item()) # used to be loss.data[0]
    test_loss(model, unsqueeze)

In [71]:
# Here is what unsqueeze does
ratings = torch.FloatTensor(df_train.rating.values)
print(ratings.shape)
ratings = ratings.unsqueeze(1) #.cuda()
ratings.shape

torch.Size([79799])


torch.Size([79799, 1])

In [72]:
def test_loss(model, unsqueeze=False):
    model.eval()
    users = torch.LongTensor(df_val.userId.values) # .cuda()
    items = torch.LongTensor(df_val.movieId.values) #.cuda()
    ratings = torch.FloatTensor(df_val.rating.values) #.cuda()
    if unsqueeze:
        ratings = ratings.unsqueeze(1)
    y_hat = model(users, items)
    loss = F.mse_loss(y_hat, ratings)
    print("test loss %.3f " % loss.item())

In [190]:
model = MF(num_users, num_items, emb_size=100)  # if you have a GPU .cuda()

In [191]:
train_epocs(model, epochs=20, lr=0.1, wd=1e-5)

13.22977066040039
5.127340793609619
2.357693672180176
3.313488721847534
0.899741530418396
1.8573846817016602
2.7810661792755127
2.2830049991607666
1.1453200578689575
0.9064176082611084
1.6611082553863525
1.4776304960250854
0.8247122168540955
0.909933865070343
1.3347585201263428
1.392578125
1.0245131254196167
0.6999652981758118
0.8274641633033752
1.0505080223083496
test loss 1.131 


In [192]:
train_epocs(model, epochs=15, lr=0.01, wd=1e-5)

0.8765497803688049
0.6484661102294922
0.6463428735733032
0.6898732781410217
0.6919195055961609
0.6609521508216858
0.6276931166648865
0.6107308864593506
0.6090696454048157
0.6101611256599426
0.6040680408477783
0.5906494855880737
0.576405942440033
0.5673468708992004
0.5642893314361572
test loss 0.847 


In [119]:
train_epocs(model, epochs=15, lr=0.001, wd=1e-5)

0.6992694735527039
0.6840101480484009
0.6716464757919312
0.6620489954948425
0.6549171805381775
0.6498640179634094
0.6464440822601318
0.6442126035690308
0.6427420377731323
0.6417128443717957
0.6408921480178833
0.6401099562644958
0.6392812132835388
0.6383832097053528
0.6373997926712036
test loss 0.819 


In [120]:
train_epocs(model, epochs=15, lr=0.001, wd=1e-5)

0.6363869309425354
0.6331124305725098
0.6313668489456177
0.6303348541259766
0.629484236240387
0.6286053657531738
0.6276991367340088
0.6267718076705933
0.6258683800697327
0.6249338388442993
0.623957097530365
0.6228951811790466
0.6217635273933411
0.62052983045578
0.6192501783370972
test loss 0.814 


## MF with bias

In [110]:
class MF_bias(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF_bias, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.item_bias = nn.Embedding(num_items, 1)
        # init 
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        self.user_bias.weight.data.uniform_(-0.01,0.01)
        self.item_bias.weight.data.uniform_(-0.01,0.01)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        b_u = self.user_bias(u).squeeze()
        b_v = self.item_bias(v).squeeze()
        return (U*V).sum(1) +  b_u  + b_v

In [177]:
model = MF_bias(num_users, num_items, emb_size=100) #.cuda()

In [178]:
train_epocs(model, epochs=10, lr=0.1, wd=1e-5)

13.233558654785156
4.632496356964111
3.3016626834869385
2.066195249557495
0.9793775081634521
2.2044050693511963
2.879669189453125
2.432570219039917
1.4973409175872803
1.0138121843338013
test loss 1.412 


In [179]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5)

1.2231932878494263
0.8970196843147278
0.802173376083374
0.8413064479827881
0.9154934287071228
0.9634409546852112
0.9676942229270935
0.9368249773979187
0.8884711265563965
0.8396408557891846
test loss 0.932 


In [180]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.801849901676178
0.7882671356201172
0.7760322093963623
0.7650706768035889
0.7552807331085205
0.7464788556098938
0.7385380268096924
0.7313017845153809
0.7246699333190918
0.7185471653938293
test loss 0.858 


In [181]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.7128633856773376
0.7066500782966614
0.7009379267692566
0.6956583261489868
0.6907535195350647
0.686170756816864
0.6818504929542542
0.6777631044387817
0.6738976240158081
0.6702176332473755
test loss 0.828 


Note that these models are susceptible to weight initialization, optimization algorithm and regularization.

## Neural Network Model

In [182]:
# Note here there is no matrix multiplication, we could potentially make the embeddings 
# for users and items of different sizes.
# Here we could get better results by keep playing with regularization.
    
class CollabFNet(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100, n_hidden=30):
        super(CollabFNet, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.lin1 = nn.Linear(emb_size*2, n_hidden)
        self.lin2 = nn.Linear(n_hidden, 1)
        self.drop1 = nn.Dropout(0.1)
        self.drop2 = nn.Dropout(0.0)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        x = torch.cat([U, V], dim=1)
        x = self.drop1(x)
        x = F.relu(self.lin1(x))
        x = self.drop2(x)
        x = self.lin2(x)
        return x

In [183]:
model = CollabFNet(num_users, num_items, emb_size=100) #.cuda()

In [184]:
train_epocs(model, epochs=10, lr=0.1, wd=1e-5, unsqueeze=True) 

14.124415397644043
3.733110189437866
5.349889755249023
2.7663002014160156
3.6660704612731934
2.705371379852295
2.598076581954956
1.3190714120864868
1.2373065948486328
1.274068832397461
test loss 1.324 


In [185]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5, unsqueeze=True)

1.237608790397644
1.0828303098678589
0.9421415328979492
0.916451096534729
0.8738793134689331
0.8391649723052979
0.8309617042541504
0.8109890818595886
0.7898802161216736
0.7823166847229004
test loss 0.916 


In [186]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5, unsqueeze=True)

0.7763959169387817
0.7684773206710815
0.7606607675552368
0.7575645446777344
0.7530223727226257
0.7507917881011963
0.7472320199012756
0.747545063495636
0.743499755859375
0.7400220036506653
test loss 0.900 


In [187]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5, unsqueeze=True)

0.7428808212280273
0.7386583685874939
0.7383617758750916
0.7336496114730835
0.7367436289787292
0.7332612872123718
0.7299482226371765
0.7304689884185791
0.7287819385528564
0.7290533185005188
test loss 0.901 


In [188]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5, unsqueeze=True)

0.7253053784370422
0.7274237275123596
0.7232051491737366
0.7221526503562927
0.7207226157188416
0.7186670303344727
0.7201763391494751
0.716482400894165
0.7174072265625
0.7140310406684875
test loss 0.890 


In [189]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5, unsqueeze=True)

0.7129399180412292
0.7138351798057556
0.7113379240036011
0.7109148502349854
0.7072175145149231
0.7107471227645874
0.7082176804542542
0.7069698572158813
0.7074231505393982
0.7022867798805237
test loss 0.885 


## TODO
* use t-sne to visualize embeddings

# Lab
* Can you use `tags.csv` and `timestamp` to improve your predictions?
* Play with the hyperparameters
* Look at fastai version of this network and try his transformation https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb
* You may need a dataloader if you data is larger. Can you construct a dataset? Here is an example:
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel.html
* Work with the largest dataset http://files.grouplens.org/datasets/movielens/ml-latest.zip

# References
* This notebook is based on [lesson 5 of Jeremy Howard's Deep Learning Course](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb)