TODO:
* Try using SGD or Adagrad
* Try different measurements (pearson, cosine, euclidian, etc.)
* See if increasing dataloader workers improves training speed
* Try removing users with few ratings (like I did with movies since vanilla CF doesn't address the cold-start problem)
* Try replacing all ratings with a 1, indicating it has been watched (but then I also have to generate negative examples)
* Try to balance the dataset (to avoid a bias towards popular movies)

### Initialization for running in Colab

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Basic imports

In [2]:
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt

### Data loading

In [3]:
datasets_dir = '/content/drive/My Drive/datasets/'
dataset_zipped = datasets_dir + 'ml-25m.zip'
dataset_unzipped = 'ml-25m/'

In [4]:
import os

if os.path.exists(dataset_unzipped):
    print('Dataset found.')
elif os.path.exists(dataset_zipped):
    !unzip "$dataset_zipped"
    print('Dataset extracted.')
else:
    raise Exception('Dataset not found.')

Archive:  /content/drive/My Drive/datasets/ml-25m.zip
   creating: ml-25m/
  inflating: ml-25m/tags.csv         
  inflating: ml-25m/links.csv        
  inflating: ml-25m/README.txt       
  inflating: ml-25m/ratings.csv      
  inflating: ml-25m/genome-tags.csv  
  inflating: ml-25m/genome-scores.csv  
  inflating: ml-25m/movies.csv       
Dataset extracted.


In [5]:
ratings = pd.read_csv(dataset_unzipped + 'ratings.csv', usecols=['userId', 'movieId', 'rating'])
ratings

Unnamed: 0,userId,movieId,rating
0,1,296,5.0
1,1,306,3.5
2,1,307,5.0
3,1,665,5.0
4,1,899,3.5
...,...,...,...
25000090,162541,50872,4.5
25000091,162541,55768,2.5
25000092,162541,56176,2.0
25000093,162541,58559,4.0


In [6]:
movies = pd.read_csv(dataset_unzipped + 'movies.csv')
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


### Preprocessing

Remove movies with few ratings to avoid the cold-start problem. The movies with very few ratings would have to be handled differently, for example with a content-based approach, but that is outside the scope of this notebook.

In [7]:
rating_counts = ratings.movieId.astype('category').value_counts().rename('num_ratings')
movies = movies.merge(rating_counts, left_on='movieId', right_index=True)

In [8]:
movies.sort_values('num_ratings', ascending=False).head(10)

Unnamed: 0,movieId,title,genres,num_ratings
351,356,Forrest Gump (1994),Comedy|Drama|Romance|War,81491
314,318,"Shawshank Redemption, The (1994)",Crime|Drama,81482
292,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,79672
585,593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,74127
2480,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,72674
257,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,68717
475,480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,64144
522,527,Schindler's List (1993),Drama|War,60411
108,110,Braveheart (1995),Action|Drama|War,59184
2867,2959,Fight Club (1999),Action|Crime|Drama|Thriller,58773


In [9]:
# A large portion of the movies are rated by only a handful of people
movies[movies.num_ratings < 10].num_ratings.count()

34717

In [10]:
# Only keep movies with a decent amount of ratings
num_ratings_threshold = 1000

movies_to_keep = movies[movies.num_ratings >= num_ratings_threshold].movieId
ratings = ratings[ratings.movieId.isin(movies_to_keep)]
movies = movies[movies.movieId.isin(movies_to_keep)]

Reset IDs

In [11]:
id2ml = dict(enumerate(ratings.movieId.unique()))
ml2id = dict((ml, id) for id, ml in id2ml.items())

In [None]:
ratings.movieId = ratings.movieId.map(ml2id)
ratings.userId = ratings.userId.astype('category').cat.codes.values
movies.movieId = movies.movieId.map(ml2id)

In [13]:
ratings

Unnamed: 0,userId,movieId,rating
0,0,0,5.0
1,0,1,3.5
2,0,2,5.0
3,0,3,5.0
4,0,4,3.5
...,...,...,...
25000090,162538,508,4.5
25000091,162538,3169,2.5
25000092,162538,3717,2.0
25000093,162538,541,4.0


In [14]:
movies

Unnamed: 0,movieId,title,genres,num_ratings
0,47,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,57309
1,1032,Jumanji (1995),Adventure|Children|Fantasy,24228
2,952,Grumpier Old Men (1995),Comedy|Romance,11804
3,3041,Waiting to Exhale (1995),Comedy|Drama|Romance,2523
4,1664,Father of the Bride Part II (1995),Comedy,11714
...,...,...,...,...
55643,3072,Venom (2018),Action|Horror|Sci-Fi|Thriller,1098
55821,2889,Bohemian Rhapsody (2018),Drama,2151
56570,2921,Green Book (2018),Comedy|Drama,1971
56890,859,Spider-Man: Into the Spider-Verse (2018),Action|Adventure|Animation|Sci-Fi,3085


In [15]:
n_movies = len(ratings.movieId.unique())
n_users = len(ratings.userId.unique())
n_ratings = len(ratings)
print('n_movies:', n_movies)
print('n_users:', n_users)
print('n_ratings:', n_ratings)

n_movies: 3794
n_users: 162539
n_ratings: 22141815


Make data torch compatible

In [16]:
class MovieLensDataset(torch.utils.data.Dataset):
    def __init__(self, ratings):
        self.users = ratings.userId.values.astype(np.int64)
        self.movies = ratings.movieId.values.astype(np.int64)
        self.ratings = ratings.rating.values.astype(np.float32)

    def __len__(self):
        return len(self.ratings)

    def __getitem__(self, idx):
        return self.users[idx], self.movies[idx], self.ratings[idx]

In [17]:
dataset = MovieLensDataset(ratings)

test_size = 100000
val_size = 100000
train_val_size = len(dataset) - test_size
train_size = train_val_size - val_size

# Deterministic test split
torch.manual_seed(42)
trainvalset, testset = torch.utils.data.random_split(dataset, [train_val_size, test_size])
# Randomize seed again for train and validation split
torch.manual_seed(np.random.randint(100000))
trainset, valset = torch.utils.data.random_split(trainvalset, [train_size, val_size])

In [18]:
traingen = torch.utils.data.DataLoader(trainset, batch_size=256, shuffle=True, num_workers=0)
valgen = torch.utils.data.DataLoader(valset, batch_size=256, num_workers=0)
testgen = torch.utils.data.DataLoader(testset, batch_size=256, num_workers=0)

### Matrix factorization

In [19]:
import torch.nn.functional as F
from tqdm import tqdm_notebook

class MatrixFactorization(torch.nn.Module):
    def __init__(self, embedding_dim):
        super().__init__()
        self.U = torch.nn.Parameter(torch.randn((n_users, embedding_dim)) * 0.01, requires_grad=True)
        self.I = torch.nn.Parameter(torch.randn((n_movies, embedding_dim)) * 0.01, requires_grad=True)

    def forward(self, user_id, item_id):
        return torch.sum(self.U[user_id] * self.I[item_id], axis=1)

    def evaluate(self, gen):
        self.eval()
        with torch.no_grad():
            total_loss = 0
            for user_ids, movie_ids, ratings in gen:
                user_ids, movie_ids, ratings = user_ids.to(device), movie_ids.to(device), ratings.to(device)
                p = model(user_ids, movie_ids)
                total_loss += F.mse_loss(p, ratings).item()
            loss = total_loss / len(gen)
            return loss

    def fit(self, optimizer, traingen, valgen, max_epochs=None, max_steps=None, print_interval=1000, evaluate=True):
        e, i = 0, 0
        stop = False
        
        while True:
            self.train()
            running_loss = 0.0
            partial_loss = 0.0

            for user_ids, movie_ids, ratings in traingen:
                user_ids, movie_ids, ratings = user_ids.to(device), movie_ids.to(device), ratings.to(device)

                p = model(user_ids, movie_ids)
                loss = F.mse_loss(p, ratings)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                running_loss = 0.99 * running_loss + 0.01 * loss.item()
                partial_loss += loss.item()
                
                i += 1
                if i % print_interval == 0:
                    print(f'Epoch {e}, Step {i}, Loss: {partial_loss / print_interval}')
                    partial_loss = 0
                if i >= max_steps:
                    stop = True
                    break

            val_loss = self.evaluate(valgen)

            print('Running training loss:', running_loss)
            print('Batch validation loss:', val_loss)

            e += 1
            if e == max_epochs or stop:
                break

In [20]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [21]:
model = MatrixFactorization(embedding_dim=100).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0)

In [22]:
model.fit(optimizer, traingen, valgen, max_steps=100000, print_interval=5000)

Epoch 0, Step 5000, Loss: 6.316866959500313
Epoch 0, Step 10000, Loss: 1.4178538162112235
Epoch 0, Step 15000, Loss: 1.1432121218919753
Epoch 0, Step 20000, Loss: 1.0429588246941566
Epoch 0, Step 25000, Loss: 0.9838444051027297
Epoch 0, Step 30000, Loss: 0.9424167881131172
Epoch 0, Step 35000, Loss: 0.9103831383109092
Epoch 0, Step 40000, Loss: 0.8809279952168465
Epoch 0, Step 45000, Loss: 0.8580609582901001
Epoch 0, Step 50000, Loss: 0.8389864018440246
Epoch 0, Step 55000, Loss: 0.8177299563765525
Epoch 0, Step 60000, Loss: 0.8017386973500252
Epoch 0, Step 65000, Loss: 0.789416436290741
Epoch 0, Step 70000, Loss: 0.7737824049711227
Epoch 0, Step 75000, Loss: 0.761042411005497
Epoch 0, Step 80000, Loss: 0.7507939923048019
Epoch 0, Step 85000, Loss: 0.7430362506628037
Running training loss: 0.7440797622373445
Batch validation loss: 0.7437538577772468
Epoch 1, Step 90000, Loss: 0.5718674710154533
Epoch 1, Step 95000, Loss: 0.6643115600049496
Epoch 1, Step 100000, Loss: 0.6599719904124737

### Recommendation

The item-centric way of creating recommendations is to compare the item embeddings and recommend items similar to ones the user likes. It's typically more robust than user-centric recommendations, since it doesn't rely on each user having rated many movies.

In [23]:
movie_embeddings = model.I.cpu().detach().numpy()
similarity_matrix = np.corrcoef(movie_embeddings)

In [24]:
def top_similar_movies(movie_id):
    similar = np.argpartition(similarity_matrix[movie_id], np.arange(-10, 0))[-10:][::-1]
    similarities = similarity_matrix[movie_id, similar]
    top_similar = pd.DataFrame({'movieId': similar, 'similarity': similarities})
    return top_similar.merge(movies, on='movieId')

In [25]:
def top_dissimilar_movies(movie_id):
    similar = np.argpartition(similarity_matrix[movie_id], np.arange(10))[:10]
    similarities = similarity_matrix[movie_id, similar]
    top_similar = pd.DataFrame({'movieId': similar, 'similarity': similarities})
    return top_similar.merge(movies, on='movieId')

Example recommendations

In [26]:
top_similar_movies(47) # Toy Story

Unnamed: 0,movieId,similarity,title,genres,num_ratings
0,47,1.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,57309
1,150,0.929232,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,26536
2,612,0.869688,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,14426
3,129,0.868671,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy,22471
4,175,0.837668,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy,34572
5,31,0.798437,Finding Nemo (2003),Adventure|Animation|Children|Comedy,34712
6,823,0.797611,Up (2009),Adventure|Animation|Children|Drama,25127
7,208,0.784296,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy,30562
8,508,0.756532,Ratatouille (2007),Animation|Children|Drama,19157
9,69,0.749148,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical,43373


In [27]:
top_similar_movies(651) # The Avengers

Unnamed: 0,movieId,similarity,title,genres,num_ratings
0,651,1.0,"Avengers, The (2012)",Action|Adventure|Sci-Fi|IMAX,17572
1,647,0.915642,Captain America: The First Avenger (2011),Action|Adventure|Sci-Fi|Thriller|War,8774
2,746,0.908917,Avengers: Age of Ultron (2015),Action|Adventure|Sci-Fi,7721
3,753,0.903804,Captain America: Civil War (2016),Action|Sci-Fi|Thriller,6302
4,542,0.901523,Iron Man (2008),Action|Adventure|Sci-Fi,25671
5,717,0.900436,Captain America: The Winter Soldier (2014),Action|Adventure|Sci-Fi|IMAX,8755
6,683,0.897058,Iron Man 3 (2013),Action|Sci-Fi|Thriller|IMAX,8265
7,637,0.885799,Thor (2011),Action|Adventure|Drama|Fantasy|IMAX,9535
8,748,0.884795,Ant-Man (2015),Action|Adventure|Sci-Fi,6406
9,610,0.878106,Iron Man 2 (2010),Action|Adventure|Sci-Fi|Thriller|IMAX,11745


In [28]:
top_similar_movies(807) # Spirited Away

Unnamed: 0,movieId,similarity,title,genres,num_ratings
0,807,1.0,Spirited Away (Sen to Chihiro no kamikakushi) ...,Adventure|Animation|Fantasy,22719
1,1658,0.917243,Howl's Moving Castle (Hauru no ugoku shiro) (2...,Adventure|Animation|Fantasy|Romance,10512
2,1653,0.916978,My Neighbor Totoro (Tonari no Totoro) (1988),Animation|Children|Drama|Fantasy,9340
3,1545,0.91269,Princess Mononoke (Mononoke-hime) (1997),Action|Adventure|Animation|Drama|Fantasy,13136
4,1721,0.894728,Nausicaä of the Valley of the Wind (Kaze no ta...,Adventure|Animation|Drama|Fantasy|Sci-Fi,5300
5,1655,0.871055,Laputa: Castle in the Sky (Tenkû no shiro Rapy...,Action|Adventure|Animation|Children|Fantasy|Sc...,5448
6,1718,0.863706,Grave of the Fireflies (Hotaru no haka) (1988),Animation|Drama|War,4835
7,1740,0.851741,Ponyo (Gake no ue no Ponyo) (2008),Adventure|Animation|Children|Fantasy,3348
8,3704,0.841205,Song of the Sea (2014),Animation|Children|Fantasy,1108
9,1737,0.839638,Paprika (Papurika) (2006),Animation|Mystery|Sci-Fi,2523


In [29]:
top_similar_movies(0) # Pulp Fiction

Unnamed: 0,movieId,similarity,title,genres,num_ratings
0,0,1.0,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,79672
1,237,0.915222,Reservoir Dogs (1992),Crime|Mystery|Thriller,34634
2,243,0.805842,Goodfellas (1990),Crime|Drama,32663
3,1203,0.791997,True Romance (1993),Crime|Thriller,15314
4,904,0.790113,Fargo (1996),Comedy|Crime|Drama|Thriller,47823
5,232,0.781922,Trainspotting (1996),Comedy|Crime|Drama,28702
6,265,0.781907,Fight Club (1999),Action|Crime|Drama|Thriller,58773
7,1594,0.781406,Grindhouse (2007),Action|Crime|Horror|Sci-Fi|Thriller,5324
8,1028,0.77568,Jackie Brown (1997),Crime|Drama|Thriller,12436
9,577,0.774762,Inglourious Basterds (2009),Action|Drama|War,23077


Explore more movies

In [None]:
def movie_search(movie_title):
    return movies[movies.title.str.contains(movie_title)]

In [None]:
movie_search("<insert movie title>")

In [None]:
top_similar_movies(<insert movieId>)

### Save/Load model

In [30]:
models_dir = '/content/drive/My Drive/ML/models/'

In [31]:
torch.save(model.state_dict(), models_dir + 'CollaborativeFilteringMovieLens.pt')

In [None]:
model = MatrixFactorization(100)
model.load_state_dict(torch.load(models_dir + 'CollaborativeFilteringMovieLens.pt'))