<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

## EmbeddingDotBias Recommender

This notebook shows how to use `EmbeddingDotBias` similar to [EmbeddingDotBias](https://docs.fast.ai/collab.html#EmbeddingDotBias) from FastAI but directly using Pytorch. This will create an embedding for the users and the items.

In [1]:
# Suppress all warnings
import warnings
warnings.filterwarnings("ignore")

import os
import sys
import numpy as np
import pandas as pd
import torch
from tempfile import TemporaryDirectory

from recommenders.utils.constants import (
    DEFAULT_USER_COL as USER, 
    DEFAULT_ITEM_COL as ITEM, 
    DEFAULT_RATING_COL as RATING, 
    DEFAULT_TIMESTAMP_COL as TIMESTAMP, 
    DEFAULT_PREDICTION_COL as PREDICTION
)
from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.models.embdotbias.utils import cartesian_product, score
from recommenders.evaluation.python_evaluation import map, ndcg_at_k, precision_at_k, recall_at_k
from recommenders.evaluation.python_evaluation import rmse, mae, rsquared, exp_var
from recommenders.utils.notebook_utils import store_metadata
from recommenders.models.embdotbias.training_utils import Trainer, predict_rating
from recommenders.models.embdotbias.model import EmbeddingDotBias
from recommenders.models.embdotbias.data_loader import CollabDataLoaders


print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))
print("Torch version: {}".format(torch.__version__))
print("CUDA Available: {}".format(torch.cuda.is_available()))
print("CuDNN Enabled: {}".format(torch.backends.cudnn.enabled))

System version: 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0]
Pandas version: 2.2.3
Torch version: 2.6.0+cu124
CUDA Available: False
CuDNN Enabled: True


Defining some constants to refer to the different columns of our dataset.

In [2]:
# top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

# Model parameters
N_FACTORS = 40
EPOCHS = 5

In [3]:
ratings_df = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=[USER,ITEM,RATING,TIMESTAMP]
)

# Make sure the IDs are loaded as strings to better prevent confusion with embedding ids
ratings_df[USER] = ratings_df[USER].astype('str')
ratings_df[ITEM] = ratings_df[ITEM].astype('str')

ratings_df.head()

100%|██████████| 4.81k/4.81k [00:00<00:00, 12.7kKB/s]


Unnamed: 0,userID,itemID,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [4]:
# Split the dataset
train_valid_df, test_df = python_stratified_split(
    ratings_df,
    ratio=0.75, 
    min_rating=1, 
    filter_by="item", 
    col_user=USER, 
    col_item=ITEM
)

In [5]:
train_valid_df

Unnamed: 0,userID,itemID,rating,timestamp
10047,94,1,4.0,885870323
44185,620,1,5.0,889987954
82784,779,1,4.0,875501555
83281,399,1,4.0,882340657
69124,864,1,5.0,877214125
...,...,...,...,...
77891,429,999,2.0,882387163
31448,393,999,4.0,889730187
7847,125,999,4.0,892838288
42623,476,999,2.0,883365385


In [6]:
# Remove "cold" users from test set 
test_df = test_df[test_df.userID.isin(train_valid_df.userID)]

## Training

In [7]:
# Fix random seeds to make sure the runs are reproducible
np.random.seed(101)
torch.manual_seed(101)
torch.cuda.manual_seed_all(101)

In [8]:
data = CollabDataLoaders.from_df(train_valid_df, 
                                    user_name=USER, 
                                    item_name=ITEM, 
                                    rating_name=RATING, 
                                    valid_pct=0.001)

In [9]:
data.show_batch()

Showing a sample batch:
Showing 5 examples from a batch:
  userID itemID  rating
0    505    161     3.0
1    500   1010     4.0
2    172    430     3.0
3    880    380     3.0
4    158      4     4.0


We will be using 40 latent factors. This will create an embedding for the users and the items that will map each of these to 40 floats as can be seen below. Note that the embedding parameters are not predefined, but are learned by the model.

Although ratings can only range from 1-5, we are setting the range of possible ratings to a range from 0 to 5.5 -- that will allow the model to predict values around 1 and 5, which improves accuracy. Lastly, we set a value for weight-decay for regularization.

In [10]:
model = EmbeddingDotBias.from_classes(
    n_factors=40,
    classes=data.classes,
    user='userID',
    item='itemID',
    y_range=[0,5.5]
)


Now train the model for 7 epochs setting the maximal learning rate. The learner will reduce the learning rate with each epoch using cosine annealing.

In [11]:
trainer = Trainer(model=model)

with Timer() as train_time:
    n_epochs = 7
    trainer.fit(data.train, data.valid, n_epochs)

print("Took {} seconds for training.".format(train_time))

Epoch 1/7:
Train Loss: 1.3408423032064893
Valid Loss: 0.8520179986953735
Epoch 2/7:
Train Loss: 0.9016596601939039
Valid Loss: 0.7767820954322815
Epoch 3/7:
Train Loss: 0.8278594535772305
Valid Loss: 0.7384382486343384
Epoch 4/7:
Train Loss: 0.7706746463838697
Valid Loss: 0.7467287182807922
Epoch 5/7:
Train Loss: 0.7190423823460784
Valid Loss: 0.7333086133003235
Epoch 6/7:
Train Loss: 0.6650302821306239
Valid Loss: 0.7224665284156799
Epoch 7/7:
Train Loss: 0.6063857537792812
Valid Loss: 0.7398031949996948


Save the learner so it can be loaded back later for inferencing / generating recommendations

In [None]:
### TODO 

## Generating Recommendations

Load the learner from disk.

In [None]:
# Load the learner from disk

Get all users and items that the model knows

In [19]:
# Total items & users
total_items = model.classes[ITEM][1:]
total_users = model.classes[USER][1:]

Get all users from the test set and remove any users that were know in the training set

In [20]:
test_users = test_df[USER].unique()
test_users = np.intersect1d(test_users, total_users)

In [22]:
user_embeddings = model.weight(['1', '10'], is_item=False)

Example prediction


In [23]:
user_id = "1"
item_id = "10"
predicted_rating = predict_rating(model, user_id, item_id)
print(f'\nPredicted rating for user {user_id} and item {item_id}: {predicted_rating}')


Predicted rating for user 1 and item 10: 4.050769329071045


Build the cartesian product of test set users and all items known to the model

In [24]:
users_items = cartesian_product(np.array(test_users),np.array(total_items))

In [25]:
users_items = pd.DataFrame(users_items, columns=[USER,ITEM])

In [41]:
users_items

Unnamed: 0,userID,itemID
0,1,1
1,1,10
2,1,100
3,1,1000
4,1,1001
...,...,...
1586121,99,995
1586122,99,996
1586123,99,997
1586124,99,998



Lastly, remove the user/items combinations that are in the training set -- we don't want to propose a movie that the user has already watched.

In [42]:
training_removed = pd.merge(users_items, train_valid_df.astype(str), on=[USER, ITEM], how='left')
training_removed = training_removed[training_removed[RATING].isna()][[USER, ITEM]]

In [43]:
training_removed

Unnamed: 0,userID,itemID
3,1,1000
4,1,1001
5,1,1002
6,1,1003
7,1,1004
...,...,...
1586121,99,995
1586122,99,996
1586123,99,997
1586124,99,998


### Score the model to find the top K recommendation

In [44]:
top_k_scores = score(model, 
                        data,
                        test_df=training_removed,
                        user_col=USER, 
                        item_col=ITEM, 
                        prediction_col=PREDICTION)

In [45]:
top_k_scores

Unnamed: 0,userID,itemID,prediction
760,1,169,4.997737
1026,1,408,4.976365
926,1,318,4.887697
1343,1,694,4.844806
1109,1,483,4.833915
...,...,...,...
1585433,99,375,1.772301
1585502,99,437,1.758064
1585504,99,439,1.729182
1584542,99,1087,1.655640


Calculate some metrics for our model

In [30]:
eval_map = map(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
               col_rating=RATING, col_prediction=PREDICTION, 
               relevancy_method="top_k", k=TOP_K)

In [31]:
eval_ndcg = ndcg_at_k(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
                      col_rating=RATING, col_prediction=PREDICTION, 
                      relevancy_method="top_k", k=TOP_K)

In [32]:
eval_precision = precision_at_k(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
                                col_rating=RATING, col_prediction=PREDICTION, 
                                relevancy_method="top_k", k=TOP_K)

In [33]:
eval_recall = recall_at_k(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
                          col_rating=RATING, col_prediction=PREDICTION, 
                          relevancy_method="top_k", k=TOP_K)

In [40]:
print("Model:\t\t" + model.__class__.__name__,
      "Top K:\t\t%d" % TOP_K,
      "MAP:\t\t%f" % eval_map,
      "NDCG:\t\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n')

Model:		EmbeddingDotBias
Top K:		10
MAP:		0.021975
NDCG:		0.136488
Precision@K:	0.123754
Recall@K:	0.051028


In [33]:
print("Model:\t\t" + model.__class__.__name__,
      "Top K:\t\t%d" % TOP_K,
      "MAP:\t\t%f" % eval_map,
      "NDCG:\t\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n')

Model:		EmbeddingDotBias
Top K:		10
MAP:		0.021975
NDCG:		0.136488
Precision@K:	0.123754
Recall@K:	0.051028


The above numbers are lower than [SAR](../sar_single_node_movielens.ipynb), but expected, since the model is explicitly trying to generalize the users and items to the latent factors. Next look at how well the model predicts how the user would rate the movie. Need to score `test_df` user-items only. 

In [35]:
scores = score(model,
                data,
               test_df=test_df.copy(), 
               user_col=USER, 
               item_col=ITEM, 
               prediction_col=PREDICTION)

Now calculate some regression metrics

In [36]:
eval_r2 = rsquared(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)
eval_rmse = rmse(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)
eval_mae = mae(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)
eval_exp_var = exp_var(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)

print("Model:\t\t\t" + model.__class__.__name__,
      "RMSE:\t\t\t%f" % eval_rmse,
      "MAE:\t\t\t%f" % eval_mae,
      "Explained variance:\t%f" % eval_exp_var,
      "R squared:\t\t%f" % eval_r2, sep='\n')

Model:			EmbeddingDotBias
RMSE:			0.910992
MAE:			0.714782
Explained variance:	0.346044
R squared:		0.346013


That RMSE is competitive in comparison with other models.

In [None]:
# Record results for tests - ignore this cell
store_metadata("map", eval_map)
store_metadata("ndcg", eval_ndcg)
store_metadata("precision", eval_precision)
store_metadata("recall", eval_recall)
store_metadata("rmse", eval_rmse)
store_metadata("mae", eval_mae)
store_metadata("exp_var", eval_exp_var)
store_metadata("rsquared", eval_r2)
store_metadata("train_time", train_time.interval)
store_metadata("test_time", test_time.interval)