<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

## EmbeddingDotBias Recommender

This notebook shows how to use `EmbeddingDotBias` similar to [EmbeddingDotBias](https://docs.fast.ai/collab.html#embeddingdotbias) from FastAI but directly using Pytorch. This will create an embedding for the users and the items.

In [1]:
# Suppress all warnings
import warnings
warnings.filterwarnings("ignore")

import os
import sys
import logging
import numpy as np
import pandas as pd
import torch
from tempfile import TemporaryDirectory

from recommenders.utils.constants import (
    DEFAULT_USER_COL as USER, 
    DEFAULT_ITEM_COL as ITEM, 
    DEFAULT_RATING_COL as RATING, 
    DEFAULT_TIMESTAMP_COL as TIMESTAMP, 
    DEFAULT_PREDICTION_COL as PREDICTION
)

from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.evaluation.python_evaluation import (exp_var, mae, map,
                                                       ndcg_at_k,
                                                       precision_at_k,
                                                       recall_at_k, rmse,
                                                       rsquared)
from recommenders.models.embdotbias.data_loader import RecoDataLoader
from recommenders.models.embdotbias.model import EmbeddingDotBias
from recommenders.models.embdotbias.training_utils import (Trainer,
                                                           predict_rating)
from recommenders.models.embdotbias.utils import cartesian_product, score
from recommenders.utils.notebook_utils import store_metadata
from recommenders.utils.timer import Timer

logging.basicConfig(level=logging.INFO, format="%(levelname)s - %(message)s")

print(f"System version: {sys.version}")
print(f"Pandas version: {pd.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"CuDNN Enabled: {torch.backends.cudnn.enabled}")

System version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]
Pandas version: 2.2.2
PyTorch version: 2.3.1+cu121
CUDA Available: True
CuDNN Enabled: True


Defining some constants to refer to the different columns of our dataset.

In [2]:
# top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = "100k"

# Model parameters
N_FACTORS = 40
EPOCHS = 7
SEED = 101

In [3]:
ratings_df = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=[USER,ITEM,RATING,TIMESTAMP]
)

# Make sure the IDs are loaded as strings to better prevent confusion with embedding ids
ratings_df[USER] = ratings_df[USER].astype("str")
ratings_df[ITEM] = ratings_df[ITEM].astype("str")

ratings_df.head()

INFO - Downloading https://files.grouplens.org/datasets/movielens/ml-100k.zip
100%|██████████| 4.81k/4.81k [00:01<00:00, 3.56kKB/s]



Unnamed: 0,userID,itemID,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [4]:
# Split the dataset
train_valid_df, test_df = python_stratified_split(
    ratings_df,
    ratio=0.75, 
    min_rating=1, 
    filter_by="item", 
    col_user=USER, 
    col_item=ITEM,
    seed=SEED
)

In [5]:
train_valid_df

Unnamed: 0,userID,itemID,rating,timestamp
99941,593,1,3.0,875659150
63031,879,1,4.0,887761865
66516,216,1,4.0,880232615
21048,200,1,5.0,876042340
78925,933,1,3.0,874854294
...,...,...,...,...
10413,336,999,2.0,877757516
7847,125,999,4.0,892838288
34637,417,999,3.0,880952434
42623,476,999,2.0,883365385


In [6]:
# Remove "cold" users from test set 
test_df = test_df[test_df[USER].isin(train_valid_df[USER])]

## Training

In [7]:
# Fix random seeds to make sure the runs are reproducible
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

In [8]:
data = RecoDataLoader.from_df(
    train_valid_df,
    user_name=USER,
    item_name=ITEM,
    rating_name=RATING,
    valid_pct=0.1
)

In [9]:
data.show_batch()

Showing a sample batch:
Showing 5 examples from a batch:
  userID itemID  rating
0    710    302     4.0
1    588    554     3.0
2     92    452     2.0
3    727     56     3.0
4    535    212     4.0


We will be using 40 latent factors. This will create an embedding for the users and the items that will map each of these to 40 floats as can be seen below. Note that the embedding parameters are not predefined, but are learned by the model.

Although ratings can only range from 1-5, we are setting the range of possible ratings to a range from 0 to 5.5 -- that will allow the model to predict values around 1 and 5, which improves accuracy. Lastly, we set a value for weight-decay for regularization.

In [10]:
model = EmbeddingDotBias.from_classes(
    n_factors=N_FACTORS,
    classes=data.classes,
    user=USER,
    item=ITEM,
    y_range=[0,5.5]
)

Now train the model for 7 epochs setting the maximal learning rate. The learner will reduce the learning rate with each epoch using cosine annealing.

In [11]:
trainer = Trainer(model=model)

with Timer() as train_time:
    trainer.fit(data.train, data.valid, EPOCHS)

print(f"Took {train_time} seconds for training.")

INFO - Epoch 1/7:
INFO - Train Loss: 1.3875741174613887
INFO - Valid Loss: 1.0270110111115343
INFO - Train Loss: 1.3875741174613887
INFO - Valid Loss: 1.0270110111115343
INFO - Epoch 2/7:
INFO - Train Loss: 0.908381488456419
INFO - Valid Loss: 0.9222675213369272
INFO - Epoch 2/7:
INFO - Train Loss: 0.908381488456419
INFO - Valid Loss: 0.9222675213369272
INFO - Epoch 3/7:
INFO - Train Loss: 0.821684703202636
INFO - Valid Loss: 0.8861896385580806
INFO - Epoch 3/7:
INFO - Train Loss: 0.821684703202636
INFO - Valid Loss: 0.8861896385580806
INFO - Epoch 4/7:
INFO - Train Loss: 0.7628276834941723
INFO - Valid Loss: 0.8663221562312822
INFO - Epoch 4/7:
INFO - Train Loss: 0.7628276834941723
INFO - Valid Loss: 0.8663221562312822
INFO - Epoch 5/7:
INFO - Train Loss: 0.7107005440488909
INFO - Valid Loss: 0.8576887482303684
INFO - Epoch 5/7:
INFO - Train Loss: 0.7107005440488909
INFO - Valid Loss: 0.8576887482303684
INFO - Epoch 6/7:
INFO - Train Loss: 0.6560591047234607
INFO - Epoch 6/7:
INFO - T

Took 63.6546 seconds for training.


Save the learner so it can be loaded back later for inferencing / generating recommendations

In [12]:
tmp = TemporaryDirectory()
model_path = os.path.join(tmp.name, "embdotbias_model.pth")

torch.save(model.state_dict(), model_path)
print(f"Model saved to: {model_path}")

Model saved to: /tmp/tmp3det9ii6/embdotbias_model.pth


## Generating Recommendations

Load the learner from disk.

In [13]:
loaded_model = EmbeddingDotBias.from_classes(
    n_factors=N_FACTORS, 
    classes=data.classes, 
    user=USER,
    item=ITEM,
    y_range=[0,5.5] 
)

# Load the state dictionary
loaded_model.load_state_dict(torch.load(model_path))

# Set the model to evaluation mode
loaded_model.eval()

print("Model loaded successfully.")

Model loaded successfully.


Get all users and items that the model knows

In [14]:
# Total items & users
total_items = loaded_model.classes[ITEM][1:]
total_users = loaded_model.classes[USER][1:]

Get all users from the test set and remove any users that were not known in the training set

In [15]:
test_users = test_df[USER].unique()
test_users = np.intersect1d(test_users, total_users)

Example prediction


In [16]:
first_batch = next(iter(data.train))
user_idx = first_batch[0][0, 0].item()  
user_id = data.classes[USER][user_idx]  
item_idx = first_batch[0][0, 1].item() 
item_id = data.classes[ITEM][item_idx]  
print(f"User ID: {user_id}, Item ID: {item_id}")

User ID: 864, Item ID: 232


In [17]:

try: 
    user_embeddings = loaded_model.weight([user_id, item_id], is_item=False)
    predicted_rating = predict_rating(loaded_model, user_id, item_id)
    print(f"Predicted rating for user {user_id} and item {item_id}: {predicted_rating}")
except KeyError as e:
    print(f"Error: {e}")

Predicted rating for user 864 and item 232: 3.881427526473999


Build the cartesian product of test set users and all items known to the model

In [18]:
users_items = cartesian_product(np.array(test_users),np.array(total_items))

In [19]:
users_items = pd.DataFrame(users_items, columns=[USER,ITEM])

In [20]:
users_items

Unnamed: 0,userID,itemID
0,1,1
1,1,10
2,1,100
3,1,1000
4,1,1001
...,...,...
1586121,99,995
1586122,99,996
1586123,99,997
1586124,99,998



Lastly, remove the user/items combinations that are in the training set -- we don't want to propose a movie that the user has already watched.

In [21]:
users_items_candidates = pd.merge(users_items, train_valid_df.astype(str), on=[USER, ITEM], how="left")
users_items_candidates = users_items_candidates[users_items_candidates[RATING].isna()][[USER, ITEM]]

In [22]:
users_items_candidates

Unnamed: 0,userID,itemID
3,1,1000
4,1,1001
5,1,1002
6,1,1003
7,1,1004
...,...,...
1586121,99,995
1586122,99,996
1586123,99,997
1586124,99,998


### Score the model to find the top K recommendation

In [23]:
top_k_scores = score(
    loaded_model, 
    test_df=users_items_candidates,
    user_col=USER,
    item_col=ITEM,
    prediction_col=PREDICTION
)

In [24]:
top_k_scores

Unnamed: 0,userID,itemID,prediction
1642,1,963,5.101374
1109,1,483,5.003863
1026,1,408,4.969304
780,1,187,4.891338
1143,1,513,4.880493
...,...,...,...
1584764,99,1287,1.850188
1585974,99,862,1.774681
1585488,99,424,1.739392
1586100,99,976,1.690039


Calculate some metrics for our model

In [25]:
eval_map = map(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
               col_rating=RATING, col_prediction=PREDICTION, 
               relevancy_method="top_k", k=TOP_K)

In [26]:
eval_ndcg = ndcg_at_k(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
                      col_rating=RATING, col_prediction=PREDICTION, 
                      relevancy_method="top_k", k=TOP_K)

In [27]:
eval_precision = precision_at_k(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
                                col_rating=RATING, col_prediction=PREDICTION, 
                                relevancy_method="top_k", k=TOP_K)

In [28]:
eval_recall = recall_at_k(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
                          col_rating=RATING, col_prediction=PREDICTION, 
                          relevancy_method="top_k", k=TOP_K)

In [29]:
print("Model:\t\t" + model.__class__.__name__,
      "Top K:\t\t%d" % TOP_K,
      "MAP:\t\t%f" % eval_map,
      "NDCG:\t\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n')

Model:		EmbeddingDotBias
Top K:		10
MAP:		0.020839
NDCG:		0.131409
Precision@K:	0.121633
Recall@K:	0.047912


The above numbers are lower than [SAR](../sar_single_node_movielens.ipynb), but expected, since the model is explicitly trying to generalize the users and items to the latent factors. Next look at how well the model predicts how the user would rate the movie. Need to score `test_df` user-items only. 

In [30]:
scores = score(
    model,
    test_df=test_df, 
    user_col=USER, 
    item_col=ITEM, 
    prediction_col=PREDICTION
)

Now calculate some regression metrics

In [31]:
eval_r2 = rsquared(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)
eval_rmse = rmse(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)
eval_mae = mae(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)
eval_exp_var = exp_var(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)

print("Model:\t\t\t" + model.__class__.__name__,
      "RMSE:\t\t\t%f" % eval_rmse,
      "MAE:\t\t\t%f" % eval_mae,
      "Explained variance:\t%f" % eval_exp_var,
      "R squared:\t\t%f" % eval_r2, sep='\n')

Model:			EmbeddingDotBias
RMSE:			0.910456
MAE:			0.713525
Explained variance:	0.339586
R squared:		0.339563


That RMSE is competitive in comparison with other models.

In [32]:
# Record results for tests - ignore this cell
store_metadata("map", eval_map)
store_metadata("ndcg", eval_ndcg)
store_metadata("precision", eval_precision)
store_metadata("recall", eval_recall)
store_metadata("rmse", eval_rmse)
store_metadata("mae", eval_mae)
store_metadata("exp_var", eval_exp_var)
store_metadata("rsquared", eval_r2)
store_metadata("train_time", train_time.interval)