<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

## FastAI Recommender

This notebook shows how to use the [FastAI](https://fast.ai) recommender which is using [Pytorch](https://pytorch.org/) under the hood. 

In [122]:
# Suppress all warnings
import warnings
warnings.filterwarnings("ignore")

import os
import sys
import numpy as np
import pandas as pd
import torch
import fastai
from tempfile import TemporaryDirectory

from fastai.collab import collab_learner, CollabDataLoaders, load_learner

from recommenders.utils.constants import (
    DEFAULT_USER_COL as USER, 
    DEFAULT_ITEM_COL as ITEM, 
    DEFAULT_RATING_COL as RATING, 
    DEFAULT_TIMESTAMP_COL as TIMESTAMP, 
    DEFAULT_PREDICTION_COL as PREDICTION
) 
from recommenders.utils.timer import Timer
from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.models.fastai.fastai_utils import cartesian_product, score
from recommenders.evaluation.python_evaluation import map, ndcg_at_k, precision_at_k, recall_at_k
from recommenders.evaluation.python_evaluation import rmse, mae, rsquared, exp_var
from recommenders.utils.notebook_utils import store_metadata

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))
print("Fast AI version: {}".format(fastai.__version__))
print("Torch version: {}".format(torch.__version__))
print("CUDA Available: {}".format(torch.cuda.is_available()))
print("CuDNN Enabled: {}".format(torch.backends.cudnn.enabled))

System version: 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0]
Pandas version: 2.2.3
Fast AI version: 2.8.1
Torch version: 2.6.0+cu124
CUDA Available: False
CuDNN Enabled: True


Defining some constants to refer to the different columns of our dataset.

In [123]:
# top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

# Model parameters
N_FACTORS = 40
EPOCHS = 5

In [124]:
ratings_df = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=[USER,ITEM,RATING,TIMESTAMP]
)

# make sure the IDs are loaded as strings to better prevent confusion with embedding ids
ratings_df[USER] = ratings_df[USER].astype('str')
ratings_df[ITEM] = ratings_df[ITEM].astype('str')

ratings_df.head()

100%|██████████| 4.81k/4.81k [00:00<00:00, 14.1kKB/s]


Unnamed: 0,userID,itemID,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [125]:
# Split the dataset
train_valid_df, test_df = python_stratified_split(
    ratings_df, 
    ratio=0.75, 
    min_rating=1, 
    filter_by="item", 
    col_user=USER, 
    col_item=ITEM
)

In [126]:
# Remove "cold" users from test set  
test_df = test_df[test_df.userID.isin(train_valid_df.userID)]

## Training

In [127]:
# fix random seeds to make sure our runs are reproducible
np.random.seed(101)
torch.manual_seed(101)
torch.cuda.manual_seed_all(101)

In [128]:
with Timer() as preprocess_time:
    data = CollabDataLoaders.from_df(train_valid_df, 
                                     user_name=USER, 
                                     item_name=ITEM, 
                                     rating_name=RATING, 
                                     valid_pct=0)


In [9]:
#################### DRAFT ###################

In [136]:
from torch.utils.data import Dataset, DataLoader
import torch
import numpy as np
import pandas as pd
from pathlib import Path
import random

class CollabDataset(Dataset):
    def __init__(self, users, items, ratings):
        # Convert to numpy arrays first and ensure correct types
        users = np.array(users, dtype=np.int64)
        items = np.array(items, dtype=np.int64)
        ratings = np.array(ratings, dtype=np.float32)

        # Then convert to tensors
        self.users = torch.tensor(users, dtype=torch.long)
        self.items = torch.tensor(items, dtype=torch.long)
        self.ratings = torch.tensor(ratings, dtype=torch.float)
        
    def __len__(self):
        return len(self.ratings)
    
    def __getitem__(self, idx):
        user_item_tensor = torch.stack((self.users[idx], self.items[idx]))
        return user_item_tensor, self.ratings[idx]

class CollabDataLoaders:
    # Add this __init__ method
    def __init__(self):
        self.classes = {}
        
    @classmethod
    def from_df(cls, ratings, valid_pct=0.2, user_name=None, item_name=None, 
                rating_name=None, seed=42, batch_size=64, **kwargs):
        """Create DataLoaders from a pandas DataFrame for collaborative filtering."""
        # Set random seed
        torch.manual_seed(seed)
        random.seed(seed)
        np.random.seed(seed)

        ratings[user_name] = ratings[user_name].astype(str)
        ratings[item_name] = ratings[item_name].astype(str)
        
        # Get column names
        user_name = user_name or ratings.columns[0]
        item_name = item_name or ratings.columns[1]
        rating_name = rating_name or ratings.columns[2]
        
        # Drop any rows with NaN values
        ratings = ratings.dropna(subset=[user_name, item_name, rating_name])
        
        # Get unique users and items
        users = ratings[user_name].unique()
        items = ratings[item_name].unique()
        
        
        # Create mapping dictionaries
        user2idx = {u: i for i, u in enumerate(users)}
        item2idx = {i: idx for idx, i in enumerate(items)} # Changed
        
        # Convert to indices and handle any remaining NaN values
        ratings[user_name] = ratings[user_name].map(user2idx).fillna(-1).astype(np.int64)
        ratings[item_name] = ratings[item_name].map(item2idx).fillna(-1).astype(np.int64)
        ratings[rating_name] = ratings[rating_name].fillna(0).astype(np.float32)
        
        # Remove any rows where mapping failed (indices are -1)
        ratings = ratings[
            (ratings[user_name] >= 0) & 
            (ratings[item_name] >= 0)
        ]
        
        # Split into train and validation
        n = len(ratings)
        n_valid = int(n * valid_pct)
        indices = list(range(n))
        random.shuffle(indices)
        train_idx = indices[n_valid:]
        valid_idx = indices[:n_valid]
        
        # Create datasets with explicit type conversion
        train_ds = CollabDataset(
            ratings.iloc[train_idx][user_name].values,
            ratings.iloc[train_idx][item_name].values,
            ratings.iloc[train_idx][rating_name].values
        )
        
        valid_ds = CollabDataset(
            ratings.iloc[valid_idx][user_name].values,
            ratings.iloc[valid_idx][item_name].values,
            ratings.iloc[valid_idx][rating_name].values
        ) if n_valid > 0 else None
        
        # Create dataloaders
        train_dl = DataLoader(
            train_ds, 
            batch_size=batch_size,
            shuffle=True,
            **kwargs
        )
        
        valid_dl = DataLoader(
            valid_ds,
            batch_size=batch_size*2,
            shuffle=False,
            **kwargs
        ) if valid_ds is not None else None
        
        # Store metadata
        #dl = cls(train_dl, valid_dl)
        self.train = train_dl
        self.valid = valid_dl
        self.classes = {
            user_name: ['#na#'] + sorted(users.astype(str).tolist(), key=lambda x: int(x) if x.isdigit() else float('inf')),
            item_name: ['#na#'] + sorted(items.astype(str).tolist(), key=lambda x: int(x) if x.isdigit() else float('inf'))
        }
        self.user = user_name
        self.item = item_name
        self.n_users = len(users)
        self.n_items = len(items)
        
        return self


    def show_batch(self, n=5):
        """Show a batch of data."""
        print("Showing a sample batch:")
        # Get one batch from the training dataloader
        # Unpack the two elements from the batch: user_item_batch (tensor of shape [bs, 2]) and ratings_batch (tensor of shape [bs])
        for user_item_batch, ratings_batch in self.train:
            # Extract users and items from the user_item_batch tensor
            users = user_item_batch[:, 0] # Shape [bs]
            items = user_item_batch[:, 1] # Shape [bs]

            # Now take the first n elements as intended by the original code
            users = users[:n].numpy()
            items = items[:n].numpy()
            ratings = ratings_batch[:n].numpy() # ratings_batch is already the ratings tensor

            df = pd.DataFrame({
                self.user: [self.classes[self.user][u] for u in users],
                self.item: [self.classes[self.item][i] for i in items],
                'rating': ratings
            })

            print(f"Showing {n} examples from a batch:")
            print(df)  # This line prints the DataFrame
            break

In [137]:
self = CollabDataLoaders()

In [138]:
self.classes

{}

In [213]:
ratings = train_valid_df

In [214]:
user_name = ratings.columns[0]
item_name = ratings.columns[1]
rating_name = ratings.columns[2]

In [215]:
user_name

'userID'

In [216]:
ratings[user_name] = ratings[user_name].astype(str)
ratings[item_name] = ratings[item_name].astype(str)

In [217]:
ratings.dtypes

userID        object
itemID        object
rating       float32
timestamp      int64
dtype: object

In [218]:
# Drop any rows with NaN values
ratings = ratings.dropna(subset=[user_name, item_name, rating_name])
        

In [219]:
ratings.dtypes

userID        object
itemID        object
rating       float32
timestamp      int64
dtype: object

In [220]:
# Get unique users and items
users = ratings[user_name].unique()
items = ratings[item_name].unique()

In [221]:
users

array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
       '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23',
       '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34',
       '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45',
       '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56',
       '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67',
       '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78',
       '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89',
       '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100',
       '101', '102', '103', '104', '105', '106', '107', '108', '109',
       '110', '111', '112', '113', '114', '115', '116', '117', '118',
       '119', '120', '121', '122', '123', '124', '125', '126', '127',
       '128', '129', '130', '131', '132', '133', '134', '135', '136',
       '137', '138', '139', '140', '141', '142', '143', '144

In [222]:
items

array(['0', '1', '2', ..., '1679', '1680', '1681'],
      shape=(1682,), dtype=object)

In [223]:
# Create mapping dictionaries
user2idx = {u: i for i, u in enumerate(users)}

In [224]:
item2idx = {i: idx for idx, i in enumerate(items)}

In [225]:
ratings[user_name] = ratings[user_name].map(user2idx).fillna(-1).astype(np.int64)

In [226]:
ratings[item_name] = ratings[item_name].map(item2idx).fillna(-1).astype(np.int64)

In [227]:
ratings[rating_name] = ratings[rating_name].fillna(0).astype(np.float32)

In [228]:
ratings

Unnamed: 0,userID,itemID,rating,timestamp
10047,0,0,4.0,885870323
44185,1,0,5.0,889987954
82784,2,0,4.0,875501555
83281,3,0,4.0,882340657
69124,4,0,5.0,877214125
...,...,...,...,...
77891,279,1681,2.0,882387163
31448,142,1681,4.0,889730187
7847,238,1681,4.0,892838288
42623,678,1681,2.0,883365385


In [229]:
ratings[item_name].iloc[0]

np.int64(0)

In [230]:
# Remove any rows where mapping failed (indices are -1)
ratings = ratings[
    (ratings[user_name] >= 0) & 
    (ratings[item_name] >= 0)
]

In [231]:
ratings

Unnamed: 0,userID,itemID,rating,timestamp
10047,0,0,4.0,885870323
44185,1,0,5.0,889987954
82784,2,0,4.0,875501555
83281,3,0,4.0,882340657
69124,4,0,5.0,877214125
...,...,...,...,...
77891,279,1681,2.0,882387163
31448,142,1681,4.0,889730187
7847,238,1681,4.0,892838288
42623,678,1681,2.0,883365385


In [250]:
valid_pct = 0.1

In [251]:
        
# Split into train and validation
n = len(ratings)
n_valid = int(n * valid_pct)
indices = list(range(n))
random.shuffle(indices)
train_idx = indices[n_valid:]
valid_idx = indices[:n_valid]

In [244]:
valid_idx

[]

In [254]:
# Create datasets with explicit type conversion
train_ds = CollabDataset(
    ratings.iloc[train_idx][user_name].values,
    ratings.iloc[train_idx][item_name].values,
    ratings.iloc[train_idx][rating_name].values
)

In [255]:
ratings.iloc[train_idx][user_name].values

array([566, 660, 209, ..., 180, 158, 564], shape=(67560,))

In [256]:
ratings.iloc[valid_idx][user_name].values

array([529, 283, 355, ..., 286, 621, 444], shape=(7506,))

In [257]:
valid_ds = CollabDataset(
    ratings.iloc[valid_idx][user_name].values,
    ratings.iloc[valid_idx][item_name].values,
    ratings.iloc[valid_idx][rating_name].values
) if n_valid > 0 else None

In [258]:
batch_size = 24

In [259]:
kwargs = {}

In [260]:
# Create dataloaders
train_dl = DataLoader(
    train_ds, 
    batch_size=batch_size,
    shuffle=True,
    **kwargs
)
        

In [210]:
valid_dl = DataLoader(
    valid_ds,
    batch_size=batch_size*2,
    shuffle=False,
    **kwargs
) if valid_ds is not None else None

In [211]:
ITEM

'itemID'

In [261]:
train_valid_df

Unnamed: 0,userID,itemID,rating,timestamp
10047,0,0,4.0,885870323
44185,1,0,5.0,889987954
82784,2,0,4.0,875501555
83281,3,0,4.0,882340657
69124,4,0,5.0,877214125
...,...,...,...,...
77891,279,1681,2.0,882387163
31448,142,1681,4.0,889730187
7847,238,1681,4.0,892838288
42623,678,1681,2.0,883365385


In [262]:
train_valid_df

Unnamed: 0,userID,itemID,rating,timestamp
10047,0,0,4.0,885870323
44185,1,0,5.0,889987954
82784,2,0,4.0,875501555
83281,3,0,4.0,882340657
69124,4,0,5.0,877214125
...,...,...,...,...
77891,279,1681,2.0,882387163
31448,142,1681,4.0,889730187
7847,238,1681,4.0,892838288
42623,678,1681,2.0,883365385


In [263]:
data = CollabDataLoaders.from_df(train_valid_df, 
                                user_name=USER, 
                                item_name=ITEM, 
                                rating_name=RATING, 
                                valid_pct=0)

In [264]:
data.show_batch()

Showing a sample batch:
Showing 5 examples from a batch:
  userID itemID  rating
0    624    769     5.0
1    487    968     5.0
2     58    556     3.0
3    263   1097     1.0
4    641   1477     4.0


In [265]:
data.classes.keys()

dict_keys(['userID', 'itemID'])

In [266]:
# Access the dataloaders
for user_item, ratings in train_dl:
    # Training loop
    break

In [267]:
ratings

tensor([3., 3., 2., 3., 5., 4., 5., 2., 5., 2., 5., 3., 4., 3., 4., 5., 3., 4.,
        2., 4., 4., 3., 2., 4.])

In [268]:
user_item

tensor([[ 279, 1436],
        [  80,  910],
        [ 618,  577],
        [ 433,   18],
        [ 269,   24],
        [ 289, 1360],
        [ 353, 1163],
        [  21,  259],
        [ 171,  774],
        [ 214, 1214],
        [ 362,  824],
        [ 339, 1229],
        [ 784, 1638],
        [ 418, 1182],
        [ 103, 1438],
        [ 103, 1383],
        [ 406, 1212],
        [ 571, 1048],
        [ 720, 1514],
        [  18, 1649],
        [ 799, 1194],
        [ 423, 1095],
        [ 150,  843],
        [ 213,  534]])

In [269]:
train_dl

<torch.utils.data.dataloader.DataLoader at 0x7f8a71f56170>

In [270]:
users,items = user_item[:,0],user_item[:,1]

In [271]:
users

tensor([279,  80, 618, 433, 269, 289, 353,  21, 171, 214, 362, 339, 784, 418,
        103, 103, 406, 571, 720,  18, 799, 423, 150, 213])

In [272]:
items

tensor([1436,  910,  577,   18,   24, 1360, 1163,  259,  774, 1214,  824, 1229,
        1638, 1182, 1438, 1383, 1212, 1048, 1514, 1649, 1194, 1095,  843,  534])

In [273]:
user_item[0]

tensor([ 279, 1436])

In [274]:
user_item[0]

tensor([ 279, 1436])

In [275]:
ratings[0]

tensor(3.)

In [278]:
data.valid

In [279]:
total_users, total_items = data.classes.values()

In [276]:
for users, items, ratings in data.valid:
    # Validation loop
    pass

TypeError: 'NoneType' object is not iterable

In [277]:
users

tensor([279,  80, 618, 433, 269, 289, 353,  21, 171, 214, 362, 339, 784, 418,
        103, 103, 406, 571, 720,  18, 799, 423, 150, 213])

In [134]:
items

tensor([ 268,  875, 1137, 1372,  773,  833,  323,  836, 1043, 1549,  874,  776,
        1097,   55,  925,  244, 1105,  648, 1369, 1031,  799, 1025, 1350,  775])

In [135]:
ratings

tensor([5., 2., 3., 3., 5., 5., 4., 1., 5., 5., 5., 4., 4., 4., 3., 3., 5., 1.,
        2., 2., 4., 3., 4., 4.])

In [None]:
##################

In [118]:
data.show_batch()

Showing a sample batch:
Showing 5 examples from a batch:
  userID itemID  rating
0    152   1150     4.0
1    538   1177     4.0
2    430   1478     3.0
3    235     20     5.0
4    189    881     5.0


Now we will create a `collab_learner` for the data, which by default uses the [EmbeddingDotBias](https://docs.fast.ai/collab.html#EmbeddingDotBias) model. We will be using 40 latent factors. This will create an embedding for the users and the items that will map each of these to 40 floats as can be seen below. Note that the embedding parameters are not predefined, but are learned by the model.

Although ratings can only range from 1-5, we are setting the range of possible ratings to a range from 0 to 5.5 -- that will allow the model to predict values around 1 and 5, which improves accuracy. Lastly, we set a value for weight-decay for regularization.

In [None]:
learn = collab_learner(data, n_factors=N_FACTORS, y_range=[0,5.5], wd=1e-1)
learn.model

EmbeddingDotBias(
  (u_weight): Embedding(944, 40)
  (i_weight): Embedding(1683, 40)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1683, 1)
)

Now train the model for 5 epochs setting the maximal learning rate. The learner will reduce the learning rate with each epoch using cosine annealing.

In [10]:
with Timer() as train_time:
    learn.fit_one_cycle(EPOCHS, lr_max=5e-3)

print("Took {} seconds for training.".format(train_time))

epoch,train_loss,valid_loss,time


Took 33.9113 seconds for training.


Save the learner so it can be loaded back later for inferencing / generating recommendations

In [11]:
tmp = TemporaryDirectory()
model_path = os.path.join(tmp.name, "movielens_model.pkl")

In [12]:
learn.export(model_path)

## Generating Recommendations

Load the learner from disk.

In [13]:
learner = load_learner(model_path)

Get all users and items that the model knows

In [14]:
total_users, total_items = learner.dls.classes.values()
total_items = total_items[1:]
total_users = total_users[1:]

Get all users from the test set and remove any users that were know in the training set

In [15]:
test_users = test_df[USER].unique()
test_users = np.intersect1d(test_users, total_users)

Build the cartesian product of test set users and all items known to the model

In [16]:
users_items = cartesian_product(np.array(test_users),np.array(total_items))
users_items = pd.DataFrame(users_items, columns=[USER,ITEM])


Lastly, remove the user/items combinations that are in the training set -- we don't want to propose a movie that the user has already watched.

In [17]:
training_removed = pd.merge(users_items, train_valid_df.astype(str), on=[USER, ITEM], how='left')
training_removed = training_removed[training_removed[RATING].isna()][[USER, ITEM]]

### Score the model to find the top K recommendation

In [18]:
with Timer() as test_time:
    top_k_scores = score(learner, 
                         test_df=training_removed,
                         user_col=USER, 
                         item_col=ITEM, 
                         prediction_col=PREDICTION)

print("Took {} seconds for {} predictions.".format(test_time, len(training_removed)))

Took 5.1570 seconds for 1511060 predictions.


Calculate some metrics for our model

In [19]:
eval_map = map(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
               col_rating=RATING, col_prediction=PREDICTION, 
               relevancy_method="top_k", k=TOP_K)

In [20]:
eval_ndcg = ndcg_at_k(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
                      col_rating=RATING, col_prediction=PREDICTION, 
                      relevancy_method="top_k", k=TOP_K)

In [21]:
eval_precision = precision_at_k(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
                                col_rating=RATING, col_prediction=PREDICTION, 
                                relevancy_method="top_k", k=TOP_K)

In [22]:
eval_recall = recall_at_k(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
                          col_rating=RATING, col_prediction=PREDICTION, 
                          relevancy_method="top_k", k=TOP_K)

In [23]:
print("Model:\t\t" + learn.__class__.__name__,
      "Top K:\t\t%d" % TOP_K,
      "MAP:\t\t%f" % eval_map,
      "NDCG:\t\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n')

Model:		Learner
Top K:		10
MAP:		0.024119
NDCG:		0.152808
Precision@K:	0.139130
Recall@K:	0.054943


The above numbers are lower than [SAR](../sar_single_node_movielens.ipynb), but expected, since the model is explicitly trying to generalize the users and items to the latent factors. Next look at how well the model predicts how the user would rate the movie. Need to score `test_df` user-items only. 

In [24]:
scores = score(learner, 
               test_df=test_df.copy(), 
               user_col=USER, 
               item_col=ITEM, 
               prediction_col=PREDICTION)

Now calculate some regression metrics

In [25]:
eval_r2 = rsquared(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)
eval_rmse = rmse(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)
eval_mae = mae(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)
eval_exp_var = exp_var(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)

print("Model:\t\t\t" + learn.__class__.__name__,
      "RMSE:\t\t\t%f" % eval_rmse,
      "MAE:\t\t\t%f" % eval_mae,
      "Explained variance:\t%f" % eval_exp_var,
      "R squared:\t\t%f" % eval_r2, sep='\n')

Model:			Learner
RMSE:			0.904589
MAE:			0.715827
Explained variance:	0.356082
R squared:		0.355173


That RMSE is competitive in comparison with other models.

In [26]:
# Record results for tests - ignore this cell
store_metadata("map", eval_map)
store_metadata("ndcg", eval_ndcg)
store_metadata("precision", eval_precision)
store_metadata("recall", eval_recall)
store_metadata("rmse", eval_rmse)
store_metadata("mae", eval_mae)
store_metadata("exp_var", eval_exp_var)
store_metadata("rsquared", eval_r2)
store_metadata("train_time", train_time.interval)
store_metadata("test_time", test_time.interval)

In [27]:
tmp.cleanup()