# torchrec Criteo Terabyte Tutorial

## Table of contents
1. Instantiating Criteo Terabyte dataset
2. Defining and applying batch data transformation function
3. Defining model
4. Training and evaluating model
5. Training and evaluating model on GPU

In [4]:
from typing import Dict, List, Tuple, Union

import torch
from torchrec.datasets.criteo import criteo_terabyte

torch.set_printoptions(threshold=20)

## 1. Instantiating Criteo Terabyte dataset
Let's begin by instantiating a datapipe representing the Criteo 1TB Click Logs https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/ dataset (we'll refer to it here as the Criteo Terabyte dataset).

In [5]:
datapipe = criteo_terabyte(
    ("/home/jeffhwang/local/datasets/criteo/day_11.tsv",),
)

By default, the datapipe returns each sample as a dictionary that maps each default feature name to a typecasted feature value (int for each of the label and 13 integer features, and str for each of the 26 categorical features).

In [6]:
next(iter(datapipe))

{'label': 0,
 'int_0': 0,
 'int_1': 0,
 'int_2': 0,
 'int_3': 0,
 'int_4': 0,
 'int_5': 1,
 'int_6': 0,
 'int_7': 124,
 'int_8': 0,
 'int_9': 1,
 'int_10': 0,
 'int_11': 1,
 'int_12': 0,
 'cat_0': '35b29d1c',
 'cat_1': '11b5bc17',
 'cat_2': '63f76c15',
 'cat_3': 'f2463ffb',
 'cat_4': '16420cce',
 'cat_5': '6fcd6dcb',
 'cat_6': '6e1739cb',
 'cat_7': '337bf7a5',
 'cat_8': '2e4e821f',
 'cat_9': '4dc5d654',
 'cat_10': '59e53f80',
 'cat_11': '12716184',
 'cat_12': '00c5ffb7',
 'cat_13': 'be4ee537',
 'cat_14': 'eb24f585',
 'cat_15': '4cdc3efa',
 'cat_16': 'd20856aa',
 'cat_17': '7232d217',
 'cat_18': '9512c20b',
 'cat_19': '6c8c076c',
 'cat_20': '174c2fe8',
 'cat_21': 'b32f71aa',
 'cat_22': '59f8acf3',
 'cat_23': 'f3a1835d',
 'cat_24': '30436bfc',
 'cat_25': 'b757e957'}

We can adjust the format of each sample via input parameter `row_mapper`. For instance, if we'd prefer to work with lists of feature values, we can define and provide a function that maps a raw split TSV line to a list of typecasted values:

In [7]:
from torchrec.datasets.utils import safe_cast

def row_to_list(row):
    return [
        safe_cast(val, int, 0) for val in row[:14]
    ] + [
        safe_cast(val, str, "") for val in row[14:]
    ]

list_datapipe = criteo_terabyte(
    ("/home/jeffhwang/local/datasets/criteo/day_11.tsv",),
    row_mapper=row_to_list,
)
next(iter(list_datapipe))

[0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 124,
 0,
 1,
 0,
 1,
 0,
 '35b29d1c',
 '11b5bc17',
 '63f76c15',
 'f2463ffb',
 '16420cce',
 '6fcd6dcb',
 '6e1739cb',
 '337bf7a5',
 '2e4e821f',
 '4dc5d654',
 '59e53f80',
 '12716184',
 '00c5ffb7',
 'be4ee537',
 'eb24f585',
 '4cdc3efa',
 'd20856aa',
 '7232d217',
 '9512c20b',
 '6c8c076c',
 '174c2fe8',
 'b32f71aa',
 '59f8acf3',
 'f3a1835d',
 '30436bfc',
 'b757e957']

Or, if we'd prefer to operate directly on raw split TSV lines, we can pass `None`:

In [8]:
raw_datapipe = criteo_terabyte(
    ("/home/jeffhwang/local/datasets/criteo/day_11.tsv",),
    row_mapper=None,
)
next(iter(raw_datapipe))

['0',
 '',
 '',
 '0',
 '0',
 '',
 '1',
 '0',
 '124',
 '0',
 '1',
 '',
 '1',
 '0',
 '35b29d1c',
 '11b5bc17',
 '63f76c15',
 'f2463ffb',
 '16420cce',
 '6fcd6dcb',
 '6e1739cb',
 '337bf7a5',
 '2e4e821f',
 '4dc5d654',
 '59e53f80',
 '12716184',
 '00c5ffb7',
 'be4ee537',
 'eb24f585',
 '4cdc3efa',
 'd20856aa',
 '7232d217',
 '9512c20b',
 '6c8c076c',
 '174c2fe8',
 'b32f71aa',
 '59f8acf3',
 'f3a1835d',
 '30436bfc',
 'b757e957']

Next, we move onto creating train and validation datapipes representing complementary subsets of the dataset and applying a sample limit, batching, and collation to each:

In [9]:
from torchrec.datasets.utils import idx_split_train_val

datapipe = criteo_terabyte(
    ("/home/jeffhwang/local/datasets/criteo/day_11.tsv",),
)
train_datapipe, val_datapipe = idx_split_train_val(datapipe, 0.7)
train_datapipe = train_datapipe.limit(int(1e3)).batch(100).collate()
val_datapipe = val_datapipe.limit(int(1e3)).batch(100).collate()

## 2. Defining and applying batch data transformation function

At this point, each item that is read from `train_datapipe` and `val_datapipe` is a dictionary representing a batch of 100 Criteo Terabyte samples ("batch dictionary"). The dictionary maps each string feature name to 100 feature values, each corresponding to a sample in the batch.

Each of the 13 feature names corresponding to integer-valued features ("int_0" through "int_12") maps to a shape-(100,) tensor of integers; each of the 26 feature names corresponding to categorical features ("cat_0" through "cat_25") maps to a length-100 list of hex strings.

In [10]:
batch = next(iter(train_datapipe))
print("int_0:", batch["int_0"])
print("cat_0:", batch["cat_0"])

int_0: tensor([  0, 118,   3,  ...,  24,  12,   1])
cat_0: ['35b29d1c', '0ede8acc', '9a38fdbd', 'b7590909', 'f7f317e1', 'e5f3fd8d', '74a30cd8', 'a2309537', '0d2de9b7', 'd173a71b', '9bb030cc', 'd080dcdd', 'e5f3fd8d', '75bbaf08', 'fd2294fd', '6f88737d', '7f5629e3', '4ba9ec22', 'b1e51346', 'ae08ee40', '6dfe5365', 'b401509b', '', '288878ba', '', 'e5f3fd8d', '11440f4a', 'e5f3fd8d', '4a3130c4', '6f4012dc', 'a1c393aa', 'a5ba1c3d', '105fc022', 'e5f3fd8d', '', '5deaeb35', '8175c6fa', '265366bf', '', '8a2b1e43', 'ad98e872', 'ad98e872', '36ad0c3a', 'faec4515', 'ad98e872', '372034f9', '788a5d5b', 'e5f3fd8d', '240b1f33', 'ad98e872', 'a6367ddd', '84bff54b', '265366bf', 'cc1858ef', '03fd28c6', 'f6771153', '76d82355', 'ad98e872', '73de94cd', '265366bf', 'ad98e872', 'ad98e872', '32818e9b', '788a5d5b', 'b2d27a4e', '341cc7aa', 'ad98e872', '4d4b357f', '10a8c43d', '6a6402aa', 'ad98e872', '2edf58c3', '', 'ad98e872', 'b2d27a4e', 'b401509b', '2c4bc41a', '7592d348', 'ad98e872', '0d5c791d', 'ad98e872', 'ad98e87

There are a few data transformations we'd like to apply to each batch dictionary to produce the data we want to feed into our model:
- Normalize integer feature values, e.g. by applying a logarithmic function.
- Map each categorical feature hex string value to an integer that can be used to index into an embedding table.
- Separate integer features, categorical features, and labels into individual tensors reshaped appropriately.

Towards accomplishing this, we define a function `_transform` that accepts a batch dictionary as an input, applies the aforementioned transformations, and returns a tuple of three tensors corresponding to integer features, categorical features, and labels:

In [11]:
from torchrec.datasets.criteo import DEFAULT_CAT_NAMES, DEFAULT_INT_NAMES, DEFAULT_LABEL_NAME

NUM_EMBEDDINGS = int(1e5)

col_transforms = {
    **{name: lambda x: torch.log(x + 2) for name in DEFAULT_INT_NAMES},
    **{
        name: lambda x: x.fmod(NUM_EMBEDDINGS - 1) + 1
        for name in DEFAULT_CAT_NAMES
    },
}
    
def _transform(
    batch: Dict[str, List[Union[int, str]]]
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    int_x = torch.cat(
        [
            col_transforms[col_name](torch.tensor(batch[col_name]).unsqueeze(0).T)
            for col_name in DEFAULT_INT_NAMES
            if col_name in col_transforms
        ],
        dim=1,
    )
    cat_x = torch.cat(
        [
            col_transforms[col_name](
                torch.tensor([int(v, 16) if v else -1 for v in batch[col_name]])
                .unsqueeze(0)
                .T
            )
            for col_name in DEFAULT_CAT_NAMES
            if col_name in col_transforms
        ],
        dim=1,
    )
    y = torch.tensor(batch[DEFAULT_LABEL_NAME], dtype=torch.float32).unsqueeze(1)
    return int_x, cat_x, y

Then, using `map`, we produce a new pair of train and validation datapipes that applies `_transform` to each batch dictionary of data:

In [12]:
train_datapipe = train_datapipe.map(_transform)
val_datapipe = val_datapipe.map(_transform)

In [13]:
next(iter(train_datapipe))

(tensor([[0.6931, 0.6931, 0.6931,  ..., 0.6931, 1.0986, 0.6931],
         [4.7875, 0.6931, 2.5649,  ..., 0.6931, 6.0234, 2.8332],
         [1.6094, 5.4638, 2.6391,  ..., 1.7918, 7.9697, 2.6391],
         ...,
         [3.2581, 7.3343, 1.7918,  ..., 1.7918, 7.8921, 1.7918],
         [2.6391, 2.9957, 1.3863,  ..., 1.0986, 8.0064, 1.3863],
         [1.0986, 1.0986, 1.0986,  ..., 0.6931, 4.8675, 1.0986]]),
 tensor([[ 7086, 25811, 76217,  ..., 89288, 33022, 22656],
         [68043, 68258, 52745,  ..., 81118, 40776, 34095],
         [52112, 50486, 12400,  ...,  6322, 33022, 47765],
         ...,
         [ 8472, 85233, 86687,  ..., 68498, 33022, 87620],
         [ 8472, 94259, 77092,  ..., 77871, 70499, 87620],
         [43585, 52600,  2570,  ...,  3211, 51896, 67374]]),
 tensor([[0.],
         [0.],
         [0.],
         ...,
         [0.],
         [0.],
         [0.]]))

Now we've got datapipes that produce data that we can train and evaluate a model on!

## 3. Defining model
To utilize the integer (dense) and categorical (sparse) features present in the Criteo Terabyte dataset, we define `TestSparseNN`, which maps dense and sparse features to embeddings and interacts the embeddings to produce an output:

In [14]:
from torchrec.fb.modules.mlp import LazyMLP


class TestSparseNN(torch.nn.Module):
    def __init__(
        self,
        *,
        hidden_layer_size,
        output_dim,
        sparse_input_size,
        num_embeddings,
        embedding_dim,
    ):
        super(TestSparseNN, self).__init__()
        self.dense_arch = LazyMLP([hidden_layer_size, embedding_dim])
        self.embedding_layers = self._embedding_layers(
            sparse_input_size, num_embeddings, embedding_dim
        )
        self.over_arch = LazyMLP([output_dim])
        self.final = torch.nn.LazyLinear(1)

    def _embedding_layers(self, sparse_input_size, num_embeddings, embedding_dim):
        return torch.nn.ModuleList(
            [
                torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx=0)
                for _ in range(sparse_input_size)
            ]
        )

    def _interact(self, embeddings):
        batch_size, embedding_dim = embeddings[0].shape
        stacked_embeddings = torch.cat(embeddings, dim=1).view(
            batch_size, -1, embedding_dim
        )
        interactions = torch.matmul(
            stacked_embeddings, torch.transpose(stacked_embeddings, 1, 2)
        )
        _, embedding_count, _ = interactions.shape
        rows, cols = torch.tril_indices(embedding_count, embedding_count)
        return interactions[:, rows, cols]

    def forward(self, dense_x, cat_x):
        embedded_dense = self.dense_arch(dense_x)
        embedded_sparse = [
            embedding_layer(cat_x[:, idx])
            for idx, embedding_layer in enumerate(self.embedding_layers)
        ]
        interactions = self._interact([embedded_dense] + embedded_sparse)
        return self.final(
            self.over_arch(torch.cat([embedded_dense, interactions], dim=1))
        )

## 4. Training and evaluating model
We can now train an instance of `TestSparseNN` on data supplied by `train_datapipe`

In [15]:
model = TestSparseNN(
    hidden_layer_size=20,
    output_dim=10,
    sparse_input_size=26,
    num_embeddings=NUM_EMBEDDINGS,
    embedding_dim=16,
)

# Initialize lazy modules.
int_x, cat_x, y = next(iter(train_datapipe))
model(int_x, cat_x)

loss_fn = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adagrad(model.parameters(), lr=1e-2, weight_decay=1e-6)

for batch_num, (int_x, cat_x, y) in enumerate(train_datapipe):
    res = model(int_x, cat_x)
    loss = loss_fn(res, y)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if batch_num % 1 == 0:
        loss, current = loss.item(), batch_num * len(y)
        print(f"loss: {loss:>7f}  {current}")

loss: 0.536839  0
loss: 0.127229  100
loss: 0.322483  200
loss: 0.155279  300
loss: 0.143317  400
loss: 0.244847  500
loss: 0.264295  600
loss: 0.136220  700
loss: 0.040959  800
loss: 0.080234  900


, and evaluate the trained model on data supplied by `val_datapipe`

In [16]:
import sklearn.metrics


y_true = []
y_pred = []
with torch.no_grad():
    for int_x, cat_x, y in val_datapipe:
        pred = model(int_x, cat_x)
        y_pred.append(pred)
        y_true.append(y)

auroc = sklearn.metrics.roc_auc_score(
    torch.cat(y_true).view(-1),
    torch.sigmoid(torch.cat(y_pred).view(-1)),
)
val_loss = loss_fn(
    torch.cat(y_pred).view(-1),
    torch.cat(y_true).view(-1),
)

print("Test results:")
print(f"AUROC: {auroc:>8f} Avg loss: {val_loss:>8f}")

Test results:
AUROC: 0.530097 Avg loss: 0.126038


## 5. Training and evaluating model on GPU

If we have access to a GPU device, we can leverage it as follows to accelerate model training and evaluation.

In [18]:
assert(torch.cuda.is_available())

device = torch.device("cuda:0")

datapipe = criteo_terabyte(
    ("/home/jeffhwang/local/datasets/criteo/day_11.tsv",),
)
train_datapipe, val_datapipe = idx_split_train_val(datapipe, 70)
train_datapipe = train_datapipe.limit(int(1e6)).batch(1000).collate().map(_transform)
val_datapipe = val_datapipe.limit(int(1e5)).batch(1000).collate().map(_transform)

model.to(device)

int_x, cat_x, y = next(iter(train_datapipe))
int_x, cat_x, y = int_x.to(device), cat_x.to(device), y.to(device)
model(int_x, cat_x)

loss_fn = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adagrad(model.parameters(), lr=1e-2, weight_decay=1e-6)

for batch_num, (int_x, cat_x, y) in enumerate(train_datapipe):
    int_x, cat_x, y = int_x.to(device), cat_x.to(device), y.to(device)
    res = model(int_x, cat_x)
    loss = loss_fn(res, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if batch_num % 10 == 0:
        loss, current = loss.item(), batch_num * len(y)
        print(f"loss: {loss:>7f}  {current}")

y_true = []
y_pred = []
with torch.no_grad():
    for int_x, cat_x, y in val_datapipe:
        int_x, cat_x, y = int_x.to(device), cat_x.to(device), y.to(device)
        pred = model(int_x, cat_x)
        y_pred.append(pred)
        y_true.append(y)

auroc = sklearn.metrics.roc_auc_score(
    torch.cat(y_true).view(-1).cpu(),
    torch.sigmoid(torch.cat(y_pred).view(-1)).cpu(),
)
val_loss = loss_fn(
    torch.cat(y_pred).view(-1).cpu(),
    torch.cat(y_true).view(-1).cpu(),
)

print("Test results:")
print(f"AUROC: {auroc:>8f} Avg loss: {val_loss:>8f}")

loss: 0.120394  0
loss: 0.122464  10000
loss: 0.148007  20000
loss: 0.153441  30000
loss: 0.124577  40000
loss: 0.146918  50000
loss: 0.153290  60000
loss: 0.124814  70000
loss: 0.139988  80000
loss: 0.155026  90000
loss: 0.127663  100000
loss: 0.128998  110000
loss: 0.149047  120000
loss: 0.130756  130000
loss: 0.098698  140000
loss: 0.156221  150000
loss: 0.144935  160000
loss: 0.111010  170000
loss: 0.165140  180000
loss: 0.162504  190000
loss: 0.126021  200000
loss: 0.133386  210000
loss: 0.146822  220000
loss: 0.139569  230000
loss: 0.134351  240000
loss: 0.143748  250000
loss: 0.127452  260000
loss: 0.150848  270000
loss: 0.110147  280000
loss: 0.121761  290000
loss: 0.148827  300000
loss: 0.135799  310000
loss: 0.143518  320000
loss: 0.147040  330000
loss: 0.147874  340000
loss: 0.158020  350000
loss: 0.117170  360000
loss: 0.160049  370000
loss: 0.124524  380000
loss: 0.147427  390000
loss: 0.143686  400000
loss: 0.145967  410000
loss: 0.140429  420000
loss: 0.129113  430000
lo