# 04. Modeling - Nubank AI Core Transaction Dataset Interview Project

In this section we will go over the theory and practice of the models we will train to extract deep representations of our data.

## The Language Task == The Sequence Task

Modern transformers-based architectures use the self-attention mechanism to capture long-range dependencies and contextual information from sequence data, learning deep representations from these sequences. Although they are applied heavily on text data (leading to models such as GPT, LLaMa and BERT), their core idea applies to any type of sequence data, which includes our task of extracting deep features from transaction sequences.

These models learn to extract these representations by being trained on the self-supervised task of token-masking. The idea is that, given an input sequence, some tokens of the sequence will be masked from the model, and his objective is try to reconstruct what are those missing tokens. The way these tokens are masked varies from model to model.

GPT-style models' ([1])[https://arxiv.org/pdf/2005.14165] primary objective is unidirectional sequence modeling. It aims to predict the next token in a sequence given all the previous tokens. This works great for language modeling because language is mostly unidirectional, as most languages in the West are written from left to write. This also allows the model to be used as an auto-regressive generator, where, given an input sequence, the model can output the most probable next token and add it to the next input sequence, repeating the process auto-regressively.

However, for language-understanding downstream tasks such as sequence classification, bidirectional models such as BERT ([2])[https://arxiv.org/pdf/1810.04805]  have an edge. BERT employs the Masked Language Modeling (MLM) objective, which involves involves randomly masking 15% of the input tokens and training the model to predict these masked tokens based on both left and right context.

## Modeling Tabular Sequences with BERT

The TabFormer paper ([3])[https://arxiv.org/pdf/2011.01843] by IBM research introduced the idea of modeling tabular time series data through a language modeling task. In the paper, they introduce TabBERT, a model that can be pre-trained end-to-end for representation learning of tabular time series data, which can then be fine-tuned for specific tasks such as classification and regression.

Their insight was that through the *language metaphor*, they can quantize the continuous fields and define a finite vocabulary for the features of a given tabular series, which can then be concatenated and trained as a sequence, much like a NLP task.

However, their approach didn't take into consideration the introduction of text data in the fields. Instead, they used only categorical and numerical features for their approach, adding new tokens based on the categorical and quantized values of the tables. While this achieves great results, it misses 2 opportunities:

1. Text data can be extremely valuable, allowing the model can learn representations from text data that can correlate with non-text features.
2. By using a whole new tokenizer, the available pre-trained-on-text models available become obsolete. By including text data, we can leverage the already learned representations from the pre-training.

In the dataset provided by Nubank, most of the fields are text data, which can be used to extract deep representations of each transaction and their sequences. Therefore for training NuBERT, we will include the these text fields as we've discussed in Section 3 - Tokenization.

### Training Pipeline

For training, we will use the tokenized sequences based on the cleaned dataset that we've mentioned before. For the training framework, we will use Hugging Face's `transformer` and `datasets`, which give us some high-level interfaces to easily model these languages without much boilerplate code. The model we will use is the `distilbert`, which is a distilled version of BERT with 40% less parameters. The training logs are logged with WandB. Here's a compilation of the hyperparameters used for this sample training run.

*note: this training run done here is only for demonstration purposes. The experiment training runs were done either through the nubert scripts or the next notebooks*.

- Number of transactions per sequence: 5
- Stride: 1
- Amount Bins: 20
- Number of training epochs: 1
- Train/val batch sizes: 128
- Max sequence length: 512
- Train/val/test split: [0.9, 0.1, 0.1]
- Optimizer: AdamW | Beta1 = 0.9, Beta2 = 0.999
- Initial learning rate: 5e-5
- Learning rate schedular: Linear
- Warmup steps: 1000
- Gradient accumulation steps: 1
- Use bf16: True

At the time of running this, this model is trained on a A6000-45GB.

In [1]:
import os
import argparse
import logging

from transformers import (
    AutoModelForMaskedLM,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
    set_seed,
)
from datasets import Dataset
from sklearn.model_selection import train_test_split
from nubert import NuDataset

In [2]:
def split_dataset(dataset, test_size=0.1, val_size=0.1, seed=42):
    train_val, test = train_test_split(dataset, test_size=test_size, random_state=seed)    
    train, val = train_test_split(train_val, test_size=val_size / (1 - test_size), random_state=seed)
    
    return train, val, test

def create_hf_dataset(data):
    return Dataset.from_dict({"input_ids": data})

def resize_model_embeddings(model, tokenizer):
    """Resize the model's embeddings to match the tokenizer's vocabulary size."""
    model.resize_token_embeddings(len(tokenizer))
    return model

In [3]:
model_name = "distilbert/distilbert-base-uncased"
dataset_path = "/notebooks/nubank/nugpt/analyses/nubank-2013-2014/"
num_transactions = 5
stride = 1
max_length = 512
num_amount_bins = 20

full_dataset = NuDataset.from_raw_data(
                    root=dataset_path,
                    fname="nubank_raw",
                    num_bins=num_amount_bins,
                    model_name=model_name,
                    num_transaction_sequences=num_transactions,
                    max_seq_len=max_length,
                    stride=stride,
                )
summary = full_dataset.get_summary(verbose=True)

  df = pd.read_csv(path.join(root, f"{fname}.csv"))
  df['Transaction Date'] = pd.to_datetime(df['Transaction Date'])
100%|██████████| 111/111 [15:17<00:00,  8.27s/it]

--------------------------------------------------
Dataset Summary:
num_samples: 427646
num_tokens: 92343162
num_features: 9
features: Index(['Agency Name', 'Vendor', 'Merchant Category Code (MCC)', 'Timestamp',
       'Amount', 'Transaction Date', 'Original Amount', 'Amount Min',
       'Amount Max'],
      dtype='object')
num_transaction_sequences: 5
max_seq_len: 512
--------------------------------------------------





In [4]:
output_dir = "/notebooks/nuvank/nubert"
tokenizer = full_dataset.tokenizer.base_tokenizer

model = AutoModelForMaskedLM.from_pretrained(model_name)

tokenizer.save_pretrained(output_dir)
model = resize_model_embeddings(model, tokenizer)

train_data, val_data, test_data = split_dataset(full_dataset.data)

train_dataset = create_hf_dataset(train_data)
val_dataset = create_hf_dataset(val_data)
test_dataset = create_hf_dataset(test_data)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [5]:
import os

os.environ["WANDB_PROJECT"] = "nubert"
os.environ["WANDB_LOG_MODEL"] = "end"

run_name = f"nubert-distil-transactions-{num_transactions}-stride-{stride}"

training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        num_train_epochs=1.0,
        per_device_train_batch_size=128,
        per_device_eval_batch_size=128,
        learning_rate=5e-5,
        bf16=True,
        save_total_limit=1,
        evaluation_strategy="epoch",
        remove_unused_columns=False,
        report_to="wandb",
        run_name=run_name,
        save_strategy = "epoch",
        load_best_model_at_end=True,
        logging_steps=2,
    )

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [6]:
trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
    )

trainer.train()

trainer.save_model()
tokenizer.save_pretrained(output_dir)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
huggingface/tokenizers: The 

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch,Training Loss,Validation Loss
1,0.2225,0.215927


There were missing keys in the checkpoint model loaded: ['vocab_projector.weight'].


('/notebooks/nuvank/nubert/tokenizer_config.json',
 '/notebooks/nuvank/nubert/special_tokens_map.json',
 '/notebooks/nuvank/nubert/vocab.txt',
 '/notebooks/nuvank/nubert/added_tokens.json',
 '/notebooks/nuvank/nubert/tokenizer.json')