In [1]:
############################################################################
##
## Copyright (C) 2022 NVIDIA Corporation.  All rights reserved.
##
## NVIDIA Sample Code
##
## Please refer to the NVIDIA end user license agreement (EULA) associated
## with this source code for terms and conditions that govern your use of
## this software. Any use, reproduction, disclosure, or distribution of
## this software and related documentation outside the terms of the EULA
## is strictly prohibited.
##
############################################################################

# 1. Introduction to Megatron-LM Framework

<a href="https://github.com/NVIDIA/Megatron-LM">Megatron</a> [[1](#1_1),[2](#1_2)] is a large, powerful transformer developed by the [Applied Deep Learning Research team at NVIDIA](https://nv-adlr.github.io/). The repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel (tensor and pipeline), and multi-node pre-training of transformer based models such as GPT, BERT, and T5 using mixed precision.

Below are some of the projects where we have directly used Megatron:
    
   - FinMegatron: Large Financial Domain Language Models
   - BioMegatron: Larger Biomedical Domain Language Model
   - End-to-End Training of Neural Retrievers for Open-Domain Question Answering
   - Large Scale Multi-Actor Generative Dialog Modeling
   - Local Knowledge Powered Conversational Agents
   - MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models
   - RACE Reading Comprehension Dataset Leaderboard
   - Scaling Language Model Training to a Trillion Parameters Using Megatron
   - Training Question Answering Models From Synthetic Data

Megatron-LM implements <u>pipeline model parallelism</u> and <u>tensor model parallelism</u> to efficiently distribute workloads across multiple GPUs and multiple nodes. 

- Pipeline Parallelism splits layers across multiple GPUs
- Tensor Parallelism splits individual layers across multiple GPUs

Our Megatron-LM framework will handle this for you using `tensor-model-parallel-size` and `pipeline-model-parallel-size` parameters found in the `pretrain_step.sh` script, which we will use to train our model. Our model config file has `TENSOR_MP_SIZE` and `PIPELINE_MP_SIZE` variables, which are passed to the `pretrain_step.sh` script for convenience.


If you are familiar with the <a href="https://pytorch.org/">PyTorch</a> Framework, you can move parts of a Neural Network model to different GPUs using `layer.to(device)`. This approach of moving different parts of a model to different GPUs lets the data scientist implement pipeline parallelism by hand. 


<br>
<center><img src=images/model_parallelism.png width="40%" height="40%" style="display=block; margin:auto" alt="model parallelism"/></center>
<br>

<div><font size="4">While the idea of pipeline parallelism, moving neural network layers to different devices, makes intuitive sense, let's explore the idea behind Tensor Parallelism further by example. Consider the following matrix multiplication taking place on a single GPU 0 using input values and the first layer in a neural network:</font><div>

<center><img src=images/tensor_parallelism_whole.png width="40%" height="40%" style="display=block; margin:auto" alt="matrix multiplication for tensor parallelism"/></center>

<div><font size="4">Notice that we can split this matrix multiplication across multiple GPUs by splitting the matrices as follows:</font><div>


<center><img src=images/tensor_parallelism_split.png width="40%" height="40%" style="display=block; margin:auto" alt="matrix multiplication for tensor parallelism distributed across two GPUs"/></center>

<div><font size="4">The values are the same! Thus, tensor parallelism becomes useful especially for very large Deep Learning models where <u>it may be impossible to fit an entire layer on a single GPU, or where synchronization is important.</u></font><div> 


<div><font size="4">When we are training our Deep learning model, data must pass from the beginning of the network through to the end, and then weights backpropagated to the beginning. For very large deep learning models, communication across the multiple GPUs can become a bottleneck. The image below shows that we get a `bubble` effect, `shown in grey` when there is no communication for 4 GPUS. In the bottom image, Megatron handles this by creating multiple virtual stages to reduce the communication bottleneck and reduce training time.</font><div>

<center><img src=images/interleaving.png width="40%" height="40%" style="display=block; margin:auto" alt="reducing communication in multi-GPU training with interleaving"/></center>

Megatron is also used in <a href="https://developer.nvidia.com/nvidia-nemo#nemo-megatron">NeMo Megatron</a>, a framework to help enterprises overcome the challenges of building and training sophisticated natural language processing models with billions and trillions of parameters.

Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. 

We vary <u>hidden size, number of attention heads, and number of layers to arrive at a specific model size.</u> As the model size increases, we also modestly increase the batch size. 

We leverage NVIDIA's Selene supercomputer to perform scaling studies and use up to 3072 A100 GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs (both per GPU and aggregate over all GPUs). Note that these results are from benchmark runs and these models were not trained to convergence; however, the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.

# 2. Scaling Megatron up to a Supercomputer

The Megatron-LM model framework has been used to train small and very large models up to and using NVIDIA Selene, currently <a href="https://www.top500.org/system/179842/">6th on the Top 500</a> list of Supercomputers as of November 2021.  We will go over the `pretrain_step.sh` script at the end of this notebook, which will detail how to adjust model size.

Large supercomputers can have bottlenecks in either the compute, storage, or networking. If one of these components is slow, it can cause an application to slow down. For very large NLP models trained on GPUs, a common bottleneck is in the network and storage because the training data needs to be read from disk and transferred over the network to the feed GPUs. In addition, large models may span multiple GPUs or nodes, which requires sending/receiving model parameters. <a href="https://www.nvidia.com/en-us/data-center/dgx-superpod/">NVIDIA DGX SuperPOD™</a> is an AI data center infrastructure platform that enables IT to deliver performance—without compromise—for every user and workload. DGX SuperPOD offers leadership-class accelerated infrastructure and agile, scalable performance for the most challenging AI and high-performance computing (HPC) workloads, with industry-proven results. DGX SuperPOD uses InfiniBand networking to achieve best in class performance and includes advantages such as in-network compute, which can speed up training times.

<center><img src="images/cases_april2021.png" width="50%" height="50%" style="display=block; margin:auto"/></center>

# 3. Train GPT on multi-node multiple GPUs

Training using the Megatron Framework involves 4 steps:

<center><img src="images/E2E_Megatron_steps.png" width="50%" height="50%" style="display=block; margin:auto"/></center>

<div><font size="4">In the rest of the notebook, we will focus on  <strong>(1) data preprocessing of credit card transactions</strong></font><div></br>


Megatron expects training data in a <em>json lines</em> format and is commonly denoted with a JSON object per line. By default, Megatron expects the `text` field to contain the document (in this case, transactions) of interest.

```json
{"text": 
 "791,1,68.00,2018-01-02 09:10:00,2018,1,2,9,10,Swipe,Transaction,12345536,New,York,NY,10017,8005<NA>,0\n"
 }
```

We can model temporal information by placing continuous sets of credit card transactions in the `text` field. Megatron defines a special `<|endoftext|>` token to denote the beginning and end of a document. The attention mask will stop at the boundary of the `<|endoftext|>` token so no attention is applied across this token. As we just learned, to model temporal information in the time series, we want to make sure that the `<|endoftext|>` is added between continuous sections. We'll come back to how to model this on the credit card dataset after we have tokenized our data.

# 4. Tokenizing text

This link contains a <a href="https://huggingface.co/docs/transformers/tokenizer_summary">summary of common tokenizers</a> used in GPT, BERT, or other models. GPT2 uses the BPE tokenizer:


BPE tokenizer stands for Byte Pair Encoding Tokenizer and is used in GPT2. Here is the GPT2 paper
<a href="https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">GPT2 paper</a> and the 
<a href=https://arxiv.org/pdf/1508.07909.pdf>BPE tokenizer paper</a>, which describe the tokenization process in further detail. 

In the section below, we will explore the BPE tokenizer, explain its benefits, and drawbacks for tabular data tokenization. Then we will implement a custom `TabularTokenizer` class, which will attempt to address some issues with BPE tokenization for tabular data.

## 4.1 Original BPE Tokenizer

Overview:
- Frequent pairs of characters iteratively merged together to construct a model vocabulary
- Hyperparameter: number of merge iterations


The motivation for the BPE tokenizer is to create a word segmentation algorithm where frequent pairs of characters are merged together. For example, the pair `('A', 'B')` can be merged to the symbol `('AB')`. Frequent pairs can also be merged together, eventually yielding whole words. The number of merge operations is the only hyperparameter of the algorithm.

From the paper, the code to implement BPE in Python is:

In [2]:
"""
BPE algorithm from Neural Machine Translation of Rare Words with Subword Units
https://arxiv.org/pdf/1508.07909.pdf
"""

import re, collections

def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs

def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2,
         'n e w e s t ? </w>':6, 'w i d e s t ? </w>':3}

NUM_MERGES = 10  # HYPERPARAMETER

for i in range(1, NUM_MERGES+1):
    pairs = get_stats(vocab)
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
#     print(f'Iteration: {i}\nPairs: {pairs}\nMerging Pair: {best}\nVocab: {vocab}\n')
    print(f'Iteration: {i}\nMerging Pair: {best}\nVocab: {vocab}\n')


Iteration: 1
Merging Pair: ('e', 's')
Vocab: {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t ? </w>': 6, 'w i d es t ? </w>': 3}

Iteration: 2
Merging Pair: ('es', 't')
Vocab: {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est ? </w>': 6, 'w i d est ? </w>': 3}

Iteration: 3
Merging Pair: ('est', '?')
Vocab: {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est? </w>': 6, 'w i d est? </w>': 3}

Iteration: 4
Merging Pair: ('est?', '</w>')
Vocab: {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est?</w>': 6, 'w i d est?</w>': 3}

Iteration: 5
Merging Pair: ('l', 'o')
Vocab: {'lo w </w>': 5, 'lo w e r </w>': 2, 'n e w est?</w>': 6, 'w i d est?</w>': 3}

Iteration: 6
Merging Pair: ('lo', 'w')
Vocab: {'low </w>': 5, 'low e r </w>': 2, 'n e w est?</w>': 6, 'w i d est?</w>': 3}

Iteration: 7
Merging Pair: ('n', 'e')
Vocab: {'low </w>': 5, 'low e r </w>': 2, 'ne w est?</w>': 6, 'w i d est?</w>': 3}

Iteration: 8
Merging Pair: ('ne', 'w')
Vocab: {'low </w>': 5, 'low e r </w>': 2, 'new est?</w>': 6, 


Play around with the value of `NUM_MERGES`.
As `NUM_MERGES` increases the `vocab` has a greater number of full words (no spaces). Think about what happens if we set `NUM_MERGES` to a large value, example `NUM_MERGES > 15`?

We see from adjusting `NUM_MERGES` we can tune our vocabulary to have more discrete letters when `NUM_MERGES` is a small value and more full words when `NUM_MERGES` is a larger value.

## 4.2 GPT2 BPE optimizations over original implementation

- no merging on punctuation, with exception for spaces 

The authors of GPT2, in their paper, noted that the BPE tokenizer produced versions of words with punctuation together like `dog.`, `dog?`, and `dog!`. We obtained the same analogous result in the third merge iteration of above, `est` was merged with `?` to make the symbol `est?`. As a result, the model vocabulary was not optimally being used in the original BPE tokenizer implementation. The GPT2 authors prevented punctuation, except for spaces, from being used in merging. This approach still allows for generality like the original BPE tokenizer, while allowing additional tokens to be used for the same amount of space as before.

The full GPT2 BPE tokenizer implementation is located [here](./megatron/tokenizer/gpt2_tokenization.py).

### Tokenization of Tabular Data

A tokenizer, when applied to a tabular corpus should incorporate the structural information in each column of the table. As we learned above, the GPT2 BPE tokenizer will merge tokens together based on the frequency of occurrence. Thus, for a given column, the tokenizer could yield a different number of tokens, producing different numbers of tokens on any two rows. 

For float columns, the structural information of the float numbers can get lost if an NLP tokenizer is used. 

For example, we could break down the float `19.2` using the tokens `1`, `9.` and `2` and merge these together to form `19.2` again, but the tokens have lost their meaning, which is that `1` is a tens digit, 9 is a ones digit, and 2 is located in the first position after the decimal point. 

Moreover, in the case of columns with a few different values, this is a wasteful use of the vocabulary as we can simply `factorize` those columns. For example, take a look at the `Use Chip` column, 

```python
df['Use Chip'].value_counts().to_pandas().to_dict()

Out:
{'Swipe Transaction': 15386082,
 'Chip Transaction': 6287598,
 'Online Transaction': 2713220}
```

and computing like before - only showing the last iteration:

In [3]:
vocab = {'S w i p e   T r a n s a c t i o n </w>': 15386082,
         'C h i p   T r a n s a c t i o n </w>': 6287598,
         'O n l i n e   T r a n s a c t i o n </w>': 2713220}

NUM_MERGES = 10

for i in range(1, NUM_MERGES+1):
    pairs = get_stats(vocab)
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
print(f'Iteration: {i}\nPairs: {pairs}\nMerging Pair: {best}\nVocab: {vocab}\n')

Iteration: 10
Pairs: defaultdict(<class 'int'>, {('S', 'w'): 15386082, ('w', 'i'): 15386082, ('i', 'p'): 21673680, ('p', 'e'): 15386082, ('e', 'Transactio'): 18099302, ('Transactio', 'n'): 24386900, ('n', '</w>'): 24386900, ('C', 'h'): 6287598, ('h', 'i'): 6287598, ('p', 'Transactio'): 6287598, ('O', 'n'): 2713220, ('n', 'l'): 2713220, ('l', 'i'): 2713220, ('i', 'n'): 2713220, ('n', 'e'): 2713220})
Merging Pair: ('Transactio', 'n')
Vocab: {'S w i p e   Transaction </w>': 15386082, 'C h i p   Transaction </w>': 6287598, 'O n l i n e   Transaction </w>': 2713220}



After 10 merge iterations, the method did not produce the only 3 tokens needed for this column.

## 4.3 So what alternative tokenization options exist for Tabular data?

As shown in the [TabFormer paper](https://arxiv.org/abs/2011.01843) where this dataset is derived, the authors constructed a tabular tokenizer that considers the structural information in the table, and <strong>uses the same number of tokens (1) per column</strong> and is akin to LabelEncoding, or factorization. However, since TabFormer uses a single token for each of the columns, it can cause either accuracy loss if the number of tokens is small for the column, or weak generalization if the number of tokens is too large. We improve it by using multiple tokens to code the columns. For example, the floating number "134.09" can be tokenized into multiple integers. 

Those familiar with `quantization` methods may be familiar with how the Real Numbers can be "digitized" to discrete integers, at the loss of a little accuracy. The most popular quantization method is rounding a number up or down to the nearest integer. Another density-based approach is to calculate a histogram of floating numbers and use the bin number as a representation of a number - so we could represent the number `2.71828` given the following bin edges: `[0,1), [1,4), [4,10), [10,12)]` as the number `1` since it occurs in bin 1 (bin 0 is the first bin). 

We implemented our own custom encoder/decoder used in the tokenizer. It uses FloatTokenizer for the `Amount` column, and a categorical tokenizer for the rest of the columns. The trained encoder and decoder are saved to a file and can be used later. We choose to use 4 tokens to encode the floating numbers in this example. Using more tokens will yield a more accurate reverse-lookup.

Briefly, the method we implemented for tokenizing floating point numbers is the following:

- Set minimum value = 0 by subtracting from the column
- Compress these values by taking `log1p` 
- Compute the max digits of these log values using base 10 logarithm, `log10`
- Allow for extra user-defined precision using the `extra_digits = code_len - max_digits` where `code_len` is user defined and `max_digits` was calculated in the prior step
- Compute and save a lookup and reverse-lookup table for the digits up the precision defined by `extra_digits`

    
So if the `extra_digits` is 4 and a logarithmed value is: `1.23456`, we would truncate after the `4`
    
Computing the reverse transform follows the reverse steps, namely compute
``` python 
reverse_transformed_value = np.exp(v / 10**self.extra_digits) + min_value - 1.0
```

<strong>This link will bring you to the [code for our custom tabular tokenizer](./coder/column_code.py) implementation.</strong>

Let's tokenize the credit card dataset that we saw in the previous notebook. We use the NVIDIA RAPIDS library to read in and parse much of the data. For this dataset (24M rows), we can achieve speedups using GPUs for parallelism in this preprocessing scheme. RAPIDS has GPU-enabled libraries that are analogous to the CPU counterpart. For example, the `pandas` equivalent is `cudf` and the `scikit-learn` equivalent is `cuml`.

We'll use `cudf` to read and preprocess the data, and numpy and sklearn where certain functionality is needed.

# 5. ETL with cuDF and Pandas

In [1]:
# we'll save the 
data_fp = './data/card_transaction.v1.parquet'
out_data_fp = './data/card_transactions_fixed.parquet'

## cuDF

In [3]:
import cudf
import time

start =time.time()

df = cudf.read_parquet(data_fp)

# remove the dollar string and convert amount to float
df['Amount'] = df['Amount'].str.replace('$', '').astype('float').round(2)
# fix the zip column
df['Zip'] = df['Zip'].astype(int).astype(str).str.zfill(5)
# strip some leading spaces present in the Merchant City column
df['Merchant City'] = df['Merchant City'].str.strip()#.str.replace('/','') #Naval Air Station/ Jrb

# parse the time into minute and second
df['Hour'] = df['Time'].str[0:2].astype(int)
df['Minute'] = df['Time'].str[-2:].astype(int)
# remove the 'Time' Column once it is parsed
del df['Time']

# rename and lowercase columns
df = df.rename(columns={'Errors?': 'errors',
                        'Is Fraud?': 'is_fraud'
                        })
df.columns = [i.lower() for i in df.columns]

df['is_fraud'], fraud_key = df['is_fraud'].factorize()
df[['user', 'card', 'year', 'month', 'day', 'hour', 'minute', 'merchant name', 'mcc', 'is_fraud']] = df[['user', 'card', 'year', 'month', 'day', 'hour', 'minute', 'merchant name', 'mcc', 'is_fraud']].astype('int64')

# sort rows by date and re-order the cols
df = df.sort_values(['year', 'month', 'day', 'hour', 'minute'])[['user', 'card', 'year', 'month', 'day', 'hour', 'minute', 'amount',
                             'use chip', 'merchant name', 'merchant city', 'merchant state', 'zip',
                             'mcc', 'errors', 'is_fraud']].reset_index(drop=True)

for col, typ in zip(df.columns, df.dtypes):
    # print(co, t, t=='object')
    if typ=='object':
        df[col] = df[col].str.strip()
        
        
print(f'Elapsed (s): {time.time() - start: 0.3f}')
df.to_parquet(out_data_fp)
df.sample(5, random_state=2718)

Elapsed (s):  1.289


Unnamed: 0,user,card,year,month,day,hour,minute,amount,use chip,merchant name,merchant city,merchant state,zip,mcc,errors,is_fraud
23080265,888,1,2019,6,9,15,11,59.15,Chip Transaction,-245178307025547046,Woodbury Heights,NJ,8097,5311,,0
7068011,1008,5,2009,9,16,16,5,16.58,Swipe Transaction,6135208568923449408,Matthews,NC,28105,9402,,0
14681442,98,4,2014,7,13,8,44,4.1,Swipe Transaction,5428932742579976001,Hollywood,FL,33024,5921,,0
1995083,1649,1,2004,9,21,10,28,204.56,Swipe Transaction,6299212792632776248,Blair,NE,68008,8931,,0
16432346,485,0,2015,7,26,16,46,83.3,Chip Transaction,-9172319837809732852,League City,TX,77573,7538,,0


In [3]:
df = cudf.read_parquet(out_data_fp)

In [6]:
df

Unnamed: 0,user,card,year,month,day,hour,minute,amount,use chip,merchant name,merchant city,merchant state,zip,mcc,errors,is_fraud
0,791,1,1991,1,2,7,10,68.00,Swipe Transaction,2027553650310142703,Burke,VA,22015,5541,,0
1,791,1,1991,1,2,7,17,-68.00,Swipe Transaction,2027553650310142703,Burke,VA,22015,5541,,0
2,791,1,1991,1,2,7,21,113.62,Swipe Transaction,2027553650310142703,Burke,VA,22015,5541,,0
3,791,1,1991,1,2,17,30,114.73,Swipe Transaction,-7269691894846892021,Burke,VA,22015,5411,,0
4,791,1,1991,1,3,9,3,251.71,Swipe Transaction,-3693650930986299431,Burke,VA,22015,4814,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24386895,1659,2,2020,2,28,23,51,7.67,Chip Transaction,7231700044622779845,Cosby,TN,37722,5300,,0
24386896,863,1,2020,2,28,23,53,49.06,Chip Transaction,7654254764356253071,North Brunswick,NJ,08902,5912,,0
24386897,1300,0,2020,2,28,23,56,51.29,Online Transaction,-6458444334611773637,ONLINE,,,4784,,0
24386898,1366,2,2020,2,28,23,56,132.73,Chip Transaction,-7398558035733466800,Rockville Centre,NY,11570,5812,,0


## Pandas

In [8]:
import pandas as pd
import time

# no need to run pandas, results already cached
if False:
    def preprocess_raw_data(fp: str, out_fp: str) -> pd.DataFrame:
        """
        preprocess the raw transaction data
        fp(str): path to the raw transaction data.
        """
        df = pd.read_parquet(fp)
        df = df.rename(columns={'Errors?': 'errors',
                                'Is Fraud?': 'is_fraud'
                                })

        # split time into hour and minute
        df[['hour', 'minute']] = df.Time.str.split(':', expand=True)
        df.hour = df.hour.astype(int)
        df.minute = df.minute.astype(int)

        # remove the 'Time' Column once it is parsed
        del df['Time']
        # rename all cols to lowercase
        df.columns = [i.lower() for i in df.columns]
        # add date col
        df['date'] = pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute']])

        # remove the dollar string and convert amount to float
        df['amount'] = df['amount'].str.replace('$', '', regex=False).astype('float')

        # sort rows by date and re-order the cols, exclude date column
        df = df.sort_values('date')[['user', 'card', 'year', 'month', 'day', 'hour', 'minute', 'amount',
                                     'use chip', 'merchant name', 'merchant city', 'merchant state', 'zip',
                                     'mcc', 'errors', 'is_fraud']].reset_index(drop=True)

        # fix the zip column
        df['zip'] = df['zip'].apply(lambda x: '' if pd.isna(x) else "{:05.0f}".format(x))
        # factorize is_fraud col into 1 and 0.
        df['is_fraud'], fraud_key = df['is_fraud'].factorize()
        df['use chip'] = df['use chip'].str.strip()
        df['merchant city'] = df['merchant city'].str.strip()

        df.to_parquet(out_fp)
        print('Complete')
        return df

    # we'll save the 
    data_fp = './data/card_transaction.v1.parquet'
    out_data_fp = './data/card_transactions_fixed.parquet'

    start =time.time()
    df = preprocess_raw_data(data_fp, out_data_fp)
    print(f'Elapsed (s): {time.time() - start: 0.3f}')

    df.sample(5, random_state=2718)

Complete
Elapsed (s):  236.001


Unnamed: 0,user,card,year,month,day,hour,minute,amount,use chip,merchant name,merchant city,merchant state,zip,mcc,errors,is_fraud
3505310,1761,1,2006,8,29,14,29,25.29,Online Transaction,31551052261259716,ONLINE,,,4784,,0
716646,1520,4,2001,11,4,14,17,2.38,Swipe Transaction,-6571010470072147219,Norwalk,CT,6854.0,5499,,0
5558369,1590,2,2008,7,23,6,49,27.82,Swipe Transaction,706166510299238548,Patterson,LA,70392.0,7538,,0
19519445,848,2,2017,5,16,12,42,679.31,Chip Transaction,-8436105799536235330,Yonkers,NY,10710.0,8111,,0
20362175,1897,0,2017,11,10,14,33,63.0,Chip Transaction,1799189980464955940,Boulder,CO,80301.0,5499,,0


Performance times will vary based on your cpu architecture
```
Complete
Elapsed (s):  236.001
```

# 6. Preprocessing

<div><font size="4">Tokenize the DataFrame columns</font></div>

In [9]:
vocabulary_path = 'credit_card_coder.pickle'

In [10]:
# the ordering of these columns is VERY IMPORTANT when we begin to pass context to our model during inference
import cudf
import pickle
from coder.column_code import ColumnTokenizer, FloatTokenizer, CategoricalTokenizer

if isinstance(df, cudf.DataFrame):
    df = df.to_pandas()
# filling missing rows with None
df = df.fillna('None')

column_codes_gpu = ColumnTokenizer()

beg = 0
cc = None

#ACTION MAKE SURE THE COLUMNS ARE CORRECT HERE.
FLOAT_COLS = ['amount']
EXCLUDED_COLS = []
columns = [col for col in df.columns if col not in EXCLUDED_COLS]


for column in columns:
    start_id = beg if cc is None else cc.end_id
    print(column, start_id)
    if column in FLOAT_COLS:
        cc = FloatTokenizer(column, df[[column]], start_id, 'quantile')

    else:
        cc = CategoricalTokenizer(column, df[column], start_id)

    column_codes_gpu.register(column, cc)

# add 1 for newline char to separate each row
print('Each row uses', sum(column_codes_gpu.sizes) + 1, 'tokens')

# save the encoder and decoder
with open(vocabulary_path, 'wb') as handle:
    pickle.dump(column_codes_gpu, handle)


user 0
card 2000
year 2009
month 2039
day 2051
hour 2082
minute 2106
amount 2166
use chip 2237
merchant name 2240
merchant city 102583
merchant state 116012
zip 116236
mcc 143558
errors 143667
is_fraud 143691
Each row uses 24 tokens


In [11]:
for col, size in zip(column_codes_gpu.columns, column_codes_gpu.sizes):
    print(f'{col}:\t {size}')

user:	 1
card:	 1
year:	 1
month:	 1
day:	 1
hour:	 1
minute:	 1
amount:	 8
use chip:	 1
merchant name:	 1
merchant city:	 1
merchant state:	 1
zip:	 1
mcc:	 1
errors:	 1
is_fraud:	 1


In [12]:
import pickle
from coder.column_code import ColumnTokenizer, FloatTokenizer, CategoricalTokenizer

column_codes_gpu = ColumnTokenizer()
with open(vocabulary_path, 'rb') as handle:
    column_codes_gpu = pickle.load(handle)

In [13]:
import json
from coder.tabular_tokenizer import TabularTokenizer
ENDOFTEXT = '<|endoftext|>'
DELIMITER = '|'
tokenizer = TabularTokenizer(vocabulary_path,
                             special_tokens=['\n', ENDOFTEXT],
                             delimiter=DELIMITER)

In [14]:
tokenizer.eor

143693

<div><font size="4">Let's try out our tokenizer on the `Amount` and `Merchant City` columns of the TabFormer Dataset.</font></div>

In [15]:
float_str = '134.09'
token_ids = column_codes_gpu.encode('amount', float_str)
print('Token ids for {} is: {}'.format(float_str, token_ids))
amt_str = column_codes_gpu.decode('amount', token_ids)
print('Recovered Amount for {} is: {}\n'.format(float_str, amt_str))

city_str = 'Monterey Park'
token_ids = column_codes_gpu.encode('merchant city', city_str)
print('Token ids for {} is: {}'.format(city_str, token_ids))
amt_str = column_codes_gpu.decode('merchant city', token_ids)
print('Recovered Merchant City for {} is: {}'.format(city_str, amt_str))

Token ids for 134.09 is: [2166, 2167, 2186, 2190, 2203, 2209, 2217, 2232]
Recovered Amount for 134.09 is: 134.0897315256757

Token ids for Monterey Park is: [103152]
Recovered Merchant City for Monterey Park is: Monterey Park


We used a number of tokens for the float column, `Amount` and recovered the quantized amount with a small loss in accuracy.
The `Merchant City` column only requires `1` token to encode `Monterey Park`.

Now that we have the encoder and decoder implemented for tabular data, we can apply it to encode each of the columns in the dataset. Since the number of tokens per column is fixed, decoding the data is easy as the structure of the tabular data is maintained.

## 6.1 Modeling time series information with the `<|endoftext|>` token on the tokenized data

To model the temporal information in the time series, we want to make sure the <|endoftext|> is added between the continuous sections. For example, in this credit card dataset, there are 2000 users. 

```python
# There are 2000 users in this dataset.
set(df.User.unique().to_numpy()) == set(range(2000))
Out: True
```

Each user's transactions are one long time series sequence. We split the long time series into smaller overlapping pieces with some strike number. The <|endoftext|> is only added at the beginning and end of the long sequences but not in between. In this way, Megatron applies attention to learn the temporal correlation in the transactions for the user but not across users. 

We have provided the Python code to convert the CSV file into the loose json format.

## 6.2 Collate data by User

In [16]:
# converting time back to ints for sorting grouped rows by time so transactions are in order
df[['year', 'month', 'day', 'hour', 'minute']] = df[['year', 'month', 'day', 'hour', 'minute']].astype(int)

In [17]:
import json
import time
from typing import List

import pandas as pd
from tqdm import tqdm


ENDOFTEXT = '<|endoftext|>'
DELIMITER = '|'


def get_docs_from_df(group) -> List:
    """
    Make a document of size, WINDOW_SIZE rows from a user's transactions. The STRIDE moves the window by the provided number of rows.
    
    group: a DataFrame.groupby group
    """
    WINDOW_SIZE = 500
    STRIDE = 250
    group = group.sort_values(['year', 'month', 'day', 'hour', 'minute'])
    group = group.astype(str) # convert to string for concatenation in loop below

    result = group[["user"]].copy()
    result.columns = ["Out"]
    cols = group.columns[1:]
    total = []
    for col in cols:
        # print(col)
        result["Out"] = result["Out"].str.cat(group[col], sep=DELIMITER)
    
    for start_rows in range(0, max(len(result) - WINDOW_SIZE, len(result)), STRIDE):
        if not start_rows:
            total.append(json.dumps({'text': ENDOFTEXT + '\n'.join(result['Out'].iloc[start_rows: start_rows + WINDOW_SIZE].to_json(orient='values')[2:-2].replace('\\', '').split('","'))}) + '\n')
        else:
            total.append(json.dumps({'text': '\n'.join(result['Out'].iloc[start_rows: start_rows + WINDOW_SIZE].to_json(orient='values')[2:-2].replace('\\', '').split('","'))}) + '\n')

    return total


In [19]:
from multiprocessing import Pool, cpu_count
from tqdm import tqdm
tqdm.pandas()
# https://stackoverflow.com/questions/26187759/parallelize-apply-after-pandas-groupby?noredirect=1&lq=1
def applyParallel(dfGrouped, func, convert_to_series=False):
    with Pool(cpu_count()) as p:
        ret_list = p.map(func, [group for name, group in dfGrouped])
    if convert_to_series:
        return pd.concat(ret_list)
    return ret_list

In [20]:
start = time.time()
# out = df.groupby('user').progress_apply(get_docs_from_df)
out = applyParallel(df.groupby('user'), get_docs_from_df)
print(f"elapsed (s): {time.time() - start: 0.3f}")

elapsed (s):  82.131


In the code above, we `groupby user` and pass each group into `get_docs_from_df`. This function sorts the user's transactions by date, and then makes `"documents"` of length of 500 rows (`WINDOW_SIZE`) and with an temporal overlap of 250 rows (`STRIDE`). We add the special `<|endoftext|>` token to the beginning of the first document to mask the attention from another user's transactions. The variables `WINDOW_SIZE` and `STRIDE` are hyperparameters.

Recall that Megatron expects a json lines format, with a document per line and the `text` field to contain the document (in this case, transactions) of interest.

Thus we can take the DataFrame

| user | card | amount | year | month | day  | hour | minute | use chip              | merchant name | merchant city | merchant state | zip   | mcc   | errors | is fraud |
|------|------|------- |------|------ |------| -----| ------ | ----------------------| --------------| ------------- | -------------- | ----- | ---   | ------ | -------- |
| 791  | 1    | 68.00  | 2018 |  1    |  2   |  9   |     10 |    Swipe Transaction  | 12345536      |  New York     | NY             | 10017 |  8005 |  \<NA> | 0        |
| 1572 | 0    | 572.42 | 2018 |  4    |  12  |  7   |     11 |    Chip Transaction   | 49908535      |  Princeton    | NJ             | 19406 |  5634 |  \<NA> | 0        |
| 2718 | 7    | 123.10 | 2019 |  1    |  4   |  10  |     14 |    Chip Transaction   | 43211536      |  Beverly Hills| CA             | 90210 |  4800 |  \<NA> | 0        |
| 21   | 2    | 42.04  | 2020 |  6    |  23  |  11  |     18 |    Swipe Transaction  | 65423006      |  Burke        | VA             | 22015 |  5604 |  \<NA> | 0        |
| 1001 | 1    | 5000.00| 2020 |  11   |  3   |  1   |     22 |    Online Transaction | 75434546      |  \<NA>        | \<NA>          | \<NA> |  1234 |  \<NA> | 1        |

and transform it into documents:

```json
{'text': '<|endoftext|>0|0|2002|9|1|6|21|134.09|Swipe Transaction|3527213246127876953|La Verne|CA|91750|5300|None|No\n0|0|2002|9|1|6|42|38.48|Swipe Transaction|-727612092139916043|Monterey Park|CA|91754|5411|None|No\n0|0|2002|9|2|6|22|120.34|Swipe Transaction|-727612092139916043|Monterey Park|CA|91754|5411|None|No\n0|0|2002|9|2|17|45|128.95|Swipe Transaction|3414527459579106770|Monterey Park|CA|91754|5651|None|No\n
 ...},
{'text': '0|0|2002|12|2|12|54|50.0|Swipe Transaction|-1288082279022882052|La Verne|CA|91750|5499|None|No\n0|0|2002|12|2|19|8|71.03|Swipe Transaction|-4733023138943446282|Chicago|IL|60643|5812|None|No\n0|0|2002|12|2|20|19|78.32|Swipe Transaction|-2744911404133435018|Chicago|IL|60645|5812|None|No\n0|0|2002|12|2|23|22|127.0|Swipe Transaction|-6406662083475903219|Chicago|IL|60643|3390|None|No\n0|0|2002|12|2|23|48|211.0|Swipe Transaction|-7807051024009846392|Peoria|IL|61604|3684|None|No\n
 ...},
...
{'text': '<|endoftext|>1|2|2003|7|1|6|45|17.23|Swipe Transaction|-4791240532834014651|Little Neck|NY|11363|5921|None|No\n1|2|2003|7|3|6|57|15.42|Swipe Transaction|1799189980464955940|Little Neck|NY|11363|5499|None|No\n1|2|2003|7|4|5|22|22.23|Online Transaction|-521141999023077663|ONLINE|None|None|5815|Bad Expiration,|No\n1|2|2003|7|4|5|29|10.28|Online Transaction|-521141999023077663|ONLINE|None|None|5815|None|No\n
 ...},

```
Notice how the `<|endoftext|>` token separates user `0` transactions from user `1`.

In [21]:
json.loads(out[0][0])['text'].split('\n')[:5]

['<|endoftext|>0|0|2002|9|1|6|21|134.09|Swipe Transaction|3527213246127876953|La Verne|CA|91750|5300|None|0',
 '0|0|2002|9|1|6|42|38.48|Swipe Transaction|-727612092139916043|Monterey Park|CA|91754|5411|None|0',
 '0|0|2002|9|2|6|22|120.34|Swipe Transaction|-727612092139916043|Monterey Park|CA|91754|5411|None|0',
 '0|0|2002|9|2|17|45|128.95|Swipe Transaction|3414527459579106770|Monterey Park|CA|91754|5651|None|0',
 '0|0|2002|9|3|6|23|104.71|Swipe Transaction|5817218446178736267|La Verne|CA|91750|5912|None|0']

## 6.3 Save the document corpus to the `json lines` file

In [22]:
jsonlines_path = 'credit_card_pd.jl'
with open(jsonlines_path, 'w') as f:
    for row in tqdm(out):
        f.write(''.join(row))


100%|██████████| 2000/2000 [00:06<00:00, 293.92it/s]


Note we use '|' as delimiter because ',' is used as a character in the original file 'Errors' column. Having the loose JSON file ready, we can test the customized Tabular Tokenizer. You can use another delimiter as long as it is not present in the data, otherwise that would corrupt the output file when parsing the data!

In [23]:
import json
from coder.tabular_tokenizer import TabularTokenizer


tokenizer = TabularTokenizer(vocabulary_path,
                             special_tokens=['\n', ENDOFTEXT],
                             delimiter=DELIMITER)

with open(jsonlines_path, 'r') as f:
    for line in f:
        break

In [24]:
text = json.loads(line)['text']
r = tokenizer.tokenize_str(text)
ids = tokenizer.convert_tokens_to_ids(r)
tex = tokenizer.convert_ids_to_tokens(ids)
decoded_text = tokenizer.decode(ids)
show_num_lines = 5
print('raw input text\n', '\n'.join(text.split('\n')[:show_num_lines]))
print('tokens\n', r[:show_num_lines*len(tokenizer.code_column.columns)])
print('token ids\n', ids[:show_num_lines*(sum(tokenizer.code_column.sizes)+1)])
print('decoded tokens\n', tex[:show_num_lines*(sum(tokenizer.code_column.sizes)+1)])
print('decoded text\n', '\n'.join(decoded_text.split('\n')[:show_num_lines]))

raw input text
 <|endoftext|>0|0|2002|9|1|6|21|134.09|Swipe Transaction|3527213246127876953|La Verne|CA|91750|5300|None|0
0|0|2002|9|1|6|42|38.48|Swipe Transaction|-727612092139916043|Monterey Park|CA|91754|5411|None|0
0|0|2002|9|2|6|22|120.34|Swipe Transaction|-727612092139916043|Monterey Park|CA|91754|5411|None|0
0|0|2002|9|2|17|45|128.95|Swipe Transaction|3414527459579106770|Monterey Park|CA|91754|5651|None|0
0|0|2002|9|3|6|23|104.71|Swipe Transaction|5817218446178736267|La Verne|CA|91750|5912|None|0
tokens
 ['<|endoftext|>', '0', '0', '2002', '9', '1', '6', '21', '134.09', 'Swipe Transaction', '3527213246127876953', 'La Verne', 'CA', '91750', '5300', 'None', '0', '\n', '0', '0', '2002', '9', '1', '6', '42', '38.48', 'Swipe Transaction', '-727612092139916043', 'Monterey Park', 'CA', '91754', '5411', 'None', '0', '\n', '0', '0', '2002', '9', '2', '6', '22', '120.34', 'Swipe Transaction', '-727612092139916043', 'Monterey Park', 'CA', '91754', '5411', 'None', '0', '\n', '0', '0', '2002

The TabularTokenizer understands the Table structure so it can convert back and forth between the tabular data text and the token ids.

# 7 Model configuration

The <a href="./model_config.sh">model_config.sh</a> file defines the model architecture and file paths

```bash
GPUS_PER_NODE=$(nvidia-smi -L | wc -l)
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"


OUTPUT_PATH=checkpoints
PROJECT_NAME=creditcard
INPUT_FILE=credit_card.jl
DATA_PATH=${PROJECT_NAME}_text_document
CHECKPOINT_PATH=${OUTPUT_PAHT}/gpt_${PROJECT_NAME}
LOADPATH=${OUTPUT_PAHT}/gpt_${PROJECT_NAME}
TB_PATH=${OUTPUT_PAHT}/checkpoints/tb

TENSOR_MP_SIZE=1
PIPELINE_MP_SIZE=1

#model architecture
NUM_LAYERS=48
HIDDEN_SIZE=1024
NUM_HEADS=16
SEQ_LEN=2048
MAX_POS_EMD=2048


VOCAB_FILE=credit_card_coder.pickle
TOKENIZER=TabularTokenizer
PREPROCESS_WORKERS=$(nproc)
```

The model size can be adjusted by altering the parameters under `model architecture` section: 
```bash
#model architecture
NUM_LAYERS=48
HIDDEN_SIZE=1024
NUM_HEADS=16
SEQ_LEN=2048
MAX_POS_EMD=2048
```

The chart with different model size parameters is shown again below for your reference. Also recall, that the `TENSOR_MP_SIZE` and `PIPELINE_MP_SIZE` variables are passed to the `pretrain_step.sh` training script for convenience of using tensor and pipline parallelism. 

<center><img src="images/cases_april2021.png" width="50%" height="50%"/></center>

### 7.1 Preprocess

The loose json above is processed into a binary format for training. To convert the json into mmap (memory mapped), cached index file use `preprocess_data.py`. 

The `preprocess.sh` script to prepare data for GPT training:
```bash
#!/bin/bash
source ./model_config.sh

python tools/preprocess_data.py --input=$INPUT_FILE \
                                --output-prefix=$PROJECT_NAME \
                                --vocab=$VOCAB_FILE \
                                --dataset-impl=mmap \
                                        --tokenizer-type=$TOKENIZER \
                                --workers=$PREPROCESS_WORKERS

```
We defined the `PREPROCESS_WORKERS=nproc` parameter in the `model_config.sh`  workers to do the preprocess step in parallel.

## 7.2 Run the preprocessing script

In [26]:
!./preprocess.sh

Opening credit_card_pd.jl
> building TabularTokenizer tokenizer ...
 > padded vocab (size: 143695) with 49 dummy tokens (new size: 143744)
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
> building TabularTokenizer tokenizer ...
 > padded vocab (size: 143695) with 49 dummy tokens (new size: 143744)
 > padded vocab (size: 143695) with 49 dummy tokens (new size: 143744)
> building TabularTokenizer tokenizer ...
 > pa

After running the script, two new files are generated in the current directory: `creditcard_text_document.bin` and `creditcard_text_document.idx`. The `bin` file is the mmap binary file and the `idx` is the cached index file. They will be used for the following pre-training step.

In [27]:
!ls creditcard_text_document.*

creditcard_text_document.bin  creditcard_text_document.idx


# 8. Please shut down the Kernel

Ex. `Kernel -> Shut down kernel`, or in Jupyter Lab, navigating to the `Running Terminals and Kernels` Tab on the left sidebar, highlighting the mouse over this notebook's name in the `KERNELS` Section and selecting the `X` that appears.

# References

<a id="1_1">[1]</a> 
<a href="https://arxiv.org/pdf/1909.08053.pdf">Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism</a>

<a id="1_2">[2]</a> 
<a href="https://arxiv.org/pdf/2104.04473.pdf">Efficient Large-Scale Language Model Training on GPU Clusters
Using Megatron-LM</a>

<a id="1_3">[3]</a> 
<a href="https://pytorch.org/docs/stable/distributed.html">Distributed Training Overview</a>

<a id="1_4">[4]</a> 
<a href="https://pytorch.org/tutorials/intermediate/dist_tuto.html">Distributed Training Tutorial</a>

<a id="1_5">[5]</a> 
<a href="https://github.com/pytorch/examples/tree/master/distributed/ddp">Distributed Data Parallel Applications</a>
