In [None]:
############################################################################
##
## Copyright (C) 2022 NVIDIA Corporation.  All rights reserved.
##
## NVIDIA Sample Code
##
## Please refer to the NVIDIA end user license agreement (EULA) associated
## with this source code for terms and conditions that govern your use of
## this software. Any use, reproduction, disclosure, or distribution of
## this software and related documentation outside the terms of the EULA
## is strictly prohibited.
##
############################################################################

# 1. Generate synthetic credit card transactions

We come to the next step of this workshop - generate synthetic credit card transactions!

First let's load the trained model weights, put the Megatron GPT in inference mode and start the text generation server. All of these can be done by running the `run_data_gen_server.sh` script.

The text generation server accepts REST API request to send the generated text in the response. Let's use the following Python code to generate some transactions.

In [None]:
vocabulary_path = 'credit_card_coder.pickle'

In [3]:
import requests
import json
from coder.column_code import ColumnTokenizer
import pickle

with open(vocabulary_path, 'rb') as handle:
        cc: ColumnTokenizer = pickle.load(handle)

TOKENS_PER_ROW = sum(cc.sizes) + 1

BATCH_SIZE = 16
PORT_NUM = 5000
NUM_ROWS = 30
HEADERS = {"Content-Type": "application/json"}


def request_data(data):
    resp = requests.put('http://localhost:{}/generate'.format(PORT_NUM),
                        data=json.dumps(data), headers=HEADERS)
    sentences = resp.json()['sentences']
    return sentences


# generate the inital transactions unconditionally
data = {
    "sentences": [""] * BATCH_SIZE,
    "tokens_to_generate": NUM_ROWS * TOKENS_PER_ROW,
    "temperature": 1.0,
    "add_BOS": True
}

In [4]:
sentences = request_data(data)

for i in range(BATCH_SIZE):
    s = sentences[i]
    print(s)
    break

<|endoftext|>1399|0|2015|8|26|16|36|162.94210326270257|Online Transaction|8099188931555596779|ONLINE|None||7996|None|0
1399|0|2015|8|26|16|59|120.0|Chip Transaction|-4282466774399734331|Channelview|TX|77530|4829|None|0
1399|0|2015|8|27|8|13|139.99952433243243|Chip Transaction|-4282466774399734331|Channelview|TX|77530|4829|None|0
1399|0|2015|8|27|8|20|2.4024739|Chip Transaction|-5475680618560174533|Channelview|TX|77530|5942|None|0
1399|0|2015|8|29|7|19|45.85496951999999|Chip Transaction|-4534432520820813184|Alvin|TX|77511|7230|None|0
1399|0|2015|8|29|7|47|139.99952433243243|Chip Transaction|-4282466774399734331|Channelview|TX|77530|4829|None|0
1399|0|2015|8|29|8|23|139.99952433243243|Chip Transaction|-4282466774399734331|Channelview|TX|77530|4829|None|0
1399|0|2015|8|30|13|31|84.06869268378381|Chip Transaction|1799189980464955940|Alvin|TX|77511|5499|None|0
1399|0|2015|8|30|13|37|46.2293459|Chip Transaction|1799189980464955940|Alvin|TX|77511|5499|None|0
1399|0|2015|8|30|13|39|-84.0000209

# 2. What just happened?

The above code only generated `NUM_ROWS` of transactions starting from the an empty string. The `BATCH_SIZE` parameter allows the generation to run in parallel to generate `NUM_ROWS` per batch. Thus, the number of generated transactions in the single request above is `BATCH_SIZE * NUM_ROWS`. Multiplying `NUM_ROWS * TOKENS_PER_ROW` allows to specify the number of generated rows. 

If generation gives an `Out of Memory error` (see the output from the notebook where the server is run), consider decreasing the `BATCH_SIZE` or the `NUM_ROWS`. A safe value for the `BATCH_SIZE` should be the `MICRO_BATCH_SIZE` defined in the `pretrain_step.sh` as this was used to train the model. 

If a longer sequence is needed, we can run the inference conditioned on the past transactions in a sliding window fashion. For example, first the model generates `[A, B, C, D, E]` transactions conditioned on an `<|endoftext|>` token. Then, it conditions on `[D, E]` and generates `[D, E, F, G, H]`. Once the long sequence comes to the end indicated by the special `<|endoftext|>` token, it will keep generating new transactions for a different user. For example, after generating `[X, Y, Z, <|endoftext|>]`, the model will generate `[Z, <|endoftext|>, A', B', C']` in the next iteration, where `A', B', C'` are transactions for a different user and not dependent on the former user's transaction `Z`.

In more practical terms, suppose we are interested in this past Christmas' transactions for Emily Smith, and we want to know how this will effect Emily's purchasing in January. We can pass the Christmas transactions to GPT and it will "condition" the output based on our input.

# 3. How to pass the most "context" to our model?

To obtain the maximum context use the `SEQ_LEN` variable, which tells us the max sequence length for our model. Since we already calculated the number of tokens per row, we can calculate how many previous transactions (rows) to pass into the model, leaving one row to generate.

For example, if our sequence length is `5`, then we can pass `[A, B, C, D]` as the context to conditionally generate `[A, B, C, D, E]`

In [5]:
TOKENS_PER_ROW = sum(cc.sizes) + 1  # add 1 for newline char to separate each row
SEQ_LEN=6144

# subtract 2 because of extra endoftext token, which is in first row only, and we need to leave space for an extra generated row
NUM_ROWS = SEQ_LEN//TOKENS_PER_ROW
NUM_CONTEXT_ROWS = NUM_ROWS - 1

# we will need to remove any partial row that will be generated (the last row).
TOKENS_PER_ROW, NUM_CONTEXT_ROWS

(24, 255)

The following code implements the ideas above for conditional generation and can be used to generate massive number of transactions in long sequences.

Again, we start from unconditional generation to "seed" the context, and then keep feeding this historical context as a rolling window. 

Alternatively, we can provide real transactions as the context for the model to generate new synthetic transactions. A similar principle applies for time series forecasting.

In [6]:
import pandas as pd
import cudf

out_data_fp = './data/card_transactions_fixed.parquet'
df = cudf.read_parquet(out_data_fp)

In [7]:
user0 = df.loc[df.user == 0]
user0 = user0.fillna('None')
# recall this from the first notebook if you had actually used EXCLUDED_COLS
EXCLUDED_COLS = []
COLUMNS = [col for col in user0.columns if col not in EXCLUDED_COLS]

In [8]:
ENDOFTEXT = '<|endoftext|>'
DELIMITER = '|'


def make_context(dframe, columns, delimiter):
    dframe = dframe[columns].astype(str)
    result = dframe[[dframe.columns[0]]].copy()
    result.columns = ['Out']
    cols = dframe.columns[1:]
    for col in cols:
        result['Out'] = result['Out'].str.cat(dframe[col], sep=delimiter)

    return '\n'.join(result['Out'].to_json(orient='values')[2:-2].replace('\\', '').split('","'))

context = make_context(user0, COLUMNS, DELIMITER).split('\n')



In [9]:
len(context), len(context[:NUM_CONTEXT_ROWS])


(19963, 255)

## 3.1 Example - One conditional generation iteration: let's pass this context and generate new transactions

In [10]:
# conditional generation of the transactions.
BATCH_SIZE = 2
data = {
    "sentences": ['\n'.join(context[:NUM_CONTEXT_ROWS])] * BATCH_SIZE,
    "tokens_to_generate": NUM_ROWS * TOKENS_PER_ROW,
    "temperature": 1.0,
    "add_BOS": True
}

sentences = request_data(data)

In [11]:
# some properties of the sentences variable:
print("length of `sentences` variable: {}\n"
       "length of a returned sentence: {}".format(len(sentences), len(sentences[0].split('\n'))))

length of `sentences` variable: 2
length of a returned sentence: 256


### 3.1a Notice how we are able to recover the original passed transactions. We can easily `round` the amounts to two decimal points if desired.

In [12]:
def round_amount(row_list, col_index, delimiter):
    # Round the amounts to two decimal points. Repeat for each row, if desired.
    rows = []
    for i in row_list:
        split_row = i.split(delimiter)
        amount = str(round(float(split_row[col_index]), 2))
        split_row[col_index] = amount
        rows.append(delimiter.join(split_row))
    return rows

In [13]:
print('\n'.join(round_amount(context[:3], COLUMNS.index('amount'), DELIMITER)), '\n\n', 
      '\n'.join(round_amount(sentences[0].replace(ENDOFTEXT, '').split('\n')[:3], COLUMNS.index('amount'), DELIMITER)),
      sep='')

0|0|2002|9|1|6|21|134.09|Swipe Transaction|3527213246127876953|La Verne|CA|91750|5300|None|0
0|0|2002|9|1|6|42|38.48|Swipe Transaction|-727612092139916043|Monterey Park|CA|91754|5411|None|0
0|0|2002|9|2|6|22|120.34|Swipe Transaction|-727612092139916043|Monterey Park|CA|91754|5411|None|0

0|0|2002|9|1|6|21|134.09|Swipe Transaction|3527213246127876953|La Verne|CA|91750|5300|None|0
0|0|2002|9|1|6|42|38.48|Swipe Transaction|-727612092139916043|Monterey Park|CA|91754|5411|None|0
0|0|2002|9|2|6|22|120.34|Swipe Transaction|-727612092139916043|Monterey Park|CA|91754|5411|None|0


# 4. Let's check out our generated transactions, compare `context` with the `sentences` variable

In [14]:
# The last two rows are the generated transaction, cross reference with the passed in context

print('Original context:\n',
      '\n'.join(round_amount(context[-2:], COLUMNS.index('amount'), DELIMITER)), '\n\n',
      'Sentences (this length is based on batch size):\n',
      '\n'.join(round_amount(sentences[0].rsplit('\n', 4)[1:], COLUMNS.index('amount'), DELIMITER)), '\n\n',
      '\n'.join(round_amount(sentences[1].rsplit('\n', 4)[1:], COLUMNS.index('amount'), DELIMITER)),
      sep='')

Original context:
0|3|2020|2|28|6|53|34.11|Swipe Transaction|-34551508091458520|La Verne|CA|91750|5912|None|0
0|2|2020|2|28|7|36|41.05|Chip Transaction|5817218446178736267|La Verne|CA|91750|5912|None|0

Sentences (this length is based on batch size):
0|0|2002|12|2|20|19|78.32|Swipe Transaction|-2744911404133435018|Chicago|IL|60645|5812|None|0
0|0|2002|12|2|23|22|127.0|Swipe Transaction|-6406662083475903219|Chicago|IL|60643|3390|None|0
0|0|2002|12|2|23|48|211.0|Swipe Transaction|-7807051024009846392|Peoria|IL|61604|3684|None|0
0|0|2002|12|3|13|33|10.6|Swipe Transaction|-4733023138943446282|Chicago|IL|60643|5812|None

0|0|2002|12|2|20|19|78.32|Swipe Transaction|-2744911404133435018|Chicago|IL|60645|5812|None|0
0|0|2002|12|2|23|22|127.0|Swipe Transaction|-6406662083475903219|Chicago|IL|60643|3390|None|0
0|0|2002|12|2|23|48|211.0|Swipe Transaction|-7807051024009846392|Peoria|IL|61604|3684|None|0
0|0|2002|12|4|13|25|12.04|Swipe Transaction|-4733023138943446282|Chicago|IL|60643|5812|None


In [15]:
# example criterion to check if there is a complete or partial final row
len(sentences[0].rsplit('\n', 1)[-1].split(DELIMITER)) == len(COLUMNS)

False

Notice how the last row in each `sentences` item is a partial row, remove it to only save complete rows. Adjusting the `SEQ_LEN` variable prior to training the model can help reduce the amount of extra text. The reason we discard the partial row is because we created the TabularTokenizer in such a fashion as to iterate over complete rows. A more recent implementation in <a href="https://github.com/NVIDIA/NeMo">Nemo-Megatron</a> addresses this. In our iteration example below, this is taken into account by reducing the `NUM_CONTEXT_ROWS` by 1 to ensure a complete row is generated.

# 5. Iteratively Generating transactions

One strategy to iteratively generate transactions is as follows:
- Generate transactions unconditionally. 
- The first generated transactions are used on the second iteration as the context to generate 1 additional row
- Use a rolling window on the context in subsequent iterations to generate 1 new row per iteration

If an `<|endoftext|>` token appears during the generation, the model will have finished generating transactions for that particular user and begun generating transactions for another user. It is up to the developer whether to keep track of the different users generated and segment the generated user transactions whether
- the `user A` was generated in a different batch, assuming `BATCH_SIZE` > 1
- in a batch, the `user A` was generated, then another `user B` generated, then later `user A` was generated again

and counting the two instances of `user A` as one or two distinct users.

In the example below, we will not do any user segmentation, however, we did do this in practice. The synthetic dataset provided in the Evaluation notebook includes this segmentation, but we merged all user instances together for the analysis. 

In [16]:
import requests
import json
from coder.column_code import ColumnTokenizer
import pickle

from tqdm import tqdm

with open(vocabulary_path, 'rb') as handle:
        cc: ColumnTokenizer = pickle.load(handle)

NITERATIONS = 10 # MAKE THIS A LARGER NUMBER FOR LONGER GENERATION

BATCH_SIZE = 1  # Increasing batch size will consume more GPU memory. Adjust to fit on your GPU.
TOKENS_PER_ROW = sum(cc.sizes) + 1
SEQ_LEN=6144  # From the model architecture
# subtract 2 because of extra endoftext token, which is in first row only, and we need to leave space for an extra generated row
NUM_ROWS = SEQ_LEN//TOKENS_PER_ROW
NUM_CONTEXT_ROWS = NUM_ROWS - 1

# Request params
PORT_NUM = 5000
HEADERS = {"Content-Type": "application/json"}

#NAME OF THE FILES
WRITE_FILES = False
PREFIX_NAME = 'synthetic'
files = []


def request_data(data):
    resp = requests.put('http://localhost:{}/generate'.format(PORT_NUM),
                        data=json.dumps(data), headers=HEADERS)
    sentences = resp.json()['sentences']
    return sentences


def get_condition_text(sentences, NUM_CONTEXT_ROWS):
    condition_text = ['\n'.join([ss for ss in s.split('\n')[-NUM_CONTEXT_ROWS:]]) for s in sentences]
    return condition_text


def get_extra_text(sentences, NUM_CONTEXT_ROWS):
    extra_text = ['\n'.join([ss for ss in s.split('\n')[NUM_CONTEXT_ROWS:]]) for s in sentences]
    return extra_text


def check_last_full_row(sentences, DELIMITER, COLUMNS):
    """removes last row if it is a partial row"""
    if isinstance(sentences, list) and len(sentences[0].rsplit('\n', 1)[-1].split(DELIMITER)) == len(COLUMNS):
        return sentences, True
    s = []
    for sentence in sentences:
        s.append(sentence.rsplit('\n', 1)[0])
    return s, False

In [17]:
# generate the inital transactions 
data = {
    "sentences": [""] * BATCH_SIZE,
    "tokens_to_generate": NUM_ROWS * TOKENS_PER_ROW,
    "temperate": 1.0,
    "add_BOS": True
}

sentences = request_data(data)

# round the amount data
sentences = ['\n'.join(round_amount(s.replace(ENDOFTEXT,'').split('\n'), 
                                    COLUMNS.index('amount'),
                                    DELIMITER)
                      ) 
             for s in sentences]

In [18]:
len(get_condition_text(sentences, NUM_CONTEXT_ROWS)[0].split('\n'))

255

In [19]:
if WRITE_FILES:
    for i in range(BATCH_SIZE):
        files.append(open("{}_{}.txt".format(PREFIX_NAME, i), 'w'))
        s = sentences[i]
        files[i].write(s.replace('<|endoftext|>', '\n'))
else:
    generated_rows = [i.split('\n') for i in sentences]

# generate the transactions conditioned on the previous ones
pbar = tqdm(range(NITERATIONS))

for iteration in pbar:
    # get conditional text and prepare payload
    condition_text = get_condition_text(sentences, NUM_CONTEXT_ROWS)

    data = {
        "sentences": condition_text,
        "tokens_to_generate": NUM_ROWS * TOKENS_PER_ROW,
        "temperate": 1.0,
        "add_BOS": False
    }

    # request new generated data
    sentences = request_data(data)
    
    # round the amount data
    sentences = ['\n'.join(round_amount(s.replace(ENDOFTEXT,'').split('\n'), 
                                        COLUMNS.index('amount'),
                                        DELIMITER)
                          ) 
                 for s in sentences]
    
    sentences, is_last_full = check_last_full_row(sentences, DELIMITER, COLUMNS)
    
    if not is_last_full:
        # reduce context by 1 row to provide room for generation
        print('adjusting NUM_CONTEXT_ROWS')
        NUM_CONTEXT_ROWS -= 1
        assert NUM_CONTEXT_ROWS > 0
    
    extra_text = get_extra_text(sentences, NUM_CONTEXT_ROWS)
    
    if WRITE_FILES:
        for i in range(BATCH_SIZE):
            s = extra_text[i]
            files[i].write(s.replace('<|endoftext|>', '\n'))
            files[i].flush()
    else:
        for i in range(BATCH_SIZE):
            generated_rows[i].append(extra_text[i])
    
if WRITE_FILES:
    for i in range(BATCH_SIZE):
        files[i].close()

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:18<00:00,  1.81s/it]


<div><font size="4">That's it! So far, we have learned how to preprocess our raw data and train a Megatron GPT model to generate synthetic data.</font></div>

<div><font size="4">In the next notebook, we will evaluate our synthetic data and compare it against our real data.</font></div>