# 03. Exploratory Analysis - Nubank AI Core Transaction Dataset Interview Project

In this section we will explore the tokenization and organization of our transactions to create sequence samples for our dataset.

In [4]:
import pandas as pd
import numpy as np
import math
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

sns.set_theme()
mpl.rcParams['axes.prop_cycle'] = mpl.cycler(color=["#8A05BE", "#A5D936", "#191919"])


In [5]:
df = pd.read_csv('./nubank_checkpoint_02.csv')

In [6]:
df.head()

Unnamed: 0,Agency Name,Amount,Vendor,Transaction Date,Merchant Category Code (MCC),Original Amount
0,OKLAHOMA STATE UNIVERSITY,10,NACAS,2013-07-30,CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS,890.0
1,OKLAHOMA STATE UNIVERSITY,9,SHERATON HOTEL,2013-07-30,SHERATON,368.96
2,OKLAHOMA STATE UNIVERSITY,8,SEARS.COM 9300,2013-07-29,DIRCT MARKETING/DIRCT MARKETERS--NOT ELSEWHERE...,165.82
3,OKLAHOMA STATE UNIVERSITY,7,WAL-MART #0137,2013-07-30,"GROCERY STORES,AND SUPERMARKETS",96.39
4,OKLAHOMA STATE UNIVERSITY,8,STAPLES DIRECT,2013-07-30,"STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT...",125.96


## From transactions to sequence of transactions

Our idea here will be to create model that can extract deep representations from a sequence of transactions from a given agency.

If we have a long sequence of previous transactions, we can maybe start to understand what are the spending habits of these agencies. The model will use both text relations and categorical data to understand deep features of the sequences.

### Encoding transactions

Let's imagine we have a small dataset of a single agency with only 5 transaction samples.

| Agency Name      | Vendor             | Merchant Category Code (MCC)                       | Timestamp | Amount |
|------------------|---------------------|---------------------------------------------------|-----|------------|
| ATTORNEY GENERAL | STAPLES             | STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT... | 0   | 7          |
| ATTORNEY GENERAL | DMI  DELL K-12/GOVT | COMPUTERS, COMPUTER PERIPHERAL EQUIPMENT, SOFT... | 0   | 10         |
| ATTORNEY GENERAL | STAPLES             | STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT... | 1   | 6          |
| ATTORNEY GENERAL | FIZZ-O WATER        | MISCELLANEOUS AND SPECIALTY RETAIL STORES         | 1   | 7          |
| ATTORNEY GENERAL | VERITEXT CORP       | PROFESSIONAL SERVICES NOT ELSEWHERE CLASSIFIED    | 1   | 7          |

Since language models work on tokenized sequence data, we have to find a way to encode these transactions into a text sequence which can then be tokenized by a known tokenizer. One way we can do this is encode each row the same way as we would if we were writing in a sentence. For example, if we have the following transaction:

| AgencyName       | Amount              | Vendor                                            | MCC | Timestsamp |
|------------------|---------------------|---------------------------------------------------|-----|------------|
| ATTORNEY GENERAL | STAPLES             | STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT... | 0   | 7          |

we can encode it as:

```
Agency Name: ATTORNEY GENERAL, Vendor: STAPLES, Merchant Category Code (MCC): STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT..., Timestamp: 0, Amount: 7
```

This sentence can then be tokenized and fed into a language model. However, this doesn't fully capture both the tabular and sequential nature of the transaction. To do that, we can use special tokens to organize our data into a sequence that, instead of representing a sentence, represents a *transaction*. The BERT model, for example, has 2 special tokens: `[CLS]`, which can be used to represent the start of a transaction; and `[SEP]`, which can be used to represent the separation of fields.

```
[CLS] Agency Name: ATTORNEY GENERAL [SEP] Vendor: STAPLES [SEP] Merchant Category Code (MCC): STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT... [SEP] Timestamp: 0 [SEP] Amount: 7 [SEP] [SEP]
```

This way we can encode the information more clearly, indicating expliclity where the information from each transaction starts. This allows us also to differentiate between different transactions.

### Encoding transaction sequences


Now that we can encode a single transaction, we can do the same for a sequence of transactions. Suppose we have a time-series sequence of 2 transactions:

| Agency Name      | Vendor             | Merchant Category Code (MCC)                       | Timestamp | Amount |
|------------------|---------------------|---------------------------------------------------|-----|------------|
| ATTORNEY GENERAL | STAPLES             | STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT... | 0   | 7          |
| ATTORNEY GENERAL | DMI  DELL K-12/GOVT | COMPUTERS, COMPUTER PERIPHERAL EQUIPMENT, SOFT... | 0   | 10         |

We can encode this sequence by encoding each transaction individually and then concatenating the results into a single time-dependant sample sequence.


```
[CLS] Agency Name: ATTORNEY GENERAL [SEP] Vendor: STAPLES [SEP] Merchant Category Code (MCC): STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT... [SEP] Timestamp: 0 [SEP] Amount: 7 [SEP] [SEP] [CLS] Agency Name: ATTORNEY GENERAL [SEP] Vendor: DMI  DELL K-12/GOVT [SEP] Merchant Category Code (MCC): COMPUTERS, COMPUTER PERIPHERAL EQUIPMENT, SOFT... [SEP] Timestamp: 0 [SEP] Amount: 10 [SEP] [SEP]
```

This encoding allows us to express some interesting things:

1. Through the `[CLS]` and `[SEP]` tokens we can encode clearly for each of the transactions where each of the fields is located.
2. Through the positional representation of these sequences we can encode a time-dependant constraint of our data.
3. Through the text fields we can encode contextual representations of the values, and using pre-trained language models we can extract deep features more efficiently.


### Reducing token selection variability

Although in the data cleaning process we have decided to quantize and categorize our numerical/time features, once this is encoded as text by the tokenizer, these features may be represented by a variety of tokens. That is, the quantized amoun `7` may be represented, for example, by the tokens `7`, ` 7`, `7 `, `: 7` and so forth. Therefore, to reduce this variability in the representation of our numerical values, we can include added tokens into a vocabulary that represent exactly the categorical tokens that we want to model. 

For example, if we know we have 20 categorical values for `Amount`, we can create 20 new special tokens `<Amount_bin_0>, <Amount_bins_1> ... <Amount_bin_19>` and add them to the our vocabulary. Then, when we encode our dataset, we translate our `Amount` data into these categorical text representations which will then be tokenized as our new tokens. This way, our model learns to output only 19 possible tokens for `Amount`, instead of a lot more possible representations.