## Training an ORiGAMi model on Tabular Data from OpenML

This notebook demonstrates how to train an ORiGAMi model on tabular data from [OpenML](https://www.openml.org/).

While ORiGAMi is designed for semi-structured data, it can also be applied to tabular data, which is a special case of semi-structured data where all objects only have top-level fields and no missing fields.

The notebook trains a model on the [Tic Tac Toe dataset](https://www.openml.org/search?type=data&status=active&id=50), a dataset that encodes all possible board configurations at the end of tic tac toe games. The target label to predict is whether the player who plays "x" wins or loses. This is not a difficult dataset and many classic algorithms can achieve 100% classification accuracy. We use this relatively small dataset to show how the data and model are prepared for training.

Note that if you choose a different OpenML dataset, or even your own, the hyperparameters and model configuration may not be ideal and you might have to do some hyperparameter exploration to get the best results.


### Loading the dataset

OpenML datasets have unique IDs. The Tic Tac Toe dataset has ID 50. For other datasets, see the [Datasets](https://www.openml.org/search?type=data&status=active) page on OpenML's website.

First we load the data into a list of dictionaries and also get the target field name from the metadata.


In [24]:
from openml import datasets

# Fetch tic tac toe dataset by ID (50)
dataset = datasets.get_dataset(50)

# Get the data in pandas DataFrame format
X, _, _, _ = dataset.get_data()

display(X.head())

# Convert DataFrame to list of dictionaries
data = X.to_dict("records")

# get the name of the target label
target_field = dataset.default_target_attribute

print(f"Number of instances: {len(data)}")
print(f'Target field for classification: "{target_field}"')
print(f"Example instance:\n{data[0]}")

Unnamed: 0,top-left-square,top-middle-square,top-right-square,middle-left-square,middle-middle-square,middle-right-square,bottom-left-square,bottom-middle-square,bottom-right-square,Class
0,x,x,x,x,o,o,x,o,o,positive
1,x,x,x,x,o,o,o,x,o,positive
2,x,x,x,x,o,o,o,o,x,positive
3,x,x,x,x,o,o,o,b,b,positive
4,x,x,x,x,o,o,b,o,b,positive


Number of instances: 958
Target field for classification: "Class"
Example instance:
{'top-left-square': 'x', 'top-middle-square': 'x', 'top-right-square': 'x', 'middle-left-square': 'x', 'middle-middle-square': 'o', 'middle-right-square': 'o', 'bottom-left-square': 'x', 'bottom-middle-square': 'o', 'bottom-right-square': 'o', 'Class': 'positive'}


### Data Preprocessing Pipeline

Next, we split the data in train and test set and create a prediction pipeline with the built-in `build_prediction_pipelines()` utility.

We also get access to the `schema`, `encoder` and `block_size` variable which we'll need for model creation.

While most of the datasets in the ORiGAMi paper use a _shuffled_ training approach with _data upscaling_, in the case of Tic Tac Toe, we don't need this, because the individual field positions are not causally related to each other.

For other datasets, you may get better results with shuffling and upscaling:

```python
# pipeline config
config.pipeline.upscale = 10 # you can experiment with different values here
config.pipeline.sequence_order = "SHUFFLED"
```


In [20]:
from sklearn.model_selection import train_test_split

from origami.preprocessing import build_prediction_pipelines, docs_to_df
from origami.utils import set_seed
from origami.utils.config import TopLevelConfig

# for reproducibility
set_seed(123)

# load data into "docs" column in dataframe and split into train/test
df = docs_to_df(data)
train_docs_df, test_docs_df = train_test_split(df, test_size=0.2, shuffle=True)

config = TopLevelConfig()

# pipeline config
config.pipeline.upscale = 1
config.pipeline.sequence_order = "ORDERED"

# create train and test pipelines
pipelines = build_prediction_pipelines(pipeline_config=config.pipeline, target_field=target_field, verbose=True)

# process train and test data
train_df = pipelines["train"].fit_transform(train_docs_df)
test_df = pipelines["test"].transform(test_docs_df)

# get stateful objects
schema = pipelines["train"]["schema"].schema
encoder = pipelines["train"]["encoder"].encoder
block_size = pipelines["train"]["padding"].length

# print data stats
print(f"len train: {len(train_df)}, len test: {len(test_df)}")
print(f"vocab size {encoder.vocab_size}")
print(f"block size {block_size}")

train pipeline: Pipeline(steps=[('binning',
                 KBinsDiscretizerPipe(strategy='kmeans', threshold=100)),
                ('target', TargetFieldPipe(target_field='Class')),
                ('schema', SchemaParserPipe()),
                ('tokenizer', DocTokenizerPipe()),
                ('padding', PadTruncTokensPipe()),
                ('encoder', TokenEncoderPipe(max_tokens=0))],
         verbose=True)
test pipeline: Pipeline(steps=[('binning',
                 KBinsDiscretizerPipe(strategy='kmeans', threshold=100)),
                ('target', TargetFieldPipe(target_field='Class')),
                ('tokenizer', DocTokenizerPipe()),
                ('padding', PadTruncTokensPipe()),
                ('encoder', TokenEncoderPipe(max_tokens=0))],
         verbose=True)
[Pipeline] ........... (step 1 of 6) Processing binning, total=   0.0s
[Pipeline] ............ (step 2 of 6) Processing target, total=   0.0s
[Pipeline] ............ (step 3 of 6) Processing schema, total=   0

### Model Creation

Now we configure the model and create a model instance and a pushdown automaton (VPDA = Vectorised Pushdown Automaton) which we'll pass to the model.

We pass in the encoder and schema to the VPDA class, which will automatically create transition rules that only allow next tokens that lead to valid objects.


In [21]:
from origami.model import ORIGAMI
from origami.model.vpda import ObjectVPDA
from origami.preprocessing import DFDataset
from origami.utils import count_parameters

# wrap dataframes in datasets
train_dataset = DFDataset(train_df)
test_dataset = DFDataset(test_df)

# model config
config.model.n_layer = 4
config.model.n_head = 4
config.model.n_embd = 64
config.model.vocab_size = encoder.vocab_size
config.model.block_size = block_size

# create PDA and pass it to the model
vpda = ObjectVPDA(encoder, schema)
model = ORIGAMI(config.model, config.train, vpda=vpda)

n_params = count_parameters(model)
print(f"Number of parameters: {n_params / 1e6:.2f}M")

Number of parameters: 0.20M


### Model Training

We configure the training parameters, create a `Predictor` instance for evaluation, a progress callback function, and train the model for 2000 steps.


In [22]:
from origami.inference import Predictor
from origami.utils import make_progress_callback

# train config
config.train.learning_rate = 1e-3
config.train.print_every = 10
config.train.eval_every = 100

# create a predictor
predictor = Predictor(model, encoder, target_field)

# create and register progress callback
progress_callback = make_progress_callback(
    config.train, train_dataset=train_dataset, test_dataset=test_dataset, predictor=predictor
)
model.set_callback("on_batch_end", progress_callback)

# train model
model.train_model(train_dataset, batches=2000)

|  step: 0  |  epoch: 0  |  batch_num: 0  |  batch_dt: 0.00  |  batch_loss: 2.2953  |  lr: 1.01e-06  |  train_acc: 0.6300  |  test_loss: 2.2927  |  test_acc: 0.6500  |
|  step: 1  |  epoch: 1  |  batch_num: 10  |  batch_dt: 47.73  |  batch_loss: 2.2590  |  lr: 1.10e-05  |
|  step: 2  |  epoch: 2  |  batch_num: 20  |  batch_dt: 49.26  |  batch_loss: 2.2035  |  lr: 2.10e-05  |
|  step: 3  |  epoch: 3  |  batch_num: 30  |  batch_dt: 47.07  |  batch_loss: 2.1020  |  lr: 3.10e-05  |
|  step: 4  |  epoch: 5  |  batch_num: 40  |  batch_dt: 50.02  |  batch_loss: 1.9753  |  lr: 4.10e-05  |
|  step: 5  |  epoch: 6  |  batch_num: 50  |  batch_dt: 47.68  |  batch_loss: 1.8567  |  lr: 5.10e-05  |
|  step: 6  |  epoch: 7  |  batch_num: 60  |  batch_dt: 51.26  |  batch_loss: 1.7385  |  lr: 6.10e-05  |
|  step: 7  |  epoch: 8  |  batch_num: 70  |  batch_dt: 48.23  |  batch_loss: 1.6089  |  lr: 7.10e-05  |
|  step: 8  |  epoch: 10  |  batch_num: 80  |  batch_dt: 49.82  |  batch_loss: 1.4764  |  lr: 8.1

### Model Evaluation

Finally, we evaluate the model on the full test set, and compare the first predictions to the ground truth.


In [23]:
# calculate test accuracy
acc = predictor.accuracy(test_dataset, show_progress=True)
print(f"Test accuracy: {acc:.4f}")

# we can also access the predictions with the `predict()` method
predictions = predictor.predict(test_dataset)
print("Model predictions (first 10): ", predictions[:10])
print("Correct labels (first 10): ", test_dataset.df["target"].to_list()[:10])

Predicting:   0%|          | 0/2 [00:00<?, ?it/s]

Test accuracy: 1.0000
Model predictions (first 10):  ['positive', 'positive', 'positive', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'negative']
Correct labels (first 10):  ['positive', 'positive', 'positive', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'negative']
