In this notebook we perform the training process and the evaluation.

<hr>

The process consists on the following stages:

1. Load the dataset produced in the `eda` notebook.
2. Generate n-char samples, then, add noise to the samples to simulate noisy process like OCR detection.
3. Create the model, and the config model (in case you want to use `pipeline.py`), and train it.
4. Make a predictions of the dataset and evaluate the results.

In [None]:
import os
import json

import pandas as pd
import numpy as np
import tensorflow as tf

from pprint import pprint
from pathlib import Path

In [None]:
# Remember to update the PYTHON_PATH to
# export PYTHONPATH=`pwd`:`pwd`/conabio_ml_text/conabio_ml:`pwd`/conabio_ml_text

from conabio_ml_text.datasets.dataset import Dataset, Partitions
from conabio_ml_text.preprocessing.preprocessing import Tokens, PreProcessing
from conabio_ml_text.preprocessing.transform import Transform

from conabio_ml_text.trainers.bcknds.tfkeras import TFKerasTrainer, TFKerasTrainerConfig
from conabio_ml_text.trainers.bcknds.tfkeras import CHECKPOINT_CALLBACK, TENSORBOARD_CALLBACK

from conabio_ml_text.utils.constraints import TransformRepresentations as TR

from conabio_ml.evaluator.generic.evaluator import Evaluator, Metrics

from conabio_ml.utils.logger import get_logger, debugger

from utils import nchars, datagen
from model import LSTMModel

log = get_logger(__name__)
debug = debugger.debug

Tokens.UNK_TOKEN = Tokens.UNK_TOKEN * 3

In [None]:
# Enable this to see the `debug` messages
# debugger.create(True)

In [None]:
dataset_filepath = Path("dataset/dataset.csv")
results_path = Path("results")

We load the dataset from the `csv` file and perform a split in `[train, validation, test]` partitions.

In [None]:
dataset = Dataset.from_csv(dataset_filepath)
dataset = Dataset.split(dataset,
                        train_perc=0.8,
                        test_perc=0.1,
                        val_perc=0.1)

In [None]:
len(dataset.get_partition("test")), len(dataset.get_partition("test")["item"].unique())

For the training process, based on a `char-rnn`, we create samples of n-chars to capture insights.

Note we use the property `build_vocab=True` to create the vocabulary accordding to the resulting dataset. Also, the `func_args` parameters are sent to the preprocess function, defined in `preprocess_fn=nchars`.

In [None]:
SUBCHAR_SIZE = 2
MAX_SAMPLE_SIZE = np.max(dataset.data["item"].apply(lambda x: len(x)))
SAMPLE_SIZE = MAX_SAMPLE_SIZE - (SUBCHAR_SIZE + 1)
BATCH_SIZE = 16

dataset = PreProcessing.preprocess(dataset,
                                   build_vocab=True,
                                   preprocess_args={
                                       "fields": ["item"],
                                       "func_args": {
                                           "pad_size": -1,
                                           "nchar_size": SUBCHAR_SIZE,
                                           "unk_token": Tokens.UNK_TOKEN
                                       }
                                   },
                                   preprocess_fn=nchars)
# # By this moment the dataset is already processed and the vocab property exists
# # Taken from TransformRepresentations constraints
pprint(dataset.representations[TR.VOCAB][0:5])
dataset.data.head()

Then, we add a noise process. 

We produce 3 new noise samples from 1 real sample, 4 in total. With the following methodology:
1. Real sample [nchar_1, nchar_2, …, nchar_N, PAD_TOKEN]
2. Two samples replacing 1 {nchar_1, nchar_N} to UNK_TOKEN
3. One sample with both previously replaced `nchars``

In [None]:
dataset = Transform.as_data_generator(dataset,
                                      vocab=None,
                                      transform_args={
                                          "pad_length": SAMPLE_SIZE,
                                          "unk_token": Tokens.UNK_TOKEN,
                                          "batch_size": BATCH_SIZE
                                      },
                                      data_generator=datagen)

Example
<hr>

In [None]:
## Be aware with batch size in model training, we are using the current batch_size in 
## `as_data_generator`, because it will produce samples of `[batch_size X 4 X max_len] (3D)`
## instead of `[(batch_size X 4) X max_len] (2D)`
#vocab = dataset.representations[TR.VOCAB]
#gen = dataset.representations["data_generators"]["train"]
#sample_x, sample_y = next(gen())
#[print([vocab[char] for char in sample]) for sample in sample_x]

The `trainer_config` contains parameters of the environment where the model will be trained. Like callbacks, and strategies (if there is one)

In [None]:
TRAIN_REPRESENTATION = TR.DATA_GENERATORS
trainer_config = TFKerasTrainerConfig.create(config={
        "strategy": None,
        "callbacks": {
            CHECKPOINT_CALLBACK: {
                "filepath": os.path.join(results_path, "checkpoints"),
                "save_best_only": False
            },
            TENSORBOARD_CALLBACK: {
                "log_dir": os.path.join(results_path, "tb_logs")
            }}
    })
VOCAB_SIZE = len(dataset.representations[TR.VOCAB])

We create a template of the `config`, in case you want to use it in `pipeline.py` script.

In [None]:
config = {
    "dataset": str("code" / dataset_filepath),
    "layers": {
        "input": {
            "T": int(SAMPLE_SIZE)
        },
        "embedding": {
            "V": VOCAB_SIZE,
            "D": 200
        },
        "lstm": {
            "M": 48,
            "dropout":0.6,
            "recurrent_dropout":0.6
        },
        "dense": {
            "K": 2
        }
    },
    "params": {
        "initial_learning_rate": 0.0002,
        "decay_steps": 5000,
        "batch_size": BATCH_SIZE,
        "epochs": 4
    }
}

with open("configs/config.json", mode="w") as _f:
    json.dump(config, _f, indent=4)
    
config

<hr>
Model definition.<br> 
<i>In case you change the model (or create a new one), be sure the config of the layers match with your own definition, in <code>LSTMModel.create_model</code>.</i>

In [None]:
layers = config["layers"]
model = LSTMModel.create(model_config={        
    "CLASSIFIER": {
        "layers": {
            "input": layers["input"],
            "embedding": layers["embedding"],
            "lstm": layers["lstm"],
            "dense": layers["dense"]
        }
    }})

We use all keyparts: `dataset`, `model`, `trainer_config`, to train the model.

In [None]:
lr = config["params"]["initial_learning_rate"]
epochs = config["params"]["epochs"]
trained_model = TFKerasTrainer.train(dataset=dataset,
                                     model=model,
                                     execution_config=trainer_config,
                                     train_config={
                                         "CLASSIFIER": {
                                             "representation": TRAIN_REPRESENTATION,
                                             'optimizer': tf.keras.optimizers.Adam(learning_rate=lr),
                                             'loss': tf.keras.losses.CategoricalCrossentropy(),
                                             "batch_size": None,
                                             "epochs": epochs,
                                             "metrics": ["accuracy"]
                                         }
                                     })

In [None]:
trained_model.model.summary()

You can run this complete process in a sequential pipeline in the script `pipeline.py`, with:
```shell
python pipeline.py -c path/to/config_file.json
```