In this notebook we perform the training process and the evaluation.

<hr>

The process consists on the following stages:

1. Load the dataset produced in the `eda` notebook.
2. Generate n-char samples, then, add noise to the samples to simulate noisy process like OCR detection.
3. Create the model, and the config model (in case you want to use `pipeline.py`), and train it.
4. Make a predictions of the dataset and evaluate the results.

In [1]:
import os
import json

import pandas as pd
import numpy as np
import tensorflow as tf

from pprint import pprint
from pathlib import Path

In [2]:
# Remember to update the PYTHON_PATH to
# export PYTHONPATH=`pwd`:`pwd`/conabio_ml_text/conabio_ml:`pwd`/conabio_ml_text

from conabio_ml_text.datasets.dataset import Dataset, Partitions
from conabio_ml_text.preprocessing.preprocessing import Tokens, PreProcessing
from conabio_ml_text.preprocessing.transform import Transform

from conabio_ml_text.trainers.bcknds.tfkeras import TFKerasTrainer, TFKerasTrainerConfig
from conabio_ml_text.trainers.bcknds.tfkeras import CHECKPOINT_CALLBACK, TENSORBOARD_CALLBACK

from conabio_ml_text.utils.constraints import TransformRepresentations as TR

from conabio_ml.evaluator.generic.evaluator import Evaluator, Metrics

from conabio_ml.utils.logger import get_logger, debugger

from utils import nchars, datagen
from model import LSTMModel

log = get_logger(__name__)
debug = debugger.debug

Tokens.UNK_TOKEN = Tokens.UNK_TOKEN * 3

In [None]:
# Enable this to see the `debug` messages
# debugger.create(True)

In [3]:
dataset_filepath = Path("dataset/dataset.csv")
results_path = Path("results")

We load the dataset from the `csv` file and perform a split in `[train, validation, test]` partitions.

In [4]:
dataset = Dataset.from_csv(dataset_filepath)
dataset = Dataset.split(dataset,
                        train_perc=0.8,
                        test_perc=0.1,
                        val_perc=0.1)

2021-02-15 16:58:05,000 [conabio_ml.datasets.dataset] [INFO ]  Creating dataset from CSV file: /Users/rrivera/Documents/m.ind/clsspec/code/dataset/dataset.csv
2021-02-15 16:58:05,106 [conabio_ml.datasets.dataset] [DEBUG]  Columns in dataset: Index(['Unnamed: 0', 'item', 'label'], dtype='object')
2021-02-15 16:58:05,108 [conabio_ml.datasets.dataset] [DEBUG]  Creating dataset with 81069 registers
2021-02-15 16:58:05,130 [conabio_ml.datasets.dataset] [DEBUG]  Dataset with ordinal labels
2021-02-15 16:58:05,138 [conabio_ml.datasets.dataset] [DEBUG]  2 categories in dataset
2021-02-15 16:58:05,139 [conabio_ml.datasets.dataset] [INFO ]  Assigning labelmap with [{0: 'non_species'}, {1: 'species'}]
2021-02-15 16:58:05,140 [conabio_ml.datasets.dataset] [WARNI]  Column partition is not present in the dataset, so it has not been partitioned


In [5]:
len(dataset.get_partition("test")), len(dataset.get_partition("test")["item"].unique())

(8108, 8107)

For the training process, based on a `char-rnn`, we create samples of n-chars to capture insights.

Note we use the property `build_vocab=True` to create the vocabulary accordding to the resulting dataset.Also, the `func_args` parameters are sent to the preprocess function, defined in `preprocess_fn=nchars`.

In [6]:
SAMPLE_SIZE = np.max(dataset.data["item"].apply(lambda x: len(x)))
BATCH_SIZE = 16

dataset = PreProcessing.preprocess(dataset,
                                   build_vocab=True,
                                   preprocess_args={
                                       "fields": ["item"],
                                       "func_args": {
                                           "pad_size": -1,
                                           "nchar_size": 3,
                                           "unk_token": Tokens.UNK_TOKEN
                                       }
                                   },
                                   preprocess_fn=nchars)
# # By this moment the dataset is already processed and the vocab property exists
# # Taken from TransformRepresentations constraints
pprint(dataset.representations[TR.VOCAB][0:5])
dataset.data.head()

['[PAD]', '[UNK][UNK][UNK]', 'sap', 'apo', 'pov']


Unnamed: 0.1,Unnamed: 0,item,label,partition
0,0,sap apo pov ovi vir iru rus us_ s_i _is is_ s_a,non_species,train
1,1,gen enu nus us_ s_o _of,non_species,train
2,2,cal ali lic ici civ ivi vir iru rus use ses,non_species,validation
3,3,tha hat,non_species,train
4,4,are,non_species,train


Then, we add a noise process. 

We produce 3 new noise samples from 1 real sample, 4 in total. With the following methodology:
1. Real sample [nchar_1, nchar_2, …, nchar_N, PAD_TOKEN]
2. Two samples replacing 1 {nchar_1, nchar_N} to UNK_TOKEN
3. One sample with both previously replaced `nchars``

In [7]:
dataset = Transform.as_data_generator(dataset,
                                      vocab=None,
                                      transform_args={
                                          "pad_length": SAMPLE_SIZE,
                                          "unk_token": Tokens.UNK_TOKEN,
                                          "batch_size": BATCH_SIZE
                                      },
                                      data_generator=datagen)

Example
<hr>

In [8]:
## Be aware with batch size in model training, we are using the current batch_size in 
## `as_data_generator`, because it will produce samples of `[batch_size X 4 X max_len] (3D)`
## instead of `[(batch_size X 4) X max_len] (2D)`
#vocab = dataset.representations[TR.VOCAB]
#gen = dataset.representations["data_generators"]["train"]
#sample_x, sample_y = next(gen())
#[print([vocab[char] for char in sample]) for sample in sample_x]

The `trainer_config` contains parameters of the environment where the model will be trained. Like callbacks, and strategies (if there is one)

In [9]:
TRAIN_REPRESENTATION = TR.DATA_GENERATORS
trainer_config = TFKerasTrainerConfig.create(config={
        "strategy": None,
        "callbacks": {
            CHECKPOINT_CALLBACK: {
                "filepath": os.path.join(results_path, "checkpoints"),
                "save_best_only": False
            },
            TENSORBOARD_CALLBACK: {
                "log_dir": os.path.join(results_path, "tb_logs")
            }}
    })
VOCAB_SIZE = len(dataset.representations[TR.VOCAB])

We create a template of the `config`, in case you want to use it in `pipeline.py` script.

In [10]:
config = {
    "dataset": str("code" / dataset_filepath),
    "layers": {
        "input": {
            "T": int(SAMPLE_SIZE)
        },
        "embedding": {
            "V": VOCAB_SIZE,
            "D": 100
        },
        "lstm": {
            "M": 20
        },
        "dense": {
            "K": 2
        }
    },
    "params": {
        "initial_learning_rate": 0.0001,
        "decay_steps": 200,
        "batch_size": BATCH_SIZE,
        "epochs": 4
    }
}

with open("configs/config.json", mode="w") as _f:
    json.dump(config, _f, indent=4)
    
config

{'dataset': 'code/dataset/dataset.csv',
 'layers': {'input': {'T': 49},
  'embedding': {'V': 11660, 'D': 100},
  'lstm': {'M': 20},
  'dense': {'K': 2}},
 'params': {'initial_learning_rate': 0.0001,
  'decay_steps': 200,
  'batch_size': 16,
  'epochs': 1}}

<hr>
Model definition.<br> <i>In case you change it, match config (or create a new one), with your model.</i>

In [11]:
layers = config["layers"]
model = LSTMModel.create(model_config={        
    "CLASSIFIER": {
        "layers": {
            "input": layers["input"],
            "embedding": layers["embedding"],
            "lstm": layers["lstm"],
            "dense": layers["dense"]
        }
    }})

We use all keyparts: `dataset`, `model`, `trainer_config`, to train the model.

In [12]:
lr = config["params"]["initial_learning_rate"]
epochs = config["params"]["epochs"]
trained_model = TFKerasTrainer.train(dataset=dataset,
                                         model=model,
                                         execution_config=trainer_config,
                                         train_config={
                                             "CLASSIFIER": {
                                                 "representation": TRAIN_REPRESENTATION,
                                                 'optimizer': tf.keras.optimizers.Adam(learning_rate=lr),
                                                 'loss': tf.keras.losses.CategoricalCrossentropy(),
                                                 "batch_size": None,
                                                 "epochs": epochs,
                                                 "metrics": ["accuracy"]
                                             }})

Updated vocab
_________________
{'V': 11660, 'D': 100}


2021-02-15 16:58:12,486 [conabio_ml_text.trainers.bcknds.tfkeras_models] [INFO ]  Starting fitting process


  16213/Unknown - 437s 27ms/step - loss: 0.0834 - accuracy: 0.9747
Epoch 00001: saving model to results/checkpoints
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


2021-02-15 17:05:50,454 [tensorflow  ] [WARNI]  From /Users/rrivera/Documents/m.ind/clsspec/env/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


INFO:tensorflow:Assets written to: results/checkpoints/assets


2021-02-15 17:05:51,684 [tensorflow  ] [INFO ]  Assets written to: results/checkpoints/assets




2021-02-15 17:05:51,908 [conabio_ml_text.trainers.bcknds.tfkeras_models] [INFO ]  Model fitting finished


Finally, we generate a new dataset with the predictions. 

We show the resulting dataset structure, we use the param `sparse_predictions=False`, by default it computes the `np.max` function over samples.

<i>If other function is required, use the `pred_converter_fn` parameter</i>

In [14]:
predicted_dataset = trained_model.predict(
    dataset=dataset,
    execution_config=None,
    prediction_config={
        "sparse_predictions": False,
        "ommit_uniques": True
    }
)
predicted_dataset.data.head(5)

2021-02-15 17:07:00,231 [conabio_ml.datasets.dataset] [DEBUG]  Creating dataset with 8108 registers
2021-02-15 17:07:00,236 [conabio_ml.datasets.dataset] [DEBUG]  Dataset with ordinal labels
2021-02-15 17:07:00,237 [conabio_ml.datasets.dataset] [DEBUG]  2 categories in dataset
2021-02-15 17:07:00,238 [conabio_ml.datasets.dataset] [INFO ]  Assigning labelmap with [{0: 'non_species'}, {1: 'species'}]


Unnamed: 0.1,Unnamed: 0,item,label,partition,score
15,16,dif iff ffe fer ere ren ent nt_ t_g _ge gen en...,non_species,test,0.999482
22,24,hum uma man ans ns_ s_p _pi pig igs gs_ s_m _m...,non_species,test,0.999488
26,29,sap apo pov ovi vir iru rus use ses,non_species,test,0.998485
37,41,res esu sul ult lts ts_ s_o _of of_ f_a,non_species,test,0.998491
48,52,hip ipp ppo pos osi sid ide der ero ros os_ s_...,non_species,test,0.998266


In [21]:
evaluation = Evaluator.eval(dataset_true=dataset,
                            dataset_pred=predicted_dataset,
                            eval_config={
                                "dataset_partition": Partitions.TEST,
                                "metrics_set": {
                                    Metrics.Sets.MULTICLASS: {
                                        'per_class': True,
                                        'average': 'macro',
                                        "zero_division": 1.0
                                    }
                                }      
                            })

In [24]:
pprint(evaluation.results)

{'MULTICLASS': {'one_class': {'confusion_matrix': [[7051, 6], [1045, 6]],
                              'f1_score': 0.4709648012349917,
                              'labels': ['non_species', 'species'],
                              'precision': 0.6854619565217391,
                              'recall': 0.5024293145377177},
                'per_class': {'non_species': {'f1_score': 0.9306407972018742,
                                              'precision': 0.8709239130434783,
                                              'recall': 0.9991497803599263},
                              'species': {'f1_score': 0.011288805268109126,
                                          'precision': 0.5,
                                          'recall': 0.005708848715509039}}}}
