### DEEP Assisted Tagging Tool


In this notebook we propose an example of the main model architecture for the assisted tagging tool that will soon be implemented in [**The DEEP**](https://thedeep.io/) platform.

Let's recap for completeness what The DEEP is, and how Machine Learning models improve its use. 

The DEEP is a collaborative platform for qualitative data analysis supporting humanitarian analytical teams to produce actionable insights. Since its inception in the aftermath of the 2015 Nepal Earthquake, DEEP has significantly contributed to improving the humanitarian data ecosystem, and today, without a doubt, is the largest repository of annotated humanitarian response documents: 50k+ sources/leads and 400k+ entries, used for 300+ projects by 3.5k+ registered users in 60+ countries.

During crises, rapidly identifying important information from available data (news, reports, research, etc.) is crucial to understanding the needs of affected populations and to improving evidence-based decision-making. To make the information classification process even faster, DEEP is largely benefitting from  Natural Language Processing (NLP) and Deep Learning (DL) to aid and support the manual tagging process and give the humanitarian community more time to produce analyses and take rapid action to save more lives.

Up to now, all the information (of any kind: reports, news, articles, maps, infographics, etc.) uploaded to the platform has been annotated by hand by experts in the humanitarian sector. The tagging was done under several projects according to different predefined multi-label categories (analytical frameworks). Since the data is mostly textual, we internally developed NLP models that could improve and speed up the analysis of the texts. 

We must also consider that informations are often contained within document reports (PDF, docx etc.) of numerous pages, making the tagging effort very difficult and time-consuming, therefore we understand how important it can be to optimize the humanitarian response during, for example, an ongoing natural disaster.

### Data

Let's go into the details of the model now, starting from the data.

In The DEEP platform each user or group has the possibility to create a project, which is usually link to a certain humanitarian crisis, such as a natural disaster, or to a certain geographic region or state where a rapid response is needed. Users can create custom labels and use them to annotate the information that will be uploaded within each project. Therefore each user will have the possibility to upload, for example, a document (of any format), select an exerpt of text (which perhaps contains important details for the purpose of the analysis) and annotate it using its own project labels. 

To combine entries from those projects and various analytical frameworks (set of labels), we defined a generic analytical framework and we transformed our labels accordingly. Our generic analytical framework has 10 different multi-label categories, totalling 86 different labels, covering all areas of a detailed humanitarian analysis.

Our proposed dataset contains 8 categories overall:
- 3 **primary tags**: sectors, pillars/subpillars_2d, pillars/subpillars_1d
- 5 **secondary tags**: affected_groups, demographic_groups, specific_needs_groups, severity, geolocation

Different tags are treated independently one from another. One model is trained alone for each different tag.

In this notebook we focus only on a subset of above categories, **primary tags**.
Primary tags contain 75 labels under different subcategories named as follows: 
- **Sectors** with 11 labels,
- **2D Pillars** with  6 labels,
- **2D Sub-pillars** with  18 labels,
- **1D Pillars** with  7 labels,
- **1D Sub-pillars** with  33 labels

Let's see how they are divided:

![image info](./plot.png)

We can see that, apart from Sectors, each subcategory has an annotation hierarchy, from Pillar to Sub-pillar (1D and 2D) . Furthermore, each text excerpt can be annotated with multiple labels, thus making the problem a multi-label text classification.

The difference between 1D and 2D Pillars (and respective Sub-pillars), as we can see in the previous image, lies in the fact that the 2D subcategory presents an additional level of hierarchy, given by the Sectors. Example:

![image info](./ex1.png)

In [3]:
# let's import a data sample

import pandas as pd

dataset = pd.read_csv("./new_columns_train_val_sample.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [172]:
dataset.head()

Unnamed: 0,entry_id,excerpt,analysis_framework_id,lead_id,project_id,verified,sectors,subpillars_2d,subpillars_1d,geo_location,...,present_prim_tags,pillars_1d,pillars_2d,prim_tags_level1,subpillars_2d_part1,subpillars_2d_part2,subpillars_1d_part1,subpillars_1d_part2,subpillars_1d_part3,affected_groups
0,489435,"After past, partially implemented attempts, th...",1306.0,67488.0,2225.0,False,['Education'],['Capacities & Response->International Respons...,['Context->Socio Cultural'],['République démocratique du Congo'],...,"['sectors', 'subpillars_2d', 'subpillars_1d']",['Context'],['Capacities & Response'],"['sectors->Education', 'pillars_2d->Capacities...",['Capacities & Response->International Response'],['Capacities & Response->National Response'],[],['Context->Socio Cultural'],[],['All']
1,194719,"[10th November, NW Syria] Now with the lockdow...",1306.0,43771.0,2028.0,False,[],[],['Covid-19->Restriction Measures'],['Syrian Arab Republic'],...,['subpillars_1d'],['Covid-19'],[],['pillars_1d->Covid-19'],[],[],['Covid-19->Restriction Measures'],[],[],[]
2,489431,Extreme poverty and the government’s fiscal li...,1306.0,67488.0,2225.0,False,['Education'],['Humanitarian Conditions->Living Standards'],['Context->Socio Cultural'],['République démocratique du Congo'],...,"['sectors', 'subpillars_2d', 'subpillars_1d']",['Context'],['Humanitarian Conditions'],"['sectors->Education', 'pillars_2d->Humanitari...",['Humanitarian Conditions->Living Standards'],[],[],['Context->Socio Cultural'],[],['All']
3,179077,"Le Sénat congolais, la chambre haute du parlem...",1306.0,41963.0,2225.0,False,[],[],"['Covid-19->Deaths', 'Covid-19->Restriction Me...",['République démocratique du Congo'],...,['subpillars_1d'],['Covid-19'],[],['pillars_1d->Covid-19'],[],[],"['Covid-19->Deaths', 'Covid-19->Restriction Me...",[],[],[]
4,489439,The government is also tackling several import...,1306.0,67488.0,2225.0,False,['Education'],['Capacities & Response->National Response'],['Context->Socio Cultural'],['République démocratique du Congo'],...,"['sectors', 'subpillars_2d', 'subpillars_1d']",['Context'],['Capacities & Response'],"['sectors->Education', 'pillars_2d->Capacities...",[],['Capacities & Response->National Response'],[],['Context->Socio Cultural'],[],['All']


Textual entries are in 3 different languages: **English**, **Spanish** and **French**.

As often happens in sparse multi-label datasets, some labels are underrepresented compared to others, which may cause overfitting problems on the most numerous tags. So as a first step we have divided the 1D and 2D sub-pillars in order to build more balanced subsets of labels with which to perform separate training, and then later joining  resulting models.

### Model

The model developed is based on pre-trained transformer architecture. The transformer had to fulfill some criteria:
- **multilingual** : it needs to work for different languages
- **good performance** : in order for it to be useful, the model needs to be performant
- **fast predictions** : the main goal of the modelling is to give live predictions to taggers while they are working on tagging. Speed is critical in this case and the faster the model the better.
- **one endpoint only for deployment**: in order to optimize costs, we want to have one endpoint only for all models and predictions. To do this, we create one custom class containing models and deploy it.

We use the transformer [**microsoft/xtremedistil-l6-h256-uncased**](https://huggingface.co/microsoft/xtremedistil-l6-h256-uncased) as a backbone

In this notebook we overall train three independent models: one for sectors, one for subpillars (1D and 2D) and one for secondary. 
Sectors is trained with a MLP-like standard architecture.

For the subpillars tags (and also for secondary tags), we use a tree-like multi-task learning model, fine-tuning the last hidden state of the transformer differently for each subtask. We have 13 different subtasks for the subpillars model (Humanitarian Conditions, At Risk, Displacement, Covid-19, Humanitarian Access, Impact, Information And Communication, Shock/Event, Capacities & Response, Context, Casualties, Priority Interventions, Priority Needs) each of which then has its own final labels, which we want to predict.

Now let's get started with the serious stuff ;)

In [22]:
# we use pytorch and pytorch-lightning as main framweworks

import torch
import copy, os
import numpy as np
import torch.nn.functional as F
import pytorch_lightning as pl


from typing import Optional
from tqdm.auto import tqdm
from sklearn import metrics
from transformers import AutoModel, AutoTokenizer
from torch.utils.data import DataLoader
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
from torch.optim.lr_scheduler import ReduceLROnPlateau
from transformers import AdamW

# importing some utilities methods

from src.utils import *
from src.loss import FocalLoss
from src.pooling import Pooling
from src.data import CustomDataset

In [17]:
# this is the classifier architecture

class Model(torch.nn.Module):
    
    def __init__(
        self,
        model_name_or_path: str,
        ids_each_level,
        dropout_rate=0.3,
        output_length=384,
        dim_hidden_layer: int = 256,
    ):
        super().__init__()
        
        self.ids_each_level = ids_each_level
        self.l0 = AutoModel.from_pretrained(model_name_or_path)
        self.pool = Pooling(word_embedding_dimension=output_length, pooling_mode="cls")

        self.LayerNorm_backbone = torch.nn.LayerNorm(output_length)
        self.LayerNorm_specific_hidden = torch.nn.LayerNorm(dim_hidden_layer)

        self.dropout = torch.nn.Dropout(dropout_rate)

        self.specific_hidden_layer = [
            torch.nn.Linear(output_length, dim_hidden_layer) for _ in ids_each_level
        ]
        self.specific_hidden_layer = torch.nn.ModuleList(self.specific_hidden_layer)

        self.output_layer = [
            torch.nn.Linear(dim_hidden_layer, len(id_one_level))
            for id_one_level in ids_each_level
        ]
        self.output_layer = torch.nn.ModuleList(self.output_layer)
    
    def forward(self, inputs):
        
        """
        Multi-task forward.
        
        """
        
        output = self.l0(
            inputs["ids"],
            attention_mask=inputs["mask"],
        )
        output = self.pool(
            {
                "token_embeddings": output.last_hidden_state,
                "attention_mask": inputs["mask"],
            }
        )["sentence_embedding"]

        last_hidden_states = [output.clone() for _ in self.ids_each_level]

        heads = []
        for i in range(len(self.ids_each_level)):
            # specific hidden layer
            output_tmp = F.selu(last_hidden_states[i])
            output_tmp = self.dropout(output_tmp)
            output_tmp = self.LayerNorm_specific_hidden(output_tmp)

            # output layer
            output_tmp = self.output_layer[i](output_tmp)
            heads.append(output_tmp)

        return heads

In [20]:
# pytorch-lightining model class
# as loss we use a BCE focal loss (details in ./src/loss.py)

class Transformer(pl.LightningModule):
    
    def __init__(
        self,
        model_name_or_path: str,
        train_dataset,
        val_dataset,
        train_params,
        val_params,
        tokenizer,
        column_name,
        multiclass,
        learning_rate: float = 1e-5,
        adam_epsilon: float = 1e-7,
        warmup_steps: int = 500,
        weight_decay: float = 0.1,
        train_batch_size: int = 32,
        eval_batch_size: int = 32,
        dropout_rate: float = 0.3,
        max_len: int = 512,
        output_length: int = 384,
        training_device: str = "cuda",
        keep_neg_examples: bool = False,
        dim_hidden_layer: int = 256,
        **kwargs,
    ):

        super().__init__()
        self.output_length = output_length
        self.column_name = column_name
        self.save_hyperparameters()
        self.targets = train_dataset["target"]
        self.tagname_to_tagid = tagname_to_id(train_dataset["target"])
        self.num_labels = len(self.tagname_to_tagid)
        self.get_first_level_ids()

        self.max_len = max_len
        self.model = Model(
            model_name_or_path,
            self.ids_each_level,
            dropout_rate,
            self.output_length,
            dim_hidden_layer,
        )
        self.tokenizer = tokenizer
        self.val_params = val_params

        self.training_device = training_device

        self.multiclass = multiclass
        self.keep_neg_examples = keep_neg_examples

        self.training_loader = self.get_loaders(
            train_dataset, train_params, self.tagname_to_tagid, self.tokenizer, max_len
        )
        self.val_loader = self.get_loaders(
            val_dataset, val_params, self.tagname_to_tagid, self.tokenizer, max_len
        )
        self.Focal_loss = FocalLoss()

    def forward(self, inputs):
        output = self.model(inputs)
        return output

    def training_step(self, batch, batch_idx):
        outputs = self(batch)
        train_loss = self.get_loss(outputs, batch["targets"])

        self.log(
            "train_loss", train_loss.item(), prog_bar=True, on_step=False, on_epoch=True
        )
        return train_loss

    def validation_step(self, batch, batch_idx):
        outputs = self(batch)
        val_loss = self.get_loss(outputs, batch["targets"])
        self.log(
            "val_loss",
            val_loss,
            on_step=False,
            on_epoch=True,
            prog_bar=True,
            logger=False,
        )

        return {"val_loss": val_loss}

    def total_steps(self) -> int:
        """The number of total training steps that will be run. Used for lr scheduler purposes."""
        self.dataset_size = len(self.train_dataloader().dataset)
        num_devices = max(1, self.hparams.gpus)
        effective_batch_size = (
            self.hparams.train_batch_size
            * self.hparams.accumulate_grad_batches
            * num_devices
        )
        return (self.dataset_size / effective_batch_size) * self.hparams.max_epochs

    def configure_optimizers(self):
        "Prepare optimizer and schedule (linear warmup and decay)"

        optimizer = AdamW(
            self.parameters(),
            lr=self.hparams.learning_rate,
            weight_decay=self.hparams.weight_decay,
            eps=self.hparams.adam_epsilon,
        )

        scheduler = ReduceLROnPlateau(
            optimizer, "min", 0.5, patience=self.hparams.max_epochs // 6
        )
        scheduler = {
            "scheduler": scheduler,
            "interval": "epoch",
            "frequency": 1,
            "monitor": "val_loss",
        }
        return [optimizer], [scheduler]

    def train_dataloader(self):
        return self.training_loader

    def val_dataloader(self):
        return self.val_loader

    def get_loaders(
        self, dataset, params, tagname_to_tagid, tokenizer, max_len: int = 128
    ):

        set = CustomDataset(dataset, tagname_to_tagid, tokenizer, max_len)
        loader = DataLoader(set, **params, pin_memory=True)
        return loader

    def get_loss(self, outputs, targets, only_pos: bool = False):

        # keep the if because we want to take negative examples into account for the models that contain
        # no hierarchy (upper level models)
        
        if len(self.ids_each_level) == 1:
            return self.Focal_loss(outputs[0], targets)
        else:
            tot_loss = 0
            for i_th_level in range(len(self.ids_each_level)):
                ids_one_level = self.ids_each_level[i_th_level]
                outputs_i_th_level = outputs[i_th_level]
                targets_one_level = targets[:, ids_one_level]
                
                # main objective: for each level, if row contains only zeros, not to do backpropagation

                if only_pos:
                    mask_ids_neg_example = [
                        not bool(int(torch.sum(one_row)))
                        for one_row in targets_one_level
                    ]
                    outputs_i_th_level[mask_ids_neg_example, :] = 1e-8

                tot_loss += self.Focal_loss(outputs_i_th_level, targets_one_level)

            return tot_loss

    def get_first_level_ids(self):
        
        all_names = list(self.tagname_to_tagid.keys())
        if np.all(["->" in name for name in all_names]):
            first_level_names = list(
                np.unique([name.split("->")[0] for name in all_names])
            )
            self.ids_each_level = [
                [i for i in range(len(all_names)) if name in all_names[i]]
                for name in first_level_names
            ]

        else:
            self.ids_each_level = [[i for i in range(len(all_names))]]

    def custom_predict(
        self, validation_dataset, testing=False, hypertuning_threshold: bool = False
    ):
        """
        1) get raw predictions
        2) postprocess them to output an output compatible with what we want in the inference
        """

        def sigmoid(x):
            return 1 / (1 + np.exp(-x))

        if testing:
            self.val_params["num_workers"] = 0

        validation_loader = self.get_loaders(
            validation_dataset,
            self.val_params,
            self.tagname_to_tagid,
            self.tokenizer,
            self.max_len,
        )

        if torch.cuda.is_available():
            testing_device = "cuda"
        else:
            testing_device = "cpu"

        self.to(testing_device)
        self.eval()
        self.freeze()
        y_true = []
        logit_predictions = []
        indexes = []

        with torch.no_grad():
            for batch in tqdm(
                validation_loader,
                total=len(validation_loader.dataset) // validation_loader.batch_size,
            ):

                if not testing:
                    y_true.append(batch["targets"].detach())
                    indexes.append(batch["entry_id"].detach())

                logits = self(
                    {
                        "ids": batch["ids"].to(testing_device),
                        "mask": batch["mask"].to(testing_device),
                        "token_type_ids": batch["token_type_ids"].to(testing_device),
                    }
                )
                logits = torch.cat(logits, dim=1)  # have a matrix like in the beginning
                logits_to_array = np.array([np.array(t) for t in logits.cpu()])
                logit_predictions.append(logits_to_array)

        logit_predictions = np.concatenate(logit_predictions)
        logit_predictions = sigmoid(logit_predictions)

        target_list = list(self.tagname_to_tagid.keys())
        probabilities_dict = []
        # postprocess predictions
        for i in range(logit_predictions.shape[0]):

            # Return predictions
            # row_pred = np.array([0] * self.num_labels)
            row_logits = logit_predictions[i, :]

            # Return probabilities
            probabilities_item_dict = {}
            for j in range(logit_predictions.shape[1]):
                if hypertuning_threshold:
                    probabilities_item_dict[target_list[j]] = row_logits[j]
                else:
                    probabilities_item_dict[target_list[j]] = (
                        row_logits[j] / self.optimal_thresholds[target_list[j]]
                    )

            probabilities_dict.append(probabilities_item_dict)

        if not testing:
            y_true = np.concatenate(y_true)
            indexes = np.concatenate(indexes)
            return indexes, logit_predictions, y_true, probabilities_dict

        else:
            return probabilities_dict

    def hypertune_threshold(self, beta_f1: float = 0.8):
        """
        having the probabilities, loop over a list of thresholds to see which one:
        1) yields the best results
        2) without being an aberrant value
        """

        data_for_threshold_tuning = self.val_loader.dataset.data
        indexes, logit_predictions, y_true, _ = self.custom_predict(
            data_for_threshold_tuning, hypertuning_threshold=True
        )

        thresholds_list = np.linspace(0.0, 1.0, 101)[::-1]
        optimal_thresholds_dict = {}
        optimal_scores = []
        for ids_one_level in self.ids_each_level:
            y_true_one_level = y_true[:, ids_one_level]
            logit_preds_one_level = logit_predictions[:, ids_one_level]

            """if len(self.ids_each_level) > 1: #multitask

                mask_at_least_one_pos = [bool(sum(row)) for row in y_true_one_level]
                threshold_tuning_gt = y_true_one_level[mask_at_least_one_pos]
                threshold_tuning_logit_preds = logit_preds_one_level[mask_at_least_one_pos]
            else: #no multitask
                threshold_tuning_gt = y_true_one_level
                threshold_tuning_logit_preds = logit_predictions

            assert(threshold_tuning_logit_preds.shape == threshold_tuning_gt.shape)"""

            for j in range(len(ids_one_level)):
                scores = []
                for thresh_tmp in thresholds_list:
                    metric = self.get_metric(
                        logit_preds_one_level,
                        y_true_one_level,
                        beta_f1,
                        j,
                        thresh_tmp,
                    )
                    scores.append(metric)

                max_threshold = 0
                max_score = 0
                for i in range(1, len(scores) - 1):
                    score = np.mean(scores[i - 1 : i + 2])
                    if score >= max_score:
                        max_score = score
                        max_threshold = thresholds_list[i]

                optimal_scores.append(max_score)

                optimal_thresholds_dict[
                    list(self.tagname_to_tagid.keys())[ids_one_level[j]]
                ] = max_threshold

        self.optimal_thresholds = optimal_thresholds_dict

        return np.mean(optimal_scores)

    def get_metric(self, preds, groundtruth, beta_f1, column_idx, threshold_tmp):
        columns_logits = np.array(preds[:, column_idx])

        column_pred = [
            1 if one_logit > threshold_tmp else 0 for one_logit in columns_logits
        ]

        if self.multiclass:
            metric = metrics.fbeta_score(
                groundtruth[:, column_idx],
                column_pred,
                beta_f1,
                average="macro",
            )
        else:
            metric = metrics.f1_score(
                groundtruth[:, column_idx],
                column_pred,
                average="macro",
            )
        return metric

In [24]:
# main class used to train model
    
class CustomTrainer:


    def __init__(
        self,
        train_dataset,
        val_dataset,
        training_column,
        MODEL_DIR: str,
        MODEL_NAME: str,
        TOKENIZER_NAME: str,
        dropout_rate: float,
        train_params,
        val_params,
        gpu_nb: int,
        MAX_EPOCHS: int,
        weight_decay=0.02,
        warmup_steps=500,
        output_length=384,
        max_len=150,
        multiclass_bool=True,
        keep_neg_examples_bool=False,
        learning_rate=3e-5,
        weighted_loss: str = "sqrt",
        training_device: str = "cuda",
        beta_f1: float = 0.8,
        dim_hidden_layer: int = 256
    ) -> None:
        
        self.train_dataset = train_dataset
        self.val_dataset = val_dataset
        self.training_column = training_column
        self.MODEL_DIR = MODEL_DIR
        self.MODEL_NAME = MODEL_NAME
        self.TOKENIZER_NAME = TOKENIZER_NAME
        self.dropout_rate = dropout_rate
        self.train_params = train_params
        self.val_params = val_params
        self.gpu_nb = gpu_nb
        self.MAX_EPOCHS = MAX_EPOCHS
        self.weight_decay = weight_decay
        self.warmup_steps = warmup_steps
        self.output_length = output_length
        self.max_len = max_len
        self.multiclass_bool = multiclass_bool
        self.keep_neg_examples_bool = keep_neg_examples_bool
        self.learning_rate = learning_rate
        self.weighted_loss = weighted_loss
        self.training_device = training_device
        self.beta_f1 = beta_f1
        self.dim_hidden_layer = dim_hidden_layer

    def train_model(self):
        PATH_NAME = self.MODEL_DIR
        if not os.path.exists(PATH_NAME):
            os.makedirs(PATH_NAME)

        early_stopping_callback = EarlyStopping(
            monitor="val_loss", patience=2, mode="min"
        )

        checkpoint_callback_params = {
            "save_top_k": 1,
            "verbose": True,
            "monitor": "val_loss",
            "mode": "min",
        }

        FILENAME = "model_" + self.training_column
        dirpath_pillars = str(PATH_NAME)
        checkpoint_callback = ModelCheckpoint(
            dirpath=dirpath_pillars, filename=FILENAME, **checkpoint_callback_params
        )

        logger = TensorBoardLogger("lightning_logs", name=FILENAME)

        trainer = pl.Trainer(
            logger=logger,
            callbacks=[early_stopping_callback, checkpoint_callback],
            progress_bar_refresh_rate=40,
            profiler="simple",
            log_gpu_memory=True,
            weights_summary=None,
            gpus=self.gpu_nb,
            precision=16,
            accumulate_grad_batches=1,
            max_epochs=self.MAX_EPOCHS,
            gradient_clip_val=1,
            gradient_clip_algorithm="norm"
        )
        
        tokenizer = AutoTokenizer.from_pretrained(self.TOKENIZER_NAME)
        
        model = Transformer(
            model_name_or_path=self.MODEL_NAME,
            train_dataset=self.train_dataset,
            val_dataset=self.val_dataset,
            train_params=self.train_params,
            val_params=self.val_params,
            tokenizer=tokenizer,
            column_name=self.training_column,
            gpus=self.gpu_nb,
            plugin="deepspeed_stage_3_offload",
            accumulate_grad_batches=1,
            max_epochs=self.MAX_EPOCHS,
            dropout_rate=self.dropout_rate,
            weight_decay=self.weight_decay,
            warmup_steps=self.warmup_steps,
            output_length=self.output_length,
            learning_rate=self.learning_rate,
            multiclass=self.multiclass_bool,
            weighted_loss=self.weighted_loss,
            training_device=self.training_device,
            keep_neg_examples=self.keep_neg_examples_bool,
            dim_hidden_layer=self.dim_hidden_layer
        )

        """lr_finder = trainer.tuner.lr_find(model)
        new_lr = lr_finder.suggestion()
        model.hparams.learning_rate = new_lr"""
        trainer.fit(model)
        model.train_f1_score = model.hypertune_threshold(self.beta_f1)

        del model.training_loader
        del model.val_loader

        return model

In [22]:
train_data, val_data = preprocess_df(dataset, column_name="subpillars_2d_part1", multiclass_bool=True)

In [39]:
batch_size = 32
MODEL_DIR = "./models"
tokenizer_name = "microsoft/xtremedistil-l6-h256-uncased"
model_name = "microsoft/xtremedistil-l6-h256-uncased"

In [40]:
train_params = {
    "batch_size": batch_size,
    "num_workers": 4,
}
val_params = {
    "batch_size": batch_size,
    "shuffle": False,
    "num_workers": 4,
}

In [41]:
trainer = CustomTrainer(train_dataset=train_data,
        val_dataset=val_data,
        training_column="subpillars_2d_part1",
        MODEL_DIR=MODEL_DIR,
        MODEL_NAME= model_name,
        TOKENIZER_NAME=tokenizer_name,
        dropout_rate=0.3,
        train_params=train_params,
        val_params=val_params,
        gpu_nb=1,
        MAX_EPOCHS=3)

In [None]:
model = trainer.train_model()

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.


Downloading:   0%|          | 0.00/525 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/51.0M [00:00<?, ?B/s]

cuda


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
FIT Profiler Report

Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
--------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  45.049         	|  100 %          	|
--------------------------------------------------------------------------------------------------------------------------------------
evaluation_step_and_end            	|  3.3706         	|2              	|  6.7413         	|  14.964         	|
validation_step                    	|  3.3702         	|2              	|  6.7404         	|  14.962         	|
cache_result                       	|  2.9364e-05     	|15             	|  0.00044046     	|  0.00097773     	|
on_validation_batch_start          	|  6.6459e-05     	|2              	|  0.00013292     	|  0.00029505 

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

FIT Profiler Report

Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
--------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  191.05         	|  100 %          	|
--------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  145.92         	|1              	|  145.92         	|  76.378         	|
run_training_batch                 	|  1.6881         	|86             	|  145.18         	|  75.989         	|
optimizer_step_and_closure_0       	|  1.6862         	|86             	|  145.02         	|  75.904         	|
training_step_and_backward         	|  1.668          	|86             	|  143.45         	|  75.083         	|
backward                           

Validation sanity check: 0it [00:00, ?it/s]

Training: 85it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch 0, global step 6385: val_loss reached 0.01959 (best 0.01959), saving model to "/home/ec2-user/SageMaker/models/model_subpillars_2d_part1.ckpt" as top 1


In [None]:
c_m = Transformer.load_from_checkpoint("/home/ec2-user/SageMaker/models/model_pillars_2d.ckpt")

In [None]:
c_m.hypertune_threshold(0.5)

In [None]:
res = c_m.custom_predict(test, testing=True)

In [None]:
c_m.optimal_thresholds

In [None]:
preds_final = []
for entry_nb in range (test.shape[0]):
    preds_column = res[entry_nb]
    preds_entry = [
        sub_tag for sub_tag in list(preds_column.keys()) if preds_column[sub_tag]>1
    ]
    preds_final.append(preds_entry)

In [None]:
multi = MultiLabelBinarizer()
multi.fit(test.target)
labels = multi.transform(test.target)

In [None]:
labels.shap

In [None]:
test_data = pd.read_csv("./test_v0.7.1.csv")

In [None]:
test = preprocess_df(test_data, column_name="subpillars_2d", multiclass_bool=True, test=True)

In [None]:
Y = multi.transform(preds_final)

In [None]:
print(classification_report(labels,Y, target_names=list(multi.classes_), zero_division=0))

In [None]:
import numpy as np
import pandas as pd
from sklearn import metrics

def get_matrix (column_of_columns, tag_to_id, nb_subtags):
    matrix = [[
        1 if tag_to_id[i] in column else 0 for i in range (nb_subtags)
    ] for column in column_of_columns]
    return np.array(matrix)

def assess_performance (preds, groundtruth, subtags):
    """
    INPUTS:
        preds: List[List[str]]: list containing list of predicted tags for each entry
        groundtruth: List[List[str]]: list containing list of true tags for each entry
        subtags: subtags list, sorted by alphabetical order 
    OUTPUTS:
        pd.DataFrame: rows: subtags, column: precision, recall, f1_score
    """
    results_dict = {}
    nb_subtags = len(subtags)
    tag_to_id = {i:subtags[i] for i in range (nb_subtags)}
    groundtruth_col = get_matrix( groundtruth, tag_to_id, nb_subtags)
    preds_col = get_matrix( preds, tag_to_id, nb_subtags)  
    for j in range(groundtruth_col.shape[1]):  
        preds_subtag = preds_col[:, j]
        groundtruth_subtag = groundtruth_col[:, j]
        results_subtag = {
            'macro_precision': np.round(
                metrics.precision_score(groundtruth_subtag, preds_subtag, average='macro'), 3
            ),
            'macro_recall': np.round(
                metrics.recall_score(groundtruth_subtag, preds_subtag, average='macro'), 3
            ),
            'macro_f1_score': np.round(
                metrics.f1_score(groundtruth_subtag, preds_subtag, average='macro'), 3
            ),
            '1_precision': np.round(
                metrics.precision_score(groundtruth_subtag, preds_subtag, average='binary', pos_label=1), 3
            ),
            '0_precision': np.round(
                metrics.precision_score(groundtruth_subtag, preds_subtag, average='binary', pos_label=0), 3
            ),
            '1_recall': np.round(
                metrics.recall_score(groundtruth_subtag, preds_subtag, average='binary', pos_label=1), 3
            ),
            '0_recall': np.round(
                metrics.recall_score(groundtruth_subtag, preds_subtag, average='binary', pos_label=0), 3
            ),
            '1_f1_score': np.round(
                metrics.f1_score(groundtruth_subtag, preds_subtag, average='binary', pos_label=1), 3
            ),
            '0_f1_score': np.round(
                metrics.f1_score(groundtruth_subtag, preds_subtag, average='binary', pos_label=0), 3
            ),
            'hamming_loss': np.round(
                metrics.hamming_loss(groundtruth_subtag, preds_subtag), 3
            ),
            'zero_one_loss': np.round(
                metrics.zero_one_loss(groundtruth_subtag, preds_subtag), 3
            )
            
        }
        results_dict[subtags[j]] = results_subtag
        
    df_results = pd.DataFrame.from_dict(results_dict, orient='index')
    df_results.loc['mean'] = df_results.mean()
        
    return df_results

In [None]:
subtags = sorted(list(multi.classes_))

In [None]:
df = assess_performance(preds_final, test.target.tolist(), subtags)

In [3]:
test = pd.read_csv("./test_results.csv")

In [4]:
test

Unnamed: 0,macro_precision,macro_recall,macro_f1_score,1_precision,0_precision,1_recall,0_recall,1_f1_score,0_f1_score,hamming_loss,zero_one_loss
0,0.499,0.5,0.5,0.0,0.998,0.0,1.0,0.0,0.999,0.002,0.002
1,0.496,0.492,0.483,0.091,0.902,0.197,0.788,0.124,0.841,0.269,0.269
2,0.664,0.704,0.679,0.414,0.915,0.543,0.865,0.47,0.889,0.184,0.184
3,0.498,0.499,0.499,0.0,0.997,0.0,0.998,0.0,0.997,0.006,0.006
4,0.555,0.555,0.555,0.166,0.944,0.168,0.943,0.167,0.944,0.106,0.106
5,0.572,0.543,0.552,0.187,0.957,0.109,0.976,0.138,0.967,0.064,0.064
6,0.525,0.521,0.523,0.099,0.95,0.084,0.958,0.091,0.954,0.087,0.087
7,0.601,0.605,0.603,0.414,0.787,0.444,0.766,0.428,0.777,0.321,0.321
8,0.517,0.558,0.52,0.046,0.987,0.166,0.951,0.072,0.969,0.061,0.061
9,0.555,0.574,0.558,0.246,0.864,0.364,0.784,0.293,0.822,0.284,0.284
