# Train our Baseline Model (DeBERTa)

## Overview

In this notebook, we train a classifier based on the [DeBERTa small](https://huggingface.co/microsoft/deberta-v3-small/tree/main) model. This notebook serves as an application following the reference of Bootcamp day 2, 04ModelTraining.ipynb .

## Key Steps and Objectives

1. **Classifier Training**: We use the DeBERTa small model to train a classifier. The training process involves fine-tuning the model on our specific task or dataset.

2. **Results Visualization**: The model correctly predicts rain 58% of the time but incorrectly predicts rain when it's not raining 22% of the time. It's more cautious about predicting no rain, with a true negative rate of 12% and a false negative rate of 7.9%. We visualize the model's performance by generating and examining the following:

    - **Confusion Matrix**: A matrix that provides insights into the classifier's ability to correctly classify instances.
    
    - **ROC Curve**: A Receiver Operating Characteristic curve that illustrates the classifier's performance across different threshold values.
    
    - **Classifier Certainty**: We assess the certainty of the classifier's predictions, offering insights into its level of confidence in its decisions.


This notebook helps us understand how well the DeBERTa-based classifier performs on our task and provides valuable insights through visualizations.


In [None]:
!jupyter kernelspec list

# Library Imports and Directory Setup

This section of the code imports necessary libraries and sets up the directory paths for the project. It also includes custom module imports related to plotting and training distribution.

Please make sure to replace the relative directory paths ('../helpers/plotting', '../helpers') with actual paths relevant to your project.


In [None]:
import xarray as xr
import sys
import transformers
import datasets
import functools
import xarray as xr
import os
from sklearn.model_selection import train_test_split

sys.path.append("../scripts/plotting")
import test_training_distribution

sys.path.append("../scripts")
from transformer_trainer import get_trainer

# Dataset Loading and Label Definition

In this section of the code, a dataset is loaded from a specified location, and labels are defined based on a certain condition.

In [None]:
ds_raw = xr.open_dataset("/p/project/deepacf/maelstrom/haque1/dataset/tweets_2017_01_era5_normed_filtered.nc")
# again define labels
key_tp = "tp_h"
ds_raw["raining"] = (["index"], ds_raw[key_tp].values > 1e-8)

# Label Definition and Data Splitting

In this section of the code, labels are extracted from the loaded dataset, and the dataset is split into training and testing sets while maintaining stratified sampling based on the labels.


In [None]:
labels = ds_raw["raining"]
indices_train, indices_test = train_test_split(ds_raw.index, test_size=0.20, stratify=labels)

# Pretrained Tokenizer and Dataset Preparation

This section of the code loads a pretrained tokenizer and sets up functions for tokenization. It also prepares the dataset for training and testing.


In [None]:
# Load the pretrained tokenizer and model configuration
model_nm = "/p/project/deepacf/maelstrom/haque1/deberta-v3-small"  # Path to model
tokenizer = transformers.AutoTokenizer.from_pretrained(model_nm)
db_config_base = transformers.AutoConfig.from_pretrained(model_nm, num_labels=2)


# Define function to tokenize the field 'inputs' stored in x
def tok_func(x, tokenizer):
    return tokenizer(x["inputs"], padding=True, truncation=True, max_length=512)


# Function to convert the dataset to a format used by Hugging Face
def get_dataset(ds, tok_func, tokenizer, indices_train, indices_test, train=True):
    df = ds[["text_normalized", "raining"]].to_pandas()
    df = df.rename(columns={"text_normalized": "inputs", "raining": "labels"})
    datasets_ds = datasets.Dataset.from_pandas(df)
    tok_function_partial = functools.partial(tok_func, tokenizer=tokenizer)
    tok_ds = datasets_ds.map(tok_function_partial, batched=True)
    if train:
        return datasets.DatasetDict({"train": tok_ds.select(indices_train), "test": tok_ds.select(indices_test)})
    else:
        return tok_ds

In [None]:
dataset = get_dataset(ds_raw, tok_func, tokenizer, indices_train, indices_test)

# Output Folder Definition

The path to the output folder is defined

In [None]:
FOLDER_TO_OUTPUT = "./models"

# Project Parameters and Output Folder Creation

In this section of the code, project-specific parameters are defined, and an output folder is created to store project outputs.


In [None]:
parameters = {
    "learning_rate": 8e-5,
    "batch_size": 16,
    "weight_decay": 0.01,
    "epochs": 1,
    "warmup_ratio": 0.1,
    "cls_dropout": 0.3,
    "lr_scheduler_type": "cosine",
}

os.makedirs(FOLDER_TO_OUTPUT, exist_ok=True)

# Model Training Initialization and Execution

In this section of the code, a trainer for the machine learning model is initialized, and the training process is started.


In [None]:
trainer = get_trainer(dataset, db_config_base, model_nm, FOLDER_TO_OUTPUT, parameters)

# Start training
trainer.train()

# Test Dataset Preparation and Dataset Selection

In this section of the code, the test dataset is prepared in the format expected by Hugging Face's Transformers library. Additionally, a subset of the original xarray dataset is selected for further analysis.

This section prepares the test dataset and extracts the relevant subset of data for further evaluation and analysis.


In [None]:
# this is the test dataset in the format expected by Hugging Face
test_ds = get_dataset(
    ds_raw.sel(index=indices_test),
    tok_func,
    tokenizer,
    indices_train,
    indices_test,
    train=False,  # not training anymore
)
# this is a selection of our xarray dataset that corresponds to the tweets that are part of the test set
ds_test = ds_raw.sel(index=indices_test)

# Model Prediction and Evaluation

In this section of the code, the trained model is used to make predictions on the test dataset, and various evaluation plots are generated.


In [None]:
import sys

sys.path.append("../bootcamp/AP2/scripts")
import plotting

preds = torch.nn.functional.softmax(torch.Tensor(trainer.predict(test_ds).predictions)).numpy()
prediction_probability = preds[:, 1]
predictions = preds.argmax(axis=-1)
truth = ds_test.raining.values
plotting.analysis.classification_report(labels=truth, predictions=predictions)
plotting.analysis.plot_roc(truth=truth, prediction_probability=prediction_probability)
plotting.plotting.analysis.check_prediction(truth=truth, prediction=predictions);