## Fine-Tuning on Training Dataset for topic "Health"

Meant to be run on a Google Colab Instance, not locally. T4, A100 or V100 should all be sufficient. 

Thanks to Moritz Laurer for excellent templates and starting points for fine-tuning! See [here](https://github.com/MoritzLaurer/summer-school-transformers-2023/blob/main/3_tune_bert.ipynb)

#### Install relevant packages

In [1]:
!nvidia-smi

Thu Mar 21 14:48:39 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0              53W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
!pip install accelerate -U
!pip install transformers[sentencepiece]
!pip install datasets



In [3]:
## Load general packages
# some more specialised packages are loaded in each sub section
import pandas as pd
import numpy as np
from google.colab.data_table import DataTable
from sklearn.model_selection import train_test_split

In [4]:
# set random seed for reproducibility
SEED_GLOBAL = 42
np.random.seed(SEED_GLOBAL)

## Download data

In [5]:
df =  pd.read_excel("https://nextcloud.swp-berlin.org/s/REDACTED/download").dropna(subset = ['text'])

df["label_text"] = df["label"].apply(lambda x: 'int_sec' if x == 1 else 'not int_sec')

In [6]:
df

Unnamed: 0,text,label,id,label_text
0,"The General Assembly,",0,477681_1,not int_sec
1,Emphasizes once again the importance of the im...,0,535326_74,not int_sec
2,"The General Assembly,",0,284790_1,not int_sec
3,Aware that development is difficult under occu...,1,588959_7,int_sec
4,Recognizes the desirability of making availabl...,0,587898_35,not int_sec
...,...,...,...,...
656,"Recalling the New Urban Agenda, adopted at the...",0,1483771_6,not int_sec
657,Requests the administering Powers to continue ...,0,816483_38,not int_sec
658,"Encourages Member States, relevant organizatio...",0,1642600_48,not int_sec
659,Wishing to promote cooperation between the Uni...,0,853780_2,not int_sec


In [7]:
df = df.drop_duplicates(subset='text', keep='first')
df = df.astype({'label':'int'})
df = df.dropna(subset = 'label')
df

Unnamed: 0,text,label,id,label_text
0,"The General Assembly,",0,477681_1,not int_sec
1,Emphasizes once again the importance of the im...,0,535326_74,not int_sec
3,Aware that development is difficult under occu...,1,588959_7,int_sec
4,Recognizes the desirability of making availabl...,0,587898_35,not int_sec
5,"Requests the Secretary-General, with the assis...",1,816528_28,int_sec
...,...,...,...,...
656,"Recalling the New Urban Agenda, adopted at the...",0,1483771_6,not int_sec
657,Requests the administering Powers to continue ...,0,816483_38,not int_sec
658,"Encourages Member States, relevant organizatio...",0,1642600_48,not int_sec
659,Wishing to promote cooperation between the Uni...,0,853780_2,not int_sec


In [8]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

df_train = df_train[["label", "label_text", "text","id"]]

df_test = df_test[["label", "label_text", "text","id"]]

In [9]:
df_test['text_exists_in_train'] = df_test['text'].isin(df_train['text'])
df_test["text_exists_in_train"].sum()

0

In [10]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import torch

## load a model and its tokenizer
model_name = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)


# link the numeric labels to the label texts
label_text = np.sort(df_test.label_text.unique()).tolist()
# Get unique label_text and corresponding label values
unique_labels = df[['label', 'label_text']].drop_duplicates()

label2id = dict(zip(unique_labels['label_text'], unique_labels['label']))
id2label = dict(zip(unique_labels['label'], unique_labels['label_text']))


config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id))
#config.hidden_dropout_prob = 0.01
# load model with config
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config, ignore_mismatched_sizes=True)

# use GPU (cuda) if available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
model.to(device);

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Device: cuda


## Setting training arguments / hyperparameters

In [11]:
### Function to calculate metrics
from sklearn.metrics import balanced_accuracy_score, precision_recall_fscore_support, accuracy_score, classification_report
from transformers import TrainingArguments, Trainer, logging

import warnings

def compute_metrics_standard(eval_pred):
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore")

        labels = eval_pred.label_ids
        pred_logits = eval_pred.predictions
        preds_max = np.argmax(pred_logits, axis=1)  # argmax on each row (axis=1) in the tensor

        # metrics
        precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(labels, preds_max, average='macro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
        acc_balanced = balanced_accuracy_score(labels, preds_max)

        metrics = {
            'f1_macro': f1_macro,
            'accuracy_balanced': acc_balanced,
            'precision_macro': precision_macro,
            'recall_macro': recall_macro
        }

        return metrics


In [12]:
import datasets
dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_pandas(df_train),
    "test": datasets.Dataset.from_pandas(df_test)
})
print(df_train['text'].iloc[0])

# tokenize
def tokenize(examples):
  return tokenizer(examples["text"], truncation=True, max_length=512)

dataset["train"] = dataset["train"].map(tokenize, batched=True)
dataset["test"] = dataset["test"].map(tokenize, batched=True)
dataset = dataset.remove_columns(['label_text'])


Commends the Special Rapporteur for the activities undertaken so far, for the catalytic role that she plays in raising the level of awareness about the plight of internally displaced persons and for her ongoing efforts to address their development and other specific needs, including through the mainstreaming of the human rights of internally displaced persons into all relevant parts of the United Nations system;


Map:   0%|          | 0/448 [00:00<?, ? examples/s]

Map:   0%|          | 0/112 [00:00<?, ? examples/s]

## Fine-tuning and evaluation

Let's start fine-tuning the model!

If you get an 'out-of-memory' error, reduce the 'per_device_train_batch_size' to 8 or 4 in the TrainingArguments above and restart the runtime. If you don't restart your runtime (menu to the to left 'Runtime' > 'Restart runtime') and rerun the entire script, the 'out-of-memory' error will probably not go away.

In [13]:
import numpy as np
import json
import os
from transformers import AdamW, get_cosine_with_hard_restarts_schedule_with_warmup
from transformers import set_seed

set_seed(42)

## push to hugging face with swp account
## to do learning rate scheduler to constant instead of linear and make sure to do at least 30 epochs
## https://medium.com/geekculture/how-does-batch-size-impact-your-model-learning-2dd34d9fb1fa
## see https://arxiv.org/pdf/2111.09543.pdf

if model_name == "microsoft/deberta-v3-large":
  gradient_accumulation_steps = 16
  LEARNING_RATE = 3e-5  # can try: 9e-6
  EPOCHS = 30  # can try: 10
  BATCH_SIZE = 2  # can try: 10
  folder = "deberta_large"
else:
  LEARNING_RATE = 8e-5  # can try: 6e-5
  EPOCHS = 40  # can try: 10
  BATCH_SIZE = 8  # can try: 10
  folder = "deberta_base"
  gradient_accumulation_steps = 4


optimizer = AdamW(model.parameters(),
                  lr = LEARNING_RATE,
                  weight_decay=0.01)

scheduler = get_cosine_with_hard_restarts_schedule_with_warmup(optimizer,
                                                               num_warmup_steps = 50,
                                                               num_training_steps = 560,
                                                               num_cycles = 2)

train_args = TrainingArguments(
    output_dir = f"./output2/",
    logging_dir=f'./logs/logs',
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    evaluation_strategy="epoch", # options: "no"/"steps"/"epoch"
    save_strategy = "epoch",  # options: "no"/"steps"/"epoch",
    save_total_limit = 2,
    report_to="all",  # "all"  # logging
    )

# remove unnecessary columns for model training
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    optimizers=(optimizer, scheduler),
    compute_metrics=compute_metrics_standard
)

trainer.train(resume_from_checkpoint = False)
eval_metrics = trainer.evaluate()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,F1 Macro,Accuracy Balanced,Precision Macro,Recall Macro
1,No log,0.683742,0.363636,0.5,0.285714,0.5
2,No log,0.681269,0.363636,0.5,0.285714,0.5
3,No log,0.654997,0.363636,0.5,0.285714,0.5
4,No log,0.516507,0.72999,0.760417,0.784495,0.760417
5,No log,0.410697,0.85641,0.864583,0.857143,0.864583
6,No log,0.443999,0.838462,0.846354,0.839286,0.846354
7,No log,0.442458,0.836735,0.838542,0.835484,0.838542
8,No log,0.421745,0.864767,0.869792,0.863287,0.869792
9,No log,0.784515,0.803509,0.822917,0.826746,0.822917
10,No log,0.512857,0.872396,0.872396,0.872396,0.872396


Checkpoint destination directory ./output2/checkpoint-560 already exists and is non-empty. Saving will proceed but saved results may be invalid.


In [14]:
trainer.evaluate()

{'eval_loss': 0.4970652163028717,
 'eval_f1_macro': 0.8911564625850339,
 'eval_accuracy_balanced': 0.8932291666666667,
 'eval_precision_macro': 0.8896774193548387,
 'eval_recall_macro': 0.8932291666666667,
 'eval_runtime': 0.4242,
 'eval_samples_per_second': 264.048,
 'eval_steps_per_second': 33.006,
 'epoch': 40.0}

In [15]:
run_below = False
assert run_below

AssertionError: 

In [16]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [17]:
### save best model to google drive
model_name_custom = f"deberta-base-int_sec_final_20240321_8911"
mode_custom_path = "/content/drive/MyDrive/unga_int_sec/" + model_name_custom

trainer.save_model(output_dir=mode_custom_path)

In [18]:
df_test.to_csv("/content/drive/MyDrive/unga_int_sec/"+model_name_custom+'_test_data.csv')