## Fine-Tuning on Training Dataset for topic "Health"

Meant to be run on a Google Colab Instance, not locally. T4, A100 or V100 should all be sufficient. 

Thanks to Moritz Laurer for excellent templates and starting points for fine-tuning! See [here](https://github.com/MoritzLaurer/summer-school-transformers-2023/blob/main/3_tune_bert.ipynb)

#### Install relevant packages

In [None]:
!pip install accelerate -U
!pip install transformers[sentencepiece]
!pip install datasets




In [None]:
## Load general packages
# some more specialised packages are loaded in each sub section
import pandas as pd
import numpy as np
from google.colab.data_table import DataTable
from sklearn.model_selection import train_test_split

In [None]:
# set random seed for reproducibility
SEED_GLOBAL = 42
np.random.seed(SEED_GLOBAL)

## Download data

In [None]:
df =  pd.read_excel("https://nextcloud.swp-berlin.org/s/redacted/download").dropna(subset = ['text'])

df["label_text"] = df["label"].apply(lambda x: 'health' if x == 1 else 'not health')



In [None]:
df

Unnamed: 0,text,label,id,label_text
834,Commends the Secretary-General for supporting ...,0,284603_21,not health
592,Emphasizing that a favourable national and int...,0,615084_6,not health
1301,Requests the Secretary-General to submit a rep...,0,822138_70,not health
1002,Noting with concern the occurrence of nuclear ...,0,3895663_9,not health
816,"Also notes that the Committee on Fisheries, at...",1,3952733_182,health
...,...,...,...,...
365,Noting also that the globalization process has...,0,509303_10,not health
183,Stresses the need for close consultation betwe...,0,615084_38,not health
108,Calls upon States and/or the relevant funds an...,1,3896389_36,health
1156,Recognizes that microfinance has experienced t...,0,644748_17,not health


In [None]:
df = df.drop_duplicates(subset='text', keep='first')
df

Unnamed: 0,text,label,id,label_text
834,Commends the Secretary-General for supporting ...,0,284603_21,not health
592,Emphasizing that a favourable national and int...,0,615084_6,not health
1301,Requests the Secretary-General to submit a rep...,0,822138_70,not health
1002,Noting with concern the occurrence of nuclear ...,0,3895663_9,not health
816,"Also notes that the Committee on Fisheries, at...",1,3952733_182,health
...,...,...,...,...
365,Noting also that the globalization process has...,0,509303_10,not health
183,Stresses the need for close consultation betwe...,0,615084_38,not health
108,Calls upon States and/or the relevant funds an...,1,3896389_36,health
1156,Recognizes that microfinance has experienced t...,0,644748_17,not health


In [None]:


df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

df_train = df_train[["label", "label_text", "text","id"]]

df_test = df_test[["label", "label_text", "text","id"]]
df_test.head()

Unnamed: 0,label,label_text,text,id
918,0,not health,Calls once again for the implementation of a d...,3840121_70
705,1,health,(i) To take all measures necessary to ensure t...,1468215_81
382,0,not health,Also appreciates the initiatives of the Econom...,644435_24
510,1,health,Adopts the political declaration on HIV and AI...,833719_2
522,1,health,Urges health workers by promoting training in ...,673963_43


In [None]:
df_test['text_exists_in_train'] = df_test['text'].isin(df_train['text'])
df_test["text_exists_in_train"].sum()

0

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import torch

## load a model and its tokenizer
model_name = "microsoft/deberta-v3-base"  # replace e.g. with "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)


# link the numeric labels to the label texts
label_text = np.sort(df_test.label_text.unique()).tolist()
# Get unique label_text and corresponding label values
unique_labels = df[['label', 'label_text']].drop_duplicates()

label2id = dict(zip(unique_labels['label_text'], unique_labels['label']))
id2label = dict(zip(unique_labels['label'], unique_labels['label_text']))


config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id))
config.hidden_dropout_prob = 0.02
# load model with config
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config, ignore_mismatched_sizes=True)

# use GPU (cuda) if available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
model.to(device);

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Device: cuda


## Setting training arguments / hyperparameters

In [None]:
### Function to calculate metrics
from sklearn.metrics import balanced_accuracy_score, precision_recall_fscore_support, accuracy_score, classification_report
from transformers import TrainingArguments, Trainer, logging

import warnings

def compute_metrics_standard(eval_pred):
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore")

        labels = eval_pred.label_ids
        pred_logits = eval_pred.predictions
        preds_max = np.argmax(pred_logits, axis=1)  # argmax on each row (axis=1) in the tensor

        # metrics
        precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(labels, preds_max, average='macro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
        acc_balanced = balanced_accuracy_score(labels, preds_max)

        metrics = {
            'f1_macro': f1_macro,
            'accuracy_balanced': acc_balanced,
            'precision_macro': precision_macro,
            'recall_macro': recall_macro
        }

        return metrics


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import datasets
dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_pandas(df_train),
    "test": datasets.Dataset.from_pandas(df_test)
})
print(df_train['text'].iloc[0])

# tokenize
def tokenize(examples):
  return tokenizer(examples["text"], truncation=True, max_length=512)  # max_length can be reduced to e.g. 256 to increase speed, but long texts will be cut off

dataset["train"] = dataset["train"].map(tokenize, batched=True)
dataset["test"] = dataset["test"].map(tokenize, batched=True)
dataset = dataset.remove_columns(['label_text'])


Recognizes and calls for further broadening and strengthening of efforts by many African countries to enhance progress in areas covering economic reforms, including the putting in place of sound macroeconomic policies, promotion of the private sector, enhancement of the democratization process and strengthening of civil society and participatory, transparent and accountable governance and the rule of law, as well as increased attention to the human dimension, especially education, gender, population, health and south-south cooperation;


Map:   0%|          | 0/1028 [00:00<?, ? examples/s]

Map:   0%|          | 0/258 [00:00<?, ? examples/s]

## Fine-tuning and evaluation

Let's start fine-tuning the model!

If you get an 'out-of-memory' error, reduce the 'per_device_train_batch_size' to 8 or 4 in the TrainingArguments above and restart the runtime. If you don't restart your runtime (menu to the to left 'Runtime' > 'Restart runtime') and rerun the entire script, the 'out-of-memory' error will probably not go away.

In [None]:
import numpy as np
import json
import os

## push to hugging face with swp account
## to do learning rate scheduler to constant instead of linear and make sure to do at least 30 epochs
## https://medium.com/geekculture/how-does-batch-size-impact-your-model-learning-2dd34d9fb1fa
## see https://arxiv.org/pdf/2111.09543.pdf

if model_name == "microsoft/deberta-v3-large":
  gradient_accumulation_steps = 16
  LEARNING_RATE = 8e-6  # can try: 9e-6
  EPOCHS = 30  # can try: 10
  BATCH_SIZE = 4  # can try: 10
  folder = "deberta_large"
else:
  LEARNING_RATE = 6e-5  # can try: 6e-5
  EPOCHS = 30  # can try: 10
  BATCH_SIZE = 8  # can try: 10
  folder = "deberta_base"
  gradient_accumulation_steps = 4

  ## base 0.947 with 6e-5, 4, 20, no warm-up
  ## base 0.95 with 6e-5, 32, 20, no warm-up
  ## base 0.944 with 6e-5, 64, 20, no warm-up
  ## base 0.942 with 2e-5, 64, 20, no warm-up
  ## base 0.959 with 3e-5, 64, 20, no warm-up
  ## base 0.955952 with 3.5e-5, 64, 20, no warm-up

  ## base 0.955952 with 5e-5, 8, 20, no warm-up
  ## base 0.925 with 1e-5, 8, 20, no warm-up
  ## base 0.948 with 3e-5, 8, 20, no warm-up
  ## base 0.94 with 3e-5, 8, 10,  warm-up 0.06
  ## base 0.931 with 2.5e-5, 8, 10,  warm-up 0.06
  ## base 0.945 with 3.5e-5, 8, 10,  warm-up 0.06
  ## base 0.919 with 3.5e-5, 16, 10,  warm-up 0.06
  ## base 0.929 with 3.5e-5, 32, 10,  warm-up 0.06
  ## base 0.957 with 6e-5, 32, 10,  warm-up 0.06
  ################################################ base 0.973 with 6e-5, 32, 30,  warm-up 0.06
  ################################################ base 0.956 with 6e-5, 32, 30,  warm-up 0.06 (hidden dropout 0.15)



  ## large 0.966503 with 8e-6, 4, 30, no warm-up
  ## large 0.44 with 1e-5, 64, 30, no warm-up
  ## large 0.957 with 5e-5, 64, 30, no warm-up
  ## large 0.91 with 1e-5, 16, 30, no warm-up
  ## large 0.947483 with 5e-5, 4, 30 no warm-up
  ## large 0.955 with 3e-5, 4, 30 warm-up 0.06
  ## large 0.955 with 1e-5, 4, 30 warm-up 0.06

  ## large 0.940 with 5e-5, 64, 30 warm-up 0.06
  ## large 0.88 with 1e-5, 64, 30 warm-up 0.06
  ## large 0.886 with 9e-6, 64, 30 warm-up 0.06
  ## large 0.899 with 1e-5, 64, 30 no warm-up
  ## large 0.929977 with 5e-5, 64, 30 no warm-up
  ## large 0.935 with 3e-5, 64, 30 no warm-up
  ## large 0.955 with 3e-5, 4, 30, no warmup
  ## large 0.933 with 9e-6, 4, 30, no warmup




train_args = TrainingArguments(
    output_dir = f"./output2/",
    logging_dir=f'./logs/logs',
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_train_epochs=EPOCHS,  # this can be increased, but higher values increase training time. Good values for NLI are between 3 and 20.
    learning_rate=LEARNING_RATE, # with deberta base was 6e-5
    per_device_train_batch_size=BATCH_SIZE,  # if you get an out-of-memory error, reduce this value to 8 or 4 and restart the runtime. Higher values increase training speed, but also increase memory requirements. Ideal values here are always a multiple of 8.
    per_device_eval_batch_size=BATCH_SIZE,  # if you get an out-of-memory error, reduce this value, e.g. to 40 and restart the runtime
    warmup_ratio=0.06,  # a good normal default value is 0.06 for normal BERT-base models, but since we want to reuse prior NLI knowledge and avoid catastrophic forgetting, we set the value higher
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    evaluation_strategy="epoch", # options: "no"/"steps"/"epoch"
    save_strategy = "epoch",  # options: "no"/"steps"/"epoch",
    save_total_limit = 2,
    report_to="all",  # "all"  # logging
    )

# remove unnecessary columns for model training
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics_standard
)

trainer.train(resume_from_checkpoint = False)
eval_metrics = trainer.evaluate()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,F1 Macro,Accuracy Balanced,Precision Macro,Recall Macro
0,No log,0.528817,0.424107,0.5,0.368217,0.5
1,No log,0.410104,0.842977,0.801471,0.937788,0.801471
2,No log,0.162392,0.948585,0.935913,0.963463,0.935913
4,No log,0.159133,0.954416,0.947988,0.961379,0.947988
5,No log,0.332874,0.914122,0.887074,0.953976,0.887074
6,No log,0.190496,0.945843,0.952167,0.940009,0.952167
8,No log,0.147212,0.954856,0.952709,0.95706,0.952709
9,No log,0.170038,0.959678,0.955341,0.964252,0.955341
10,No log,0.219173,0.954856,0.952709,0.95706,0.952709
12,No log,0.190935,0.959678,0.955341,0.964252,0.955341


In [None]:
trainer.evaluate()

{'eval_loss': 0.17003820836544037,
 'eval_f1_macro': 0.9596780495428616,
 'eval_accuracy_balanced': 0.9553405572755418,
 'eval_precision_macro': 0.964251893939394,
 'eval_recall_macro': 0.9553405572755418,
 'eval_runtime': 3.2652,
 'eval_samples_per_second': 79.016,
 'eval_steps_per_second': 10.107,
 'epoch': 29.77}

In [None]:
run_below = False
assert run_below

AssertionError: ignored

In [None]:
### save best model to google drive
model_name_custom = f"deberta-base-health_final_20240321_9596"
mode_custom_path = "/content/drive/MyDrive/unga_health/" + model_name_custom

trainer.save_model(output_dir=mode_custom_path)

In [None]:
df_test.to_csv("/content/drive/MyDrive/unga_health/"+model_name_custom+'_test_data.csv')