<a href="https://colab.research.google.com/drive/1sQuW2QY6MMI1R2cgIg7QFh0aXGNib0KD?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  🧑‍💻 Workshop Notebook 🧑‍💻

📅 ICPSR 2023 tutorial

These materials are for the NLP Workshop as a supplement to Machine Learning Course at [ICPSR 2023](https://www.icpsr.umich.edu/web/pages/sumprog/), and some parts are adapted from Moritz Laurer's [Github Repo](https://github.com/MoritzLaurer)

👨‍🏫 By [Selim Yaman](https://twitter.com/selimyaman_)

## Activate a GPU runtime

In order to run this notebook on a GPU, click on "Runtime" > "Change runtime type" > select "GPU" in the menu bar in to top left. Training a Transformer is much faster on a GPU. Given Google's usage limits for GPUs, I would suggest to first test your non-training code on a CPU (Hardware accelerator "None" instead of GPU) and only use the GPU once you know that everything is working.

## Install relevant packages

In [1]:
!pip install transformers[sentencepiece]==4.28
!pip install datasets==2.12
!pip install optuna==3.1

Collecting transformers[sentencepiece]==4.28
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers[sentencepiece]==4.28)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[sentencepiece]==4.28)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m106.6 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece!=0.1.92,>=0.1.91 (from transformers[sentencepiece]==4.28)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━

Collecting datasets==2.12
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets==2.12)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets==2.12)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets==2.12)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from datasets==2.12)
  Downloading responses-0.18.0-py

In [2]:
## Load general packages
# some more specialised packages are loaded in each sub section
import pandas as pd
import numpy as np
from google.colab.data_table import DataTable

In [3]:
# set random seed for reproducibility
SEED_GLOBAL = 123
np.random.seed(SEED_GLOBAL)

## Download data

In [4]:
## Download the cleaned train and test data github
df_train = pd.read_csv("https://raw.githubusercontent.com/selimyaman/ICPSR_WORKSHOP/master/political_tweets_example_clean_train.csv")
df_train.drop(df_train.columns[0], axis=1, inplace=True)

df_test = pd.read_csv("https://raw.githubusercontent.com/selimyaman/ICPSR_WORKSHOP/master/political_tweets_example_clean_test.csv")
df_test.drop(df_test.columns[0], axis=1, inplace=True)


print("Length of training and test sets: ", len(df_train), " (train) ", len(df_test), " (test).") # I split the dataset 80-20

Length of training and test sets:  3204  (train)  800  (test).


In [5]:
# make sure everything looks correct:
df_train.head()

Unnamed: 0,label,text,label_text
0,1,Global Voices Online Â» Alex Castro: A liberal...,political
1,1,Do the Conservatives Have a Death Wish? http:/...,political
2,1,RT @AllianceAlert: * House Dems ask for civili...,political
3,1,RT @AdamSmithInst Quote of the week: My politi...,political
4,1,@mystic23 I also think that most liberals don'...,political


**If you want to run the notebook on your own dataset:**

You can load your own training and test data above to fine-tune your own model. Your own dataframe only needs three columns to be compatible with the code below:
- **label** column with a numeric label;
- **label_text** column with the label name in plain language,
- **text** column with the texts for training (you might need to delete/adapt the text preparation code cell below for your dataset).

In [6]:
## alternatively, you can also load your own .csv files from Google Drive
"""
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=False)

# set the path to your data
os.chdir("/content/drive/My Drive/PhD/other/chapter2/data")
print(os.getcwd())

df_train = pd.read_csv("./df_manifesto_morality_train.csv")
df_test = pd.read_csv("./df_manifesto_morality_test.csv")
print("Length of training and test sets: ", len(df_train), " (train) ", len(df_test), " (test).")
"""


'\nfrom google.colab import drive\nimport os\ndrive.mount(\'/content/drive\', force_remount=False)\n\n# set the path to your data\nos.chdir("/content/drive/My Drive/PhD/other/chapter2/data")\nprint(os.getcwd())\n\ndf_train = pd.read_csv("./df_manifesto_morality_train.csv")\ndf_test = pd.read_csv("./df_manifesto_morality_test.csv")\nprint("Length of training and test sets: ", len(df_train), " (train) ", len(df_test), " (test).")\n'

In [7]:
# optional: use training data sample size of e.g. 1000 for faster testing
sample_size = 1000
df_train = df_train.sample(n=min(sample_size, len(df_train)), random_state=SEED_GLOBAL).copy(deep=True)
df_test = df_test.sample(n=min(sample_size*4, len(df_test)), random_state=SEED_GLOBAL).copy(deep=True)

print("Length of training and test sets after sampling: ", len(df_train), " (train) ", len(df_test), " (test).")


Length of training and test sets after sampling:  1000  (train)  800  (test).


In [8]:
## inspect the data
# label distribution train set
print("Train set label distribution:\n", df_train.label_text.value_counts(), "\n")
# label distribution test set
print("Test set label distribution:\n", df_test.label_text.value_counts())


Train set label distribution:
 non-political    584
political        416
Name: label_text, dtype: int64 

Test set label distribution:
 non-political    452
political        348
Name: label_text, dtype: int64


In [9]:
# full training data table
DataTable(df_train, num_rows_per_page=5)

Unnamed: 0,label,text,label_text
2573,0,"en with an extra hour of sleep, Monday is stil...",non-political
2853,0,"aux #Saints! <3 Psh, and the Falcons used to b...",non-political
1196,1,We are broadcasting a very special Request Sho...,political
602,1,RT @jackiewalorski: What a deal! Obama's going...,political
984,1,With the billions congress wastes spending on ...,political
...,...,...,...
2989,0,onorGarry angels with maky faces @ dancehouse ...,non-political
3154,0,@FoxxFiles: #LoseMyNumber if you just saw me ...,non-political
2787,0,want to get the flu.,non-political
1316,1,gbmiii [ff] - Andhra Congress leader urges Cen...,political


## Data preprocessing

<!-- **Prepare the input text**

1.) We prepare the target texts by making them more naturally fit to the hypothesis. Here we simply wrap each target text into the string ' The quote: "{target_text}" - end of the quote. '

2.) We surround the target text by its preceeding and following sentence. Adding context like this systematically increases performance. -->


In [10]:
# If your text is long documents instead of just short-text like tweets, you might want to utilize the preceding and following text as well
# df_train["text"] = df_train.text_preceding.fillna("") + " " + df_train.text_original.fillna("") + " " + df_train.text_following.fillna("")
# df_test["text"] = df_test.text_preceding.fillna("") + " " + df_test.text_original.fillna("") + " " + df_test.text_following.fillna("")


In [11]:
df_train = df_train[["label", "label_text", "text"]]
df_test = df_test[["label", "label_text", "text"]]

In [12]:
DataTable(df_train, num_rows_per_page=5)

Unnamed: 0,label,label_text,text
2573,0,non-political,"en with an extra hour of sleep, Monday is stil..."
2853,0,non-political,"aux #Saints! <3 Psh, and the Falcons used to b..."
1196,1,political,We are broadcasting a very special Request Sho...
602,1,political,RT @jackiewalorski: What a deal! Obama's going...
984,1,political,With the billions congress wastes spending on ...
...,...,...,...
2989,0,non-political,onorGarry angels with maky faces @ dancehouse ...
3154,0,non-political,@FoxxFiles: #LoseMyNumber if you just saw me ...
2787,0,non-political,want to get the flu.
1316,1,political,gbmiii [ff] - Andhra Congress leader urges Cen...


## Load a Transformer

We use [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) for loading and training our model. They provide great documentation and also a very good [course](https://huggingface.co/course/chapter1/1) on how to use Transformers.

**Choosing a Transformer model**

You can can use any classification model on the [Hugging Face Hub](https://huggingface.co/models?sort=downloads). I suggest testing these models:



*   Original BERT: `bert-base-uncased`
*   Small efficient model: `distilbert-base-uncased`
*   Newer version of BERT: `microsoft/deberta-v3-base`
*   Large, high-performance model: `microsoft/deberta-v3-large`
*   Multilingual model: `microsoft/mdeberta-v3-base`





In [13]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import torch

## load a model and its tokenizer
model_name = "distilbert-base-uncased"  # specify the model to be used
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)  # load the tokenizer associated with the model

# create mappings between numeric labels and label texts
label_text = np.sort(df_test.label_text.unique()).tolist()  # get unique label texts and sort them
label2id = dict(zip(np.sort(label_text), np.sort(pd.factorize(label_text, sort=True)[0]).tolist()))  # map sorted label texts to their corresponding numeric IDs
id2label = dict(zip(np.sort(pd.factorize(label_text, sort=True)[0]).tolist(), np.sort(label_text)))  # map numeric IDs to their corresponding label texts

# create a model configuration
config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id));

# load model using the above configuration
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config, ignore_mismatched_sizes=True);

# specify the device to be used for computation
device = "cuda" if torch.cuda.is_available() else "cpu"  # use GPU if available, else use CPU
print(f"Device: {device}")
model.to(device);  # move the model to the specified device

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classi

Device: cuda


## Tokenize data

Tokenization is the process of breaking down text into smaller units, called tokens. Tokens can be individual words, phrases, or even individual characters - they're often individual words or subwords. The purpose of tokenization is to convert human-readable text into a format that a ML model can understand.

This tokenization process is particularly important for models like DistilBERT, which operate on tokens rather than raw text. Each token corresponds to an entry in the model's vocabulary, and the model learns to associate these tokens with specific meanings and contexts during training.

In [14]:
# convert pandas dataframes to Hugging Face dataset object to facilitate pre-processing
import datasets

dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_pandas(df_train),
    "test": datasets.Dataset.from_pandas(df_test)
})

# tokenize
def tokenize(examples):
  return tokenizer(examples["text"], truncation=True, max_length=512)  # max_length can be reduced to e.g. 256 to increase speed, but long texts will be cut off

dataset["train"] = dataset["train"].map(tokenize, batched=True)
dataset["test"] = dataset["test"].map(tokenize, batched=True)

# remove unnecessary columns for model training
dataset = dataset.remove_columns(['label_text'])

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

**Inspect processed data**

In [15]:
print("The overall structure of the pre-processed train and test sets:\n")
print(dataset)

The overall structure of the pre-processed train and test sets:

DatasetDict({
    train: Dataset({
        features: ['label', 'text', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['label', 'text', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 800
    })
})


An example for a tokenized hypothesis-context pair:

{'label': 0, 'text': "yin' to enjoy this beautiful Halloween Night!!!", '__index_level_0__': 1619, 'input_ids': [101, 18208, 1005, 2000, 5959, 2023, 3376, 14414, 2305, 999, 999, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## Setting training arguments / hyperparameters

The following cells set several important hyperparameters. We chose parameters that work well in general to avoid the need for hyperparameter search. Further below, we also provide code for hyperparameter search, if researchers want to try to increase performance by a few percentage points.

In [16]:
# Set the directory to write the fine-tuned model and training logs to.
# With google colab, this will create a temporary folder, which will be deleted once you disconnect.
# You can connect to your personal google drive to save models and logs properly.
training_directory = "BERT-demo"

# FP16 is a hyperparameter which can increase training speed and reduce memory consumption, but only on GPU and if batch-size > 8, see here: https://huggingface.co/transformers/performance.html?#fp16
# FP16 does not work on CPU or for multilingual mDeBERTa models
fp16_bool = True if torch.cuda.is_available() else False
if "mdeberta" in model_name.lower(): fp16_bool = False  # multilingual mDeBERTa does not support FP16 yet: https://github.com/microsoft/DeBERTa/issues/77
# in case of hyperparameter search end the end: FP16 has to be set to False. The integrated hyperparameter search with the Hugging Face Trainer can lead to errors otherwise.
fp16_bool = False

In [17]:
from transformers import TrainingArguments, Trainer, logging

LEARNING_RATE = 2e-5  # can try: 6e-5
EPOCHS = 5  # can try: 10

# Hugging Face tipps to increase training speed and decrease out-of-memory (OOM) issues: https://huggingface.co/transformers/performance.html?
# Overview of all training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
train_args = TrainingArguments(
    output_dir=f'./results/{training_directory}',
    logging_dir=f'./logs/{training_directory}',
    num_train_epochs=EPOCHS,  # this can be increased, but higher values increase training time. Good values for NLI are between 3 and 20.
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=8,  # if you get an out-of-memory error, reduce this value to 8 or 4 and restart the runtime. Higher values increase training speed, but also increase memory requirements. Ideal values here are always a multiple of 8.
    per_device_eval_batch_size=80,  # if you get an out-of-memory error, reduce this value, e.g. to 40 and restart the runtime
    gradient_accumulation_steps=2, # Can be used in case of memory problems to reduce effective batch size. accumulates gradients over X steps, only then backward/update. decreases memory usage, but also slightly speed. (!adapt/halve batch size accordingly)
    warmup_ratio=0.06,  # a good normal default value is 0.06 for normal BERT-base models, but since we want to reuse prior NLI knowledge and avoid catastrophic forgetting, we set the value higher
    weight_decay=0.1,
    seed=SEED_GLOBAL,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=fp16_bool,  # Can speed up training and reduce memory consumption, but only makes sense at batch-size > 8. loads two copies of model weights, which creates overhead. https://huggingface.co/transformers/performance.html?#fp16
    fp16_full_eval=fp16_bool,
    evaluation_strategy="epoch", # options: "no"/"steps"/"epoch"
    #eval_steps=10_000,  # evaluate after n steps if evaluation_strategy!='steps'. defaults to logging_steps
    save_strategy = "epoch",  # options: "no"/"steps"/"epoch"
    #save_steps=10_000,              # Number of updates steps before two checkpoint saves.
    #save_total_limit=10,             # If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir
    #logging_strategy="steps",
    report_to="all",  # "all"  # logging
    #push_to_hub=False,
    #push_to_hub_model_id=f"{model_name}-finetuned-{task}",
)


In [18]:
### Function to calculate metrics
from sklearn.metrics import balanced_accuracy_score, precision_recall_fscore_support, accuracy_score, classification_report
import warnings

def compute_metrics_standard(eval_pred):
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore")

        labels = eval_pred.label_ids
        pred_logits = eval_pred.predictions
        preds_max = np.argmax(pred_logits, axis=1)  # argmax on each row (axis=1) in the tensor

        # metrics
        precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(labels, preds_max, average='macro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
        precision_micro, recall_micro, f1_micro, _ = precision_recall_fscore_support(labels, preds_max, average='micro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
        acc_balanced = balanced_accuracy_score(labels, preds_max)
        acc_not_balanced = accuracy_score(labels, preds_max)

        metrics = {
            'accuracy': acc_not_balanced,
            'f1_macro': f1_macro,
            'accuracy_balanced': acc_balanced,
            'f1_micro': f1_micro,
            'precision_macro': precision_macro,
            'recall_macro': recall_macro,
            'precision_micro': precision_micro,
            'recall_micro': recall_micro,
        }

        return metrics


## Fine-tuning and evaluation

Let's start fine-tuning the model!

If you get an 'out-of-memory' error, reduce the **per_device_train_batch_size** to 8 or 4 in the `TrainingArguments` above and **restart** the runtime ('Runtime' > 'Restart runtime')

In [19]:
# training
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics_standard
)

trainer.train()


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Accuracy Balanced,F1 Micro,Precision Macro,Recall Macro,Precision Micro,Recall Micro
0,No log,0.251266,0.9175,0.915401,0.912445,0.9175,0.919844,0.912445,0.9175,0.9175
2,No log,0.302962,0.91125,0.910696,0.916501,0.91125,0.909772,0.916501,0.91125,0.91125
2,No log,0.279826,0.92,0.918975,0.920939,0.92,0.917594,0.920939,0.92,0.92
4,No log,0.288575,0.9225,0.921551,0.923812,0.9225,0.920052,0.923812,0.9225,0.9225
4,No log,0.301137,0.92625,0.924855,0.924156,0.92625,0.92563,0.924156,0.92625,0.92625


TrainOutput(global_step=310, training_loss=0.15450115819131174, metrics={'train_runtime': 73.5744, 'train_samples_per_second': 67.958, 'train_steps_per_second': 4.213, 'total_flos': 70629961120896.0, 'train_loss': 0.15450115819131174, 'epoch': 4.96})

In [20]:
# Evaluate the fine-tuned model on the held-out test set
results = trainer.evaluate()
print(results)

{'eval_loss': 0.30113667249679565, 'eval_accuracy': 0.92625, 'eval_f1_macro': 0.924855003590084, 'eval_accuracy_balanced': 0.9241557318685789, 'eval_f1_micro': 0.92625, 'eval_precision_macro': 0.9256304584978724, 'eval_recall_macro': 0.9241557318685789, 'eval_precision_micro': 0.92625, 'eval_recall_micro': 0.92625, 'eval_runtime': 3.2785, 'eval_samples_per_second': 244.013, 'eval_steps_per_second': 3.05, 'epoch': 4.96}


## Inference with your fine-tuned model

In [21]:
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"  # use GPU (cuda) if available, otherwise use CPU

# documentation: https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline
pipe_classifier = pipeline(
    "text-classification",
    model=model,  # if you have trained a model above, load_best_model_at_end in the training arguments has automatically replaced model with the fine-tuned model
    tokenizer=tokenizer,
    framework="pt",
    device=device,
)

We now apply the pipeline to unseen texts. We re-use the df_test data-frame here for simplicity, but it could be any other dataset. It only needs a text column. Note that we do not need to re-format the text data anymore here, as this is handled internally by the Hugging Face zero-shot pipeline. If you want to better understand the arguments in the pipeline below, I'd recommend reading the [documentation here](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline).

In [22]:
# create a dummy data frame for illustration
# Take a random sample of size 1000 or the entire dataframe if it contains fewer than 1000 rows
df_inference = df_test[["text", "label_text"]].sample(n=min(1000, df_test.shape[0]), random_state=123).copy(deep=True)


text_lst = df_inference["text"].tolist()

# use the pipeline with your chosen model for inference (prediction)
pipe_output = pipe_classifier(
    text_lst,  # input any list of texts here
    batch_size=32  # reduce this number to 8 or 16 if you get an out-of-memory error
)
print(pipe_output)

df_output = pd.DataFrame(pipe_output)

# add inference data to your original dataframe
df_inference["label_text_pred"] = df_output["label"].tolist()
df_inference["label_text_pred_probability"] = df_output["score"].round(2).tolist()


[{'label': 'non-political', 'score': 0.99785977602005}, {'label': 'non-political', 'score': 0.9977198243141174}, {'label': 'non-political', 'score': 0.9971897006034851}, {'label': 'non-political', 'score': 0.9952951073646545}, {'label': 'political', 'score': 0.9939175844192505}, {'label': 'non-political', 'score': 0.9974752068519592}, {'label': 'political', 'score': 0.9940033555030823}, {'label': 'non-political', 'score': 0.9979459643363953}, {'label': 'political', 'score': 0.9931082725524902}, {'label': 'non-political', 'score': 0.998141884803772}, {'label': 'political', 'score': 0.9829419851303101}, {'label': 'political', 'score': 0.9931381344795227}, {'label': 'non-political', 'score': 0.9979761242866516}, {'label': 'non-political', 'score': 0.9938741326332092}, {'label': 'non-political', 'score': 0.9974325299263}, {'label': 'political', 'score': 0.9933236241340637}, {'label': 'political', 'score': 0.9835572242736816}, {'label': 'non-political', 'score': 0.9974848031997681}, {'label

In [23]:
df_inference

Unnamed: 0,text,label_text,label_text_pred,label_text_pred_probability
669,"RETTYSHAQ u kno what ur right, but I thought a...",non-political,non-political,1.00
634,ni tutorial on Box2D http://bit.ly/lsZBz,non-political,non-political,1.00
415,am working on starting my on clothing line. Lo...,non-political,non-political,1.00
457,ixtape] Lil Wayne - No Ceilings (Official): He...,non-political,non-political,1.00
336,Tories plan to reduce Big Brother state: A FUT...,political,political,0.99
...,...,...,...,...
13,Jerry Falwell - The conservative elite and cor...,political,political,0.99
30,in us in 30 minutes for the Common Man Common ...,non-political,non-political,0.99
756,@BigBassFishing Phillips Rode Flipping Bite T...,non-political,non-political,1.00
745,@Neg10540: Hope y'all had a greatMon MJ fans....,non-political,non-political,1.00


## Workshop exercise

1. `Runtime > Run all`
2. Inspect the output; Google/ChatGPT some terms you don’t know
3. Choose a more powerful model and adapt `model_name`
4. `Runtime > Restart and Run All`
5. Inspect the output; (optionally rerun again with different hyperparameters)
6. Optional: Create a [Hugging Face account](https://huggingface.co/) and upload your model with the code below
7. Optional: Try hyperparameter search with the code below (takes very long)

In [24]:
run_code_below = False # If you want to run the whole script, including the hyper-parameter tuning below, change this into True
assert run_code_below, "Stopping code here to avoid accidental runs of the code below with Runtime > Run all"

AssertionError: ignored

## Bonus: Save and load your fine-tuned model

### Saving your model to Google Drive

In [25]:
## first you need to connect to your google drive with your google account
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=False)
#drive.flush_and_unmount()

# insert the path where you want to save the model
os.chdir("/content/drive/My Drive/")
print(os.getcwd())


MessageError: ignored

In [None]:
### save best model to google drive
directory_save_model = f"{training_directory}/"
model_name_custom = f"{model_name.split('/')[-1]}-custom"
mode_custom_path = directory_save_model + model_name_custom

trainer.save_model(output_dir=mode_custom_path)

### Upload your model to the Hugging Face Hub

In [None]:
### Push to Hugging Face hub
# install necessary dependencies
# you need to create an account on https://huggingface.co/ for this
!sudo apt-get install git-lfs
!huggingface-cli login

In [None]:
# load your models and tokenizer saved before from disk
model = AutoModelForSequenceClassification.from_pretrained(mode_custom_path)
tokenizer = AutoTokenizer.from_pretrained(mode_custom_path, use_fast=True, model_max_length=512)  # we load the tokenizer from the original BERT-NLI model

In [None]:
# https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.push_to_hub
repo_id = '<your-user-name>/<your-model-name>'  # e.g. "JaneJones/DeBERTa-v3-nli-custom". note that the repo name is case-sensitive
model.push_to_hub(repo_id=repo_id, use_temp_dir=True, private=True, use_auth_token="<your-huggingface-token>")
tokenizer.push_to_hub(repo_id=repo_id, use_temp_dir=True, private=True, use_auth_token="<your-huggingface-token>")


## Bonus: Hyperparameter Search

To increase performance, you can also conduct a hyperparameter search (hp-search), to try and find the best hyperparameters for your specific task and dataset. The trade-off is that hp-search is very compute intensive, but finding better hyperparameters for your task can increase performance. Make sure to conduct hp-search on a sub-set of the training set (i.e. validation set) and not the final test set to avoid data leakage of the test set before final testing.

Note that for small datasets, running the hp-search only on one train-validation split is not ideal. For datasets with less than around 2000 training data points, we recommend running the hp-search on two different random train-validation split. We implemented this for our paper, but not in this notebook as this would make the code harder to understand.

Documentation with more information on hp-search with Hugging Face Transformers is available [here](https://huggingface.co/docs/transformers/main/hpo_train).

In [None]:
## train-validation split - test set should not be visible during hp-search
# https://huggingface.co/docs/datasets/v2.5.1/en/package_reference/main_classes#datasets.Dataset.train_test_split

# the ideal size of the validation set depends on the size of your training data. Each label should have at the very least a few dozen examples in the validation set (ideally several hundred)
validation_set_size = 0.4  # for a training data size of 1000 with 3 classes we use 40% of the training data for validating hyperparameters

# reformatting of label column to enable dataset stratification
from datasets import ClassLabel
new_features = dataset["train"].features.copy()
label_names = list(model.config.label2id.keys())

new_features['label'] = ClassLabel(names=label_names)
dataset = dataset.cast(new_features)

# train-validation split for hp-search
dataset_hp = dataset["train"].train_test_split(test_size=validation_set_size, seed=SEED_GLOBAL, shuffle=True, stratify_by_column="label")
print(dataset_hp)

In [None]:
# helper function to clean memory and reduce risk of out-of-memory error
import gc
def clean_memory():
  #del(model)
  if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()
  gc.collect()

clean_memory()

In [None]:
## Reinitialize trainer for hp-search
# https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10

def model_init():
    clean_memory()

    # link the numeric labels to the label texts
    label_text = np.sort(df_test.label_text.unique()).tolist()
    label2id = dict(zip(np.sort(label_text), np.sort(pd.factorize(label_text, sort=True)[0]).tolist()))
    id2label = dict(zip(np.sort(pd.factorize(label_text, sort=True)[0]).tolist(), np.sort(label_text)))
    config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id));

    return AutoModelForSequenceClassification.from_pretrained(model_name, config=config, ignore_mismatched_sizes=True).to(device)

trainer = Trainer(
    model_init=model_init,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset_hp["train"],
    eval_dataset=dataset_hp["test"],
    compute_metrics=compute_metrics_standard
);


**Define the hyperparameters you want to optimise**

For a detailed discussion of different hyperparameters, see the appendix of our paper.

In [None]:
# we use Optuna for hp-search: https://optuna.readthedocs.io/en/stable/
def my_hp_space(trial):
    return {
        "learning_rate": trial.suggest_categorical("learning_rate", [9e-6, 2e-5, 4e-5]),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 4, 24, log=False, step=4),   # increasing the maximum number of epochs here could increase performance but will take (much) longer to train
        #"warmup_ratio": trial.suggest_float("warmup_ratio", 0.1, 0.6, log=True),
        "per_device_train_batch_size": 16,  # lower this value in case of out-of-memory errors and restart the runtime
        #"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
        "evaluation_strategy": "no",
        "save_strategy": "no",
    }


**Run HP search!**

Choose the number of hyperparameter configurations you want to test. In our experiments we found that after 10 to 15 trials with around 4 hyperparameters, performance is unlikely to increase meaningfully. 15 trials seems to be a safe value, but can take a while to run.

In [None]:
import optuna

# number of differen hp configurations to test
numer_of_trials = 10  # increasing this value can lead to better hyperparameters, but will take longer
# chose the sampler for sampling hp configurations
optuna_sampler = optuna.samplers.TPESampler(
    seed=SEED_GLOBAL, consider_prior=True, prior_weight=1.0, consider_magic_clip=True,
    consider_endpoints=False, n_startup_trials=numer_of_trials/2, n_ei_candidates=24,
    multivariate=False, group=False, warn_independent_sampling=True, constant_liar=False
)  # https://optuna.readthedocs.io/en/stable/reference/generated/optuna.samplers.TPESampler.html#optuna.samplers.TPESampler

# Hugging Face Documentation: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.hyperparameter_search
best_run = trainer.hyperparameter_search(
    n_trials=numer_of_trials,
    direction="maximize",
    hp_space=my_hp_space,
    backend='optuna',
    **{"sampler": optuna_sampler}
)

In [None]:
# show best hyperparameters based on hp-search
print(best_run)

**Training Time with optimised hyperparameters!**

Here we can use the original train and test set again.

In [None]:
# update the training arguments with the best hyperparameters
for k,v in best_run.hyperparameters.items():
    setattr(train_args, k, v)
print("\n", train_args)

# hp-search with hf causes errors with FP16 for some reason
#setattr(train_args, "fp16", False)
#setattr(train_args, "fp16_full_eval", False)

In [None]:
# reinitialize the model to avoid re-using a trained model from a step further above
#model_name = "XXX"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)

# link the numeric labels to the label texts
label_text = np.sort(df_test.label_text.unique()).tolist()
label2id = dict(zip(np.sort(label_text), np.sort(pd.factorize(label_text, sort=True)[0]).tolist()))
id2label = dict(zip(np.sort(pd.factorize(label_text, sort=True)[0]).tolist(), np.sort(label_text)))
config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id));

# load model with config
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config, ignore_mismatched_sizes=True);


In [None]:
# Training
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],  #.shard(index=1, num_shards=100),  # https://huggingface.co/docs/datasets/processing.html#sharding-the-dataset-shard
    eval_dataset=dataset["test"],  #.shard(index=1, num_shards=100),
    compute_metrics=lambda eval_pred: compute_metrics_nli_binary(eval_pred, label_text_alphabetical=label_text_alphabetical)
)

trainer.train()


In [None]:
## Evaluate the fine-tuned model on the held-out test set
results = trainer.evaluate()


In [None]:
print(results)

Note that hyperparameter searches do not necessarily lead to better results, as they need to be searched on a smaller validation set of the train set, which might impact generalisation. Especially for smaller training sets, hyperparameter searches might lead to similar values as good default values.