The following steps are adapted from the github link [Training and Fine-Tuning BERT for Classification](https://github.com/uvacw/teaching-bdaca/blob/main/modules/machinelearning-text-exercises/transformers_bert_classification.ipynb).

In [1]:
# ! pip3 install transformers==4.30
# ! pip install transformers[torch]

In [2]:
import pandas as pd
from transformers import AutoTokenizer, RobertaForSequenceClassification, Trainer, TrainingArguments, AutoModelForSequenceClassification
from torch.utils.data import Dataset
import torch
from sklearn.metrics import accuracy_score, f1_score
from sklearn.utils.class_weight import compute_sample_weight

In [3]:
data = pd.read_csv('Disability_videos_combined_shuffled_cleaned.csv')

In [88]:
X_train = pd.read_csv('X_train.csv')
X_val = pd.read_csv('X_val.csv')
X_test = pd.read_csv('X_test.csv')
X_train_val = pd.read_csv('X_train_val.csv')

y_train = pd.read_csv('y_train.csv')
y_val = pd.read_csv('y_val.csv')
y_test = pd.read_csv('y_test.csv')
y_train_val = pd.read_csv('y_train_val.csv')

In [5]:
# Remove rows with empty descriptions
X_train = X_train.dropna(subset=['Description'])
X_val = X_val.dropna(subset=['Description'])
X_test = X_test.dropna(subset=['Description'])

## Encode data for BERT

We will be using the AutoTokenizer.from_pretrained() module from HuggingFace library to encode our texts. Specifically, we chose the "distilbert-base-uncased" tokenizer, because this one does not take up too much RAM. https://huggingface.co/distilbert/distilbert-base-uncased The maximum input length is 512 tokens, and it will add padding and special BERT tokens.

The pretrained BERT model is loaded using `AutoForSequenceClassification`. https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt

In [6]:
# Load the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Show details of the input ids and the attention mask.

In [7]:
# Generate the label to ID mapping
unique_labels = set(y_train.columns)
label2id = {label: id for id, label in enumerate(unique_labels)}
id2label = {id: label for label, id in label2id.items()}

In [8]:
label2id.keys()

dict_keys(['Conflict', 'Morality', 'Human Interest', 'Economic Consequences'])

In [9]:
def encode_labels(df, label2id, default_label=0):
    def get_label(row):
        labels = [label2id[col] for col in df.columns if row[col] == 1]
        return labels[0] if labels else default_label
    return df.apply(get_label, axis=1).tolist()

In [10]:
train_labels_encoded = encode_labels(y_train, label2id)
val_labels_encoded = encode_labels(y_val, label2id)
test_labels_encoded = encode_labels(y_test, label2id)

In [11]:
device_name = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=len(id2label)).to(device_name)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.we

In [12]:
# Encode the texts
max_length = 512  # Maximum length for BERT
train_encodings = tokenizer(X_train['Description'].tolist(), truncation=True, padding=True, max_length=max_length)
val_encodings = tokenizer(X_val['Description'].tolist(), truncation=True, padding=True, max_length=max_length)
test_encodings = tokenizer(X_test['Description'].tolist(), truncation=True, padding=True, max_length=max_length)

## Create a custom Torch dataset by following these steps:

In [13]:
# Define the custom dataset class
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx], dtype=torch.long) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

Create dataset objects

In [14]:
train_dataset = MyDataset(train_encodings, train_labels_encoded)
val_dataset = MyDataset(val_encodings, val_labels_encoded)
test_dataset = MyDataset(test_encodings, test_labels_encoded)

Examine Encoded Articles from Training and Test Datasets

Examine a news article in the Torch training_dataset after encoding

In [15]:
# Examine a news article in the Torch training dataset after encoding
train_tokens = tokenizer.convert_ids_to_tokens(train_dataset.encodings['input_ids'][0])
train_article = ' '.join(train_tokens[:100])
print("Training Article (first 100 tokens):", train_article)

Training Article (first 100 tokens): [CLS] blindness is an invisible disability . when she was 5 years old , ky ##m de ##key ##rel was diagnosed with re ##tin ##itis pigment ##osa , a de ##gen ##erative genetic condition that causes loss of vision . in her 20s , she was diagnosed with lu ##pus , carrying symptoms of r ##he ##uma ##to ##id arthritis . during surgery to correct liver failure caused by complications from lu ##pus , de ##key ##rel almost died from excessive blood loss . although she survived , any remaining vision she had was gone , leaving her feeling “


Examine a news article in the Torch test_dataset after encoding

In [16]:
# Examine a news article in the Torch test dataset after encoding
test_tokens = tokenizer.convert_ids_to_tokens(test_dataset.encodings['input_ids'][1])
test_article = ' '.join(test_tokens[:100])
print("Test Article (first 100 tokens):", test_article)

Test Article (first 100 tokens): [CLS] ava ##rd law offices is a florida based law firm that specializes in social security disability benefits ( ss ##d ) . https : / / ava ##rd ##law . com / social - security - disability / [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]


In [17]:
# Print the length of id2label
print("Length of id2label:", len(id2label))

Length of id2label: 4


## Initialize the pre-trained BERT model

In [18]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=8,   # batch size for training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    gradient_accumulation_steps=4,
)

## Fine-tune the BERT model

In [19]:
# Define the custom evaluation function
def compute_metrics(eval_pred):
    labels = eval_pred.label_ids
    preds = eval_pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    macro_f1 = f1_score(labels, preds, average='macro', sample_weight=compute_sample_weight('balanced', labels))
    return {'accuracy': acc, 'macro_f1': macro_f1}

Instantiate an object of the TrainingArguments class with the following parameters:

In [20]:
# Create the Trainer
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    compute_metrics=compute_metrics      # evaluation metrics
)


In [21]:
# Fine-tune the model
trainer.train()



Step,Training Loss


Step,Training Loss
10,1.379


TrainOutput(global_step=12, training_loss=1.3799358407656352, metrics={'train_runtime': 1327.4467, 'train_samples_per_second': 0.339, 'train_steps_per_second': 0.009, 'total_flos': 50339406888960.0, 'train_loss': 1.3799358407656352, 'epoch': 2.53})

## Save fine-tuned model

In [22]:
save_directory = "./fine_tuned_model"
trainer.save_model(save_directory)

Import shutil to download my model.

In [23]:
# Zip the model directory for download
import shutil
shutil.make_archive(save_directory, 'zip', save_directory)

# Download the zipped model directory
from google.colab import files
files.download(save_directory + '.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Evaluate fine-tuned model on the validation set

The following function of the Trainer object will run the built-in evaluation, including our compute_metrics function.

In [24]:
trainer.evaluate()

{'eval_loss': 1.364883542060852,
 'eval_accuracy': 0.46,
 'eval_macro_f1': 0.1794688457609806,
 'eval_runtime': 57.2484,
 'eval_samples_per_second': 0.873,
 'eval_steps_per_second': 0.07,
 'epoch': 2.53}

In [38]:
device_name = "cuda" if torch.cuda.is_available() else "cpu"
device = torch.device(device_name)

In [40]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")




prepare the data for classification:

In [54]:
# Combine all labels to ensure consistent encoding
combined_labels = pd.concat([y_train, y_val, y_test])

In [41]:
# Encode the labels for y_text
label2id = {label: id for id, label in enumerate(y_test.columns)}
id2label = {id: label for label, id in label2id.items()}

The ys must be transformed into binarized individual texts to perform the analysis.

In [55]:
from sklearn.preprocessing import MultiLabelBinarizer

# Binarize labels using the combined dataset
mlb = MultiLabelBinarizer()
mlb.fit(combined_labels.values.tolist())

In [56]:
# Binarize the individual sets
y_train_binarized = mlb.transform(y_train.values.tolist())
y_val_binarized = mlb.transform(y_val.values.tolist())
y_test_binarized = mlb.transform(y_test.values.tolist())

In [64]:
# Check the number of classes
num_labels = y_train_binarized.shape[1]
print(f"Number of labels: {num_labels}")

Number of labels: 2


Load the model again, and encode the descriptions in X.

In [65]:
# Load my tuned model
model = AutoModelForSequenceClassification.from_pretrained(save_directory).to(device)

In [66]:
# Encode the texts
max_length = 512  # Maximum length for BERT
train_encodings = tokenizer(X_train['Description'].tolist(), truncation=True, padding=True, max_length=max_length)
val_encodings = tokenizer(X_val['Description'].tolist(), truncation=True, padding=True, max_length=max_length)
test_encodings = tokenizer(X_test['Description'].tolist(), truncation=True, padding=True, max_length=max_length)

Define a custom dataset class.

In [67]:
# Define the custom dataset class
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx], dtype=torch.long) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.float)  # Change to float for BCEWithLogitsLoss
        return item

    def __len__(self):
        return len(self.labels)

Create dataset objects. The shape of the gt must be exactly the same than the shape of the target. https://discuss.pytorch.org/t/bcewithlogitsloss-with-bert-valueerror-target-size-torch-size-68-1-1-must-be-the-same-as-input-size/146037

In [None]:
train_dataset = MyDataset(train_encodings, y_train_binarized)
val_dataset = MyDataset(val_encodings, y_val_binarized)
test_dataset = MyDataset(test_encodings, y_test_binarized)

Define and configure the custom trainer. Change the loss function to BCEWithLogitsLoss and instead of softmax, which selects the highest, use sigmoid. https://szuyuchu.medium.com/multi-label-text-classification-with-bert-52fa78eddb9, https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html

In [69]:
# Custom Trainer to use BCEWithLogitsLoss
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        loss_fct = torch.nn.BCEWithLogitsLoss()
        loss = loss_fct(logits, labels)
        return (loss, outputs) if return_outputs else loss

Tokenize the test data and make predictions

In [72]:
# Function to tokenize and create DataLoader
def tokenize_data(text_list):
    inputs = tokenizer(text_list, return_tensors='pt', max_length=512, truncation=True, padding=True)
    return inputs

In [74]:
# Tokenize the test data
inputs = tokenize_data(X_test['Description'].tolist())

In [75]:
# Move model to GPU if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [81]:
# Make predictions
predictions = predict(inputs)

In [90]:
# Convert predictions into a DataFrame with binary columns
predictions_binary = pd.get_dummies(predictions, prefix='class')

In [94]:
# Ensure predictions_binary has the same columns as y_test
for column in y_test.columns:
    if column not in predictions_binary:
        predictions_binary[column] = 0

# Compute and print classification report for each class
for idx, class_label in enumerate(y_test.columns):
    y_true = y_test[class_label]
    y_pred = predictions_binary.get(class_label, pd.Series(0, index=y_test.index))

    print(f"Classification Report for {class_label}:")
    print(classification_report(y_true, y_pred, target_names=[f"not_{class_label}", class_label]))
    print("\n")

Classification Report for Conflict:
              precision    recall  f1-score   support

not_Conflict       0.66      1.00      0.80        33
    Conflict       0.00      0.00      0.00        17

    accuracy                           0.66        50
   macro avg       0.33      0.50      0.40        50
weighted avg       0.44      0.66      0.52        50



Classification Report for Economic Consequences:
                           precision    recall  f1-score   support

not_Economic Consequences       0.74      1.00      0.85        37
    Economic Consequences       0.00      0.00      0.00        13

                 accuracy                           0.74        50
                macro avg       0.37      0.50      0.43        50
             weighted avg       0.55      0.74      0.63        50



Classification Report for Human Interest:
                    precision    recall  f1-score   support

not_Human Interest       0.42      1.00      0.59        21
    Human Intere

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The classification report shows that the transformer is acceptable in predicting the non-existence of frames, especially economic consequences (accuracy > 0.7). However, the parameters indicate bad performance on the existence of frames, making the BERT model not a good candidate for the final classification task.