# Text Classification with Transformers (ALBERT)

This script helps you fine-tune a pre-trained model (ALBERT) and encoder model for text classification with a dataset from the HuggingFace.

The use case uses binary classes to produce a model to identify clickbait versus factual content with the use of a synthetic dataset found [here](https://huggingface.co/datasets/ilsilfverskiold/clickbait_titles_synthetic_data). This script follows a tutorial that you can find here.

You may use any encoder model such as BERT, RoBERTa and DeBERTa instead.

In [47]:
!pip install -U datasets
!pip install -U accelerate
!pip install -U transformers
!pip install -U huggingface_hub



In [48]:
import pandas as pd

good = pd.read_csv('./good.csv')
bad = pd.read_csv('./bad.csv')

ds = pd.concat([good, bad], axis=0)
ds

Unnamed: 0,sentence,label
0,( ( ( variable ) - [ variable ] - number + var...,good
1,[ number number - variable number + variable v...,good
2,( [ variable number * ] - variable ) / ( numbe...,good
3,number * [ variable ] + ( number / ( number ) ...,good
4,( ( number ) ) + number + variable - variable,good
...,...,...
19995,variable * [ variable number variable variable...,bad
19996,[ variable number variable * * ] / [ ] number ...,bad
19997,( [ number number / variable number - - number...,bad
19998,variable * [ variable variable variable + vari...,bad


In [49]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(ds, test_size=0.2)

In [50]:
train

Unnamed: 0,sentence,label
14307,[ number variable + number number / number num...,good
17812,[ variable number variable - variable number -...,good
11020,( number * ( variable ) ) - variable,good
15158,number * ( number ) / variable - number / [ nu...,good
4990,( number + number - variable * / number ) + nu...,bad
...,...,...
6265,[ variable number variable variable * / variab...,good
11284,variable - [ variable number variable variable...,good
18158,( number * [ number number / variable + ] + / ...,bad
860,[ variable number number number number / + num...,good


In [51]:
#@title Save the distribution
train.to_csv('train.csv', sep=',', index=False, header=True, encoding='utf-8')
test.to_csv('test.csv', sep=',', index=False, header=True, encoding='utf-8')


In [52]:
from datasets import Dataset

dataset = Dataset.from_pandas(ds)

dataset=dataset.train_test_split(test_size=0.2)
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', '__index_level_0__'],
        num_rows: 32000
    })
    test: Dataset({
        features: ['sentence', 'label', '__index_level_0__'],
        num_rows: 8000
    })
})

Import the dataset you'll be trainin on. This dataset has a 'text' field and a 'label' field. Be sure to tweak the script if you need to.

In [53]:
# from datasets import load_dataset, DatasetDict

# dataset = load_dataset("ilsilfverskiold/clickbait_titles_synthetic_data")
# dataset

Decide on your pre-trained model along with your new model's name.

In [54]:
model_name = "albert/albert-base-v2"
your_path = 'nomi'

Look over your distribution of the labels (optional)

In [55]:
from collections import Counter

train_label_distribution = Counter(dataset['train']['label'])
test_label_distribution = Counter(dataset['test']['label'])

print("Training Label Distribution:", train_label_distribution)
print("Test Label Distribution:", test_label_distribution)

Training Label Distribution: Counter({'bad': 16043, 'good': 15957})
Test Label Distribution: Counter({'good': 4043, 'bad': 3957})


Create a label encoder that converts categorical labels to a standardized numerical format. Labels in their original categorical form (e.g., 'clickbait', 'factual') need to be converted into numerical values so that they can be processed by the algorithms.

In [56]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

label_encoder.fit(dataset['train']['label'])

def encode_labels(example):
    return {'encoded_label': label_encoder.transform([example['label']])[0]}

for split in dataset:
    dataset[split] = dataset[split].map(encode_labels, batched=False)

Map:   0%|          | 0/32000 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

The id2label and label2id mappings in AutoConfig are used to inform the model of the specific label-to-ID mappings so we can get the actual label names rather than the numerical reps when we do inference with the model.

In [57]:
from transformers import AutoConfig

unique_labels = sorted(list(set(dataset['train']['label'])))
id2label = {i: label for i, label in enumerate(unique_labels)}
label2id = {label: i for i, label in enumerate(unique_labels)}

config = AutoConfig.from_pretrained(model_name)
config.id2label = id2label
config.label2id = label2id

# Verify the correct labels
print("ID to Label Mapping:", config.id2label)
print("Label to ID Mapping:", config.label2id)



ID to Label Mapping: {0: 'bad', 1: 'good'}
Label to ID Mapping: {'bad': 0, 'good': 1}


The provided code snippet is responsible for loading a tokenizer and a model from the Hugging Face Transformers library. Here we use ALBERT as a model, you can use AutoTokenizer and AutoModelForSequenceClassification if you want to use another model or it's specified tokenizer.

In [58]:
from transformers import AlbertForSequenceClassification, AlbertTokenizer

tokenizer = AlbertTokenizer.from_pretrained(model_name)
model = AlbertForSequenceClassification.from_pretrained(model_name, config=config)

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [59]:
#@title testing tokenizer
tokenizer("The quick brown fox jumped.")

{'input_ids': [2, 14, 2231, 886, 2385, 4298, 9, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

This next function makes sure the text data is properly tokenized and labeled, preparing the dataset for efficient training of the transformer model.

In [60]:
def filter_invalid_content(example):
    return isinstance(example['sentence'], str)

dataset = dataset.filter(filter_invalid_content, batched=False)

def encode_data(batch):
    tokenized_inputs = tokenizer(batch["sentence"], padding=True, truncation=True, max_length=256)
    tokenized_inputs["labels"] = batch["encoded_label"]
    return tokenized_inputs

dataset_encoded = dataset.map(encode_data, batched=True)
dataset_encoded

Filter:   0%|          | 0/32000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/32000 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', '__index_level_0__', 'encoded_label', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 32000
    })
    test: Dataset({
        features: ['sentence', 'label', '__index_level_0__', 'encoded_label', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 8000
    })
})

In [61]:
dataset_encoded.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

The DataCollatorWithPadding ensures that all input sequences in a batch are padded to the same length, using the padding logic defined by the tokenizer.

In [62]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer)

Next we'll set up LabelEncoder to encode labels and defines a function to compute per-label accuracy from a confusion matrix, providing label-specific accuracy metrics. I.e. when we train the model we want to see the accuracy metrics per label as well as the average metrics. This is more relevant if you have more than two labels, and one is underperforming.

In [63]:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix
import numpy as np

label_encoder = LabelEncoder()
label_encoder.fit(unique_labels)

def per_label_accuracy(y_true, y_pred, labels):
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    correct_predictions = cm.diagonal()
    label_totals = cm.sum(axis=1)
    per_label_acc = np.divide(correct_predictions, label_totals, out=np.zeros_like(correct_predictions, dtype=float), where=label_totals != 0)
    return dict(zip(labels, per_label_acc))

Next we set up our compute metrics. Here I've set up several, but you may reduce them if needed be. You can read more on this metrics [here.](https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9)

In [64]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    decoded_labels = label_encoder.inverse_transform(labels)
    decoded_preds = label_encoder.inverse_transform(preds)

    precision = precision_score(decoded_labels, decoded_preds, average='weighted')
    recall = recall_score(decoded_labels, decoded_preds, average='weighted')
    f1 = f1_score(decoded_labels, decoded_preds, average='weighted')
    acc = accuracy_score(decoded_labels, decoded_preds)
    # cf = confusion_matrix(decoded_labels, decoded_preds)

    labels_list = list(label_encoder.classes_)
    per_label_acc = per_label_accuracy(decoded_labels, decoded_preds, labels_list)

    per_label_acc_metrics = {}
    for label, accuracy in per_label_acc.items():
        label_key = f"accuracy_label_{label}"
        per_label_acc_metrics[label_key] = accuracy

    return {
        # 'confusion': cf,
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall,
        **per_label_acc_metrics
    }

Lastly, we set up our training metrics to train the model. I'm following the paper ["How to Fine-Tune BERT for Text Classification?"](https://arxiv.org/abs/1905.05583) on epochs, batch size and learning rate but do play around with it if you want to.

When it is in training, be sure to look out for training loss and validation loss. Both should decrease consistently. If validation is increasing consistently you may be overfitting your model and you can try to decrease number of epochs.

In [65]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir=your_path,
    num_train_epochs=3,
    warmup_steps=500,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    logging_steps=10,
    evaluation_strategy='steps',
    eval_steps=100,
    learning_rate=2e-5,
    save_steps=1000,
    gradient_accumulation_steps=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_encoded['train'],
    eval_dataset=dataset_encoded['test'],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()



Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Accuracy Label Bad,Accuracy Label Good
100,0.7012,0.690064,0.51325,0.435023,0.539252,0.51325,0.890321,0.1442
200,0.6763,0.681943,0.542625,0.469067,0.585609,0.542625,0.167551,0.909721
300,0.5259,0.574049,0.70025,0.685996,0.74074,0.70025,0.487238,0.908731
400,0.5215,0.467941,0.7695,0.763313,0.798025,0.7695,0.609047,0.92654
500,0.4867,0.482924,0.773125,0.766098,0.807184,0.773125,0.60096,0.941628
600,0.4061,0.398077,0.81675,0.814554,0.83057,0.81675,0.710134,0.921098
700,0.3861,0.455768,0.803875,0.803867,0.804158,0.803875,0.815517,0.792481
800,0.3485,0.395029,0.817875,0.814633,0.839348,0.817875,0.687642,0.945338
900,0.3699,0.350817,0.844375,0.841976,0.864333,0.844375,0.723528,0.962651
1000,0.3952,0.351705,0.846,0.842048,0.881319,0.846,0.689917,0.998763


TrainOutput(global_step=3000, training_loss=0.3416156885226568, metrics={'train_runtime': 1112.1644, 'train_samples_per_second': 86.318, 'train_steps_per_second': 2.697, 'total_flos': 328821450272640.0, 'train_loss': 0.3416156885226568, 'epoch': 3.0})

Once you're finito, you can evaluate the results, save your model and the state.

In [66]:
trainer.evaluate()
trainer.save_model(your_path)
trainer.save_state()

In [67]:
trainer.evaluate()

{'eval_loss': 0.1572115123271942,
 'eval_accuracy': 0.954,
 'eval_f1': 0.9538877078161438,
 'eval_precision': 0.9575470183762034,
 'eval_recall': 0.954,
 'eval_accuracy_label_bad': 0.9087692696487237,
 'eval_accuracy_label_good': 0.9982686124165224,
 'eval_runtime': 16.3612,
 'eval_samples_per_second': 488.962,
 'eval_steps_per_second': 30.56,
 'epoch': 3.0}

If you want to test it out, you can run the pipeline directly with the model. I just used some new example titles to see how it did.

In [68]:
from transformers import pipeline
pipe = pipeline('text-classification', model='nomi')

In [69]:
example_titles = [
    "( [ variable variable variable - + ] + number / number - number )",
    "variable - variable * variable + variable / variable + [ variable ] * number + variable",
    "variable",
    "( [ number variable variable variable - / + variable variable number variable * * - number - variable * + variable / ] )",
    "number",
    "]",
    "number",
    "variable variable",
    "] variable",
    "variable / number - ( [ ) variable variable number / variable variable number + / number / / * ] )"
]

for title in example_titles:
    result = pipe(title)
    print(f"Title: {title}")
    print(f"Output: {result[0]['label']}")

Title: ( [ variable variable variable - + ] + number / number - number )
Output: good
Title: variable - variable * variable + variable / variable + [ variable ] * number + variable
Output: good
Title: variable
Output: bad
Title: ( [ number variable variable variable - / + variable variable number variable * * - number - variable * + variable / ] )
Output: good
Title: number
Output: bad
Title: ]
Output: bad
Title: number
Output: bad
Title: variable variable
Output: bad
Title: ] variable
Output: bad
Title: variable / number - ( [ ) variable variable number / variable variable number + / number / / * ] )
Output: bad


If you're satisfied, you can log in to HuggingFace with a token (you'll find these in your account under Settings - make sure it has write access).

In [70]:
# !huggingface-cli login

Push the model with your new name for it. It usually just takes the name you set when you trained it so whatever you put here doesn't matter.

In [71]:
# tokenizer.push_to_hub("username/classify-clickbait")
# trainer.push_to_hub("username/classify-clickbait")

Now, you're done. You got your text classifier.

In [72]:
!zip -r /content/nomi_model.zip /content/nomi

  adding: content/nomi/ (stored 0%)
  adding: content/nomi/model.safetensors (deflated 7%)
  adding: content/nomi/training_args.bin (deflated 51%)
  adding: content/nomi/trainer_state.json (deflated 82%)
  adding: content/nomi/runs/ (stored 0%)
  adding: content/nomi/runs/Jun25_01-25-45_41e797489ec8/ (stored 0%)
  adding: content/nomi/runs/Jun25_01-25-45_41e797489ec8/events.out.tfevents.1719278745.41e797489ec8.2725.0 (deflated 61%)
  adding: content/nomi/runs/Jun25_01-26-58_41e797489ec8/ (stored 0%)
  adding: content/nomi/runs/Jun25_01-26-58_41e797489ec8/events.out.tfevents.1719278818.41e797489ec8.2725.1 (deflated 68%)
  adding: content/nomi/runs/Jun25_01-34-20_41e797489ec8/ (stored 0%)
  adding: content/nomi/runs/Jun25_01-34-20_41e797489ec8/events.out.tfevents.1719279261.41e797489ec8.2725.2 (deflated 69%)
  adding: content/nomi/runs/Jun25_01-34-20_41e797489ec8/events.out.tfevents.1719280389.41e797489ec8.2725.3 (deflated 53%)
  adding: content/nomi/config.json (deflated 53%)
  adding: 

In [73]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [76]:
import matplotlib.pyplot as plt
import numpy as np

# Assuming trainer.train() has been called and log history is available
log_history = trainer.state.log_history
log_history

[{'loss': 0.6975,
  'grad_norm': 2.845897912979126,
  'learning_rate': 4.0000000000000003e-07,
  'epoch': 0.01,
  'step': 10},
 {'loss': 0.7233,
  'grad_norm': 4.1713337898254395,
  'learning_rate': 8.000000000000001e-07,
  'epoch': 0.02,
  'step': 20},
 {'loss': 0.7085,
  'grad_norm': 3.223047971725464,
  'learning_rate': 1.2000000000000002e-06,
  'epoch': 0.03,
  'step': 30},
 {'loss': 0.7093,
  'grad_norm': 4.125215530395508,
  'learning_rate': 1.6000000000000001e-06,
  'epoch': 0.04,
  'step': 40},
 {'loss': 0.6812,
  'grad_norm': 3.508023738861084,
  'learning_rate': 2.0000000000000003e-06,
  'epoch': 0.05,
  'step': 50},
 {'loss': 0.6962,
  'grad_norm': 4.207041263580322,
  'learning_rate': 2.4000000000000003e-06,
  'epoch': 0.06,
  'step': 60},
 {'loss': 0.6891,
  'grad_norm': 6.273722171783447,
  'learning_rate': 2.8000000000000003e-06,
  'epoch': 0.07,
  'step': 70},
 {'loss': 0.7023,
  'grad_norm': 3.1793785095214844,
  'learning_rate': 3.2000000000000003e-06,
  'epoch': 0.08

In [75]:


# Extracting metrics
train_loss = [metric['train_loss'] for metric in log_history]
eval_loss = [metric['eval_loss'] for metric in log_history]
eval_accuracy = [metric['eval_accuracy'] for metric in log_history]
eval_f1 = [metric['eval_f1'] for metric in log_history]

# Steps
steps = np.arange(len(log_history))

# Plotting
plt.figure(figsize=(12, 6))

# Plot training loss
plt.subplot(2, 1, 1)
plt.plot(steps, train_loss, label='Training Loss', marker='o')
plt.plot(steps, eval_loss, label='Validation Loss', marker='x')
plt.xlabel('Steps')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()

# Plot evaluation metrics
plt.subplot(2, 1, 2)
plt.plot(steps, eval_accuracy, label='Validation Accuracy', marker='o')
plt.plot(steps, eval_f1, label='Validation F1 Score', marker='x')
plt.xlabel('Steps')
plt.ylabel('Score')
plt.title('Validation Metrics')
plt.legend()

plt.tight_layout()
plt.show()

KeyError: 'train_loss'