# Text Classification with Transformers (ALBERT)

This script helps you fine-tune a pre-trained model (ALBERT) and encoder model for text classification with a dataset from the HuggingFace.

The use case uses binary classes to produce a model to identify clickbait versus factual content with the use of a synthetic dataset found [here](https://huggingface.co/datasets/ilsilfverskiold/clickbait_titles_synthetic_data). This script follows a tutorial that you can find here.

You may use any encoder model such as BERT, RoBERTa and DeBERTa instead.

In [1]:
!pip install -U datasets
!pip install -U accelerate
!pip install -U transformers
!pip install -U huggingface_hub

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m542.7/547.8 kB[0m [31m16.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m42.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl (6

In [4]:
import pandas as pd

good = pd.read_csv('./good.csv')
bad = pd.read_csv('./bad.csv')

ds = pd.concat([good, bad], axis=0)
ds

Unnamed: 0,sentence,label
0,( ( ( variable ) - [ variable ] - number + var...,good
1,[ number number - variable number + variable v...,good
2,( [ variable number * ] - variable ) / ( numbe...,good
3,number * [ variable ] + ( number / ( number ) ...,good
4,( ( number ) ) + number + variable - variable,good
...,...,...
19995,variable * [ variable number variable variable...,bad
19996,[ variable number variable * * ] / [ ] number ...,bad
19997,( [ number number / variable number - - number...,bad
19998,variable * [ variable variable variable + vari...,bad


In [5]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(ds, test_size=0.2)

In [6]:
train

Unnamed: 0,sentence,label
15956,( variable ) ] variable,bad
1087,( variable - number variable / [ variable numb...,bad
3735,number / ( ( variable * variable - [ number ] ...,good
12796,number * ( variable ) / variable variable vari...,bad
3031,number / [ variable ] - [ number variable + nu...,good
...,...,...
1446,( variable + variable - - variable ) - number ...,bad
16345,[ number number + - ],bad
2709,( [ number ] * variable / variable ) * number ...,bad
6559,number * [ variable number - variable number v...,good


In [7]:
from datasets import Dataset

dataset = Dataset.from_pandas(ds)

dataset=dataset.train_test_split(test_size=0.2)
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', '__index_level_0__'],
        num_rows: 32000
    })
    test: Dataset({
        features: ['sentence', 'label', '__index_level_0__'],
        num_rows: 8000
    })
})

Import the dataset you'll be trainin on. This dataset has a 'text' field and a 'label' field. Be sure to tweak the script if you need to.

In [8]:
# from datasets import load_dataset, DatasetDict

# dataset = load_dataset("ilsilfverskiold/clickbait_titles_synthetic_data")
# dataset

Decide on your pre-trained model along with your new model's name.

In [9]:
model_name = "albert/albert-base-v2"
your_path = 'nomi'

Look over your distribution of the labels (optional)

In [10]:
from collections import Counter

train_label_distribution = Counter(dataset['train']['label'])
test_label_distribution = Counter(dataset['test']['label'])

print("Training Label Distribution:", train_label_distribution)
print("Test Label Distribution:", test_label_distribution)

Training Label Distribution: Counter({'good': 16022, 'bad': 15978})
Test Label Distribution: Counter({'bad': 4022, 'good': 3978})


Create a label encoder that converts categorical labels to a standardized numerical format. Labels in their original categorical form (e.g., 'clickbait', 'factual') need to be converted into numerical values so that they can be processed by the algorithms.

In [11]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

label_encoder.fit(dataset['train']['label'])

def encode_labels(example):
    return {'encoded_label': label_encoder.transform([example['label']])[0]}

for split in dataset:
    dataset[split] = dataset[split].map(encode_labels, batched=False)

Map:   0%|          | 0/32000 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

The id2label and label2id mappings in AutoConfig are used to inform the model of the specific label-to-ID mappings so we can get the actual label names rather than the numerical reps when we do inference with the model.

In [12]:
from transformers import AutoConfig

unique_labels = sorted(list(set(dataset['train']['label'])))
id2label = {i: label for i, label in enumerate(unique_labels)}
label2id = {label: i for i, label in enumerate(unique_labels)}

config = AutoConfig.from_pretrained(model_name)
config.id2label = id2label
config.label2id = label2id

# Verify the correct labels
print("ID to Label Mapping:", config.id2label)
print("Label to ID Mapping:", config.label2id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

ID to Label Mapping: {0: 'bad', 1: 'good'}
Label to ID Mapping: {'bad': 0, 'good': 1}


The provided code snippet is responsible for loading a tokenizer and a model from the Hugging Face Transformers library. Here we use ALBERT as a model, you can use AutoTokenizer and AutoModelForSequenceClassification if you want to use another model or it's specified tokenizer.

In [13]:
from transformers import AlbertForSequenceClassification, AlbertTokenizer

tokenizer = AlbertTokenizer.from_pretrained(model_name)
model = AlbertForSequenceClassification.from_pretrained(model_name, config=config)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
#@title testing tokenizer
tokenizer("The quick brown fox jumped.")

{'input_ids': [2, 14, 2231, 886, 2385, 4298, 9, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

This next function makes sure the text data is properly tokenized and labeled, preparing the dataset for efficient training of the transformer model.

In [15]:
def filter_invalid_content(example):
    return isinstance(example['sentence'], str)

dataset = dataset.filter(filter_invalid_content, batched=False)

def encode_data(batch):
    tokenized_inputs = tokenizer(batch["sentence"], padding=True, truncation=True, max_length=256)
    tokenized_inputs["labels"] = batch["encoded_label"]
    return tokenized_inputs

dataset_encoded = dataset.map(encode_data, batched=True)
dataset_encoded

Filter:   0%|          | 0/32000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/32000 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', '__index_level_0__', 'encoded_label', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 32000
    })
    test: Dataset({
        features: ['sentence', 'label', '__index_level_0__', 'encoded_label', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 8000
    })
})

In [16]:
dataset_encoded.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

The DataCollatorWithPadding ensures that all input sequences in a batch are padded to the same length, using the padding logic defined by the tokenizer.

In [17]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer)

Next we'll set up LabelEncoder to encode labels and defines a function to compute per-label accuracy from a confusion matrix, providing label-specific accuracy metrics. I.e. when we train the model we want to see the accuracy metrics per label as well as the average metrics. This is more relevant if you have more than two labels, and one is underperforming.

In [18]:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix
import numpy as np

label_encoder = LabelEncoder()
label_encoder.fit(unique_labels)

def per_label_accuracy(y_true, y_pred, labels):
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    correct_predictions = cm.diagonal()
    label_totals = cm.sum(axis=1)
    per_label_acc = np.divide(correct_predictions, label_totals, out=np.zeros_like(correct_predictions, dtype=float), where=label_totals != 0)
    return dict(zip(labels, per_label_acc))

Next we set up our compute metrics. Here I've set up several, but you may reduce them if needed be. You can read more on this metrics [here.](https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9)

In [19]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    decoded_labels = label_encoder.inverse_transform(labels)
    decoded_preds = label_encoder.inverse_transform(preds)

    precision = precision_score(decoded_labels, decoded_preds, average='weighted')
    recall = recall_score(decoded_labels, decoded_preds, average='weighted')
    f1 = f1_score(decoded_labels, decoded_preds, average='weighted')
    acc = accuracy_score(decoded_labels, decoded_preds)

    labels_list = list(label_encoder.classes_)
    per_label_acc = per_label_accuracy(decoded_labels, decoded_preds, labels_list)

    per_label_acc_metrics = {}
    for label, accuracy in per_label_acc.items():
        label_key = f"accuracy_label_{label}"
        per_label_acc_metrics[label_key] = accuracy

    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall,
        **per_label_acc_metrics
    }

Lastly, we set up our training metrics to train the model. I'm following the paper ["How to Fine-Tune BERT for Text Classification?"](https://arxiv.org/abs/1905.05583) on epochs, batch size and learning rate but do play around with it if you want to.

When it is in training, be sure to look out for training loss and validation loss. Both should decrease consistently. If validation is increasing consistently you may be overfitting your model and you can try to decrease number of epochs.

In [20]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir=your_path,
    num_train_epochs=3,
    warmup_steps=500,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    logging_steps=10,
    evaluation_strategy='steps',
    eval_steps=100,
    learning_rate=2e-5,
    save_steps=1000,
    gradient_accumulation_steps=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_encoded['train'],
    eval_dataset=dataset_encoded['test'],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()



Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Accuracy Label Bad,Accuracy Label Good
100,0.6635,0.674897,0.565875,0.563717,0.566775,0.565875,0.635505,0.495475
200,0.5634,0.541414,0.708375,0.695777,0.751726,0.708375,0.504724,0.914279
300,0.5328,0.511726,0.733375,0.722121,0.780517,0.733375,0.531825,0.937154
400,0.5,0.484158,0.75225,0.747357,0.774726,0.75225,0.612382,0.893665
500,0.508,0.491237,0.752875,0.750899,0.761951,0.752875,0.662854,0.843891
600,0.4665,0.395003,0.821125,0.818871,0.839047,0.821125,0.708354,0.935143
700,0.4007,0.448386,0.815625,0.813655,0.830557,0.815625,0.711586,0.920814
800,0.4194,0.392513,0.827875,0.82295,0.870702,0.827875,0.660119,0.997486
900,0.3601,0.349403,0.843125,0.839952,0.874148,0.843125,0.701144,0.986677
1000,0.3148,0.30949,0.85925,0.856561,0.889827,0.85925,0.721034,0.998994


TrainOutput(global_step=3000, training_loss=0.3186349188089371, metrics={'train_runtime': 1125.3458, 'train_samples_per_second': 85.307, 'train_steps_per_second': 2.666, 'total_flos': 332382267434880.0, 'train_loss': 0.3186349188089371, 'epoch': 3.0})

Once you're finito, you can evaluate the results, save your model and the state.

In [21]:
trainer.evaluate()
trainer.save_model(your_path)
trainer.save_state()

In [25]:
trainer.evaluate()

{'eval_loss': 0.15486373007297516,
 'eval_accuracy': 0.9545,
 'eval_f1': 0.9544256067036531,
 'eval_precision': 0.957944931445212,
 'eval_recall': 0.9545,
 'eval_accuracy_label_bad': 0.9117354549975136,
 'eval_accuracy_label_good': 0.997737556561086,
 'eval_runtime': 17.2561,
 'eval_samples_per_second': 463.605,
 'eval_steps_per_second': 28.975,
 'epoch': 3.0}

If you want to test it out, you can run the pipeline directly with the model. I just used some new example titles to see how it did.

In [23]:
from transformers import pipeline
pipe = pipeline('text-classification', model='nomi')

In [24]:
example_titles = [
    "( [ variable variable variable - + ] + number / number - number )",
    "variable - variable * variable + variable / variable + [ variable ] * number + variable",
    "variable",
    "( [ number variable variable variable - / + variable variable number variable * * - number - variable * + variable / ] )",
    "number",
    "]",
    "number",
    "variable variable",
    "] variable",
    "variable / number - ( [ ) variable variable number / variable variable number + / number / / * ] )"
]

for title in example_titles:
    result = pipe(title)
    print(f"Title: {title}")
    print(f"Output: {result[0]['label']}")

Title: ( [ variable variable variable - + ] + number / number - number )
Output: good
Title: variable - variable * variable + variable / variable + [ variable ] * number + variable
Output: good
Title: variable
Output: bad
Title: ( [ number variable variable variable - / + variable variable number variable * * - number - variable * + variable / ] )
Output: good
Title: number
Output: bad
Title: ]
Output: bad
Title: number
Output: bad
Title: variable variable
Output: bad
Title: ] variable
Output: bad
Title: variable / number - ( [ ) variable variable number / variable variable number + / number / / * ] )
Output: bad


If you're satisfied, you can log in to HuggingFace with a token (you'll find these in your account under Settings - make sure it has write access).

In [None]:
# !huggingface-cli login

Push the model with your new name for it. It usually just takes the name you set when you trained it so whatever you put here doesn't matter.

In [None]:
# tokenizer.push_to_hub("username/classify-clickbait")
# trainer.push_to_hub("username/classify-clickbait")

Now, you're done. You got your text classifier.

In [26]:
!zip -r /content/nomi_model.zip /content/nomi

  adding: content/nomi/ (stored 0%)
  adding: content/nomi/spiece.model (deflated 49%)
  adding: content/nomi/training_args.bin (deflated 51%)
  adding: content/nomi/model.safetensors (deflated 7%)
  adding: content/nomi/trainer_state.json (deflated 82%)
  adding: content/nomi/runs/ (stored 0%)
  adding: content/nomi/runs/Jun24_01-44-01_8cc35e069ebb/ (stored 0%)
  adding: content/nomi/runs/Jun24_01-44-01_8cc35e069ebb/events.out.tfevents.1719194584.8cc35e069ebb.1465.1 (deflated 52%)
  adding: content/nomi/runs/Jun24_01-44-01_8cc35e069ebb/events.out.tfevents.1719193442.8cc35e069ebb.1465.0 (deflated 69%)
  adding: content/nomi/config.json (deflated 53%)
  adding: content/nomi/checkpoint-3000/ (stored 0%)
  adding: content/nomi/checkpoint-3000/spiece.model (deflated 49%)
  adding: content/nomi/checkpoint-3000/training_args.bin (deflated 51%)
  adding: content/nomi/checkpoint-3000/scheduler.pt (deflated 56%)
  adding: content/nomi/checkpoint-3000/rng_state.pth (deflated 25%)
  adding: conte

In [27]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
