<a href="https://colab.research.google.com/github/saturnMars/FM_2025/blob/main/Lab1_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from os import path
import pandas as pd
import tarfile

# Getting the labelled datasets for:
- ***binary* classification**:
    1. **Truthfulness** (True/false claims)
        - *[The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets](https://github.com/saprmarks/geometry-of-truth/tree/main)*
    2. **Subjectivity** (subjective/objetive sentences)
        - [CLEF 2025, Task 1 - Subjectivity](https://checkthat.gitlab.io/clef2025/task1/)
- ***multiclass* classification**:
    3. **Tense** (past/present/future)
        - [EnglishTense: A large scale English texts dataset categorized into three categories: Past, Present, Future tenses.](https://data.mendeley.com/datasets/jnb2xp9m4r/2)
    4. **Language** (utterances from multiple languages)
        - [MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages](https://github.com/alexa/massive)



In [None]:
!wget https://data.mendeley.com/public-files/datasets/jnb2xp9m4r/files/8148432a-a69a-473f-beb6-835d2a176f30/file_downloaded

In [None]:
# (1) TRUTHFULNESS (The Geometry of Truth; TRUE|FALSE)
truthfulness_df = pd.read_csv("https://raw.githubusercontent.com/saprmarks/geometry-of-truth/refs/heads/main/datasets/counterfact_true_false.csv")
truthfulness_df = truthfulness_df[['statement', 'label']].rename(columns = {'statement':'doc'})

# (2) SUBJECTIVITY (CLEF2025; SUB|OBJ)
subjectivity_df = pd.concat([
    pd.read_csv("https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/raw/main/task1/data/english/train_en.tsv", sep= '\t'),
    pd.read_csv("https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/raw/main/task1/data/english/dev_en.tsv", sep= '\t'),
    pd.read_csv("https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/raw/main/task1/data/english/dev_test_en.tsv", sep= '\t'),
    pd.read_csv("https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/raw/main/task1/data/english/test_en_labeled.tsv", sep= '\t'),
])
subjectivity_df = subjectivity_df[['sentence', 'label']].rename(columns = {'sentence':'doc'})

# (3) TENSE (EnglishTense; past|present|future)
tense_df = pd.read_excel("https://prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com/28304dc7-a47c-4d83-bdcc-2edc535236d8").rename(columns = {'Sentence':'doc', 'Labels':'label'})
tense_df['label'] = tense_df['label'].str.upper() # Turnaround to fix a bug in the dataset labels

# (4) LANGUAGE (MASSIVE; EN/IT/DE/ES)
!wget https://amazon-massive-nlu-dataset.s3.amazonaws.com/amazon-massive-dataset-1.1.tar.gz
dfs = []
with tarfile.open("amazon-massive-dataset-1.1.tar.gz", "r:gz") as tar:
    for lang in ['en-US', 'it-IT', 'de-DE', 'es-ES']:
      dfs.append(pd.read_json(tar.extractfile(path.join('1.1','data', f'{lang}.jsonl')), lines = True))
language_df = pd.concat(dfs)[['utt', 'locale']].rename(columns = {'utt':'doc', 'locale': 'label'})

--2025-09-26 10:49:42--  https://amazon-massive-nlu-dataset.s3.amazonaws.com/amazon-massive-dataset-1.1.tar.gz
Resolving amazon-massive-nlu-dataset.s3.amazonaws.com (amazon-massive-nlu-dataset.s3.amazonaws.com)... 52.217.174.153, 16.182.96.145, 3.5.25.184, ...
Connecting to amazon-massive-nlu-dataset.s3.amazonaws.com (amazon-massive-nlu-dataset.s3.amazonaws.com)|52.217.174.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40251390 (38M) [application/x-gzip]
Saving to: ‘amazon-massive-dataset-1.1.tar.gz.30’


2025-09-26 10:49:43 (35.6 MB/s) - ‘amazon-massive-dataset-1.1.tar.gz.30’ saved [40251390/40251390]



# Data exploration

In [None]:
# TRUTHFULNESS dataset
print('-' * 30, 'TRUTHFULNESS', '-' * 30)
print(f"CLASSES ({truthfulness_df['label'].nunique()}):", '|'.join(truthfulness_df['label'].map(str).unique()), '\n')
print(truthfulness_df)

# SUBJECTIVITY dataset
print('-' * 30, 'SUBJECTIVITY', '-' * 30)
print(f"CLASSES ({subjectivity_df['label'].nunique()}):", '|'.join(subjectivity_df['label'].unique()), '\n')
print(subjectivity_df)

# TENSE dataset
print('-' * 30, 'TENSE', '-' * 30)
print(f"CLASSES ({tense_df['label'].nunique()}):", '|'.join(tense_df['label'].unique()), '\n')
print(tense_df)

# LANGUAGE dataset
print('-' * 30, 'LANGUAGE', '-' * 30)
print(f"CLASSES ({language_df['label'].nunique()}):", '|'.join(language_df['label'].unique()), '\n')
print(language_df)

------------------------------ TRUTHFULNESS ------------------------------
CLASSES (2): 1|0 

                                                     doc  label
0      The mother tongue of Danielle Darrieux is French.      1
1      The mother tongue of Danielle Darrieux is Engl...      0
2      The official religion of Edwin of Northumbria ...      1
3      The official religion of Edwin of Northumbria ...      0
4      The mother tongue of Thomas Joannes Stieltjes ...      1
...                                                  ...    ...
31959          Jerusalem of Gold was written in Finnish.      0
31960  The language used by Jean-Pierre Dionnet is Fr...      1
31961  The language used by Jean-Pierre Dionnet is Sp...      0
31962                             Subair works as actor.      1
31963                          Subair works as composer.      0

[31964 rows x 2 columns]
------------------------------ SUBJECTIVITY ------------------------------
CLASSES (2): SUBJ|OBJ 

             

## Create *Dataset*, *DataLoader* (PyTorch), and *DataModule* (PyTorch Lightining) for training
1. PyTorch: [Dataset/DataLoader](https://docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html)
2. PyTorch Lightining [DataModule](https://lightning.ai/docs/pytorch/stable/data/datamodule.html)


In [None]:
from torch.utils.data import Dataset, DataLoader, random_split
import torch

In [None]:
class MyDataset(Dataset):
    def __init__(self, df:pd.DataFrame):

        # Create our inputs (X)
        self.inputs = df['doc'].values

        # Convert the textual labels into numbers (CLASS A --> 0, CLASS B --> 1, ...)
        self.class_mapping = {label: i for i, label in enumerate(df['label'].unique())}

        # Create our outputs (y)
        self.targets = df['label'].map(self.class_mapping).values

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return self.inputs[idx], self.targets[idx]


In [None]:
!pip install lightning



In [None]:
from lightning import LightningDataModule

In [None]:
class MyDataModule(LightningDataModule):
    def __init__(self, data: Dataset, batch_size: int = 32, val_size:float = 0.1, test_size:float = 0.1):
        super().__init__()

        # Initialize the variables
        self.data = data
        self.batch_size = batch_size

        self.train_size = 1 - val_size - test_size
        self.val_size = val_size
        self.test_size = test_size

        # Set the seed for reproducibility
        self.random_seed = 101

    def setup(self, stage:str):

        # Create the splits
        train_set, val_set, test_set = random_split(
            dataset = self.data,
            generator = torch.Generator().manual_seed(self.random_seed),
            lengths = [self.train_size, self.val_size, self.test_size])

        self.train_set = train_set
        self.val_set = val_set
        self.test_set = test_set

        print('\nINPUTS:', len(self.data), '--> TRAIN:', round(((len(self.train_set) / len(self.data)) * 100), 1), '%',
              '|| VALIDATION:', round(((len(self.val_set) / len(self.data)) * 100), 1), '%',
              '|| TEST:', round(((len(self.test_set) / len(self.data)) * 100), 1), '%', '\n')

    def train_dataloader(self):
        return DataLoader(self.train_set, batch_size = self.batch_size, shuffle = True)

    def val_dataloader(self):
        return DataLoader(self.val_set, batch_size = self.batch_size, shuffle = False)

    def test_dataloader(self):
        return DataLoader(self.test_set, batch_size = self.batch_size, shuffle = False)

# Explore the dataloader

In [None]:
# Initialize the dataset and the dataloders
dataset = MyDataset(tense_df)
dataloaders = MyDataModule(dataset, batch_size = 32, val_size = 0.1, test_size = 0.1)

# Create the splits and explore the train loader
dataloaders.setup('')
train_set = dataloaders.train_dataloader()

# Get the first batch
x, y = list(train_set)[0]
label_mapping = {v: k for k, v in dataset.class_mapping.items()}
for x_item, y_item in zip(x, y):
    print("DOC:", x_item, "--> CLASS:", y_item.item(), f'({label_mapping[y_item.item()]})')


INPUTS: 13316 --> TRAIN: 80.0 % || VALIDATION: 10.0 % || TEST: 10.0 % 

DOC: the scientists conducted experiments and made significant discoveries --> CLASS: 2 (PAST)
DOC: they attended a conference on islamic studies --> CLASS: 2 (PAST)
DOC: the playful kittens bat at a dangling feather toy mesmerized by its movement --> CLASS: 1 (PRESENT)
DOC: the baby was sleeping with a soft blanket wrapped around her --> CLASS: 2 (PAST)
DOC: progressive disclosure principles enhance user engagement by revealing information gradually --> CLASS: 1 (PRESENT)
DOC: problemsolving skills are honed through challenging academic projects --> CLASS: 1 (PRESENT)
DOC: creators have been sharing their journey and milestones with the audience --> CLASS: 1 (PRESENT)
DOC: i was sleeping with my arm draped over the edge of the bed --> CLASS: 2 (PAST)
DOC: continuous updates and patches address security vulnerabilities --> CLASS: 1 (PRESENT)
DOC: engineers will be developing advanced materials for space exploratio

# Create our custom model for classification: a frozen LLM with a Multi Layer Perceptron (MLP)

In [None]:
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer

In [None]:
class Network(nn.Module):
    def __init__(self, llm_name:str, num_classes: int):
        super().__init__()
        self.num_classes = num_classes

        # Set the latent dimension
        self.latent_dim = 512

        # Set the probability for the dropout layer
        self.drop_p = 0.3

        # Load the foundation language model
        self.tokenizer = AutoTokenizer.from_pretrained(llm_name)
        self.llm = AutoModel.from_pretrained(llm_name)

        # Define the token for padding and its direction
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.tokenizer.padding_side = 'left'

        # Create our custom layers
        self.decoder_layer = nn.Sequential(

            # Layer 0
            nn.Linear(self.llm.config.hidden_size, self.latent_dim),
            nn.ReLU(),
            nn.LayerNorm(self.latent_dim),
            nn.Dropout(self.drop_p),

            # Layer 1
            nn.Linear(self.latent_dim, self.latent_dim),
            nn.ReLU(),
            nn.LayerNorm(self.latent_dim)
        )

        # Create our output layer
        self.output_layer = nn.Sequential(
            nn.LayerNorm(self.latent_dim),
            nn.Linear(self.latent_dim, self.num_classes),
            nn.Softmax(dim = 1) # Get probability distribution over the classes [batch_size, num_classes]
        )

    def forward(self, x):

        # Tokenize the textual document (x)
        input_ids = self.tokenizer(x, padding = True, return_tensors = 'pt').to(self.llm.device)

        # Process the tokenized document using the frozen LLM
        llm_output = self.llm(**input_ids)

        # Get the embeddings from the median hidden layer [batch_size, num_layers, hidden_dim]
        median_layer = self.llm.config.num_hidden_layers // 2
        embeddings = llm_output['last_hidden_state'][:, median_layer, :]

        # Learn the latent fetures from the LLM embeddings
        out = self.decoder_layer(embeddings)

        # Output layer with the SoftMax
        out = self.output_layer(out)

        return out

### Why do we consider the median hidden layer?

# Define the loss function and the training process

In [None]:
from lightning import LightningModule
from torchmetrics.classification import F1Score, Precision, Recall

In [None]:
class Classifier(LightningModule):
    def __init__(self, llm_name:str, num_classes: int, lr:float):
        super().__init__()

        # Unpacked the configs
        self.lr = lr

        # Load our custom model
        self.model = Network(llm_name, num_classes)

        # Define the loss function
        self.loss_function = nn.CrossEntropyLoss()

        # Define the classification metrics for the train, validation and test sets
        self.train_f1 = F1Score(task="multiclass", num_classes = num_classes, average="macro")
        self.train_precision = Precision(task="multiclass", num_classes = num_classes, average="macro")
        self.train_recall = Recall(task="multiclass", num_classes = num_classes, average="macro")

        self.val_f1 = F1Score(task="multiclass", num_classes = num_classes, average="macro")
        self.val_precision = Precision(task="multiclass", num_classes = num_classes, average="macro")
        self.val_recall = Recall(task="multiclass", num_classes = num_classes, average="macro")

        self.test_f1 = F1Score(task="multiclass", num_classes = num_classes, average="macro")
        self.test_precision = Precision(task="multiclass", num_classes = num_classes, average="macro")
        self.test_recall = Recall(task="multiclass", num_classes = num_classes, average="macro")

        self.val_history = {'F1': [], 'precision': [], 'recall': []}

    # Define the optimizer
    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr = self.lr)
        return optimizer

    def forward(self, x):
        return self.model(x)

    def _step(self, batch, batch_idx):

        # Unpack the batch
        docs, labels = batch

        # Forward pass
        outputs = self(docs)

        # Compute the loss
        loss = self.loss_function(outputs, labels.flatten())

        # Get the most likely class
        preds = torch.argmax(outputs, dim=1)

        return loss, preds, labels

    def training_step(self, batch, batch_idx):
        loss, preds, labels  = self._step(batch, batch_idx)

        # Compute the classification metrics
        self.train_precision.update(preds, labels)
        self.train_recall.update(preds, labels)
        self.train_f1.update(preds, labels)

        # Log metrics
        self.log('train_loss', loss, on_step=False, on_epoch=True, prog_bar=True)
        self.log('train_precision', self.train_precision, on_step=False, on_epoch=True, prog_bar=True)
        self.log('train_recall', self.train_recall, on_step=False, on_epoch=True, prog_bar=True)
        self.log('train_f1', self.train_f1, on_step=False, on_epoch=True, prog_bar=True)

        return loss

    def validation_step(self, batch, batch_idx):
        with torch.inference_mode():
            loss, preds, labels  = self._step(batch, batch_idx)

        # Compute the classification metrics
        self.val_precision.update(preds, labels)
        self.val_recall.update(preds, labels)
        self.val_f1.update(preds, labels)

        # Log metrics
        self.log('val_loss', loss, on_step=False, on_epoch=True, prog_bar=True)
        self.log('val_precision', self.val_precision, on_step=False, on_epoch=True, prog_bar=True)
        self.log('val_recall', self.val_recall, on_step=False, on_epoch=True, prog_bar=True)
        self.log('val_f1', self.val_f1, on_step=False, on_epoch=True, prog_bar=True)

    def on_validation_epoch_end(self):
        self.val_history['precision'].append(self.trainer.callback_metrics["val_precision"].item())
        self.val_history['recall'].append(self.trainer.callback_metrics["val_recall"].item())
        self.val_history['F1'].append(self.trainer.callback_metrics["val_f1"].item())

        # Visualize the values
        df = pd.DataFrame(self.val_history)
        if len(df) > 1:
            plot_values(df, epoch_number = self.current_epoch + 1)

    def test_step(self, batch, batch_idx):
        with torch.inference_mode():
            _, preds, labels  = self._step(batch, batch_idx)

        # Compute the classification metrics
        self.test_precision.update(preds, labels)
        self.test_recall.update(preds, labels)
        self.test_f1.update(preds, labels)

        # Log metrics
        self.log('test_precision', self.test_precision, on_epoch=True, prog_bar=True)
        self.log('test_recall', self.test_recall, on_epoch=True, prog_bar=True)
        self.log('test_f1', self.test_f1, on_epoch=True, prog_bar=True)


In [None]:
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
def plot_values(df, epoch_number):
    colors = {'F1': 'tab:blue', 'precision': 'tab:green', 'recall': 'tab:orange'}

    # Plot the metrics as lines
    sns.lineplot(data = df, palette = colors, marker = 'o')

    # Some graphical setting
    ax = plt.gca()
    ax.xaxis.set_major_locator(MaxNLocator(integer=True))
    ax.set_title(f'VALIDATION (epoch {epoch_number})')
    ax.grid(True)
    ax.ylim(0, 1)
    ax.legend(title="Metric")

    plt.show()

# Train our custom neural models with Pytorch Lighting
truthfulness_df | subjectivity_df | tense_df | language_df

In [None]:
num_epochs = 5
data = tense_df

1. EleutherAI's Pythia
    - EleutherAI/pythia-160m
    - EleutherAI/pythia-1.4b
    - EleutherAI/pythia-6.9b
2. MetaAI's Llama
    - meta-llama/Llama-3.1-8B
    - meta-llama/Llama-3.2-1B
3. OpenAI's GPT-2
    - openai-community/gpt2-medium
    - openai-community/gpt2-xl
3. Google's BERT
    - google-bert/bert-base-uncased

In [None]:
model = Classifier(
    llm_name = 'openai-community/gpt2-medium',
    num_classes = data['label'].nunique(),
    lr = 1e-3)
dataloaders = MyDataModule(MyDataset(data), batch_size = 32, val_size = 0.1, test_size = 0.1)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

In [None]:
from lightning import Trainer

In [None]:
trainer = Trainer(max_epochs = num_epochs)
trainer.fit(model, datamodule=dataloaders)

INFO: 💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
INFO:lightning.pytorch.utilities.rank_zero:💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]



INPUTS: 13316 --> TRAIN: 80.0 % || VALIDATION: 10.0 % || TEST: 10.0 % 



INFO: 
   | Name            | Type                | Params | Mode 
-----------------------------------------------------------------
0  | model           | Network             | 355 M  | train
1  | loss_function   | CrossEntropyLoss    | 0      | train
2  | train_f1        | MulticlassF1Score   | 0      | train
3  | train_precision | MulticlassPrecision | 0      | train
4  | train_recall    | MulticlassRecall    | 0      | train
5  | val_f1          | MulticlassF1Score   | 0      | train
6  | val_precision   | MulticlassPrecision | 0      | train
7  | val_recall      | MulticlassRecall    | 0      | train
8  | test_f1         | MulticlassF1Score   | 0      | train
9  | test_precision  | MulticlassPrecision | 0      | train
10 | test_recall     | MulticlassRecall    | 0      | train
-----------------------------------------------------------------
355 M     Trainable params
0         Non-trainable params
355 M     Total params
1,422.461 Total estimated model params size (MB)
23        M

Sanity Checking: |          | 0/? [00:00<?, ?it/s]



Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

ValueError: The palette dictionary is missing keys: {'F1'}

# Compute metrics on the test set

In [None]:
test_metrics = trainer.test(model = model, datamodule=dataloaders)