# Multi-Label Text CLassification Training Pipeline

This pipeline implements an efficient multi-label text classifier using a pre-trained language model (Gemma3-1b) with LoRA optimization. The goal is to predict multiple binary labels simultaneously for each text input, making it suitable for tasks like sentiment analysis with multiple aspects or document tagging.

In [1]:
import os
import numpy as np
import pandas as pd

import torch
import torch.nn as nn

from sklearn.metrics import precision_recall_fscore_support, classification_report

from datasets import Dataset, DatasetDict

from transformers import (
    AutoTokenizer,
    AutoModel,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback,
)

from transformers.modeling_outputs import SequenceClassifierOutput

from peft import LoraConfig, get_peft_model

In [2]:
os.environ["HF_TOKEN"] = "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"

In [3]:
# Model & columns
MODEL_ID  = "google/gemma-3-1b-it"
TEXT_COL   = "comprehensive_review"
LABEL_COLS = ["is_ad", "is_rant", "is_relevant"]
num_labels = len(LABEL_COLS)

# Toggle LoRA (True to enable efficient fine-tuning adapters)
USE_LORA = True

# Device
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=os.environ.get("HF_TOKEN"))
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Device: cuda


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

## Stage 1: Data Preprocessing and Validation

**Text and Label Cleaning**

We start by loading the CSV data and ensuring data quality. The preprocessing converts all label values to consistent binary format (0 or 1), handling various input formats like strings ("true", "yes") or boolean values. Empty or missing text entries are removed to prevent training issues.

**Train-Validation Split**

The dataset is randomly shuffled and split into 90% training and 10% validation sets. This ensures the model can be evaluated on unseen data during training to monitor for overfitting and select the best checkpoint.

**Tokenization**

Text is converted into numerical tokens that the language model can process. We use a maximum sequence length of 256 tokens to balance between capturing sufficient context and maintaining training efficiency. The tokenizer also adds attention masks to handle variable-length sequences properly.

In [4]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [5]:
# Load your CSV
CSV_PATH = "/content/drive/MyDrive/all_reviews_with_labels_and_features.csv"
df = pd.read_csv(CSV_PATH)

# Ensure required columns exist
missing = [c for c in [TEXT_COL] + LABEL_COLS if c not in df.columns]
if missing:
    raise KeyError(f"Missing columns in CSV: {missing}")

# Coerce labels to {0,1}
def to01(x):
    if isinstance(x, str):
        x = x.strip().lower()
        return 1 if x in {"1","true","t","yes","y"} else 0
    return int(bool(x))

for c in LABEL_COLS:
    df[c] = df[c].apply(to01).astype(np.int32)

# Drop empty / NaN text
df = df[df[TEXT_COL].notna()]
df = df[df[TEXT_COL].astype(str).str.strip().ne("")].reset_index(drop=True)

print("Rows after cleaning:", len(df))
display(df[[TEXT_COL] + LABEL_COLS].head(3))

# Shuffle & split
df = df.sample(frac=1.0, random_state=42).reset_index(drop=True)
split = int(0.9 * len(df))
df_train = df.iloc[:split].reset_index(drop=True)
df_val   = df.iloc[split:].reset_index(drop=True)

# Convert to HF datasets
ds_train = Dataset.from_pandas(df_train, preserve_index=False)
ds_val   = Dataset.from_pandas(df_val,   preserve_index=False)
dataset  = DatasetDict(train=ds_train, validation=ds_val)

dataset

Rows after cleaning: 11920


Unnamed: 0,comprehensive_review,is_ad,is_rant,is_relevant
0,"[Business] Auto Spa Speedy Wash - Harvester, M...",0,0,1
1,[Business] Kmart | [Category] ['Discount store...,1,1,1
2,[Business] Papa’s Pizza | [Category] ['Pizza r...,0,0,1


DatasetDict({
    train: Dataset({
        features: ['review_text', 'rating', 'has_photo', 'author_name', 'user_review_count', 'business_name', 'category', 'source', 'review_id', 'comprehensive_review', 'is_ad', 'is_relevant', 'is_rant', 'is_legit', 'average_score', 'has_phone', 'has_link', 'has_email', 'average_rating', 'rating_discrepancy'],
        num_rows: 10728
    })
    validation: Dataset({
        features: ['review_text', 'rating', 'has_photo', 'author_name', 'user_review_count', 'business_name', 'category', 'source', 'review_id', 'comprehensive_review', 'is_ad', 'is_relevant', 'is_rant', 'is_legit', 'average_score', 'has_phone', 'has_link', 'has_email', 'average_rating', 'rating_discrepancy'],
        num_rows: 1192
    })
})

In [6]:
MAX_LEN = 256  # reduce memory use vs 512

def preprocess(batch):
    enc = tokenizer(batch[TEXT_COL], truncation=True, padding=True, max_length=MAX_LEN)
    labels = np.stack([batch[c] for c in LABEL_COLS], axis=1).astype("float32")
    enc["labels"] = labels
    return enc

tokenized = dataset.map(preprocess, batched=True, remove_columns=dataset["train"].column_names)

print("Tokenized columns (train):", tokenized["train"].column_names)
assert TEXT_COL not in tokenized["train"].column_names

# quick type check
sample = tokenized["train"][0]
for k, v in sample.items():
    if k != "labels":
        assert isinstance(v, list), f"{k} should be list[int]"
    else:
        assert isinstance(v, list) and len(v) == num_labels, f"labels should be list[float] len={num_labels}"
print("✅ Tokenized sample OK")

Map:   0%|          | 0/10728 [00:00<?, ? examples/s]

Map:   0%|          | 0/1192 [00:00<?, ? examples/s]

Tokenized columns (train): ['input_ids', 'attention_mask', 'labels']
✅ Tokenized sample OK


In [7]:
# Compute per-label positive rates on TRAIN split only
y_train = np.stack(tokenized["train"]["labels"])  # shape (N, num_labels)
pos_rate = y_train.mean(axis=0)                   # per-label positive rate
pos_weight = (1.0 - pos_rate) / (pos_rate + 1e-8)
pos_weight_t = torch.tensor(pos_weight, dtype=torch.float32, device=device)

print("Positive rates per label:", dict(zip(LABEL_COLS, np.round(pos_rate, 4))))
print("pos_weight per label:",     dict(zip(LABEL_COLS, np.round(pos_weight, 4))))

Positive rates per label: {'is_ad': np.float64(0.0327), 'is_rant': np.float64(0.0779), 'is_relevant': np.float64(0.9645)}
pos_weight per label: {'is_ad': np.float64(29.5641), 'is_rant': np.float64(11.8325), 'is_relevant': np.float64(0.0368)}


## Stage 2: Model Architecture Design

**Base Language Model Selection**

We use Gemma3-1B as the backbone for this version of model training. It is relatively small but capable models that provide a good balance between performance and training speed. The models come pre-trained on large text corpora, giving us strong linguistic representations.

**LoRA (Low-Rank Adaptation) Integration**

Instead of fine-tuning all model parameters (which would be slow and memory-intensive), we apply LoRA adapters. This technique freezes the original model weights and only trains small "adapter" layers inserted into the attention mechanisms. This reduces trainable parameters by ~90% while maintaining most of the performance.

**Classification Head Architecture**

The model outputs are processed through several layers:

* Mean Pooling: Converts variable-length token sequences into fixed-size vectors by averaging embeddings across the sequence length (respecting attention masks)
* Dropout Layer: Provides regularization to prevent overfitting
* Linear Classifier: Maps the pooled representation to logits for each label


In [12]:
class MeanPooler(nn.Module):
    def forward(self, last_hidden_state, attention_mask):
        mask = attention_mask.unsqueeze(-1)
        summed = (last_hidden_state * mask).sum(1)
        counts = mask.sum(1).clamp(min=1e-6)
        return summed / counts


# Multi-label Classifier With LoRA Optimisation
# We use LoRA adaptors for efficient fine-tuning of a pre-trained gemma-3-1b model
# and we include drop-out layers for regularisation to prevent overfitting
class OptimizedClassifier(nn.Module):
    def __init__(self, model_id: str, num_labels: int, token=None):
        super().__init__()

        # Load base model with optimizations
        self.base = AutoModel.from_pretrained(
            model_id,
            token=token,
            torch_dtype=torch.float16,  # Memory optimization
        )

        # Apply LoRA for efficiency as it freezes the base model and only trains
        # small adaptor layers, finetuning the model
        lora_config = LoraConfig(
            r=16,  # Slightly higher rank for better performance
            lora_alpha=32,
            lora_dropout=0.1,
            target_modules=["q_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
            bias="none",
            task_type="FEATURE_EXTRACTION",
        )
        self.base = get_peft_model(self.base, lora_config)

        self.config = self.base.config
        self.pool = MeanPooler()
        self.classifier = nn.Linear(self.config.hidden_size, num_labels)
        self.dropout = nn.Dropout(0.1)

    def forward(self, input_ids=None, attention_mask=None, labels=None, **kwargs):
        outputs = self.base(input_ids=input_ids, attention_mask=attention_mask, **kwargs)
        last_hidden_state = outputs.last_hidden_state

        pooled = self.pool(last_hidden_state, attention_mask)
        pooled = self.dropout(pooled)
        # Generating logits for each label
        logits = self.classifier(pooled)

        loss = None
        if labels is not None:
            # Multi-label classification with class balancing
            loss_fct = nn.BCEWithLogitsLoss()
            loss = loss_fct(logits, labels)

        return SequenceClassifierOutput(loss=loss, logits=logits)

## Stage 3: Multi-Label Loss Function

**Binary Cross-Entropy with Logits**

Unlike multi-class classification (where each sample has exactly one label), multi-label classification treats each label as an independent binary decision. We use `BCEWithLogitsLoss` which:

* Applies sigmoid activation internally to convert logits to probabilities
* Calculates binary cross-entropy loss for each label independently
* Allows each sample to have zero, one, or multiple positive labels

**Class Imbalance Considerations**

While the current implementation uses standard BCE loss, the framework supports adding `pos_weight` parameters to handle imbalanced labels (where some labels are much rarer than others).

In [9]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    # Convert logits into probabilities using sigmoid, then threshold at 0.5
    # which can be further tuned in future iterations
    probs = 1 / (1 + np.exp(-predictions))  # sigmoid
    preds = (probs >= 0.5).astype(int)

    metrics = {}

    # Per-label metrics
    for i in range(labels.shape[1]):
        precision, recall, f1, _ = precision_recall_fscore_support(
            labels[:, i], preds[:, i], average="binary", zero_division=0
        )
        metrics[f'eval_precision_label_{i}'] = precision
        metrics[f'eval_recall_label_{i}'] = recall
        metrics[f'eval_f1_label_{i}'] = f1

    # Macro averages across all labels
    precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(
        labels, preds, average="macro", zero_division=0
    )

    metrics.update({
        'eval_precision_macro': precision_macro,
        'eval_recall_macro': recall_macro,
        'eval_f1_macro': f1_macro,
    })

    return metrics

def print_final_metrics(trainer, test_dataset, class_names=None):
    """Print detailed per-label metrics after training for multi-label classification"""
    predictions = trainer.predict(test_dataset)

    # Multi-label: sigmoid + threshold
    probs = 1 / (1 + np.exp(-predictions.predictions))
    preds = (probs >= 0.5).astype(int)
    labels = predictions.label_ids

    print("\n" + "="*70)
    print("FINAL MULTI-LABEL CLASSIFICATION METRICS")
    print("="*70)

    if class_names is None:
        class_names = [f"Label_{i}" for i in range(labels.shape[1])]

    print(f"{'Label':<20} {'Precision':<10} {'Recall':<10} {'F1-Score':<10} {'Support':<10}")
    print("-" * 70)

    # Per-label metrics
    all_precision, all_recall, all_f1 = [], [], []

    for i, name in enumerate(class_names):
        precision, recall, f1, support = precision_recall_fscore_support(
            labels[:, i], preds[:, i], average="binary", zero_division=0
        )
        print(f"{name:<20} {precision:<10.3f} {recall:<10.3f} {f1:<10.3f} {np.sum(labels[:, i]):<10}")

        all_precision.append(precision)
        all_recall.append(recall)
        all_f1.append(f1)

    # Macro averages
    macro_precision = np.mean(all_precision)
    macro_recall = np.mean(all_recall)
    macro_f1 = np.mean(all_f1)

    print("-" * 70)
    print(f"{'Macro Average':<20} {macro_precision:<10.3f} {macro_recall:<10.3f} {macro_f1:<10.3f} {len(labels):<10}")
    print("="*70)

## Stage 4: Training Configuration and Optimization

**Hyperparameter Selection**

The training uses parameters optimised for smaller models with LoRA:
* Learning Rate (5e-4): Higher than typical fine-tuning because LoRA adapters can handle more aggressive updates
* Batch Size (32/64): Larger batches possible due to smaller model size and LoRA efficiency
* Epochs (3): Sufficient for LoRA adaptation without overfitting


**Performance Optimizations**

Several techniques ensure fast training:

* Mixed Precision (FP16): Uses 16-bit floating point to reduce memory usage and increase speed
* Efficient Attention: Uses Pytorch's optimised scaled dot-product attention
* Parallel Data Loading: Multiple workers load batches in parallel
* Length Grouping: Batches similar-length sequences together to minimize padding

**Evaluation Strategy**

The model is evaluated every 100 training steps, allowing us to:

* Monitor training progress in real-time
* Detect overfitting early
* Save the best-performing checkpoint based on macro F1 score

In [15]:
# Main training pipeline

def train_classifier(
    model_id: str,
    train_dataset,
    eval_dataset,
    test_dataset,
    num_labels: int,
    class_names=None,
    output_dir="./results"
):

    # Initialize model with LoRA optimisation
    model = OptimizedClassifier(
        model_id=model_id,
        num_labels=num_labels,
        token=os.environ.get("HF_TOKEN")
    )

    # Print trainable parameters
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Trainable: {trainable_params:,} / {total_params:,} ({100*trainable_params/total_params:.2f}%)")

    # Optimized training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=2,  # Reduced for speed
        per_device_train_batch_size=16,  # Larger batch size
        per_device_eval_batch_size=32,
        learning_rate=3e-4,  # Higher LR for LoRA
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,

        # # Evaluation
        # evaluation_strategy="steps",
        # eval_steps=100,
        # save_strategy="steps",
        # save_steps=100,

        # # Best model tracking
        # load_best_model_at_end=True,
        # metric_for_best_model="eval_f1_macro",
        # greater_is_better=True,

        # Optimizations
        fp16=torch.cuda.is_available(),
        dataloader_num_workers=4,
        dataloader_pin_memory=True,
        group_by_length=True,
        remove_unused_columns=False,

        # Reduce logging for speed
        logging_steps=100,
        save_total_limit=2,
        report_to=[],
    )

    # Initialize trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
    )

    # Train
    print("Starting training...")
    trainer.train()

    # Print final detailed metrics
    print_final_metrics(trainer, test_dataset, class_names)

    return trainer, model

In [16]:
trainer, model = train_classifier(
    model_id=MODEL_ID,
    train_dataset=tokenized["train"],           # Your tokenized train split
    eval_dataset=tokenized["validation"],       # Your tokenized validation split
    test_dataset=tokenized["validation"],       # Use validation as test (or create separate test split)
    num_labels=len(LABEL_COLS),                # Number of label columns you have
    class_names=LABEL_COLS,                    # Your actual label column names
    output_dir="./classifier_results"
)

Some weights of Gemma3TextModel were not initialized from the model checkpoint at google/gemma-3-1b-it and are newly initialized: ['embed_tokens.weight', 'layers.0.input_layernorm.weight', 'layers.0.mlp.down_proj.weight', 'layers.0.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight', 'layers.0.post_attention_layernorm.weight', 'layers.0.post_feedforward_layernorm.weight', 'layers.0.pre_feedforward_layernorm.weight', 'layers.0.self_attn.k_norm.weight', 'layers.0.self_attn.k_proj.weight', 'layers.0.self_attn.o_proj.weight', 'layers.0.self_attn.q_norm.weight', 'layers.0.self_attn.q_proj.weight', 'layers.0.self_attn.v_proj.weight', 'layers.1.input_layernorm.weight', 'layers.1.mlp.down_proj.weight', 'layers.1.mlp.gate_proj.weight', 'layers.1.mlp.up_proj.weight', 'layers.1.post_attention_layernorm.weight', 'layers.1.post_feedforward_layernorm.weight', 'layers.1.pre_feedforward_layernorm.weight', 'layers.1.self_attn.k_norm.weight', 'layers.1.self_attn.k_proj.weight', 'layers.1.self_attn.o_pr

Trainable: 12,463,491 / 1,012,349,443 (1.23%)
Starting training...




Step,Training Loss
100,0.261
200,0.1952
300,0.2045
400,0.2041
500,0.1914
600,0.1673
700,0.1738
800,0.1711
900,0.1518
1000,0.1706





FINAL MULTI-LABEL CLASSIFICATION METRICS
Label                Precision  Recall     F1-Score   Support   
----------------------------------------------------------------------
is_ad                0.000      0.000      0.000      40.0      
is_rant              0.600      0.300      0.400      80.0      
is_relevant          0.960      1.000      0.979      1144.0    
----------------------------------------------------------------------
Macro Average        0.520      0.433      0.460      1192      


In [18]:
import shutil

# Copy the entire results folder to Google Drive
shutil.copytree("./classifier_results", "/content/drive/MyDrive/classifier_results")

'/content/drive/MyDrive/classifier_results'

## Stage 5: Metrics Calculation and Evaluation

**Multi-Label Metrics Framework**

For each label, we calculate precision, recall and F1-score as independent-binary classification problems. This differs from multi-class metrics where we would use argmax to select a single prediction.

**Macro vs Micro Averaging**

We focus on macro averaging, which:
* Treats each label equally regardless of frequency
* Provides better insights for imbalanced datasets
* Gives equal weight to rare and common labels

**Per-Label Performance Tracking**
The system tracks individual metrics for each label, allowing identification of:
* Labels the model learns easily vs. struggles with i.e. labels that might need different treatment/more training data
* Potential data quality issues in specific labels

## Future Improvements

### 1. Model Scale and Architecture
* With more compute, we would be able to use larger models (e.g. Gemma3 models/Qwen3 models with a larger number of parameters) for significantly better language understanding and context modelling.

### 2. Enhanced Data Engineering
* In the aspect of data engineering, our model performance might be improved through embedding other signals - e.g. sentiment scores, disparity between average ratings and sentiment scores, rating * sentiment scores for internal consistency of reviews etc. These can be appended to the comprehensive review for greater nuance.

### 3. Advanced Label Modelling
* We could potentially implement hierarchical labels as well, where we model parent-child relationships between labels e.g. is_legit is the child of is_ad, is_rant and is_relevant.

### 4. Model Interpretability
* In terms of interpretability, which is the limitation of most machine learning models, attention visualisation could be implemented to show which words drive the predictions.

### 5. Active Learning and Human-In-The-Loop
* With additional resources, we could implement an active learning system that identifiees the most informative unlabeled examples for human annotation. This would maximise the value of limited labeling budget by focusing human effort by focusing human effort on samples that most improve model performance, creating a continuous feedback loop between model predictions and expert knowledge.