<a href="https://colab.research.google.com/github/ywang1110/ML-LLM-System-Design/blob/main/meeting_notes_multilabel_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Multi-Label Theme Classification for Meeting Notes (BERT + Hugging Face)

This notebook fine-tunes a BERT model for **multi-label classification** (e.g., *Decisions, Action Items, Risks, Timeline, Budget,* etc.).  
It uses:
- `transformers` Trainer API
- `BCEWithLogitsLoss` (via `problem_type='multi_label_classification'`)
- Threshold tuning per label on the validation set
- Metrics: micro/macro F1 + per-label F1

> **How to use:**  
> 1. Upload your dataset (CSV or JSONL) containing a `text` field and a `labels` field.  
> 2. Update the `LABELS` list to your ontology.  
> 3. Run all cells top-to-bottom.  




1.   **transformers**
* easy access to pretrained models, like BERT
* Includes:
  * Model Architectures
  * Tokenizers
  * Training & Inference utilities

2.   **datasets**
* For loading, processing and streaming dataset effiently
* Features:
  * Handles large dataset (memory-mapped Arrow format)
  * Built-in dataset hub
  * Easy splits/train/test handling

3.  **accelerate**
* Hugging face library to simplify multi-GPU and mixed precision training
* Abstracts way `torch.distributed` setup

4. **evaluate**
* Hugging face library for standardized evaluation metrics
* easily load metrics like `f1`, `accuracy`, `precision`, `recall`, `BLEU`, `ROUGE`, etc.

5. scikit-learn
* General-purpose ML toolkit.
* Here, you use it for:
  * Train/validation splitting (train_test_split).
  * Multi-label metrics (f1_score, classification_report).
  * Potentially for threshold tuning.


In [None]:
!nvidia-smi

Sat Aug  9 15:35:07 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   33C    P0             44W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
import torch

if torch.cuda.is_available():
  print("GPU: ", torch.cuda.get_device_name(0))
else:
  print("No GPU detected")

GPU:  NVIDIA A100-SXM4-40GB


In [None]:
# If you're in Colab, uncomment the next line to install dependencies
!pip install -q transformers datasets accelerate evaluate scikit-learn

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m122.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m87.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m58.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from dataclasses import dataclass
from typing import List, Dict, Any
import os, json, math, random
import numpy as np
import pandas as pd

import torch
from torch import nn
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report

from transformers import (
    AutoTokenizer,  # automatically loads the correct tokenizer for a given pre-trained model or path
    AutoModelForSequenceClassification, # load a pretrained model configured for sequence classification tasks, adding the appropriate classification head
    Trainer, # High-level Hugging face training loop that handles batching, eval, logging, saving and distributed training
    TrainingArguments,  # A configuration object for `Trainer` that scores all training settings (e.g., lr, batch size, eval strategy)
    EarlyStoppingCallback # a callback for `Trainer` that stops training early if a monitored metrics (e.g., validation loss) doesn't improve for a set number of evals
)
import evaluate

SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

# === TODO: Update to your theme list ===
LABELS: List[str] = [
    "Decisions","Action Items","Follow-ups","Risks","Requirements","Timeline","Budget","Owners","Scope","Dependencies"
]
NUM_LABELS = len(LABELS)

# Base model (you can switch to roberta-base, etc.)
MODEL_NAME = "bert-base-uncased"

# Paths (change as needed)
DATA_PATH = "data.csv"        # or set to your JSONL path
SAVE_DIR  = "meeting-bert-multilabel"

os.makedirs(SAVE_DIR, exist_ok=True)



## Load Data

Expected formats (**choose one and update the code accordingly**):

### CSV
- Columns: `text` and `labels`
- `labels` can be a comma-separated string like: `Decisions,Action Items`

### JSONL
- Each line is a JSON object with keys: `text`, `labels` (list of strings)


In [None]:
def load_data(DATA_PATH: str, labels: List[str]) -> pd.DataFrame:
    ext = os.path.splitext(DATA_PATH)[1].lower()
    label2id = {l: i for i, l in enumerate(labels)}

    if ext == ".csv":
        df = pd.read_csv(DATA_PATH)
        # Expect columns: text, labels (comma-separated)
        def parse_labels(x):
            if isinstance(x, str):
                return [s.strip() for s in x.split(",") if s.strip()]
            if isinstance(x, list):
                return x
            return []
        df["labels_list"] = df["labels"].apply(parse_labels)
    elif ext in [".jsonl", ".json"]:
        rows = []
        with open(DATA_PATH, "r", encoding="utf-8") as f:
            for line in f:
                if not line.strip():
                    continue
                obj = json.loads(line)
                rows.append({
                    "text": obj["text"],
                    "labels_list": obj.get("labels", [])
                })
        df = pd.DataFrame(rows)
    else:
        raise ValueError(f"Unsupported file extension: {ext}")

    # binarize labels to multi-hot vectors
    def to_multihot(lst):
        vec = np.zeros(len(labels), dtype=np.float32)
        for l in lst:
            if l in label2id:
                vec[label2id[l]] = 1.0
        return vec

    df["y"] = df["labels_list"].apply(to_multihot)
    return df[["text", "y", "labels_list"]]

# If you don't have data yet, create a tiny toy set
if not os.path.exists(DATA_PATH):
    toy_base = {
        "Decisions,Owners": [
            "We decided to move forward with vendor A and Bob owns the integration.",
            "We approved vendor B and Sarah is responsible for implementation."
        ],
        "Action Items,Timeline": [
            "Alice will send the report next Tuesday; timeline pushed by a week.",
            "John will prepare the draft by Friday; delivery delayed to next Monday."
        ],
        "Budget,Risks,Dependencies": [
            "Budget is tight and there are risks with the API dependency.",
            "Costs are over budget and integration with the payment API is risky."
        ],
        "Follow-ups,Requirements,Scope": [
            "Follow up with legal on requirements and scope next week.",
            "Check with compliance on updated requirements and review project scope."
        ],
        "Risks,Timeline,Dependencies": [
            "Server outage impacted the release plan; backup systems need verification.",
            "Database migration may delay release; dependent services need updates."
        ],
    }

    # 扩增到每个组合 6 条（轻微改动文本确保唯一）
    toy = []
    for labels, texts in toy_base.items():
        for i in range(6):
            text = texts[i % len(texts)] + f" (note {i+1})"
            toy.append({"text": text, "labels": labels})

        pd.DataFrame(toy).to_csv(DATA_PATH, index=False)

    df = load_data(DATA_PATH, LABELS)
    print("Data size:", len(df))
    df.head()


Data size: 30


In [None]:
print(df.loc[0]['y'])
print(df.loc[1]['y'])

[1. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 1. 0. 0.]


In [None]:
train_df, val_df = train_test_split(df,
                      test_size=0.2,
                      random_state=SEED,
                      stratify=[
                          tuple(v) for v in df["y"] # df["y"] contains arrays (multi-label vectors), so each vector is converted to a tuple to make it hashable for stratification.
                          ]) # ensure label distribution is similar in both train and validation set

# load correct tokenizer for your pre-trained model (e..g, BERT)
# Handle text -> token IDs, adding special tokens ([CLS, [SEP]), and truncation/padding rules
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

MAX_LEN = 512  # for meeting notes, consider chunking if texts are long

class TextDataset(torch.utils.data.Dataset): # store text and label from a DataFrame; Converts text into tokenized tensors using tokenizer
    def __init__(self, df: pd.DataFrame, tokenizer, max_len: int = 512):
        self.texts = df["text"].tolist()
        self.labels = np.stack(df["y"].values)  # shape: (N, C)
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        enc = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt",
        )
        item = {k: v.squeeze(0) for k, v in enc.items()}
        item["labels"] = torch.tensor(self.labels[idx], dtype=torch.float)
        return item

train_ds = TextDataset(train_df, tokenizer, MAX_LEN)
val_ds   = TextDataset(val_df, tokenizer, MAX_LEN)
len(train_ds), len(val_ds)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

(24, 6)

In [None]:
# AutoModelForSequenceClassification: A Hugging face class loads a model arachitecture and weights designed for text classification
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, # download and load the weights of a pre-trained model
    num_labels=NUM_LABELS, # set number of output neurons in the classification layer - one per label
    problem_type="multi_label_classification"  # enables BCEWithLogitsLoss internally; each label treated as an independent binary prediction
)

# Optional: freeze embeddings for faster convergence early on
# prevent model's embedding weights from updating during training
# speed up training and reduce memory use.
# especially usefully if dataset is small and you only want to fine-tune higher layers first
for p in model.base_model.embeddings.parameters():
    p.requires_grad = False

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



## Metrics & Threshold Tuning

We'll compute micro/macro F1. During training we’ll log F1 with a **default threshold = 0.5**.  
After training, we'll **tune per-label thresholds** on the validation set to maximize macro F1.


In [None]:
metric_f1 = evaluate.load("f1")

# turn logits (raw model outputs) into probablities between 0 and 1 for each label
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def compute_metrics(eval_pred):
    logits, labels = eval_pred # eval_pred: returned by Hugging Face Trainer during eval
    probs = sigmoid(logits)
    # default threshold 0.5 (we'll tune later)
    preds = (probs >= 0.5).astype(int)
    micro = f1_score(labels, preds, average="micro", zero_division=0)
    macro = f1_score(labels, preds, average="macro", zero_division=0)
    return {"micro_f1": micro, "macro_f1": macro}


Downloading builder script: 0.00B [00:00, ?B/s]

In [None]:
training_args = TrainingArguments(
    output_dir=SAVE_DIR,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=50,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=1,
    num_train_epochs=5,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="macro_f1",
    greater_is_better=True,
    fp16=torch.cuda.is_available(),
    report_to="none",
    seed=SEED,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

trainer.train()


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Micro F1,Macro F1
1,No log,0.687577,0.333333,0.21619
2,No log,0.65745,0.363636,0.183333
3,No log,0.637099,0.424242,0.325
4,No log,0.62518,0.428571,0.275
5,No log,0.620065,0.48,0.292381


TrainOutput(global_step=15, training_loss=0.6685957590738932, metrics={'train_runtime': 16.1694, 'train_samples_per_second': 7.421, 'train_steps_per_second': 0.928, 'total_flos': 31575594516480.0, 'train_loss': 0.6685957590738932, 'epoch': 5.0})

## 1️⃣ Why thresholds matter in multi-label classification

* In multi-label classification, **each label is an independent binary classification problem**.
* The model outputs a probability (after sigmoid) for each label.
* To turn probabilities into final predictions, you choose a **threshold**:

  * If `prob >= threshold` → predict 1 (positive)
  * If `prob < threshold` → predict 0 (negative)

The **threshold** directly affects:

* **Precision** (how many predicted positives are correct)
* **Recall** (how many actual positives are found)

---

## 2️⃣ Why not just use 0.5 for all labels?

If you always use 0.5:

* Labels with **class imbalance** (e.g., very few positives) may suffer — you might need a lower threshold to improve recall.
* Labels with **noisy predictions** may need a higher threshold to improve precision.

Example:

| Label          | Positive Rate | Good Threshold |
| -------------- | ------------- | -------------- |
| “cat”          | 50%           | \~0.5          |
| “rare disease” | 2%            | \~0.2          |
| “common word”  | 90%           | \~0.7          |

---

## 3️⃣ What the code does

For each label **independently**:

1. **Loop over candidate thresholds** from 0.1 to 0.9

   ```python
   for t in np.linspace(0.1, 0.9, 33):
   ```
2. **Convert probabilities to 0/1 predictions** at that threshold:

   ```python
   pred_i = (val_probs[:, i] >= t).astype(int)
   ```
3. **Compute the F1 score** for that label:

   ```python
   f1_i = f1_score(val_labels[:, i], pred_i, zero_division=0)
   ```
4. **Keep the threshold** that gives the highest F1:

   ```python
   if f1_i > best_f1:
       best_f1, best_t = f1_i, t
   ```

---

## 4️⃣ Why F1 score is used here

* **F1 score** is the harmonic mean of precision and recall:

  $$
  F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
  $$
* It balances both metrics, which is important when:

  * Classes are **imbalanced**
  * Both **false positives** and **false negatives** matter

By **maximizing F1 per label**, you’re making sure:

* For each label, the **trade-off between precision and recall** is optimal
* You don’t just optimize for overall accuracy (which can be misleading for rare labels)

---

## 5️⃣ After threshold tuning

You get an array like:

```python
Per-label thresholds: {
    "cat": 0.52,
    "dog": 0.47,
    "mouse": 0.31,
    ...
}
```

Then you use **these custom thresholds** to make predictions:

```python
preds_tuned = (val_probs >= ths[None, :]).astype(int)
```

This often **boosts overall multi-label F1** compared to a fixed 0.5 threshold.


In [None]:
# Get raw logits on validation set
pred_out = trainer.predict(val_ds)
val_logits = pred_out.predictions
val_labels = np.stack(val_df["y"].values)

# In multi-label tasks, each output unit is an independent binary classification.
val_probs = sigmoid(val_logits)

# Search the best threshold for each label
ths = np.zeros(NUM_LABELS)
for i in range(NUM_LABELS):
    best_f1, best_t = 0.0, 0.5
    # Try thresholds from 0.1 to 0.9 in 33 evenly spaced steps.
    for t in np.linspace(0.1, 0.9, 33):
        # Convert probabilities to 0/1 predictions at that threshold:
        pred_i = (val_probs[:, i] >= t).astype(int)
        f1_i = f1_score(val_labels[:, i], pred_i, zero_division=0)
        if f1_i > best_f1:
            best_f1, best_t = f1_i, t
    ths[i] = best_t

print("Per-label thresholds:", dict(zip(LABELS, ths)))

# Evaluation after applying the per-label thresholds
preds_tuned = (val_probs >= ths[None, :]).astype(int)
print("\n== Tuned Thresholds Report ==")
print(classification_report(val_labels, preds_tuned, target_names=LABELS, zero_division=0))


Per-label thresholds: {'Decisions': np.float64(0.42500000000000004), 'Action Items': np.float64(0.1), 'Follow-ups': np.float64(0.1), 'Risks': np.float64(0.42500000000000004), 'Requirements': np.float64(0.475), 'Timeline': np.float64(0.55), 'Budget': np.float64(0.45000000000000007), 'Owners': np.float64(0.42500000000000004), 'Scope': np.float64(0.5), 'Dependencies': np.float64(0.1)}

== Tuned Thresholds Report ==
              precision    recall  f1-score   support

   Decisions       0.20      1.00      0.33         1
Action Items       0.33      1.00      0.50         2
  Follow-ups       0.17      1.00      0.29         1
       Risks       1.00      0.50      0.67         2
Requirements       0.20      1.00      0.33         1
    Timeline       0.75      1.00      0.86         3
      Budget       1.00      1.00      1.00         1
      Owners       0.50      1.00      0.67         1
       Scope       1.00      1.00      1.00         1
Dependencies       0.33      1.00      0.50

In [None]:
# Save model, tokenizer, and thresholds
trainer.save_model(SAVE_DIR) # save model's weights, configuration and other files needed to reload the model achitecture
tokenizer.save_pretrained(SAVE_DIR) # save vocabulary, tokenization rules and special tokens

# save the label list
# critical for mapping output indices back to human-readable label names when making predictions later
with open(os.path.join(SAVE_DIR, "label_list.json"), "w") as f:
    json.dump(LABELS, f, indent=2)

# save the tuned thresholds
# create a dict mapping each label name to its tuned threshold
# allow you to use label-speicifc thresholds when you reload model for inference
with open(os.path.join(SAVE_DIR, "thresholds.json"), "w") as f:
    json.dump({l: float(t) for l, t in zip(LABELS, ths)}, f, indent=2)

print("Saved to:", SAVE_DIR)


Saved to: meeting-bert-multilabel



## Inference Helper

This cell loads the saved model and uses the tuned thresholds to produce labels.


In [None]:
def load_model_for_inference(save_dir: str, model_name: str = MODEL_NAME):
    # load tokenizer, including volcabuary/tokenization rules from training
    tokenizer = AutoTokenizer.from_pretrained(save_dir)
    # load trained model weights and config
    model = AutoModelForSequenceClassification.from_pretrained(save_dir)
    # load file stores ordered list of label names during the training
    # order matters because logits[i] coresponds to label[i]
    with open(os.path.join(save_dir, "label_list.json")) as f:
        labels = json.load(f)
    # provide mapping {label_name: threshold} that you computed eariler by maximizing per-label F1
    with open(os.path.join(save_dir, "thresholds.json")) as f:
        ths_map = json.load(f)
    ths = np.array([ths_map[l] for l in labels], dtype=np.float32)
    return tokenizer, model, labels, ths

tokenizer_inf, model_inf, labels_inf, ths_inf  = load_model_for_inference(SAVE_DIR)

def predict_multilabel(texts: List[str]):
    enc = tokenizer_inf(
        texts,
        padding=True,  # padding to same length in the batch
        truncation=True,  # cur off texts longer than max_length
        max_length=512,
        return_tensors="pt" # return pytorch tensors
    )

    # put model into eval mode (no drapout, no training layers)
    model_inf.eval()
    with torch.no_grad():
        logits = model_inf(**{k: v for k, v in enc.items()}).logits
    probs = 1 / (1 + np.exp(-logits.detach().cpu().numpy()))
    pred = (probs >= ths_inf[None, :]).astype(int)
    results = []
    for i in range(len(texts)):
        on_labels = [labels_inf[j] for j in np.where(pred[i] == 1)[0].tolist()]
        scores = {labels_inf[j]: float(probs[i, j]) for j in range(len(labels_inf))}
        results.append({"labels": on_labels, "scores": scores})
    return results

# Demo
predict_multilabel([
    "We decided to move forward and Bob owns the integration work.",
    "Alice will send the report next Tuesday; timeline pushed by a week."
])


[{'labels': ['Decisions',
   'Action Items',
   'Follow-ups',
   'Risks',
   'Requirements',
   'Budget',
   'Owners',
   'Dependencies'],
  'scores': {'Decisions': 0.46805182099342346,
   'Action Items': 0.5080020427703857,
   'Follow-ups': 0.4553377330303192,
   'Risks': 0.4907306730747223,
   'Requirements': 0.5113127827644348,
   'Timeline': 0.3900125324726105,
   'Budget': 0.5220425724983215,
   'Owners': 0.48205286264419556,
   'Scope': 0.3777032494544983,
   'Dependencies': 0.5080409049987793}},
 {'labels': ['Decisions',
   'Action Items',
   'Follow-ups',
   'Timeline',
   'Owners',
   'Dependencies'],
  'scores': {'Decisions': 0.5114848017692566,
   'Action Items': 0.44036900997161865,
   'Follow-ups': 0.44937658309936523,
   'Risks': 0.40673694014549255,
   'Requirements': 0.4740058183670044,
   'Timeline': 0.5646368265151978,
   'Budget': 0.43413975834846497,
   'Owners': 0.4463684558868408,
   'Scope': 0.4838813543319702,
   'Dependencies': 0.531299889087677}}]

## classification head demo.

In [None]:
# Install dependencies (Colab)
!pip install -q torch transformers

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m136.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m108.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
# ============================================
# Define a BERT model with a classification head
# ============================================
class BertWithClassificationHead(nn.Module):
    def __init__(self, model_name: str, num_labels: int, dropout: float = 0.1):
        super().__init__()
        #  Load pre-trained BERT (without classification head)
        self.bert = AutoModel.from_pretrained(model_name)
         # Hidden size of the [CLS] vector (e.g., 768 for bert-base)
        hidden_size = self.bert.config.hidden_size
        # Dropout layer to help prevent overfitting
        self.dropout = nn.Dropout(dropout)
        # Classification head: linear layer mapping [CLS] → num_labels logits
        self.classifier = nn.Linear(hidden_size, num_labels)  # classification head

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
      # Pass inputs through BERT
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            return_dict=True,
        )

        # outputs.last_hidden_state → shape: (batch_size, seq_len, hidden_size)
        # Take the embedding of the [CLS] token (first token, index 0)
        cls_embedding = outputs.last_hidden_state[:, 0, :]  # shape: (batch_size, hidden_size)
        # Apply dropout for regularization
        cls_embedding = self.dropout(cls_embedding)
        # Pass through classification head → shape: (batch_size, num_labels)
        logits = self.classifier(cls_embedding)

        return logits

In [None]:
# ============================================
# Parameters
# ============================================
MODEL_NAME = "bert-base-uncased"
NUM_LABELS = 3  # Example: [Decision, Risk, Budget]
THRESHOLD = 0.5 # Probability cutoff for deciding label presence

In [None]:
# ============================================
# Load tokenizer & model
# ============================================
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = BertWithClassificationHead(MODEL_NAME, NUM_LABELS).to(DEVICE)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]