
### Clinical Note Classification using PEFT (LoRA), RoLA, and OpenAI Fine-Tuning

This project fine-tunes three approaches for classifying clinical notes (e.g., identifying diseases like Type 1 or Type 2 diabetes) from a shared dataset `clinical_notes_large.csv`.

#### Goals:
- Compare **PEFT (LoRA)**, **RoLA**, and **OpenAI** fine-tuning pipelines
- Use the same dataset and metrics for a fair comparison
- Provide annotated code for reproducibility

---

#### Dataset: `clinical_notes_large.csv`

- Contains synthetic clinical note texts and associated disease labels
- Each method will load, preprocess, and use the same training and evaluation splits

---

In [1]:
## Env
import os
import json
from dotenv import load_dotenv

load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

In [2]:
# pip install peft transformers datasets accelerate
# !pip install datasets

# Import Libraries
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torch

In [3]:
print(torch.cuda.is_available())

True


### Step 1. Load 

In [4]:
# Load dataset
df = pd.read_csv("clinical_notes_large.csv")

# Check for missing values
print("Missing values before drop:\n", df[["text", "label"]].isna().sum())

# Drop rows with missing clinical text or label if exists
df = df.dropna(subset=["text", "label"])

# Preview dataset
print("Sample data:\n", df.head())

# Optional: display label distribution
print("\nLabel distribution:\n", df["label"].value_counts())

Missing values before drop:
 text     0
label    0
dtype: int64
Sample data:
                                                 text              label
0  Patient experienced chest pain and underwent E...  cardiac condition
1  Adult-onset diabetes, family history positive,...    diabetes type 2
2  Patient on ACE inhibitors for essential hypert...       hypertension
3  Patient experienced chest pain and underwent E...  cardiac condition
4  Early onset diabetes, C-peptide levels extreme...    diabetes type 1

Label distribution:
 label
hypertension         293
diabetes type 1      290
cardiac condition    262
diabetes type 2      255
Name: count, dtype: int64


In [5]:
# Encode labels
label_encoder = LabelEncoder()
df["label"] = label_encoder.fit_transform(df["label"])

In [6]:

# Train-test split
train_df, eval_df = train_test_split(df, test_size=0.2, stratify=df['label'], random_state=42)

# Save to disk for use in all pipelines
train_df.to_csv("train.csv", index=False)
eval_df.to_csv("eval.csv", index=False)


### Step 2: Fine-Tuning with PEFT (LoRA)

Use Hugging Face's `peft` and `transformers` libraries to fine-tune a base model with LoRA. This is parameter-efficient and great for resource-limited environments.


In [7]:
# Convert to Hugging Face Dataset
train_ds = Dataset.from_pandas(train_df)
eval_ds = Dataset.from_pandas(eval_df)

# 4. Tokenization
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

def tokenize(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=512)

train_ds = train_ds.map(tokenize, batched=True)
eval_ds = eval_ds.map(tokenize, batched=True)

train_ds = train_ds.remove_columns(["text"])
eval_ds = eval_ds.remove_columns(["text"])

# 5. PEFT with LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_lin", "v_lin"],
    bias="none"
)

base_model = AutoModelForSequenceClassification.from_pretrained(
    model_ckpt,
    num_labels=len(df["label"].unique())
)
model = get_peft_model(base_model, lora_config)

# 6. Training Setup
training_args = TrainingArguments(
    output_dir="./lora_model",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    logging_dir="./logs",
    logging_steps=10,
    save_total_limit=1,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)

# 7. Trainer API
peft_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    tokenizer=tokenizer
)

# 8. Train
peft_trainer.train()

# 9. Save label encoder
import joblib
joblib.dump(label_encoder, "label_encoder.pkl")

Map:   0%|          | 0/880 [00:00<?, ? examples/s]

Map:   0%|          | 0/220 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  peft_trainer = Trainer(
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss
1,0.6235,0.542932
2,0.0833,0.052543
3,0.0317,0.022216


['label_encoder.pkl']

### Step 3: Fine-Tuning with RoLA (Representation-Oriented Learning Alignment)

This method aligns intermediate representations for robustness. Useful when generalization and structure-aware learning is needed.


In [8]:

# For the sake of placeholder, reuse Hugging Face trainer but annotate where RoLA logic can go.
# RoLA Training for Clinical Note Classification with clinical_notes_large.csv

import pandas as pd
from datasets import Dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch
from torch.nn import functional as F

# Load and prepare data
# df = pd.read_csv("clinical_notes_large.csv")
# df = df.dropna(subset=["text", "label"])
# df['label'] = pd.factorize(df['label'])[0]

# train_df = df.sample(frac=0.8, random_state=42)
# eval_df = df.drop(train_df.index)

# Convert to Hugging Face Dataset
train_ds = Dataset.from_pandas(train_df)
eval_ds = Dataset.from_pandas(eval_df)

# Tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(df['label'].unique()))

def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=512)

train_ds = train_ds.map(tokenize, batched=True)
eval_ds = eval_ds.map(tokenize, batched=True)


# Custom RoLA-style loss Trainer

class RoLALossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):  # Added **kwargs
        outputs = model(**inputs, output_hidden_states=True)
        logits = outputs.logits

        # Use pooled output (or mean last hidden layer)
        if hasattr(outputs, "hidden_states") and outputs.hidden_states:
            # Use CLS token or pooled output for RoLA similarity
            cls_embeddings = outputs.hidden_states[-1][:, 0, :]  # shape: (batch_size, hidden_size)
        else:
            raise ValueError("No hidden_states found in outputs. Enable `output_hidden_states=True`")

        # Cosine similarity matrix
        cosine_sim = F.cosine_similarity(cls_embeddings.unsqueeze(1), cls_embeddings.unsqueeze(0), dim=-1)
        target_sim = torch.eye(cls_embeddings.size(0)).to(cls_embeddings.device)
        sim_loss = F.mse_loss(cosine_sim, target_sim)

        ce_loss = F.cross_entropy(logits, inputs["labels"])
        total_loss = ce_loss + 0.2 * sim_loss  # Combine losses
        return (total_loss, outputs) if return_outputs else total_loss

training_args = TrainingArguments(
    output_dir="./rola_output",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    logging_dir="./logs_rola",
    save_strategy="epoch"
)

# Train using RoLA-style loss
rola_trainer = RoLALossTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    tokenizer=tokenizer
)

rola_trainer.train()


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/880 [00:00<?, ? examples/s]

Map:   0%|          | 0/220 [00:00<?, ? examples/s]

  rola_trainer = RoLALossTrainer(


Epoch,Training Loss,Validation Loss
1,No log,0.026407
2,No log,0.020353
3,No log,0.019489


TrainOutput(global_step=330, training_loss=0.07629402623032079, metrics={'train_runtime': 835.9185, 'train_samples_per_second': 3.158, 'train_steps_per_second': 0.395, 'total_flos': 694625659453440.0, 'train_loss': 0.07629402623032079, 'epoch': 3.0})

### Step 5: Evaluate All Models

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.preprocessing import LabelEncoder
from transformers import AutoTokenizer
from datasets import Dataset

# Encode labels
label_encoder = LabelEncoder()
df["label"] = label_encoder.fit_transform(df["label"])

# Train/test split
train_df, eval_df = train_test_split(df, test_size=0.2, stratify=df["label"], random_state=42)

# Convert eval set to Hugging Face Dataset
eval_ds_raw = Dataset.from_pandas(eval_df.reset_index(drop=True))

# Load tokenizer (must match the model you're using, e.g. BERT)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenization function
def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=512)

# Tokenize the evaluation dataset
eval_ds = eval_ds_raw.map(tokenize, batched=True)
eval_ds.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

def evaluate_model(trainer, eval_dataset, model_name=""):
    try:
        preds_output = trainer.predict(eval_dataset)

        # Handle case where output is a tuple (e.g. custom RoLA Trainer)
        if isinstance(preds_output, tuple):
            predictions = preds_output[0]
            labels = preds_output[1]
        else:
            predictions = preds_output.predictions
            labels = preds_output.label_ids

        preds = predictions.argmax(-1)
        accuracy = accuracy_score(labels, preds)
        f1 = f1_score(labels, preds, average='weighted')
        report = classification_report(labels, preds)

        print(f"\n ===== {model_name} Evaluation =====")
        print(f" Accuracy: {accuracy:.4f}")
        print(f" F1 Score (weighted): {f1:.4f}")
        print(" Classification Report:\n", report)

        return accuracy, f1, report

    except Exception as e:
        print(f" Error during evaluation for {model_name}: {e}")
        return None, None, None

# Evaluate PEFT/LoRA
peft_acc, peft_f1, peft_report = evaluate_model(peft_trainer, eval_ds, model_name="PEFT/LoRA")

# Evaluate RoLA
rola_acc, rola_f1, rola_report = evaluate_model(rola_trainer, eval_ds, model_name="RoLA")

Map:   0%|          | 0/220 [00:00<?, ? examples/s]


 ===== PEFT/LoRA Evaluation =====
 Accuracy: 1.0000
 F1 Score (weighted): 1.0000
 Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        52
           1       1.00      1.00      1.00        58
           2       1.00      1.00      1.00        51
           3       1.00      1.00      1.00        59

    accuracy                           1.00       220
   macro avg       1.00      1.00      1.00       220
weighted avg       1.00      1.00      1.00       220



 Error during evaluation for RoLA: 'tuple' object has no attribute 'argmax'



###  Step 4: Fine-Tuning with OpenAI GPT (External API)

OpenAI fine-tuning works via uploading training files and running fine-tune jobs remotely.

First, format the dataset to OpenAI's required `.jsonl` format with `{"messages": ..., "completion": ...}` format.

In [10]:
# 1. Prepare Training JSONL
import json
openai_train = train_df.apply(lambda row: {
    "messages": [{"role": "user", "content": row["text"]}],
    "completion": row["label"]
}, axis=1)

with open("openai_train.jsonl", "w") as f:
    for line in openai_train:
        f.write(json.dumps(line) + "\n")

In [11]:
# 2. Initialize OpenAI Client
from openai import OpenAI
from tqdm import tqdm
# load_dotenv(override=True)
# os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

# client = openai = OpenAI()

load_dotenv(override=True)
api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)

In [12]:
# 3. Upload and Fine-tune
train_file = client.files.create(file=open("openai_train.jsonl", "rb"), purpose="fine-tune")

response = client.fine_tuning.jobs.create(
    training_file=train_file.id,
    model="gpt-3.5-turbo",
    # model = "gpt-4o-mini-2024-07-18",
    hyperparameters={"n_epochs": 3, "batch_size": 8}
)

job_id = response.id
fine_tuned_model_id = response.fine_tuned_model  # May be None if job not finished yet

print("\nFine-tune job submitted:", job_id)
print("Fine-tuned model ID (initial):", fine_tuned_model_id)



Fine-tune job submitted: ftjob-JSaCi3hM5eugXYrtwY0xlhA5
Fine-tuned model ID (initial): None


In [13]:
# 4. Monitor Job 
job = client.fine_tuning.jobs.retrieve(job_id)
print("Status:", job.status)

print("\n Training Log:")
for event in client.fine_tuning.jobs.list_events(fine_tuning_job_id=job_id):
    print(f"{event.created_at}: {event.message}")

print("\n Recent Jobs:")
jobs = client.fine_tuning.jobs.list(limit=10)
for j in jobs.data:
    print(j.id, j.status, j.fine_tuned_model)

Status: validating_files

 Training Log:
1753889843: Validating training file: file-53rkwHm9EtRyVuWje1aHWx
1753889843: Created fine-tuning job: ftjob-JSaCi3hM5eugXYrtwY0xlhA5

 Recent Jobs:
ftjob-JSaCi3hM5eugXYrtwY0xlhA5 validating_files None
ftjob-SBCME3bTn7KSgTsCnjQuTL14 failed None
ftjob-OCsEjBKSq5NiUCFK3GBiECbF validating_files None
ftjob-dJ13fHK43LQfbIhulKSpbwYG failed None
ftjob-K07fp2c2Q2BuZUp4Vpbl7o6v failed None
ftjob-qaq2PcDZ6u4z9EIHUWDkbIWR failed None
ftjob-bEJ661Qv5Xq5rEjg9DnNQBuE failed None
ftjob-celwDoTCMt9S1ZpHCLzLGJWE failed None
ftjob-naiX5niE9wSQhdgL1wRbHMHi failed None
ftjob-wA9xBTEqd7mvmMMUhQL3Z7RV failed None


In [14]:
# 4. Save Test Set (Optional)
openai_test = eval_df.apply(lambda row: {
    "messages": [
        {"role": "user", "content": row["text"]}
    ],
    "completion": row["label"]
}, axis=1)

with open("openai_test.jsonl", "w") as f:
    for item in openai_test:
        f.write(json.dumps(item) + "\n")

In [None]:
# 5. Inference using Fine-Tuned Model 
# Replace this with your actual fine-tuned model ID
model_id = "ft:gpt-3.5-turbo-0613:<your-org>:<your-model-id>"  # <-- UPDATE THIS with final model ID after job completes

true_labels = eval_df["label"].tolist()
predicted_labels = []

for prompt in tqdm(eval_df['text']):
    response = client.chat.completions.create(
        model=model_id,
        messages=[
            {"role": "system", "content": "You are a clinical note classifier. Return only the label."},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )
    predicted = response.choices[0].message.content.strip()
    predicted_labels.append(predicted)


In [None]:
# 6. Evaluation 
accuracy = accuracy_score(true_labels, predicted_labels)
f1 = f1_score(true_labels, predicted_labels, average="weighted")
report = classification_report(true_labels, predicted_labels)

print("\n===== OpenAI GPT Fine-Tuned Model Evaluation =====")
print(f"Accuracy: {accuracy:.4f}")
print(f" F1 Score (weighted): {f1:.4f}")
print("\n Classification Report:\n", report)