#Pain Level Classification Using GPT-2: Analysis of Patient Notes

**Introduction**

This project focuses on developing a machine learning model to classify patient pain levels based on their reported symptoms and notes. The dataset consists of clinical records containing patient notes, length of hospital stay, treatment information, and pain level assessments.

**Dataset Overview**

The dataset (patient_pain_data.csv) contains several key fields:

- Patient Notes: Free-text descriptions of patient conditions and symptoms
- Days Admitted: Duration of hospital stay
- Treatment Type: Category of treatment (Surgery, Physiotherapy, Medication)
- Surgery Done: Whether surgery was performed (Yes/No)
- Pain Level: Numerical pain scale rating (target variable)

Sample entries show a range of patient experiences, from "I feel a little better today" to "The pain is unbearable," with corresponding pain levels ranging from 1-9 on a numerical scale. The pain level classification is treated as a multi-class problem with 9 distinct levels.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset
from transformers import (
    GPT2Tokenizer,
    GPT2ForSequenceClassification,
    Trainer,
    TrainingArguments
)


In [2]:
# Mount Google Drive to access data
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# Step 1: Load and inspect the dataset
# Load and display the dataset
df = pd.read_csv('/content/drive/My Drive/patient_pain_data.csv')
print(df.head())

                  Patient Notes  Days Admitted Treatment Type Surgery Done  \
0  I feel a little better today              7        Surgery          Yes   
1        The pain is unbearable              4  Physiotherapy          Yes   
2       I can walk but it hurts             13     Medication          Yes   
3        I have a mild headache             11  Physiotherapy          Yes   
4     I can't sleep due to pain              8        Surgery          Yes   

   Pain Level  
0           8  
1           7  
2           9  
3           8  
4           5  


In [24]:
# Step 2: Extract texts and labels
texts = df['Patient Notes'].tolist()    # List of text notes
labels = df['Pain Level'].tolist()  # Corresponding labels

In [49]:
# Step 3: Determine the number of unique classes
# Ensure labels start from 0 and are consecutive integers
unique_labels = sorted(list(set(labels)))  # Get unique labels and sort them
label_mapping = {label: i for i, label in enumerate(unique_labels)}  # Create a mapping

# Map original labels to encoded labels
encoded_labels = [label_mapping[label] for label in labels]
num_labels = len(unique_labels)  # Update num_labels

print(f"Number of classes: {num_labels}")


Number of classes: 9


In [58]:

# Step 4: Split the data into training and validation sets (80/20 split)
train_texts, val_texts, train_labels, val_labels = train_test_split(
    texts, encoded_labels, test_size=0.2, random_state=42
)


In [59]:
# Step 5: Create a custom PyTorch Dataset class
class PatientNotesDataset(Dataset):
    """
    Custom Dataset for patient/doctor notes.
    It tokenizes texts and returns input IDs, attention masks, and labels.
    """
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels  # Now expects encoded labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]  # Accessing encoded label directly
        encoding = self.tokenizer(
            text,
            padding='max_length',  # Pad to max_length
            truncation=True,       # Truncate texts longer than max_length
            max_length=self.max_length,
            return_tensors='pt'
        )
        # Remove extra batch dimension and include the label
        item = {key: val.squeeze() for key, val in encoding.items()}
        item['labels'] = torch.tensor(label, dtype=torch.long)  # Using encoded label
        return item

In [60]:
# Step 6: Initialize the GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# GPT-2 does not have a pad token by default; use the EOS token as the pad token.
tokenizer.pad_token = tokenizer.eos_token

model = GPT2ForSequenceClassification.from_pretrained(
    "gpt2",
    num_labels=num_labels  # Configure for the correct number of classes
)
# Ensure the model's configuration knows about the padding token
model.config.pad_token_id = model.config.eos_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [61]:
# Step 7: Create Dataset instances for training and validation
# Use encoded labels when creating the datasets
train_dataset = PatientNotesDataset(train_texts, train_labels, tokenizer, max_length=128)
val_dataset = PatientNotesDataset(val_texts, val_labels, tokenizer, max_length=128)


In [62]:
# Step 8: Define training arguments (wandb and other integrations disabled)
training_args = TrainingArguments(
    output_dir='./results',             # Directory for checkpoints and predictions
    num_train_epochs=3,                 # Number of training epochs
    per_device_train_batch_size=8,      # Batch size per device during training
    per_device_eval_batch_size=8,       # Batch size for evaluation
    evaluation_strategy="epoch",        # Evaluate at the end of each epoch
    save_strategy="epoch",              # Save a checkpoint at the end of each epoch
    logging_steps=10,                   # Log every 10 steps
    load_best_model_at_end=True,        # Load the best model based on evaluation loss
    metric_for_best_model="eval_loss",  # Criterion for best model selection
    report_to=[],                       # Disable integrations like wandb
)


In [63]:
# Step 9: Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
)

  trainer = Trainer(


In [64]:
# Step 10: Train the model
print("Starting training...")
trainer.train()

Starting training...


Epoch,Training Loss,Validation Loss
1,No log,2.918393
2,3.792800,2.304745
3,3.792800,2.266632


TrainOutput(global_step=15, training_loss=3.2614095052083334, metrics={'train_runtime': 275.8821, 'train_samples_per_second': 0.435, 'train_steps_per_second': 0.054, 'total_flos': 7839397969920.0, 'train_loss': 3.2614095052083334, 'epoch': 3.0})

In [65]:
# Step 11: Evaluate the model on the validation set
eval_results = trainer.evaluate()
print("Evaluation results:")
print(eval_results)

Evaluation results:
{'eval_loss': 2.266631841659546, 'eval_runtime': 3.8019, 'eval_samples_per_second': 2.63, 'eval_steps_per_second': 0.526, 'epoch': 3.0}
