#Patient Data Multiclass Classification

Problem Statement:
This program implements a multiclass classification model to predict a patient's pain level (0-10)
based on their clinical notes. The model uses BERT (Bidirectional Encoder Representations from
Transformers) to analyze the text descriptions and predict the corresponding pain level.

Dataset Description:
The dataset contains patient medical records with the following columns:
- Patient Notes: Text descriptions of patient condition and symptoms
- Days Admitted: Number of days the patient has been in the hospital
- Treatment Type: The type of treatment being administered (Surgery/Physiotherapy/Medication)
- Surgery Done: Whether surgery was performed (Yes/No)
- Pain Level: Numerical score indicating patient's pain level (0-10)

In [1]:
# Required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from google.colab import drive



In [2]:
# Mount Google Drive to access data
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# Load and display the dataset
df = pd.read_csv('/content/drive/My Drive/patient_pain_data.csv')
print(df.head())

                  Patient Notes  Days Admitted Treatment Type Surgery Done  \
0  I feel a little better today              7        Surgery          Yes   
1        The pain is unbearable              4  Physiotherapy          Yes   
2       I can walk but it hurts             13     Medication          Yes   
3        I have a mild headache             11  Physiotherapy          Yes   
4     I can't sleep due to pain              8        Surgery          Yes   

   Pain Level  
0           8  
1           7  
2           9  
3           8  
4           5  


In [33]:
# Step 2: Extract features and labels
texts = df['Patient Notes'].tolist()  # List of text notes
labels = df['Pain Level'].tolist()    # List of pain levels (0-10)

In [50]:
# Step 3: Determine number of classes
# For pain levels 0-10, we need 11 classes (0 through 10)
num_labels = max(labels) + 1  # Add 1 to include 0
print(f"Number of classes: {num_labels}")


Number of classes: 11


In [51]:
# Step 4: Split the data into training and validation sets (80/20 split)
train_texts, val_texts, train_labels, val_labels = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)

In [52]:
# Step 5: Create a custom PyTorch Dataset class
class PatientNotesDataset(Dataset):
    """
    A custom Dataset class for patient/doctor notes.
    It tokenizes the input texts and returns input IDs, attention masks, and labels.
    """
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        # The original label_mapping is incorrect for pain level data
        # We need to handle numerical pain levels (0-10)
        # self.label_mapping = {'No': 0, 'Yes': 1}  # This is incorrect for pain levels
        self.labels = labels # Assuming labels are already numerical (0-10)
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        # Tokenize the text with padding and truncation
        encoding = self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=self.max_length,
            return_tensors='pt'
        )
        # Remove the extra batch dimension and add the label
        item = {key: val.squeeze() for key, val in encoding.items()}
        item['labels'] = torch.tensor(label, dtype=torch.long)
        return item

In [53]:
# Step 6: Initialize BERT model components
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=num_labels  # 11 classes for pain levels 0-10
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [54]:
# Step 7: Create Dataset instances for training and validation
train_dataset = PatientNotesDataset(train_texts, train_labels, tokenizer, max_length=128)
val_dataset = PatientNotesDataset(val_texts, val_labels, tokenizer, max_length=128)


In [55]:
# Step 8: Define training arguments with wandb disabled
training_args = TrainingArguments(
    output_dir='./results',             # Directory for model predictions and checkpoints
    num_train_epochs=3,                 # Number of training epochs
    per_device_train_batch_size=8,      # Batch size per device during training
    per_device_eval_batch_size=8,       # Batch size for evaluation
    evaluation_strategy="epoch",        # Evaluate at the end of each epoch
    save_strategy="epoch",              # Save a checkpoint at the end of each epoch
    logging_steps=10,                   # Log every 10 steps
    load_best_model_at_end=True,        # Load the best model at the end of training
    metric_for_best_model="eval_loss",  # Use evaluation loss to pick the best model
    report_to=[]                        # Disable integrations (including wandb)
)



In [56]:
# Step 9: Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
)

  trainer = Trainer(


In [57]:
# Step 10: Train the model
print("Starting training...")
trainer.train()

Starting training...


Epoch,Training Loss,Validation Loss
1,No log,2.447322
2,2.320700,2.336524
3,2.320700,2.340604


TrainOutput(global_step=15, training_loss=2.246491305033366, metrics={'train_runtime': 178.6822, 'train_samples_per_second': 0.672, 'train_steps_per_second': 0.084, 'total_flos': 7893969500160.0, 'train_loss': 2.246491305033366, 'epoch': 3.0})

In [58]:
# Step 11: Evaluate the model on the validation set
eval_results = trainer.evaluate()
print("Evaluation results:")
print(eval_results)

Evaluation results:
{'eval_loss': 2.336524486541748, 'eval_runtime': 3.1441, 'eval_samples_per_second': 3.181, 'eval_steps_per_second': 0.636, 'epoch': 3.0}
