In [1]:
import pandas as pd
from datasets import Dataset
from transformers import GPT2Tokenizer

ModuleNotFoundError: No module named 'datasets'

In [None]:
# Step 1: Load the dataset into a pandas DataFrame
df = pd.read_csv('St_Paul_hospital_train.csv')

df.head()

In [None]:
# Step 2: Format the data as prompt
def format_data(row):
    return f"Text: {row['medical_text']} Diagnosis: {row['diagnosis']}"

# Apply to DataFrame
df['formatted'] = df.apply(format_data, axis=1)

In [None]:

# Step 3: Convert DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df[['formatted']])


In [None]:

# Step 4: Load the GPT2 Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

In [None]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['formatted'], padding="max_length", truncation=True, max_length=512)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Step 5: Split the dataset into train and validation sets
train_dataset = tokenized_dataset.train_test_split(test_size=0.1)["train"]
val_dataset = tokenized_dataset.train_test_split(test_size=0.1)["test"]

# Save the tokenized datasets for fine-tuning
train_dataset.save_to_disk('train_dataset')
val_dataset.save_to_disk('val_dataset')

In [None]:
from transformers import GPT2LMHeadModel, TrainingArguments, Trainer

# Load the pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Step 6: Set up training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-medical-diagnosis",  # Directory to save the model
    num_train_epochs=3,  # Number of training epochs
    per_device_train_batch_size=2,  # Batch size for training
    per_device_eval_batch_size=2,  # Batch size for evaluation
    warmup_steps=500,  # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,  # Strength of weight decay
    logging_dir="./logs",  # Directory for storing logs
    logging_steps=10,  # Log every 10 steps
    evaluation_strategy="epoch",  # Evaluate every epoch
)

In [None]:
# Step 7: Set up the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
)

In [None]:
# Step 8: Fine-tune the model
trainer.train()



In [None]:
# Step 9: Save the fine-tuned model
model.save_pretrained("./fine-tuned-gpt2-medical-diagnosis")
tokenizer.save_pretrained("./fine-tuned-gpt2-medical-diagnosis")

In [None]:
#Inference with the Fine-Tuned Model

In [None]:
def predict_diagnosis(medical_text):
    # Format the prompt
    prompt = f"Text: {medical_text} Diagnosis:"

    # Tokenize the input text
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)

    # Generate the diagnosis prediction
    outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1)

    # Decode the output tokens to text
    diagnosis = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return diagnosis




In [None]:
# Example use case
medical_text = "Severe gastrointestinal dysmotility is a newly recognized paraneoplastic syndrome that occurs with small-cell lung carcinoma."
predicted_diagnosis = predict_diagnosis(medical_text)
print(predicted_diagnosis)

In [None]:
Q3.

1. Traditional Models (e.g., Logistic Regression, Decision Trees):
Strengths:
Simple to implement and understand.
Faster inference time because they work with manually engineered features rather than large neural networks.
Weaknesses:
May not capture the complex relationships between words and the medical domain as well as more advanced models.
Requires manual feature extraction (TF-IDF, Bag of Words) which may not fully capture the nuances of medical language.
Accuracy is usually lower compared to neural models, especially on more complex, unstructured data like medical texts.
2. Transfer Learning Model (BERT):
Strengths:
BERT has a better understanding of the language and context than traditional models. It captures relationships between words more effectively, especially when trained on domain-specific data (medical text).
The ability to fine-tune a pre-trained model means faster training and better generalization.
Weaknesses:
BERT is computationally expensive, both in terms of training and inference time.
It requires more memory and resources than traditional models.
Performance may plateau after fine-tuning, especially on smaller datasets.
3. Fine-tuned GPT Model:
Strengths:
GPT-2 is a powerful generative model that has shown strong performance in language generation and can be adapted for classification tasks by conditioning the model to predict the next word (or token) that corresponds to the diagnosis.
GPT-2 can generate coherent and contextually relevant text, which might be beneficial in understanding complex medical descriptions.
After fine-tuning, GPT-2 can be very flexible, providing not only predictions but also human-readable explanations in some cases (depending on how it's fine-tuned).
Weaknesses:
Like BERT, GPT-2 is computationally intensive and may have slower inference times, particularly for large inputs.
Fine-tuning a large model on a relatively small dataset (like the one used in this task) can sometimes lead to overfitting or suboptimal performance if not handled carefully.
GPT models are more prone to generating "hallucinated" results (i.e., plausible-sounding but incorrect predictions) because they are generative models.

## Conclusion:
Traditional models are good for small-scale problems with limited data and are highly efficient in terms of computational resources, but they struggle to capture the complexity of medical text.
BERT offers a significant improvement in accuracy and performance by using transfer learning, but it is computationally expensive.
GPT-2, while similar to BERT in terms of accuracy, offers additional flexibility by being able to generate explanations and responses. However, it can be prone to overfitting on small datasets and is computationally intensive.