<a href="https://colab.research.google.com/github/thedatasense/llm-healthcare/blob/main/.ipynb_checkpoints/App-C.1-Instruct-checkpoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune with Instruct Dataset

In this section we are going to finetune a base model with instruct dataset to optimize it to be used in a chat framework

In [25]:
!pip install datasets



In [26]:
import json            # For parsing JSON data
import random          # For setting seeds and shuffling data
import requests        # For downloading dataset from URL
import torch           # Main PyTorch library
from torch.utils.data import Dataset, DataLoader  # For dataset handling
from transformers import AutoTokenizer, AutoModelForCausalLM, StoppingCriteria  # HuggingFace components
from tqdm import tqdm   # Progress bar utilities
import re               # For text normalization
from datasets import load_dataset


In [27]:
# Set the CUDA device to GPU 1
print(f"Using device: {torch.cuda.current_device()} - {torch.cuda.get_device_name(torch.cuda.current_device())}")

Using device: 0 - Tesla T4


In [28]:
def set_seed(seed):
    # Set Python's built-in random seed
    random.seed(seed)
    # Set PyTorch's CPU random seed
    torch.manual_seed(seed)
    # Set seed for all available GPUs
    torch.cuda.manual_seed_all(seed)
    # Request cuDNN to use deterministic algorithms
    torch.backends.cudnn.deterministic = True
    # Disable cuDNN's auto-tuner for consistent behavior
    torch.backends.cudnn.benchmark = False

# Dataset

The dataset we use is  [MedInstruct](https://arxiv.org/pdf/2310.14558) where the study curated utilized a small set of high-quality clinician-curated seed tasks with 167 instances to prompt GPT-4 in generating medical
tasks. Similar instructions are removed from the generated medical tasks, creating 52k instances
which are subsequently inputted into ChatGPT for response generation

In [29]:
dataset = load_dataset("lavita/AlpaCare-MedInstruct-52k")

In [30]:
dataset['train'][1:3]

{'output': ["Today's lecture on diabetes provided a comprehensive overview of the condition, its causes, risk factors, and management strategies. \n\nI learned that diabetes is a chronic disease characterized by high blood sugar levels as a result of either insufficient insulin production or impaired insulin function. There are two main types of diabetes: type 1, which is an autoimmune disorder where the body does not produce insulin, and type 2, which is caused by insulin resistance.\n\nOne important takeaway from today's lecture was the recognition that diabetes has become a global epidemic due to various factors such as sedentary lifestyle, unhealthy diet, and obesity. It is crucial to address these underlying causes to prevent and manage diabetes effectively.\n\nFurthermore, I gained insights into the risk factors associated with diabetes, including family history, age, ethnicity, and certain lifestyle choices. Understanding these risk factors can help in identifying individuals wh

In [31]:
def build_prompt(instruction, solution=None):
    # Add solution with end token if provided
    wrapped_solution = ""
    if solution:
        wrapped_solution = f"\n{solution}\n<|im_end|>"

    # Build chat format with system, user, and assistant messages
    return f"""<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
{instruction}
<|im_end|>
<|im_start|>assistant""" + wrapped_solution

In [32]:
def encode_text(tokenizer, text, return_tensor=False):
    # If tensor output is requested, encode with PyTorch tensors
    if return_tensor:
        return tokenizer.encode(
            text, add_special_tokens=False, return_tensors="pt"
        )
    # Otherwise return list of token IDs
    else:
        return tokenizer.encode(text, add_special_tokens=False)

In [33]:
class EndTokenStoppingCriteria(StoppingCriteria):
    """
    Custom stopping criteria for text generation.
    Stops when a specific end token sequence is generated.

    Args:
        end_tokens (list): Token IDs that signal generation should stop
        device: Device where the model is running
    """
    def __init__(self, end_tokens, device):
        self.end_tokens = torch.tensor(end_tokens).to(device)

    def __call__(self, input_ids, scores):
        """
        Checks if generation should stop for each sequence.

        Args:
            input_ids: Current generated token IDs
            scores: Token probabilities

        Returns:
            tensor: Boolean tensor indicating which sequences should stop
        """
        should_stop = []

        # Check each sequence for end tokens
        for sequence in input_ids:
            if len(sequence) >= len(self.end_tokens):
                # Compare last tokens with end tokens
                last_tokens = sequence[-len(self.end_tokens):]
                should_stop.append(torch.all(last_tokens == self.end_tokens))
            else:
                should_stop.append(False)

        return torch.tensor(should_stop, device=input_ids.device)

In [34]:
class PromptCompletionDataset(Dataset):
    """
    PyTorch Dataset for instruction-completion pairs.
    Handles the conversion of text data into model-ready format.

    Args:
        data (list): List of dictionaries containing instructions and solutions
        tokenizer: Hugging Face tokenizer
    """
    def __init__(self, data, tokenizer):
        self.data = data
        self.tokenizer = tokenizer

    def __len__(self):
        # Return total number of examples
        return len(self.data)

    def __getitem__(self, idx):
        """
        Returns a single training example.

        Args:
            idx (int): Index of the example to fetch

        Returns:
            dict: Contains input_ids, labels, prompt, and expected completion
        """
        # Get example from dataset
        item = self.data[idx]
        # Build full prompt with instruction
        prompt = build_prompt(item["instruction"])
        # Format completion with end token
        completion = f"""{item["solution"]}\n<|im_end|>"""

        # Convert text to token IDs
        encoded_prompt = encode_text(self.tokenizer, prompt)
        encoded_completion = encode_text(self.tokenizer, completion)
        eos_token = [self.tokenizer.eos_token_id]

        # Combine for full input sequence
        input_ids = encoded_prompt + encoded_completion + eos_token
        # Create labels: -100 for prompt (ignored in loss)
        labels = [-100] * len(encoded_prompt) + encoded_completion + eos_token

        return {
            "input_ids": input_ids,
            "labels": labels,
            "prompt": prompt,
            "expected_completion": completion
        }

In the below function we get the data ready

### Padding:


*   We compute max_length by taking the length of the longest input in the batch.
*   We then pad all shorter sequences up to max_length.

### Labels:

*   We mirror the same padding strategy using -100 for padding. By default, -100 is typically ignored in PyTorch’s cross-entropy loss.

### Attention Mask:

*   A mask of 1 corresponds to real tokens, and 0 corresponds to padding tokens.

###  Return Values:

*   We return (input_ids, attention_mask, labels, prompts, expected_completions) as a tuple {prompts and expected_completions remain strings, providing a human-readable reference for debugging or logging}

In [35]:
def collate_fn(batch):
    """
    Collates batch of examples into training-ready format.
    Handles padding and conversion to tensors.

    Args:
        batch: List of examples from Dataset

    Returns:
        tuple: (input_ids, attention_mask, labels, prompts, expected_completions)
    """
    # Find longest sequence for padding
    max_length = max(len(item["input_ids"]) for item in batch)

    # Pad input sequences
    input_ids = [
        item["input_ids"] +
        [tokenizer.pad_token_id] * (max_length - len(item["input_ids"]))
        for item in batch
    ]
    # Pad label sequences
    labels = [
        item["labels"] +
        [-100] * (max_length - len(item["labels"]))
        for item in batch
    ]
    # Create attention masks
    attention_mask = [
        [1] * len(item["input_ids"]) +
        [0] * (max_length - len(item["input_ids"]))
        for item in batch
    ]
    prompts = [item["prompt"] for item in batch]
    expected_completions = [item["expected_completion"] for item in batch]

    return (
        torch.tensor(input_ids),
        torch.tensor(attention_mask),
        torch.tensor(labels),
        prompts,
        expected_completions
    )

In [36]:
def normalize_text(text):
    """
    Normalizes text for consistent comparison.

    Args:
        text (str): Input text

    Returns:
        str: Normalized text
    """
    # Remove leading/trailing whitespace and convert to lowercase
    text = text.strip().lower()
    # Replace multiple whitespace characters with single space
    text = re.sub(r'\s+', ' ', text)
    return text

In [37]:
def generate_text(model, tokenizer, prompt, max_new_tokens=100):
    """
    Generates text completion for a given prompt.

    Args:
        model: Fine-tuned model
        tokenizer: Associated tokenizer
        prompt (str): Input prompt
        max_new_tokens (int): Maximum number of tokens to generate

    Returns:
        str: Generated completion
    """
    # Encode prompt and move to model's device
    input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Setup end token detection
    end_tokens = tokenizer.encode("<|im_end|>", add_special_tokens=False)
    stopping_criteria = [EndTokenStoppingCriteria(end_tokens, model.device)]

    # Generate completion
    output_ids = model.generate(
        input_ids=input_ids["input_ids"],
        attention_mask=input_ids["attention_mask"],
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.pad_token_id,
        stopping_criteria=stopping_criteria
    )[0]

    # Extract and decode only the generated part
    generated_ids = output_ids[input_ids["input_ids"].shape[1]:]
    generated_text = tokenizer.decode(generated_ids).strip()
    return generated_text


In [38]:
def test_model(model_path, test_input):
    """
    Tests a saved model on a single input.

    Args:
        model_path (str): Path to saved model
        test_input (str): Instruction to test
    """
    # Setup device and load model
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    # Load model and tokenizer
    model = AutoModelForCausalLM.from_pretrained(model_path).to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    tokenizer.pad_token = tokenizer.eos_token

    # Generate and display prediction
    prompt = build_prompt(test_input)
    generated_text = generate_text(model, tokenizer, prompt)

    print(f"\nInput: {test_input}")
    print(f"Full generated text: {generated_text}")
    print(f"""Cleaned response: {generated_text.replace("<|im_end|>", "").strip()}""")


In [39]:
def download_and_prepare_data(dataset,tokenizer, batch_size, test_ratio=0.1):
    """
    Downloads and prepares dataset for training.

    Args:
        data_url (str): URL of the dataset
        tokenizer: Tokenizer for text processing
        batch_size (int): Batch size for DataLoader
        test_ratio (float): Proportion of data for testing

    Returns:
        tuple: (train_loader, test_loader)
    """
    dataset_list = []
    for example in dataset['train']:
      dataset_list.append({
        "instruction": example["instruction"],
        "solution": example["output"]
    })

    # Split into train dataset_list test sets
    random.shuffle(dataset)
    split_index = int(len(dataset_list) * (1 - test_ratio))
    train_data = dataset_list[:split_index]
    test_data = dataset_list[split_index:]

    # Print dataset statistics
    print(f"\nDataset size: {len(dataset_list)}")
    print(f"Training samples: {len(train_data)}")
    print(f"Test samples: {len(test_data)}")

    # Create datasets
    train_dataset = PromptCompletionDataset(train_data, tokenizer)
    test_dataset = PromptCompletionDataset(test_data, tokenizer)

    # Create dataloaders
    train_loader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        collate_fn=collate_fn
    )
    test_loader = DataLoader(
        test_dataset,
        batch_size=batch_size,
        shuffle=False,
        collate_fn=collate_fn
    )

    return train_loader, test_loader


In [40]:
def get_hyperparameters():
    """
    Returns training hyperparameters.

    Returns:
        tuple: (num_epochs, batch_size, learning_rate)
    """
    # Fewer epochs for instruction tuning as it's more data-efficient
    num_epochs = 4
    # Standard batch size that works well with most GPU memory
    batch_size = 16
    # Standard learning rate for fine-tuning transformers
    learning_rate = 5e-5

    return num_epochs, batch_size, learning_rate



In [41]:
model_name = "openai-community/gpt2"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [20]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [21]:
# Get hyperparameters and prepare data
num_epochs, batch_size, learning_rate = get_hyperparameters()
train_loader, test_loader = download_and_prepare_data(dataset, tokenizer, batch_size)

# Initialize optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)


Dataset size: 52002
Training samples: 46801
Test samples: 5201


NameError: name 'collate_fn' is not defined

In [20]:
# Training loop
for epoch in range(num_epochs):
    total_loss = 0  # Tracks cumulative loss for the epoch
    num_batches = 0  # Tracks number of batches processed
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch + 1}/{num_epochs}")

    for batch in progress_bar:
        # Unpack batch and move to device
        input_ids, attention_mask, labels, _, _ = batch
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        labels = labels.to(device)

        # Forward pass: compute model outputs and loss
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        loss = outputs.loss

        # Backward pass: compute gradients and update weights
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        # Update metrics
        total_loss += loss.item()
        num_batches += 1
        average_loss = total_loss / num_batches

        # Update progress bar
        progress_bar.set_postfix({"Loss": average_loss})


Epoch 1/4: 100%|██████████| 2926/2926 [19:13<00:00,  2.54it/s, Loss=1.66]
Epoch 2/4: 100%|██████████| 2926/2926 [19:11<00:00,  2.54it/s, Loss=1.41]
Epoch 3/4: 100%|██████████| 2926/2926 [19:10<00:00,  2.54it/s, Loss=1.28]
Epoch 4/4: 100%|██████████| 2926/2926 [19:09<00:00,  2.55it/s, Loss=1.17]


In [23]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [23]:
torch.save(model.state_dict(),"/content/drive/MyDrive/PhD- LLM Healthcare/models/instr_basic_llm_med.pth")

In [25]:
model.save_pretrained("/content/drive/MyDrive/PhD- LLM Healthcare/models/finetuned_model")
tokenizer.save_pretrained("/content/drive/MyDrive/PhD- LLM Healthcare/models/finetuned_model")


Testing finetuned model:
Using device: cuda

Input: diabetes is caused by
Full generated text: Diabetes is a chronic condition that affects the way your body processes glucose (sugar). When you have diabetes, your body either does not produce enough insulin or cannot effectively use the insulin it produces. Insulin is a hormone that helps regulate blood sugar levels.

When you have diabetes, your body either does not produce enough insulin or cannot effectively use the insulin it produces. This can lead to high blood sugar levels (hyperglycemia) or a condition called diabetes.

Managing
Cleaned response: Diabetes is a chronic condition that affects the way your body processes glucose (sugar). When you have diabetes, your body either does not produce enough insulin or cannot effectively use the insulin it produces. Insulin is a hormone that helps regulate blood sugar levels.

When you have diabetes, your body either does not produce enough insulin or cannot effectively use the insulin 

In [42]:
print("\nTesting finetuned model:")
test_input = "diabetes is caused by"
test_model("/content/drive/MyDrive/PhD- LLM Healthcare/models/finetuned_model", test_input)


Testing finetuned model:
Using device: cuda

Input: diabetes is caused by
Full generated text: Diabetes is a chronic condition that affects the way your body processes glucose (sugar). When you have diabetes, your body either does not produce enough insulin or cannot effectively use the insulin it produces. Insulin is a hormone that helps regulate blood sugar levels.

When you have diabetes, your body either does not produce enough insulin or cannot effectively use the insulin it produces. This can lead to high blood sugar levels (hyperglycemia) or a condition called diabetes.

Managing
Cleaned response: Diabetes is a chronic condition that affects the way your body processes glucose (sugar). When you have diabetes, your body either does not produce enough insulin or cannot effectively use the insulin it produces. Insulin is a hormone that helps regulate blood sugar levels.

When you have diabetes, your body either does not produce enough insulin or cannot effectively use the insulin 