## Step 1: Mounting Google Drive and Importing Libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/multimodal-xray-agent
!ls

Mounted at /content/drive
/content/drive/MyDrive/multimodal-xray-agent
app	      deployment  models	  README.md	    src
chexpert.zip  LICENSE	  notebooks	  requirements.txt
data	      logs	  PROJECT_LOG.md  scripts


In [None]:
import torch
import json
from huggingface_hub import login
from datasets import load_dataset, DatasetDict, load_from_disk, Dataset
from transformers import AutoTokenizer

In [None]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Step 2: Loading QA Dataset

In [None]:
# Copy file from GDrive to Colab local runtime
!cp /content/drive/MyDrive/multimodal-xray-agent/data/qapairs/top_700_qa_pairs.jsonl /content/top_700_qa_pairs.jsonl

In [None]:
# Load the data manually
with open("/content/top_700_qa_pairs.jsonl", "r") as f:
    data = [json.loads(line) for line in f]

In [None]:
# Convert to Hugging Face Dataset
dataset = Dataset.from_list(data)

In [None]:
len(dataset)

700

## Step 3: Formatting Dataset for Supervised Fine-Tuning

In this step, we format each QA pair into an instruction-following format expected by our model during training. The `format_example()` function wraps the question and answer into a structured prompt using markdown-style headings `(### Question: / ### Answer:)`, which helps the model learn the instruction-response format more reliably.

In essence, this function takes a QA pair and transforms it into a single string where the question and answer are clearly labeled with headings and separated by newlines. This structured format helps the language model learn the relationship between questions and answers more effectively during the fine-tuning process.

We then use the `.map()` function to apply this transformation to the entire dataset, removing the original "question" and "answer" fields and keeping only the unified "text" field for training.

This is the final format that will be tokenized and fed into the model for fine-tuning.

In [None]:
# Format each QA pair into an instruction-following prompt
def format_example(example):
    return {
        "text": f"### Question:\n{example['question']}\n\n### Answer:\n{example['answer']}"
    }

In [None]:
# Apply formatting to all samples in the dataset
formatted_dataset = dataset.map(format_example, remove_columns=["uuid", "question", "answer"])

Map:   0%|          | 0/700 [00:00<?, ? examples/s]

In [None]:
formatted_dataset[0]

{'text': '### Question:\nIs there any evidence of disease in the X-ray?\n\n### Answer:\n1. Severe emphysema. 2. Irregular, pleural-parenchymal opacity in left upper lobe. This may irregular pleural-parenchymal scarring, however, recommend comparison with more remote outside imaging, if available to determine long-term stability. If none are available, recommend short-term [REDACTED] in 3 to 4 months. Evaluation of coronal and sagittal reformatted images from the outside study would also be helpful. These were not [REDACTED] available at the outside institution. Malignancy cannot be confidently excluded on the available images'}

In [None]:
formatted_dataset.features

{'text': Value(dtype='string', id=None)}

## Step 4:  Loading the Tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

In [None]:
print(tokenizer.pad_token)

None


In [None]:
tokenizer.add_special_tokens({'pad_token': '<pad>'})

1

## Step 5: Tokenizing the Dataset for Causal Language Modeling

We now tokenize the dataset using the LLaMA tokenizer. Each example is formatted in the style:

```
### Question:
{question text}

### Answer:
{answer text}
```

During tokenization, each example is converted into three key fields:

- **input_ids**: Token IDs representing the full prompt (`question + answer`) to be fed into the model.
- **attention_mask**: Binary vector indicating which tokens are real (1) vs. padding (0).
- **labels**: Target tokens that the model should try to predict during training.

---

#### Why Label Masking?

In causal language modeling (CLM), the model learns by **predicting the next token**, one step at a time. To train the model to *only* learn to generate the **answer** (not the question or prompt), we **mask the prompt portion** of the labels using `-100`. This tells the loss function to **ignore these tokens** during gradient computation.

The logic is:

```python
labels = [-100] * len(prompt_ids) + result["input_ids"][len(prompt_ids):]
```

- `-100` is the special ignored index in PyTorch loss functions.
- The answer portion (after the prompt) remains unmasked and is used for learning.
- We truncate or pad the label sequence to `max_length = 512` for stability.

---

#### Tokenization Config

```python
tokenizer(
    example["text"],
    truncation=True,         # Cut off long sequences safely
    padding="max_length",    # Pad all to uniform length
    max_length=512           # Max allowed length (safe for 3B models)
)
```

We chose `max_length = 512` to ensure future compatibility with longer inference prompts and outputs (e.g., definitions, expanded context). This also keeps GPU memory usage manageable and prevents truncating informative answers.


In [None]:
def tokenize(example):
    # Compute prompt length so we know what to mask
    prompt_split = example["text"].split("### Answer:\n")  # Split text at answer section
    prompt_ids = tokenizer(prompt_split[0] + "### Answer:\n")["input_ids"]  # Tokenize prompt only

    result = tokenizer(
        example["text"],
        truncation=True,           # Truncate to max_length
        padding="max_length",      # Pad to max_length
        max_length=384,            # Set max sequence length
        return_tensors=None,       # Return as lists, not tensors
    )

    labels = [-100] * len(prompt_ids) + result["input_ids"][len(prompt_ids):]  # Mask prompt tokens
    labels = labels[:384] + [-100] * max(0, 384 - len(labels))                 # Pad/truncate labels to 384

    result["labels"] = labels  # Attach labels to result
    return result              # Return tokenized dict

In [None]:
tokenized_dataset = formatted_dataset.map(tokenize, batched=False, remove_columns=["text"])

## Step 6: Splitting the Tokenized Dataset into Train and Validation Sets

In [None]:
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, seed=42)

In [None]:
dataset_dict = DatasetDict({
    "train": split_dataset["train"],
    "validation": split_dataset["test"]
})

In [None]:
print("Training examples:", len(dataset_dict["train"]))
print("Validation examples:", len(dataset_dict["validation"]))

Training examples: 630
Validation examples: 70


In [None]:
# Sanity check
print(tokenizer.decode(tokenized_dataset[0]["input_ids"]))

<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>

In [None]:
# Decode the label sequence, replacing masked tokens (-100) with the pad token for readability
print(tokenizer.decode([t if t != -100 else tokenizer.pad_token_id for t in tokenized_dataset[0]["labels"]]))

## Step 7: Saving the Tokenized Dataset to Disk

In [None]:
save_path = "./data/tokenized_dataset"

dataset_dict.save_to_disk(save_path)

## Step 8: Verifying the Saved Dataset

In [None]:
# Path to the saved tokenized dataset
load_path = "file://./data/tokenized_dataset"

In [None]:
# Load the dataset from disk
loaded_dataset = load_from_disk(load_path)

In [None]:
# Sanity check: view one example
print(loaded_dataset["train"][0])

{'input_ids': [128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 1

In [None]:
print(len(loaded_dataset["train"]))
print(len(loaded_dataset["validation"]))

630
70


## Step 9: Fix Metadata

In [1]:
!pip install nbformat --q

In [2]:
import nbformat
import os
from google.colab import drive, files

In [None]:
drive.mount('/content/drive', force_remount=True)

In [4]:
# List the notebook directory to confirm the file exists
os.listdir("/content/drive/MyDrive/multimodal-xray-agent/notebooks")

['10_tokenization.ipynb',
 '.gitkeep',
 '00_colab_setup.ipynb',
 '01_bootstrap.ipynb',
 '02_preprocessing.ipynb',
 '04_text_embedding_faiss_indexing.ipynb',
 '03_image_embedding_faiss_indexing.ipynb',
 '05_iu_xray_processing.ipynb',
 '06_generate_qa_pairs.ipynb',
 '08_finetune_biogpt_lora_run2.ipynb',
 '09_llama3_zero_shot_eval.ipynb',
 '07_finetune_biogpt_lora.ipynb',
 'Copy of 10_tokenization.ipynb',
 '12_llama3_finetuned_eval.ipynb',
 '11_finetune_llama3.2_lora.ipynb',
 '10_tokenization_fixed.ipynb']

In [None]:
notebook_path = "/content/drive/MyDrive/multimodal-xray-agent/notebooks/10_tokenization.ipynb"

with open(notebook_path, "r") as f:
    nb = nbformat.read(f, as_version=4)

if "widgets" in nb.metadata:
    del nb.metadata["widgets"]

with open(notebook_path, "w") as f:
    nbformat.write(nb, f)

print("Notebook fixed and saved successfully!")