## Step 1: Mounting Google Drive and Installing Dependencies

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Navigate to the repo folder
%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer

# List repo contents
!ls

Mounted at /content/drive
/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer
data  deployment  LICENSE  notebooks  project_plan.md  qa_pairs  README.md  scripts


In [None]:
!pip install -q transformers datasets

In [10]:
from huggingface_hub import login
from datasets import load_dataset
from datasets import DatasetDict
from datasets import load_from_disk
from transformers import AutoTokenizer

In [8]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Step 2: Loading the Tokenizer

In [11]:
# Load the tokenizer for Mistral-7B-Instruct
# This is a gated a repo so you need to get access to the model via Hugging Face website to run the code
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

## Step 3: Loading QA Dataset from JSONL

We load the `train.jsonl` file using HuggingFace's `load_dataset` function. Each line in the `.jsonl` file is a single QA pair stored as a JSON object. This forms the base dataset we'll format and tokenize in the next steps.

In [12]:
dataset = load_dataset("json", data_files="./data/train.jsonl", split="train")

Generating train split: 0 examples [00:00, ? examples/s]

In [13]:
print(dataset[1])

{'question': 'How does AutoLoRA represent each update matrix in the fine-tuning process?', 'answer': 'AutoLoRA decomposes each update matrix into the product of two low-rank matrices, consistent with the LoRA methodology. This product is then expressed as a sum of rank-1 matrices, each associated with a trainable selection variable α ∈ [0, 1].'}


## Step 4: Formatting Dataset for Supervised Fine-Tuning

In this step, we format each QA pair into an instruction-following format expected by our model during training. The `format_example()` function wraps the question and answer into a structured prompt using markdown-style headings `(### Question: / ### Answer:)`, which helps the model learn the instruction-response format more reliably.

In essence, this function takes a QA pair and transforms it into a single string where the question and answer are clearly labeled with headings and separated by newlines. This structured format helps the language model learn the relationship between questions and answers more effectively during the fine-tuning process.

We then use the `.map()` function to apply this transformation to the entire dataset, removing the original "question" and "answer" fields and keeping only the unified "text" field for training.

This is the final format that will be tokenized and fed into the model for fine-tuning.

In [14]:
# Format each QA pair into an instruction-following prompt
def format_example(example):
    return {
        "text": f"### Question:\n{example['question']}\n\n### Answer:\n{example['answer']}"
    }

In [15]:
# Apply formatting to all samples in the dataset
formatted_dataset = dataset.map(format_example, remove_columns=["question", "answer"])

Map:   0%|          | 0/234 [00:00<?, ? examples/s]

In [16]:
# Quick check: Show an example
formatted_dataset[0]

{'text': '### Question:\nWhat problem does AutoLoRA aim to solve in traditional LoRA-based fine-tuning?\n\n### Answer:\nAutoLoRA addresses two core limitations of traditional LoRA: (1) the uniform rank assignment across all layers, which neglects layer-specific importance, leading to suboptimal or inefficient fine-tuning; and (2) the need for exhaustive manual hyperparameter searches to determine optimal ranks.'}

## Step 5: Tokenizing the Dataset for Causal Language Modeling

We now tokenize our dataset using the Mistral tokenizer. Each example is converted into a dictionary of input features required for training:

- **`input_ids`**: token IDs representing the input text.
- **`attention_mask`**: binary mask indicating which tokens should be attended to (1 for real tokens, 0 for padding).
- **`labels`**: identical to `input_ids`, but used by the loss function during training as ground truth targets.

Since we are performing **causal language modeling**, the model is trained to predict the *next token* in the sequence. Thus, input and label sequences are aligned token-by-token.

We also apply:
- `padding="max_length"` to ensure all sequences are of fixed length.
- `truncation=True` to safely handle longer inputs.
- `max_length=512`, which is a safe, compute-efficient length for the 7B model class.

Note: We manually set the tokenizer's `pad_token` to the `eos_token` to avoid padding errors, as Mistral does not define a default pad token.

In [17]:
def tokenize(example):
    result = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
        return_tensors=None,
    )
    # Labels are the same as input_ids for causal language modeling
    result["labels"] = result["input_ids"].copy()
    return result

In [18]:
tokenizer.pad_token = tokenizer.eos_token

In [19]:
# Applying the tokenizer to the entire dataset
tokenized_dataset = formatted_dataset.map(tokenize, batched=True, remove_columns=["text"])

Map:   0%|          | 0/234 [00:00<?, ? examples/s]

## Step 6: Splitting the Tokenized Dataset into Train and Validation Sets

We split the tokenized dataset into training and validation subsets using an 90/10 ratio. This allows us to monitor the model's generalization during fine-tuning and detect overfitting.

We use HuggingFace's `DatasetDict` to store both splits in a structured and conventional format. This is standard practice and ensures compatibility with downstream training APIs.

The result is:
- `dataset_dict["train"]` → for model training
- `dataset_dict["validation"]` → for evaluation during training

In [20]:
# Let's split 90% for training, 10% for validation
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, seed=42)

In [21]:
# Wrap the result in a DatasetDict for clarity and usability
dataset_dict = DatasetDict({
    "train": split_dataset["train"],
    "validation": split_dataset["test"]
})

In [22]:
# Verify sizes
print("Training examples:", len(dataset_dict["train"]))
print("Validation examples:", len(dataset_dict["validation"]))

Training examples: 210
Validation examples: 24


## Step 7: Saving the Tokenized Dataset to Disk

We save the `DatasetDict` using HuggingFace's `.save_to_disk()` method. This ensures that our tokenized training and validation data are preserved in an efficient, reloadable format.

This prevents us from having to repeat tokenization during fine-tuning, saving time and computation. The dataset will be stored in the `tokenized_dataset` directory and can be reloaded later using `datasets.load_from_disk()`.

In [23]:
save_path = "./data/tokenized_dataset"

dataset_dict.save_to_disk(save_path)

Saving the dataset (0/1 shards):   0%|          | 0/210 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/24 [00:00<?, ? examples/s]

## Step 8: Verifyig the Saved Dataset

In [24]:
# Path to the saved tokenized dataset
load_path = "./data/tokenized_dataset"

In [25]:
# Load the dataset from disk
loaded_dataset = load_from_disk(load_path)

In [26]:
# Sanity check: view one example
print(loaded_dataset["train"][0])

{'input_ids': [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2