## Step 1: Mounting Google Drive

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Navigate to the repo folder
%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer

# List repo contents
!ls

Mounted at /content/drive
/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer
data  deployment  LICENSE  notebooks  project_plan.md  qa_pairs  README.md  scripts


## Step 2: Importing Libraries and Generating the .jsonl File

This block of code performs the following steps:

1. **Define Input and Output Paths**:
   - `qa_folder` points to the directory containing multiple `.json` files, each with QA pairs under the key `"qa_pairs"`.
   - `output_path` specifies where the final `train.jsonl` file will be saved.

2. **Aggregate All QA Pairs**:
   - Iterate through every `.json` file in the QA folder.
   - For each file, load the content and extract the list of QA pairs.
   - Strip whitespace from both questions and answers.
   - Store each pair as a dictionary `{"question": ..., "answer": ...}` in the `all_qa_pairs` list.

3. **Preview a Sample**:
   - Print the first QA pair (`all_qa_pairs[0]`) to verify correct structure and content.

4. **Write to `.jsonl` File**:
   - Open `train.jsonl` in write mode.
   - For each QA pair, write a JSON object followed by a newline (`\n`), creating a valid `.jsonl` file.
   - Each line in the output is an independent JSON object, which is ideal for streaming-based training in HuggingFace Datasets.

This step prepares the dataset in a format suitable for LLM fine-tuning pipelines such as LoRA training with `transformers` and `peft`.

*Note*: `.jsonl` (JSON Lines) is used instead of `.json` because:
- It's easier to stream and line-by-line parse.
- It's supported directly by `datasets.load_dataset()` with the `json` loader.

In [16]:
import os
import json
import pandas as pd

In [4]:
qa_folder = './qa_pairs'
output_path = './data/train.jsonl'

all_qa_pairs = []

for filename in os.listdir(qa_folder):
    if filename.endswith('.json'):
        with open(os.path.join(qa_folder, filename), 'r') as f:
            data = json.load(f)
            for pair in data['qa_pairs']:
                all_qa_pairs.append({
                    "question": pair['question'].strip(),
                    "answer": pair['answer'].strip()
                })

In [9]:
all_qa_pairs[0]

{'question': 'What problem does AutoLoRA aim to solve in traditional LoRA-based fine-tuning?',
 'answer': 'AutoLoRA addresses two core limitations of traditional LoRA: (1) the uniform rank assignment across all layers, which neglects layer-specific importance, leading to suboptimal or inefficient fine-tuning; and (2) the need for exhaustive manual hyperparameter searches to determine optimal ranks.'}

In [10]:
with open(output_path, 'w') as f:
    for pair in all_qa_pairs:
        json.dump(pair, f)
        f.write('\n')

In [17]:
df = pd.read_json('data/train.jsonl', lines=True)

In [19]:
df.head()

Unnamed: 0,question,answer
0,What problem does AutoLoRA aim to solve in tra...,AutoLoRA addresses two core limitations of tra...
1,How does AutoLoRA represent each update matrix...,AutoLoRA decomposes each update matrix into th...
2,What is the role of the selection variable α i...,The α variable controls whether a given rank-1...
3,How does AutoLoRA determine the optimal rank o...,AutoLoRA introduces selection variables associ...
4,Why is learning α directly on the training dat...,Directly learning α from training data can lea...


### Step 3: Downloading the Notebook

In [20]:
from google.colab import files

%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer/notebooks
files.download("04_prepare_finetuning_corpus.ipynb")

/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer/notebooks


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>