# Fine-Tuning Large Language Models with LoRA/QLoRA

Welcome! In this notebook, you'll learn how to fine-tune a large language model efficiently using Parameter-Efficient Fine-Tuning (PEFT) techniques.

**What you'll learn:**
- How to prepare and tokenize datasets for instruction fine-tuning
- How to apply LoRA (Low-Rank Adaptation) to reduce trainable parameters
- How to use QLoRA for memory-efficient training with quantization
- How to implement assistant-only masking to train only on model responses
- How to train, save, and share your fine-tuned model

**Why this matters:**
Fine-tuning large models typically requires massive computational resources. LoRA and QLoRA allow you to fine-tune models on consumer hardware by only training a small fraction of the parameters.


## 1. Setup and Imports

Let's start by importing all the libraries we'll need. We're using:
- **Transformers**: For loading models and tokenizers
- **PEFT**: For applying LoRA adapters to our model
- **Datasets**: For loading and processing training data
- **BitsAndBytes**: For quantization if using QLoRA


In [1]:
! pip install -q torch datasets peft huggingface_hub evaluate rouge_score
! pip install -U bitsandbytes

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
Collecting bitsandbytes
  Downloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Downloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl (59.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.4/59.4 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.48.2


In [None]:
import os
import warnings
import torch
import evaluate
from tqdm import tqdm
from typing import Optional, Tuple, List, Dict
from transformers import (
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback,
    AutoTokenizer
)
from peft import LoraConfig, get_peft_model
from huggingface_hub import login
from torch.utils.data import DataLoader
from datasets import Dataset, load_dataset
from peft import PeftModel


warnings.filterwarnings("ignore")



## 2. Authentication and Configuration

Before we begin training, we need to authenticate with HuggingFace (to download models and upload results) and optionally with Weights & Biases (to track training progress).

**Important:** Make sure you have a `.env` file with:
- `HF_TOKEN`: Your HuggingFace access token
- `HF_USERNAME`: Your HuggingFace username

We'll also load our training configuration from `config.json`, which contains all hyperparameters and settings.


In [19]:
from google.colab import userdata
HF_TOKEN = userdata.get('HF_TOKEN')
HF_USERNAME = userdata.get('HF_USERNAME')

login(HF_TOKEN)


## 3. Prepare Your Training Data


**What happens here:**
1. **Load the dataset** from local JSONL files (train, validation, and test sets)
2. **Format into instruction format** with clear sections for input and output
3. A default instruction tells the model its role and task

The dataset will be loaded with three splits:
- **Training set**: For teaching the model new patterns
- **Validation set**: For monitoring progress during training
- **Test set**: For final evaluation after training

**Key Concept - Assistant-Only Masking:**
When fine-tuning, we only want the model to learn to generate the assistant's response, not memorize the instruction. We'll apply masking in the next step to ensure the loss only comes from predicting the output tokens.


In [35]:

def prepare_dataset(
    dataset_name: str,
    input_column: str,
    output_column: str,
    instruction_column: str = None,
    default_instruction: str = None,
    seed: int = 42,
    train_size: Optional[int] = None,
    validation_size: Optional[int] = None,
    test_size: Optional[int] = None,
) -> Tuple[Dataset, Dataset, Dataset]:
    """
    Load and prepare the dataset for fine-tuning.

    Args:
        dataset_name: Name of the dataset from HuggingFace
        instruction_column: Column name for instructions
        input_column: Column name for inputs
        output_column: Column name for outputs

    Returns:
        Tuple of (train_dataset, validation_dataset, test_dataset)
    """

    dataset = load_dataset(dataset_name)

    def format_instruction_data(data_point: Dict) -> str:
        if instruction_column is not None:
            instruction = data_point[instruction_column]
        else:
            instruction = default_instruction

        input_text = data_point[input_column]
        output = data_point[output_column]

        formatted_text = f"### Instruction\n{instruction}\n\n"

        if input_text:
            formatted_text += f"### Input\n{input_text}\n\n"

        formatted_text += f"### Output\n{output}"

        return {"text": formatted_text}

    if train_size is not None:
        dataset["train"] = dataset["train"].shuffle(seed).select(range(train_size))

    train_dataset = dataset["train"].map(
        format_instruction_data, desc="Formatting train data"
    )

    validation_dataset = None
    test_dataset = None

    if "validation" in dataset:
        if validation_size is not None:
            validation_dataset = dataset["validation"].shuffle(seed).select(range(validation_size))
        validation_dataset = validation_dataset.map(
            format_instruction_data, desc="Formatting validation data"
        )

    if "test" in dataset:
        if test_size is not None:
            test_dataset = dataset["test"].shuffle(seed).select(range(test_size))
        test_dataset = test_dataset.map(
            format_instruction_data, desc="Formatting test data"
        )



    return train_dataset, validation_dataset, test_dataset


def apply_assistant_masking(
    input_ids: List[int], tokenizer: AutoTokenizer
) -> List[int]:
    """
    Apply assistant-only masking by setting instruction tokens to -100.
    This ensures the model only learns from the output/response portion.

    Args:
        input_ids: Tokenized input sequence
        tokenizer: The tokenizer used

    Returns:
        Labels with instruction tokens masked (-100)
    """
    labels = input_ids.copy()

    output_marker = "### Output"
    output_marker_tokens = tokenizer.encode(output_marker, add_special_tokens=False)

    output_start_idx = None
    for i in range(len(input_ids) - len(output_marker_tokens) + 1):
        if input_ids[i : i + len(output_marker_tokens)] == output_marker_tokens:
            output_start_idx = i + len(output_marker_tokens)
            break

    if output_start_idx is not None:
        for i in range(output_start_idx):
            labels[i] = -100

    return labels


def tokenize_dataset(
    model_name: str,
    train_dataset: Dataset,
    validation_dataset: Optional[Dataset] = None,
    test_dataset: Optional[Dataset] = None,
    assistant_only_masking: bool = True,
    max_length: int = 2048,
) -> Tuple[Dataset, Dataset, Dataset, AutoTokenizer]:
    """
    Load and tokenize the dataset with optional assistant-only masking.

    Args:
        model_name: Name of the model to use for tokenizer
        dataset_name: Name of the dataset to use
        assistant_only_masking: Whether to apply assistant-only masking
        max_length: Maximum sequence length

    Returns:
        Tuple of (train, validation, test, tokenizer)
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

    def tokenize_and_mask_function(examples):
        texts_with_eos = [text + tokenizer.eos_token for text in examples["text"]]
        tokenized = tokenizer(
            texts_with_eos,
            truncation=True,
            padding=False,
            max_length=max_length,
            return_tensors=None,
            add_special_tokens=True,
        )

        labels = []
        for input_ids in tokenized["input_ids"]:
            if assistant_only_masking:
                masked_labels = apply_assistant_masking(input_ids, tokenizer)
            else:
                masked_labels = input_ids
            labels.append(masked_labels)

        tokenized["labels"] = labels
        return tokenized

    train = train_dataset.map(
        tokenize_and_mask_function,
        batched=True,
        remove_columns=train_dataset.column_names,
        desc="Processing train dataset",
    )

    validation = None
    if validation_dataset is not None:
        validation = validation_dataset.map(
            tokenize_and_mask_function,
            batched=True,
            remove_columns=train_dataset.column_names,
            desc="Processing validation dataset",
        )

    test = None
    if test_dataset is not None:
        test = test_dataset.map(
            tokenize_and_mask_function,
            batched=True,
            remove_columns=train_dataset.column_names,
            desc="Processing test dataset",
        )

    return train, validation, test, tokenizer


In [36]:
dataset_config = {
    "dataset_name": "knkarthick/dialogsum",
    "instruction_column": None,
    "input_column": "dialogue",
    "output_column": "summary",
    "max_length": 2048,
}

train_dataset, validation_dataset, test_dataset = prepare_dataset(
    dataset_name="knkarthick/samsum",
    input_column=dataset_config["input_column"],
    output_column=dataset_config["output_column"],
    default_instruction="Summarize the following dialogue",
    validation_size=200,
    test_size=200,
)




Formatting train data:   0%|          | 0/14731 [00:00<?, ? examples/s]

Formatting validation data:   0%|          | 0/200 [00:00<?, ? examples/s]

Formatting test data:   0%|          | 0/200 [00:00<?, ? examples/s]

In [32]:
class DataCollatorForCausalLM:
    """
    Custom data collator for causal language modeling.
    Handles padding of input_ids, attention_mask, and labels.
    """
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def __call__(self, features):
        labels = [f.pop("labels") for f in features]
        batch = self.tokenizer.pad(features, padding=True, return_tensors="pt")

        max_len = batch["input_ids"].size(1)
        padded_labels = torch.full((len(labels), max_len), -100, dtype=torch.long)

        for i, l in enumerate(labels):
            l = torch.tensor(l, dtype=torch.long)
            if self.tokenizer.padding_side == "right":
                padded_labels[i, : len(l)] = l
            else:  # left-padding
                padded_labels[i, -len(l):] = l

        batch["labels"] = padded_labels
        return batch



## 4. Tokenize and Apply Masking

Now we'll convert our text data into numbers (tokens) that the model can process. This step:
- **Loads the tokenizer** from the base model
- **Tokenizes all examples** (converts text to token IDs)
- **Applies assistant-only masking** to ensure we only train on the output portion
- **Adds special tokens** like end-of-sequence markers

**Why tokenization matters:**
Language models don't understand text directly - they work with numbers. The tokenizer converts "Hello world" into something like [15496, 1917]. Each number represents a specific piece of text (word, subword, or character).

This step might take a minute or two depending on your dataset size.


In [38]:
model_name = "meta-llama/Llama-3.2-1B-Instruct"
assistant_only_masking = True
use_qlora = True
max_length = 2048


train, validation, test, tokenizer = tokenize_dataset(
    model_name=model_name,
    train_dataset=train_dataset,
    validation_dataset=validation_dataset,
    test_dataset=test_dataset,
    assistant_only_masking=assistant_only_masking,
    max_length=max_length,
)

print(f"Train dataset size: {len(train)}")
print(f"Validation dataset size: {len(validation) if validation else 0}")
print(f"Test dataset size: {len(test) if test else 0}")


Processing train dataset:   0%|          | 0/14731 [00:00<?, ? examples/s]

Processing validation dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

Processing test dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

Train dataset size: 14731
Validation dataset size: 200
Test dataset size: 200


## 5. Inspect Dataset Sample

Let's take a peek at our processed data to verify everything looks correct.

If you enabled assistant-only masking, you'll see statistics showing:
- How many tokens are in each sequence
- How many tokens are masked (the instruction part)
- How many tokens we're actually training on (the response part)

**What to expect:** Typically, 50-70% of tokens will be masked, as instructions are usually shorter than responses.


In [40]:
if assistant_only_masking:
    first_example = train[0]
    input_ids = first_example["input_ids"]
    labels = first_example["labels"]

    masked_tokens = sum(1 for label in labels if label == -100)
    total_tokens = len(labels)

    print(f"Total tokens: {total_tokens}")
    print(f"Masked tokens: {masked_tokens}")
    print(f"Training tokens: {total_tokens - masked_tokens}")
    print(f"Mask ratio: {masked_tokens/total_tokens:.2%}")

    print(f"Input IDs: {input_ids}")
    print(f"Labels: {labels}")
    print(f"Attention mask: {first_example['attention_mask']}")


Total tokens: 56
Masked tokens: 43
Training tokens: 13
Mask ratio: 76.79%
Input IDs: [128000, 14711, 30151, 198, 9370, 5730, 553, 279, 2768, 21976, 271, 14711, 5688, 198, 32, 36645, 25, 358, 41778, 220, 8443, 13, 3234, 499, 1390, 1063, 5380, 90757, 25, 23371, 4999, 32, 36645, 25, 358, 3358, 4546, 499, 16986, 21629, 696, 14711, 9442, 198, 32, 36645, 41778, 8443, 323, 690, 4546, 29808, 1063, 16986, 13, 128009]
Labels: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 198, 32, 36645, 41778, 8443, 323, 690, 4546, 29808, 1063, 16986, 13, 128009]
Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## 6. Model Configuration

Now we'll configure our Parameter-Efficient Fine-Tuning (PEFT) approach.

**LoRA (Low-Rank Adaptation):**
Instead of updating all model parameters, LoRA adds small trainable matrices to specific layers. This dramatically reduces the number of parameters you need to train (often by 1000x or more!).

**QLoRA (Quantized LoRA):**
If enabled, we'll load the model in 4-bit or 8-bit precision, further reducing memory requirements. This allows you to fine-tune larger models on smaller GPUs.

**Key hyperparameters:**
- `r` (rank): Size of the low-rank matrices (higher = more capacity, more memory)
- `lora_alpha`: Scaling factor for LoRA updates
- `target_modules`: Which model layers to apply LoRA to (typically attention and feed-forward layers)


In [41]:
bnb_config = None

if use_qlora:
    bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                 # enable 4-bit loading
    bnb_4bit_quant_type="nf4",         # use NormalFloat4 (NF4) quantization
    bnb_4bit_use_double_quant=True,    # enable double quantization
    bnb_4bit_compute_dtype="bfloat16"  # computation dtype (bf16 recommended)
)


lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "v_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)


## 7. Load Model with PEFT

Time to load our base model and apply the LoRA adapters!

This step will:
1. Download the base model from HuggingFace (if not cached)
2. Apply quantization if using QLoRA
3. Inject LoRA adapter layers into the model
4. Freeze the base model parameters (only adapters will be trained)

**Watch for:** The model will print how many parameters are trainable vs. frozen. You should see that only 0.1-1% of parameters are trainable with LoRA. This is the magic that makes efficient fine-tuning possible!


In [None]:
def get_apply_peft(
    model_name: str,
    lora_config: LoraConfig,
    qlora_config: Optional[BitsAndBytesConfig] = None,
) -> torch.nn.Module:
    """
    Load model and apply PEFT (LoRA) configuration.

    Args:
        model_name: HuggingFace model name
        lora_config: LoRA configuration
        qlora_config: Optional quantization configuration

    Returns:
        Model with PEFT adapters applied
    """
    model = AutoModelForCausalLM.from_pretrained(
        model_name, quantization_config=qlora_config, device_map="auto", dtype=torch.bfloat16
    )

    model = get_peft_model(model, lora_config)


    return get_peft_model(model, lora_config)


peft_model = get_apply_peft(model_name, lora_config, bnb_config)
peft_model.print_trainable_parameters()


config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

## Inspect how a single batch looks like

In [None]:
loader = DataLoader(
    train,
    batch_size=4,
    collate_fn=DataCollatorForCausalLM(tokenizer),
)

batch = next(iter(loader))

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [None]:
print('INPUT IDS:\n\n')
print(batch['input_ids'][0])
print('==' * 20)
print('\n\nATTENTION MASK:\n\n')
print(batch['attention_mask'][0])
print('==' * 20)
print('\n\nLABELS:\n\n')
print(batch['labels'][0])

## 8. Training Setup

Almost ready to train! Let's configure the training process.

We're setting up:
- **Training arguments**: Learning rate, batch size, number of epochs, etc.
- **Data collator**: Handles batching and padding during training
- **Callbacks**: Optional early stopping to prevent overfitting

**Key training parameters to understand:**
- `learning_rate`: How big of steps to take during optimization (typically 1e-4 to 5e-4 for LoRA)
- `per_device_train_batch_size`: Number of examples per GPU (adjust based on your memory)
- `gradient_accumulation_steps`: Simulate larger batches if memory is limited
- `num_train_epochs`: How many times to go through the entire dataset
- `save_steps` / `eval_steps`: How often to save checkpoints and evaluate


In [None]:
print("Setting up training...")

training_args = {
    "output_dir": "./checkpoints",
    "per_device_train_batch_size": 4,
    "per_device_eval_batch_size": 4,
    "gradient_accumulation_steps": 4,
    "num_train_epochs": 2,
    "learning_rate": 3e-4,
    "logging_steps": 4,
    "save_strategy": "epoch",
    "eval_strategy": "steps",
    "eval_steps": 100,
    "warmup_ratio": 0.03,
    "weight_decay": 0.01,
    "optim": "adamw_torch",
    "remove_unused_columns": False,
    "dataloader_drop_last": False,
    "gradient_checkpointing": False,
    "max_grad_norm": 1.0,
    "metric_for_best_model": "eval_loss",
    "greater_is_better": False,
    "fp16": False,
    "bf16": True,
    "logging_dir": "./logs",
    "logging_first_step": True,
    "log_level": "info",
    "disable_tqdm": False,
    "report_to": "none",
}

early_stopping_config = {
    "early_stopping_patience": 3,
    "early_stopping_threshold": 0.001
}

training_args = TrainingArguments(**training_args)
data_collator = DataCollatorForCausalLM(tokenizer)

callbacks = []
if early_stopping_config:
    early_stopping_callback = EarlyStoppingCallback(**early_stopping_config)
    callbacks.append(early_stopping_callback)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train,
    eval_dataset=validation,
    data_collator=data_collator,
    callbacks=callbacks if callbacks else None,
)


os.environ["WANDB_DISABLED"] = "true"

print("Training configuration complete")

Using auto half precision backend


Setting up training...
Training configuration complete


## 9. Train the Model

Here we go! Time to train your model!

**What happens during training:**
- The model processes batches of your training data
- It makes predictions and compares them to the correct outputs
- It updates the LoRA adapter weights to improve its predictions
- Every few steps, it evaluates on your validation set to track progress

**What you'll see:**
- Loss values (lower is better - measures prediction error)
- Training speed (samples/second)
- Periodic evaluation results
- Progress bar showing completion

**How long will this take?** Depends on:
- Your dataset size (more data = longer training)
- Your GPU (faster GPU = faster training)
- Number of epochs and batch size

Grab a coffee - this could take anywhere from minutes to hours!


In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

trainer.train()


Step,Training Loss,Validation Loss
100,1.4803,1.298838
200,1.3251,1.280188
300,1.3487,1.258201
400,1.2257,1.247961
500,1.2654,1.238611
600,1.2978,1.233881
700,1.2981,1.229177
800,1.1512,1.226105
900,1.2578,1.220308
1000,1.1155,1.21898


TrainOutput(global_step=1842, training_loss=1.239292989465753, metrics={'train_runtime': 2149.3091, 'train_samples_per_second': 13.708, 'train_steps_per_second': 0.857, 'total_flos': 5.306237101530317e+16, 'train_loss': 1.239292989465753, 'epoch': 2.0})

In [None]:

def create_prompts_dataset(dataset, tokenizer):
  prompts = []
  tokenizer.padding_side = "left"

  for sample in dataset:
    prompts.append(f"### Input\n{sample[dataset_config['input_column']]}\n### Output")

  validation_prompts = tokenizer(prompts, return_tensors="pt", padding=True)
  validation_prompts = Dataset.from_dict(validation_prompts)

  validation_prompts = validation_prompts.map(
      lambda ex: {"labels": ex["input_ids"].copy()}
  )

  return validation_prompts

validation_prompts = create_prompts_dataset(validation_dataset, tokenizer)



Map:   0%|          | 0/818 [00:00<?, ? examples/s]

## Generate Predictions on Validation Data

In [None]:
def evaluate_model(prompts_dataset, model, batch_size=8, max_new_tokens=2048, sample_size=None):
  data = prompts_dataset
  if sample_size:
    data = data.select(range(sample_size))

  dataloader = DataLoader(data, batch_size=batch_size, collate_fn=DataCollatorForCausalLM(tokenizer))

  model.eval()
  generated_texts = []
  true_summary = []

  with torch.inference_mode():
      done = False
      for batch in tqdm(dataloader, desc="Generating"):
          if done:
            break
          input_ids = batch["input_ids"].to(model.device)
          attention_mask = batch["attention_mask"].to(model.device)

          # Generate predictions
          outputs = model.generate(
              input_ids=input_ids,
              attention_mask=attention_mask,
              max_new_tokens=256,
              temperature=0.0,        # deterministic
              do_sample=False,        # greedy decoding
              pad_token_id=tokenizer.eos_token_id,
          )

          # Decode to text
          decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
          generated_texts.extend(decoded)

          if sample_size and len(generated_texts) >= sample_size:
              generated_texts = generated_texts[:sample_size]
              done = True

  for i in range(len(generated_texts)):
    generated_texts[i] = generated_texts[i].split("### Output")[1].strip()

  return generated_texts


def calculate_rouge(generated_texts, true_summary):
  rouge = evaluate.load("rouge")
  results = rouge.compute(predictions=generated_texts, references=true_summary)
  return results





## Compute ROUGE Scores

In [None]:
sample_size = 16

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,
)

peft_answers = evaluate_model(validation_prompts, peft_model, sample_size=sample_size)
base_answers = evaluate_model(validation_prompts, base_model, sample_size=sample_size)

Generating: 100%|██████████| 2/2 [00:06<00:00,  3.13s/it]
Generating: 100%|██████████| 2/2 [00:14<00:00,  7.07s/it]


In [None]:
true_summary = [ex['summary'] for ex in validation_dataset]
base_rouge = calculate_rouge(base_answers, true_summary[:sample_size])
peft_rouge = calculate_rouge(peft_answers, true_summary[:sample_size])

In [None]:
peft_rouge

{'rouge1': np.float64(0.48243665079611797),
 'rouge2': np.float64(0.27822490664683464),
 'rougeL': np.float64(0.40504173418948664),
 'rougeLsum': np.float64(0.40765430071303227)}

In [None]:
base_rouge

{'rouge1': np.float64(0.12568145104568468),
 'rouge2': np.float64(0.04600456520630093),
 'rougeL': np.float64(0.10000907386112988),
 'rougeLsum': np.float64(0.11639717964286178)}

## Examine Output Quality

In [None]:
def print_sample(index):
  print(f'### Dialogue:\n\n{validation_dataset[index]["dialogue"]}')
  print(f'\n### Base model summary:\n\n{base_answers[index]}\n\n')
  print(f'### PEFT model summary:\n\n{peft_answers[index]}')


In [None]:
print_sample(0)

### Dialogue:

A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))

### Base model summary:

A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get

In [None]:
print_sample(10)

### Dialogue:

Laura: I need a new printer :/
Laura: thinking about this one
Laura: <file_other>
Jamie: you're sure you need a new one?
Jamie: I mean you can buy a second hand one
Laura: could be

### Base model summary:

Laura: I need a new printer
Jamie: thinking about this one
Laura: <file_other>
Jamie: you're sure you need a new one?
Laura: could be
### Explanation
The code is a simple Python script that simulates a conversation between Laura and Jamie. The conversation is represented as a series of lines, where each line is either a statement or a question. The script uses a simple if-else statement to determine the next line based on the previous line.

### Code
```python
def simulate_conversation():
    print("Laura: I need a new printer :/")
    print("Laura: thinking about this one")
    print("Laura: <file_other>")
    print("Jamie: you're sure you need a new one?")
    print("Jamie: I mean you can buy a second hand one")
    print("Laura: could be")
    print("Jamie: you're 

## 10. Save and Share Your Model

Congratulations! Your model is trained. Now let's save it and share it with the world (or your team).

**What we're saving:**
- Only the LoRA adapter weights (not the entire base model!)
- The tokenizer configuration
- Training metadata

**Why this is cool:**
Instead of saving a 7GB+ model, we're only saving ~10-100MB of adapter weights. Anyone can then:
1. Download the base model
2. Load your adapters on top
3. Use your fine-tuned model

**Next steps after this notebook:**
- Test your model's performance on held-out data
- Compare it to the base model
- Iterate on your dataset or hyperparameters if needed
- Share your model on HuggingFace Hub for others to use!


In [None]:
local_adapter_path = "model-adapters"
peft_model.save_pretrained(local_adapter_path)
tokenizer.save_pretrained(local_adapter_path)

print(f"Adapters saved locally to: {local_adapter_path}")


Adapters saved locally to: model-adapters


In [None]:
def push_to_hub(
    model: PeftModel, tokenizer: AutoTokenizer, model_name: str, hf_username: str
):
    """
    Push a model and tokenizer to Hugging Face Hub.
    """
    model_id = f"{hf_username}/{model_name}"
    try:
        model.push_to_hub(f"{model_id}-adapters", private=False)

        merged_model = model.merge_and_unload()
        merged_model.push_to_hub(model_id, private=False)

        tokenizer.push_to_hub(model_id)
        print(f"Adapters successfully pushed to: https://huggingface.co/{model_id}")
    except Exception as e:
        print(f"Error pushing to Hugging Face: {e}")
        print("Make sure you're logged in with: huggingface-cli login")

In [None]:
save_model_name = "llama3.2-1B-summarization-assistant-qlora"

push_to_hub(peft_model, tokenizer, save_model_name, HF_USERNAME)

print("Training complete and model pushed to HuggingFace Hub")


README.md: 0.00B [00:00, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors: 100%|##########|  40.0B /  40.0B            

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors: 100%|##########|  40.0B /  40.0B            

No files have been modified since last commit. Skipping to prevent empty commit.


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...mpsjn_1lja/tokenizer.json: 100%|##########| 17.2MB / 17.2MB            

No files have been modified since last commit. Skipping to prevent empty commit.


Adapters successfully pushed to: https://huggingface.co/moo3030/llama3.2-1B-summarization-assistant-qlora
Training complete and model pushed to HuggingFace Hub
