This Notebook is based on [Hugging Face Blog](https://huggingface.co/blog/dvgodoy/fine-tuning-llm-hugging-face), showing how to finetune a quanted LLM with Lora

In [None]:
import os
import torch
from datasets import load_dataset
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer

* peft: pretrained efficient fine-tuning
* trl: transformer reinforcement learning

# Model Preparation

In [47]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# use qwen model
repo_id = "Qwen/Qwen2.5-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.43s/it]


In [48]:
print(f"{model.get_memory_footprint()/1e6} MB")

2010.088704 MB


In [None]:
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 2048)
    (layers): ModuleList(
      (0-35): 36 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=True)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=True)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((2048,), eps=1e-0

In [50]:
# prepare for lora training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
    r=8,  # rank
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen2ForCausalLM(
      (model): Qwen2Model(
        (embed_tokens): Embedding(151936, 2048)
        (layers): ModuleList(
          (0-35): 36 x Qwen2DecoderLayer(
            (self_attn): Qwen2SdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): Li

prepare_model_for_kbit_training(model)

* 功能：让一个已经量化（如 4bit/8bit）加载的模型适配训练。
* 主要做的事：
    * 冻结除 LoRA 层以外的所有权重（避免反向传播更新它们）
    * 把 LayerNorm 类层的参数转成 FP32（保持数值稳定）
    * 确保梯度检查点和混合精度设置能正常工作
    * 可选地启用 gradient checkpointing 来节省显存
* 为什么需要：
    * 量化权重是 int4/int8，不支持反向梯度更新
    * 微调时只训练 LoRA 参数，不改动量化权重
    * 有些操作需要高精度，否则会梯度爆炸或 NaN

LoraConfig(...)
这是 LoRA 的配置对象，定义你要插入的 LoRA 层的形状、位置和训练超参。

* r=8
    * LoRA 的秩（rank），相当于低秩矩阵分解里的中间维度
    * 越大，可训练参数越多，表示能力越强，但显存和计算开销也更大
* lora_alpha=16
    * LoRA 的缩放系数，控制更新幅度（类似学习率放大器）
    * 有效权重更新公式：dW = a/r *A*B
* target_modules=["q_proj", "v_proj"]
    * 只在注意力层的 Query 投影（q_proj）和 Value 投影（v_proj）加 LoRA
    * 这是 QLoRA 论文常用配置，因为对模型性能影响大、显存开销小
* lora_dropout=0.05
    * 对 LoRA 输入加 dropout，防止过拟合
* bias="none"
    * 不训练 bias 参数（减少开销）
* task_type="CAUSAL_LM"
    * 告诉 PEFT 这是因果语言建模任务（Causal Language Modeling）

model = get_peft_model(model, lora_config)
* 作用：根据 lora_config，在 model 指定的模块里插入 LoRA 层
* 执行结果：
    * q_proj 和 v_proj 原来的 Linear4bit 会被包装成 LoRA 版本
    * 训练时只更新这些 LoRA 参数（A、B 矩阵），其余部分冻结
* 优点:
    * 大部分参数保持量化（显存低）
    * 可训练参数量只有原模型的几百万到几千万分之一，训练快

In [51]:
model.print_trainable_parameters()  # 查看确实只有 LoRA 在训

trainable params: 1,843,200 || all params: 3,087,781,888 || trainable%: 0.0597


In [52]:
print(model.get_memory_footprint()/1e6)

2640.274688


In [53]:
train_p, tot_p = model.get_nb_trainable_parameters()
print(f'Trainable parameters:      {train_p/1e6:.2f}M')
print(f'Total parameters:          {tot_p/1e6:.2f}M')
print(f'% of trainable parameters: {100*train_p/tot_p:.2f}%')


Trainable parameters:      1.84M
Total parameters:          3087.78M
% of trainable parameters: 0.06%


# Dataset Preparation

## 为什么这两个结果不一样？

### 1. `model.get_memory_footprint()`
- **作用**：估算 **整个模型当前占用的显存/内存大小**（以字节为单位）
- 计算方式：
  - 遍历所有参数张量（`model.parameters()`）和缓冲区（buffers）
  - 根据张量的 `numel()` × `element_size()` 计算字节数
  - **会考虑数据类型**（FP32=4B，BF16=2B，int4≈0.5B）
- 特点：
  - 这是“运行时内存占用”，和参数总个数无关，**和量化精度直接相关**
  - 4bit 量化模型会比 FP16/FP32 模型占用显存小很多

---

### 2. `model.get_nb_trainable_parameters()`
- **作用**：统计**可训练参数数量**与**总参数数量**（按个数，不按内存）
- 计算方式：
  - 遍历所有 `model.parameters()`  
    - 总参数数 = 所有 `numel()` 相加
    - 可训练参数数 = `requires_grad=True` 的 `numel()` 相加
- 特点：
  - 这是**参数个数统计**，**不考虑数据类型**  
  - 不管是 int4、FP16 还是 FP32，一个参数就是“1 个参数”

---

### 3. 为什么结果不同？
- **统计维度不同**
  - `get_memory_footprint()` → 按**字节数**计算（受 dtype 影响）
  - `get_nb_trainable_parameters()` → 按**参数个数**计算（与 dtype 无关）
- **量化的影响**
  - 量化后参数个数不变（所以 `total parameters` 一样）
  - 但存储精度下降（比如 FP16 → int4），显存占用大幅下降
- **LoRA 的影响**
  - 大部分基座参数被冻结（`requires_grad=False`），`trainable parameters` 只有 LoRA 层那部分
  - 基座参数虽然不训练，但仍然会占内存（所以 footprint 里有它们）

---

### 4. 举例
假设：
- 原模型：100M 参数，FP16（2 字节/参数） → 占 200MB
- LoRA：只训练 2M 参数（FP32）
- 量化基座：98M 参数 int4（0.5 字节/参数）

结果：
- **`get_nb_trainable_parameters()`**  
  - Trainable = 2M  
  - Total = 100M  
  - 比例 = 2%
- **`get_memory_footprint()`**  
  - Trainable：2M × 4B ≈ 8MB  
  - 冻结基座：98M × 0.5B ≈ 49MB  
  - 总占用 ≈ 57MB（远小于原 FP16 模型的 200MB）


In [54]:
dataset = load_dataset("dvgodoy/yoda_sentences", split="train")
dataset

Dataset({
    features: ['sentence', 'translation', 'translation_extra'],
    num_rows: 720
})

In [55]:
dataset[0]

{'sentence': 'The birch canoe slid on the smooth planks.',
 'translation': 'On the smooth planks, the birch canoe slid.',
 'translation_extra': 'On the smooth planks, the birch canoe slid. Yes, hrrrm.'}

In [72]:
dataset.data

MemoryMappedTable
messages: list<item: struct<content: string, role: string>>
  child 0, item: struct<content: string, role: string>
      child 0, content: string
      child 1, role: string
----
messages: [[    -- is_valid: all not null
    -- child 0 type: string
["The birch canoe slid on the smooth planks.","On the smooth planks, the birch canoe slid. Yes, hrrrm."]
    -- child 1 type: string
["user","assistant"],    -- is_valid: all not null
    -- child 0 type: string
["Glue the sheet to the dark blue background.","Glue the sheet to the dark blue background, you must."]
    -- child 1 type: string
["user","assistant"],...,    -- is_valid: all not null
    -- child 0 type: string
["She called his name many times.","Hrrmmm. His name many times, she called. Hrmmm."]
    -- child 1 type: string
["user","assistant"],    -- is_valid: all not null
    -- child 0 type: string
["When you hear the bell, come quickly.","Hrrmmm. When the bell you hear, come quickly, you must."]
    -- child 

SFTTrainer 支持 Conversational format和Instruction Format e.g. 

```text
Conversational
{"messages":[
  {"role": "system", "content": "<general directives>"},
  {"role": "user", "content": "<prompt text>"},
  {"role": "assistant", "content": "<ideal generated text>"}
]}

Instruction
{"prompt": "<prompt text>",
"completion": "<ideal generated text>"}
```

但是为了更好的适配性，conversational类型是更推荐的

In [56]:
dataset = dataset.rename_column("sentence", "prompt")
dataset = dataset.rename_column("translation_extra", "completion")
dataset = dataset.remove_columns(["translation"])
dataset[0]

{'prompt': 'The birch canoe slid on the smooth planks.',
 'completion': 'On the smooth planks, the birch canoe slid. Yes, hrrrm.'}

In [57]:
# Adapted from trl.extras.dataset_formatting.instructions_formatting_function
# Converts dataset from prompt/completion format (not supported anymore)
# to the conversational format
def format_dataset(examples):
    if isinstance(examples["prompt"], list):
        output_texts = []
        for i in range(len(examples["prompt"])):
            converted_sample = [
                {"role": "user", "content": examples["prompt"][i]},
                {"role": "assistant", "content": examples["completion"][i]},
            ]
            output_texts.append(converted_sample)
        return {'messages': output_texts}
    else:
        converted_sample = [
            {"role": "user", "content": examples["prompt"]},
            {"role": "assistant", "content": examples["completion"]},
        ]
        return {'messages': converted_sample}

dataset = dataset.map(format_dataset).remove_columns(['prompt', 'completion'])
dataset[0]['messages']


[{'content': 'The birch canoe slid on the smooth planks.', 'role': 'user'},
 {'content': 'On the smooth planks, the birch canoe slid. Yes, hrrrm.',
  'role': 'assistant'}]

# Tokenizer

In [58]:
print(repo_id)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
print(tokenizer)


Qwen/Qwen2.5-3B-Instruct
Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-3B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=

Tokenizer负责把文字翻译成token ID, e.g. "Hello World" -> [10123, 12341, 12321]。

一个经典的tokenizer可以有以下几个工作步骤

1. Preprocessing

    * 统一大小写（或者不统一）
    * 处理Unicode正规化
    * 特殊字符处理 （中文英文之间加入空格）
2. 特殊标记插入
    * <bos>: begin of sentence
    * <eos>: end of sentence
    * <pad>: padding 用于补齐长度
    * <unk>: unknown token
    * 对话标识符: <|im_start|> <|im_end|>
    * 注意，这些标识符也有对应的token ID, e.g. <|im_start|> = 100264
3. tokenization 分词
    * BPE (Byte Pair Encoding)，用尽量长的token单元代替
    * SentencePiece / WordPiece
    * e.g., "playing" = ["play", "ing"] = [1234, 567]


模型最后输出也是token ID序列，然后进行词汇表反查，拼接成句子。

tokenizer输出的是N个token，之后交给embedding进行查表，embedding本身是一个mapping，用来将token ID 转换成一个个feature vector
* <sentence> -> Tokenizer -> <tokens: [N,]>
* <tokens> -> Embedding -> <embeddings: [N, C]>
* <embeddings> -> Transformers -> Linear -> <logits: [N, V]>: V is the vocabular size, use max prob to get token (like classification task)

In [59]:
tokenizer.chat_template


'{%- if tools %}\n    {{- \'<|im_start|>system\\n\' }}\n    {%- if messages[0][\'role\'] == \'system\' %}\n        {{- messages[0][\'content\'] }}\n    {%- else %}\n        {{- \'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\' }}\n    {%- endif %}\n    {{- "\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>" }}\n    {%- for tool in tools %}\n        {{- "\\n" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- "\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\"name\\": <function-name>, \\"arguments\\": <args-json-object>}\\n</tool_call><|im_end|>\\n" }}\n{%- else %}\n    {%- if messages[0][\'role\'] == \'system\' %}\n        {{- \'<|im_start|>system\\n\' + messages[0][\'content\'] + \'<|im_end|>\\n\' }}\n    {%- else %}\n       

In [60]:
messages = dataset[0]['messages']
print(messages)
print(tokenizer.apply_chat_template(messages, tokenize=False))


[{'content': 'The birch canoe slid on the smooth planks.', 'role': 'user'}, {'content': 'On the smooth planks, the birch canoe slid. Yes, hrrrm.', 'role': 'assistant'}]
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
The birch canoe slid on the smooth planks.<|im_end|>
<|im_start|>assistant
On the smooth planks, the birch canoe slid. Yes, hrrrm.<|im_end|>



# Finetune

In [61]:
sft_config = SFTConfig(
    ## GROUP 1: Memory usage
    # These arguments will squeeze the most out of your GPU's RAM
    # Checkpointing
    gradient_checkpointing=True,    # this saves a LOT of memory
    # Set this to avoid exceptions in newer versions of PyTorch
    gradient_checkpointing_kwargs={'use_reentrant': False}, 
    # Gradient Accumulation / Batch size
    # Actual batch (for updating) is same (1x) as micro-batch size
    gradient_accumulation_steps=1,  
    # The initial (micro) batch size to start off with
    per_device_train_batch_size=16, 
    # If batch size would cause OOM, halves its size until it works
    auto_find_batch_size=True,

    ## GROUP 2: Dataset-related
    max_seq_length=64,
    # Dataset
    # packing a dataset means no padding is needed
    packing=True,

    ## GROUP 3: These are typical training parameters
    num_train_epochs=10,
    learning_rate=3e-4,
    # Optimizer
    # 8-bit Adam optimizer - doesn't help much if you're using LoRA!
    optim='paged_adamw_8bit',       
    
    ## GROUP 4: Logging parameters
    logging_steps=10,
    logging_dir='./logs',
    output_dir='./qwen-2_5-mini-yoda-adapter',
    report_to='none'
)


In [62]:
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    args=sft_config,
    train_dataset=dataset,
)


In [63]:
dl = trainer.get_train_dataloader()
batch = next(iter(dl))
batch['input_ids'][0], batch['labels'][0]


(tensor([   198,   2610,    525,   1207,  16948,     11,   3465,    553,  54364,
          14817,     13,   1446,    525,    264,  10950,  17847,     13, 151645,
            198, 151644,    872,    198,  58400,    383,    323,  11967,    304,
            279,   7010,   6176,  16359,     13, 151645,    198, 151644,  77091,
            198,    641,    279,   7010,   6176,  16359,     11,  15743,    383,
            323,  11967,     11,    498,   1969,     13, 151645,    198, 151645,
         151644,   8948,    198,   2610,    525,   1207,  16948,     11,   3465,
            553], device='cuda:0'),
 tensor([   198,   2610,    525,   1207,  16948,     11,   3465,    553,  54364,
          14817,     13,   1446,    525,    264,  10950,  17847,     13, 151645,
            198, 151644,    872,    198,  58400,    383,    323,  11967,    304,
            279,   7010,   6176,  16359,     13, 151645,    198, 151644,  77091,
            198,    641,    279,   7010,   6176,  16359,     11,  15743, 

In [64]:
trainer.train()


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]


Step,Training Loss
10,4.804
20,2.9303
30,2.1904
40,1.7527
50,1.4539
60,1.293
70,1.1903
80,1.1072
90,1.0562
100,0.9726


TrainOutput(global_step=390, training_loss=1.1107213252629988, metrics={'train_runtime': 447.539, 'train_samples_per_second': 13.787, 'train_steps_per_second': 0.871, 'total_flos': 6578583030988800.0, 'train_loss': 1.1107213252629988, 'epoch': 10.0})

In [66]:
def gen_prompt(tokenizer, sentence):
    converted_sample = [{"role": "user", "content": sentence}]
    prompt = tokenizer.apply_chat_template(
        converted_sample, tokenize=False, add_generation_prompt=True
    )
    return prompt


In [67]:
sentence = 'The Force is strong in you!'
prompt = gen_prompt(tokenizer, sentence)
print(prompt)


<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
The Force is strong in you!<|im_end|>
<|im_start|>assistant



In [69]:
def generate(model, tokenizer, prompt, max_new_tokens=64, skip_special_tokens=False):
    tokenized_input = tokenizer(
        prompt, add_special_tokens=False, return_tensors="pt"
    ).to(model.device)

    model.eval()
    gen_output = model.generate(**tokenized_input,
                                eos_token_id=tokenizer.eos_token_id,
                                max_new_tokens=max_new_tokens)
    
    output = tokenizer.batch_decode(gen_output, skip_special_tokens=skip_special_tokens)
    return output[0]


In [70]:
print(generate(model, tokenizer, prompt))


<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
The Force is strong in you!<|im_end|>
<|im_start|>assistant
Strong in you! The Force is .<|im_end|>
