## LLaMA 2 指令微调（Alpaca-Style on Dolly-15K Dataset)

示例代码关键训练要素：
- 使用 Dolly-15K 数据集，以 Alpaca 指令风格生成训练数据
- 以 4-bit（NF4）量化精度加载 `LLaMA 2-7B` 模型
- 使用 QLoRA 以 `bf16` 混合精度训练模型
- 使用 `HuggingFace TRL` 的 `SFTTrainer` 实现监督指令微调
- 使用 Flash Attention 快速注意力机制加速训练（需硬件支持）

### 下载 databricks-dolly-15k 数据集

In [1]:
from datasets import load_dataset
from random import randrange
 
# 从hub加载数据集
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# 数据集样例总数: 15011
dataset

Dataset({
    features: ['instruction', 'context', 'response', 'category'],
    num_rows: 15011
})

In [3]:
# 随机抽选一个数据样例打印
print(dataset[randrange(len(dataset))])

{'instruction': 'What would you do to improve the rules of Tennis, to make it a better TV viewing experience?', 'context': '', 'response': 'I would recommend the following things be changed in the rules of tennis to make it more interesting. (1) Reduce the length of a \'set\' to be 4 games long, and the first person to 4 wins the set, with no requirement to lead by 2 clear games over their opponent. (2) I would only allow one serve - instead of two - per player when starting each point. (3) I would stop players from wasting time between points by limiting their towel breaks to 23 seconds long. (4) If a player\'s service hits the net and goes over, they win the point (this means no replaying of points due to hitting the netcord and flopping over the net). (5) I would declare a rally null and void if it goes over 20 shots; it would count for nothing and both players would have wasted their efforts without any positive outcome. (6) I would not allow players to take a break, between games,

### 以 Alpaca-Style 格式化指令数据

`Alpacca-style` 格式：https://github.com/tatsu-lab/stanford_alpaca#data-release

In [4]:
def format_instruction(sample_data):
    """
    Formats the given data into a structured instruction format.

    Parameters:
    sample_data (dict): A dictionary containing 'response' and 'instruction' keys.

    Returns:
    str: A formatted string containing the instruction, input, and response.
    """
    # Check if required keys exist in the sample_data
    if 'response' not in sample_data or 'instruction' not in sample_data:
        # Handle the error or return a default message
        return "Error: 'response' or 'instruction' key missing in the input data."

    return f"""### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM. 
 
### Input:
{sample_data['response']}
 
### Response:
{sample_data['instruction']}
"""

In [5]:
# 随机抽选一个样例，打印 Alpaca 格式化后的样例 
print(format_instruction(dataset[randrange(len(dataset))]))

### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM. 
 
### Input:
KPMG International Limited
Ernst & Young
Deloitte
PricewaterhouseCoopers
 
### Response:
What are the big four accounting organizations as per the given passage? List the names in bulleted format.



### 使用快速注意力（Flash Attention）加速训练

检查你的 GPU 是否支持 `flash-attn` 加速：

```shell
$ python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError: Hardware not supported for Flash Attention
```
**运行结果：演示使用的 NVIDIA T4 硬件不支持 Flash Attention**

#### 安装 flash-attn 加速包（需要GPU硬件支持）

```shell
$ MAX_JOBS=4 pip install flash-attn --no-build-isolation
```

### 加载模型

In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# 如果硬件设备支持，成功安装 flash-attn后，将 use_flash_attention 设置为True
use_flash_attention = True
 
# 取消注释以使用 flash-atten
# if torch.cuda.get_device_capability()[0] >= 8:
#     from utils.llama_patch import replace_attn_with_flash_attn
#     print("Using flash attention")
#     replace_attn_with_flash_attn()
#     use_flash_attention = True
 
 
# 获取 LLaMA 2-7B 模型权重
# 无需 Meta AI 审核的模型权重
model_id = "NousResearch/Llama-2-7b-hf" 
# 通过 Meta AI 审核后可使用此 Model ID 下载
# model_id = "meta-llama/Llama-2-7b-hf" 
 
 
# 使用 BnB 加载量化后的模型
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
 
# 加载模型与分词器
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=False, device_map="auto")
model.config.pretraining_tp = 1 
 
# 通过对比doc中的字符串，验证模型是否在使用flash attention
# if use_flash_attention:
#     from utils.llama_patch import forward    
#     assert model.model.layers[0].self_attn.forward.__doc__ == forward.__doc__, "Model is not using flash attention"
 
 
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.69s/it]


### 使用 QLoRA 配置加载 PEFT 模型

In [7]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
 
# QLoRA 配置
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=16,
        bias="none",
        task_type="CAUSAL_LM", 
)
 
 
# 使用 QLoRA 配置加载 PEFT 模型
model = prepare_model_for_kbit_training(model)
qlora_model = get_peft_model(model, peft_config)

In [8]:
qlora_model.print_trainable_parameters()

trainable params: 8,388,608 || all params: 6,746,804,224 || trainable%: 0.12433454005023165


In [9]:
qlora_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear4bit(in_features=4096, out_feature

### 训练超参数

In [10]:
import datetime

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

# 演示训练参数（实际训练是设置为 False）
demo_train = False
output_dir = f"models/llama-7-int4-dolly-{timestamp}"

In [11]:
from transformers import TrainingArguments
 
args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=1 if demo_train else 3,
    max_steps=100,
    per_device_train_batch_size=3, # Nvidia T4 16GB 显存支持的最大 Batch Size
    gradient_accumulation_steps=1 if demo_train else 4,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=10,
    save_strategy="steps" if demo_train else "epoch",
    save_steps=10,
    learning_rate=2e-4,
    bf16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant"
)

### 实例化 SFTTrainer

In [12]:
from trl import SFTTrainer
 
# 数据集的最大长度序列（筛选后的训练数据样例数为1158）
max_seq_length = 2048 
 
trainer = SFTTrainer(
    model=qlora_model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_instruction, 
    args=args,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


### 训练模型

In [13]:
trainer.train()

 10%|█         | 10/100 [07:43<1:09:10, 46.12s/it]

{'loss': 1.5968, 'learning_rate': 0.0002, 'epoch': 0.1}


 20%|██        | 20/100 [15:29<1:02:03, 46.55s/it]

{'loss': 1.3708, 'learning_rate': 0.0002, 'epoch': 0.21}


 30%|███       | 30/100 [23:15<54:25, 46.65s/it]  

{'loss': 1.2946, 'learning_rate': 0.0002, 'epoch': 0.31}


 40%|████      | 40/100 [31:00<46:23, 46.40s/it]

{'loss': 1.2677, 'learning_rate': 0.0002, 'epoch': 0.41}


 50%|█████     | 50/100 [38:45<38:50, 46.62s/it]

{'loss': 1.2474, 'learning_rate': 0.0002, 'epoch': 0.52}


 60%|██████    | 60/100 [46:28<30:38, 45.96s/it]

{'loss': 1.2176, 'learning_rate': 0.0002, 'epoch': 0.62}


 70%|███████   | 70/100 [54:12<23:12, 46.41s/it]

{'loss': 1.2077, 'learning_rate': 0.0002, 'epoch': 0.73}


 80%|████████  | 80/100 [1:01:55<15:24, 46.22s/it]

{'loss': 1.2135, 'learning_rate': 0.0002, 'epoch': 0.83}


 90%|█████████ | 90/100 [1:09:42<07:48, 46.88s/it]

{'loss': 1.2051, 'learning_rate': 0.0002, 'epoch': 0.93}


100%|██████████| 100/100 [1:17:28<00:00, 46.68s/it]

{'loss': 1.2299, 'learning_rate': 0.0002, 'epoch': 1.04}


100%|██████████| 100/100 [1:17:28<00:00, 46.48s/it]

{'train_runtime': 4648.4776, 'train_samples_per_second': 0.258, 'train_steps_per_second': 0.022, 'train_loss': 1.2851289558410643, 'epoch': 1.04}





TrainOutput(global_step=100, training_loss=1.2851289558410643, metrics={'train_runtime': 4648.4776, 'train_samples_per_second': 0.258, 'train_steps_per_second': 0.022, 'train_loss': 1.2851289558410643, 'epoch': 1.04})

### 保存模型

In [14]:
trainer.save_model()

### 模型推理（测试）