## LLaMA 2 指令微调（Alpaca-Style on Dolly-15K Dataset)

### 下载 databricks-dolly-15k 数据集

In [1]:
# 或者一次性列出所有已安装包及其版本（可搜索关键词）
!pip list | grep -E "torch|transformers|accelerate|bitsandbytes|datasets|optimum|attn"

/bin/bash: /root/miniconda3/envs/aistudy10/lib/libtinfo.so.6: no version information available (required by /bin/bash)
accelerate                1.10.0
bitsandbytes              0.47.0
datasets                  4.0.0
torch                     2.8.0
transformers              4.55.2


In [2]:
import torch
print(torch.cuda.is_available())

True


In [3]:
import torch

cap = torch.cuda.get_device_capability()
print("CUDA device capability:", cap)

if cap[0] >= 8:
    print("GPU 支持 Flash Attention")
else:
    print("GPU 不支持 Flash Attention")


CUDA device capability: (8, 6)
GPU 支持 Flash Attention


In [4]:
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

In [5]:
from datasets import load_dataset
from random import randrange
 
# 从hub加载数据集
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
# 数据集样例总数: 15011
dataset

Dataset({
    features: ['instruction', 'context', 'response', 'category'],
    num_rows: 15011
})

In [7]:
# 随机抽选一个数据样例打印
print(dataset[randrange(len(dataset))])

{'instruction': 'Which film won multiple Filmfare Awards?\nA. Mumbai Meri Jaan\nB. Govardhan\nC. C.I.D.\nD. The end titles are accompanied by the song Aye Dil Hain Mushkil.', 'context': 'Mumbai Meri Jaan (translation: Mumbai, My Life) is a 2008 Indian drama film directed by Nishikant Kamat and produced by Ronnie Screwvala. It stars R. Madhavan, Irrfan Khan, Soha Ali Khan, Paresh Rawal and Kay Kay Menon. It deals with the aftermath of the 11 July 2006 Mumbai train bombings, where 209 people lost their lives and over 700 were injured. It won multiple Filmfare Awards.Rupali Joshi (Soha Ali Khan) is a successful reporter who is getting married in two months. Nikhil Agrawal (Madhavan) is an environmentally conscious executive who rides the train to work every day and is expecting his first child. Suresh (Kay Kay Menon) is a struggling computer tech who spends his time loafing at a local cafe and criticizing Muslims. Meanwhile, Sunil Kadam (Vijay Maurya) struggles with the corruption and ine

### 以 Alpaca-Style 格式化指令数据

`Alpacca-style` 格式：https://github.com/tatsu-lab/stanford_alpaca#data-release

In [8]:
def format_instruction(sample_data):
    """
    Formats the given data into a structured instruction format.

    Parameters:
    sample_data (dict): A dictionary containing 'response' and 'instruction' keys.

    Returns:
    str: A formatted string containing the instruction, input, and response.
    """
    # Check if required keys exist in the sample_data
    if 'response' not in sample_data or 'instruction' not in sample_data:
        # Handle the error or return a default message
        return "Error: 'response' or 'instruction' key missing in the input data."

    return f"""### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM. 
 
### Input:
{sample_data['response']}
 
### Response:
{sample_data['instruction']}
"""

In [9]:
# 随机抽选一个样例，打印 Alpaca 格式化后的样例 
print(format_instruction(dataset[randrange(len(dataset))]))

### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM. 

### Input:
'prime', 'prime', 'composite', 'prime', 'composite', 'prime', 'composite', 'composite', 'composite', 'prime', 'composite', 'prime', 'composite', 'composite', 'composite'.

### Response:
Classify the following numbers as 'prime' or 'composite' - 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16.



### 使用快速注意力（Flash Attention）加速训练

检查你的 GPU 是否支持 `flash-attn` 加速：

```shell
$ python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError: Hardware not supported for Flash Attention
```
**运行结果：演示使用的 NVIDIA T4 硬件不支持 Flash Attention**

#### 安装 flash-attn 加速包（需要GPU硬件支持）

```shell
$ MAX_JOBS=4 pip install flash-attn --no-build-isolation
```

In [10]:
!python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"

/bin/bash: /root/miniconda3/envs/aistudy10/lib/libtinfo.so.6: no version information available (required by /bin/bash)


In [11]:
!LD_PRELOAD=/lib/x86_64-linux-gnu/libtinfo.so.6 LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:/root/miniconda3/envs/aistudy5/lib:/root/miniconda3/lib python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'; print('CUDA device capability:', torch.cuda.get_device_capability())"

/bin/bash: /root/miniconda3/envs/aistudy10/lib/libtinfo.so.6: no version information available (required by /bin/bash)
CUDA device capability: (8, 6)


### 加载模型

In [12]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# 如果硬件设备支持，成功安装 flash-attn后，将 use_flash_attention 设置为True
use_flash_attention = False
 
# 取消注释以使用 flash-atten
# if torch.cuda.get_device_capability()[0] >= 8:
#     from utils.llama_patch import replace_attn_with_flash_attn
#     print("Using flash attention")
#     replace_attn_with_flash_attn()
#     use_flash_attention = True
 
 
# 获取 LLaMA 2-7B 模型权重
# 无需 Meta AI 审核的模型权重
model_id = "NousResearch/Llama-2-7b-hf" 
# 通过 Meta AI 审核后可使用此 Model ID 下载
# model_id = "meta-llama/Llama-2-7b-hf" 
 
 
# 使用 BnB 加载量化后的模型
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
 
# 加载模型与分词器
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=False, device_map="auto")
model.config.pretraining_tp = 1 
 
# 通过对比doc中的字符串，验证模型是否在使用flash attention
if use_flash_attention:
    from utils.llama_patch import forward    
    assert model.model.layers[0].self_attn.forward.__doc__ == forward.__doc__, "Model is not using flash attention"
 
 
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.98s/it]


### 使用 QLoRA 配置加载 PEFT 模型

In [13]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
 
# QLoRA 配置
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=16,
        bias="none",
        task_type="CAUSAL_LM", 
)
 
 
# 使用 QLoRA 配置加载 PEFT 模型
model = prepare_model_for_kbit_training(model)
qlora_model = get_peft_model(model, peft_config)

In [14]:
qlora_model.print_trainable_parameters()

trainable params: 8,388,608 || all params: 6,746,804,224 || trainable%: 0.1243


### 训练超参数

In [15]:
import datetime

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

# 演示训练参数（实际训练是设置为 False）
demo_train = True
output_dir = f"models/llama-7-int4-dolly-{timestamp}"

In [16]:
from transformers import TrainingArguments
 
args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=1 if demo_train else 3,
    max_steps=100,
    per_device_train_batch_size=3, # Nvidia T4 16GB 显存支持的最大 Batch Size
    gradient_accumulation_steps=1 if demo_train else 4,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=10,
    save_strategy="steps" if demo_train else "epoch",
    save_steps=10,
    learning_rate=2e-4,
    bf16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant"
)

### 实例化 SFTTrainer

In [17]:
from trl import SFTTrainer
 
# 数据集的最大长度序列（筛选后的训练数据样例数为1158）
max_seq_length = 2048 
 
trainer = SFTTrainer(
    model=qlora_model,
    train_dataset=dataset,
    peft_config=peft_config,
    # max_seq_length=max_seq_length,
    # tokenizer=tokenizer,
    # packing=True,
    formatting_func=format_instruction, 
    args=args,
)



### 训练模型

In [18]:
trainer.train()

Step,Training Loss
10,2.2009
20,1.4262
30,1.2001
40,1.3684
50,1.2008
60,1.0969
70,1.0123
80,1.2724
90,1.0692
100,1.27


TrainOutput(global_step=100, training_loss=1.3117273139953614, metrics={'train_runtime': 157.0714, 'train_samples_per_second': 1.91, 'train_steps_per_second': 0.637, 'total_flos': 3050077490159616.0, 'train_loss': 1.3117273139953614})

### 保存模型

In [19]:
trainer.save_model()

### 模型推理（测试）