<a href="https://colab.research.google.com/github/shake/colab-Llama-2-ipynb/blob/main/01-Llama_2_Fine_Tune_With_Your_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Practical Introduction to Llama 2 Fine-Tuning**

## **如何微调 Llama 2 🦙**

我们将使用 Google Colab 在具有高 RAM 的 T4 GPU 上微调 7B 参数 Llama 2 模型。请注意，T4 只有 16 GB 的 VRAM，这几乎不足以存储 Llama 2-7b 的权重（在 FP16 中为 7b × 2 字节 = 14 GB）。此外，我们还需要考虑优化器状态、梯度和前向激活导致的开销（有关更多信息，请参阅这篇优秀的文章）。这意味着这里不可能进行完全的微调：我们需要参数高效微调 （PEFT） 技术，如 LoRA 或 QLoRA。

为了大幅减少 VRAM 的使用，我们必须以 4 位精度微调模型，这就是我们在这里使用 QLoRA 的原因。好消息是，我们可以利用 Hugging Face 生态系统和 transformer、accelerate、peft、trl 和 bitsandbytes 库。首先，我们安装并加载这些库。

参考文章

[finetune_llama_v2.py](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da?permalink_comment_id=4636954)

In [1]:
# 安装需要依赖
%%capture
!pip install -q transformers==4.31.0
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2  trl==0.4.7 datasets==2.13.0 safetensors>=0.3.1
!pip install -q huggingface_hub

In [2]:
# import 密钥，token
%%capture
from google.colab import userdata
hf_token = userdata.get('huggingface')
!git config --global credential.helper store
!huggingface-cli login --token $hf_token --add-to-git-credential

In [3]:
import os
import huggingface_hub
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

首先，我们要加载一个 llama-2-7b-chat-hf 模型（聊天模型），并在 mlabonne/guanaco-llama2-1k（1,000 个样本）上训练它，如果对此数据集的创建方式感兴趣，可以查看此笔记本。随意更改它：Hugging Face Hub 上有许多很好的数据集，例如 databricks/databricks-dolly-15k。

QLoRA 将使用秩 64，缩放参数为 16。我们将使用 NF4 类型以 4 位精度直接加载 Llama 2 模型，并训练它一个 epoch。若要获取有关其他参数的详细信息，请查看 TrainingArguments、PeftModel 和 SFTTrainer 文档。

In [4]:
# The model that you want to train from the Hugging Face hub
model_name = "chenshake/Llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"


# Fine-tuned model name
new_model = "Llama-2-7b-chat-hf-guanaco"

# Name of adopters
adapters = "Llama-2-7b-chat-hf-guanaco-adapters"

In [None]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset(dataset_name)

print(f"Train dataset size: {len(dataset['train'])}")
#print(f"Test dataset size: {len(dataset['test'])}")

In [6]:
################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

In [7]:
################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

In [8]:
################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

In [9]:
################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

## **我们现在可以开始微调过程**

1. 加载预处理的数据集。

2. 配置 bitsandbytes 以进行 4 位量化

3. 在带有分词器的 GPU 上以 4 位精度加载 Llama 2 模型

4. 加载 QLoRA 的配置，使用 SFTTrainer 的常规训练参数

In [10]:
# Step 1: Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")



In [11]:
# Step 2: configure bitsandbytes for 4-bit quantization

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

In [13]:
# Step 3: Load Llama 2 model in 4-bit precision on a GPU with tokenizer

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load Llama 2 base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)

model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/3.59G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some weights of LlamaForCausalLM were not initialized from the model checkpoint at chenshake/Llama-2-7b-chat-hf and are newly initialized: ['model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rota

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/960 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [14]:
# Step 4: Load configurations for QLoRA

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

## **开始使用给定数据集训练模型**

In [15]:
#解决colab字符集错误
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [16]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,1.3478
50,1.6152
75,1.2211
100,1.4522
125,1.1889
150,1.3752
175,1.1821
200,1.4705
225,1.1633
250,1.533




TrainOutput(global_step=250, training_loss=1.354943717956543, metrics={'train_runtime': 1479.2951, 'train_samples_per_second': 0.676, 'train_steps_per_second': 0.169, 'total_flos': 8773998173061120.0, 'train_loss': 1.354943717956543, 'epoch': 1.0})

In [17]:
# Save trained model
trainer.model.save_pretrained(adapters,safe_serialization=True)

训练时间可能很长，具体取决于数据集的大小。在这里，在 T4 GPU 上花了不到30分钟。我们可以检查张量板上的图，如下所示：

In [None]:
%load_ext tensorboard
%tensorboard --logdir results/runs

让我们确保模型的行为正确。这需要更详尽的评估，但我们可以使用文本生成管道来提出诸如“什么是大型语言模型”之类的问题。请注意，我正在格式化输入以匹配 Llama 2 的提示模板。

In [28]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Explain a large language model and its architecture?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])



<s>[INST] Explain a large language model and its architecture? [/INST] A large language model is a type of artificial intelligence model that is trained on a large dataset of text to generate language outputs that are coherent and natural-sounding.

The architecture of a large language model typically consists of several components, including:

1. Input layer: This layer takes in the input text that the model will generate.
2. Encoder: This layer processes the input text and generates a set of intermediate representations that will be used to generate the output text.
3. Decoder: This layer generates the output text based on the intermediate representations generated by the encoder.
4. Attention mechanism: This mechanism allows the model to focus on specific parts of the input text when generating the output text.
5. Output layer: This layer generates the final output text.

The large language model is trained using a large dataset of text, such as


In [18]:
# Empty VRAM
del model
#del pipe
del trainer
import gc
gc.collect()
gc.collect()

20933

In [19]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some weights of LlamaForCausalLM were not initialized from the model checkpoint at chenshake/Llama-2-7b-chat-hf and are newly initialized: ['model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rota

In [20]:
model = PeftModel.from_pretrained(base_model, adapters)


In [21]:
model = model.merge_and_unload()

In [22]:
# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

我们的权重被合并，我们重新加载了分词器。您现在可以将所有内容推送到 Hugging Face Hub 以

保存我们的模型。

In [23]:
model.save_pretrained(new_model, safe_serialization=True)

In [24]:
tokenizer.save_pretrained(new_model)

('Llama-2-7b-chat-hf-guanaco/tokenizer_config.json',
 'Llama-2-7b-chat-hf-guanaco/special_tokens_map.json',
 'Llama-2-7b-chat-hf-guanaco/tokenizer.model',
 'Llama-2-7b-chat-hf-guanaco/added_tokens.json',
 'Llama-2-7b-chat-hf-guanaco/tokenizer.json')

In [27]:
model.push_to_hub(new_model, use_temp_dir=False,safe_serialization=True)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/chenshake/Llama-2-7b-chat-hf-guanaco/commit/3db1c61ea92f5faf2df1a551dae84baababb4001', commit_message='Upload tokenizer', commit_description='', oid='3db1c61ea92f5faf2df1a551dae84baababb4001', pr_url=None, pr_revision=None, pr_num=None)

## backup

In [None]:
model.push_to_hub(adapters, use_temp_dir=False,safe_serialization=True)
tokenizer.push_to_hub(adapters, use_temp_dir=False)

In [None]:
print(locals())