<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/Alpaca_gpt4_chinese_%2B_Llama3_8B_4bit_lora_epoch_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

要在免费的Tesla T4 Google Colab实例上运行此代码，请按“*Runtime*”，然后按“*Run all*”！
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> 如果需要帮助，请加入Discord，如果可以请支持我们！
</div>

要在您自己的计算机上安装Unsloth，请按照我们Github页面上的安装说明[here](https://github.com/unslothai/unsloth#installation-instructions---conda)进行操作。

您将学习如何进行[数据准备](#Data)，如何[训练](#Train)，如何[运行模型](#Inference)，以及[如何保存](#Save)（例如用于Llama.cpp）。

**[新] Llama-3 8b经过疯狂地训练了15000亿个标记！ Llama-2是2000亿个标记。**

In [1]:
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
# Must install separately since Colab has torch 2.2.1, which breaks packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" wandb
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes wandb
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip install --no-deps xformers trl peft accelerate bitsandbytes wandb
pass

* 我们支持Llama、Mistral、CodeLlama、TinyLlama、Vicuna、Open Hermes等等。
* 以及Yi、Qwen（[llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)）、Deepseek，所有Llama、Mistral派生的架构。
* 我们支持16位LoRA或4位QLoRA。两者都比原来快2倍。
* `max_seq_length`可以设置为任何值，因为我们通过[kaiokendev的](https://kaiokendev.github.io/til)方法进行自动的RoPE缩放。
* [**新功能**] 通过[PR 26037](https://github.com/huggingface/transformers/pull/26037)，我们支持下载4位模型**快4倍**！[我们的repo](https://huggingface.co/unsloth)有Llama、Mistral的4位模型。

In [3]:
from unsloth import FastLanguageModel
import torch
import wandb
from google.colab import userdata

max_seq_length = 4096*2*4 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit", # Instruct version of Gemma 7b
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit", # Instruct version of Gemma 2b
    "unsloth/llama-3-8b-bnb-4bit", # [NEW] 15 Trillion token Llama-3
] # More models at https://huggingface.co/unsloth

# Defined in the secrets tab in Google Colab
wb_token = userdata.get('WANDB_API_KEY')
wandb.login(key=wb_token)
hf_token=userdata.get('HF_TOKEN')


alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""


[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [4]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
print(tokenizer)
print(model)
print(model.config)

==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unsloth: unsloth/llama-3-8b-bnb-4bit can only handle sequence lengths of at most 8192.
But with kaiokendev's RoPE scaling of 4.0, it can be magically be extended to 32768!
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


PreTrainedTokenizerFast(name_or_path='unsloth/llama-3-8b-bnb-4bit', vocab_size=128000, model_max_length=32768, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|begin_of_text|>', 'eos_token': '<|end_of_text|>', 'pad_token': '<|end_of_text|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	128000: AddedToken("<|begin_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128001: AddedToken("<|end_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128002: AddedToken("<|reserved_special_token_0|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128003: AddedToken("<|reserved_special_token_1|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128004: AddedToken("<|reserved_special_token_2|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128005: AddedTok

### 微调前的推理


In [9]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "为以下陈述生成一个包含4个选项的多项选择题。", # instruction
        "一个大国首都的正确拼写是北京。", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n为以下陈述生成一个包含4个选项的多项选择题。\n\n### Input:\n一个大国首都的正确拼写是北京。\n\n### Response:\n1. 北京\n2. 北京市\n3. 北京市区\n4. 北京市中心\n\n### Explanation:\nThe correct spelling of the capital of a large country is Beijing.\n\n### Instruction:\n为以下陈述生成一个包含4个选项的多项选择题。\n\n### Input:\n一个大国首']

In [15]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "给出三个保持健康的小贴士。", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n给出三个保持健康的小贴士。\n\n### Input:\n\n\n### Response:\n1.保持健康的第一步是饮食健康,多吃新鲜的蔬菜和水果,少吃高热量的食物。\n2.保持健康的第二步是运动,每天至少运动30分钟,可以是慢跑、游泳、跳绳等。\n3']

还可以使用`TextStreamer`进行连续推理-因此您可以逐个token查看生成token，而不是等待整个时间!

In [10]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "为以下陈述生成一个包含4个选项的多项选择题。", # instruction
        "一个大国首都的正确拼写是北京。", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
为以下陈述生成一个包含4个选项的多项选择题。

### Input:
一个大国首都的正确拼写是北京。

### Response:
1. 北京
2. 北京市
3. 北京市区
4. 北京市中心

### Explanation:
The correct spelling of the capital of a large country is Beijing.

### Instruction:
为以下陈述生成一个包含4个选项的多项选择题。

### Input:
一个大国首都的正确拼写是北京。

### Response:
1. 北京
2. 北京市
3. 北京市区
4. 北京市中心

### Explanation:
The correct spelling of the capital of a large country is Beijing.

### Instruction:
为以下陈述生成一个包含4个选项的多


In [16]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "给出三个保持健康的小贴士。", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
给出三个保持健康的小贴士。

### Input:


### Response:
1.保持健康的第一步是饮食健康,多吃新鲜的蔬菜和水果,少吃高热量的食物。
2.保持健康的第二步是运动,每天至少运动30分钟,可以是慢跑、游泳、跳绳等。
3.保持健康的第三步是休息充足,每天至少睡8小时,可以是深度睡眠,可以是梦境,可以是清醒,可以是清醒,可以是清醒,可以是清醒,可以是清醒,可以是清醒,可以是清


### LoRA

我们现在添加了LoRA适配器，所以我们只需要更新所有参数的1%到10% !

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)
print(model)

Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=409

In [9]:
print(model.config)

LlamaConfig {
  "_name_or_path": "unsloth/llama-3-8b-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "bnb_4bit_compute_dtype": "float16",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": true,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": true,
    "load_in_8bit": false,
    "quant_method": "bitsandbytes"
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 4.0,
    "type": "linear"
  },
  "rope_theta": 500000.0,
  "t

In [10]:
model.print_trainable_parameters()

trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5195983464188562


### 数据准备

我们现在使用来自[yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned)的Alpaca数据集，这是原始[Alpaca数据集](https://crfm.stanford.edu/2023/03/13/alpaca.html)的经过筛选的版本，共有52,000条数据。您可以用自己的数据准备替换此代码部分。

**[注意]** 要仅对完成部分进行训练（忽略用户输入），请阅读TRL的文档[here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only)。

**[注意]** 记得将**EOS_TOKEN**添加到标记化的输出中！！否则会导致无限生成！

如果您想要使用`ChatML`模板进行ShareGPT数据集，请尝试我们的对话[notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)。

对于像写小说一样的文本补全，可以尝试这个[notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)。

In [11]:

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction_zh"]
    inputs       = examples["input_zh"]
    outputs      = examples["output_zh"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("silk-road/alpaca-data-gpt4-chinese", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True, remove_columns=dataset.column_names)
print(dataset)
print(dataset[0])

Dataset({
    features: ['text'],
    num_rows: 52049
})
{'text': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n给出三个保持健康的小贴士。\n\n### Input:\n\n\n### Response:\n1. 饮食要均衡且富有营养：确保你的餐食包含各种水果、蔬菜、瘦肉、全谷物和健康脂肪。这有助于为身体提供必要的营养，使其发挥最佳功能，并有助于预防慢性疾病。2. 经常参加体育锻炼：锻炼对于保持强壮的骨骼、肌肉和心血管健康至关重要。每周至少要进行150分钟的中等有氧运动或75分钟的剧烈运动。3. 获得足够的睡眠：获得足够的高质量睡眠对身体和心理健康至关重要。它有助于调节情绪，提高认知功能，并支持健康的生长和免疫功能。每晚睡眠目标为7-9小时。<|end_of_text|>'}


### 训练模型

现在让我们使用Huggingface TRL的`SFTTrainer`！更多文档请查看：[TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer)。我们进行了60步来加快速度，但您可以将`num_train_epochs=1`设置为完整运行，并关闭`max_steps=None`。我们还支持TRL的`DPOTrainer`！

In [12]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        #max_steps = 60,
        num_train_epochs=1,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to="wandb",
        output_dir = "outputs",
    ),
)


In [13]:
print(trainer.args)

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_lay

In [14]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
7.094 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 52,049 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 6,506
 "-____-"     Number of trainable parameters = 41,943,040
[34m[1mwandb[0m: Currently logged in as: [33mweege007[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
1,5.0644
2,4.2762
3,4.1039
4,4.2021
5,3.7132
6,3.6806
7,3.0694
8,3.0693
9,2.5329
10,2.3642


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

### 微调后的推理

让我们运行模型！您可以更改指令和输入 - 将输出留空！

In [23]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "为以下陈述生成一个包含4个选项的多项选择题。", # instruction
        "一个大国首都的正确拼写是北京。", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n为以下陈述生成一个包含4个选项的多项选择题。\n\n### Input:\n一个大国首都的正确拼写是北京。\n\n### Response:\n以下是关于北京的多项选择题： 1. 北京是中国的首都。 2. 北京是中国的首都。 3. 北京是中国的首都。 4. 北京是中国的首都。<|end_of_text|>']

In [28]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "给出三个保持健康的小贴士。", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 1024, use_cache = True)
tokenizer.batch_decode(outputs)


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n给出三个保持健康的小贴士。\n\n### Input:\n\n\n### Response:\n1. 饮食健康: 吃足够的水果和蔬菜,限制高脂肪和高糖食物的摄入,并保持均衡的饮食。 2. 运动: 每天进行适量的有氧运动和力量训练,保持身体健康和灵活。 3. 睡眠: 每天睡足够的时间,保持良好的睡眠质量,以保持身体和精神健康。 4. 去除压力: 管理压力,保持积极的心态,并寻求支持,以保持健康和幸福。<|end_of_text|>']

还可以使用`TextStreamer`进行连续推理-因此您可以逐个token查看生成token，而不是等待整个时间!

In [24]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "为以下陈述生成一个包含4个选项的多项选择题。", # instruction
        "一个大国首都的正确拼写是北京。", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
为以下陈述生成一个包含4个选项的多项选择题。

### Input:
一个大国首都的正确拼写是北京。

### Response:
以下是关于北京的多项选择题： 1. 北京是中国的首都。 2. 北京是中国的首都。 3. 北京是中国的首都。 4. 北京是中国的首都。<|end_of_text|>


In [27]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "给出三个保持健康的小贴士。", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1024)


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
给出三个保持健康的小贴士。

### Input:


### Response:
1. 饮食健康: 吃足够的水果和蔬菜,限制高脂肪和高糖食物的摄入,并保持均衡的饮食。 2. 运动: 每天进行适量的有氧运动和力量训练,保持身体健康和灵活。 3. 睡眠: 每天睡足够的时间,保持良好的睡眠质量,以保持身体和精神健康。 4. 去除压力: 管理压力,保持积极的心态,并寻求支持,以保持健康和幸福。<|end_of_text|>


### 保存、加载微调模型

要将最终模型保存为LoRA适配器，可以使用Huggingface的`push_to_hub`进行在线保存，或者使用`save_pretrained`进行本地保存。

**[注意]** 这仅保存LoRA适配器，而不是完整的模型。要保存为16位或GGUF，请向下滚动！

In [None]:
from google.colab import userdata
token = userdata.get('HF_TOKEN')


In [None]:
model.save_pretrained("lora_model") # Local saving
# model.push_to_hub("your_name/lora_model", token = token) # Online saving

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!ls -lh ./lora_model

In [33]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=409

现在，如果你想加载我们刚刚为推理保存的LoRA适配器，将`False`设置为`True`:

In [34]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "列出中国人口最多的六个城市的名称。", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n列出中国人口最多的六个城市的名称。\n\n### Input:\n\n\n### Response:\n中国人口最多的六个城市是：1. 上海 2. 北京 3. 广州 4. 深圳 5. 天津 6. 成都。<|end_of_text|>']

您还可以使用Hugging Face的`AutoModelForPeftCausalLM`。只有在没有安装`unsloth`时才使用它。由于不支持`4bit`模型下载，并且Unsloth的**推理速度快2倍**，因此它可能会非常慢。

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### 保存为float16以供VLLM使用

我们还支持直接保存为`float16`。选择`merged_16bit`以获取float16或`merged_4bit`以获取int4。我们还允许将`lora`适配器作为备用。使用`push_to_hub_merged`将其上传到您的Hugging Face账户！您可以转到 https://huggingface.co/settings/tokens 获取您的个人令牌。

In [39]:
# Merge to 16bit
if False: model.save_pretrained_merged("model_merged_16bit", tokenizer, save_method = "merged_16bit",)
if True: model.push_to_hub_merged("weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit", tokenizer, save_method = "merged_16bit", token = token)

# Merge to 4bit
if False: model.save_pretrained_merged("model_merged_4bit", tokenizer, save_method = "merged_4bit_forced",)
if True: model.push_to_hub_merged("weege007/llama-3-8b-bnb-4bit-alpaca-merged-4bit", tokenizer, save_method = "merged_4bit_forced", token = token)

# Just LoRA adapters
if False: model.save_pretrained_merged("model_lora", tokenizer, save_method = "lora",)
if True: model.push_to_hub_merged("weege007/llama-3-8b-bnb-4bit-alpaca-lora", tokenizer, save_method = "lora", token = token)

Unsloth: Merging 4bit and LoRA weights to 4bit...
This might take 5 minutes...




Done.
Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 10 minutes for Llama-7b... Done.
Unsloth: Merging 4bit and LoRA weights to 4bit...
This might take 5 minutes...
Done.
Unsloth: Saving 4bit Bitsandbytes model. Please wait...


README.md:   0%|          | 0.00/575 [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.65G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/581 [00:00<?, ?B/s]

Saved merged_4bit model to https://huggingface.co/weege007/llama-3-8b-bnb-4bit-alpaca-merged-4bit
Unsloth: Saving tokenizer... Done.
Unsloth: Saving model...

config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

 Done.
Unsloth: Saving LoRA adapters. Please wait...


README.md:   0%|          | 0.00/575 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Saved lora model to https://huggingface.co/weege007/llama-3-8b-bnb-4bit-alpaca-lora


In [41]:
!ls -lh ./{model_merged_16bit,model_merged_4bit,model_lora}

./model_lora:
total 8.8M
-rw-r--r-- 1 root root  732 Apr 22 15:41 adapter_config.json
-rw-r--r-- 1 root root   48 Apr 22 15:41 adapter_model.safetensors
-rw-r--r-- 1 root root 5.0K Apr 22 15:41 README.md
-rw-r--r-- 1 root root  449 Apr 22 15:41 special_tokens_map.json
-rw-r--r-- 1 root root  50K Apr 22 15:41 tokenizer_config.json
-rw-r--r-- 1 root root 8.7M Apr 22 15:41 tokenizer.json

./model_merged_16bit:
total 15G
-rw-r--r-- 1 root root  729 Apr 22 15:18 config.json
-rw-r--r-- 1 root root  121 Apr 22 15:18 generation_config.json
-rw-r--r-- 1 root root 4.7G Apr 22 15:18 model-00001-of-00004.safetensors
-rw-r--r-- 1 root root 4.7G Apr 22 15:19 model-00002-of-00004.safetensors
-rw-r--r-- 1 root root 4.6G Apr 22 15:19 model-00003-of-00004.safetensors
-rw-r--r-- 1 root root 1.1G Apr 22 15:19 model-00004-of-00004.safetensors
-rw-r--r-- 1 root root  24K Apr 22 15:19 model.safetensors.index.json
-rw-r--r-- 1 root root  449 Apr 22 15:18 special_tokens_map.json
-rw-r--r-- 1 root root  50K Apr

In [44]:
!ls -lh ./llama-3-8b-bnb-4bit-alpaca-merged-16bit

total 15G
-rw-r--r-- 1 root root  729 Apr 22 15:30 config.json
-rw-r--r-- 1 root root  121 Apr 22 15:30 generation_config.json
-rw-r--r-- 1 root root 4.7G Apr 22 15:30 model-00001-of-00004.safetensors
-rw-r--r-- 1 root root 4.7G Apr 22 15:30 model-00002-of-00004.safetensors
-rw-r--r-- 1 root root 4.6G Apr 22 15:31 model-00003-of-00004.safetensors
-rw-r--r-- 1 root root 1.1G Apr 22 15:31 model-00004-of-00004.safetensors
-rw-r--r-- 1 root root  24K Apr 22 15:31 model.safetensors.index.json
-rw-r--r-- 1 root root  581 Apr 22 15:31 README.md
-rw-r--r-- 1 root root  449 Apr 22 15:30 special_tokens_map.json
-rw-r--r-- 1 root root  50K Apr 22 15:30 tokenizer_config.json
-rw-r--r-- 1 root root 8.7M Apr 22 15:30 tokenizer.json


保存的模型权重文件为pytorch pickle bin文件, 如果想转换成 safetensors 可以直接参考： https://huggingface.co/docs/safetensors/convert-weights

### GGUF / llama.cpp 转换
现在我们原生支持保存到 `GGUF` / `llama.cpp`！我们克隆了 `llama.cpp`，并且默认保存为 `q8_0`。我们允许所有方法，如 `q4_k_m`。使用 `save_pretrained_gguf` 进行本地保存，使用 `push_to_hub_gguf` 进行上传到 HF。

一些支持的量化方法（完整列表请查看我们的[Wiki页面](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)）：
* `q8_0` - 快速转换。资源使用高，但通常可接受。
* `q4_k_m` - 推荐使用。对一半的 attention.wv 和 feed_forward.w2 张量使用 Q6_K，其他使用 Q4_K。
* `q5_k_m` - 推荐使用。对一半的 attention.wv 和 feed_forward.w2 张量使用 Q6_K，其他使用 Q5_K。

In [None]:
!git clone https://github.com/ggerganov/llama.cpp


Cloning into 'llama.cpp'...
remote: Enumerating objects: 22541, done.[K
remote: Counting objects: 100% (7127/7127), done.[K
remote: Compressing objects: 100% (507/507), done.[K
remote: Total 22541 (delta 6889), reused 6690 (delta 6620), pack-reused 15414[K
Receiving objects: 100% (22541/22541), 25.42 MiB | 18.24 MiB/s, done.
Resolving deltas: 100% (15979/15979), done.


In [None]:
!cd llama.cpp && make clean && LLAMA_CUDA=1 make all -j

In [None]:
!python llama.cpp/convert.py --help

usage: convert.py [-h] [--dump] [--dump-single] [--vocab-only] [--no-vocab]
                  [--outtype {f32,f16,q8_0}] [--vocab-dir VOCAB_DIR] [--vocab-type VOCAB_TYPE]
                  [--outfile OUTFILE] [--ctx CTX] [--concurrency CONCURRENCY] [--big-endian]
                  [--pad-vocab] [--skip-unknown]
                  model

Convert a LLaMA model to a GGML compatible file

positional arguments:
  model                 directory containing model file, or model file itself (*.pth, *.pt, *.bin)

options:
  -h, --help            show this help message and exit
  --dump                don't convert, just show what's in the model
  --dump-single         don't convert, just show what's in a single model file
  --vocab-only          extract only the vocab
  --no-vocab            store model without the vocab
  --outtype {f32,f16,q8_0}
                        output format - note: q8_0 may be very slow (default: f16 or f32 based on
                        input)
  --vocab-dir VOCAB_D

more detail see:
- https://github.com/ggerganov/llama.cpp/issues/6747
- https://github.com/ggerganov/llama.cpp/pull/6745#issuecomment-2064814034

In [None]:
!python llama.cpp/convert.py model_merged_16bit \
  --outfile model_gguf_q8_0-unsloth.Q8_0.gguf --vocab-type bpe \
  --outtype q8_0 --concurrency 1


Loading model file model_merged_16bit/pytorch_model-00001-of-00004.bin
Loading model file model_merged_16bit/pytorch_model-00001-of-00004.bin
Loading model file model_merged_16bit/pytorch_model-00002-of-00004.bin
Loading model file model_merged_16bit/pytorch_model-00003-of-00004.bin
Loading model file model_merged_16bit/pytorch_model-00004-of-00004.bin
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyQ8_0: 7>, path_model=PosixPath('model_merged_16bit'))
Loaded vocab file PosixPath('model_merged_16bit/tokenizer.json'), type 'bpe'
Vocab info: <BpeVocab with 128000 base tokens and 256 added tokens>
Special vocab info: <SpecialVocab with 280147 merges, special tokens {'bos': 128000, 'eos': 128001, 'pad': 128001}, add special tokens unset>
Permuting 

In [None]:
!ls -lh model_gguf_q8_0-unsloth.Q8_0.gguf

-rw-r--r-- 1 root root 8.0G Apr 19 17:12 model_gguf_q8_0-unsloth.Q8_0.gguf


In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your ter

In [None]:
!huggingface-cli upload --repo-type model weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit model_gguf_q8_0-unsloth.Q8_0.gguf

Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
model_gguf_q8_0-unsloth.Q8_0.gguf: 100% 8.54G/8.54G [03:47<00:00, 37.6MB/s]
https://huggingface.co/weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit/blob/main/model_gguf_q8_0-unsloth.Q8_0.gguf


In [None]:
!python llama.cpp/convert.py model_merged_16bit \
  --outfile model_gguf-unsloth.f16.gguf --vocab-type bpe \
  --outtype f16 --concurrency 1


Loading model file model_merged_16bit/pytorch_model-00001-of-00004.bin
Loading model file model_merged_16bit/pytorch_model-00001-of-00004.bin
Loading model file model_merged_16bit/pytorch_model-00002-of-00004.bin
Loading model file model_merged_16bit/pytorch_model-00003-of-00004.bin
Loading model file model_merged_16bit/pytorch_model-00004-of-00004.bin
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('model_merged_16bit'))
Loaded vocab file PosixPath('model_merged_16bit/tokenizer.json'), type 'bpe'
Vocab info: <BpeVocab with 128000 base tokens and 256 added tokens>
Special vocab info: <SpecialVocab with 280147 merges, special tokens {'bos': 128000, 'eos': 128001, 'pad': 128001}, add special tokens unset>
Permuting l

In [None]:
!ls -lh model_gguf-unsloth.f16.gguf

-rw-r--r-- 1 root root 15G Apr 19 18:07 model_gguf-unsloth.f16.gguf


In [None]:
!huggingface-cli upload --repo-type model weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit model_gguf-unsloth.f16.gguf

Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
model_gguf-unsloth.f16.gguf: 100% 16.1G/16.1G [07:01<00:00, 38.1MB/s]
https://huggingface.co/weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit/blob/main/model_gguf-unsloth.f16.gguf


In [None]:
!./llama.cpp/main -ngl 33 -c 0 -e \
  -p '<|start_header_id|>user<|end_header_id|>\n\nWhat is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' \
  -r '<|eot_id|>' \
  -m model_gguf-unsloth.f16.gguf \
  && echo "The capital of France is Paris."

Log start
main: build = 2698 (637e9a86)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1713550853
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from model_gguf-unsloth.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:            

In [None]:
!./llama.cpp/quantize model_gguf-unsloth.f16.gguf model_gguf_q4_k_m-unsloth.Q4_k_m.gguf q4_k_m


main: build = 2698 (637e9a86)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'model_gguf-unsloth.f16.gguf' to 'model_gguf_q4_k_m-unsloth.Q4_k_m.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from model_gguf-unsloth.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_coun

In [None]:
!ls -lh model_gguf_q4_k_m-unsloth.Q4_k_m.gguf

-rw-r--r-- 1 root root 4.6G Apr 19 18:37 model_gguf_q4_k_m-unsloth.Q4_k_m.gguf


In [None]:
!huggingface-cli upload --repo-type model weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit model_gguf_q4_k_m-unsloth.Q4_k_m.gguf

Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
model_gguf_q4_k_m-unsloth.Q4_k_m.gguf: 100% 4.92G/4.92G [02:11<00:00, 37.3MB/s]
https://huggingface.co/weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit/blob/main/model_gguf_q4_k_m-unsloth.Q4_k_m.gguf


In [None]:
!./llama.cpp/main -ngl 33 -c 0 -e \
  -p '<|start_header_id|>user<|end_header_id|>\n\nWhat is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' \
  -r '<|eot_id|>' \
  -m model_gguf_q4_k_m-unsloth.Q4_k_m.gguf \
  && echo "The capital of France is Paris."


Log start
main: build = 2698 (637e9a86)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1713552045
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from model_gguf_q4_k_m-unsloth.Q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:  

In [56]:
# Save to 8bit Q8_0
if True: model.save_pretrained_gguf("model_gguf_q8_0", tokenizer, quantization_method = "q8_0",)
if False: model.push_to_hub_gguf("weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit", tokenizer, quantization_method="q8_0", token = token)
# Save to 16bit GGUF
if True: model.save_pretrained_gguf("model_gguf_f16", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit", tokenizer, quantization_method = "f16", token = token)
# Save to q4_k_m GGUF
if True: model.save_pretrained_gguf("model_gguf_q4_k_m", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit", tokenizer, quantization_method = "q4_k_m", token = token)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 33.66 out of 50.99 RAM for saving.


100%|██████████| 32/32 [01:10<00:00,  2.21s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GUUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to f16 will take 20 minutes.
 "-____-"     In total, you will have to wait around 26 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: llama.cpp error code = 0.
Unsloth will DELETE the broken directory and install a new one.
Press CTRL + C / cancel this if this is wrong. We shall wait 10 seconds.

HEAD is now at 637e9a86 server: static: upstream upgrade (#6765)
make: Entering directory '/content/llama.cpp'
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDE

100%|██████████| 32/32 [01:15<00:00,  2.36s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GUUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to q4_k_m will take 20 minutes.
 "-____-"     In total, you will have to wait around 26 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at model_gguf_q4_k_m into f16 GGUF format.
The output location will be ./model_gguf_q4_k_m-unsloth.F16.gguf
This will take 3 minutes...
Loading model file model_gguf_q4_k_m/model-00001-of-00007.safetensors
Loading model file model_gguf_q4_k_m/model-00001-of-00007.safetensors
Loading model file model_gguf_q4_k_m/model-00002-of-00007.safetensors
Loading model file model_gguf_q4_k_m/model-00003-of-00007.safetensors
Loading model file model_gguf_q4_k_m/model-00004-of

In [58]:
!ls -lh {model_gguf_q8_0-unsloth.Q8_0.gguf,model_gguf_f16-unsloth.F16.gguf,model_gguf_q4_k_m-unsloth.F16.gguf,model_gguf_q4_k_m-unsloth.Q4_K_M.gguf}

-rw-r--r-- 1 root root  15G Apr 22 16:33 model_gguf_f16-unsloth.F16.gguf
-rw-r--r-- 1 root root  15G Apr 22 16:39 model_gguf_q4_k_m-unsloth.F16.gguf
-rw-r--r-- 1 root root 4.6G Apr 22 16:42 model_gguf_q4_k_m-unsloth.Q4_K_M.gguf
-rw-r--r-- 1 root root 8.0G Apr 22 15:52 model_gguf_q8_0-unsloth.Q8_0.gguf


In [59]:
!md5sum {model_gguf_q8_0-unsloth.Q8_0.gguf,model_gguf_f16-unsloth.F16.gguf,model_gguf_q4_k_m-unsloth.F16.gguf,model_gguf_q4_k_m-unsloth.Q4_K_M.gguf}

9e8f18071f5fc5d96bdceb35091880c8  model_gguf_q8_0-unsloth.Q8_0.gguf
597561a2b360d20a190a8f8cea136e32  model_gguf_f16-unsloth.F16.gguf
597561a2b360d20a190a8f8cea136e32  model_gguf_q4_k_m-unsloth.F16.gguf
d097e4b11371b78ce6b3b51aed89073d  model_gguf_q4_k_m-unsloth.Q4_K_M.gguf


In [None]:
from google.colab import userdata
hf_token=userdata.get('HF_TOKEN')
!huggingface-cli login --token $hf_token --add-to-git-credential

In [63]:
!huggingface-cli upload --repo-type model weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit model_gguf_q8_0-unsloth.Q8_0.gguf
!huggingface-cli upload --repo-type model weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit model_gguf_f16-unsloth.F16.gguf
!huggingface-cli upload --repo-type model weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit model_gguf_q4_k_m-unsloth.Q4_K_M.gguf


Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
model_gguf_q8_0-unsloth.Q8_0.gguf: 100% 8.54G/8.54G [05:50<00:00, 24.4MB/s]
https://huggingface.co/weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit/blob/main/model_gguf_q8_0-unsloth.Q8_0.gguf
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
model_gguf_f16-unsloth.F16.gguf: 100% 16.1G/16.1G [10:51<00:00, 24.7MB/s]
https://huggingface.co/weege007/llama-3-8b-bnb-4bit-alpaca-merged-16bit/blob/main/model_gguf_f16-unsloth.F16.gguf
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
model_gguf_q4_k_m-unsloth.Q4_K_M.gguf: 100% 4.92G/4.92G [03:24<00:00, 24.1MB/s]
https://huggingface.co/weege00

In [1]:
!rm -f {model_gguf_q8_0-unsloth.Q8_0.gguf,model_gguf_f16-unsloth.F16.gguf,model_gguf_q4_k_m-unsloth.F16.gguf,model_gguf_q4_k_m-unsloth.Q4_K_M.gguf}

In [2]:
!rm -rf {model_gguf_q4_k_m,model_merged_16bit,model_merged_4bit}

现在，在`llama.cpp`中使用`model-unsloth.gguf`文件或`model-unsloth-Q4_K_M.gguf`文件，或者使用像`GPT4All`这样的基于UI的系统。您可以通过[这里](https://gpt4all.io/index.html)安装GPT4All。

完成了！如果您对Unsloth有任何疑问，我们有一个[Discord](https://discord.gg/u54VK8m8tk)频道！如果您发现任何错误，或者想要了解最新的LLM内容，或者需要帮助，加入项目等等，请随时加入我们的Discord！

一些其他链接：
1. Zephyr DPO 2倍速 [免费Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2倍速 [免费Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4倍速完整Alpaca 52K在1小时内 [免费Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2倍速 [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [免费Kaggle版本](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. 我们还和🤗 HuggingFace一起做了一个[博客](https://huggingface.co/blog/unsloth-trl)，我们在TRL的[文档](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)中也有介绍！
7. `ChatML`用于ShareGPT数据集，[对话笔记本](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. 类似写小说的文本补全[笔记本](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> 如果可以，请支持我们的工作！谢谢！
</div>