# Merging trained adapters to gguf

In [None]:
%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 4000 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# set what you used
llama3_instruct = "unsloth/llama-3-8b-Instruct-bnb-4bit"
llama3_completion = "unsloth/llama-3-8b-bnb-4bit"
llama3_finetuned = "pookie3000/pg_lora_completion_run2"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = llama3_completion,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...",
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Load PEFT, this is necessary

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
)

Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
model.delete_adapter('default')
hf_path = "pookie3000/pg_lora_completion_run6"
model.load_adapter(hf_path, "default");
model.set_adapter("default")

adapter_config.json:   0%|          | 0.00/732 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

In [None]:
model.active_adapters

['default']

## Test if adapters actually do something

In [None]:
BOS_TOKEN = tokenizer.bos_token
print(BOS_TOKEN)

<|begin_of_text|>


In [None]:
prompt_to_compare = "If there were intelligent beings elsewhere in the universe, they'd "

In [None]:
from transformers import TextIteratorStreamer
from threading import Thread
text_streamer = TextIteratorStreamer(tokenizer)
import textwrap
max_print_width = 100

#model.enable_adapters()
inputs = tokenizer(
[
    prompt_to_compare
]*1, return_tensors = "pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer = text_streamer,
    max_new_tokens = 256,
    use_cache = True,
)
thread = Thread(target = model.generate, kwargs = generation_kwargs)
thread.start()

length = 0
for j, new_text in enumerate(text_streamer):
    if j == 0:
        wrapped_text = textwrap.wrap(new_text, width = max_print_width)
        length = len(wrapped_text[-1])
        wrapped_text = "\n".join(wrapped_text)
        print(wrapped_text, end = "")
    else:
        length += len(new_text)
        if length >= max_print_width:
            length = 0
            print()
        print(new_text, end = "")
    pass
pass

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>If there were intelligent beings elsewhere in the universe, they'd99% likely be 
entirely different from us. They might be gas giants, or made out of some substance even more exotic than 
rock. They might have 7 sexes. They might be the servants of a race of super-intelligent robots. The 
only thing they'd have in common with us would be that they'd be made of a small number of atoms 
assembled into sophisticated objects. It seems to me that it would be more interesting to write about aliens 
who were entirely different from us than to write about aliens who were, except for having wings, 
exactly like us. To start with, the story would be more credible. If you describe a society that's nothing 
more than humans with wings, a lot of readers will simply think you've gone crazy. And when you make 
the aliens really different, you can use that as food for thought. For example, in the culture of a 
seven-sexed race, what would be the concept of "gender?" Would there even b

In [None]:
from transformers import TextIteratorStreamer
from threading import Thread
text_streamer = TextIteratorStreamer(tokenizer)
import textwrap
max_print_width = 100

inputs = tokenizer(
[
    prompt_to_compare
]*1, return_tensors = "pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer = text_streamer,
    max_new_tokens = 256,
    use_cache = True,
)


model.disable_adapters()
thread = Thread(target = model.generate, kwargs = generation_kwargs)
thread.start()

length = 0
for j, new_text in enumerate(text_streamer):
    if j == 0:
        wrapped_text = textwrap.wrap(new_text, width = max_print_width)
        length = len(wrapped_text[-1])
        wrapped_text = "\n".join(wrapped_text)
        print(wrapped_text, end = "")
    else:
        length += len(new_text)
        if length >= max_print_width:
            length = 0
            print()
        print(new_text, end = "")
    pass
pass

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>The truth about9/11: The official story is a lie
The truth about 9/11: The 
official story is a lie
The official story of 9/11 is that 19 hijackers, 15 of whom were Saudi nationals, 
took over four commercial jets with box cutters and flew three of the planes into the Twin Towers and 
the Pentagon. A fourth plane was stopped by heroic passengers in a field in Pennsylvania. There is no 
evidence that any of the hijackers had been trained by the government. The official story is that the 
hijackers were all killed on 9/11, and that the government was unable to identify the remains of any of 
them. The official story is that the government had no foreknowledge of the attacks, and that there were 
The official story is a lie.
The truth is that 9/11 was an inside job, and that the 
government knew about the attacks in advance.
There is no evidence that any of the hijackers had been trained 
by the government. The official story is that the hijackers were all killed on 9/11,

In [None]:
# merging
hf_token = "TODO"
if True: model.push_to_hub_gguf("pookie3000/pg_completion_v4_q8_gguf", quantization_method = "q4_k_m", tokenizer = tokenizer, token = hf_token)


Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.88 out of 12.67 RAM for saving.


 50%|█████     | 16/32 [00:01<00:01,  9.84it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [01:42<00:00,  3.20s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving pookie3000/pg_completion_v4_q8_gguf/pytorch_model-00001-of-00004.bin...
Unsloth: Saving pookie3000/pg_completion_v4_q8_gguf/pytorch_model-00002-of-00004.bin...
Unsloth: Saving pookie3000/pg_completion_v4_q8_gguf/pytorch_model-00003-of-00004.bin...
Unsloth: Saving pookie3000/pg_completion_v4_q8_gguf/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GUUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to q4_k_m will take 20 minutes.
 "-____-"     In total, you will have to wait around 26 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...


Unsloth: We must use f16 for non Llama and Mistral models.


Unsloth: [1] Converting model at pookie3000/pg_completion_v4_q8_gguf into f16 GGUF format.
The output location will be ./pookie3000/pg_completion_v4_q8_gguf-unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: pg_completion_v4_q8_gguf
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 8192
INFO:hf-to-gguf:gguf: embedding length = 4096
INFO:hf-to-gguf:gguf: feed forward length = 14336
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 8
INFO:hf-to-gguf:gguf: rope theta = 500000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special t

pg_completion_v4_q8_gguf-unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/pookie3000/pg_completion_v4_q8_gguf
