<a href="https://colab.research.google.com/github/robberyguy1999/ai/blob/colab/Copy_of_Llama_3_1_8b_%2B_Unsloth_2x_faster_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

[NEW] Llama-3.1 8b, 70b & 405b are trained on a crazy 15 trillion tokens with 128K long context lengths!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [None]:
!pip install unsloth
!pip uninstall torch
!pip uninstall "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# install unsloth + dependencies

Collecting unsloth
  Downloading unsloth-2024.9-py3-none-any.whl.metadata (54 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/54.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.7/54.7 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting xformers==0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.27.post2-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting triton>=3.0.0 (from unsloth)
  Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.8.11-py3-none-any.whl.metadata (8.4 kB)
Collecting datasets>=2.16.0 (from unsloth)
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting sentencepiece>=0.2.0 (from unsloth)
  Downloading sentencepiece

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
* [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

In [None]:
from unsloth import FastLanguageModel # Importing unsloth for functionality.
import torch # Importing the torch library for GPU usage.
max_seq_length = 32768 # Any value up to 32768 works. Rope scaling is supported in unsloth.
dtype = None # None for auto detection.
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Model being used in the project. The rest are just other 4 bit options.
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit",
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",
] # Find any models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained( # This function installs the models safetensors and other factors to the system, for later finetuning.
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", # model to download, can be any from the list above
    max_seq_length = max_seq_length, # The rest of these values are stated above with explanations. This just reloads them in the tokenizer.
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Any number above 0. Suggested 8, 16, 32, 64, 128. Higher value retains more information, increases computational load.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16, # Higher value works with less training steps needed, but is more unstable. 16 is optimal.
    lora_dropout = 0, # Supports any, but = 0 is optimized. This is the probability of ignoring elements.
    bias = "none",    # Supports any, but = "none" is optimized. Only none works for this AI product.
    use_gradient_checkpointing = "unsloth", # Can either be True or "Unsloth", unsloth works 2x faster finetuning.
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA, works better than regular adapters.
    loftq_config = None, # LoftQ for the quantization of LORA adapters. Speeds up the compression at the end. We are not compressing LoftQ adapters for the sake of the products performance.
)

<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise the generation will go on forever.
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass # This sorts the dataset, the instructions, inputs and outputs are all dependent on which dataset is being used.

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train") # Loads a datsset from huggingface. With the format "Author/DatasetName"
dataset = dataset.map(formatting_prompts_func, batched = True,) #This loads a dataset map for the finetuner to use.

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer # Import the HuggingFace trainer.
from transformers import TrainingArguments # Import the arguments used for training, seen below.
from unsloth import is_bfloat16_supported # Checks if bfloat16 is supported by the system, as it is needed for Nvidia T4 GPU usage.

trainer = SFTTrainer( # Arguments used for training
    model = model, # Model set earlier in code.
    tokenizer = tokenizer, # Tokenizer set earlier in code.
    train_dataset = dataset, # Variable set earlier in code.
    dataset_text_field = "text", # Not too important. Just names the output of the dataset, only used if training is continued later on.
    max_seq_length = max_seq_length, # Set earlier in code.
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2, # Uses more of the systems performance.
        gradient_accumulation_steps = 4, # Not too important, only used for graphing data which is ignored.
        warmup_steps = 5, # Warmup steps for the system to warm up performance.
        num_train_epochs = 1, # This number is how many times the trainer will re-go over the dataset. 2-3 is optimized. Running on a limited run time however means we have to have a max steps.
        max_steps = 900, # This step is to limit how long the AI takes to train, compromising performance over actually getting a product.
        learning_rate = 2e-4, # Learning rate. Higher numbers are faster but higher chance of error. Lower numbers are slower with more stability.
        fp16 = not is_bfloat16_supported(), # This is for GPU performance, seeing which technique to use in each gpus case.
        bf16 = is_bfloat16_supported(), # As stated above, same thing.
        logging_steps = 1, # Steps after it is done training to log performance, time and other things like training loss. These steps are then averaged, to see the iterations/second and other data that are mostly ignored for our performance.
        optim = "adamw_8bit", # Using 8-bit optimization.
        weight_decay = 0.01, # Applies a penalty to the weights during training. Ensures weights are not too large when the AI is done training.
        lr_scheduler_type = "linear", # Schedules steps linearly, so step 30 doesnt happen before step 1. This is optimal.
        seed = 3407, # Random state as stated above.
        output_dir = "outputs", # The output directory of the trained model. this is where the model will be saved
    ),
)

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0) # Pulls the GPU name of the model, Should be T4.
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3) # Pulls the reserved amount of VRAM that the product will use during finetuning.
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3) # Pulls the total GPU ram of the system.
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.") # Prints out GPU info in the console.
print(f"{start_gpu_memory} GB of memory reserved.") # Prints out how much vram the finetuning has reserved.

In [None]:
trainer_stats = trainer.train() # This command trains according to the arguments above. Can take up to 12 hours for this big Llama model.

In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model for testing. This is to see if the AI is working.

 We will use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format( # Testing the AI by asking it a question.
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer # See output token by token instead of at the end of the sentence. Much more user friendly.
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128) #

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save. We will be using local saves for this product.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to GGUF, which is the usable version of the product, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model") # Saving the models tokenizer and lora adapters for later saving.
from google.colab import files
files.download('submission111111.csv')

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. We will use `save_pretrained_gguf` for local saving as opposed to cloud saving.

Some supported quant methods (full list on this [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.


In [None]:
# Save to multiple GGUF options - much faster if you want multiple!
if True:
    model.save_pretrained_gguf(
        "model", # Output folder name.
        tokenizer, # Outputs tokenizer with it.
        quantization_method = ["f16"], # Quantization methods. Lower quality - Lower file size - Faster speed, Higher quality - Higher file size - A lot slower.
    )

from google.colab import files
files.download('models/unsloth.F16.gguf')

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. We are using `Lm Studio` for this product as Lm studio is user friendly.


In [None]:
from google.colab import files
files.download('./model/unsloth.F16.gguf')