# Tutorial: How to Finetune Llama-3 and Use In Ollama
[Tutorial](https://docs.unsloth.ai/basics/tutorial-how-to-finetune-llama-3-and-use-in-ollama)

## 1. Create Miniconda environment
```shell
(base) $ conda create --name unsloth_env python=3.11.9 pytorch-cuda=12.1 pytorch=2.3.0 cudatoolkit -c pytorch -c nvidia -c xformers -y
(base) $ conda activate unsloth_env
```
## 2. Install and run `jupyter notebook`
```shell
(unsloth_env) $ conda install jupyter
(unsloth_env) $ jupyter notebook
```

## 3. Install Unsloth

In [None]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

## 4. Selecting a model to finetune

In [1]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

Unsloth: Patching Xformers to fix some performance issues.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [2]:
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit",
    "unsloth/DeepSeek-R1-Distill-Qwen-14B-unsloth-bnb-4bit",
    "unsloth/DeepSeek-R1-Distill-Qwen-7B-unsloth-bnb-4bit",
    "unsloth/phi-4-unsloth-bnb-4bit",
    "unsloth/Mistral-Small-24B-Instruct-2501-unsloth-bnb-4bit",
    "unsloth/Meta-Llama-3.1-8B-Instruct-unsloth-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit",
    "unsloth/Llama-3.2-1B-Instruct-unsloth-bnb-4bit",
    "unsloth/Mistral-Small-24B-Base-2501-unsloth-bnb-4bit",
    "unsloth/Llama-3.1-8B-Instruct-unsloth-bnb-4bit",
    "unsloth/Llama-3.1-8B-unsloth-bnb-4bit",
    "unsloth/Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit",
    "unsloth/Qwen2.5-VL-7B-Instruct-unsloth-bnb-4bit",
    "unsloth/Qwen2.5-VL-72B-Instruct-unsloth-bnb-4bit",
    "unsloth/Qwen2.5-VL-3B-Instruct-unsloth-bnb-4bit",
    "unsloth/Qwen2-VL-7B-Instruct-unsloth-bnb-4bit",
    "unsloth/Qwen2-VL-2B-Instruct-unsloth-bnb-4bit",
    "unsloth/Pixtral-12B-2409-unsloth-bnb-4bit",
    "unsloth/Llama-3.2-11B-Vision-unsloth-bnb-4bit",
    "unsloth/QwQ-32B-Preview-unsloth-bnb-4bit",
    "unsloth/Qwen2.5-3B-Instruct-unsloth-bnb-4bit",
] # More models at https://huggingface.co/unsloth

There are 3 other settings which you can toggle:

1. `max_seq_length = 2048`
This determines the context length of the model. Gemini for example has over 1 million context length, whilst Llama-3 has 8192 context length. We allow you to select ANY number - but we recommend setting it 2048 for testing purposes. Unsloth also supports very long context finetuning, and we show we can provide 4x longer context lengths than the best.

2. `dtype = None`
Keep this as None, but you can select torch.float16 or torch.bfloat16 for newer GPUs.

3. `load_in_4bit = True`
We do finetuning in 4 bit quantization. This reduces memory usage by 4x, allowing us to actually do finetuning in a free 16GB memory GPU. 4 bit quantization essentially converts weights into a limited set of numbers to reduce memory usage. A drawback of this is there is a 1-2% accuracy degradation. Set this to False on larger GPUs like H100s if you want that tiny extra accuracy.

In [6]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.1-8B-unsloth-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA GeForce RTX 4070. Max memory: 11.994 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.3.0. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 2.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## 5. Parameters for finetuning
The goal is to change these numbers to increase accuracy, but also counteract over-fitting. Over-fitting is when you make the language model memorize a dataset, and not be able to answer novel new questions. We want to a final model to answer unseen questions, and not do memorization.

1. `r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128`
The rank of the finetuning process. A larger number uses more memory and will be slower, but can increase accuracy on harder tasks. We normally suggest numbers like 8 (for fast finetunes), and up to 128. Too large numbers can causing over-fitting, damaging your model's quality.

2. `target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj",],`
We select all modules to finetune. You can remove some to reduce memory usage and make training faster, but we highly do not suggest this. Just train on all modules!

3. `lora_alpha = 16,`
The scaling factor for finetuning. A larger number will make the finetune learn more about your dataset, but can promote over-fitting. We suggest this to equal to the rank r, or double it.

4. `lora_dropout = 0, # Supports any, but = 0 is optimized`
Leave this as 0 for faster training! Can reduce over-fitting, but not that much.

5. `bias = "none",    # Supports any, but = "none" is optimized`
Leave this as 0 for faster and less over-fit training!

6. `use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context`
Options include `True`, `False` and `"unsloth"`. We suggest `"unsloth"` since we reduce memory usage by an extra 30% and support extremely long context finetunes.You can read up here: [https://unsloth.ai/blog/long-context](https://unsloth.ai/blog/long-context) for more details.

7 `random_state = 3407,`
The number to determine deterministic runs. Training and finetuning needs random numbers, so setting this number makes experiments reproducible.

8. `use_rslora = False,  # We support rank stabilized LoRA`
Advanced feature to set the lora_alpha = 16 automatically. You can use this if you want!

9. `loftq_config = None, # And LoftQ`
Advanced feature to initialize the LoRA matrices to the top r singular vectors of the weights. Can improve accuracy somewhat, but can make memory usage explode at the start.

In [7]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.2.15 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## 6. Alpaca Dataset
We now use the Alpaca dataset from vicgalle, which is a version of 52K of the original Alpaca dataset generated from GPT4.

You can access the GPT4 version of the Alpaca dataset here: [https://huggingface.co/datasets/vicgalle/alpaca-gpt4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4). An older first version of the dataset is here: [https://github.com/tatsu-lab/stanford_alpaca](https://github.com/tatsu-lab/stanford_alpaca).

In [22]:
from datasets import load_dataset
dataset = load_dataset("vicgalle/alpaca-gpt4", split = "train")
print(dataset.column_names)

['instruction', 'input', 'output', 'text']


You can see there are 3 columns in each row - an instruction, and input and an output. We essentially combine each row into 1 large prompt like below. We then use this to finetune the language model, and this made it very similar to ChatGPT. We call this process **supervised instruction finetuning**.

In [23]:
dataset

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 52002
})

## 7. Multiple columns for finetuning

Other finetuning libraries require you to manually prepare your dataset for finetuning, by merging all your columns into 1 prompt. In Unsloth, we simply provide the function called to_sharegpt which does this in 1 go!

To access the Titanic finetuning notebook or if you want to upload a CSV or Excel file, go here: [https://colab.research.google.com/drive/1VYkncZMfGFkeCEgN2IzbZIKEDkyQuJAS?usp=sharing](https://colab.research.google.com/drive/1VYkncZMfGFkeCEgN2IzbZIKEDkyQuJAS?usp=sharing)


In [24]:
from unsloth import to_sharegpt
dataset = to_sharegpt(
    dataset,
    merged_prompt = "{instruction}[[\nYour input is:\n{input}]]",
    output_column_name = "output",
    conversation_extension = 3, # Select more to handle longer conversations
)

For example the very famous Titanic dataset has many many columns. Your job was to predict whether a passenger has survived or died based on their age, passenger class, fare price etc. We can't simply pass this into ChatGPT, but rather, we have to "merge" this information into 1 large prompt.

Now this is a bit more complicated, since we allow a lot of customization, but there are a few points:
* You must enclose all columns in curly braces `{}`. These are the column names in the actual CSV / Excel file.
* Optional text components must be enclosed in `[[]]`. For example if the column "input" is empty, the merging function will not show the text and skip this. This is useful for datasets with missing values.
* Select the output or target / prediction column in `output_column_name`. For the Alpaca dataset, this will be `output`.

For example in the Titanic dataset, we can create a large merged prompt format like below, where each column / piece of text becomes optional.

In [None]:
# from unsloth import to_sharegpt
# dataset = to_sharegpt(
#     dataset,
#     merged_prompt = \
#         "[[The passanger embarked from {Empbarked}.]]",\
#         "[[\nThey are {Sex}.]]",\
#         "[[\nThey have {Parch} parents and children.]]",\
#         "[[\nThey have {SibSp} siblings and spouses.]]",\
#         "[[\nTheir passenger class is {Pclass}.]]",\
#         "[[\nTheir age is {Age}.]]",\
#         "[[\nThey paid ${Fare} for the trip.]]",
#     conversation_extension = 5, # Randomly combines conversations into 1! Good for long convos
#     output_column_name = "Survived",
# )

## 8. Multi turn conversations
We then use the `standardize_sharegpt` function to just make the dataset in a correct format for finetuning! Always call this!

In [25]:
from unsloth import standardize_sharegpt
dataset = standardize_sharegpt(dataset)

## 9. Customizable Chat Templates

In [28]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [29]:
alpaca_prompt

'Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\n\n### Instruction:\n{}\n\n### Input:\n{}\n\n### Response:\n{}'

Or in the Titanic prediction task where you had to predict if a passenger died or survived in this Colab  notebook which includes CSV and Excel uploading: [https://colab.research.google.com/drive/1VYkncZMfGFkeCEgN2IzbZIKEDkyQuJAS?usp=sharing](https://colab.research.google.com/drive/1VYkncZMfGFkeCEgN2IzbZIKEDkyQuJAS?usp=sharing)

In [34]:
chat_template = """Below describes some details about some passengers who went on the Titanic.
Predict whether they survived or perished based on their characteristics.
Output 1 if they survived, and 0 if they died.
>>> Passenger Details:
{INPUT}
>>> Did they survive?
{OUTPUT}"""

In [37]:
del chat_template

## 10. Train the model
Let's train the model now! We normally suggest people to not edit the below, unless if you want to finetune for longer steps or want to train on large batch sizes.

In [40]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 1,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps =  60,
        # num_train_epochs = 1, # Set this for 1 full training run.
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

KeyError: 'text'

We do not normally suggest changing the parameters above, but to elaborate on some of them:

1. `per_device_train_batch_size = 2,`
Increase the batch size if you want to utilize the memory of your GPU more. Also increase this to make training more smooth and make the process not over-fit. We normally do not suggest this, since this might make training actually slower due to padding issues. We normally instead ask you to increase gradient_accumulation_steps which just does more passes over the dataset.

2. `gradient_accumulation_steps = 4,`
Equivalent to increasing the batch size above itself, but does not impact memory consumption! We normally suggest people increasing this if you want smoother training loss curves.

3. `max_steps = 60, # num_train_epochs = 1,`
We set steps to 60 for faster training. For full training runs which can take hours, instead comment out max_steps, and replace it with num_train_epochs = 1. Setting it to 1 means 1 full pass over your dataset. We normally suggest 1 to 3 passes, and no more, otherwise you will over-fit your finetune.

4. `learning_rate = 2e-4,`
Reduce the learning rate if you want to make the finetuning process slower, but also converge to a higher accuracy result most likely. We normally suggest 2e-4, 1e-4, 5e-5, 2e-5 as numbers to try.

In [13]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4070. Max memory = 11.994 GB.
8.143 GB of memory reserved.


In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 2
\        /    Total batch size = 2 | Total steps = 60
 "-____-"     Number of trainable parameters = 40,370,176


Step,Training Loss
1,3.1512
2,3.0508
3,3.2737
4,2.7211
5,2.7546
6,2.6025
7,2.5288
8,1.6785
9,1.7801
10,1.6831


In [15]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

68.7814 seconds used for training.
1.15 minutes used for training.
Peak reserved memory = 8.764 GB.
Peak reserved memory for training = 0.621 GB.
Peak reserved memory % of max memory = 73.07 %.
Peak reserved memory for training % of max memory = 5.178 %.


In [16]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [17]:
import torch
torch.cuda.empty_cache()


In [None]:
# model.save_pretrained_gguf("model", tokenizer, quantization_method = "f32")

Follow this Steps for Local Saving

In [19]:
model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 28.16 out of 49.39 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


  4%|█████▊                                                                                                                                                           | 1/28 [00:00<00:09,  2.71it/s]
We will save to Disk and not RAM now.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:17<00:00,  1.62it/s]


Unsloth: Saving tokenizer... Done.
Done.


In [20]:
!git clone --recursive https://github.com/ggerganov/llama.cpp

fatal: destination path 'llama.cpp' already exists and is not an empty directory.


In [21]:
!make clean -C llama.cpp

make: Entering directory '/home/ryan/Documents/workspaces/workspace_ai/FineTune/llama.cpp'
Makefile:2: *** The Makefile build is deprecated. Use the CMake build instead. For more details, see https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md.  Stop.
make: Leaving directory '/home/ryan/Documents/workspaces/workspace_ai/FineTune/llama.cpp'


In [22]:
!make all -j -C llama.cpp

make: Entering directory '/home/ryan/Documents/workspaces/workspace_ai/FineTune/llama.cpp'
Makefile:2: *** The Makefile build is deprecated. Use the CMake build instead. For more details, see https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md.  Stop.
make: Leaving directory '/home/ryan/Documents/workspaces/workspace_ai/FineTune/llama.cpp'


In [23]:
!pip install gguf protobuf



In [24]:
!python llama.cpp/convert_hf_to_gguf.py merged_model --outfile text2sql_8.gguf --outtype q8_0

INFO:hf-to-gguf:Loading model: merged_model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:token_embd.weight,         torch.bfloat16 --> Q8_0, shape = {3584, 152064}
INFO:hf-to-gguf:blk.0.attn_norm.weight,    torch.bfloat16 --> F32, shape = {3584}
INFO:hf-to-gguf:blk.0.ffn_down.weight,     torch.bfloat16 --> Q8_0, shape = {18944, 3584}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,     torch.bfloat16 --> Q8_0, shape = {3584, 18944}
INFO:hf-to-gguf:blk.0.ffn_up.weight,       torch.bfloat16 --> Q8_0, shape = {3584, 18944}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {3584}
INFO:hf-to-gguf:blk.0.attn_k.bias,         torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.attn_k.weight,       torch.bfloat16 --> Q8_0, shape = {3584, 512}

INSTALL OLLAMA using Terminal

In [None]:
!curl -fsSL https://ollama.com/install.sh | sh

>>> Installing ollama to /usr/local
[sudo] password for ryan: 

In [None]:
!ollama create unsloth_model_8 -f Modelfile

In [26]:
alpaca_prompt1= """Below is an instruction that give an sql prompt. Write a response that appropriately completes the request and gives you an sql and the corresponding explanation.

### sql_prompt:
{}

### sql:
{}

### Explanation:
{}"""

In [27]:
input = alpaca_prompt1.format(
    "What is the total volume of timber sold by each salesperson, sorted by salesperson?",
    "",
    "",)

In [28]:
input

'Below is an instruction that give an sql prompt. Write a response that appropriately completes the request and gives you an sql and the corresponding explanation.\n\n### sql_prompt:\nWhat is the total volume of timber sold by each salesperson, sorted by salesperson?\n\n### sql:\n\n\n### Explanation:\n'

In [29]:
query = "What is the average property size in inclusive housing areas?"

In [30]:
input = alpaca_prompt1.format(
    query,
    "",
    "",)

In [31]:
input

'Below is an instruction that give an sql prompt. Write a response that appropriately completes the request and gives you an sql and the corresponding explanation.\n\n### sql_prompt:\nWhat is the average property size in inclusive housing areas?\n\n### sql:\n\n\n### Explanation:\n'