To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [None]:
!pip install -qU llama-index llama-index-packs-raft-dataset

### Unsloth

### Retrieval Augmented Finetuning (RAFT) Cookbook Recipe!
This cookbook aims to show how to use Unsloth to use retrieval augmented finetuning (RAFT). Supervised finetuning is like a closed-book examination where we encode knowledge from the training dataset into the LLM during finetuning, and then test it on unseen examples in the "exam".

RAFT differs from this in that it is an open-book exam format of finetuning! We allow the LLM to see not just the question and answer (in chain-of-thought format), but also the contexts. The hope is that the LLM will be able to acquire the domain knowledge, but also an improved ability to synthesize answers from context.

> Reference: [RAFT: Adapting Language Model to Domain Specific RAG](https://arxiv.org/abs/2403.10131)

### Code Setup 

First, let's setting up the OPENAI API KEY so that we can use the OpenAI LLMs. 

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

Next, we'll set up LlamaIndex. This involves configuring the language model (LLM) and embedding model that LlamaIndex will use. We'll be using OpenAI's `gpt-4o` as our LLM and `text-embedding-ada-002` as our embedding model.

In [None]:
from llama_index.core import (
    Settings,
    SimpleDirectoryReader,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-4o")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

### Ingest documents 

We'll use the following code to download a research paper and then load it using `SimpleDirectoryReader`. This will be the data we use for our retrieval augmented finetuning.

In [None]:
!mkdir  -p ../data
!wget "https://arxiv.org/pdf/2405.00247.pdf" -O "../data/non_traditional_credentials.pdf"

docs = SimpleDirectoryReader("../data/").load_data(show_progress=True)

Loading files:   0%|          | 0/1 [00:00<?, ?file/s]

Loading files: 100%|██████████| 1/1 [00:00<00:00,  1.19file/s]


### Retrieval Augmented Finetuning

### Getting the RAFT dataset
LlamaIndex has very kindly adapted the source code of the RAFT repository and made it even easier to generate your own RAFT dataset. Just point it to your filepath.t
> Reference: [RAFTDatasetPack](https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-raft-dataset/examples/raft_dataset.ipynb)

In [14]:
from llama_index.packs.raft_dataset import RAFTDatasetPack

raft_dataset = RAFTDatasetPack(
    file_path = "../data/non_traditional_credentials.pdf",
    llm = Settings.llm,
    embed_model=Settings.embed_model
)

This cell takes quite long to run! Go have a coffee ☕
> It took 19 minutes for the cell to finish running

In [15]:
dataset = raft_dataset.run()

Let's have a look!

In [18]:
import pandas as pd
df = pd.DataFrame(dataset)
df.head()

Unnamed: 0,id,type,question,context,oracle_context,cot_answer,instruction
0,seed_task_0,general,What percentage increase in credential sharing...,{'sentences': [['The value of non-traditional ...,The value of non-traditional credentials in th...,assistant: To determine the percentage increas...,<DOCUMENT>The value of non-traditional credent...
1,seed_task_1,general,How much more likely were learners in the trea...,{'sentences': [['The control group did not rec...,The value of non-traditional credentials in th...,"assistant: To answer the question ""How much mo...",<DOCUMENT>The control group did not receive th...
2,seed_task_2,general,What was the increase in jobs related to the c...,{'sentences': [['The value of non-traditional ...,The value of non-traditional credentials in th...,assistant: To determine the increase in jobs r...,<DOCUMENT>The value of non-traditional credent...
3,seed_task_3,general,Which group of LinkedIn users showed a more pr...,{'sentences': [['The value of non-traditional ...,The value of non-traditional credentials in th...,assistant: To determine which group of LinkedI...,<DOCUMENT>The value of non-traditional credent...
4,seed_task_4,general,What platform were the courses completed on fo...,"{'sentences': [['Analogously, Past Managerial ...",The value of non-traditional credentials in th...,"assistant: To answer the question ""What platfo...","<DOCUMENT>Analogously, Past\nManagerial Job fo..."


In [24]:
from IPython.display import display, Markdown

display(Markdown(df.iloc[0]['instruction']))

<DOCUMENT>The value of non-traditional credentials in the labor market*
Susan Athey & Emil Palikot
May 2, 2024
Abstract
This study investigates the labor market value of credentials obtained from Massive Open On-
line Courses (MOOCs) and shared on business networking platforms. We conducted a random-
ized experiment involving more than 800,000 learners, primarily from developing countries and
without college degrees, who completed technology or business-related courses on the Coursera
platform between September 2022 and March 2023. The intervention targeted learners who had
recently completed their courses, encouraging them to share their credentials and simplifying the
sharing process. One year after the intervention, we collected data from LinkedIn profiles of ap-
proximately 40,000 experimental subjects. We find that the intervention leads to an increase of 17
percentage points for credential sharing. Further, learners in the treatment group were 6% more
likely to report new employment within a year, with an 8% increase in jobs related to their certifi-
cates. This effect was more pronounced among LinkedIn users with lower baseline employability.
Across the entire sample, the treated group received a higher number of certificate views, indicat-
ing an increased interest in their profiles. These results suggest that facilitating credential sharing
and reminding learners of the value of skill signaling can yield significant gains. When the ex-
periment is viewed as an encouragement design for credential sharing, we can estimate the local
average treatment effect (LATE) of credential sharing (that is, the impact of credential sharing on
the workers induced to share by the intervention) for the outcome of getting a job. The LATE esti-
mates are imprecise but large in magnitude; they suggest that credential sharing more than doubles
the baseline probability of getting a new job in scope for the credential.
*We thank Eric Karsten and his team in Coursera for collaborating on this project. </DOCUMENT>
<DOCUMENT>13 p.p.) and 36 p.p. (S.E. </DOCUMENT>
<DOCUMENT>), which corresponds to a
17% increase from baseline. The remaining columns present estimates from the instrumental variable
regression with New Job and New Job in Scope as outcomes. In Columns 6, 7, and 8, we restrict attention
to jobs reported with a starting date at least four months after treatment. We estimate positive and
statistically significant effects. Specifically, we estimate the local average treatment effect of 0.24 (S.E.
0.13) for any new job starting at least one month after treatment and 0.36 (S.E. 0.12) when restricting
14</DOCUMENT>
What percentage increase in credential sharing was observed after the intervention?

In [27]:
display(Markdown(df.iloc[0]['oracle_context']))

The value of non-traditional credentials in the labor market*
Susan Athey & Emil Palikot
May 2, 2024
Abstract
This study investigates the labor market value of credentials obtained from Massive Open On-
line Courses (MOOCs) and shared on business networking platforms. We conducted a random-
ized experiment involving more than 800,000 learners, primarily from developing countries and
without college degrees, who completed technology or business-related courses on the Coursera
platform between September 2022 and March 2023. The intervention targeted learners who had
recently completed their courses, encouraging them to share their credentials and simplifying the
sharing process. One year after the intervention, we collected data from LinkedIn profiles of ap-
proximately 40,000 experimental subjects. We find that the intervention leads to an increase of 17
percentage points for credential sharing. Further, learners in the treatment group were 6% more
likely to report new employment within a year, with an 8% increase in jobs related to their certifi-
cates. This effect was more pronounced among LinkedIn users with lower baseline employability.
Across the entire sample, the treated group received a higher number of certificate views, indicat-
ing an increased interest in their profiles. These results suggest that facilitating credential sharing
and reminding learners of the value of skill signaling can yield significant gains. When the ex-
periment is viewed as an encouragement design for credential sharing, we can estimate the local
average treatment effect (LATE) of credential sharing (that is, the impact of credential sharing on
the workers induced to share by the intervention) for the outcome of getting a job. The LATE esti-
mates are imprecise but large in magnitude; they suggest that credential sharing more than doubles
the baseline probability of getting a new job in scope for the credential.
*We thank Eric Karsten and his team in Coursera for collaborating on this project. 

In [16]:
# Save as .jsonl format
dataset.to_json("raft_train.jsonl")

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

2966201

### Training the LLM
Our dataset is a HuggingFace `Dataset` object, so we can leverage the abstraction's advantage to do a train-test split

In [19]:
splits = dataset.train_test_split(test_size=0.1)
train_ds = splits["train"]
eval_ds  = splits["test"]

In [20]:
train_ds, eval_ds

(Dataset({
     features: ['id', 'type', 'question', 'context', 'oracle_context', 'cot_answer', 'instruction'],
     num_rows: 301
 }),
 Dataset({
     features: ['id', 'type', 'question', 'context', 'oracle_context', 'cot_answer', 'instruction'],
     num_rows: 34
 }))

### Now let's get the model!

In [None]:
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, 
    full_finetuning = False, 
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-21 06:09:36 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-21 06:09:36 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.4.7: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA A10G. Num GPUs = 1. Max memory: 22.184 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [22]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 2025,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.4.7 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


## Formatting the prompts
We need to put everything together into a single 'text' field for the LLM to be trained on. According to the [RAFT paper](https://arxiv.org/abs/2403.10131), we add the context along with the question and chain-of-thought answer in a bid to help our LLM learn how to use the context to answer the question. Let's do that!

In [25]:
def formatting_prompts_func(examples):
    """Define a formatter that injects the retrieved context:"""
    
    texts = []
    for qn, ctx, oracle, instr, ans in zip(
        examples['question'],
        examples["context"],
        examples["oracle_context"],
        examples["instruction"],
        examples["cot_answer"]
    ):
        # You can choose to use `oracle_context` (gold) vs. `context` (retrieved)
        # Here we show both, but you could just use `context`.
        prompt = (
            "### Question:\n"
            f"{qn}\n\n"
            "### Context:\n"
            f"{ctx}\n\n"
            "### (Oracle Passages):\n"
            f"{oracle}\n\n"
            "### Instruction:\n"
            f"{instr}\n\n"
            "### Answer:\n"
        )
        # Append the gold answer plus EOS
        texts.append(prompt + ans + tokenizer.eos_token)
    return {"text": texts}

# then:
train_ds = train_ds.map(formatting_prompts_func, batched=True)
eval_ds = eval_ds.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/301 [00:00<?, ? examples/s]

Map:   0%|          | 0/34 [00:00<?, ? examples/s]

Let's take a look at what we just did!

In [26]:
from IPython.display import display, Markdown

display(Markdown(pd.DataFrame(train_ds).head()['text'].iloc[0]))

### Question:
What is the mean value for the 'Data Science' variable in the LinkedIn matched sample?

### Context:
{'sentences': [['Table 1: Summary statistics pretreatment and outcome variables\nCoursera Internal Data LinkedIn Matched Sample\nVariable name Mean S.E. Mean S.E.\nTreatment 0.499 0.001 0.500 0.003\nPanel A: Pre-treatment covariates\nProfessional Experience Years – – 3.040 0.028\nPast Tech Job – – 0.127 0.002\nPast Managerial Job – – 0.064 0.001\nMain Skill Absolute 0.099 0.001 2.074 0.010\nMain Skill Standardized 0.000 <0.001 0.000 0.001\nComputer Science 0.252 0.001 0.230 0.002\nData Science 0.236 0.001 0.300 0.002\nInformation Technology 0.140 0.001 0.138 0.002\nGuided Project 0.168 0.001 0.097 0.002\nProfessional Certificate 0.005 <0.001 0.005 <0.001\nSpecialization 0.009 <0.001 0.009 0.001\nDeveloping Country 0.896 0.001 0.850 0.002\nAssociate Degree 0.029 <0.001 0.062 0.001\nBachelor Degree 0.127 0.001 0.367 0.003\nSome College 0.072 0.001 0.130 0.002\nDoctorate Degree 0.004 <0.001 0.012 0.001\nHigh School Diploma 0.059 0.001 0.097 0.002\nLess than High School 0.009 <0.001 0.012 0.001\nMasters Degree 0.050 0.001 0.146 0.002\nNo Education Mentioned 0.645 0.002 0.164 0.002\nProfessional Degree 0.004 <0.001 0.010 0.001\nMale 0.302 0.002 0.674 0.002\nGender Not Mentioned 0.533 0.002 0.101 0.002\nPanel B: Outcome variables\nNew Job – – 0.177 0.002\nNew Job in Scope – – 0.133 0.002\nCredential Shared – – 0.181 0.002\nAll Views 0.191 0.001 0.429 0.003\nAll Views by Others 0.143 0.001 0.318 0.002\nViews LinkedIn 0.165 0.001 0.409 0.003\nViews LinkedIn by Others 0.124 0.001 0.296 0.002\nNote: Professional Experience Years is the number of years between the starting date of the first job and August 2023. Past Tech Job\ntakes the value of 1 when the learner had a job title related to technology before randomization and zero otherwise. ', 'effects between the bottom and top tertiles, the difference is 0.1 p.p. (S.E. ', 'For each learner, Coursera assesses skill mastery and assigns a score (Red-\ndick, 2019). Additionally, we compute a max-mean standardization of the learners’ skill level. We also\nobserve the country where the learner registered for the course. Following the OECD classification,\nwe use this information to group countries into developing and developed. Finally, we also observe\nthe information provided by the learners in their registration survey. ']], 'title': [['placeholder_title', 'placeholder_title', 'placeholder_title']]}

### (Oracle Passages):
Table 1: Summary statistics pretreatment and outcome variables
Coursera Internal Data LinkedIn Matched Sample
Variable name Mean S.E. Mean S.E.
Treatment 0.499 0.001 0.500 0.003
Panel A: Pre-treatment covariates
Professional Experience Years – – 3.040 0.028
Past Tech Job – – 0.127 0.002
Past Managerial Job – – 0.064 0.001
Main Skill Absolute 0.099 0.001 2.074 0.010
Main Skill Standardized 0.000 <0.001 0.000 0.001
Computer Science 0.252 0.001 0.230 0.002
Data Science 0.236 0.001 0.300 0.002
Information Technology 0.140 0.001 0.138 0.002
Guided Project 0.168 0.001 0.097 0.002
Professional Certificate 0.005 <0.001 0.005 <0.001
Specialization 0.009 <0.001 0.009 0.001
Developing Country 0.896 0.001 0.850 0.002
Associate Degree 0.029 <0.001 0.062 0.001
Bachelor Degree 0.127 0.001 0.367 0.003
Some College 0.072 0.001 0.130 0.002
Doctorate Degree 0.004 <0.001 0.012 0.001
High School Diploma 0.059 0.001 0.097 0.002
Less than High School 0.009 <0.001 0.012 0.001
Masters Degree 0.050 0.001 0.146 0.002
No Education Mentioned 0.645 0.002 0.164 0.002
Professional Degree 0.004 <0.001 0.010 0.001
Male 0.302 0.002 0.674 0.002
Gender Not Mentioned 0.533 0.002 0.101 0.002
Panel B: Outcome variables
New Job – – 0.177 0.002
New Job in Scope – – 0.133 0.002
Credential Shared – – 0.181 0.002
All Views 0.191 0.001 0.429 0.003
All Views by Others 0.143 0.001 0.318 0.002
Views LinkedIn 0.165 0.001 0.409 0.003
Views LinkedIn by Others 0.124 0.001 0.296 0.002
Note: Professional Experience Years is the number of years between the starting date of the first job and August 2023. Past Tech Job
takes the value of 1 when the learner had a job title related to technology before randomization and zero otherwise. 

### Instruction:
<DOCUMENT>Table 1: Summary statistics pretreatment and outcome variables
Coursera Internal Data LinkedIn Matched Sample
Variable name Mean S.E. Mean S.E.
Treatment 0.499 0.001 0.500 0.003
Panel A: Pre-treatment covariates
Professional Experience Years – – 3.040 0.028
Past Tech Job – – 0.127 0.002
Past Managerial Job – – 0.064 0.001
Main Skill Absolute 0.099 0.001 2.074 0.010
Main Skill Standardized 0.000 <0.001 0.000 0.001
Computer Science 0.252 0.001 0.230 0.002
Data Science 0.236 0.001 0.300 0.002
Information Technology 0.140 0.001 0.138 0.002
Guided Project 0.168 0.001 0.097 0.002
Professional Certificate 0.005 <0.001 0.005 <0.001
Specialization 0.009 <0.001 0.009 0.001
Developing Country 0.896 0.001 0.850 0.002
Associate Degree 0.029 <0.001 0.062 0.001
Bachelor Degree 0.127 0.001 0.367 0.003
Some College 0.072 0.001 0.130 0.002
Doctorate Degree 0.004 <0.001 0.012 0.001
High School Diploma 0.059 0.001 0.097 0.002
Less than High School 0.009 <0.001 0.012 0.001
Masters Degree 0.050 0.001 0.146 0.002
No Education Mentioned 0.645 0.002 0.164 0.002
Professional Degree 0.004 <0.001 0.010 0.001
Male 0.302 0.002 0.674 0.002
Gender Not Mentioned 0.533 0.002 0.101 0.002
Panel B: Outcome variables
New Job – – 0.177 0.002
New Job in Scope – – 0.133 0.002
Credential Shared – – 0.181 0.002
All Views 0.191 0.001 0.429 0.003
All Views by Others 0.143 0.001 0.318 0.002
Views LinkedIn 0.165 0.001 0.409 0.003
Views LinkedIn by Others 0.124 0.001 0.296 0.002
Note: Professional Experience Years is the number of years between the starting date of the first job and August 2023. Past Tech Job
takes the value of 1 when the learner had a job title related to technology before randomization and zero otherwise. </DOCUMENT>
<DOCUMENT>effects between the bottom and top tertiles, the difference is 0.1 p.p. (S.E. </DOCUMENT>
<DOCUMENT>For each learner, Coursera assesses skill mastery and assigns a score (Red-
dick, 2019). Additionally, we compute a max-mean standardization of the learners’ skill level. We also
observe the country where the learner registered for the course. Following the OECD classification,
we use this information to group countries into developing and developed. Finally, we also observe
the information provided by the learners in their registration survey. </DOCUMENT>
What is the mean value for the 'Data Science' variable in the LinkedIn matched sample?

### Answer:
assistant: To answer the question about the mean value for the 'Data Science' variable in the LinkedIn matched sample, we need to follow these steps:

1. **Identify the Relevant Section**: The context provides a table with summary statistics for various variables. We need to locate the section that pertains to the LinkedIn matched sample.

2. **Locate the 'Data Science' Variable**: Within the table, find the row labeled 'Data Science'. This row will contain the mean value for the 'Data Science' variable in both the Coursera Internal Data and the LinkedIn Matched Sample.

3. **Extract the Mean Value for LinkedIn Matched Sample**: Once the 'Data Science' row is located, extract the mean value specifically for the LinkedIn matched sample.

Let's proceed with these steps:

- The context provides a table with two columns of mean values: one for Coursera Internal Data and another for LinkedIn Matched Sample.

- ##begin_quote## Data Science 0.236 0.001 0.300 0.002 ##end_quote##: This line from the context shows the mean values for the 'Data Science' variable. The first mean value (0.236) corresponds to the Coursera Internal Data, and the second mean value (0.300) corresponds to the LinkedIn Matched Sample.

Therefore, the mean value for the 'Data Science' variable in the LinkedIn matched sample is 0.300.

<ANSWER>: 0.300<|eot_id|>

### And now we finally get to training!

In [28]:
from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="llama32_1bn_raft_v2", #This will also be used as your huggingfacehub model id name
    report_to="wandb", #Leave this to be blank if you don't want to use wandb
    run_name="RAFT_SFT_Take7",
    eval_steps=5,
    eval_strategy="steps",
    per_device_train_batch_size=1,    # small batches if quantized
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    num_train_epochs=5,
    # max_steps=60,                    # or set num_train_epochs
    save_strategy="no",
    gradient_checkpointing=True,
    logging_strategy="steps",
    logging_steps=5,
    seed=42,
    optim="adamw_torch",
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_ds,
    eval_dataset = eval_ds, 
    args=training_args,
    dataset_text_field="text",
    
)

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/301 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/34 [00:00<?, ? examples/s]

Current memory statistics

In [29]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A10G. Max memory = 22.184 GB.
1.457 GB of memory reserved.


In [30]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 301 | Num Epochs = 5 | Total steps = 185
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192/1,000,000,000 (1.13% trained)
[34m[1mwandb[0m: Currently logged in as: [33mtituslhy[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
5,1.493,1.633143
10,1.4666,1.617843
15,1.5463,1.596143
20,1.4859,1.571562
25,1.4498,1.546785
30,1.4265,1.521693
35,1.4468,1.497457
40,1.3767,1.474485
45,1.3344,1.454567
50,1.3655,1.434021


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


Used memory statistics

In [None]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

637.9309 seconds used for training.
10.63 minutes used for training.
Peak reserved memory = 2.156 GB.
Peak reserved memory for training = 0.699 GB.
Peak reserved memory % of max memory = 9.719 %.
Peak reserved memory for training % of max memory = 3.151 %.


<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False:
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/vuhung16au/unslothai-notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/vuhung16au/unslothai-notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/vuhung16au/unslothai-notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/vuhung16au/unslothai-notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
