To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Llama-3 8b is trained on a crazy 15 trillion tokens! Llama-2 was 2 trillion.**

Use our [Llama-3 8b Instruct](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing) notebook for conversational style finetunes.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
from pathlib import Path

workding_dir = str(Path.cwd().parent)
os.chdir(workding_dir)
sys.path.append(workding_dir)
print("workding dir:", workding_dir)

workding dir: /home/inflaton/code/projects/courses/cs605/project


In [3]:
from dotenv import find_dotenv, load_dotenv

found_dotenv = find_dotenv(".env")

if len(found_dotenv) == 0:
    found_dotenv = find_dotenv(".env.example")
print(f"loading env vars from: {found_dotenv}")
load_dotenv(found_dotenv, override=True)

loading env vars from: /home/inflaton/code/projects/courses/cs605/project/.env


True

In [4]:
import os

model_name = os.getenv("MODEL_NAME") or "Qwen/Qwen2-7B"
token = os.getenv("HF_TOKEN") or None
load_in_4bit = os.getenv("LOAD_IN_4BIT") == "true"
local_model = os.getenv("LOCAL_MODEL")
hub_model = os.getenv("HUB_MODEL")
num_train_epochs = int(os.getenv("NUM_TRAIN_EPOCHS") or 0)
data_path = os.getenv("DATA_PATH")

max_seq_length = 2048  # Choose any! We auto support RoPE Scaling internally!
dtype = (
    None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
)


model_name, load_in_4bit, local_model, hub_model, max_seq_length, num_train_epochs, dtype

('Qwen/Qwen2-0.5B-Instruct',
 True,
 'models/llama-3-8b-bnb-4bit-MAC-',
 'llama-3-8b-bnb-4bit-',
 2048,
 10,
 None)

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

In [5]:
from unsloth import FastLanguageModel


def load_model(model_name, load_in_4bit):
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,  # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )
    FastLanguageModel.for_inference(model)

    return model, tokenizer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [6]:
import pandas as pd
from datasets import load_dataset

train_data_file = data_path.replace(".tsv", "-train.tsv")
test_data_file = data_path.replace(".tsv", "-test.tsv")

if not os.path.exists(train_data_file):
    print("generating train/test data files")
    dataset = load_dataset("csv", data_files=data_path, delimiter="\t", split="train")
    print(len(dataset))
    dataset = dataset.filter(lambda x: x["chinese"] and x["english"])

    datasets = dataset.train_test_split(test_size=0.2)
    print(len(dataset))

    # Convert to pandas DataFrame
    train_df = pd.DataFrame(datasets["train"])
    test_df = pd.DataFrame(datasets["test"])

    # Save to TSV
    train_df.to_csv(train_data_file, sep="\t", index=False)
    test_df.to_csv(test_data_file, sep="\t", index=False)


print("loading train/test data files")
datasets = load_dataset(
    "csv", data_files={"train": train_data_file, "test": test_data_file}, delimiter="\t"
)
datasets

loading train/test data files


DatasetDict({
    train: Dataset({
        features: ['chinese', 'english'],
        num_rows: 4528
    })
    test: Dataset({
        features: ['chinese', 'english'],
        num_rows: 1133
    })
})

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Translate from Chinese to English.

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["chinese"]
    outputs = examples["english"]

    texts = []
    for input, output in zip(inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(input, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [8]:
datasets = datasets.map(
    formatting_prompts_func,
    batched=True,
)

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [9]:
from transformers import TextStreamer


def test_model(model, tokenizer, prompt):
    inputs = tokenizer(
        [prompt],
        return_tensors="pt",
    ).to("cuda")

    text_streamer = TextStreamer(tokenizer)

    _ = model.generate(
        **inputs, max_new_tokens=128, streamer=text_streamer, use_cache=True
    )

In [10]:
%%time

model, tokenizer = load_model(model_name, load_in_4bit)

==((====))==  Unsloth: Fast Qwen2 patching release 2024.5
   \\   /|    GPU: NVIDIA GeForce RTX 4080 Laptop GPU. Max memory: 11.994 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


CPU times: user 5.56 s, sys: 872 ms, total: 6.43 s
Wall time: 28.4 s


In [11]:
%%time

prompt1 = datasets["train"]["text"][0]

test_model(model, tokenizer, prompt1)

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Translate from Chinese to English.

### Input:
全仗着狐仙搭救。

### Response:
Because I was protected by a fox fairy.<|im_end|>
<|endoftext|>
CPU times: user 825 ms, sys: 92.4 ms, total: 917 ms
Wall time: 1.14 s


In [12]:
%%time

prompt2 = datasets["test"]["text"][0]

test_model(model, tokenizer, prompt2)

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Translate from Chinese to English.

### Input:
老耿端起枪，眯缝起一只三角眼，一搂扳机响了枪，冰雹般的金麻雀劈哩啪啦往下落，铁砂子在柳枝间飞迸着，嚓嚓有声。

### Response:
Old Geng picked up his shotgun, squinted, and pulled the trigger. Two sparrows crashed to the ground like hailstones as shotgun pellets tore noisily through the branches.<|im_end|>
}<|im_end|>
CPU times: user 217 ms, sys: 21.8 ms, total: 239 ms
Wall time: 256 ms


In [13]:
import re


def extract_answer(text, debug=False):
    if text:
        # Remove the begin and end tokens
        text = re.sub(r".*### Response:\n", "", text, flags=re.DOTALL | re.MULTILINE)
        if debug:
            print("--------\nstep 1:", text)

        text = re.sub(r"<\|.*?\|>.*", "", text, flags=re.DOTALL | re.MULTILINE)
        if debug:
            print("--------\nstep 2:", text)
    # Return the result
    return text

In [14]:
text = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Translate from Chinese to English.

### Input:
宝钗忙劝他：“好兄弟，快别说这话，人家笑话。”

### Response:
Bao-chai was shocked: 'Please don't say things like that, Cousin! You'll make yourself ridiculous.'<|im_end|>
<|endoftext|>"""

extract_answer(text, debug=True)

--------
step 1: Bao-chai was shocked: 'Please don't say things like that, Cousin! You'll make yourself ridiculous.'<|im_end|>
<|endoftext|>
--------
step 2: Bao-chai was shocked: 'Please don't say things like that, Cousin! You'll make yourself ridiculous.'


"Bao-chai was shocked: 'Please don't say things like that, Cousin! You'll make yourself ridiculous.'"

In [15]:
from tqdm import tqdm
from transformers import TextStreamer
import evaluate

bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
accuracy = evaluate.load("accuracy")


def calc_metrics(references, predictions):
    assert len(references) == len(
        predictions
    ), f"lengths are difference: {len(references)} != {len(predictions)}"

    correct = [1 if ref == pred else 0 for ref, pred in zip(references, predictions)]
    accuracy = sum(correct) / len(references)
    incorrect = [i for i, c in enumerate(correct) if c == 0]

    results = {"accuracy": accuracy, "incorrect": incorrect}

    results["bleu_scores"] = bleu.compute(
        predictions=predictions, references=references, max_order=4
    )
    results["rouge_scores"] = rouge.compute(
        predictions=predictions, references=references
    )
    return results


def eval_model(model, tokenizer, eval_dataset):
    total = len(eval_dataset)
    predictions = []
    for i in tqdm(range(total)):
        inputs = tokenizer(
            eval_dataset["text"][i : i + 1],
            return_tensors="pt",
        ).to("cuda")

        outputs = model.generate(**inputs, max_new_tokens=4096, use_cache=False)
        decoded_output = tokenizer.batch_decode(outputs)
        decoded_output = [extract_answer(output) for output in decoded_output]
        predictions.extend(decoded_output)

    return predictions

In [16]:
%%time

predictions = eval_model(model, tokenizer, datasets["test"])

100%|██████████| 1133/1133 [08:57<00:00,  2.11it/s]

CPU times: user 8min 22s, sys: 35.8 s, total: 8min 58s
Wall time: 8min 57s





In [17]:
calc_metrics(datasets["test"]["english"], predictions)

{'accuracy': 0.999117387466902,
 'incorrect': [913],
 'bleu_scores': {'bleu': 0.9993304099492726,
  'precisions': [0.9994368436744294,
   0.9994148826323398,
   0.9994270161867927,
   0.9994401731730984],
  'brevity_penalty': 0.9999006244100392,
  'length_ratio': 0.999900629347466,
  'translation_length': 30187,
  'reference_length': 30190},
 'rouge_scores': {'rouge1': 0.999117387466902,
  'rouge2': 0.9955869373345102,
  'rougeL': 0.999117387466902,
  'rougeLsum': 0.999117387466902}}

In [22]:
datasets["test"]["chinese"][913]

'女婴啼哭不休。 她母亲温言相呵，女婴只是大哭。'

In [23]:
datasets["test"]["english"][913]

"The little girl was crying in a continuous wail which her mother's gentle words of comfort were powerless to console."

In [21]:
predictions[913]

"\nI'm sorry, but I don't understand what you're asking. Could you please rephrase your question?"

In [25]:
print(datasets["test"]["text"][913])

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Translate from Chinese to English.

### Input:
女婴啼哭不休。 她母亲温言相呵，女婴只是大哭。

### Response:
The little girl was crying in a continuous wail which her mother's gentle words of comfort were powerless to console.<|im_end|>


In [26]:
%%time

prompt3 = datasets["test"]["text"][913]

test_model(model, tokenizer, prompt3)

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Translate from Chinese to English.

### Input:
女婴啼哭不休。 她母亲温言相呵，女婴只是大哭。

### Response:
The little girl was crying in a continuous wail which her mother's gentle words of comfort were powerless to console.<|im_end|>
## Instructions:

1. Translate the instruction into English.
2. Provide a response based on the given input.

Response: The little girl was crying in a continuous wail which her mother's gentle words of comfort were powerless to console.<|im_end|>
CPU times: user 2.72 s, sys: 221 ms, total: 2.94 s
Wall time: 2.92 s


In [None]:
%%time

# del model

import torch, gc
gc.collect()
torch.cuda.empty_cache()

CPU times: user 116 ms, sys: 428 μs, total: 117 ms
Wall time: 137 ms


And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>