# mistral-unslothify

#### NOTE
This whole thing must be run on GPU. Either on a local machine with Nvidia/Cuda properly installed, on Google Colab with a free GPU runtime (even though they quickly run out), or any other cloud machine where the `!nvcc --version` cell below checks out ✅.

## 1. Setup and installations

Check to see that we have a GPU and Cuda driver.

In [1]:
# check cuda version
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0


In [2]:
!pip install --upgrade pip -q

# install the latest closest available cuda build
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -q

Download `unsloth` from Github and install it.

In [3]:
!git clone https://github.com/unslothai/unsloth.git
!cd unsloth && pip install . -q
#!pip show unsloth

fatal: destination path 'unsloth' already exists and is not an empty directory.


Install remaining needed libraries.

In [4]:
!pip install numpy -q
!pip install bitsandbytes -q
!pip install unsloth-zoo -q
!pip install xformers -q

Imports

In [1]:
import torch
import json
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import TrainingArguments
from trl import SFTTrainer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## 2. Download the pre-trained model and prepare for fine-tuning

Download pre-trained model

In [11]:
model_name = "unsloth/mistral-7b-instruct-v0.3"

In [36]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True
)

==((====))==  Unsloth 2024.11.9: Fast Mistral patching. Transformers = 4.46.3.
   \\   /|    GPU: NVIDIA A10. Max memory: 21.975 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu118. CUDA = 8.6. CUDA Toolkit = 11.8.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


#### Add LoRA adapters

In [37]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    random_state = 1337,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

## 3. Prepare fine-tuning data



In [38]:
jsonl_file = "domain.jsonl"

Load and reformat the domain data.

In [39]:
# use the dataset loader by Huggingface and some formatting functions
dataset = load_dataset("json", data_files=jsonl_file, split="train")

tokenizer.pad_token = tokenizer.eos_token

def format_text(examples):
    texts = [note + tokenizer.pad_token for note in examples["text"]]
    return {"text": texts}

dataset = dataset.map(format_text, batched=True)

Tokenize the domain data with the pre-trained model's tokenizer.

In [40]:
# Initialize the tokenizer
#tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize the text
def tokenize_texts(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(tokenize_texts, batched=True)

In [41]:
# Remove unneeded columns and set format for PyTorch
tokenized_dataset = tokenized_dataset.remove_columns(["text"])  # Keep only tokenized columns
tokenized_dataset.set_format(type="torch")

## 4. Fine-tune the model

We want the *training loss* to decrease. A loss value around 2-3 is reasonable, if it gets close to 1.0 or drops below, the predictions will be highly confident, but also with some risk of overfitting, meaning that the model has learned the training data too well and may not perform as effectively on unseen data.

*See the cell after the next one for details about which parameters to tweak to avoid overfitting.*

In [43]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=tokenized_dataset,
    max_seq_length=1024,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=0.00001, # <<<<<<< THE HIGHER THE RATE THE FASTER TO OVERFIT
        warmup_steps=5,
        num_train_epochs=1,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.3,
        lr_scheduler_type="linear",
        seed=1337,
        output_dir="outputs",
        report_to="none",
    ),
)

trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 500 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 62
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.8891
2,3.352
3,3.2636
4,2.9592
5,3.214
6,3.0045
7,3.1668
8,2.8346
9,2.6556
10,2.8032


#### Parameters to adjust to counteract overfitting

- `learning_rate` -- a commonly used rate could be around `0.0002` but the higher the rate the quicker the model may overfit. Lower late learns in a more stable way.
- `weight_decay` -- experiment with the value, for example around `0.1` to `0.5`, to penalise large weights to reduce overfitting.
- `warmup_steps` -- increase it (e.g., `10` or more) to help the model start learning in a more gradual and stable way.
- `per_device_train_batch_size` and `gradient_accumulation_steps` -- multiplying the numbers set for these two = batch size, and larger batch sizes can help with stability.
- `lr_scheduler_type` -- your training dynamics might be helped by choosing a different learning rate scheduler (such as `cosine` or `constant_with_warmup`).
- `num_train_epochs` -- the more epochs, the likelier to overfit, but iterating the last cell above can be a way to gradually fine-tune the model to a desired level.

## 5. Save the fine-tuned model

In [2]:
name_for_fine_tuned_model = "Mistral_7BURGERZ"

In [46]:
# Save the fine-tuned model
trainer.model.save_pretrained(name_for_fine_tuned_model)

# Save the tokenizer
tokenizer.save_pretrained(name_for_fine_tuned_model)

('Mistral_7BURGERZ/tokenizer_config.json',
 'Mistral_7BURGERZ/special_tokens_map.json',
 'Mistral_7BURGERZ/tokenizer.model',
 'Mistral_7BURGERZ/added_tokens.json',
 'Mistral_7BURGERZ/tokenizer.json')

Just to be sure, let's check that we can read the saved model.

In [3]:
# Load the fine-tuned model 
model = AutoModelForCausalLM.from_pretrained(name_for_fine_tuned_model, low_cpu_mem_usage=True) # << the low cpu thing is because the model is now quantized

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(name_for_fine_tuned_model)

In [19]:
# Test generation with a burger-related prompt
burger_prompt = "Tell me about hamburgers"
inputs = tokenizer(burger_prompt, return_tensors="pt")
outputs = model.generate(inputs["input_ids"], max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

IndexError: too many indices for tensor of dimension 1

In [18]:
original_model, original_tokenizer = FastLanguageModel.for_inference(model_name)

AttributeError: 'str' object has no attribute 'gradient_checkpointing'

In [16]:
# Test generation with a burger-related prompt
burger_prompt = "Tell me about hamburgers"
inputs = original_tokenizer(burger_prompt, return_tensors="pt")
outputs = original_model.generate(inputs["input_ids"], max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

RuntimeError: Unsloth: You must call `FastLanguageModel.for_inference(model)` before doing inference for Unsloth models.