# Initial Results of Training LLaMA 3.1 8B

## Set-up

### Python

Python is required to run this project. Although any version `> 3.11` should suffice, its's recommended to use `12.3`.
For information on installing Python for your system, take a look at the [official instructions](https://www.python.org/downloads/).

In [1]:
import sys
print(sys.version)

3.12.6 (main, Sep 14 2024, 15:48:12) [GCC 11.4.0]


### CUDA

There are a few different backends available for training LLMs depending on your available hardware. Some examples are CUDA (for Nvidia GPUs) or MPS (for Apple Silicon). These different backands may have varying success for different model sizes. Since this project intends to use relatively large models, it's recommended to use CUDA.

To install the CUDA backend visit the [offical instructions](https://developer.nvidia.com/cuda-downloads).

### PyTorch

Installing PyTorch depends on what backend you intend to use it with. For more information visit the [offical instructions](https://pytorch.org/get-started/locally/).

In [2]:
import torch

x = torch.rand(5, 3)
print(x)

torch_version = torch.__version__
torch.cuda.is_available()

tensor([[0.8691, 0.5805, 0.4101],
        [0.6245, 0.0024, 0.6505],
        [0.5809, 0.8023, 0.0379],
        [0.8142, 0.9312, 0.6318],
        [0.9965, 0.4418, 0.6478]])


True

### Hugging Face

Hugging Face is a popular AI and machine learning platform known for its open-source tools and libraries designed to make the development, deployment, and sharing of natural language processing (NLP) models easier. It provides repositories for pre-trained models. Hugging Face’s libraries, such as transformers, datasets, and tokenizers, offer simple APIs for integrating state-of-the-art models into applications. The platform also supports fine-tuning models and training them for custom use cases, while its model hub allows users to collaborate by uploading and sharing models.

### Unsloth

Unsloth is a lightweight framework designed to optimize the training and fine-tuning of large language models, such as LLaMA, with a focus on memory efficiency and reduced resource consumption. It supports advanced techniques like LoRA (Low-Rank Adaptation) and 4-bit quantization, allowing users to train models with lower VRAM requirements. By integrating gradient checkpointing and memory-efficient features, Unsloth enables developers to fit larger models and handle longer context sequences.

In [3]:
# We use unsloth to fit LLaMA-3.1-8B on our system
from unsloth import FastLanguageModel

# We use the 4-bit quantization model from unsloth
pre_trained_model = "unsloth/Meta-Llama-3.1-8B-bnb-4bit"
max_seq_length = 2048 # Choose any
dtype = None # None for auto detection
load_in_4bit = True # Use 4bit quantization to reduce memory usage

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = pre_trained_model,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# These parameters are recommended by unsloth for minimum VRAM.
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA GeForce RTX 2060 SUPER. Max memory: 8.0 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu124. CUDA = 7.5. CUDA Toolkit = 12.4.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Alpaca prompt

An Alpaca prompt refers to the style of prompt used in Stanford's Alpaca model, which was fine-tuned on a dataset derived from OpenAI's GPT-3. The prompts typically follow a structured format where a task instruction is paired with an optional input and expected response. This format is designed to improve the model's performance in following instructions.

The typical Alpaca prompt format looks like this:

In [4]:
alpaca_prompt = """
### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["result"]
    texts = []

    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)

    return { "text" : texts, }
pass

In [5]:
from datasets import load_dataset
dataset = load_dataset('json', data_files={'train': '../datasets/zephyr.json'}, split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 1 examples [00:00, 55.77 examples/s]
Map: 100%|██████████| 1/1 [00:00<00:00, 139.11 examples/s]


### Training

In [6]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        num_train_epochs = 1,
        max_steps = 60,
        learning_rate = 1e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "pengullama",
    ),
)

trainer_stats = trainer.train()

num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
Map: 100%|██████████| 1/1 [00:00<00:00, 134.13 examples/s]
max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1 | Num Epochs = 60
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 4
\        /    Total batch size = 4 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
5,0.2366
10,0.194
15,0.1171
20,0.0593
25,0.032
30,0.0215
35,0.0034
40,0.003
45,0.0028
50,0.0026


### Push the results to out Hugging Face repo 

In [7]:
trainer.push_to_hub()

### Time to test our model!

In [8]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Update the function to handle authentication.",
        "```python\ndef login(user):\n    return user.is_not_authenticated\n```",
        ""
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>
### Instruction:
Update the function to handle authentication.

### Input:
```python
def login(user):
    return user.is_not_authenticated
```

### Response:
```python
diff --git a-tests/lib/python3.6/unittest/mock.py b-tests/lib/python3.6/unittest/mock.py
index 1135537467f..6b7f10c4749 100644
--- a/tests/lib/python3.6/unittest/mock.py
+++ b/tests/lib/python3.6/unittest/mock.py
@@ -1173,18 +1173,17 @@ class CallsMatch(matchers.Matchers):
 
     @staticmethod
     def match(user):  # pylint: disable=unused-argument
-        return True if isinstance(user, User) else False



In [9]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "How many states are in the USA?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>
### Instruction:
How many states are in the USA?

### Input:


### Response:
    50
<|end_of_text|>


In [10]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Don't use match/case statements", # instruction
        "       match(status):\n           case TwisterStatus.PASS:\n               color = Fore.GREEN\n           case TwisterStatus.SKIP | TwisterStatus.FILTER | TwisterStatus.BLOCK:\n               color = Fore.YELLOW\n           case TwisterStatus.FAIL | TwisterStatus.ERROR:\n               color = Fore.RED\n           case TwisterStatus.STARTED | TwisterStatus.NONE:\n               color = Fore.MAGENTA\n           case _:\n               color = Fore.RESET\n       return color", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>
### Instruction:
Don't use match/case statements

### Input:
       match(status):
           case TwisterStatus.PASS:
               color = Fore.GREEN
           case TwisterStatus.SKIP | TwisterStatus.FILTER | TwisterStatus.BLOCK:
               color = Fore.YELLOW
           case TwisterStatus.FAIL | TwisterStatus.ERROR:
               color = Fore.RED
           case TwisterStatus.STARTED | TwisterStatus.NONE:
               color = Fore.MAGENTA
           case _:
               color = Fore.RESET
       return color

### Response:
diff --git a/twister/twisterlib/statuses.py b/twister/twisterlib/statuses.py
index 2983917467f..c5545b3fe28 100644
--- a/twister/twisterlib/statuses.py
+++ b/twister/twisterlib/statuses.py
@@ -23,18 +23,17 @@ class TwisterStatus(str, Enum):
 
     @staticmethod
     def get_color(status: TwisterStatus) -> str:
-        match(status):
-            case TwisterStatus.PASS:
-                color = Fore.GREEN



### Results

We can see that out custom trained model has definitely changed the outputs of LLaMA. For better or for worse.