# Clone Llama-Factory and Install Dependencies

In [None]:
!git clone https://github.com/hiyouga/LLaMA-Factory.git

In [1]:
%cd LLaMA-Factory

/home/ubuntu/quamer/Fine-Tuning-Mistral-7B-Using-Llama-Factory/LLaMA-Factory


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [None]:
%pip install -r requirements.txt

In [None]:
%pip install bitsandbytes

# Setup Llama-Factory

This will run on localhost only.

In [None]:
!CUDA_VISIBLE_DEVICES=0 python src/train_web.py

If you want to get an external URL, you'll need to set share=True in train_web.py like shown in the image below

<img src="assets/train_web.png" alt="share=True">

```
dataset = {
    'dockerNLCommands': {
        "hf_hub_url": "MattCoddity/dockerNLcommands",
        "columns": {
            "prompt": "instruction",
            "query": "input",
            "response": "output"
        }
    }
}

template = {
    _register_template(
        name="mistral-instruction-v02",
        format_user=StringFormatter(slots=["[INST] {{content}} [/INST]"]),
        format_system=StringFormatter(slots=[{"bos_token"}, "{{content}}"]),
        force_system=True,
)
}
```

In [4]:
!CUDA_VISIBLE_DEVICES=0 python src/train_web.py

Running on local URL:  http://0.0.0.0:7860
Running on public URL: https://f8916ac789600c46d7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
03/13/2024 19:47:12 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2046] 2024-03-13 19:47:12,901 >> loading file tokenizer.model from cache at /home/ubuntu/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/cf47bb3e18fe41a5351bc36eef76e9c900847c89/tokenizer.model
[INFO|tokenization_utils_base.py:2046] 2024-03-13 19:47:12,901 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2046] 2024-03-13 19:47:12,901 >> loading file special_tokens_map.json from cache at /home/ubuntu/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/

In [None]:
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --do_train True \
    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 \
    --finetuning_type lora \
    --quantization_bit 4 \
    --template mistral-instruction-v02 \
    --flash_attn True \
    --dataset_dir data \
    --dataset dockerNLCommands \
    --cutoff_len 512 \
    --learning_rate 0.0002 \
    --num_train_epochs 10.0 \
    --max_samples 10000 \
    --per_device_train_batch_size 64 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 0.3 \
    --logging_steps 5 \
    --save_steps 50 \
    --warmup_steps 0 \
    --optim adamw_torch \
    --output_dir saves/Custom/lora/train_2024-03-13-10-46-24 \
    --bf16 True \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.1 \
    --lora_target q_proj, v_proj \
    --val_size 0.1 \
    --evaluation_strategy steps \
    --eval_steps 50 \
    --per_device_eval_batch_size 64 \
    --load_best_model_at_end True \
    --plot_loss True

After we launch the web-version of llama-factory, we can access the gradio interface by visiting the URL shown in the terminal. We're provided with two types of URL, one local URL and another one is an external/public URL. We can use any of them to access the llama-factory gradio interface. If you're using the google colab, you can only use the public URL, whereas if you're a Lambda Cloud system, you can use both the local and public URL.

Let's now look at the llama-factory web interface.

<img src="assets/gradio-interface.png" alt="share=True">

Woah! That's a lot of options!

We can see that we have a lot of options to choose from. We can select the model, the dataset, the hyperparameters, and the training options, etc. Everything is customizable. We can also see the training logs and the training progress in the web interface. This is a very powerful tool and can be used to train models on the cloud without any hassle and worry about the coding part.

Let's now first try to understand all the available configurations provided by the llama-factory.

As of now, Llama-factory supports 3 different language in their UI. They are: en (English), ru (Russian), and zh (Chinese). By default, the language is set to English. We can change the language by selecting the language from the dropdown menu.

## Model Configurations

Llama-factory provides a lot of models to choose from. We can select the model from the dropdown menu. This is what makes llama-factory so powerful. All the popular models are available to choose from. In the backend, llama-factory uses the Hugging Face model hub to fetch the models.

<img src="assets/model_selection.png" alt="share=True">

Though llama-factory provides a lot of models, it still gives us the flexibility to use custom models. We can use the custom model by providing the model path from the Hugging Face model hub. In order to use the custom model, we need to select the "Custom" option from the dropdown menu and provide the model path in the Model Path input field.


In this blog, we'll be using a custom model. We'll be using instruction finetuned mistral model. Before we proceed, let's first understand what is mistral-instruction-v02 model. Mistral-instruction-v02 is a model that is instruction-fine-tuned version of the mistral-7b model. It is a large language model with 7.3 billion parameters. Even being a comparatively smaller model, it has outperformed larger models like Llama 2 (13 billion parameters) on various benchmarks. It uses grouped-query and sliding window attention to tackle sequences of arbitrary length efficiently. [Grouped-query attention](https://klu.ai/glossary/grouped-query-attention) is a technique that speeds up attention by grouping query vectors. Each group shares a single key and value vector, reducing computations compared to standard attention. [Sliding window attention](https://klu.ai/glossary/sliding-window-attention) handles long sequences by focusing on smaller chunks (windows) at a time. The window slides along the sequence, processing each section efficiently. Combining these two key-techinques, Mistral-7b offers a good balance between speed and performance. It excels in tasks like reasoning, math, and code generation. If you want to know more about the mistral-7b model, you read the paper by visiting [this link](https://arxiv.org/abs/2310.06825).

Now let's get back to the llama-factory. Though llama-factory does provide the mistral-insturction model, under the name Mistral-7B-Chat, but that's v.01. We'll be using the v.02 model. mistral-instruction-v02 is an improved version of mistral-instruction-v01. You can find the model from the Hugging Face model hub by visiting [this link](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). We'll be using this model to train our model. You can copy the model path from the model hub and paste it in the Model Path input field.

<img src="assets/model config.png" alt="share=True">

## Fine Tuning Configurations

Once we have selected the model, we can now move to the next configuration, i.e., Fine Tuning Configurations. Llama-factory provides us with 3 different options to fine-tune the model. They are: full, freeze, and lora. By default, the fine-tuning option is set to lora. We can change the fine-tuning option by selecting the option from the dropdown menu. Let's understand what these options mean. When we select the full option, the entire model is fine-tuned. This means that all the layers of the model are fine-tuned. When we select the freeze option, the model is not fine-tuned at all. This means that the model is used as it is for evaluation and inference purposes. The default option, i.e., lora, stands for Low Rank Adaptation. When we select the lora option, the model is fine-tuned using low rank adaptation technique. Low rank adaptation is a technique that fine-tunes the model using low-rank matrices. This technique significantly reduces training time, memory usage, and computational power needed. It's like tweaking a small dial on a large machine for precise adjustments. This is a very powerful technique and is a go-to option when we want to fine-tune large models. It essentially freezes the model's parameters and introduces a low-rank matrix to different layers of the model, which is then fine-tuned. If you want to know more about the low rank adaptation technique, you can read the paper by visiting [this link](https://arxiv.org/abs/2106.09685).

<img src="assets/finetuning_method.png" alt="share=True">

After we have selected the fine-tuning option, we now get the option to pass something called Adaptors. Adaptors are nothing but model checkpoint. We can pass the model checkpoint by providing the model path in the Adaptors input field. This is an optional field. If you're working on llama-factory for the first time, you will not have any model checkpoint that has been adapted to a specific task. You can leave this field empty. Once you're trained a model using llama-factory, you will then get options to pass the adaptors. It's like a checkpoint that you can use to either resume the training or to fine-tune the model further or to use the model for inference purposes. We will revisit this field once we have trained a model using llama-factory.

## Advanced Configurations

<img src="assets/advanced_config.png" alt="share=True">

Now that we know how to select the model and the fine-tuning options, we can now move to the next configuration, i.e., Advanced Configurations. Llama-factory provides us with a lot of advanced configurations that help us to fine-tune the model efficiently. There are mainly 4 different options available in the advanced configurations. They are: Quantization Bit, Prompt Template, RoPE Scaling, and Boosters. Let's understand what these options mean.


Quantization Bit: Quantization is a technique that reduces the precision of the model's parameters. This reduces the memory usage and computational power needed. Llama-factory provides us with the option to quantize the model's parameters. We can select the quantization bit from the dropdown menu. By default, the quantization bit is set to 4. The available options are 4 and 8. Llama-factory uses the QLoRA technique for quantization. It basically is quantization and low-rank adaptation. This method will allow us to fine-tune massive models on a single GPU. If you want to know more about the QLoRA technique, you can read the paper by visiting [this link](https://arxiv.org/abs/2305.14314).

<img src="assets/template path.png" alt="share=True">
<img src="assets/template.png" alt="share=True">

Prompt Template: We also need to provide the prompt template. You can get the information about the prompt template on respective model's hugingface model hub page. For example, the prompt template for mistral-instruction-v02 model is [INST] {{content}} [/INST]. So we first need to create a prompt template, and for that, we need to modify some code in the src/llmtuner/data/template.py file (look in the image below). In this file we nned to register the template. Here we'll see that llama-factory provides us with many pre-registered templates. We can use any of them based on our requirement. If we want to use a custom template, we can do that by adding the following code in the template.py file.

```
_register_template(
    name="mistral-instruction-v02",
    format_user=StringFormatter(slots=["[INST] {{content}} [/INST]"]),
    format_system=StringFormatter(slots=[{"bos_token"}, "{{content}}"]),
    force_system=True,
)
```



RoPE Scaling: RoPE stands for Rotary Position Embeddings. RoPE is used in LLMs to understand the relative position of words within a sequence. What RoPE Scaling does is that it modifies the RoPE calculations to improve the model's ability to handle longer sequences. It does this by tweaking the base valu used in the RoPE calculations. This value controls the rate at which the sine and cosine functions oscillate, which basically at the end affects the text-embeddings. Increasing the base value can spread out the embeddings, making them more distinct for longer sequences. While decreasing it can introduce periodicity, allowing the model to handle longer sequences that wrap around this cycle. Llama-factory provides us with the option to scale the RoPE. We can select the RoPE scaling from given options. By default, the RoPE scaling is set to None. The available options are None, Linear, and Dynamic. Linear RoPE scaling involves scaling the wavelength linearly by a factor of intended maximum sequence length to the model's original maximum sequence length. This adjustment ensures that the entire period window is fully utilized by all token positions when the wavelength is less than the context length. Dynamic RoPE scaling adjusts the base with a coefficient that increases with the length of inference. It is particularly useful for adapting RoPE to longer contexts without fine-tuning. We can select the RoPE scaling based on our requirement. If you want to know more about the RoPE scaling, you can read the paper by visiting [this link](https://arxiv.org/abs/2310.05209) or look at this [blog 1](https://www.hopsworks.ai/dictionary/rope-scalingg) and [blog 2](https://blog.eleuther.ai/yarn/).

## How to train a model using Llama-Factory?

### Training Configurations

### Extra Configurations