Do you want to build an LLM for your custom task? Are you feeling overwhelmed by the amount of code and the complexity of the model? Do you want to build an LLM but don't know how to code? Then this blog is for you!

With the rise of large language models (LLMs) like GPT, many people are interested in building their own custom LLMs for their business or research. However, building an LLM is not an easy and straightforward task. It requires a lot of coding and a deep understanding of the model architecture and training process. This can be overwhelming for people who are not familiar with machine learning and natural language processing.

In this blog, I will introduce a new tool called Llama-Factory that allows you to build your own custom LLM without writing any code. Llama-Factory offers a user-friendly interface that guides you through the process of building an LLM step by step. You can customize the model configuration, train the model, and use it for your custom task. You can also publish your LLM and share it on the huggingface model hub.

Though Llama-Factory provides a no-code interface, but having a basic understanding of LLMs and it's associated concepts and configurations is very helpful. In this blog, I will fine-tune a LLM model using Llama-Factory and then use it to generate text for a custom task. I will also go in a bit of detail about all the configurations and concepts that are used in the process. Having an understanding of these concepts is very essential as the model performance is highly dependent on these configurations.

So, let's get started!

# Clone Llama-Factory and Install Dependencies

We need to clone the Llama-Factory repository and install the required dependencies. We will also nneed to install the `bitsandbytes` package that is used for quantization of the model.

In [None]:
# Clone the LLaMA-Factory repository
!git clone https://github.com/hiyouga/LLaMA-Factory.git

In [None]:
# Change directory to LLaMA-Factory
%cd LLaMA-Factory

In [None]:
# Install requirements
%pip install -r requirements.txt

In [None]:
# Install bitsandbytes
%pip install bitsandbytes

# Setup Llama-Factory

Now that we've installed all the dependencies, we can start setting up Llama-Factory. Llama-Factory provides various support for LLM training. You can either train a model on the CLI using the command line interface or use the web interface to train the model. In this blog, I will use the web interface to train the model.

To launch the web interface, run the following command:

```bash
CUDA_VISIBLE_DEVICES=0 python src/train_web.py
```

This will start the web interface on the local server. You can access the web interface by opening the provided URL in your web browser.

If you are using colab or any other cloud environment, you can not run the web interface using the local server. You somehow need to get a public/exteral URL to access the web interface. To do that, we'll need to edit the `src/train_web.py` file and change the `share` parameter to `True`.

<img src="assets/train_web.png" alt="share=True">

Once this is done, you can now re-run the following command:

```bash
CUDA_VISIBLE_DEVICES=0 python src/train_web.py
```

This will start the web interface and provide you with a public URL that you can use to access the web interface.

Let's now look at the llama-factory web interface.

<img src="assets/gradio-interface.png" alt="share=True">

Woah! That's a lot of options!

We can see that we have a lot of options to choose from. We can select the model, the dataset, the hyperparameters, and the training options, etc. Everything is customizable. We can also see the training logs and the training progress in the web interface. This is a very powerful tool and can be used to train models on the cloud without any hassle and worry about the coding part.

Let's now first try to understand all the available configurations provided by the llama-factory.

As of now, Llama-factory supports 3 different language in their UI. They are: en (English), ru (Russian), and zh (Chinese). By default, the language is set to English. We can change the language by selecting the language from the dropdown menu.

## Model Configurations

Llama-factory provides a lot of models to choose from. We can select the model from the dropdown menu. This is what makes llama-factory so powerful. All the popular models are available to choose from. In the backend, llama-factory uses the Hugging Face model hub to fetch the models.

<img src="assets/model_selection.png" alt="share=True">

Though llama-factory provides a lot of models, it still gives us the flexibility to use custom models. We can use the custom model by providing the model path from the Hugging Face model hub. In order to use the custom model, we need to select the "Custom" option from the dropdown menu and provide the model path in the Model Path input field.


In this blog, we'll be using a custom model. We'll be using instruction finetuned mistral model. Before we proceed, let's first understand what is mistral-instruction-v0.2 model. Mistral-instruction-v0.2 is a model that is instruction-fine-tuned version of the mistral-7b model. It is a large language model with 7.3 billion parameters. Even being a comparatively smaller model, it has outperformed larger models like Llama 2 (13 billion parameters) on various benchmarks. It uses grouped-query and sliding window attention to tackle sequences of arbitrary length efficiently. [Grouped-query attention](https://klu.ai/glossary/grouped-query-attention) is a technique that speeds up attention by grouping query vectors. Each group shares a single key and value vector, reducing computations compared to standard attention. [Sliding window attention](https://klu.ai/glossary/sliding-window-attention) handles long sequences by focusing on smaller chunks (windows) at a time. The window slides along the sequence, processing each section efficiently. Combining these two key-techinques, Mistral-7b offers a good balance between speed and performance. It excels in tasks like reasoning, math, and code generation. If you want to know more about the mistral-7b model, you read the paper by visiting [this link](https://arxiv.org/abs/2310.06825).

Now let's get back to the llama-factory. Though llama-factory does provide the mistral-insturction model, under the name Mistral-7B-Chat, but that's v.01. We'll be using the v.02 model. mistral-instruction-v02 is an improved version of mistral-instruction-v01. You can find the model from the Hugging Face model hub by visiting [this link](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). We'll be using this model to train our model. You can copy the model path from the model hub and paste it in the Model Path input field.

<img src="assets/model config.png" alt="share=True">

## Fine Tuning Configurations

Once we have selected the model, we can now move to the next configuration, i.e., Fine Tuning Configurations. Llama-factory provides us with 3 different options to fine-tune the model. They are: full, freeze, and lora. By default, the fine-tuning option is set to lora. We can change the fine-tuning option by selecting the option from the dropdown menu. Let's understand what these options mean. When we select the full option, the entire model is fine-tuned. This means that all the layers of the model are fine-tuned. When we select the freeze option, the model is not fine-tuned at all. This means that the model is used as it is for evaluation and inference purposes. The default option, i.e., lora, stands for Low Rank Adaptation. When we select the lora option, the model is fine-tuned using low rank adaptation technique. Low rank adaptation is a technique that fine-tunes the model using low-rank matrices. This technique significantly reduces training time, memory usage, and computational power needed. It's like tweaking a small dial on a large machine for precise adjustments. This is a very powerful technique and is a go-to option when we want to fine-tune large models. It essentially freezes the model's parameters and introduces a low-rank matrix to different layers of the model, which is then fine-tuned. If you want to know more about the low rank adaptation technique, you can read the paper by visiting [this link](https://arxiv.org/abs/2106.09685).

<img src="assets/finetuning_method.png" alt="share=True">

After we have selected the fine-tuning option, we now get the option to pass something called Adaptors. Adaptors are nothing but model checkpoint. We can pass the model checkpoint by providing the model path in the Adaptors input field. This is an optional field. If you're working on llama-factory for the first time, you will not have any model checkpoint that has been adapted to a specific task. You can leave this field empty. Once you're trained a model using llama-factory, you will then get options to pass the adaptors. It's like a checkpoint that you can use to either resume the training or to fine-tune the model further or to use the model for inference purposes. We will revisit this field once we have trained a model using llama-factory.

## Advanced Configurations

<img src="assets/advanced_config.png" alt="share=True">

Now that we know how to select the model and the fine-tuning options, we can now move to the next configuration, i.e., Advanced Configurations. Llama-factory provides us with a lot of advanced configurations that help us to fine-tune the model efficiently. There are mainly 4 different options available in the advanced configurations. They are: Quantization Bit, Prompt Template, RoPE Scaling, and Boosters. Let's understand what these options mean.


Quantization Bit: Quantization is a technique that reduces the precision of the model's parameters. This reduces the memory usage and computational power needed. Llama-factory provides us with the option to quantize the model's parameters. We can select the quantization bit from the dropdown menu. By default, the quantization bit is set to 4. The available options are 4 and 8. Llama-factory uses the QLoRA technique for quantization. It basically is quantization and low-rank adaptation. This method will allow us to fine-tune massive models on a single GPU. If you want to know more about the QLoRA technique, you can read the paper by visiting [this link](https://arxiv.org/abs/2305.14314).

<img src="assets/template path.png" alt="share=True">
<img src="assets/template.png" alt="share=True">

Prompt Template: We also need to provide the prompt template. You can get the information about the prompt template on respective model's hugingface model hub page. For example, the prompt template for mistral-instruction-v02 model is [INST] {{content}} [/INST]. So we first need to create a prompt template, and for that, we need to modify some code in the src/llmtuner/data/template.py file (look in the image below). In this file we nned to register the template. Here we'll see that llama-factory provides us with many pre-registered templates. We can use any of them based on our requirement. If we want to use a custom template, we can do that by adding the following code in the template.py file.

```
_register_template(
    name="mistral-instruction-v02",
    format_user=StringFormatter(slots=["[INST] {{content}} [/INST]"]),
    format_system=StringFormatter(slots=[{"bos_token"}, "{{content}}"]),
    force_system=True,
)
```



RoPE Scaling: RoPE stands for Rotary Position Embeddings. RoPE is used in LLMs to understand the relative position of words within a sequence. What RoPE Scaling does is that it modifies the RoPE calculations to improve the model's ability to handle longer sequences. It does this by tweaking the base valu used in the RoPE calculations. This value controls the rate at which the sine and cosine functions oscillate, which basically at the end affects the text-embeddings. Increasing the base value can spread out the embeddings, making them more distinct for longer sequences. While decreasing it can introduce periodicity, allowing the model to handle longer sequences that wrap around this cycle. Llama-factory provides us with the option to scale the RoPE. We can select the RoPE scaling from given options. By default, the RoPE scaling is set to None. The available options are None, Linear, and Dynamic. Linear RoPE scaling involves scaling the wavelength linearly by a factor of intended maximum sequence length to the model's original maximum sequence length. This adjustment ensures that the entire period window is fully utilized by all token positions when the wavelength is less than the context length. Dynamic RoPE scaling adjusts the base with a coefficient that increases with the length of inference. It is particularly useful for adapting RoPE to longer contexts without fine-tuning. We can select the RoPE scaling based on our requirement. If you want to know more about the RoPE scaling, you can read the paper by visiting [this link](https://arxiv.org/abs/2310.05209) or look at this [blog 1](https://www.hopsworks.ai/dictionary/rope-scalingg) and [blog 2](https://blog.eleuther.ai/yarn/).

Boosters: Boosters are basically a way to improve the model's efficiency and increase it's speed. Llama-factory provides us with three different options for boosters. They are: None, Flash-Attention, and UnSloth. By default, the booster is set to None. When we select the flash-attention option, in the backend, llama-factory uses the flash-attention algorithm to increase the model's speed. Flash Attention is an attention algorithm used to reduce the memory bottleneck in the attention mechanism. It is particularly useful for scaling transformer-based models more efficiently, enabling faster training and inference. If you want to know more about the flash-attention algorithm, you can read the paper by visiting [this link](https://arxiv.org/abs/2205.14135) or follow this [blog](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention). When we select the unsloth option, in the backend, llama-factory uses the unsloth tool to increase the model's speed. UnSloth is a tool that makes any AI model faster by manually deriving all the matrix differential instead of relying on the automatic differentiation done by the deep learning framework. This results in a significant speedup. If you want to know more about the UnSloth tool, you can look into their website by visiting [this link](https://unsloth.ai/). We can select the booster based on our requirement.

## How to train a model using Llama-Factory?

<img src="assets/train main config.png" alt="share=True">

In the above image we can see that llama-factory provides us with a lot of training configurations. Let's understand each of them.

Now that we've looked on how we can configure the model, let's now look at how we can train a model using llama-factory. As you might know that LLM training is not a straightforward task. It requires a lot of computational power, memory, data, and time. We normally start with pre-training a model on a large dataset, then if using RLHF, we first create a reward model, then optimize the LLM using PPO (Proximal Policy Optimization) algorithm. This is done so as to align the LLM with the human preferences. As you know that an RLHF model is very tricky, llama-factory provides us with DPO (Direct Preference Optimization) algorithm to align the LLM with the human preferences very efficiently and easily. Then once the model is aligned with the human preferences, we also get an option of fine-tuning the model on a specific task. Llama-factory provides us with all these options.

<img src="assets/stage.png" alt="share=True">

For the purposes of this blog, we're only going to look into supervised fine tuning. The use case is to fine-tune the mistral-instruction-v02 model on the dockerNLCommands dataset. This dataset contains the instructions and the corresponding commands based on the instructions. This is a very useful dataset and can be used to train a model that can generate commands based on the instructions. Imagine creating an agent that can handle all the docker related tasks for you. This particular use-case is the first step towards creating such an agent. We'll be using the supervised fine-tuning option to fine-tune the mistral-instruction-v02 model on the dockerNLCommands dataset. So we'll be selecting the supervised fine-tuning option from the dropdown menu of stage. Once we have selected the supervised fine-tuning option, we now get the option to select the dataset. We first need to provide the dataset path in Data Dir input field. Once that's done we need to select the dataset. Llama-factory provides us with a lot of pre-registered datasets. We can select the dataset from the dropdown menu. If we want to use a custom dataset, then we'll have to make few changes in the data/dataset_info.json file. 

You can get this dataset from the huggingface model hub by visiting [this link](https://huggingface.co/datasets/MattCoddity/dockerNLcommands). This dataset is designed to translate natural language instructions into Docker commands. It contains mappings of textual phrases to corresponding Docker commands, aiding in the development of models capable of understanding and translating user requests into executable Docker instructions. This is a relatively small dataset with only about 2.42k rows.

<img src="assets/data info path.png" alt="share=True">
<img src="assets/dataset info.png" alt="share=True">

We basically need to add this code in the dataset_info.json file.

```
'dockerNLCommands': {
    "hf_hub_url": "MattCoddity/dockerNLcommands",
    "columns": {
        "prompt": "instruction",
        "query": "input",
        "response": "output"
    }
}
```

This tells the llama-factory that we have a custom dataset that we want from huggingface. We also provide the columns that we want to use. In the dockerNLCommands dataset, we've basically three columns, intruction, input, and output. So we provide this information in the form of prompt, query, and response respectively. Once this step is done, relaunch the web-version of llama-factory. You can now select the dockerNLCommands dataset from the dropdown menu.

There is also a preview dataset option available in the llama-factory. We can use this option to preview the dataset. This functionality is only available for the datasets that are already present in the llama-factory. If you're using a custom dataset, you will not be able to preview the dataset. 

One the dataset is selected, now we need to add hyperparameters. Here we've options for learning rate (we've set it as 2e-, it helps in controlling the step size of the model optimization), batch size (we've set it as 16, it helps in controlling the number of samples that will be propagated through the network in a single pass), epochs (we've set it as 10, it helps in controlling the number of times the learning algorithm will work through the entire training dataset), Max Samples (we've set it as 10000, it helps in controlling the maximum number of samples to be used for training), val size (we've set it as 0.1, it helps in controlling the size of the validation set, which in turn helps in accessing the model's performance). Apart from these configs, there are other hyperparameters as well. We can select the hyperparameters based on our requirement. Let's discuss each of them.

Max Gradient Norm: This is a technique used to prevent the exploding gradient problem. It is used to clip the gradients to a maximum value. We can pass the max gradient norm in the Max Gradient Norm input field. By default, the max gradient norm is set to 1.0. Here we've set it as 0.3. This is also known as gradient clipping. If you want to know more about the gradient clipping, you can read this [blog](https://neptune.ai/blog/understanding-gradient-clipping-and-how-it-can-fix-exploding-gradients-problem).

Gradient Accumulation: This is a technique used to increase the effective batch size (Since we've only 24 GB GPU memory, we won't be able to fit larger batch size). It is used to accumulate the gradients over multiple steps/batches. We can pass the gradient accumulation in the Gradient Accumulation input field. By default, the gradient accumulation is set to 8, but we've set it as 4. If you want to know more about the gradient accumulation, you can read this [blog](https://lightning.ai/blog/gradient-accumulation/).

LR Scheduler: This is a technique used to adjust the learning rate during training. It is used to improve the model's stability, convergence, and performance. Llama-factory provides us with the option to select the LR scheduler from the dropdown menu. By default, the LR scheduler is set to cosine. There are other options available as well. We can select the LR scheduler based on our requirement. Here we've set it as cosine. If you want to know more about the LR scheduler, you can read this [blog](https://towardsdatascience.com/a-visual-guide-to-learning-rate-schedulers-in-pytorch-24bbb262c863).

<img src="assets/lr_scheduler.png" alt="share=True">

Cutoff Length: This is a technique used to control the maximum length of the input sequence. Here we've used 512 instead of 1024. This is again because of the memory constraints. We can pass the cutoff length in the Cutoff Length input field. By default, the cutoff length is set to 1024. 

Compute Type: This basically tells the llama-factory on where to use mixed precision or not. We can select the compute type from the dropdown menu. Here we've set it bf16. If you want to know more about the mixed precision, you can read this [blog](https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407).

### Extra Configurations

<img src="assets/extra config.png" alt="share=True">

Once we've defined the training configs, we can now move to the next configuration, i.e., Extra Configurations. Llama-factory provides us with a lot of customizations that help us to fine-tune the model efficiently. There are mainly 5 different options available in the extra configurations. They are: Logging steps, Save steps, Warmup steps, NEFTune Alpha, and Optimizer. Let's understand what these options mean.

Logging steps: This basically tells the llama-factory on how often to log the training progress. We can pass the logging steps in the Logging steps input field. By default, the logging steps is set to 5.

Save steps: This basically tells the llama-factory on how often to save the model checkpoint. We can pass the save steps in the Save steps input field. By default, the save steps is set to 100. We've set it as 50.

Warmup steps: Warmup techniques are used to adjust the learning rate during the initial phase of training. What warmup steps does is that it gradually increases the learning rate from 0 to the initial learning rate over a few steps. This helps in stabilizing the training process. In this blog we won't be using the warmup steps, so we've set it as 0. If you want to know more about it you can read this [blog](https://medium.com/thedeephub/learning-rate-and-its-strategies-in-neural-network-training-270a91ea0e5c).

NEFTune Alpha: NEFTune is a technique used to add noise to the embedding vectors. This technique is used to improve the model's robustness and generalization. To know more about NEFTune, you can read the paper by visiting [this link](https://arxiv.org/abs/2310.05914). In this blog we won't be using the NEFTune Alpha, so we've set it as 0.

Optimizer: This basically tells the llama-factory on which optimizer to use. Here we've used a popular optimizer called AdamW. If you want to know more about the optimizers inner workings, you can read this [blog](https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-optimizers/).

Resize Token Embeddings: This technique is used to resize the token embeddings. If we want to increase the size, then newly initialized vectors will be added at the end. If we want to reduce the size, then vectors will be removed from the end. For this blog, we won't be using the Resize Token Embeddings, so we've set it as None. Read more about it [here](https://huggingface.co/docs/transformers/v4.22.2/en/main_classes/model#transformers.PreTrainedModel.resize_token_embeddings).

Pack Sequences: This technique is used to pack sequences into samples of fixed length in supervised fine-tuning. What this does is that it combines multiple examples into a single new example to fill the model's memory. This is done to make training more efficient and use the longer context of these LLMs. For this blog, we'll not be using the Pack Sequences flag, since we're using a relatively small dataset. If you want to know more about the packing sequences, you can read this [blog](https://wandb.ai/capecape/alpaca_ft/reports/How-to-Fine-Tune-an-LLM-Part-1-Preparing-a-Dataset-for-Instruction-Tuning--Vmlldzo1NTcxNzE2#packing:-combining-multiple-samples-into-a-longer-sequence).

Upcast LayerNorm: This technique is used to upcast weights of layernorm from float16 to float32 when using QLoRA. This is done as fine-tuning with only float16 weights can be unstable. For this blog, we'll not be using the Upcast LayerNorm flag. 

Enable LLaMA Pro: This technique is used to make the parameters in the expanded blocks trainable. This technique was introduced in the LLaMA Pro paper. What this does is that it expands the transformer blocks and makes the parameters in the expanded blocks trainable. This is done to improve the model's knowledge without catastrophic forgetting. For this blog, we'll not be using the Enable LLaMA Pro flag. If you want to know more about the LLaMA Pro, you can read the paper by visiting [this link](https://arxiv.org/abs/2401.02415)

Enable S^2 Attention: This technique is used to enable the S^2 attention. This technique is used to effectively enable context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. The way it is done is that it splits features along the head dimension into two chunks, then tokens in one of the chunks are shifted by half of the group size, and finally tokens are split into groups and reshaped into batch dimensions. You can read more about it in the LongLoRA paper by visiting [this link](https://arxiv.org/pdf/2309.12307.pdf). For this blog, we'll not be using the Enable S^2 Attention flag.

## LoRA Configurations

<img src="assets/lora config.png" alt="share=True">

LoRA probably is one of the most important piece in LLMs fine-tuning. As we discussed earlier, LoRA is a technique that fine-tunes the model using low-rank adaptation technique. It helps in reducing the training time, memory usage, and computational power needed. Llama-factory provides us with a lot of LoRA configurations that help us to fine-tune the model efficiently. Let's understand each of them briefly. 

LoRA Rank: This basically is the dimension of the low-rank matrix. We can pass the LoRA rank in the LoRA Rank input field. The way we should select the LoRA rank is by this rule: The larger the model size is, the smaller the LoRA rank should be and vice versa. By default, the LoRA rank is set to 8.

LoRA Alpha: This basically scales the learned weights, affecting how the adaptation layer's weights influence the base model. By default, the LoRA alpha is set to 16. To know more about the LoRA alpha, you can read this [blog](https://www.entrypointai.com/blog/lora-fine-tuning/) or follow the official paper.

LoRA Dropout: This basically is adding dropout to the low-rank adaptation layer. By default, the LoRA dropout is set to 0.1. Dropout is a technique used to prevent overfitting. It is used to randomly drop units from the neural network during training. If you want to know more about the dropout, you can read this [blog](https://www.analyticsvidhya.com/blog/2022/08/dropout-regularization-in-deep-learning/).

LoRA modules: This basically tells the llama-factory on which layers to apply the low-rank adaptation. We can apply the low-rank adaptation to as many layers as we want, but this will increase the training time. By default, the LoRA modules in the backend is only applied to q_proj and v_proj layers of attention modules. We can select the LoRA modules based on our requirement.

Use rslora: This technique is used to use the rank-stabilized LoRA (rsLoRA) method. This method proposes that LoRA adapters should be divided by a factor of the square root of the rank. This enables a fine-tuning compute/performance trade-off, where larger ranks can be used to trade off increased computational resources during training for better fine-tuning performance, with no change in inference computing cost. For this blog, we'll not be using this configuration. If you want to know more about the rsLoRA, you can read the paper by visiting [this link](https://arxiv.org/pdf/2312.03732.pdf).

Use DoRA: This technique is used to use the Weight-Decomposed Low-Rank Adaptation (DoRA) method. This method decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing DoRA, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. DoRA consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding. For this blog, we'll not be using this configuration. If you want to know more about the DoRA, you can read the paper by visiting [this link](https://arxiv.org/pdf/2402.09353.pdf).

Create new adapter: This technique is used to add new adapter layers on top of the existing fine-tuned model. For this blog, we'll not be using this configuration.

Additional modules: Normally in LoRA, we only train lora adapters while keeping the base model frozen. But Llama-factory provides us with the option to train additional modules as well. We can select the additional modules based on our requirement. For this blog, we'll not be using this configuration.

## Galore Configurations

<img src="assets/galore config.png" alt="share=True">

Galore (Gradient Low-Rank Projection) is a memory-efficient training strategy for large language models (LLMs) that leverages the low-rank nature of gradients. It is an alternative to common low-rank adaptation methods such as LoRA and allows for full-parameter learning, reducing memory usage. In Galore, the projection of full-rank gradients to low-rank gradients is achieved by computing two projection matrices to transform the gradient matrix into a low-rank form. This process significantly reduces the memory cost of optimizer states, which rely on component-wise gradient statistics. To know more about the Galore and it's associated configuration you can read the paper by visiting [this link](https://arxiv.org/pdf/2403.03507.pdf) or follow this [blog](https://blog.stackademic.com/galore-memory-efficient-pre-training-for-llms-9f4c0427b1b7). For the purposes of this blog, we'll not be using the Galore configurations.

There are other configurations as well. But we won't be using them for this blog. You can read more about them in the official documentation.

## Train the Model

<img src="assets/preview and train.png" alt="share=True">

Once we've set all the configurations, we can now go ahead and train the model. Much like setting up the configurations, training the model is also very easy. We just need to click on a button. That's it!

Before we start the training, we can also preview the commands that was generated based on the configuration we've set. 
```
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --do_train True \
    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 \
    --adapter_name_or_path saves/Custom/lora/train_2024-03-13-19-46-19 \
    --finetuning_type lora \
    --template default \
    --dataset_dir data \
    --dataset dockerNLCommands \
    --cutoff_len 512 \
    --learning_rate 0.0002 \
    --num_train_epochs 10.0 \
    --max_samples 10000 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --max_grad_norm 0.3 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --optim adamw_torch \
    --output_dir saves/Custom/lora/train_2024-03-14-11-23-26 \
    --bf16 True \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.1 \
    --lora_target q_proj,v_proj \
    --val_size 0.05 \
    --evaluation_strategy steps \
    --eval_steps 100 \
    --per_device_eval_batch_size 16 \
    --load_best_model_at_end True \
    --plot_loss True
```
This is a very useful feature. We can use this generated command to train the model using the command line interface. But in this blog, we'll be using the web interface to train the model. To train the model we click on the Start Button. Once we click on the Start Button, the training process will start. If for some reason you want to stop the training process, you can click on the Abort Button.

Once the training process is started, we can see the training logs and the training progress in the web interface. We can see all the information about the training configs, like batch size, gradient accumulation etc. Once the model starts training, we can see the loss for each step. These loss are also repoprted in the plot. We can see the loss curve being plotted in the web interface. All these logs and plots are essential to debug the training process.

<img src="assets/logs_1.png" alt="share=True">
<img src="assets/lora_logs.png" alt="share=True">
<img src="assets/loss_dec.png" alt="share=True">
<img src="assets/loss.png" alt="share=True">

Woah! That was a lot of information. But believe me, it's still very high-level as compared to training LLMs using code. This is what makes llama-factory so powerful. It is a No-code tool that can be used to train LLMs on the cloud without any hassle and worry about the coding part. 

Now that we've trained our model, we can now use the model to chat with it and see how it responds to the instructions related to docker. To use the model for chat, we can click on the chat button available in the web interface. But before we can load the model we first need to select our trained adapter. We can do that by selecting the adapter from the dropdown menu as shown in the image below.

<img src="assets/after-training-adaptors.png" alt="share=True">

Once this is done, then we can load the model and pass in out query to the model as shown in the image below.

<img src="assets/query-2.png" alt="share=True">

Here we see multiple options in the web interface. Role remains the User. Then we've an option of setting up the System prompt. System prompt basically tells the LLMs on how to behave. Then there's an option on selecting tool. Since we've no tool in this use-case, we'll be ignoring it. Then in the last we've our query. We can pass in our query in the input field. Once the query is passed, we can click on the Submit Button. Once we click on the Submit Button, the model will generate the response based on the query. We can see the response in the web interface.

Before we move on to the response, let's undertsand three additional configs present in the web-interface. They are: Maximum new tokens, Top-p, and temperature. Let's understand what these options mean.

Maximum new tokens: This basically tells the model on how many new tokens to generate. We're setting it as 512, but you can set it based on your requirement. By default, the maximum new tokens is set to 512.

Top-p: This is a technique used to control the diversity of the generated tokens. We can pass the top-p in the Top-p input field. By default, the top-p is set to 0.9. We've set it as 0.7. If you want to know more about the top-p, you can read this [blog](https://huggingface.co/blog/how-to-generate).

Temperature: This is a technique used to control the randomness of the generated tokens. We can pass the temperature in the Temperature input field. By default, the temperature is set to 1.0. We've set it as 0.95. If you want to know more about the temperature, you can read this [blog](https://huggingface.co/blog/how-to-generate).

Now that we've understood all the configurations, let's now see how the model responds to the query.

<img src="assets/response-2.png" alt="share=True">

Great! The model has generated the correct command based on the instruction. 

Let's now see how the model responds to another query.

<img src="assets/response-1.png" alt="share=True">

Great! This response is also correct. Now let's try one last query.

<img src="assets/response 3.png" alt="share=True">

Finally, we've come to an end. In this tutorial, we saw how to train a model using llama-factory. We saw how to configure the model, the fine-tuning options, the advanced configurations, the training configurations, and the LoRA configurations. We also saw how to train the model and how to use the model for chat. 

As a bonus section, let's quickly see how we can export the model and also see how we can upload the model to the Hugging Face model hub.

## Export the Model

<img src="assets/export mode.png" alt="share=True">

You can use Export tab to export the models in a new directory. Here you can select the maximum size of each checkpoint shard. You can also select the quantization bit for the exported model. Finally, you nned to pass in the path where you want to save the model. Once you've set all the configurations, you can click on the Export Button. Once you click on the Export Button, the model will be exported to the specified path.

## Upload the Model to Hugging Face Model Hub

<img src="assets/huggingface-upload.png" alt="share=True">

To upload the model to the Hugging Face model hub, you need to use the Hugging Face CLI. You can install the Hugging Face CLI by running the following command.

```
pip install --upgrade huggingface_hub
```

Once the Hugging Face CLI is installed, you need to login to the Hugging Face model hub by running the following command.

```
huggingface-cli login
```

Once you're logged in, you can now upload the model to the Hugging Face model hub by running the following command.

```
huggingface-cli upload {user}/{model-name} {model-path}
```

Great! That's it. We've now come to an end. I hope you enjoyed this blog.