# **Fine-tune Llama2 model**


Join me in this tutorial where we unravel the intricacies of fine-tuning LLaMA2 models using [AutoTrain Advanced](https://github.com/huggingface/autotrain-advanced).

[Llama2](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/) is a dynamic language model with multifaceted training possibilities. Whether your focus lies in chat-based applications or broader applications, adhering to the correct data structure is paramount.

For further insights, don't hesitate to reach out via direct message: [LinkedIn](https://www.linkedin.com/in/sif-eddine-boudjellal/)

# install autotrain-advanced & huggingface_hub

- make sure to run on GPU.
- after the installation, You must restart the runtime in order to use newly installed versions.

In [None]:
!pip install autotrain-advanced
!pip install huggingface_hub

In [None]:
!autotrain setup --update-torch

# huggingface Token

In [None]:
from huggingface_hub import notebook_login
notebook_login()

# Data Processing for LLaMA2



To effectively utilize LLaMA2, it's crucial to structure your data properly. Below are examples of how your data should be organized:

**First Example Format**

In this format, use the special tokens "### Human:" and "### Assistant:" to delineate the human question and the assistant's response. This format is exemplified by the ([timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)) dataset:

```
"### Human: ¿Qué similitudes hay entre la música Barroca y Romántica?### Assistant: Aunque la música ..."

```
**Second Example Format**

For this format, use the special tokens "### Instruction:", "### Input:", and "### Response:" to segment the instruction, input context, and response, respectively. This format is demonstrated by the ([vicgalle/alpaca-gpt4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4)) dataset:



```
"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Identify the odd one out. ### Input: Twitter, Instagram, Telegram ### Response: The odd one out is Telegram. Twitter and Instagram are social media platforms mainly for sharing information, images and videos while Telegram is a cloud-based instant messaging and voice-over-IP service."
```


## Important Notes
**Training LLaMA2 Base Model**
If you're training the LLaMA2 base model, the specific structure of your data is not a critical concern.

**Training LLaMA2 Chat Model**
However, if you intend to train the LLaMA2 chat model, your data should follow this structure:


```
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]

```

```
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

There's a llama in my garden 😱 What should I do? [/INST]

```

For more in-depth information on LLaMA2 and its implementation, feel free to refer to the [official Hugging Face blog post](https://huggingface.co/blog/llama2).

By following these guidelines, you'll be well-equipped to structure your data effectively and train Llama2 models tailored to your specific use cases.

# Train llama2 model

First, let's break down the necessity parameters for training the LLAMA2 model:

1. `--project_name`: This parameter specifies the name of your project, in this case, "**llama2_train**." It helps organize your training run and the resulting model files.

2. `--model meta-llama/Llama-2-7b-hf`: This parameter refers to the model repository you're using for training. In this tutorial, you're using the "**Llama-2-7b-hf**" base model. To access this model, you need to accept Meta's License. You can also provide the model from your local machine using the model path.

3. `--data_path`: This parameter points to the path of your dataset. You have two options for providing your dataset:
   - You can use a dataset from Hugging Face by specifying the repository ID (as demonstrated in this tutorial).
   - You can directly provide the path to your data.

4. `--text_column`: This parameter specifies the name of the column in your dataset that contains the text data you want to train on.

5. `--use_peft`: This parameter enables efficient adaptation of pre-trained language models (PLMs) using the PEFT method ([Progressive Efficient Fine-Tuning](https://github.com/huggingface/peft)).

6. `--use_int4`: This parameter quantizes the model into 4-bit integer format, reducing the model's memory and computation requirements while maintaining reasonable performance.

7. `--learning_rate 2e-4`: This parameter sets the learning rate for the training process. It determines the step size taken in each iteration of gradient descent.

8. `--train_batch_size`: This parameter specifies the batch size for training. It determines how many examples are processed together before updating the model's parameters.

9. `--num_train_epochs`: This parameter defines the number of training epochs, which is the number of times the entire dataset is passed through the model during training.

**Please be aware that the choice of hyperparameters should be influenced by the hardware at your disposal. For instance, in this notebook, I've opted for a train_batch_size of 4 and num_train_epochs of 3 due to the utilization of a free Google Colab GPU (T4). Keep in mind that in your specific scenario, you might require larger values, which will hinge on the capacity of your GPU as well as the size of your dataset.**

10. `--trainer sft`: This parameter indicates the training approach. "sft" stands for Supervisor Fine-Tuning, which is a training method for dialogue models.

11. `--model_max_length 2048`: LLAMA2 has a context window of `max_length=4096`, but I've set it to 2048 to speed up the training process. This parameter limits the maximum length of input sequences.

12. `--push_to_hub` and `--repo_id <your_repo_id>`: These parameters allow you to push the trained model to your Hugging Face model repository. You need to provide your repository ID to use this feature.

By understanding and configuring these parameters, you'll be able to train the LLAMA2 model effectively based on your specific needs and hardware resources.

In [None]:
!autotrain llm --train\
 --project_name llama2_train\
--model meta-llama/Llama-2-7b-hf\
--data_path timdettmers/openassistant-guanaco\
--text_colum text\
--use_peft\
--use_int4\
--learning_rate 2e-4\
--train_batch_size 4\
--num_train_epochs 3\
--trainer sft
