# Marlin-Mistral-7b-v0.1 Finetune Tutorial:

# A Comprehensive Guide to Finetuning mistralai/mistral-7b-v0.1


![marlin-mistral-7b.png](attachment:marlin-mistral-7b.png)



## Introduction

In this notebook, we will walk through the process of fine-tuning a pre-trained model using the Hugging Face library. We will cover the essential steps from setting up the necessary plugins and accelerator to preparing the data and loading the model. By the end of this guide, you should have a foundational understanding of how to fine-tune models for custom tasks.

First, begin by importing the neccesary libraries.

### Prerequisites

This notebook requires at least GPU compute capability 7.0 to utilize the acceleration. 20xx series nvidia cards may serve as a baseline, but it will take a long time. My original utilization of this model was trained on a single 4090 in just under 1 hour. 

This tutorial may have some install difficulties on Windows machines. Further testing will be done.

In [None]:
!pip install einops
!pip install accelerate
!pip install datasets
!pip install peft
!pip install trl
!pip install bitsandbytes
!pip install huggingface_hub
!pip install git+https://github.com/huggingface/transformers
!pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121


### Import the libraries

Now that the required libraries are installed, you can import them to the notebook for use in the rest of the script. 

In [None]:
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
import torch
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import load_dataset
from huggingface_hub import login


### Authentication

If you wish to upload the model to your huggingface hub account at the end of this tutorial then you will need to acquire an access token from huggingface_hub settings.

Optionally, you can just leave this out and opt-out of the last cell to not upload it. 

In [None]:
login()

## Loading Accelerator

For efficient training, we will leverage the Fully Sharded Data Parallel Plugin and Accelerator. This will allow for optimized memory usage and potentially faster training times. 

If you wish to do so, you can also plug in different accelerators in this cell to see how they effect training. 

In [None]:
## Load accelerator
fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

## Data Preparation

Next, we will load our training dataset, shuffle it, and split it into training, validation, and test subsets.

The dataset chosen for this fine tune is vicgalle/alpaca-gpt4, which contains English Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs.

The dataset contains approximately 56k samples, of which we will only use 10k.

The following section is where you can import different datasets and perform some exploratory data analysis to decide what information you want to use to train.

In [None]:
# Load the full training dataset
full_train_dataset = load_dataset('vicgalle/alpaca-gpt4', split='train')

# Shuffle and select the first 10,000 samples
subset_dataset = full_train_dataset.shuffle(seed=42).select(range(10000))

# Split the subset into train, validation, and test sets
train_dataset = subset_dataset.train_test_split(test_size=0.2)['train']  # 8,000 samples for training
temp_dataset = subset_dataset.train_test_split(test_size=0.2)['test']   # 2,000 samples left
eval_dataset = temp_dataset.train_test_split(test_size=0.5)['train']    # 1,000 samples for validation
test_dataset = temp_dataset.train_test_split(test_size=0.5)['test']     # 1,000 samples for testing

# Print the datasets
print(train_dataset)
print(eval_dataset)
print(test_dataset)

## Model and Tokenizer Loading

Load the base model with specific configurations and the tokenizer that will be used to preprocess our data.



In [None]:
## Load Base Model
base_model_id = "mistralai/Mistral-7B-v0.1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config)

## Tokenization Setup

In this section, we adjust the tokenizer settings and define a tokenization function to preprocess our dataset.

In [None]:
tokenizer.pad_token = tokenizer.eos_token

def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

## Data Formatting

Next, we define a function to format each data sample into a specific structure suitable for our model's task.

This prompt template is designed to format the incoming data to the model. 

In [None]:
def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{data_point["instruction"]}

### Input:
{data_point["input"]}

### Response:
{data_point["output"]}
"""
    return tokenize(full_prompt)

## Data Tokenization

For our model to understand and process the data, we need to convert the raw data into a format that it understands. Tokenization breaks down the raw text into tokens, which are integer representations of words or characters.

In [None]:
# tokenize each sample based on the prompt format
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

print(tokenized_train_dataset[4]['input_ids'])

# check that the sample has the max length of 512
print(len(tokenized_train_dataset[4]['input_ids']))

## Evaluate the Base Model

Before fine-tuning, it's a good practice to evaluate the base model on our task to get a sense of its initial performance.

This way, you can design a prompt that can be validated against once we finish training the model.

In [None]:
print("Instruction: " + test_dataset[1]['instruction'])
print("Input: " + test_dataset[1]['input'])
print("Response: " + test_dataset[1]['output'] + "\n")

eval_prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{test_dataset[1]['instruction']}

### Input:
{test_dataset[1]['input']}

### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
        print(tokenizer.decode(model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))

## Begin Fine-tuning with PEFT

Next, we will start the fine-tuning process using PEFT. PEFT stands for "Progressive Embedding Fine-Tuning", a method that helps in optimizing the model's weights.

In [None]:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

# print model to examine layers
print(model)

## Integrating LoRa

LoRa (Low-Rank Adapters) is a technique to train neural networks more efficiently by adding low-rank transformations to the intermediate representations of a pre-trained model.

In [None]:
# Define the LoRa config
config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

## Applying the Accelerator

To further optimize our model training, we apply the accelerator. This step optimizes the training across multiple devices if available.

In [None]:
# Apply the accelerator. You can comment this out to remove the accelerator.
model = accelerator.prepare_model(model)

## Examining the Updated Model

After integrating LoRa and applying the accelerator, let's print the model to observe the changes.

In [None]:
print(model)

## Tracking Training with Weights & Biases (wandb)

Weights & Biases (wandb) provides tools to track and visualize the training process. By integrating with wandb, you can monitor your model's performance, visualize metrics, and more.

In [None]:
## Track the training stats on wandb
import wandb, os
wandb.login()

wandb_project = "marlin-finetune"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project


## Training the Model

Now, we'll start the fine-tuning process. This involves specifying various training parameters and using the Trainer class from the transformers library.

In [None]:
import transformers
from datetime import datetime

project = "marlin-finetune"
base_model_name = "mistral"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=5,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        max_steps=1000,
        learning_rate=2.5e-5, # Want about 10x smaller than the Mistral learning rate
        logging_steps=50,
        bf16=True,
        optim="paged_adamw_8bit",
        logging_dir="./logs",        # Directory for storing logs
        save_strategy="steps",       # Save the model checkpoint every logging step
        save_steps=50,                # Save checkpoints every 50 steps
        evaluation_strategy="steps", # Evaluate the model every logging step
        eval_steps=50,               # Evaluate and save checkpoints every 50 steps
        do_eval=True,                # Perform evaluation at the end of training
        report_to="wandb",           # Comment this out if you don't want to use weights & baises
        run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"          # Name of the W&B run (optional)
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

## Evaluating the Fine-tuned Model

After training, it's important to evaluate the fine-tuned model to understand its performance. This involves loading the base model and comparing its outputs to the fine-tuned model.

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,  # Mistral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
    use_auth_token=True
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

from peft import PeftModel
base_model = model = AutoModelForCausalLM.from_pretrained("mistralai/mistral-7b-v0.1", trust_remote_code=True, torch_dtype=torch.float32)
ft_model = PeftModel.from_pretrained(base_model, "mistral-marlin-finetune/checkpoint-1000")

ft_model.eval()
with torch.no_grad():
        print(tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=100, pad_token_id=2)[0], skip_special_tokens=True))

## Pushing the model to Huggingface Hub

At this point you have completed the fine tuning of mistral-7b-v0.1 on a Stanford Alpaca style dataset ranging a variety of topics. You can stop now if you wish.

However, there are a few more steps if you wish to upload the model to share your completion of this project.

You just need to merge the fine tuned model with the base model, and then push the merged model to your profile.

In [None]:
## push to hub
model = ft_model.merge_and_load()
model.push_to_hub("macadeliccc/marlin-mistral-7b-v0.1") 