In [70]:
# reference: https://github.com/brevdev/notebooks/blob/main/mistral-finetune.ipynb

In [2]:
# # You only need to run this once per machine
# !pip install -q -U bitsandbytes
# !pip install -q -U git+https://github.com/huggingface/transformers.git
# !pip install -q -U git+https://github.com/huggingface/peft.git
# !pip install -q -U git+https://github.com/huggingface/accelerate.git
# !pip install -q -U datasets scipy ipywidgets

In [2]:
import transformers
from datetime import datetime
import wandb, os
from transformers import Trainer, TrainingArguments
from peft import PeftModel  
import sys
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig

### 0. Accelerator

Set up the Accelerator. I'm not sure if we really need this for a QLoRA given its [description](https://huggingface.co/docs/accelerate/v0.19.0/en/usage_guides/fsdp) (I have to read more about it) but it seems it can't hurt, and it's helpful to have the code for future reference. You can always comment out the accelerator if you want to try without.

In [3]:
fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


### 1. Load Dataset

Let's load a meaning representation dataset, and fine-tune Mistral on that. This is a great fine-tuning dataset as it teaches the model a unique form of desired output on which the base model performs poorly out-of-the box, so it's helpful to easily and inexpensively gauge whether the fine-tuned model has learned well. (Sources: [here](https://ragntune.com/blog/gpt3.5-vs-llama2-finetuning) and [here](https://www.anyscale.com/blog/fine-tuning-is-for-form-not-facts)) (In contrast, if you fine-tune on a fact-based dataset, the model may already do quite well on that, and gauging learning is less obvious / may be more computationally expensive.)

In [4]:
from datasets import load_dataset

# File paths
training_file_path = "data/processed/lima/training_set.jsonl"
validation_file_path = "data/processed/lima/validation_set.jsonl"

# Load the training dataset
train_dataset = load_dataset('json', data_files=training_file_path, split='train')

# Load the validation dataset
eval_dataset = load_dataset('json', data_files=validation_file_path, split='train')

# Display the first few examples from the training dataset
print(train_dataset)

# Display the first few examples from the validation dataset
print(eval_dataset)

Dataset({
    features: ['input', 'output'],
    num_rows: 1000
})
Dataset({
    features: ['input', 'output'],
    num_rows: 30
})


In [5]:
print(train_dataset)
print(eval_dataset)

Dataset({
    features: ['input', 'output'],
    num_rows: 1000
})
Dataset({
    features: ['input', 'output'],
    num_rows: 30
})


### 2. Load Base Model

Let's now load Mistral - `mistralai/Mistral-7B-v0.1` - using 4-bit quantization!

In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

base_model_id = "mistralai/Mistral-7B-v0.1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### 3. Tokenization

Set up the tokenizer. Add padding on the left as it [makes training use less memory](https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa).


Setup the tokenize function to make labels and input_ids the same. This is basically what [self-supervised fine-tuning is](https://neptune.ai/blog/self-supervised-learning):

In [7]:
# tokenizer = AutoTokenizer.from_pretrained(
#     base_model_id,
#     model_max_length=512,
#     padding_side="left",
#     add_special_tokens=True
# )
# tokenizer.pad_token = tokenizer.eos_token

# def format_and_tokenize(data_point):
#     """
#     Formats and tokenizes the input and output text.
    
#     Args:
#     input_text (str): The user input text.
#     output_text (str): The assistant's output text.
#     tokenizer: The tokenizer to use for tokenization.
#     11
#     Returns:
#     dict: A dictionary with tokenized 'input_ids' and 'attention_mask'.
#     """
#     # Format the text
#     formatted_text = f"<s>[INST] {data_point['input']} [/INST] {data_point['output']}</s>"

#     # Tokenize the formatted text
#     tokenized_output = tokenizer(formatted_text, truncation=True, padding="max_length")

#     return tokenized_output

In [56]:
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    model_max_length=512,
    padding_side="left",
    add_special_tokens=True
)
tokenizer.pad_token = tokenizer.eos_token

def format_and_tokenize(data_point):
    """
    Formats and tokenizes the input and output text, and masks the input part.

    Args:
    data_point (dict): A dictionary with 'input' and 'output' keys.
    tokenizer: The tokenizer to use for tokenization.

    Returns:
    dict: A dictionary with tokenized 'input_ids', 'attention_mask', and 'labels'.
    """
    # Format the text
    formatted_text = f"<s>[INST] {data_point['input']} [/INST] {data_point['output']}</s>"
#     print("formatted_text: ",formatted_text)
    # Tokenize the formatted text
    tokenized_output = tokenizer(formatted_text, truncation=True, padding="max_length", return_tensors='pt')
    # Create a copy of input_ids for labels
    labels = tokenized_output['input_ids'].clone()
    # Tokenize the input part only
    tokenized_valid = tokenizer.encode(f"<s>[INST] {data_point['input']} [/INST] {data_point['output']}</s>", truncation=True, return_tensors='pt')
    tokenized_input = tokenizer.encode(data_point['input'], truncation=True, return_tensors='pt')
#     print(tokenized_valid)
    valid_length=len(tokenized_valid[0])
#     print("valid_length",valid_length)
    start_pos=512-valid_length
    input_length = len(tokenized_input[0])
    # Mask the input part in labels
#     print("sssss:", start_pos,start_pos+input_length )
    labels[:, start_pos:start_pos+input_length] = -100

    return {
        'input_ids': tokenized_output['input_ids'].squeeze(),
        'labels': labels.squeeze(),
        'attention_mask': tokenized_output['attention_mask'].squeeze(),
    }


In [57]:
# # Sample data point
# sample_data_point = {
#     "input": "What is the capital of France?",
#     "output": "The capital of France is Paris."
# }

# # Call the function with the sample data point
# tokenized_data = format_and_tokenize(sample_data_point)
# print(tokenized_data)

And convert each sample into a prompt that I found from [this notebook](https://github.com/samlhuillier/viggo-finetune/blob/main/llama/fine-tune-code-llama.ipynb).

Reformat the prompt and tokenize each sample:

In [58]:
tokenized_train_dataset = train_dataset.map(format_and_tokenize)
tokenized_val_dataset = eval_dataset.map(format_and_tokenize)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

Check that `input_ids` is padded on the left with the `eos_token` (2) and there is an `eos_token` 2 added to the end, and the prompt starts with a `bos_token` (1).


In [59]:
print(tokenized_train_dataset[4])

{'input': 'I want to buy a used car in Santa Clara. Should I buy a Honda Civic or a Toyota Prius?', 'output': 'The Honda Civic and the Toyota Prius are two of the most trusted compact sedans available today. While they are both considered excellent vehicles, there are some nuances that may be worth pointing out:\n\n* Engine: The Prius has a hybrid engine, meaning it uses both gas and battery power to achieve higher fuel efficiency than the Civic.\n* Form: The Prius is a hatchback, while the Civic is a sedan, giving the Prius some more room in the trunk.\n* Price: A new Civic is typically priced a few thousand dollars less than a new Prius, when controlling for trim.\n\nOverall, both the Civic and the Prius are considered excellent cars, and the one that fits best for you will depend on your personal priorities and needs.', 'input_ids': [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2

Check that a sample has the max length, i.e. 512.

In [60]:
print(len(tokenized_train_dataset[4]['input_ids']))

512


#### How does the base model do?

Let's grab a test input (`meaning_representation`) and desired output (`target`) pair to see how the base model does on it.

We can see it doesn't do very well out of the box.

### 4. Set Up LoRA

Now, to start our fine-tuning, we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [61]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [62]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Let's print the model to examine its layers, as we will apply QLoRA to all the linear layers of the model. Those layers are `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`, and `lm_head`.

Here we define the LoRA config.

`r` is the rank of the low-rank matrix used in the adapters, which thus controls the number of parameters trained. A higher rank will allow for more expressivity, but there is a compute tradeoff.

`alpha` is the scaling factor for the learned weights. The weight matrix is scaled by `alpha/r`, and thus a higher value for `alpha` assigns more weight to the LoRA activations.

The values used in the QLoRA paper were `r=64` and `lora_alpha=16`, and these are said to generalize well, but we will use `r=8` and `lora_alpha=16` so that we have more emphasis on the new fine-tuned data while also reducing computational complexity.

In [63]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

# Apply the accelerator. You can comment this out to remove the accelerator.
model = accelerator.prepare_model(model)

trainable params: 21260288 || all params: 3773331456 || trainable%: 0.5634354746703705


See how the model looks different now, with the LoRA adapters added:

In [64]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linea


Let's use Weights & Biases to track our training metrics. You'll need to apply an API key when prompted. Feel free to skip this if you'd like, and just comment out the `wandb` parameters in the `Trainer` definition below.

In [65]:
# # !pip install -q wandb -U

# wandb.login()

# wandb_project = "viggo-finetune"
# if len(wandb_project) > 0:
#     os.environ["WANDB_PROJECT"] = wandb_project

[34m[1mwandb[0m: Currently logged in as: [33mhjz[0m. Use [1m`wandb login --relogin`[0m to force relogin


### 5. Run Training!

I used 500 steps, but I found the model should have trained for longer as it had not converged by then, so I upped the steps to 1000 below.

A note on training. You can set the `max_steps` to be high initially, and examine at what step your model's performance starts to degrade. There is where you'll find a sweet spot for how many steps to perform. For example, say you start with 1000 steps, and find that at around 500 steps the model starts overfitting - the validation loss goes up (bad) while the training loss goes down significantly, meaning the model is learning the training set really well, but is unable to generalize to new datapoints. Therefore, 500 steps would be your sweet spot, so you would use the `checkpoint-500` model repo in your output dir (`mistral-finetune-viggo`) as your final model in step 6 below.

You can interrupt the process via Kernel -> Interrupt Kernel in the top nav bar once you realize you didn't need to train anymore.

In [66]:
if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True

In [67]:
# ### Restart from the checkpoint
# project = "lima-finetune"
# base_model_name = "mistral"
# run_name = base_model_name + "-" + project
# output_dir = "./" + run_name


# tokenizer.pad_token = tokenizer.eos_token

# trainer = transformers.Trainer(
#     model=model,
#     train_dataset=tokenized_train_dataset,
#     eval_dataset=tokenized_val_dataset,
#     args=transformers.TrainingArguments(
#         output_dir=output_dir,
#         warmup_steps=5,
#         per_device_train_batch_size=2,
#         gradient_checkpointing=True,
#         gradient_accumulation_steps=4,
#         max_steps=1200,
#         learning_rate=1e-5, # Want about 10x smaller than the Mistral learning rate
#         logging_steps=50,
#         bf16=True,   #True for bf16
#         optim="paged_adamw_8bit",
#         logging_dir="./logs",        # Directory for storing logs
#         save_strategy="steps",       # Save the model checkpoint every logging step
#         save_steps=50,                # Save checkpoints every 50 steps
#         evaluation_strategy="steps", # Evaluate the model every logging step
#         eval_steps=50,               # Evaluate and save checkpoints every 50 steps
#         do_eval=True,                # Perform evaluation at the end of training
#         report_to="wandb",           # Comment this out if you don't want to use weights & baises
#         run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"          # Name of the W&B run (optional)
#     ),
#     data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
# )


# model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

# checkpoint_path = "mistral-lima-finetune/checkpoint-100"
# trainer.train(resume_from_checkpoint=checkpoint_path)

# # trainer.train()

Apply the hyper-parameter in LIMA paper. 

In [None]:
num_train_examples = len(tokenized_train_dataset) 
gradient_accumulation_steps = 1
# per_device_batch_size = batch_size//gradient_accumulation_steps  # Adjust this based on the number of devices you have
num_epochs = 30
per_device_train_batch_size=4
num_training_steps = (num_train_examples // per_device_train_batch_size) * num_epochs

### Restart from the checkpoint
project = "lima-finetune-v4"
base_model_name = "mistral"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name


tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=0,
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        gradient_checkpointing=True,
        max_steps=num_training_steps,
        weight_decay=0.1,
        lr_scheduler_type="linear",  # Linear decay of learning rate
        adam_beta1=0.9,
        adam_beta2=0.95,
        learning_rate=1e-5, # Want about 10x smaller than the Mistral learning rate
        logging_steps=50,
        bf16=True,   #True for bf16
        optim="paged_adamw_8bit",
        logging_dir="./logs",        # Directory for storing logs
        save_strategy="steps",       # Save the model checkpoint every logging step
        save_steps=50,                # Save checkpoints every 50 steps
        evaluation_strategy="steps", # Evaluate the model every logging step
        eval_steps=50,               # Evaluate and save checkpoints every 50 steps
        do_eval=True,                # Perform evaluation at the end of training
        report_to="wandb",           # Comment this out if you don't want to use weights & baises
        run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"          # Name of the W&B run (optional)
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)


model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

checkpoint_path = "mistral-lima-finetune-v4/checkpoint-2500"
trainer.train(resume_from_checkpoint=checkpoint_path)

# trainer.train()

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Step,Training Loss,Validation Loss
2550,0.8552,2.324494
2600,0.9421,2.275837
2650,0.9035,2.256337
2700,0.9481,2.25073
2750,0.9832,2.219952
2800,0.8584,2.314089




### 6. Drum Roll... Try the Trained Model!

It's a good idea to kill the current process so that you don't run out of memory loading the base model again on top of the model we just trained. Go to `Kernel > Restart Kernel` or kill the process via the Terminal (`nvidia smi` > `kill [PID]`). 

By default, the PEFT library will only save the QLoRA adapters, so we need to first load the base Mistral model from the Huggingface Hub:


In [24]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

In [25]:
base_model_id = "mistralai/Mistral-7B-v0.1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,  # Mistral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
#     use_auth_token=True
)

eval_tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    add_bos_token=True,
    trust_remote_code=True,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Now load the QLoRA adapter from the appropriate checkpoint directory, i.e. the best performing model checkpoint:

In [29]:
from peft import PeftModel

# ft_model = PeftModel.from_pretrained(base_model, "mistral-viggo-finetune/checkpoint-1000")
# ft_model = PeftModel.from_pretrained(base_model, "mistral-viggo-finetune/checkpoint-900")
ft_model = PeftModel.from_pretrained(base_model, "mistral-lima-finetune-v2/checkpoint-2400")

and run your inference!

Let's try the same `eval_prompt` and thus `model_input` as above, and see if the new finetuned model performs better.

In [33]:
from transformers import AutoTokenizer, PreTrainedModel

def generate_response(eval_prompt, model, tokenizer):
    """
    Generates a response for the given evaluation prompt using the specified model.

    Args:
    eval_prompt (str): The evaluation prompt.
    model (PreTrainedModel): The fine-tuned model.
    tokenizer: The tokenizer used for the model.

    Returns:
    str: The generated response text.
    """
    # Preprocess the prompt
    formatted_prompt = f"<s>[INST] {eval_prompt} [/INST]"

    # Tokenize the prompt
    input_ids = tokenizer(formatted_prompt, return_tensors='pt').to("cuda")

    # Generate a response
    output_ids = model.generate(**input_ids, max_length=512)

    # Decode the generated tokens to text
    response = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    return response


# Example prompt
eval_prompt = eval_dataset[0]['input']

eval_tokenizer.pad_token_id= 2

# Generate the response
response = generate_response(eval_prompt, ft_model, eval_tokenizer)
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


KeyboardInterrupt: 

In [31]:
# Example prompt
eval_prompt = eval_dataset[1]['input']

# Generate the response
response = generate_response(eval_prompt, ft_model, eval_tokenizer)
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] Why do people write #!/usr/bin/env python on the first line of a Python script? [/INST] The ```#!``` line is called a shebang and it tells the operating system how to run the script.

The ```python``` executable is a program that can run Python scripts. The ```env``` program is a program that can find other programs. The ```/usr/bin/env``` executable is a program that can find the ```python``` executable.

So the ```#!``` line tells the operating system to run the ```python``` program, which can be found by the ```env``` program.

The ```env``` program is useful because it allows you to run the ```python``` program that is in your ```PATH``` environment variable. This is useful if you have multiple versions of Python installed, and you want to run the correct one.

The ```env``` program also allows you to run the ```python``` program with different environment variables. This is useful if you want to run the script with different environment variables than the ones that are curr