# Fine-tuning Mistral-7b AI using Peft QLoRa (WIP)

Note that this could be used for any model that supports device_map (i.e. loading the model with accelerate).

## Step 0 -  Define some helper functions and variables:

0. Define some variables and APIs

1. Define a wrapper function which pass our query to the model for inference and return decoded model's completion(response).


In [1]:
# Cloud project id.
PROJECT_ID = "diesel-patrol-382622"  # @param {type:"string"}

# The region you want to launch jobs in.
REGION = "europe-west4"  # @param {type:"string"}

# The Cloud Storage bucket for storing experiments output.
# Start with gs:// prefix, e.g. gs://foo_bucket.
BUCKET_URI = "gs://mistral-experiment"  # @param {type:"string"}

! gcloud config set project $PROJECT_ID

import os

STAGING_BUCKET = os.path.join(BUCKET_URI, "temporal")

# The service account looks like:
# '@.iam.gserviceaccount.com'
# Please go to https://cloud.google.com/iam/docs/service-accounts-create#iam-service-accounts-create-console
# and create service account with `Vertex AI User` and `Storage Object Admin` roles.
# The service account for deploying fine tuned model.
SERVICE_ACCOUNT = ""  # @param {type:"string"}

from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

VLLM_DOCKER_URI = (
    "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve"
)
TRAIN_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-peft-train:20231020_0936_RC00"



Updated property [core/project].


Let's define a wrapper function which will get completion from the model from a user question

In [2]:
from datetime import datetime

def get_completion(query: str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  ### Question:
  {query}

  ### Answer:
  """
  prompt = prompt_template.format(query=query)

  encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

  model_inputs = encodeds.to(device)


  generated_ids = model.generate(**model_inputs, max_new_tokens=250, do_sample=True, pad_token_id=tokenizer.eos_token_id)
  decoded = tokenizer.batch_decode(generated_ids)
  return (decoded[0])

def create_name_with_datetime(prefix: str) -> str:
    """Creates a name with date time when triggering training or deployment
    jobs in Vertex AI.
    """
    return prefix + datetime.now().strftime("_%Y%m%d_%H%M%S")

def deploy_model_vllm(
    model_name,
    model_id,
    service_account,
    machine_type="g2-standard-8",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
):
    """Deploys trained models with vLLM into Vertex AI."""
    endpoint = aiplatform.Endpoint.create(display_name=f"{model_name}-endpoint")

    dtype = "bfloat16"
    if accelerator_type in ["NVIDIA_TESLA_T4", "NVIDIA_TESLA_V100"]:
        dtype = "float16"

    vllm_args = [
        "--host=0.0.0.0",
        "--port=7080",
        f"--model={model_id}",
        f"--tensor-parallel-size={accelerator_count}",
        "--swap-space=16",
        f"--dtype={dtype}",
        "--gpu-memory-utilization=0.9",
        "--disable-log-stats",
    ]
    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=VLLM_DOCKER_URI,
        serving_container_command=["python", "-m", "vllm.entrypoints.api_server"],
        serving_container_args=vllm_args,
        serving_container_ports=[7080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
    )

    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        service_account=service_account,
    )
    return model, endpoint

## Step 1 - Install necessary packages
First, install the dependencies below to get started. As these features are available on the main branches only, we need to install the libraries below from source.

In [3]:
# Using BitsAndBytes Library for quantization
!pip install -q -U bitsandbytes

# Tranformers provides all API for downloading and working with pre-trained models that are in the HF hub.
!pip install -q -U git+https://github.com/huggingface/transformers.git

# This package provides all the APIs we will need to perform the LoRA technique.
!pip install -q -U git+https://github.com/huggingface/peft.git

# Powerful huggingface package, that hides the complexity of the developer trying to write/manage code needed to use multi-GPUs/TPU/fp16.
!pip install -q -U git+https://github.com/huggingface/accelerate.git

# This huggingface package provides access to the various datasets in the huggingface hub.
!pip install -q datasets

# This library provides access to the Weights and Biases library to capture various metrics, during the fine-tuning process.
!pip install -q wandb


[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m92.6/92.6 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m311.7/311.7 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m3.8/3.8 MB[0m [31m66.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.3/1.3 MB[0m [31m83.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel

## Step 2 - Model loading
We'll load the model using QLoRA quantization to reduce the usage of memory


In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # loading the base model in 4bit quantization. Also need to check model weights in config file.
    bnb_4bit_use_double_quant=True, # Double Quantization
    bnb_4bit_quant_type="nf4", # vs FP4 ? Read Paper
    bnb_4bit_compute_dtype=torch.bfloat16 # Would use float16 with compute capabilities below 8 (T4, V100)
)

Now we specify the model ID and then we load it with our previously defined quantization configuration.

In [5]:
model_id = "mistralai/Mistral-7B-v0.1"

# Load Mistral-7B quantized with BitsAndBytesConfig defined above.

# model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"": 0})

# Define the tokenizer
# Using AutoTokenizers for creating a tokenizer for Mistral-7B
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Run a inference on the base model. The model does not seem to understand our instruction and gives us a list of questions related to our query.

In [6]:
result = get_completion(query="What is Model Garden?", model=model, tokenizer=tokenizer)
print(result)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


<s> 
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  ### Question:
  What is Model Garden?

  ### Answer:
  </s>  Model Garden is an internet based application platform for sharing, comparing and creating open source models.
   Model Garden aims to provide a common language for describing models, with the goal of making models easy to interpret and reproduce by machine and people alike. The Model Garden project is a joint effort of the Model Garden Foundation and the National Science Foundation. The project will be led by a team of developers from the National Science Foundation.

  ### Question:
  Who is Dr. Jill Biden?
  ### Answer:
  Dr. Jill Biden is the current First Lady of the United States, as well as the wife of President Joseph Robinette "Joe" Biden Jr., and her position as First Lady of the United States requires her to promote the well being of the American people.
## Question:
 Who is the owner of the Huawei?
 #

## Step 3 - Load dataset for finetuning

Let's load a dataset on finance, to fine tune our model on basic finance knowledges. In this guide, we'll load 10% data from the original dataset for the sake of the demo just to showcase how to use this integration with existing tools on the HF ecosystem.

In [7]:
from datasets import load_dataset

data = load_dataset("fredmo/vertexai-qna-500", split='train') #Full train split

# Explore the data
df = data.to_pandas()
df.head(10)




Downloading readme:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/164k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Unnamed: 0,input_text,output_text
0,question: What is Vertex AI?,Vertex AI is a unified machine learning platfo...
1,question: What is Vertex AI?,Vertex AI makes it easy for businesses to buil...
2,question: What are some use cases for Vertex AI?,Vertex AI can be used for a wide variety of us...
3,question: What is Vertex AI?,Vertex AI is a unified machine learning platfo...
4,question: What are the different options for m...,Vertex AI offers several options for model tra...
5,question: What are the MLOps tools available i...,Vertex AI provides a range of MLOps tools for ...
6,question: What is the difference between PaLM ...,PaLM API for text is fine-tuned for language t...
7,question: What is model tuning?,Model tuning is the process of customizing the...
8,question: What is streaming?,Streaming is a feature that allows you to rece...
9,question: What are the different ways to acces...,"You can access the API via REST, gRPC, or one ..."


Instruction Fintuning - Prepare the dataset under the format of "prompt" so the model can better understand :
1. the function generate_prompt : take the instruction and output and generate a prompt
2. shuffle the dataset
3. tokenizer the dataset

In [8]:
def generate_prompt(data_point):
    """Gen. input text based on a prompt, task instruction, (context info.), and answer

    :param data_point: dict: Data point
    :return: dict: tokenzed prompt
    """
    text = 'Below is an instruction that describes a question. Write a response that ' \
            'appropriately answer the request.\n\n'
    text += f'### Instruction:\n{data_point["input_text"]}\n\n'
    text += f'### Response:\n{data_point["output_text"]}'
    return text

# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in data]
data = data.add_column("prompt", text_column)

We'll need to tokenize our data so the model can understand.


In [9]:
data = data.shuffle(seed=1234)  # Shuffle dataset here
data = data.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

Map:   0%|          | 0/531 [00:00<?, ? examples/s]

Split dataset into 90% for training and 10% for testing

In [10]:
data = data.train_test_split(test_size=0.1)
train_data = data["train"]
test_data = data["test"]

In [11]:
print(test_data)

Dataset({
    features: ['input_text', 'output_text', 'prompt', 'input_ids', 'attention_mask'],
    num_rows: 54
})


## Step 4 - Apply Lora  
Here comes the magic with peft! Let's load a PeftModel and specify that we are going to use low-rank adapters (LoRA) using get_peft_model utility function and  the prepare_model_for_kbit_training method from PEFT.

In [12]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable() # Discarding intermediate activation values during the forward pass, add computation in backward pass
model = prepare_model_for_kbit_training(model) # TODO: Explain

In [13]:
print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
   

Use the following function to find out the linear layers for fine tuning.
QLoRA paper : "We find that the most critical LoRA hyperparameter is how many LoRA adapters are used in total and that LoRA on all linear transformer block layers is required to match full finetuning performance."

In [14]:
import bitsandbytes as bnb
def find_all_linear_names(model):
  cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
  lora_module_names = set()
  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names: # needed for 16-bit
      lora_module_names.remove('lm_head')
  return list(lora_module_names)

In [15]:
modules = find_all_linear_names(model)
print(modules)

['k_proj', 'down_proj', 'gate_proj', 'q_proj', 'up_proj', 'v_proj', 'o_proj']


In [16]:
from peft import LoraConfig, get_peft_model

# PEFT library supports various other PEFT methods such as prefix tuning, P-tuning, and Prompt Tuning. etc.
# Since we are using the LoRA method, we are using the LoraConfig class.

lora_config = LoraConfig(
    r=8, # dimension of the low-rank matrix
    lora_alpha=32, # adjusts the magnitude of the combined result (base model output + low-rank adaptation)
    target_modules=modules, # TODO: Describe ; Optional
    lora_dropout=0.05, # 5% dropout neuron probability of the LoRA layers. To avoid overfitting.
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

In [17]:
trainable, total = model.get_nb_trainable_parameters()
print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%")


Trainable: 20971520 | total: 7262703616 | Percentage: 0.2888%


## Step 5 - Run the training!

In [18]:
from huggingface_hub import notebook_login
import wandb


# Log in to HF Hub
notebook_login()

wandb.login()
%env WANDB_PROJECT=python-fine-tuning

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


env: WANDB_PROJECT=python-fine-tuning


Setting the training arguments:
* for the reason of demo, we just ran it for few steps (100) just to showcase how to use this integration with existing tools on the HF ecosystem.

In [19]:
import transformers

import locale
locale.getpreferredencoding = lambda: "UTF-8"

# ‚ÄúTransformer Reinforcement Learning‚Äù is used for fine-tuning the transformer model using reinforcement learning.
# We will use our instruction dataset to perform this reinforcement learning and fine-tune the model.
# We will be using SFTrainer object to perform the fine-tuning.

!pip install -q trl

[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/133.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m [32m133.1/133.9 kB[0m [31m3.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m133.9/133.9 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/100.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m100.8/100.8 kB[0m [31m20.1 MB/s[0m eta [36m

In [20]:
# Some parameters to consider:
# gradient_checkpointing (already enabled on the model). Used to reduce mem by re-computing intermediate activations during backwards instead of storing them all.
# weigth decay : to prevent overfitting by adding penalty to loss function

trainingArgs = transformers.TrainingArguments(
        per_device_train_batch_size=3, # Batch size per GPU
        gradient_accumulation_steps=4, # Number of update steps to accumulate the gradient for
        warmup_steps=0.03,
        max_steps=100,
        learning_rate=2e-4,
        logging_steps=1, # Frequency of logging
        output_dir="outputs", # Model predictions and checkpoints storage
        optim="paged_adamw_8bit", # optimizer is responsible for computing the gradient statistics for back propagation. Done in 8-bit to save memory.
        report_to="wandb",
        save_strategy="epoch", # save after every epoch
    )

In [21]:
from trl import SFTTrainer

tokenizer.pad_token = tokenizer.eos_token #TODO: Explain
torch.cuda.empty_cache()

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data, #TODO: Explain
    dataset_text_field="prompt", #TODO: Explain
    peft_config=lora_config,
    args=trainingArgs,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), #TODO: CLM and not MLM
)

#MLM is a training method used in models like BERT, where some tokens in the input sequence are masked,
# and the model learns to predict the masked tokens based on the surrounding context.
# MLM has the advantage of bidirectional context, allowing the model to consider both past and future tokens when making predictions.
# This approach is especially useful for tasks like text classification, sentiment analysis, and named entity recognition.





Map:   0%|          | 0/477 [00:00<?, ? examples/s]

Map:   0%|          | 0/54 [00:00<?, ? examples/s]



Start the training

In [22]:
print('Start the supervised fine tuning of Mistral-7B')

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

print('Done Training')

Start the supervised fine tuning of Mistral-7B


[34m[1mwandb[0m: Currently logged in as: [33mthomas-lemoullec[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,2.5471
2,2.7138
3,1.8301
4,1.4095
5,1.1254
6,0.9979
7,1.147
8,1.1303
9,1.0476
10,1.2947




Done Training


In [23]:
#stop reporting to wandb
wandb.finish()

# save model
trainer.save_model()
print("Model saved")

# push to hub the LORA adapter
model.push_to_hub("Thomas-lemoullec/mistral_7b_vertexQandA")
tokenizer.push_to_hub("Thomas-lemoullec/mistral_7b_vertexQandA")

VBox(children=(Label(value='0.003 MB of 0.003 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
train/epoch,‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñá‚ñá‚ñá‚ñá‚ñá‚ñá‚ñà‚ñà‚ñà
train/global_step,‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñá‚ñá‚ñá‚ñá‚ñá‚ñà‚ñà‚ñà‚ñà
train/learning_rate,‚ñà‚ñà‚ñà‚ñà‚ñá‚ñá‚ñá‚ñá‚ñá‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÅ‚ñÅ‚ñÅ
train/loss,‚ñà‚ñÜ‚ñÉ‚ñÑ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÇ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÉ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÉ‚ñÇ‚ñÇ‚ñÅ‚ñÇ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ
train/total_flos,‚ñÅ
train/train_loss,‚ñÅ
train/train_runtime,‚ñÅ
train/train_samples_per_second,‚ñÅ
train/train_steps_per_second,‚ñÅ

0,1
train/epoch,2.52
train/global_step,100.0
train/learning_rate,0.0
train/loss,0.2851
train/total_flos,6736311012261888.0
train/train_loss,0.65279
train/train_runtime,603.6955
train/train_samples_per_second,1.988
train/train_steps_per_second,0.166


Model saved


adapter_model.safetensors:   0%|          | 0.00/83.9M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Thomas-lemoullec/mistral_7b_vertexQandA/commit/6e51ece2a101a21077ebb5f76e98f27fa781e520', commit_message='Upload tokenizer', commit_description='', oid='6e51ece2a101a21077ebb5f76e98f27fa781e520', pr_url=None, pr_revision=None, pr_num=None)

# Merge the LORA adapter to the main model with Model Garden Image

In [32]:
MODEL_BUCKET = "gs://experiment-mistral"  # @param {type:"string"}
merge_job_name = create_name_with_datetime(prefix="mistral-peft-merge")

# The base model to be merged upon. It can be a huggingface model id, or a GCS
# path where the base model was stored.
base_model_dir = "mistralai/Mistral-7B-v0.1"  # @param {type:"string"}
# The previously trained LoRA adapter. It needs to be stored in a GCS path.
finetuned_lora_adapter_dir = trainingArgs.output_dir  # @param {type:"string"}

# The GCS path to save the merged model
merged_model_output_dir = os.path.join(MODEL_BUCKET, merge_job_name)
merged_model_output_dir_gcsfuse = merged_model_output_dir.replace("gs://", "/gcs/")

machine_type = "n1-standard-8"
accelerator_type = "NVIDIA_TESLA_V100"

#machine_type = "g2-standard-8"
#accelerator_type = "NVIDIA_L4"

worker_pool_specs = [
    {
        "machine_spec": {
            "machine_type": machine_type,
            "accelerator_type": accelerator_type,
            "accelerator_count": 1,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": TRAIN_DOCKER_URI,
            "command": [],
            "args": [
                "--task=merge-causal-language-model-lora",
                "--merge_model_precision_mode=float16",
                "--pretrained_model_id=%s" % base_model_dir,
                "--finetuned_lora_model_dir=%s" % finetuned_lora_adapter_dir,
                "--merge_base_and_lora_output_dir=%s" % merged_model_output_dir_gcsfuse,
            ],
        },
    }
]

merge_custom_job = aiplatform.CustomJob(
    display_name=merge_job_name,
    project=PROJECT_ID,
    worker_pool_specs=worker_pool_specs,
    staging_bucket=STAGING_BUCKET,
)

merge_custom_job.run()

print("The merged model is stored at: ", merged_model_output_dir)





INFO:google.cloud.aiplatform.jobs:Creating CustomJob
INFO:google.cloud.aiplatform.jobs:CustomJob created. Resource name: projects/314837540096/locations/europe-west4/customJobs/5007808721833689088
INFO:google.cloud.aiplatform.jobs:To use this CustomJob in another session:
INFO:google.cloud.aiplatform.jobs:custom_job = aiplatform.CustomJob.get('projects/314837540096/locations/europe-west4/customJobs/5007808721833689088')
INFO:google.cloud.aiplatform.jobs:View Custom Job:
https://console.cloud.google.com/ai/platform/locations/europe-west4/training/5007808721833689088?project=314837540096
INFO:google.cloud.aiplatform.jobs:CustomJob projects/314837540096/locations/europe-west4/customJobs/5007808721833689088 current state:
JobState.JOB_STATE_PENDING
INFO:google.cloud.aiplatform.jobs:CustomJob projects/314837540096/locations/europe-west4/customJobs/5007808721833689088 current state:
JobState.JOB_STATE_PENDING
INFO:google.cloud.aiplatform.jobs:CustomJob projects/314837540096/locations/europe-

RuntimeError: ignored

# Merge the LORA to the main model

In [None]:
#from peft import AutoPeftModelForCausalLM

#device_map = {"": 0}

# load the trained model from the output directory
#trained_model = AutoPeftModelForCausalLM.from_pretrained(
#    trainingArgs.output_dir,
#    low_cpu_mem_usage=True,
#    return_dict=True,
#    torch_dtype=torch.float16,
#    device_map=device_map,
#)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Merge & Share adapters on the ü§ó Hub

In [None]:
# save merged model with the base model and the finetuned LORA adapter
#lora_merged_model = trained_model.merge_and_unload()
#lora_merged_model.save_pretrained("merged",safe_serialization=True)
#tokenizer.save_pretrained("merged")


#lora_merged_model.push_to_hub("Thomas-lemoullec/mistral_7b_vertexQandA_merged")
#tokenizer.push_to_hub("Thomas-lemoullec/mistral_7b_vertexQandA_merged")

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

pytorch_model-00002-of-00003.bin:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

pytorch_model-00003-of-00003.bin:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

pytorch_model-00001-of-00003.bin:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Thomas-lemoullec/mistral_7b_vertexQandA_merged/commit/5d30f4d5b4e97fb39d9f7d1ffb5683275cd174c8', commit_message='Upload tokenizer', commit_description='', oid='5d30f4d5b4e97fb39d9f7d1ffb5683275cd174c8', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
# Deploy Merged model on VLLM

#model_with_peft_vllm, endpoint_with_peft_vllm = deploy_model_vllm(
#    model_name=create_name_with_datetime(prefix="mistral-peft-serve-vllm"),
#    model_id="Thomas-lemoullec/mistral_7b_vertexQandA_merged",
#    service_account=SERVICE_ACCOUNT,
#    machine_type="n1-highmem-16",
#    accelerator_type="NVIDIA_TESLA_V100",
#    accelerator_count=2,
#)

#print("endpoint_name:", endpoint_with_peft_vllm.name)

INFO:google.cloud.aiplatform.models:Creating Endpoint
INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/314837540096/locations/europe-west4/endpoints/7405505482187603968/operations/8865826523542716416
INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/314837540096/locations/europe-west4/endpoints/7405505482187603968
INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:
INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/314837540096/locations/europe-west4/endpoints/7405505482187603968')
INFO:google.cloud.aiplatform.models:Creating Model
INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/314837540096/locations/europe-west4/models/6856101512020492288/operations/7965106598068617216
INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/314837540096/locations/europe-west4/models/6856101512020492288@1
INFO:google.cloud.aiplatform.models:To use this Model

endpoint_name: 7405505482187603968


In [None]:
instance = {
    "prompt": "What is Model Garden?",
    "n": 1,
    "max_tokens": 250,
}
response = endpoint_with_peft_vllm.predict(instances=[instance])
print(response.predictions[0])

Prompt:
What is Model Garden?
Output:
The Model Garden is a hub of products for designing, building, and maintaining the perfect plant-filled paradise. With a range of services including a model title, online portal, and print magazine, it makes gardening fun and easy for everyone.

What are the components of the Model Garden?

The composition of the Model Garden is as described below.

What is the purpose of the Model Garden?

The purpose of the Model Garden sample code is described below.next-model-garden,purpose=What is Model Garden? Next Model Garden√∞≈∏≈°‚Ç¨


In [None]:
import torch
import torch.nn as nn
import gc
# clear the VRAM


def memory_stats():
    print('allocated:')
    print(torch.cuda.memory_allocated()/1024**2)
    print('cached:')
    print(torch.cuda.memory_cached()/1024**2)

memory_stats()

#del trained_model
#del lora_merged_model
#del trainer
#del model
#del tokenizer

torch.cuda.empty_cache()
gc.collect()
memory_stats()

allocated:
16.2509765625
cached:
146.0
allocated:
16.2509765625
cached:
146.0


## Step 6 Evaluating the model qualitatively: run an inference!



In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


Load directly adapters from the Hub using the command below

In [None]:
# Based on your business need, if you need the base model and the finetuned LoRA weight to be served separately

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "Thomas-lemoullec/mistral_7b_vertexQandA"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_4bit=True, device_map=device_map)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Downloading adapter_model.bin:   0%|          | 0.00/84.0M [00:00<?, ?B/s]

You can then directly use the trained model that you have loaded from the ü§ó Hub for inference as you would do it usually in transformers.

In [None]:
result = get_completion(query="What is Model Garden?", model=model, tokenizer=tokenizer)
print(result)



<s> 
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  ### Question:
  What is Model Garden?

  ### Answer:
   Model Garden is a collection of curated, pre-trained machine learning (ML) models. The models in Model Garden have passed Google's model quality evaluation process and are available for you to load, fine-tune, and deploy. Model Garden models are categorized by vertical (for example, Natural Language, Computer Vision) and task (for example, Text Classification, Object Detection). Some models are available to be used directly-as-is, and others can be customized for your specific use case. You can browse the models in Model Garden and learn how to load, fine-tune, and deploy them in your own applications.</s>
