<a href="https://colab.research.google.com/github/thibaud-perrin/hibo-mistral-7b-fc-v1/blob/main/mistral_7b_instruct.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📖 hibo-mistral-7b-fc-v1 Training

This notebook focuses on fine-tuning the model `mistralai/Mistral-7B-v0.1` into an instruct and function calling model named: `thibaud-perrin/hibo-mistral-7b-fc-v1`.

## 📦 Installation of Required Packages

First, we need to install the necessary packages to work with Hugging Face's transformers and datasets, as well as any other libraries required for our project. We'll use the `!pip install` command to install these packages. This ensures that our environment has all the tools we need for training and interacting with the model.

In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q trl xformers wandb datasets einops sentencepiece

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## 📚 Import of All Required Packages

After installing the required packages, we'll import them into our Jupyter notebook. This includes importing libraries such as `transformers`, `datasets`, and any other utilities we need for processing data, training the model, and evaluating its performance.

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging, TextStreamer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os, torch, wandb, platform, warnings
from datasets import load_dataset
from trl import SFTTrainer
from huggingface_hub import notebook_login, Repository, login
from google.colab import userdata
from google.colab import drive

## 🔄 Google Drive Mount

To access and store large datasets or model files, we will mount Google Drive to our Colab environment. This allows us to seamlessly use files stored in Google Drive as if they were local files, making data management much easier.

In [4]:
drive.mount("/content/gdrive")
!ln -s "/content/gdrive/MyDrive/Colab Notebooks/Projects/mistral-7b-instruct" "/content/MistralNotebook"
!ls "./MistralNotebook"

Mounted at /content/gdrive
ln: failed to create symbolic link '/content/MistralNotebook/mistral-7b-instruct': File exists
mistral-7b-instruct	   test-hibo-mistral-7b-fc-v1.ipynb  thibaud-perrin-hibo-mistral-7b-fc-v1
mistral-7b-instruct.ipynb  test.md
results			   thibaud-perrin


## 🌐 Global Variables

In this section, we'll define global variables for our project. These include the base model name (`mistralai/Mistral-7B-v0.1`), the new model name (`thibaud-perrin/hibo-mistral-7b-fc-v1`), the dataset name (`thibaud-perrin/hibo-function-calling-v1`), and any necessary authentication tokens or secrets required for accessing APIs or services.

In [5]:
#Use a sharded model to fine-tune in the free version of Google Colab.
base_model = "mistralai/Mistral-7B-v0.1" #bn22/Mistral-7B-Instruct-v0.1-sharded
dataset_name, new_model = "thibaud-perrin/hibo-function-calling-v1", "thibaud-perrin/hibo-mistral-7b-fc-v1"

In [6]:
username = userdata.get('username')
email = userdata.get('email')
wandb_key = userdata.get('wandb_key')
hugginface_token = userdata.get('hugginface_token')
TRAINING = False

## 🔑 Login to Hugging Face

We need to authenticate with the Hugging Face Hub to access models and datasets. This involves logging in using our Hugging Face credentials, which will allow us to push models to the Hub at the end of this notebook.

In [7]:
login(token=hugginface_token)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [8]:
!git config --global user.email "$email"
!git config --global user.name "$username"

In [9]:
!git config user.name

Thibaud Perrin


In [10]:
!git config user.email

thibaud.perrin6+HF@gmail.com


## 📂 Loading the Dataset

Here, we'll load the dataset `thibaud-perrin/hibo-function-calling-v1` from the Hugging Face Hub and select the train split.

In [11]:
# Loading a Gath_baize dataset
dataset = load_dataset(dataset_name, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


[{'content': 'I was wondering about the topic of war crimes. Specifically, why was Winston Churchill never tried for war crimes, just because the allies won World War 2?',
  'role': 'user'},
 {'content': 'Winston Churchill was not tried for war crimes mainly for two reasons. First, at the time, the concept of war crimes was not as well-established as it is now. Second, Churchill was seen as a hero in his own country and the Allied powers, which made it politically difficult to try him.',
  'role': 'assistant'},
 {'content': "But wasn't he responsible for making some questionable decisions during the war?",
  'role': 'user'},
 {'content': 'Yes, some of his decisions during the war were controversial, but Churchill argued that they were necessary to win the war. For example, he ordered the bombing of cities like Dresden and Hamburg, which resulted in significant civilian casualties.',
  'role': 'assistant'},
 {'content': 'That sounds like it could be considered a war crime.',
  'role': '

In [None]:
# Example of a row
dataset["chat"][1]

## 🤖 Loading the Base Model (Mistral 7B)

We'll load the pre-trained `mistralai/Mistral-7B-v0.1` model from Hugging Face. This model will serve as the starting point for our fine-tuning process, where we'll adapt it to our specific task of instruction and function calling.

In [12]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map={"": 0}
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## 🔤 Loading the Tokenizer

The tokenizer responsible for converting text into a format understandable by the model will be loaded. This tokenizer must be compatible with the base model we're using.

In [13]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
# tokenizer.pad_token = tokenizer.eos_token #this is my mistake, because of the tokenizer set the eos_token attention mask as “0”. That is why my fine-tuned model, never stop
tokenizer.pad_token = tokenizer.unk_token # instead, should replace by this
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

(True, True)

### ✍️ Adding a Custom Template to the Tokenizer

To improve the model's understanding of our specific task, we'll add a custom template to the tokenizer. This involves defining a pattern or structure that the inputs to the model will follow, making it easier for the model to interpret the data.

In [14]:
import textwrap

template = """\
{% if messages[0]['role'] == 'system' %}\
    {% set loop_messages = messages[1:] %}\
    {% set system_message = messages[0]['content'] %}\
{% elif false == true and not '<<SYS>>' in messages[0]['content'] %}\
    {% set loop_messages = messages %}\
    {% set system_message = 'You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don\\'t know the answer to a question, please don\\'t share false information.' %}\
{% else %}\
    {% set loop_messages = messages %}\
    {% set system_message = false %}\
{% endif %}\
{% for message in loop_messages %}\
    {% if loop.index0 == 0 and system_message != false %}\
        {% set content = '<<SYS>>\n' + system_message + '\n<</SYS>>\n\n' + message['content'] %}\
    {% else %}\
        {% set content = message['content'] %}\
    {% endif %}\
    {% if message['role'] == 'user' %}\
        {{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}\
    {% elif message['role'] == 'system' %}\
        {{ '<<SYS>>\n' + content.strip() + '\n<</SYS>>\n\n' }}\
    {% elif message['role'] == 'assistant' %}\
        {% if loop.last or loop_messages[loop.index0 + 1]['role'] != 'assistant' %}
            {{ '[ASST] '  + content.strip() + ' [/ASST]' + eos_token }}\
        {% else %}
            {{ '[ASST] '  + content.strip() + ' [/ASST]' }}\
        {% endif %}
    {% elif message['role'] == 'function_response' %}\
        {{ bos_token + '[FUNC] '  + content.strip() + ' [/FUNC]' }}\
    {% endif %}\
{% endfor %}"""
tokenizer.chat_template = textwrap.dedent(template).replace('    ', '').strip()

"{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true and not '<<SYS>>' in messages[0]['content'] %}{% set loop_messages = messages %}{% set system_message = 'You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don\\'t know the answer to a question, please don\\'t share false information.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\n' + system_message + '\n<</

In [None]:
# display the template to verify that it has been saved
tokenizer.chat_template

### ➕ Adding New Column to the Dataset

Based on the custom template, we'll add a new column to our dataset. This column will contain the processed inputs that are ready to be fed into the model for training.

In [15]:
# Define a function to concatenate and apply the template
def concatenate_and_transform(row):
    # Concatenate 'system' and 'chat' lists
    combined_list = row['system'] + row['chat']
    # Apply the tokenizer.apply_chat_template function on the concatenated list
    transformed_data = tokenizer.apply_chat_template(combined_list, tokenize=False, add_generation_prompt=False)
    return {'chat_sample': transformed_data}

In [16]:
dataset = dataset.map(concatenate_and_transform)
# Select the first 200,000 rows after shuffling
dataset = dataset.select(range(200_000))

{'dataset_origin': 'glaiveai/glaive-function-calling-v2',
 'system': [{'content': 'You are a helpful assistant with access to the following functions. Use them if required -\n{\n    "name": "get_stock_price",\n    "description": "Get the current stock price of a company",\n    "parameters": {\n        "type": "object",\n        "properties": {\n            "company_name": {\n                "type": "string",\n                "description": "The name of the company"\n            },\n            "exchange": {\n                "type": "string",\n                "description": "The stock exchange where the company is listed"\n            }\n        },\n        "required": [\n            "company_name",\n            "exchange"\n        ]\n    }\n}',
   'role': 'system'}],
 'chat': [{'content': 'Hi, can you tell me the current stock price of Apple on NASDAQ?',
   'role': 'user'},
  {'content': 'Sure, let me fetch that information for you.',
   'role': 'assistant'},
  {'content': '<functionca

In [None]:
# example after adding the new column
dataset[35]

## 💪 Train the Model

### 📊 Login to Weights & Biases

Weights & Biases (W&B) is a tool for tracking experiments, and we'll log in to it to monitor our model's training progress. This helps in visualizing metrics and comparing different training runs.

In [19]:
if TRAINING:
  wandb.login(key = wandb_key)
  run = wandb.init(project='Fine tuning mistral 7B for FC', job_type="training", anonymous="allow")

### ⚙️ Init LoRA Config

LoRA (Low-Rank Adaptation) is a technique for efficient model fine-tuning. We'll initialize the LoRA configuration, specifying any parameters needed to adapt the model to our task with minimal computational resources.

In [20]:
if TRAINING:
  model = prepare_model_for_kbit_training(model)
  peft_config = LoraConfig(
          r=16,
          lora_alpha=16,
          lora_dropout=0.05,
          bias="none",
          task_type="CAUSAL_LM",
          target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
      )
  model = get_peft_model(model, peft_config)

### 🏋️ Init Training Arguments and SFTTrainer

We'll initialize the training arguments, specifying details like the number of epochs, batch size, and learning rate. Then, we'll set up the `SFTTrainer` (Sparse Fine-Tuning Trainer) with these arguments to start the training process.

In [21]:
if TRAINING:
  # Training Arguments
  # Hyperparameters should beadjusted based on the hardware you using
  training_arguments = TrainingArguments(
      output_dir= "./MistralNotebook/results",
      resume_from_checkpoint="./MistralNotebook/results",
      num_train_epochs= 1,
      per_device_train_batch_size= 22,
      gradient_accumulation_steps= 2,
      optim = "paged_adamw_8bit",
      save_steps= 125,
      logging_steps= 30,
      learning_rate= 2e-4,
      weight_decay= 0.001,
      fp16= False,
      bf16= False,
      max_grad_norm= 0.3,
      max_steps= -1,
      warmup_ratio= 0.3,
      group_by_length= True,
      lr_scheduler_type= "constant",
      report_to="wandb"
  )
  # Setting sft parameters
  trainer = SFTTrainer(
      model=model,
      train_dataset=dataset,
      peft_config=peft_config,
      max_seq_length= None,
      dataset_text_field="chat_sample",
      tokenizer=tokenizer,
      args=training_arguments,
      packing= False,
  )

### 🚀 Start Training, and Save the Model

In this section, we'll start the training process using the data and configurations we've set up. After training, we'll save the fine-tuned model so it can be used for inference or further training in the future.

In [22]:
if TRAINING:
  trainer.train(resume_from_checkpoint=True)
  # Save the fine-tuned model
  trainer.model.save_pretrained(f"./MistralNotebook/{new_model}")
  trainer.model.save_pretrained(new_model)
  wandb.finish()
  model.config.use_cache = True
  model.eval()

## 🧪 Test the Model

We'll evaluate the performance of our fine-tuned model on a test set or through specific examples to ensure it performs well on the task of instruction and function calling.

In [23]:
def stream(user_prompt):
    runtimeFlag = "cuda:0"
    system_prompt = '<<SYS>>\nThe conversation between Human and AI assistant\n<</SYS>>'
    B_INST, E_INST = "[INST]", " [/INST]"

    prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}\n{E_INST}"

    inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)

    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=200)

In [24]:
stream("Explain large language models")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.



<</SYS>>[INST]Explain large language models
[/INST]
<</SYS>>[INST]Explain large language models
[/INST]
<</SYS>>[INST]Explain large language models
[/INST]
<</SYS>>[INST]Explain large language models
[/INST]
<</SYS>>[INST]Explain large language models
[/INST]
<</SYS>>[INST]Explain large language models
[/INST]
<</SYS>>[INST]Explain large language models
[/INST]
<</SYS>>[INST]Explain large language models
[/INST]
<</SYS>>[INST]Explain large language models
[/INST]
<</SYS>>[INST]Explain large language models
[/INST]
<</SYS>>[INST]Explain large language models


KeyboardInterrupt: 

In [25]:
import shutil

# Correct the directory name if it was a typo
source_dir = './MistralNotebook/thibaud-perrin/hibo-mistral-7b-fc-v1'  # Ensure this is the correct path
destination_dir = 'hibo-mistral-7b-fc-v1'  # Specify your destination directory

# Copy the directory
try:
    shutil.copytree(source_dir, f"./{destination_dir}")
    shutil.copytree(source_dir, f"./thibaud-perrin/{destination_dir}")
    print(f"Directory copied successfully from {source_dir} to {destination_dir}.")
except Exception as e:
    print(f"Error copying directory: {e}")

Error copying directory: [Errno 17] File exists: './hibo-mistral-7b-fc-v1'


## ♻️ Clear the Memory Footprint and Reload the Model

To ensure our environment is clean and to avoid any memory issues, we'll clear any unnecessary data from memory and then reload our fine-tuned model for further use or evaluation.

In [26]:
# Clear the memory footprint
del model, trainer
torch.cuda.empty_cache()

# Reload the base model
# Reload the base model
base_model_reload = AutoModelForCausalLM.from_pretrained(
    base_model, low_cpu_mem_usage=True,
    return_dict=True,torch_dtype=torch.bfloat16,
    device_map= {"": 0})
model = PeftModel.from_pretrained(base_model_reload, new_model)
model = model.merge_and_unload()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

ValueError: Can't find weights for thibaud-perrin/hibo-mistral-7b-fc-v1 in thibaud-perrin/hibo-mistral-7b-fc-v1 or in the Hugging Face Hub. Please check that the file adapter_model.bin or adapter_model.safetensors is present at thibaud-perrin/hibo-mistral-7b-fc-v1.

## 🔁 Reload the Tokenizer

After clearing the memory, we'll reload the tokenizer associated with our fine-tuned model to ensure it's ready for processing new inputs.

In [27]:
# Reload tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "right"
tokenizer.chat_template = textwrap.dedent(template).replace('    ', '').strip()
tokenizer.chat_template

"{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true and not '<<SYS>>' in messages[0]['content'] %}{% set loop_messages = messages %}{% set system_message = 'You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don\\'t know the answer to a question, please don\\'t share false information.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\n' + system_message + '\n<</

## 🚚 Push the Model and the Tokenizer to HuggingFace Repo

Finally, we'll push our fine-tuned model and its tokenizer to the Hugging Face Hub under the repository `thibaud-perrin/hibo-mistral-7b-fc-v1`. This allows others to access and use our model for their own tasks.

In [None]:
#push the model to hub
model.push_to_hub("thibaud-perrin/hibo-mistral-7b-fc-v1", use_temp_dir=False)

In [31]:
tokenizer.push_to_hub("thibaud-perrin/hibo-mistral-7b-fc-v1", use_temp_dir=False)

IsADirectoryError: [Errno 21] Is a directory: 'thibaud-perrin/hibo-mistral-7b-fc-v1'

## Other technique to push on HF

In [None]:
assert "a" == "b", "stop ! do not execute this if you do not want"

In [32]:
from huggingface_hub import HfFolder, Repository


In [33]:
repo_path = "./tokenizer"  # Local path to clone the repository
repo_url = "thibaud-perrin/hibo-mistral-7b-fc-v1"  # Your repository name on Hugging Face Hub

In [34]:
# Clone the repository locally (if not already cloned)
repo = Repository(repo_path, clone_from=repo_url)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/thibaud-perrin/hibo-mistral-7b-fc-v1 into local empty directory.


Download file model-00002-of-00003.safetensors:   0%|          | 1.45k/4.66G [00:00<?, ?B/s]

Download file tokenizer.model:   0%|          | 1.46k/482k [00:00<?, ?B/s]

Download file model-00003-of-00003.safetensors:   0%|          | 32.0k/4.23G [00:00<?, ?B/s]

Download file model-00001-of-00003.safetensors:   0%|          | 26.2k/4.60G [00:00<?, ?B/s]

Clean file tokenizer.model:   0%|          | 1.00k/482k [00:00<?, ?B/s]

Clean file model-00003-of-00003.safetensors:   0%|          | 1.00k/4.23G [00:00<?, ?B/s]

Clean file model-00001-of-00003.safetensors:   0%|          | 1.00k/4.60G [00:00<?, ?B/s]

Clean file model-00002-of-00003.safetensors:   0%|          | 1.00k/4.66G [00:00<?, ?B/s]

In [35]:
# Reload tokenizer
tokenizer.padding_side = "right"
tokenizer.chat_template = textwrap.dedent(template).replace('    ', '').strip()
tokenizer.chat_template
# Save your tokenizer locally in the cloned repository directory
tokenizer.save_pretrained(repo_path)

('./tokenizer/tokenizer_config.json',
 './tokenizer/special_tokens_map.json',
 './tokenizer/tokenizer.model',
 './tokenizer/added_tokens.json',
 './tokenizer/tokenizer.json')

In [36]:
# Push changes
repo.push_to_hub()

To https://huggingface.co/thibaud-perrin/hibo-mistral-7b-fc-v1
   4be5e3a..da2598d  main -> main

   4be5e3a..da2598d  main -> main



'https://huggingface.co/thibaud-perrin/hibo-mistral-7b-fc-v1/commit/da2598d7994b12caff7d06d0151b83f2afc7fdee'