  # **Fine-Tuning Mistral-7B on MetaMathQA-40k and Deployment of Model**



This step installs necessary Python packages. Notably, it includes bitsandbytes, transformers (Hugging Face library), peft, accelerate, and other dependencies for various tasks like training, fine-tuning, and evaluation.

#**Goal:**


The primary objective of this project was to fine-tune the Mistral 7B language model on a custom dataset and deploy it using Gradio for real-time interaction. The fine-tuned model aimed to showcase improved performance on a specific task, while the deployment allowed users to interact with the model through a user-friendly interface.

In [6]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q trl xformers wandb datasets einops gradio sentencepiece``

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.1/139.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging, TextStreamer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,torch, wandb, platform, gradio, warnings
from datasets import load_dataset
from trl import SFTTrainer
from huggingface_hub import notebook_login

## Load Dataset

In [3]:
!pip install datasets
from datasets import load_dataset


MATH_dataset = load_dataset("meta-math/MetaMathQA-40K")

Downloading readme:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/31.1M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [4]:
print(MATH_dataset)

DatasetDict({
    train: Dataset({
        features: ['query', 'type', 'response'],
        num_rows: 40000
    })
})


## Data Splitting

In [5]:
# Assuming you have a "train" split
train_dataset = MATH_dataset["train"]

# Specify the desired number of samples
desired_samples = 10000

# Ensure that the desired number of samples is not greater than the total number of samples
desired_samples = min(desired_samples, len(train_dataset))

# Take the first desired_samples rows
selected_samples = train_dataset.select(list(range(desired_samples)))

# Specify the desired ratio for your train/test split (e.g., 80% train, 20% test)
train_ratio = 0.8

# Calculate the number of samples for the train split
num_samples_train = int(desired_samples * train_ratio)

# Create the train and test splits
train_split = selected_samples.select(list(range(num_samples_train)))
test_split = selected_samples.select(list(range(num_samples_train, desired_samples)))

# Now you have train_split and test_split with 10,000 rows


In [6]:
print(train_split,test_split)

Dataset({
    features: ['query', 'type', 'response'],
    num_rows: 8000
}) Dataset({
    features: ['query', 'type', 'response'],
    num_rows: 2000
})


## Create a Prompt

In [None]:
def create_prompt(sample):
    query = sample['query']
    response = sample['response']
    prompt = f"<s>[INST] {query} [/INST]\n"

    # Include the response from the 'response' column
    prompt += f"{response}</s>"

    return prompt

prompt_example = create_prompt(train_split[0])
print(prompt_example)

<s>[INST] Reggie's father gave him $48. Reggie bought 5 books, each of which cost x. Reggie has 38 money left. What is the value of unknown variable x? [/INST]
To solve this problem, we need to determine the value of x, which represents the cost of each book that Reggie bought.
Let's break down the information given:
Amount of money Reggie's father gave him: $48
Number of books Reggie bought: 5
Amount of money Reggie has left: $38
We can set up the equation as follows:
Amount of money Reggie's father gave him - (Number of books Reggie bought * Cost per book) = Amount of money Reggie has left
$48 - (5 * x) = $38
Let's simplify and solve for x:
$48 - 5x = $38
To isolate x, we subtract $48 from both sides of the equation:
$48 - $48 - 5x = $38 - $48
-5x = -$10
To solve for x, we divide both sides of the equation by -5:
x = -$10 / -5
x = $2
The value of x is $2.
#### 2
The answer is: 2</s>


# **Load the base Mistral 7B model with quantization configurations.**


Quantization is a technique used to reduce the memory and computation requirements of a neural network model. It involves representing the model's weights and activations with fewer bits, typically lower-precision data types, such as 8-bit integers or even lower. This reduction in precision helps in compressing the model, making it more efficient for deployment on resource-constrained devices, including edge devices and mobile platforms. thats why we used Quantization.


The tokenizer corresponding to the Mistral 7B model is loaded to preprocess input text and prepare it for the model. Setting specific tokenizer parameters, such as padding_side, model_max_length, and enabling trust_remote_code, ensures consistency with the tokenization used during pre-training. Additionally, adjustments are made to include end-of-sequence (EOS) tokens by assigning eos_token to the pad_token and enabling the add_eos_token attribute. These configurations align the tokenizer with the model's expectations, facilitating accurate and consistent input processing during inference.

In [3]:
# Load base model(Mistral 7B)
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
   "filipealmeida/Mistral-7B-Instruct-v0.1-sharded",
    quantization_config=bnb_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "filipealmeida/Mistral-7B-Instruct-v0.1-sharded",
    padding_side="left",
     model_max_length=512,
    trust_remote_code=True)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

pytorch_model-00001-of-00008.bin:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

pytorch_model-00002-of-00008.bin:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

pytorch_model-00003-of-00008.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00004-of-00008.bin:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

pytorch_model-00005-of-00008.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00006-of-00008.bin:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

pytorch_model-00007-of-00008.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00008-of-00008.bin:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

(True, True)

In [4]:
def generate_response(prompt):
    encoded_input = tokenizer.apply_chat_template(prompt, return_tensors="pt")
    # attention_mask = encoded_input['attention_mask']
    model_inputs = encoded_input.to('cuda')
    generated_ids = model.generate(model_inputs, max_new_tokens=10, do_sample=True)
    decoded = tokenizer.batch_decode(generated_ids)
    return decoded[0]


In [6]:
messages = [
    {"role": "user", "content": "[INST]What is your favourite condiment?[/INST]"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "[INST]If 24 out of every 60 individuals like football and out of those that like it, 50% play it, how many people would you expect play football out of a group of 250?, just give me one word answer in number[/INST]"}
]
response = generate_response(messages)
print(response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] [INST]What is your favourite condiment?[/INST] [/INST]Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s>  [INST] [INST]If 24 out of every 60 individuals like football and out of those that like it, 50% play it, how many people would you expect play football out of a group of 250?, just give me one word answer in number[/INST] [/INST] The answer would be 100.</s>


# Prepare for K-Bit Training



Adapters are additional neural network components that can be fine-tuned to capture task-specific information without extensively modifying the pre-trained model. The prepare_model_for_kbit_training function readies the model for knowledge distillation, while the subsequent lines instantiate a LoraConfig object, specifying parameters such as adapter dimensions, dropout rates, and the target modules where adapters will be applied.

The get_peft_model function then integrates these adapters into the Mistral 7B model, allowing for the extraction and utilization of task-specific knowledge during the fine-tuning process. This facilitates the model's adaptability to the specific requirements of the downstream task.

In [None]:
#Adding the adapters in the layers
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
        r=16,
        lora_alpha=16,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
    )
model = get_peft_model(model, peft_config)

# Monitor the Language Model

Initializing WandB (Weights and Biases) serves the purpose of monitoring and tracking the training process. WandB provides a platform for experiment tracking, visualization, and collaboration. By logging various metrics, parameters, and visualizations during the model training, it enables effective analysis and comparison of different experiments. In this specific context, the wandb.login and wandb.init functions authenticate the user, set up the project, and initialize a run for tracking the fine-tuning process of the Mistral 7B model. This integration with WandB enhances the reproducibility and visibility of the training procedure, facilitating collaboration and insights into the model's performance over time.

In [None]:
# Monitering the LLM
wandb.login(key = "39941130777627d6d49f3a7c94e3f474de1abf65")
run = wandb.init(project='Fine tuning mistral 7B', job_type="training", anonymous="allow")

[34m[1mwandb[0m: Currently logged in as: [33mrutvik01[0m ([33mrutvik011[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


# Training Configuration

 The code sets up training arguments, initializes a SFTTrainer for training the model, and saves the trained model. The model is then pushed to the Hugging Face Model Hub for easy sharing and retrieval.

The TrainingArguments encapsulates key parameters and configurations for the training process. In this instance, it specifies the output directory for storing results, sets the number of training epochs to 1, defines the batch size and gradient accumulation steps, schedules saving of checkpoints, logs training progress every 10 steps, and incorporates additional settings such as learning rate, weight decay, and gradient clipping. The use of mixed-precision training (fp16) for faster computations and integration with WandB for real-time monitoring and reporting adds further versatility to the training setup.

In [None]:
training_arguments = TrainingArguments(
    output_dir= "./results",
    num_train_epochs= 1,
    per_device_train_batch_size= 2,
    gradient_accumulation_steps=4,
    save_steps= 1000,
    logging_steps= 10,
    learning_rate= 2e-4,
    weight_decay= 0.001,
    fp16= True,
    max_grad_norm= 0.3,
    max_steps= -1,
    warmup_ratio= 0.3,
    group_by_length= True,
    lr_scheduler_type= "constant",
    report_to="wandb",

)


The SFTTrainer is initialized with the specified parameters for training the model. It involves the Mistral 7B model, a maximum sequence length of 256 tokens, training and evaluation datasets, as well as the configuration for adapter-based knowledge integration (peft_config). Additionally, the trainer incorporates a formatting function (create_prompt) for generating input prompts, uses the defined tokenizer, and adheres to the training arguments set in training_arguments, which include key details such as batch size, gradient accumulation steps, and optimization settings. The optional usage of packing is employed to handle variable-length sequences efficiently during training.

In [None]:
trainer = SFTTrainer(
    model=model,
    max_seq_length = 256,
    train_dataset=train_split,
    eval_dataset=test_split,
    peft_config=peft_config,
    formatting_func=create_prompt,
    # callbacks=[early_stopping],
    # dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= True)


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



# Start the training process.

In [1]:
trainer.train()

#**Save the fine-tuned model and push it to the Hugging Face Model Hub.**


In [None]:
trainer.save_model("mistral_finetune")

In [None]:
model.push_to_hub("rutvik01/mistral_finetune")

# Model Evaluation

In [5]:
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) 
Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store'

## Load Pre-trained Model for Evaluation

In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

base_model_id = "filipealmeida/Mistral-7B-Instruct-v0.1-sharded"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,  # Mistral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
    use_auth_token=True
)

eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

pytorch_model-00001-of-00008.bin:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

pytorch_model-00002-of-00008.bin:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

pytorch_model-00003-of-00008.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00004-of-00008.bin:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

pytorch_model-00005-of-00008.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00006-of-00008.bin:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

pytorch_model-00007-of-00008.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00008-of-00008.bin:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

## Generate Response with the Fine-tuned Model

In [7]:
from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "rutvik01/mistral_finetune")

adapter_config.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/92.3M [00:00<?, ?B/s]

In [13]:
eval_prompt = "<s>[INST] If 24 out of every 60 individuals like football and out of those that like it, 50% play it, how many people would you expect play football out of a group of 250? Please just give me one word answer in number. Think step by step. [/INST]"
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

ft_model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=4096, repetition_penalty=1.15)[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] If 24 out of every 60 individuals like football and out of those that like it, 50% play it, how many people would you expect play football out of a group of 250? Please just give me one word answer in number. Think step by step. [/INST] Let's break down the problem:
1. We know that 24 out of every 60 individuals like football. So we can set up a proportion to find out how many people out of 250 like football: 24/60 = x/250. Solving for x gives us x = (24*250)/60 = 75.
2. Out of these 75 people who like football, 50% play it. To find out how many people play football, we multiply 75 by 0.5: 75 * 0.5 = 37.5.
3. Since we cannot have half a person, we round up to the nearest whole number. Therefore, there are 38 people who play football out of a group of 250.
The final answer is **38**.


# Model deployment in Gradio

In [2]:
!pip install --upgrade jinja2



In [None]:
!pip install pyngrok
!pip install bitsandbytes
!pip install gradio==3.48.0
!pip install fastapi==0.103.2
!pip install gradio
!pip install accelerate

In [14]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Fine tune Model result in gradio

In [2]:
from transformers import AutoTokenizer, TextIteratorStreamer, BitsAndBytesConfig, AutoModelForCausalLM
from threading import Thread
import gradio as gr
import transformers
import torch
from peft import PeftModel


# Run the entire app with `python run_mixtral.py`

""" The messages list should be of the following format:

messages =

[
    {"role": "user", "content": "User's first message"},
    {"role": "assistant", "content": "Assistant's first response"},
    {"role": "user", "content": "User's second message"},
    {"role": "assistant", "content": "Assistant's second response"}
]

"""
""" The `format_chat_history` function below is designed to format the dialogue history into a prompt that can be fed into the Mixtral model. This will help understand the context of the conversation and generate appropriate responses by the Model.
The function takes a history of dialogues as input, which is a list of lists where each sublist represents a pair of user and assistant messages.
"""




def format_chat_history(history) -> str:
    messages = [{"role": ("user" if i % 2 == 0 else "assistant"), "content": dialog[i % 2]}
        for i, dialog in enumerate(history) if dialog[i%2]]
    # The conditional `(if dialog[i % 2])` ensures that messages
    # that are None (like the latest assistant response in an ongoing
    # conversation) are not included.
    return pipeline.tokenizer.apply_chat_template(
        messages, tokenize=False,
        add_generation_prompt=True)

def model_loading_pipeline():
    model_id = "mistralai/Mistral-7B-Instruct-v0.2"

    bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
    )

    model = AutoModelForCausalLM.from_pretrained(
    model_id,  # Mistral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
    use_auth_token=True
    )

    model = PeftModel.from_pretrained(model, "rutvik01/mistral_finetune")
    tokenizer = AutoTokenizer.from_pretrained(model_id, add_bos_token=True, trust_remote_code=True)
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, Timeout=5)

    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        model_kwargs={"torch_dtype": torch.float16, "load_in_4bit": True,
                      "quantization_config": BitsAndBytesConfig(
                                                                load_in_4bit=True,
                                                                bnb_4bit_compute_dtype=torch.float16)},
        streamer=streamer
    )
    return pipeline, streamer

def launch_gradio_app(pipeline, streamer):
    with gr.Blocks() as demo:
        chatbot = gr.Chatbot()
        msg = gr.Textbox()
        clear = gr.Button("Clear")

        def user(user_message, history):
            return "", history + [[user_message, None]]

        def bot(history):
            prompt = format_chat_history(history)

            history[-1][1] = ""
            kwargs = dict(text_inputs=prompt, max_new_tokens=2048, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
            thread = Thread(target=pipeline, kwargs=kwargs)
            thread.start()

            for token in streamer:
                history[-1][1] += token
                yield history

        msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(bot, chatbot, chatbot)
        clear.click(lambda: None, None, chatbot, queue=False)

    demo.queue()
    demo.launch(share=True, debug=True)

if __name__ == '__main__':

    pipeline, streamer = model_loading_pipeline()
    launch_gradio_app(pipeline, streamer)

# Run the entire app with `python run_mixtral.py`



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PLBartFo

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://b9bf72378e7f37834c.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://b9bf72378e7f37834c.gradio.live


# **Base model result in gradio**

In [None]:
from transformers import AutoTokenizer, TextIteratorStreamer, BitsAndBytesConfig
from threading import Thread
import gradio as gr
import transformers
import torch

# Run the entire app with `python run_mixtral.py`

""" The messages list should be of the following format:

messages =

[
    {"role": "user", "content": "User's first message"},
    {"role": "assistant", "content": "Assistant's first response"},
    {"role": "user", "content": "User's second message"},
    {"role": "assistant", "content": "Assistant's second response"}
]

"""
""" The `format_chat_history` function below is designed to format the dialogue history into a prompt that can be fed into the Mixtral model. This will help understand the context of the conversation and generate appropriate responses by the Model.
The function takes a history of dialogues as input, which is a list of lists where each sublist represents a pair of user and assistant messages.
"""

def format_chat_history(history) -> str:
    messages = [{"role": ("user" if i % 2 == 0 else "assistant"), "content": dialog[i % 2]}
        for i, dialog in enumerate(history) if dialog[i%2]]
    # The conditional `(if dialog[i % 2])` ensures that messages
    # that are None (like the latest assistant response in an ongoing
    # conversation) are not included.
    return pipeline.tokenizer.apply_chat_template(
        messages, tokenize=False,
        add_generation_prompt=True)

def model_loading_pipeline():
    model_id = "mistralai/Mistral-7B-Instruct-v0.2"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, Timeout=5)

    pipeline = transformers.pipeline(
        "text-generation",
        model=model_id,
        model_kwargs={"torch_dtype": torch.float16, "load_in_4bit": True,
                      "quantization_config": BitsAndBytesConfig(
                                                                load_in_4bit=True,
                                                                bnb_4bit_compute_dtype=torch.float16)},
        streamer=streamer
    )
    return pipeline, streamer

def launch_gradio_app(pipeline, streamer):
    with gr.Blocks() as demo:
        chatbot = gr.Chatbot()
        msg = gr.Textbox()
        clear = gr.Button("Clear")

        def user(user_message, history):
            return "", history + [[user_message, None]]

        def bot(history):
            prompt = format_chat_history(history)

            history[-1][1] = ""
            kwargs = dict(text_inputs=prompt, max_new_tokens=2048, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
            thread = Thread(target=pipeline, kwargs=kwargs)
            thread.start()

            for token in streamer:
                history[-1][1] += token
                yield history

        msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(bot, chatbot, chatbot)
        clear.click(lambda: None, None, chatbot, queue=False)

    demo.queue()
    demo.launch(share=True, debug=True)

if __name__ == '__main__':
    pipeline, streamer = model_loading_pipeline()
    launch_gradio_app(pipeline, streamer)

# Run the entire app with `python run_mixtral.py`

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://1b2ef5fff1dab290e7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


# Conclusion:

The project effectively demonstrated the fine-tuning and deployment of the Mistral 7B model, showcasing its adaptability to specific tasks and providing a user-friendly interface for real-world applications. The integration of quantization, knowledge adapters, and monitoring tools contributed to a robust and efficient workflow. Ongoing improvements and user feedback will guide future iterations of the project.