# LLM Finetuning with MonsterTuner

MonsterTuner is a no-code LLM finetuner for up to 10X more efficient and cost-effective finetuning of AI models for your business use-cases.

### Supported Models for Finetuning:

1. LLM (Large Language Model) - For use-cases like chat completion, summary generation, sentiment analysis, etc.
2. Whisper - For speech-to-text transcription improvement.
3. SDXL Dreambooth - Fine-tune Stable Diffusion model for customized image generation.


Checkout our [Developer Docs](https://developer.monsterapi.ai/docs/launch-a-fine-tuning-job) on how to launch an LLM Finetuning Job with no-coding

**How to finetune an LLM and Deploy it on MonsterAPI - [Complete Guide](https://blog.monsterapi.ai/how-to-fine-tune-a-large-language-model-llm-and-deploy-it-on-monsterapi/)**


In [None]:
!pip install monsterapi==1.0.8
!pip install -q autoawq huggingface-hub peft

Sign up on [MonsterAPI](https://monsterapi.ai/signup?utm_source=llm-deploy-colab&utm_medium=referral) and get a free auth key. Paste it below:

In [None]:
import os
from monsterapi import client as mclient
import json
import logging
import tempfile
from awq import AutoAWQForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import requests
import os
import zipfile
from peft import PeftModel
import huggingface_hub as hf_hub
from transformers import AutoConfig, AutoModelForCausalLM
from huggingface_hub import HfApi, hf_hub_download, file_exists
from accelerate import init_empty_weights

os.environ['MONSTER_API_KEY'] = 'YOUR_MONSTER_API_KEY'
client = mclient(api_key=os.environ.get("MONSTER_API_KEY"))

### Launch Finetuning Job

This code block sets up `launch_payload` for fine-tuning an LLMs using specific configurations. The payload includes model path, LoRA parameters, data source details, and training settings such as learning rate and epochs. The model is fine-tuned using these settings

In [None]:
launch_payload = {
    "pretrainedmodel_config": {
        "model_path": "facebook/opt-350m",
        "use_lora": True,
        "lora_r": 8,
        "lora_alpha": 16,
        "lora_dropout": 0,
        "lora_bias": "none",
        "use_quantization": False,
        "use_gradient_checkpointing": False,
        "parallelization": "nmp"
    },
    "data_config": {
        "data_path": "tatsu-lab/alpaca",
        "data_subset": "default",
        "data_source_type": "hub_link",
        "prompt_template": "Here is an example on how to use tatsu-lab/alpaca dataset ### Input: {instruction} ### Output: {output}",
        "cutoff_len": 512,
        "prevalidated": False
    },
    "training_config": {
        "early_stopping_patience": 5,
        "num_train_epochs": 1,
        "gradient_accumulation_steps": 1,
        "warmup_steps": 50,
        "learning_rate": 0.001,
        "lr_scheduler_type": "reduce_lr_on_plateau",
        "group_by_length": False
    },
    "logging_config": { "use_wandb": False }
}


ret = client.finetune(service="llm", params=launch_payload)
deployment_id = ret.get("deployment_id")
print(ret)

### Fetch your Finetuning Job Status:

Wait until the status is `Live`. It should take 5-10 minutes.

In [None]:
# Get deployment status
status_ret = client.get_deployment_status(deployment_id)
print(status_ret)

------

### Get Finetuning Job Logs

To see your finetuning job progress, please run the cell below

In [None]:
# Get deployment logs
logs_ret = client.get_deployment_logs(deployment_id)
print(logs_ret)

------

### Terminate Finetuning Job

CAUTION: If you wish to terminate your finetuning job, please run the cell below

In [None]:
## Terminate Deployment
# terminate_return = client.terminate_deployment(deployment_id)
# print(terminate_return)

## Evaluate the Finetuned Model

In [None]:
import requests
base_model = launch_payload['pretrainedmodel_config']['model_path']
lora_model_path = status_ret['info']['model_url']


url = "https://api.monsterapi.ai/v1/deploy/evaluation/llm/lm_eval"

payload = {
    "eval_engine": "lm_eval",
    "basemodel_path": base_model,
    "loramodel_path": lora_model_path,
    "task": "mmlu"
}
headers = {
    "accept": "application/json",
    "content-type": "application/json",
    "authorization": f"Bearer {os.environ['MONSTER_API_KEY']}"
}

response = requests.post(url, json=payload, headers=headers)

print(response.text)
# Extracting deployment ID from response
response_data = response.json()
serving_params = response_data.get("servingParams", {})
eval_deployment_id = serving_params.get("deployment_id")

## Get Evaluation Job Logs

In [None]:
# Get deployment logs
logs_ret = client.get_deployment_status(eval_deployment_id)
print(logs_ret)

## Get Evaluation Scores

In [None]:
result_url = logs_ret["info"]["result_url"]

response = requests.get(result_url)
result_json = response.json()

print(result_json)
# Extract required values from the JSON
Evaluation_Metrics = {
    "MMLU": result_json["results"]["mmlu"]["acc,none"]
}
print(Evaluation_Metrics)

{'results': {'mmlu': {'acc,none': 0.2610739210938613, 'acc_stderr,none': 0.003678548953480317, 'alias': 'mmlu'}, 'mmlu_humanities': {'alias': ' - humanities', 'acc,none': 0.24293304994686504, 'acc_stderr,none': 0.006243800221351997}, 'mmlu_formal_logic': {'alias': '  - formal_logic', 'acc,none': 0.19047619047619047, 'acc_stderr,none': 0.035122074123020514}, 'mmlu_high_school_european_history': {'alias': '  - high_school_european_history', 'acc,none': 0.2545454545454545, 'acc_stderr,none': 0.0340150671524904}, 'mmlu_high_school_us_history': {'alias': '  - high_school_us_history', 'acc,none': 0.25980392156862747, 'acc_stderr,none': 0.03077855467869327}, 'mmlu_high_school_world_history': {'alias': '  - high_school_world_history', 'acc,none': 0.2320675105485232, 'acc_stderr,none': 0.027479744550808507}, 'mmlu_international_law': {'alias': '  - international_law', 'acc,none': 0.371900826446281, 'acc_stderr,none': 0.04412015806624504}, 'mmlu_jurisprudence': {'alias': '  - jurisprudence', 'ac

### MonsterAPI LORA Merge and Quantization Notebook

An accesory notebook to

1.   Merge a lora adapter to its base model
2.   Quantize it using the AWQ method
3.   Push it to Huggingface repo.

This notebook can directly accept a MonsterAPI model URL (ex:*https://finetuning-service.s3.us-east-2.amazonaws.com/finetune_outputs/cba26def-4cc6-476b-927a-6e1eff7d68e0/cba26def-4cc6-476b-927a-6e1eff7d68e0.zip*) or HuggingFace Repo Name (ex:*monsterapi/mistral_7b_DolphinCoder*) as input and can be used as an accessory after finetuning is complete in the main platform.



In [None]:
model_path = model_url = status_ret['info']['model_url']
print("Model Path: ",model_path)
# @title Model Configuration { display-mode: "form" }
quantize = True #@param {type:"boolean"}
hf_login_key = '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' #@param {type:"string"}
hf_model_path = 'Eval_Quantised_facebook_opt_350m' #@param {type:"string"}
save_path = 'content/Final_Model' #@param {type:"string"}

#@markdown ### Description of Parameters:
#@markdown - `model_path`: MonsterAPI Finetuned model url or the HuggingFace repo name.
#@markdown - `save_path`: Directory where the modified model should be saved after operations.
#@markdown - `quantize`: Enable or disable model quantization. Set to `True` to apply quantization.
#@markdown - `hf_login_key`: Authentication key for writing models hosted on Hugging Face. If not provided the model will not be pushed to huggingface
#@markdown - `hf_model_path`: Repo name for saving it to huggingface


Model Path:  https://finetuning-service.s3.us-east-2.amazonaws.com/finetune_outputs/26efd9b8-e776-4019-b3e6-f4325c82412d/26efd9b8-e776-4019-b3e6-f4325c82412d.zip



## LLM Quantization

LLM (Large Language Model) Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run models on embedded devices, which sometimes only support integer data types.
The Quantization we are going to be using is AWQ
###AWQ Quantization
Takes the concept of weight quantization to the next level by considering the activations of the model during the quantization process. In traditional weight quantization, the weights are quantized independently of the data they process. In AWQ, the quantization process takes into account the actual data distribution in the activations produced by the model during inference.




# Model Management Utilities

Lets write some utility functions designed for use later on. These include capabilities to download, unzip, quantize and save our model Functions covered:


- `download_model_and_unzip`: Downloads and extracts model archives from specified URLs.
- `merge_adapter`: Integrates adapter modules with base transformer models, optionally utilizing LoRA.
- `quantize_and_load`: Applies quantization to models for efficient inference.
- `save_model`: Saves the model and tokenizer to a specified directory for future use.


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)*0.8
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 11.7984 GB.
0.0 GB of memory reserved.


### Estimate Memory Requiements

In [None]:
def get_num_params(model_name: str, trust_remote_code: bool = True, hf_token: str = None) -> int:
    """
    Creates an empty model and calculates the number of parameters.
    """
    with init_empty_weights():
        config = AutoConfig.from_pretrained(model_name, trust_remote_code=False, token=hf_token)
        model = AutoModelForCausalLM.from_config(config)

    num_params = sum(p.numel() for p in model.parameters())
    return num_params

from typing import Dict, List, Optional, Literal
def estimate_memory_usage_hf(model_name, hf_token, lora_params_percentage) -> Dict[str, Dict[str, float]]:
    """
    Estimates the memory usage of the Hugging Face model.
    """

    num_params = get_num_params(model_name, hf_token=hf_token)
    model_size_gb = (num_params * 4) / (1024 ** 3)

    # check if lora_config is provided and use it to calculate percentage params using num_params
    # if lora_config:
    #     lora_params_percentage = ((lora_config.l * lora_config.w * lora_config.r) / num_params) * 100

    memory_usage = {}
    dtype_sizes = {
        'float32': 4,
        'float16': 2,
        'int8': 1,
        'int4': 0.5
    }

    inference_scale_factor = 1.2
    lora_scale_factor = (16 / 8) * 4 * inference_scale_factor

    for dtype, size in dtype_sizes.items():
        total_size = model_size_gb * (size / 4)
        training_adam = total_size * 3.9
        inference = total_size * inference_scale_factor
        lora_trainable_params_gb = total_size * (lora_params_percentage / 100) * lora_scale_factor
        lora_fine_tuning = total_size + lora_trainable_params_gb

        memory_usage[dtype] = {
            'inference': round(inference, 2),
            'training_adam': round(training_adam, 2),
            'lora_fine_tuning': round(lora_fine_tuning, 2)
        }

    return memory_usage

def check_memory(max_memory, memory_usage):
    # Directly extract memory usage values from the dictionary
    memory_f16 = memory_usage['float16']['inference']
    memory_f32 = memory_usage['float32']['inference']

    if memory_f16 > max_memory:
        print("Warning: Memory usage for float16 exceeds the limit. This colab notebook does not have enough memory for float16.")
    elif memory_f32 > max_memory:
        print("Memory usage for float32 exceeds the limit.", end=" ")
        if memory_f16 <= max_memory:
            print("However, this notebook is suitable for float16 model.")
        else:
            print("This colab notebook does not have enough memory for float16 either.")
    else:
        print("This notebook is suitable for the given model using float32.")

### Memory Required to Execute

In [None]:
model_name = launch_payload['pretrainedmodel_config']['model_path']
memory_usage = estimate_memory_usage_hf(model_name=model_name,hf_token=hf_login_key,lora_params_percentage=1)
print(f"Memory Requirements for {model_name}: ",memory_usage)
check_memory(max_memory, memory_usage)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

Memory Requirements for facebook/opt-350m:  {'float32': {'inference': 1.6, 'training_adam': 5.19, 'lora_fine_tuning': 1.46}, 'float16': {'inference': 0.8, 'training_adam': 2.59, 'lora_fine_tuning': 0.73}, 'int8': {'inference': 0.4, 'training_adam': 1.3, 'lora_fine_tuning': 0.36}, 'int4': {'inference': 0.2, 'training_adam': 0.65, 'lora_fine_tuning': 0.18}}
This notebook is suitable for the given model using float32.


### Download Model Functions

In [None]:
def download_model_and_unzip(url):

    # Create a temporary directory
    model_dir = tempfile.mkdtemp()

    # Download the zip file
    r = requests.get(url, allow_redirects=True)
    zip_path = os.path.join(model_dir, 'model.zip')
    open(zip_path, 'wb').write(r.content)

    if not os.path.exists(zip_path):
        raise ValueError(f"Failed to download model from {url}")

    # Unzip the zip file into the temporary directory
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(model_dir)

    # Remove the zip file
    os.remove(zip_path)

    # Verify model downloaded by checking if dir is empty
    if len(os.listdir(model_dir)) == 0:
        raise ValueError(f"Failed to unzip model from {url}")

    # Return the temporary directory
    return model_dir


def merge_adapter(model_path):

    #download
    if model_path.startswith('http'):
        model_path = download_model_and_unzip(model_path)
    else:
        hf_hub.snapshot_download(
                            repo_id=model_path,  # type: ignore
                            repo_type='model',
                            local_dir="Final_Model",
                            local_dir_use_symlinks=False)
        model_path = "Final_Model"


    if os.path.isfile(model_path+'/adapter_config.json'):
        with open(model_path+'/adapter_config.json', 'r') as f:
            data = json.load(f)
            basemodel_path = data['base_model_name_or_path']
            loramodel_path = model_path
            use_lora = True
    else:
        basemodel_path = model_path
        use_lora = False

    tokenizer = AutoTokenizer.from_pretrained(basemodel_path, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(basemodel_path,
                                                 device_map='auto',
                                                 trust_remote_code=True,
                                                 low_cpu_mem_usage=True,
                                                 torch_dtype=torch.bfloat16)
    if use_lora==True:
        logging.info("Loading lora model")
        model = PeftModel.from_pretrained(model, loramodel_path)
        model = model.merge_and_unload()

    return tokenizer,model


def quantize_and_load(model_path):


    quant_config = { "zero_point": True,
                    "q_group_size": 128,
                    "w_bit": 4,
                    "version": "GEMM" }

    model = AutoAWQForCausalLM.from_pretrained(
        model_path, **{"low_cpu_mem_usage": True, "use_cache": False, "device_map": torch.device("cuda")}
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

    model.quantize(tokenizer, quant_config=quant_config)


    return tokenizer, model

def save_model(tokenizer,model,save_path='Final_Model'):


    if os.path.exists(save_path):
        os.system(f'rm -rf {save_path}')

    model.save_pretrained(save_path)
    tokenizer.save_pretrained(save_path)
    return save_path

# Final Execution

Perform the final processing of the model, this will load the model, merge any lora adapters, quantize it and upload to huggingface as per the configuration set above.

In [None]:
if os.path.exists(save_path):
    os.system(f'rm -rf {save_path}')
if model_path != '':
    tokenizer, model = merge_adapter(model_path)
    print('Successfully Merged and loaded the adapters')
if save_path != '':
    save_model(tokenizer,model,save_path)
if quantize == True:
    del tokenizer, model
    tokenizer, model = quantize_and_load(save_path)
    print('Successfully quantized and loaded the model')
if save_path != '':
    os.system(f'rm -rf {save_path}')
    model.save_quantized(save_path)
    tokenizer.save_pretrained(save_path)
    print(f'Model quantized and saved, it can be found in the ./{save_path} Directory')
if hf_login_key != '':
    try:
        hf_hub.login(hf_login_key)
        hf_model_path = hf_hub.whoami(token=hf_login_key)['name'] + '/' + hf_model_path.split('/')[-1]
        tokenizer.push_to_hub(hf_model_path )
        hf_api = hf_hub.HfApi()
        hf_api.upload_folder(
            folder_path=save_path,
            repo_id=hf_model_path,
            repo_type="model"
        )
        logging.info('Model pushed to huggingface hub')
    except Exception as e:
        logging.warning('Failed to push to huggingface hub',e)

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/663M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Successfully Merged and loaded the adapters


Downloading readme:   0%|          | 0.00/167 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/214670 [00:00<?, ? examples/s]

AWQ: 100%|██████████| 24/24 [05:19<00:00, 13.30s/it]


Successfully quantized and loaded the model
Model quantized and saved, it can be found in the ./content/Final_Model Directory
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


model.safetensors:   0%|          | 0.00/267M [00:00<?, ?B/s]

## Depoly Quantised Model as an API Endpoint

In [None]:
base_model = launch_payload['pretrainedmodel_config']['model_path']
lora_model_path = status_ret['info']['model_url']

launch_payload = {
    "basemodel_path": base_model,
    "loramodel_path": lora_model_path,
    "api_auth_token": "b6a97d3b-35d0-4720-a44c-59ee33dbc25b",
    "prompt_template": "Here is an example on how to use tatsu-lab/alpaca dataset ### Input: {instruction} ### Output: {output}",
    "per_gpu_vram": 24,
    "gpu_count": 1
}

# Launch a deployment
ret = client.deploy("llm", launch_payload)
deployment_id = ret.get("deployment_id")
print(deployment_id)

dc2a598f-ee93-430f-b6a2-2662ff83b7bf


## Check Status of the Deployment

In [None]:
import json

status_ret = client.get_deployment_status(deployment_id)
print(status_ret)
assert status_ret.get("status") == "live", "Please wait until status is live!"

service_client  = mclient(api_key = status_ret.get("api_auth_token"),base_url = status_ret.get("URL"))

{'status': 'live', 'message': 'Server has started !!!', 'URL': 'https://dc2a598f-ee93-430f-b6a2-2662ff83b7bf.monsterapi.ai', 'qblocks_url': 'https://94.101.98.249', 'api_auth_token': 'b6a97d3b-35d0-4720-a44c-59ee33dbc25b', 'credits_consumed': 41, 'created_at': '2024-05-31T02:24:17.074487'}


## Query the API Endpoint

In [None]:
payload = {
    "input_variables": {
        "instruction": "What is Global Warming?"},
    "stream": False,
    "temperature": 0.6,
    "max_tokens": 512
}

output = service_client.generate(model = "deploy-llm", data = payload)
print(output['text'])

['\nGlobal warming is an increase in temperatures caused by greenhouse gases such as carbon dioxide, methane, and hydrogen sulfide. It is a result of human activity and is an increase in the amount of carbon dioxide and other greenhouse gases in the atmosphere. \n\nGlobal warming is caused by the rising levels of carbon dioxide and other greenhouse gases in the atmosphere, which is a result of the burning of fossil fuels, such as oil, coal, and gas.\n\nGlobal warming is seen as a result of the human activities that contribute to the rise in climate change. \n\nGlobal warming is also an increase in the amount of carbon dioxide and other greenhouse gases in the atmosphere. It is a result of increased atmospheric carbon dioxide concentrations, which increase the amount of carbon dioxide in the atmosphere. \n\nGlobal warming is an increase in the amount of carbon dioxide in the atmosphere, which is a result of the burning of fossil fuels, such as oil, coal, and gas. \n\nGlobal warming is a