Fine-tune a Large Language Model (LLM) and deploy it on MonsterAPI 🔥

The best part - costs less than a cup of coffee and no coding is required! 🔥

-----

@monsterapis designed their no-code LLM fine-tuner that simplifies the process of finetuning by:

👉 Automatically configuring GPU computing environments, optimised for higher throughput (with vllm in the backend).

👉 Optimizes memory usage by finding the optimal batch size,

👉 Integrates experiment tracking with WandB, and automatically pushing to Huggingface hub, and

👉 Auto configures the pipeline to complete without any errors on their cost-optimised GPU cloud

📌 And all the above steps are without writing a single line of code.

![](assets/2024-01-18-22-27-14.png)

📌 Recently Monsterapi finetuned google's Gemma-2B base model and the finetuned model outperforms LLaMA 13B on Mathematics reasoning.

For this project, Gemma-2B underwent fine-tuning on the Microsoft/Orca-Math-Word-Problems-200K dataset for 10 epochs using MonsterAPI's MonsterTuner service.

Evaluation was done on the GSM Plus benchmark, which is a specialized benchmark for checking mathematical proficiency of LLMs on grade-school mathematics

📌 And MonsterAPI-finetuned Gemma-2B model achieved a remarkable score of 20.02 on this benchmark, representing a 68% improvement over its baseline model performance.

And also this 2bn param finetuned model outperformed much larger models like LLaMA-2-13B and Code-LLaMA-7B.

![](assets/2024-04-11-16-53-30.png)


The actual process for Finetuning an LLM

📌 As a first step to finetune any model, Sign up for a Monster API account (monsterapi.ai/login) and get 2500 free credits.

📌 Launch a Finetuning Portal, and choose from the latest Large Language Models (LLMs) such as Llama-2 7B, CodeLlama, Falcon or Mixtral 8X7B.

📌 Dataset Preparation: You can choose from the curated selection of mostly used hugging face datasets with predefined training prompt configuration. OR

You can use your own custom datasets, and we get a good amount of control around how the Dataset needs to be prepared in the right format. The portal provides a text-area in which target columns can be specified. Depending on the type of task chosen, you may have to alter the column names.

📌 Specify Hyperparameter Configuration: such as epochs, learning rate, cutoff length, warmup steps, and so on.

📌 Track stages of your finetuning jobs: Like, view job logs, monitor your job metrics using Weights & Biases. And finally upload model outputs to Huggingface.

------



📌 Once you have finetuned an LLM on MonsterAPI, you will receive adapter weights as the final output. This adapter contains your fine-tuned model’s weights that Monster will host as an API endpoint using Monster Deploy.

📌 MonsterDeploy optimizes its backend operations using vLLM framework. vLLM is a rapid and user-friendly library for large language model inference and serving, notable for its state-of-the-art serving throughput.

And MonsterAPI also hosts wide range of popular LLMs as inference service and you can use popular tools like llama-index to access MonsterAPI LLMs. Indeed, Monster Deploy enables you to host any vLLM supported large language model (LLM) like Tinyllama, Mixtral, Phi-2 etc as a rest API endpoint on MonsterAPI's cost optimised GPU cloud.

The example code in image, deploys the Mixtral 8x7b Chat model with GPTQ 4bit quantization by using a 48GB GPU, using Monster Deploy.

The Deployment will be able to serve the model as a REST API for both static and streaming token response support.

```py
!python3 -m pip install monsterapi==1.0.2b3
# install specific beta version of client for quick serve access.

api_key = "YOUR_MONSTER_API_KEY"
from monsterapi import client as mclient
deploy_client = mclient(api_key = api_key)

# deploy Mixtral 8x7b Chat model with GPTQ 4bit quantization
# using a 48GB GPU.
basemodel_path="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"
    prompt_template="<s> [INST] {instruction} [/INST] {completion}</s>"
    api_auth_token="A_RANDOM_AUTH_TOKEN_TO_SECURE_YOUR_ENDPOINT"
    per_gpu_vram=48
    gpu_count=1

# Launch a deployment
launch_payload = {
    "basemodel_path": "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ",
    "prompt_template": "<s> [INST] {prompt} [/INST] {completion}</s>",
    "api_auth_token": "b6a97d3b-35d0-4720-a44c-59ee33dbc25b",
    "per_gpu_vram": 48,
    "gpu_count": 1,
    "use_nightly": True
}

# Launch a deployment
ret = deploy_client.deploy("llm", launch_payload)
deployment_id = ret.get("deployment_id")
print(deployment_id)
"""
{
     "status":"live",
     "message":"Server has started !!!",
     "URL":"https://c503a813-850a-4a78-93b9.monsterapi.ai",
     "api_auth_token":"57b7b903-a4b6-4720-8154-af71aa8e8313"
 }
 visit the url to get the llm service endpoint details
 or above url/docs to get swagger docs
"""

```

Once the deployment is live, let's query our deployed LLM endpoint:

```py
import json

status_debug = True # Just a placeholder to show possible statuses.

if status_debug:
  status_ret = deploy_client.get_deployment_status(deployment_id)
  print(status_ret)

assert status_ret.get("status") == "live", "Please wait until status is live!"

service_client  = mclient(api_key = status_ret.get("api_auth_token"),
                          base_url = status_ret.get("URL"))

payload = {
    "input_variables": {
        "prompt": "What's up?"},
    "stream": False,
    "temperature": 0.6,
    "max_tokens": 2048
}

output = service_client.generate(model = "deploy-llm", data = payload)

if payload.get("stream"):
    for i in output:
        print(i[0])
else:
    print(json.loads(output)['text'][0])

```

That's a wrap and here are all the important links.

👉 Website : http://monsterapi.ai

👉 Discord (Monsterapis) : https://discord.com/invite/mVXfag4kZN

👉 Checkout their API Docs - https://developer.monsterapi.ai/docs/monster-deploy-beta

👉 Access all Finetuned Models by Monster here:

https://huggingface.co/qblocks?ref=blog.monsterapi.ai