# MonsterAPI Deploy

Introducing Monster Deploy. A new LLM Deployment engine that enables you to serve various LLMs along with lora adapters as an API endpoint on MonsterAPI's robust and cost optimised GPU Cloud.

Following Deployment options are supported:
1. Deploy SOTA LLMs and fine-tuned LLM LoRA adapters as a REST API serving endpoint
2. Deploy docker containers for GPU powered applications

Monster Deploy offers in-built optimisations for higher throughput and lower cost of serving LLMs.

Checkout our [Developer Docs](https://developer.monsterapi.ai/docs/monster-deploy-beta)

If you haven't applied for Deploy beta then you may signup on this [Google form](https://forms.gle/ZHuZt68fJLRozo3v9) for 10K credits. Sign Up with your organization ID to receive 30K credits.

## Install MonsterAPI pypi client

In [None]:
!pip install monsterapi==1.0.2b3
# Please install specific beta version of client for quick serve access.

Sign up on [MonsterAPI](https://monsterapi.ai/signup?utm_source=llm-deployment-colab&utm_medium=referral) and get a free auth key. Paste it below:
Make sure you have signed up  for beta access at [here](https://forms.gle/TTJRapHm59RxjttJA)

In [None]:
api_key = "YOUR_MONSTERAPI_KEY"

## Initialize client

In [None]:
from monsterapi import client as mclient
deploy_client = mclient(api_key = api_key)

## Launch a LLM deployment on MonsterAPI
Let us deploy Mixtral 8x7b Chat model with GPTQ 4bit quantization by using a 48GB GPU.

The Deployment will be able to serve the model as a REST API for both static and streaming token response support.

An example of payload parameters for deploying Llama Model:

```
    basemodel_path="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"
    prompt_template="<s> [INST] {instruction} [/INST] {completion}</s>"
    api_auth_token="A_RANDOM_AUTH_TOKEN_TO_SECURE_YOUR_ENDPOINT"
    per_gpu_vram=48
    gpu_count=1
```

In [None]:
launch_payload = {
    "basemodel_path": "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ",
    "prompt_template": "<s> [INST] {prompt} [/INST] {completion}</s>",
    "api_auth_token": "b6a97d3b-35d0-4720-a44c-59ee33dbc25b",
    "per_gpu_vram": 48,
    "gpu_count": 1,
    "use_nightly": True
}

# Launch a deployment
ret = deploy_client.deploy("llm", launch_payload)
deployment_id = ret.get("deployment_id")
print(deployment_id)

### Deployment Status Types

#### Below mentioned are different types of status responses:
1. INPROGRESS:
    ```
    {
        "status":"pending",
        "message":"Instance is still being provisioned, please wait and try again"
    }
    ```

2. Fail
    ```
    {
        "status": "failed",
        "message": "Instance has failed, please launch a new instance"
    }
    ```

3. Building
    ```
    {
        "status": "building",
        "message": "Server has started but trying to connect to deployment container, just downloading your model and setting things up, please try again in few mins, if state persists, please use /restart or /terminate!"
    }
    ```
4. Live
    ```
    {
        "status":"live",
        "message":"Server has started !!!",
        "URL":"https://c503a813-850a-4a78-93b9.monsterapi.ai",
        "api_auth_token":"57b7b903-a4b6-4720-8154-af71aa8e8313"
    }
    visit the url to get the llm service endpoint details or above url/docs to get swagger docs
    ```
5. Terminated by User
    ```
    {
        "status":"terminatedByUser",
        "message":"Instance is terminatedByUser"
    }
    ```

6. Terminated by System (Out of  Credits)
    ```
    {
        "status":"terminatedBySystem",
        "message":"Instance is terminatedBySystem"
    }
    ```



## Get your Deployment Status


In [None]:
status_debug = True # Just a placeholder to show possible statuses.

In [None]:
if status_debug:
  status_ret = deploy_client.get_deployment_status(deployment_id)
  print(status_ret)

#### Get Logs of deployment available from building status (This may take 5-10 minutes)

> Deployment configuration may take few minutes. We are working on optimizing the service.

> 'status' will be initially set to `building` and then to `live` as the deployment configuration progresses and the logs will be available from `building` state onwards.

In [None]:
logs_ret = deploy_client.get_deployment_logs(deployment_id, n_lines = 50)
if 'logs' not in logs_ret:
  raise Exception("Please wait until status changes to building!")
for i in logs_ret['logs']:
  print(i)

#### Live Status

In [None]:
status_ret = deploy_client.get_deployment_status(deployment_id)
print(status_ret)

## Once the deployment is live, let's query our deployed LLM endpoint:

---



In [None]:
import json

assert status_ret.get("status") == "live", "Please wait until status is live!"

service_client  = mclient(api_key = status_ret.get("api_auth_token"), base_url = status_ret.get("URL"))

payload = {
    "input_variables": {
        "prompt": "What's up?"},
    "stream": False,
    "temperature": 0.6,
    "max_tokens": 2048
}

output = service_client.generate(model = "deploy-llm", data = payload)

if payload.get("stream"):
    for i in output:
        print(i[0])
else:
    print(json.loads(output)['text'][0])



 Why do you look so sad?

I'm not sad, I'm just... contemplative.

Contemplative? What does that even mean?

It means I'm thinking deeply about life and all its mysteries.

Mysteries? Like what?

Like... like everything! The meaning of existence, the nature of reality, the purpose of humanity.

But why do you have to think about all that stuff? Can't you just enjoy life and have fun?

I do enjoy life, and I do have fun! But I also like to think about bigger questions. It's important to me to understand the world and my place in it.

But don't you think that's a waste of time? You could be out there living life instead of sitting around thinking about it all the time.

I don't think it's a waste of time at all. Thinking and living are not mutually exclusive. I can enjoy life and think deeply about it at the same time.

But don't you get overwhelmed by all the big questions?

Sometimes I do. But I try to remember that it's okay to not have all the answers. Life is a mystery, and that's w

------

## Terminate Deployment

Once your work is done, you may terminate your LLM deployment and stop the account billing

In [None]:
terminate_return = deploy_client.terminate_deployment(deployment_id)
print(terminate_return)

{'message': 'Instance Terminated'}


## Terminate Status

Get deployment status for confirmation

In [None]:
status_ret = deploy_client.get_deployment_status(deployment_id)
print(status_ret)

{'status': 'terminatedByUser', 'message': 'Instance is terminatedByUser'}
