# Deploy `TinyLlama/TinyLlama-1.1B-Chat-v1.0` model on MonsterAPI using Monster Deploy

Monster Deploy is a new LLM Deployment engine that enables you to serve various LLMs along with lora adapters as an API endpoint on MonsterAPI's robust and cost optimised GPU Cloud.

Following Deployment options are supported:
1. Deploy SOTA LLMs and fine-tuned LLM LoRA adapters as a REST API serving endpoint
2. Deploy docker containers for GPU powered applications

Monster Deploy offers in-built optimisations for higher throughput and lower cost of serving LLMs.

Checkout our [Developer Docs](https://developer.monsterapi.ai/docs/monster-deploy-beta)

If you haven't applied for Deploy beta then you may signup on this [Google form](https://forms.gle/ZHuZt68fJLRozo3v9) for early access with free credits.

Sign up on [MonsterAPI](https://monsterapi.ai/signup?utm_source=llm-deploy-colab&utm_medium=referral) and get a free auth key. Paste it below:
Make sure you have signed up  for beta access at [here](https://forms.gle/TTJRapHm59RxjttJA)

In [None]:
api_key = "YOUR_MONSTERAPI_KEY"

### Install and Initialize MonsterAPI Client

In [None]:
!python3 -m pip install monsterapi==1.0.2b3

from monsterapi import client as mclient
deploy_client = mclient(api_key = api_key)

### Create `TinyLlama/TinyLlama-1.1B-Chat-v1.0` model deployment:

In [None]:
prompt_template = """
<|system|>
{system} </s>
<|user|>
{prompt} </s>
<|assistant|>
{response}
"""

launch_payload = {
    "basemodel_path": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "prompt_template": prompt_template,
    "per_gpu_vram": 24,
    "gpu_count": 1
}

# Launch a deployment
ret = deploy_client.deploy("llm", launch_payload)
deployment_id = ret.get("deployment_id")
print(deployment_id)

### Fetch your Deployment Status:

Wait until the status is `Live`. It should take 5-10 minutes.

In [None]:
status_ret = deploy_client.get_deployment_status(deployment_id)
print(status_ret)

### Once the deployment is live, let's query our deployed LLM endpoint:

In [None]:
import json

assert status_ret.get("status") == "live", "Please wait until status is live!"

service_client  = mclient(api_key = status_ret.get("api_auth_token"), base_url = status_ret.get("URL"))

payload = {
    "input_variables":  {
                          "system": "You are a friendly chatbot",
                          "prompt": "Are you sentient?"
                        },
    "stream": False,
    "temperature": 0.9,
    "max_tokens": 256
}

output = service_client.generate(model = "deploy-llm", data = payload)

if payload.get("stream"):
    for i in output:
        print(i[0])
else:
    print(json.loads(output)['text'][0])




<|system|>
You are a friendly chatbot </s>
<|user|>
Are you sentient? </s>
<|assistant|>
No, I am not sentient. Sentience is an idea that refers to the ability of living organisms to experience and have emotions, thoughts, and behaviors beyond those of a physical entity alone. Sentient refers to living biological entities that possess a form of intelligence and consciousness. While robots, artificial intelligence, and other machine-based entities may exhibit thought processes and behavior, they are not necessarily considered sentient.


------

### Terminate Deployment

Once your work is done, you may terminate your LLM deployment and stop the account billing

In [None]:
terminate_return = deploy_client.terminate_deployment(deployment_id)
print(terminate_return)