# Mitral 7B on Vertex AI with [vLLM](https://github.com/vllm-project/vllm) 
Following [this documentation](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/148?project=kic-chat-assistant) and [this notebook](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_pytorch_mistral.ipynb)

In [15]:
PROJECT_ID = "kic-chat-assistant"
REGION = "europe-west4"
SERVICE_ACCOUNT = "vertexai-endpoint-sa@kic-chat-assistant.iam.gserviceaccount.com"
# For experiment outputs
BUCKET_URI = "gs://vertexai_mistral"
STAGING_BUCKET = f"{BUCKET_URI}/temporal"
# The pre-built serving docker image with vLLM
VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve"


In [16]:
from datetime import datetime
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

def get_job_name_with_datetime(prefix: str):
    """Gets the job name with date time when triggering training or deployment
    jobs in Vertex AI.
    """
    return prefix + datetime.now().strftime("_%Y%m%d_%H%M%S")


def deploy_model_vllm(model_name, model_id, service_account, machine_type="g2-standard-8", accelerator_type="NVIDIA_L4", accelerator_count=1):
    """Deploys trained models with vLLM into Vertex AI."""
    endpoint = aiplatform.Endpoint.create(display_name=f"{model_name}-endpoint")

    dtype = "bfloat16"
    if accelerator_type in ["NVIDIA_TESLA_T4", "NVIDIA_TESLA_V100"]:
        dtype = "float16"

    vllm_args = [
        "--host=0.0.0.0",
        "--port=7080",
        f"--model={model_id}",
        f"--tensor-parallel-size={accelerator_count}",
        "--swap-space=16",
        f"--dtype={dtype}",
        "--gpu-memory-utilization=0.9",
        "--disable-log-stats",
    ]
    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=VLLM_DOCKER_URI,
        serving_container_command=["python", "-m", "vllm.entrypoints.api_server"],
        serving_container_args=vllm_args,
        serving_container_ports=[7080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
    )

    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        service_account=service_account,
    )
    return model, endpoint

In [17]:
prebuilt_model_id = "mistralai/Mistral-7B-Instruct-v0.1"

# Find Vertex AI prediction supported accelerators and regions in
# https://cloud.google.com/vertex-ai/docs/predictions/configure-compute.
# Pricing: https://cloud.google.com/vertex-ai/pricing#pred_eur

# Proposed configurations and pricing per hour for europe-west4 region:
# n1-standard-16 with 2 T4 GPUs    : $1.0123 + 2* GPU $0.4370 
# n1-standard-16 with 2 V100 GPUs  : $1.0123 + 2* GPU $2.9325
# g2-standard-8 with 1 L4 GPU      : $1.081  + GPU included?
# a2-highgpu-1g with 1 A100 GPU    : $4.3103 + GPU included!

machine_type = "g2-standard-8"
accelerator_type = "NVIDIA_L4"
accelerator_count = 1

model, endpoint = deploy_model_vllm(
    model_name=get_job_name_with_datetime(prefix="mistral-serve-vllm"),
    model_id=prebuilt_model_id,
    service_account=SERVICE_ACCOUNT,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
)

Creating Endpoint
Create Endpoint backing LRO: projects/675164168178/locations/europe-west4/endpoints/5312809399088054272/operations/9083172484563337216
Endpoint created. Resource name: projects/675164168178/locations/europe-west4/endpoints/5312809399088054272
To use this Endpoint in another session:
endpoint = aiplatform.Endpoint('projects/675164168178/locations/europe-west4/endpoints/5312809399088054272')
Creating Model
Create Model backing LRO: projects/675164168178/locations/europe-west4/models/8617395994415333376/operations/2789392005313069056
Model created. Resource name: projects/675164168178/locations/europe-west4/models/8617395994415333376@1
To use this Model in another session:
model = aiplatform.Model('projects/675164168178/locations/europe-west4/models/8617395994415333376@1')
Deploying model to Endpoint : projects/675164168178/locations/europe-west4/endpoints/5312809399088054272
Deploy Endpoint model backing LRO: projects/675164168178/locations/europe-west4/endpoints/531280

# Inference

In [34]:
instance = {
    "prompt": "My favourite condiment is",
    "n": 1,
    "max_tokens": 200,
}

async def get_predictions(endpoint, instance):
    """Gets predictions from the deployed model."""
    responses = await endpoint.predict_async(instances=[instance])
    for response in responses[0]:
        print(response)

import asyncio
# task = asyncio.create_task(get_predictions(endpoint, instance))  
# await task
# Many instances test
tasks = [asyncio.create_task(get_predictions(endpoint, instance)) for _ in range(10)]
await asyncio.gather(*tasks)

print("Done")

Prompt:
My favourite joke goes like this:
Output:
Why don't scientists trust atoms?

Because they make up everything!

So you see, scientists are just like us, always questioning the world around us and trying to make sense of it. And that's what I love about science - it's an ongoing exploration of the unknown.

So, if you're ever curious about something, don't be afraid to ask a question or seek out some answers. Science is here to help!
Prompt:
My favourite condiment is
Output:
 easily sriracha sauce. It satisfies my cravings for a spicy kick and adds extra depth to my meals. I’ve even been known to incorporate it in my baking! Recently, I came across this recipe for vegan sriracha tofu that looked too good to pass up.

### Ingredients:

- 1 block of firm tofu (14 oz)
- 1/4 cup of nutritional yeast
- 2 tablespoons of sriracha sauce
- 2 tablespoons of soy sauce
- 1 tablespoon of rice vinegar
- 1 clove of garlic, minced
- 1 teaspoon of sesame oil
- 1 teaspoon of cornstarch
- 1/4 teasp

# Clean up the endpoint and model

In [None]:
delete_endpoint = False
def list_endpoints():
    return [
        (r.name, r.display_name)
        for r in aiplatform.Endpoint.list()
        if r.display_name.startswith("mistral-serve-vllm")
    ]
try:
    if delete_endpoint:
        endpoints = list_endpoints()
        for endpoint_id, endpoint_name in endpoints:
            endpoint = aiplatform.Endpoint(endpoint_id)
            print(
                f"Undeploying all deployed models and deleting endpoint {endpoint_id} [{endpoint_name}]"
            )
            endpoint.delete(force=True)

        # Delete the bucket
        !gsutil -m rm -r $BUCKET_URI
        
except Exception as e:
    print(e)

