## Script for benchmarking VLLM with Mistral

I got Mistral running using this [medium post](https://medium.com/snowflake/generating-product-descriptions-with-mistral-7b-instruct-v0-2-with-vllm-serving-3fe7110b048b) and 
See all the models running on the VLLM server Eda's [Github repo](https://github.com/edemiraydin/mistral_vllm_demo/tree/main)

I will do more notebooks/material on LLMs. I am starting with traditional ML.

This was a quick notebook for benchmarking VLLM.

In [8]:
!curl http://localhost:8000/v1/models

{"object":"list","data":[{"id":"mistralai/Mistral-7B-Instruct-v0.2","object":"model","created":1706369454,"owned_by":"vllm","root":"mistralai/Mistral-7B-Instruct-v0.2","parent":null,"permission":[{"id":"modelperm-c1e2fd8706004bae8554a5101608bf11","object":"model_permission","created":1706369454,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

Verify Mistral is working

In [None]:
!curl http://localhost:8000/v1/completions \
        -H "Content-Type: application/json" \
        -d '{"model": "mistralai/Mistral-7B-Instruct-v0.2", \
        "prompt": "[INST]What is Snowflake [/INST] ", \
        "temperature": 0, "max_tokens":200 }'

{"id":"cmpl-c57675d45db14f76a9a7d9f836cb3de3","object":"text_completion","created":48076,"model":"mistralai/Mistral-7B-Instruct-v0.2","choices":[{"index":0,"text":" Snowflake is a cloud-based data warehousing platform that provides an elastic, scalable, and secure solution for managing and analyzing large volumes of data. Snowflake was designed to make it easier and more cost-effective to move and analyze data in the cloud, compared to traditional on-premises data warehousing solutions.\n\nSnowflake separates the compute and storage layers, allowing users to scale each independently based on their needs. It also supports various data formats and sources, including structured data in CSV, JSON, Avro, and Parquet formats, as well as semi-structured and unstructured data.\n\nSnowflake offers several features that make it an attractive option for data analytics and business intelligence applications, such as:\n\n* Multi-cloud support: Snowflake is available on multiple cloud platforms, inc

Running the model in the notebook - You can add in a lot of other parameters into completions like temperature

In [31]:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
prompt = "San Francisco is a"

completion = client.completions.create(model="mistralai/Mistral-7B-Instruct-v0.2",
                                      prompt=prompt)
print("Completion result:", completion)

Completion result: Completion(id='cmpl-529c4a947aa849269f1158a7c8a01d25', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' town known for its drama, both in real life and on the stage. It')], created=53920, model='mistralai/Mistral-7B-Instruct-v0.2', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=16, prompt_tokens=5, total_tokens=21))


This benchmarking script comes from Hamel's [inference notes](https://hamel.dev/notes/llm/inference/03_inference.html)

In [53]:
questions = [
    # Coding questions
    "Implement a Python function to compute the Fibonacci numbers.",
    "Write a Rust function that performs binary exponentiation.",
    "What are the differences between Javascript and Python?",
    # Literature
    "Write a story in the style of James Joyce about a trip to the Australian outback in 2083, to see robots in the beautiful desert.",
    "Who does Harry turn into a balloon?",
    "Write a tale about a time-traveling historian who's determined to witness the most significant events in human history.",
    # Math
    "What is the product of 9 and 8?",
    "If a train travels 120 kilometers in 2 hours, what is its average speed?",
    "Think through this step by step. If the sequence a_n is defined by a_1 = 3, a_2 = 5, and a_n = a_(n-1) + a_(n-2) for n > 2, find a_6.",
]

Simplified here, but you can add in more

In [62]:
def generate(prompt, note=None):
    response = {'prompt': prompt, 'note': note}
    #sampling_params = SamplingParams(
    #    temperature=1.0,
    #    top_p=1,
    #    max_tokens=200,
    #)
    start = time.perf_counter()
    result = client.completions.create(model="mistralai/Mistral-7B-Instruct-v0.2",
                                      prompt=prompt)
    request_time = time.perf_counter() - start

    for output in result:
        response['tok_count'] = result.usage.total_tokens
        response['time'] = request_time
        response['answer'] = result.choices[0].text
    
    return response

In [63]:
import time
import pandas as pd
counter = 1
responses = []

for q in questions:
    response = generate(prompt=q, note='vLLM')
    if counter >= 2:
        responses.append(response)
    counter += 1
    
df = pd.DataFrame(responses)
df.to_csv('bench-vllm.csv', index=False)

In [64]:
df

Unnamed: 0,prompt,note,tok_count,time,answer
0,Write a Rust function that performs binary exp...,vLLM,28,0.531325,Note: you assumed operator overloading in Rus...
1,What are the differences between Javascript an...,vLLM,28,0.530939,\n\nJavascript and Python are both high-level ...
2,Write a story in the style of James Joyce abou...,vLLM,51,0.530167,\n\nI. Voyage to the Inland Sea\n\nThe sun
3,Who does Harry turn into a balloon?,vLLM,26,0.530193,"In ""Harry Potter and the Chamber of Secrets,""..."
4,Write a tale about a time-traveling historian ...,vLLM,42,0.527757,"\n\nOnce upon a time, in a quaint little town ..."
5,What is the product of 9 and 8?,vLLM,28,0.531444,\nAnswer: The product of 9 and 8 is 72
6,"If a train travels 120 kilometers in 2 hours, ...",vLLM,38,0.528193,Let's calculate the average speed step by ste...
7,Think through this step by step. If the sequen...,vLLM,76,0.532794,"\n\nFirst, we have to calculate a_3 and a_4 using"
