Notebook by [Build Fast with AI](https://www.buildfastwithai.com/genai-course)


# Fastest LLM APIs

Quick speed test of Llama 3.3 70B model across different providers:
- Groq
- Cerebras
- Sambanova

Testing for tokens/sec, latency, and response time using the same prompt.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/16CJABHvVzc9kJvV--wHOzqfWZRilVkyl/view?usp=sharing)

In [None]:
!pip install -qU openai

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

True

### Groq

In [None]:
from openai import OpenAI

# Set up the client with Groq's API key
client = OpenAI(
    api_key=os.getenv("GROQ_API_KEY"),
    base_url="https://api.groq.com/openai/v1")

# Use the Llama 3.3 70B model
completion = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "user", "content": "Tell me a long story"}
    ],
    stream=True,
)

for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")

Once upon a time, in a small village nestled in the rolling hills of rural England, there lived a young girl named Emily. Emily was a curious and adventurous child, with a heart full of wonder and a mind full of questions. She loved to explore the world around her, to climb trees, to chase after butterflies, and to watch the clouds drift lazily across the sky.

As she grew older, Emily's sense of wonder never faded, but her circumstances changed. Her parents, who had always been kind and supportive, fell on hard times. Her father, a skilled craftsman, lost his job due to the economic downturn, and her mother, a talented seamstress, struggled to make ends meet. The family was forced to live in a small, humble cottage on the outskirts of the village, and Emily's parents worked tirelessly to provide for her and her younger brother.

Despite the challenges they faced, Emily's parents encouraged her to pursue her love of learning. They scraped together what little money they had and sent he

### Cerebras

In [None]:
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.cerebras.ai/v1",
    api_key=os.getenv("CEREBRAS_API_KEY"))

completion = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {
            "role": "user",
            "content": "Tell me a long story",
        }
    ],
    stream=True,
)

for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")

Once upon a time, in a small village nestled in the rolling hills of the countryside, there lived a young girl named Emily. She was a bright and curious child, with a mop of curly brown hair and a smile that could light up the darkest of rooms. Emily lived with her parents and younger brother, Jack, in a cozy little cottage on the outskirts of the village.

As a child, Emily was always fascinated by the stories her grandmother used to tell her. Her grandmother, who lived in a small cottage just across the village green, would regale Emily with tales of magic and adventure, of far-off lands and brave heroes. Emily's imagination would run wild as she listened to these stories, and she would spend hours dreaming of the day when she could have her own adventures.

As Emily grew older, her love of stories only deepened. She would spend hours reading books from the village library, devouring tales of fantasy and romance, of mystery and suspense. She was especially fond of the stories of the 

### Sambanova

In [None]:
from openai import OpenAI
import os

api_key = os.environ.get("SAMBANOVA_API_KEY")

client = OpenAI(
    base_url="https://api.sambanova.ai/v1/",
    api_key=api_key)

completion = client.chat.completions.create(
    model="Meta-Llama-3.3-70B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Tell me a long story",
        }
    ],
    stream=True,
)

for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")

Once upon a time, in a small village nestled in the rolling hills of the countryside, there lived a young girl named Sophia. Sophia was a curious and adventurous child, with a mop of curly brown hair and a smile that could light up the darkest of rooms. She lived with her parents and younger brother in a small cottage on the edge of the village, surrounded by fields of golden wheat and green pastures where cows grazed lazily in the sun.

Sophia loved to explore the countryside, and spent hours wandering through the fields and woods, discovering hidden streams and secret glades. She was especially fond of the old oak tree that stood at the edge of the village, its gnarled branches twisted and tangled in a way that seemed almost magical. According to local legend, the oak tree was over a thousand years old, and had been a meeting place for the ancient druids who had once inhabited the land.

One day, while out on a walk, Sophia stumbled upon a small, mysterious shop tucked away in a quie

### Speed and Latency Comparison

In [None]:
async def run_llm(name, base_url, api_key, model):
    try:
        client = OpenAI(
            base_url=base_url,
            api_key=api_key
        )

        completion = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": "Tell me a long story"}],
        )

        # Extract metrics based on provider
        if name == "Cerebras":
            time_info = completion.time_info
            metrics = {
                'name': name,
                'tokens_per_sec': completion.usage.total_tokens / time_info['total_time'],
                'time_to_first_token': time_info['prompt_time'],
                'total_latency': time_info['total_time'],
                'completion_tokens': completion.usage.completion_tokens,
                'total_tokens': completion.usage.total_tokens
            }
        elif name == "Groq":
            metrics = {
                'name': name,
                'tokens_per_sec': completion.usage.total_tokens / completion.usage.total_time,
                'time_to_first_token': completion.usage.prompt_time,
                'total_latency': completion.usage.total_time,
                'completion_tokens': completion.usage.completion_tokens,
                'total_tokens': completion.usage.total_tokens
            }
        else:  # Sambanova
            metrics = {
                'name': name,
                'tokens_per_sec': completion.usage.total_tokens_per_sec,
                'time_to_first_token': completion.usage.time_to_first_token,
                'total_latency': completion.usage.total_latency,
                'completion_tokens': completion.usage.completion_tokens,
                'total_tokens': completion.usage.total_tokens
            }
        return metrics
    except Exception as e:
        print(f"Error with {name}: {str(e)}")
        return None

async def compare_llms():
    tasks = [
        run_llm("Groq", "https://api.groq.com/openai/v1",
                os.getenv("GROQ_API_KEY"), "llama-3.3-70b-specdec"),
        run_llm("Cerebras", "https://api.cerebras.ai/v1",
                os.getenv("CEREBRAS_API_KEY"), "llama-3.3-70b"),
        run_llm("Sambanova", "https://api.sambanova.ai/v1/",
                os.getenv("SAMBANOVA_API_KEY"), "Meta-Llama-3.3-70B-Instruct")
    ]

    results = [r for r in await asyncio.gather(*tasks) if r is not None]

    print("\nLLM Performance Comparison:")
    print("-" * 100)
    print(f"{'Provider':<12} {'Tokens/sec':<12} {'First Token':<12} {'Latency':<12} {'Completion':<12} {'Total'}")
    print("-" * 100)

    for result in sorted(results, key=lambda x: x['tokens_per_sec'], reverse=True):
        print(f"{result['name']:<12} "
              f"{result['tokens_per_sec']:,.1f}/s    "
              f"{result['time_to_first_token']:,.3f}s     "
              f"{result['total_latency']:,.2f}s      "
              f"{result['completion_tokens']:,}         "
              f"{result['total_tokens']:,}")

# Run the comparison
await compare_llms()


LLM Performance Comparison:
----------------------------------------------------------------------------------------------------
Provider     Tokens/sec   First Token  Latency      Completion   Total
----------------------------------------------------------------------------------------------------
Groq         1,734.4/s    0.005s     0.72s      1,204         1,243
Cerebras     1,678.9/s    0.005s     1.01s      1,662         1,702
Sambanova    323.5/s    0.212s     4.33s      1,360         1,400
