# Quickstart - Inference

This notebook covers a simple client usage, including the following points:
- List available models.
- Use the SambaNova inference adaptor to interact with cloud-based LLM chat models.
- Implement a chat loop conversation using the SambaNova inference adaptor.

Run inference via chat completions with the llama-stack Python SDK.

Please refer to the [llama-stack quickstart documentation](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html) for further details.

In [1]:
# Imports
import os
import sys

from llama_stack_client import LlamaStackClient

## Setup

In [2]:
# Create HTTP client
LLAMA_STACK_PORT = 8321
client = LlamaStackClient(base_url=f"http://localhost:{LLAMA_STACK_PORT}")

In [3]:
# List available models
models = client.models.list()
print("--- Available models: ---")
for m in models:
    print(f"- {m.identifier}")
print()

--- Available models: ---
- sambanova/Meta-Llama-3.1-8B-Instruct
- meta-llama/Llama-3.1-8B-Instruct
- sambanova/Meta-Llama-3.1-405B-Instruct
- meta-llama/Llama-3.1-405B-Instruct-FP8
- sambanova/Meta-Llama-3.2-1B-Instruct
- meta-llama/Llama-3.2-1B-Instruct
- sambanova/Meta-Llama-3.2-3B-Instruct
- meta-llama/Llama-3.2-3B-Instruct
- sambanova/Meta-Llama-3.3-70B-Instruct
- meta-llama/Llama-3.3-70B-Instruct
- sambanova/Llama-3.2-11B-Vision-Instruct
- meta-llama/Llama-3.2-11B-Vision-Instruct
- sambanova/Llama-3.2-90B-Vision-Instruct
- meta-llama/Llama-3.2-90B-Vision-Instruct
- sambanova/Llama-4-Scout-17B-16E-Instruct
- meta-llama/Llama-4-Scout-17B-16E-Instruct
- sambanova/Llama-4-Maverick-17B-128E-Instruct
- meta-llama/Llama-4-Maverick-17B-128E-Instruct
- sambanova/Meta-Llama-Guard-3-8B
- meta-llama/Llama-Guard-3-8B
- all-MiniLM-L6-v2



In [4]:
# Choose an inference model from the previous list
model = "sambanova/Meta-Llama-3.3-70B-Instruct"

## Create a Chat Completion Request
Use the `chat_completion` function to define the conversation context. Each message you include should have a specific role and content:

In [5]:
response = client.inference.chat_completion(
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "Write a two-sentence poem about llama."},
    ],
    model_id=model,
)

print(response.completion_message.content)


With gentle eyes and a soft, fuzzy face, the llama roams the Andes with a peaceful, gentle pace. Its long neck bends as it grazes with glee, a symbol of serenity in a world wild and free.


## Conversation Loop
To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types 'exit,' 'quit,' or 'bye.'

In [6]:
import asyncio
from llama_stack_client import LlamaStackClient
from termcolor import cprint

async def chat_loop():
    while True:
        user_input = input("User> ")
        cprint(f"> User: {user_input}", "green")
        if user_input.lower() in ["exit", "quit", "bye"]:
            cprint("Ending conversation. Goodbye!", "yellow")
            break

        message = {"role": "user", "content": user_input}
        response = client.inference.chat_completion(messages=[message], model_id=model)
        cprint(f"> Response: {response.completion_message.content}", "cyan")


# Run the chat loop in a Jupyter Notebook cell using await
await chat_loop()
# To run it in a python file, use this line instead
# asyncio.run(chat_loop())

[32m> User: Hi, Tell me a joke[0m
[36m> Response: Here's one:

What do you call a fake noodle?

An impasta!

Hope that made you laugh! Do you want to hear another one?[0m
[32m> User: what is the capital of Austria[0m
[36m> Response: The capital of Austria is Vienna (German: Wien).[0m
[32m> User: quit[0m
[33mEnding conversation. Goodbye![0m


## Conversation History
Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session.

In [7]:
async def chat_loop():
    conversation_history = []
    while True:
        user_input = input("User> ")
        cprint(f"> User: {user_input}", "green")
        if user_input.lower() in ["exit", "quit", "bye"]:
            cprint("Ending conversation. Goodbye!", "yellow")
            break

        user_message = {"role": "user", "content": user_input}
        conversation_history.append(user_message)

        response = client.inference.chat_completion(
            messages=conversation_history,
            model_id=model,
        )
        cprint(f"> Response: {response.completion_message.content}", "cyan")

        # Append the assistant message with all required fields
        assistant_message = {
            "role": "user",
            "content": response.completion_message.content,
            # Add any additional required fields here if necessary
        }
        conversation_history.append(assistant_message)


# Use `await` in the Jupyter Notebook cell to call the function
await chat_loop()
# To run it in a python file, use this line instead
# asyncio.run(chat_loop())

[32m> User: Hi, I want to learn spanish[0m
[36m> Response: ¡Hola! Learning Spanish can be a rewarding and enriching experience. With over 460 million native speakers, Spanish is the second most widely spoken language in the world, and it's an official language in 20 countries.

To get started, let's break down the basics:

1. **Alphabet**: Spanish uses the same alphabet as English, with a few additional letters like ñ, ü, and ll.
2. **Pronunciation**: Spanish pronunciation is generally phonetic, meaning that words are pronounced as they're written. Pay attention to accents and diacritical marks, as they can change the pronunciation of words.
3. **Grammar**: Spanish grammar is relatively similar to English grammar, with a few key differences. For example, Spanish has two forms of the verb "to be" (ser and estar), and it uses verb conjugations to indicate tense and mood.

Here are some beginner-friendly resources to help you learn Spanish:

* **Duolingo**: A popular language-learning 

## Streaming Responses
Llama Stack offers a stream parameter in the chat_completion function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed.

In [8]:
from llama_stack_client.lib.inference.event_logger import EventLogger

async def run_main(stream: bool = True):
    message = {"role": "user", "content": "Please write me a 3 sentence poem about llamas."}
    cprint(f'User> {message["content"]}', "green")

    response = client.inference.chat_completion(
        messages=[message],
        model_id=model,
        stream=stream,
    )

    if not stream:
        cprint(f"> Response: {response.completion_message.content}", "cyan")
    else:
        for log in EventLogger().log(response):
            log.print()


# In a Jupyter Notebook cell, use `await` to call the function
await run_main()
# To run it in a python file, use this line instead
# asyncio.run(run_main())

[32mUser> Please write me a 3 sentence poem about llamas.[0m
[36mAssistant> [0m[33mHere[0m[33m is[0m[33m a 3 sentence poem about llamas:
[0m[33mL[0m[33mlamas roam the Andean highlands with[0m[33m gentle ease,[0m[33m their soft fur a warm[0m[33m and fuzzy breeze[0m[33m. With ears[0m[33m so tall[0m[33m and eyes so bright, they watch the world with[0m[33m quiet[0m[33m delight. In their tranquil[0m[33m presence, all worries cease[0m[33m,[0m[33m and peace desc[0m[33mends like[0m[33m a soft[0m[33m, llama[0m[33m-filled[0m[33m release[0m[33m.[0m[97m[0m
