# Quickstart - Inference

This notebook covers a simple client usage, including the following points:
- List available models.
- Use the SambaNova inference adaptor to interact with cloud-based LLM chat models.
- Implement a chat loop conversation using the SambaNova inference adaptor.

Run inference via chat completions with the llama-stack Python SDK.

In [1]:
# Imports
import os
import sys

from llama_stack_client import LlamaStackClient

## Setup

In [2]:
# Create HTTP client
client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")

In [3]:
# List available models
models = client.models.list()
print("--- Available models: ---")
for m in models:
    print(f"- {m.identifier}")
print()

--- Available models: ---
- sambanova/Meta-Llama-3.1-8B-Instruct
- meta-llama/Llama-3.1-8B-Instruct
- sambanova/Meta-Llama-3.1-70B-Instruct
- meta-llama/Llama-3.1-70B-Instruct
- sambanova/Meta-Llama-3.1-405B-Instruct
- meta-llama/Llama-3.1-405B-Instruct-FP8
- sambanova/Meta-Llama-3.2-1B-Instruct
- meta-llama/Llama-3.2-1B-Instruct
- sambanova/Meta-Llama-3.2-3B-Instruct
- meta-llama/Llama-3.2-3B-Instruct
- sambanova/Meta-Llama-3.3-70B-Instruct
- meta-llama/Llama-3.3-70B-Instruct
- sambanova/Llama-3.2-11B-Vision-Instruct
- meta-llama/Llama-3.2-11B-Vision-Instruct
- sambanova/Llama-3.2-90B-Vision-Instruct
- meta-llama/Llama-3.2-90B-Vision-Instruct
- sambanova/Meta-Llama-Guard-3-8B
- meta-llama/Llama-Guard-3-8B
- all-MiniLM-L6-v2



In [4]:
# Choose an inference model from the previous list
model = "sambanova/Meta-Llama-3.3-70B-Instruct"

## Create a Chat Completion Request
Use the ``chat_completion function to define the conversation context. Each message you include should have a specific role and content:

In [5]:
response = client.inference.chat_completion(
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "Write a two-sentence poem about llama."},
    ],
    model_id=model,
)

print(response.completion_message.content)


With gentle eyes and a soft, fuzzy face, the llama roams the Andes with a peaceful, gentle pace. Its long neck bends as it grazes with glee, a symbol of serenity in a world wild and free.


## Conversation Loop
To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types 'exit,' 'quit,' or 'bye.'

In [6]:
import asyncio
from llama_stack_client import LlamaStackClient
from termcolor import cprint

async def chat_loop():
    while True:
        user_input = input("User> ")
        if user_input.lower() in ["exit", "quit", "bye"]:
            cprint("Ending conversation. Goodbye!", "yellow")
            break

        message = {"role": "user", "content": user_input}
        response = client.inference.chat_completion(messages=[message], model_id=model)
        cprint(f"> Response: {response.completion_message.content}", "cyan")


# Run the chat loop in a Jupyter Notebook cell using await
await chat_loop()
# To run it in a python file, use this line instead
# asyncio.run(chat_loop())

[36m> Response: I can be used in a variety of ways, from helping you plan a vacation to creating art. I'm here to assist you in finding the help or information you need. My strengths include answering questions, generating text and images and even just chatting with you.[0m
[33mEnding conversation. Goodbye![0m


## Conversation History
Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session.

In [7]:
async def chat_loop():
    conversation_history = []
    while True:
        user_input = input("User> ")
        if user_input.lower() in ["exit", "quit", "bye"]:
            cprint("Ending conversation. Goodbye!", "yellow")
            break

        user_message = {"role": "user", "content": user_input}
        conversation_history.append(user_message)

        response = client.inference.chat_completion(
            messages=conversation_history,
            model_id=model,
        )
        cprint(f"> Response: {response.completion_message.content}", "cyan")

        # Append the assistant message with all required fields
        assistant_message = {
            "role": "user",
            "content": response.completion_message.content,
            # Add any additional required fields here if necessary
        }
        conversation_history.append(assistant_message)


# Use `await` in the Jupyter Notebook cell to call the function
await chat_loop()
# To run it in a python file, use this line instead
# asyncio.run(chat_loop())

[36m> Response: I can be used in a variety of ways, from helping you plan a vacation to creating art. I'm here to assist you in finding the help or information you need. My strengths include answering questions, generating text and images and even just chatting with you.[0m
[33mEnding conversation. Goodbye![0m


## Streaming Responses
Llama Stack offers a stream parameter in the chat_completion function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed.

In [8]:
from llama_stack_client.lib.inference.event_logger import EventLogger

async def run_main(stream: bool = True):
    message = {"role": "user", "content": "Please write me a 3 sentence poem about llamas."}
    cprint(f'User> {message["content"]}', "green")

    response = client.inference.chat_completion(
        messages=[message],
        model_id=model,
        stream=stream,
    )

    if not stream:
        cprint(f"> Response: {response.completion_message.content}", "cyan")
    else:
        for log in EventLogger().log(response):
            log.print()


# In a Jupyter Notebook cell, use `await` to call the function
await run_main()
# To run it in a python file, use this line instead
# asyncio.run(run_main())

[32mUser> Please write me a 3 sentence poem about llamas.[0m
[36mAssistant> [0m[33mHere is a 3 sentence poem about llamas:
[0m[33mLlamas roam the Andean [0m[33mhighlands with [0m[33mgentle ease, their soft fur [0m[33ma warm and [0m[33mfuzzy breeze. [0m[33mWith ears [0m[33mso tall and eyes so bright, they watch the [0m[33mworld [0m[33mwith quiet delight. In [0m[33mtheir tranquil presence, all [0m[33mworries cease, and [0m[33mpeace [0m[33mdescends like [0m[33ma [0m[33msoft, [0m[33mllama-filled release.[0m[97m[0m
