# Llama Stack Core Concepts: Inference Basics

Welcome to the foundational steps of working with Llama Stack! In this section, we'll get hands-on with the core capability of Llama Stack: **inference**. This involves sending requests to a Language Model (LLM) and receiving responses.

By the end of this module, you will be equipped to:

* **Initialize the Llama Stack Client:** Connect your Python code to the Llama Stack server you set up previously.
* **Perform Basic Chat Completions:** Send a simple prompt to the LLM and process the standard text response.
* **Extract Structured Data:** Learn how to request and parse responses in a defined format (like JSON) for easier programmatic use.

Let's start interacting with our Llama Stack server!

### Install Required Python Libraries

Before we write any code interacting with Llama Stack, we need to ensure the necessary Python libraries are installed in our environment.

The cell below uses `pip` to install the `llama-stack-client` library (which allows our Python code to communicate with the Llama Stack server) and the `dotenv` library (useful for loading environment variables, though we'll hardcode for simplicity in this lab).

**Execute the following code cell** to install the prerequisites using one of the following methods:

* Press <kbd>Shift</kbd> + <kbd>Enter</kbd> to run the current cell and move on to the next cell

* Press <kbd>Ctrl</kbd> + <kbd>Enter</kbd> to run the cell and stay in the same cell.

Run Button: Click the ▶️ "Run" button in the toolbar above, it looks like a play button.

In [6]:
!pip install -U llama-stack-client==0.2.7 dotenv



### Configure Llama Stack Server and Model Endpoints

Next, we need to tell our Python script *where* the Llama Stack server is running and *which* specific Language Model we want to use for inference.

The code cell below defines Python variables (`LLAMA_STACK_SERVER` and `LLAMA_STACK_MODEL`) that hold the address of your Llama Stack server and the identifier of the model you wish to interact with.

In [7]:
# Load environment variables from .env file
import os
from dotenv import load_dotenv
load_dotenv()

os.environ['LLAMA_STACK_SERVER'] = 'http://localhost:8321'
os.environ['LLAMA_STACK_MODEL'] = 'meta-llama/Llama-3.2-3B-Instruct'

> **Note:**
> In a production or standard development environment, it's best practice to load sensitive information like server addresses and API keys from environment variables (often stored in a `.env` file) rather than hardcoding them directly in your script. For simplicity in this lab, we are hardcoding these values to make the steps immediately visible and clear.

### Initializing the Llama Stack Client

The `LlamaStackClient` is your primary interface for communicating with the Llama Stack server. It allows your Python code to access all the capabilities hosted by the server, such as running inference, managing models, and utilizing tools.

In this step, we will **initialize the client instance**, providing it with the `base_url` (the server address we defined earlier).

This client object is what we will use throughout the lab to send requests and interact with the Llama Stack services. Notice how simple it is to connect – this abstraction is key to Llama Stack's flexibility, allowing you to switch backend infrastructure (like the specific Llama Stack Server instance) with minimal code changes.

In [8]:
from llama_stack_client import LlamaStackClient

LLAMA_STACK_SERVER=os.getenv("LLAMA_STACK_SERVER")
LLAMA_STACK_MODEL=os.getenv("LLAMA_STACK_MODEL")

client = LlamaStackClient(base_url=LLAMA_STACK_SERVER)

# List available models
models = client.models.list()

# Print table header
print("--- Available models: ---")

print("Model Identifier                         Provider ID     Provider Resource ID")

for m in models:
    print(f"{m.identifier:40} {m.provider_id:15} {m.provider_resource_id}")


--- Available models: ---
Model Identifier                         Provider ID     Provider Resource ID
meta-llama/Llama-3.2-3B-Instruct         ollama          llama3.2:3b-instruct-fp16
all-MiniLM-L6-v2                         ollama          all-minilm:latest


### Performing a Basic Chat Completion

With the `LlamaStackClient` initialized, we can now send our first request to the LLM hosted by the server. A common type of interaction is **chat completion**, where we provide a series of messages (representing a conversation) and ask the model to generate the next response.

The code cell below demonstrates a basic chat completion request. We provide a simple system message and a user query, then call the `client.inference.chat_completion` method.

If you've worked with other LLM frameworks, you'll find this syntax familiar. Llama Stack aims to provide a consistent interface that makes it easy to integrate different models and components. Execute the cell to see the LLM's response!

In [4]:
response = client.inference.chat_completion(
    model_id=LLAMA_STACK_MODEL,
    messages=[
        {"role": "system", "content": "You're a helpful assistant."},
        {
            "role": "user",
            "content": "What is the top speed of a leopard?",
        },
    ],
    # temperature=0.0, 
)
print(response.completion_message.content)

The top speed of a leopard can vary depending on several factors, such as the individual animal's age, sex, and motivation (e.g., chasing prey or escaping from predators). However, according to various sources, including the National Geographic and the World Wildlife Fund, the average running speed of a leopard is around 40-50 km/h (25-31 mph).

However, some leopards have been clocked at speeds of up to 60 km/h (37 mph) over short distances, such as when chasing prey or escaping from danger. In fact, one study found that a male leopard can reach speeds of up to 70 km/h (43.5 mph) for brief periods.

It's worth noting that leopards are agile and powerful climbers, and they often use their speed and agility to navigate through dense forests and rocky terrain with ease.


### Requesting Structured Data

While free-form text responses are useful for conversational AI, often in applications, we need the LLM to return information in a predictable, structured format. This allows our code to easily parse and use the response programmatically.

Llama Stack supports requesting responses in formats like JSON based on a defined schema. In this step, we will define a simple data model (`AnimalSpeed` using Pydantic) and ask the LLM to return the animal's speed in a JSON structure matching that model.

This is a powerful technique for building reliable integrations with LLMs. **Execute the code cell** and try modifying the user query with different animals to see how the structured output changes.

In [5]:
from pydantic import BaseModel
import json

class AnimalSpeed(BaseModel):
    speed: int
    animal: str
    metric_type: str

response = client.inference.chat_completion(
    model_id=LLAMA_STACK_MODEL,
    messages=[
        {"role": "system", "content": "You're a helpful assistant."},
        {
            "role": "user",
            "content": "What is the top speed of a leopard?",            
        },
    ],
    stream=False,    
    response_format={
            "type": "json_schema",
            "json_schema": AnimalSpeed.model_json_schema(),
        }
)


try:
    response_data = json.loads(response.completion_message.content)
    animal = AnimalSpeed(**response_data)    
    print("-------")
    print("Speed: ", animal.speed)
    print("Animal: ", animal.animal)
    print("metric_type: ", animal.metric_type)
    print("-------")
except (json.JSONDecodeError, ValueError) as e:
    print(f"Invalid format: {e}")


-------
Speed:  80
Animal:  leopard
metric_type:  km/h
-------


## Module Summary: Llama Stack Inference Basics

Great job completing this introductory module! You've successfully taken your first steps in interacting with the Llama Stack server for basic inference tasks.

In this section, you've learned how to:

* **Establish Connection:** Initialize the `LlamaStackClient` to connect your Python environment to the Llama Stack server.
* **Generate Responses:** Send chat completion requests to the LLM and receive free-form text outputs.
* **Extract Key Data:** Utilize structured data methods to guide the LLM to return information in a parseable format, like JSON.

You now have a fundamental understanding of how to send prompts to an LLM via Llama Stack and process the responses. This forms the bedrock for building more complex applications and agents, which we will explore in the following modules!