<a href="https://colab.research.google.com/github/sternlieb/AI/blob/main/llama_stack_agent_sdlc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Llama Stack Setup

In [None]:
!apt-get install -y bubblewrap
!pip install uv
!uv pip install llama-stack --system

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
bubblewrap is already the newest version (0.6.1-1ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
[2mUsing Python 3.11.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 114ms[0m[0m


In [None]:
import os
from google.colab import userdata

os.environ['FIREWORKS_API_KEY'] = userdata.get('FIREWORKS_API_KEY')
os.environ['TAVILY_SEARCH_API_KEY'] = userdata.get('TAVILY_SEARCH_API_KEY')

!llama stack build --template fireworks --image-type venv --image-name __system__

Installing dependencies in system Python environment
[2mUsing Python 3.11.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 82ms[0m[0m
Installing pip dependencies
[2mUsing Python 3.11.11 environment at: /usr[0m
[2mAudited [1m32 packages[0m [2min 83ms[0m[0m
sentence-transformers --no-deps
[2mUsing Python 3.11.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 83ms[0m[0m
torch torchvision --index-url https://download.pytorch.org/whl/cpu
[2mUsing Python 3.11.11 environment at: /usr[0m
[2mAudited [1m2 packages[0m [2min 68ms[0m[0m
[32mBuild Successful![0m


In [None]:

from rich.pretty import pprint

from llama_stack.distribution.library_client import LlamaStackAsLibraryClient

from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.lib.agents.react.agent import ReActAgent, ReActOutput
from llama_stack_client.lib.agents.react.prompts import DEFAULT_REACT_AGENT_SYSTEM_PROMPT_TEMPLATE

from llama_stack_client.types.agent_create_params import AgentConfig


## Initialize the Llama Stack Client

In [None]:
client = LlamaStackAsLibraryClient(
    # distribution name ( this one uses fireworks.ai as the underlying provider for the /inference apis )
    "fireworks",
    provider_data = {"tavily_search_api_key": os.environ['TAVILY_SEARCH_API_KEY']}
)

client.initialize()

True

Most of the code above was required to setup the client to work on notebooks.
With llama-stack APIs and plugin architecture you can now point the below code to any distribution (on-prem, cloud, localhost) and it should work seamlessly.

## Simple Chat Completion

In [None]:
model_id = "meta-llama/Llama-3.1-8B-Instruct"
response = client.inference.chat_completion(
    model_id=model_id,
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "Write a two-sentence poem about llama."},
    ],
    stream=False,
)

print(response.completion_message.content)


A gentle llama roams the land,
With soft fur and a gentle hand.


## Agents

Chat completion is good to work with the llm directly.

The power of LLMs can be harnessed when we can have them do more complex tasks. We also added Agent APIs to llama stack.

A simple agents can be characterized as having access to

- Tool Calling - Ability to call tools like search, code execution, DB lookups
- A control loop controlled by the llm

Here is a simple search agent,

In [None]:

config = AgentConfig(
    model="meta-llama/Llama-3.3-70B-Instruct",
    instructions="You are a helpful assistant",
    toolgroups=["builtin::websearch"],
)

agent = Agent(client, config)

user_prompts = [
    "What are the latest news about llama-con from Meta in the last week ?",
]

session_id = agent.create_session("test-session")
for prompt in user_prompts:
    # print(f"\nUser> {prompt}")
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
        stream=False,
    )
    pprint(response)


## Evaluating Agents

Since this agentic loop is controlled by the llm. We need some ways to evaluate this agent to know if this is working properly for our use case.

The above prompt was fairly simple but what if there were more complex prompts.

For eg. What was the first film of the director of the movie that was released in Novemeber of 2014 and was about space time travel.

Agent Evals can help you understand and quantify the agent quality by checking outputs against a representative dataset.

To test agents for ability to websearch, we will use the SimpleQA dataset.



## Defining a eval dataset

In [None]:
# Lets register a dataset with llama-stack so that we can use it for eval purposes across different agents

# we will use the /datasets api for this.
dataset_id = "huggingface::simpleqa"  # has 4.3k qustion / answer pairs
client.datasets.register(
    dataset_id=dataset_id,
    # The data is available on hugging face ( this could be local file system too )
    provider_id="huggingface",
    url={"uri": "https://huggingface.co/datasets/llamastack/evals"},
    metadata={
        "path": "llamastack/evals",
        "name": "evals__simpleqa",
        "split": "train",
    },
    dataset_schema={
        "input_query": {"type": "string"},
        "expected_answer": {"type": "string"},
        "chat_completion_input": {"type": "chat_completion_input"},
    }
)

eval_rows = client.datasetio.get_rows_paginated(
    dataset_id=dataset_id,
    rows_in_page=20,
)

for row in eval_rows.rows[:1]:
    pprint({'input_query': row['input_query'], 'expected_answer': row['expected_answer']})


## Defining a Scoring function

Scoring functions are what tranlates the output answer into a number so that we can quantify about it and aggregate the evals at the entire dataset level.

In [None]:
# We can use the scoring_functions.list api to know all available scoring functions,
# We also allow for adding new scoring functions

scoring_params = {}
for sf in client.scoring_functions.list()[:3]:
  scoring_params[sf.identifier] = sf.params
  print(f"{sf.identifier} --> {sf.description}")


basic::equality --> Returns 1.0 if the input is equal to the target, 0.0 otherwise.
basic::regex_parser_multiple_choice_answer --> Extract answer from response matching Answer: [the_answer_letter], and compare with expected result
basic::subset_of --> Returns 1.0 if the expected is included in generated, 0.0 otherwise.


In [None]:
scoring_function = "basic::subset_of"

response = client.scoring.score(
    input_rows=[
        {
            "input_query": "What is the capital of France?",
            "expected_answer": "Paris",
            "generated_answer": "Paris is the capital of France."
        },
        {
            "input_query": "Where is the 2026 fifa world cup ?",
            "expected_answer": "North America",
            "generated_answer": "USA",
        }
    ],
    scoring_functions={
        scoring_function: scoring_params[scoring_function]
    }
)

pprint(response)

## Evaluating Agents

Now that we have finalized the dataset and the scoring function, lets write a few agents and evaluate them.

In [None]:
import json
import time
import uuid

from pydantic import BaseModel
from typing import Any, Dict, Optional

from llama_stack_client.types.agents import Turn


class EvalRow(BaseModel):
    input_query: str
    expected_answer: str
    turn: Optional[Turn]
    scores: Dict[str, Any]

    @property
    def generated_answer(self) -> str:
        if self.turn:
            return self.turn.output_message.content
        else:
            return None

def tool_call_fraction(eval_responses) -> bool:
    has_tools = 0
    for r in eval_responses:
        if r.turn is not None:
          has_tools += int(any(step.step_type == "tool_execution" for step in r.turn.steps))

    return 0 if len(eval_responses) == 0 else has_tools * 1.0 / len(eval_responses)


def run_generate(client, agent, eval_rows, sleep_time=3):
  """
  Run the agent on each of the eval_rows and produce the answers.
  """
  turns = []
  for r in eval_rows.rows:
    session_id = agent.create_session(uuid.uuid4().hex)
    try:
      turn = agent.create_turn(
          messages=[
              {
                  "role": "user",
                  "content": r["input_query"],
              }
          ],
          session_id=session_id,
          stream=False,
      )
      turns.append(turn)
      time.sleep(sleep_time)
    except Exception as e:
      answer = f"Exception: Failed to generate answer with error: {e}"
      turns.append(None)
      continue

  return turns


def run_eval(client, agent, eval_rows, scoring_functions):
  """
  Run the agent on each of the eval_rows and produce the answers.
  Score the answers by comparing to the expected answers.
  Return eval results, scores and aggregated metrics.
  """
  # get correct scoring params for provided scoring_functions
  scoring_params = {}
  for sf in client.scoring_functions.list():
    scoring_params[sf.identifier] = sf.params

  # generate responses for agents
  turns = run_generate(client, agent, eval_rows)

  # score
  input_rows = []
  for turn, row in zip(turns, eval_rows.rows):
    if turn is None:
      answer = "Failed to generate an answer"
    else:
      answer = turn.output_message.content
    input_rows.append({
        "input_query": row["input_query"],
        "expected_answer": row["expected_answer"],
        "generated_answer": answer
    })
  scoring_function_params = {
      sf_id: scoring_params[sf_id] for sf_id in scoring_functions
  }
  response = client.scoring.score(
      input_rows=input_rows,
      scoring_functions=scoring_function_params
  )
  eval_responses = []
  aggregated_eval_results = {}
  eval_scores = [{} for _ in range(len(input_rows))]  # [{}, {} ... accumulating all scores for each row]
  for scoring_fn, result in response.results.items():
      for i, score in enumerate(result.score_rows):
          eval_scores[i][scoring_fn] = score["score"]
      aggregated_eval_results[scoring_fn] = result.aggregated_results

  for turn, ip_row, scores in zip(turns, input_rows, eval_scores):
      eval_responses.append(
          EvalRow(
              input_query=ip_row["input_query"],
              expected_answer=ip_row["expected_answer"],
              turn=turn,
              scores=scores,
          )
      )

  return eval_responses, aggregated_eval_results


## First Agent

- We will start by defining a simple agent to begin with.
- The agent has access to a web-search tool (we use the tavily provider for getting search results).
- We start with the Llama 3.1 70B model


In [None]:
agent1 = Agent(
    client,
    AgentConfig(
        model="meta-llama/Llama-3.1-70B-Instruct",
        instructions="You are a helpful assistant",
        toolgroups=["builtin::websearch"],
    )
)

print(f"Evaluating for agent: {agent1.agent_id}")
responses, metrics = run_eval(client, agent1, eval_rows, ["basic::subset_of"])
# pprint(responses[:2])
print(f"Metrics: {metrics}")
print(f"Tool call fraction: {tool_call_fraction(responses)}")


Evaluating for agent: 740186ba-86eb-4874-90ab-45984a702301
Metrics: {'basic::subset_of': {'accuracy': {'accuracy': 0.25, 'num_correct': 5.0, 'num_total': 20}}}
Tool call fraction: 0.1


As seen,
- the agent got correct answers for a few of the questions.
- only a small fraction of the requests actually resulted in a tool call.

Lets iterate a bit more on the prompt and may be use a better model to see if we can get better performance.

## Trying the Llama 3.3 70B model

There are two issues that one can see in the previous results,
- The web_search tool is not called on several rows
- The answer tends to be more verbose sentences

Lets try a better model in Llama3.3 to see if it can do better tool calling.

Also note the update to the instructions explicitly suggesting use of the web search tool.

In [None]:
agent2 = Agent(
    client,
    AgentConfig(
        model="meta-llama/Llama-3.3-70B-Instruct",
        instructions="""
        Additional Instructions:
        - When provided with questions, use the wesearch tool if necessary.
        """,
        toolgroups=["builtin::websearch"],
        enable_session_persistence=True,
    )
)

print(f"Evaluating for agent: {agent2.agent_id}")
responses, metrics = run_eval(client, agent2, eval_rows, ["basic::subset_of"])
# pprint(responses[:2])
print(f"Metrics: {metrics}")
print(f"Tool call fraction: {tool_call_fraction(responses)}")

Evaluating for agent: a11d9286-c38e-4e71-9071-5b1ddbe272a8
Metrics: {'basic::subset_of': {'accuracy': {'accuracy': 0.7, 'num_correct': 14.0, 'num_total': 20}}}
Tool call fraction: 0.45



- These changes definitely helped, we were able to push both accuracy and tool invocations went up.
- Looks like the model has much better knowledge and was able to answer a lot of questions without web search.


## ReACT Agent

ReACT stands for "Reasoning and Acting." It's a framework that combines reasoning and acting in an interleaved manner to improve problem-solving capabilities of AI agents.

The ReACT framework enables agents to:
- Reason about a problem by breaking it down into steps
- Act by taking actions (tool calls) based on that reasoning
- Observe the results of those actions
- Update their reasoning based on observations

This cycle of Reason-Act-Observe creates a more effective problem-solving approach that helps LLMs tackle complex tasks that require multiple steps and adaptation based on intermediate results

We provide out of the box utility in llama stack to enable agents work with the ReACT framework.

Here is a sample of what we expect agent to return,
```
You must always respond in the following JSON format:
{
    "thought": $THOUGHT_PROCESS,
    "action": {
        "tool_name": $TOOL_NAME,
        "tool_params": $TOOL_PARAMS
    },
    "answer": $ANSWER
}
```
- Basically, we are pushing the model to think before taking actions
- Separately, we also use structured outputs to force the model to produce results of the type `ReActOutput` so that its easy to parse the outputs.

In [None]:
# Lets look at the prompt

def build_prompt(builtin_toolgroups, additonal_instructions=None):
    tool_defs = []
    for x in builtin_toolgroups:
        tool_defs.extend(
            [
                {
                    "name": tool.identifier,
                    "description": tool.description,
                    "parameters": tool.parameters,
                }
                for tool in client.tools.list(toolgroup_id=x)
            ]
        )
    tool_names = ", ".join([x["name"] for x in tool_defs])
    tool_descriptions = "\n".join([f"- {x['name']}: {x}" for x in tool_defs])

    instruction = DEFAULT_REACT_AGENT_SYSTEM_PROMPT_TEMPLATE.replace(
        "<<tool_names>>", tool_names
    ).replace(
        "<<tool_descriptions>>", tool_descriptions
    )
    if additonal_instructions:
      instruction += f"\n\n{additional_instructions}"

    return instruction

instruction = build_prompt(
    ["builtin::websearch"],
    """
    Additional Instructions:
    - Respond concisely with just the answer and not long sentences.
    """
)
print(instruction[:2500])



You are an expert assistant who can solve any task using tool calls. You will be given a task to solve as best you can.
To do so, you have been given access to the following tools: web_search

You must always respond in the following JSON format:
{
    "thought": $THOUGHT_PROCESS,
    "action": {
        "tool_name": $TOOL_NAME,
        "tool_params": $TOOL_PARAMS
    },
    "answer": $ANSWER
}

Specifically, this json should have a `thought` key, a `action` key and an `answer` key.

The `action` key should specify the $TOOL_NAME the name of the tool to use and the `tool_params` key should specify the parameters key as input to the tool.

Make sure to have the $TOOL_PARAMS as a dictionary in the right format for the tool you are using, and do not put variable names as input if you can find the right values.

You should always think about one action to take, and have the `thought` key contain your thought process about this action.
If the tool responds, the tool will return an observat

In [None]:

agent_config = AgentConfig(
    model="meta-llama/Llama-3.3-70B-Instruct",
    instructions=instruction,
    toolgroups=["builtin::websearch"],
    tool_config={
        "system_message_behavior": "replace",
    },
    response_format={
        "type": "json_schema",
        "json_schema": ReActOutput.model_json_schema(),
    }
)

agent3 = ReActAgent(
    client=client,
    model="meta-llama/Llama-3.3-70B-Instruct",
    custom_agent_config=agent_config,
)

responses, metrics = run_eval(client, agent3, eval_rows, ["basic::subset_of"])
# pprint(responses[:2])
print(f"Metrics: {metrics}")
print(f"Tool call fraction: {tool_call_fraction(responses)}")


Metrics: {'basic::subset_of': {'accuracy': {'accuracy': 0.65, 'num_correct': 13.0, 'num_total': 20}}}
Tool call fraction: 0.95


While the accuracy is in the same range, we were abe to get much higher tool invocations using this framework.