In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

## Test using OpenAI

In [None]:
from openai import OpenAI

client = OpenAI(
    base_url=f'{os.getenv("qwen3_endpoint_url")}/v1',
    api_key=os.getenv("BENTO_CLOUD_API_KEY"),
)

In [24]:
client.models.list()

SyncPage[Model](data=[Model(id='Qwen/Qwen3-8B', created=1755676576, object='model', owned_by='vllm', root='/home/bentoml/bento/hf-models/models--qwen--qwen3-8b/snapshots/b968826d9c46dd6066d109eabc6255188de91218', parent=None, max_model_len=16548, permission=[{'id': 'modelperm-6b864e43fb80479c92e5ce8341e2c5ee', 'object': 'model_permission', 'created': 1755676576, 'allow_create_engine': False, 'allow_sampling': True, 'allow_logprobs': True, 'allow_search_indices': False, 'allow_view': True, 'allow_fine_tuning': False, 'organization': '*', 'group': None, 'is_blocking': False}])], object='list')

In [25]:
response = client.chat.completions.create(
    model="Qwen/Qwen3-8B",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role":"user", "content":"Explan what quantum computers are and how they work like I'm a 16-year-old."},
    ],
)

In [29]:
from IPython.display import display, Markdown

display(Markdown(response.choices[0].message.content))



Sure! Let me break it down like you're 16 and trying to get the hang of it.  

**What is a quantum computer?**  
Imagine a regular computer as a superhero who can only do one thing at a time. Like, if you ask it to solve a puzzle, it checks one piece at a time. But a **quantum computer** is like a superhero with *superpowers*—it can check *all the pieces at once*!  

**How does it work?**  
Regular computers use **bits** (0s and 1s) to store information. Think of them like light switches: they’re either *on* (1) or *off* (0). But quantum computers use **qubits**. Here’s the magic:  
- **Superposition**: A qubit can be both 0 and 1 at the same time, like a spinning coin that’s *both heads and tails* until it lands. This lets quantum computers process *many possibilities* simultaneously.  
- **Entanglement**: Qubits can be linked, so changing one instantly affects the other, no matter how far apart they are. It’s like having two coins that *know* each other’s state, even if they’re on opposite sides of the universe.  

**Why is this cool?**  
Quantum computers can solve certain problems way faster than regular computers. For example:  
- **Breaking codes** (though they’re still working on that).  
- **Simulating molecules** for new drugs or materials.  
- **Optimizing complex systems** like traffic routes or financial models.  

**But wait…**  
They’re not *better* at everything. They’re like a supercharged calculator for specific tasks. Plus, they’re tricky to build because qubits are super sensitive to outside interference (like a coin spinning in a storm).  

**In short**:  
Quantum computers use qubits that can be in multiple states at once and link together magically. They’re not replacing your phone, but they might revolutionize fields like medicine or AI in the future! 🌌✨  

Want to know how they’re built? It’s like building a machine that can *see* all possible answers at once—kinda like a time machine for data! 😄

## Powering Llama Cloud Index with our BentoML deployment

In [31]:
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex
from llama_index.llms.openai_like import OpenAILike

LLAMA_CLOUD_API_KEY = os.environ['LLAMA_CLOUD_API_KEY']

kwargs = {
    'dense_similarity_top_k': 10,
    'sparse_similarity_top_k': 20,
    'enable_reranking': True,
    'alpha': 0.5,
    'rerank_top_n': 8
}

alita_index = LlamaCloudIndex(
    name="alita-paper",
    project_name="Default",
    organization_id="bf9b425c-54cb-4182-a93f-8ac6aed04348",
    api_key=LLAMA_CLOUD_API_KEY,
)

mcp_zero_index = LlamaCloudIndex(
    name="mcp-zero-paper",
    project_name="Default",
    organization_id="bf9b425c-54cb-4182-a93f-8ac6aed04348",
    api_key=LLAMA_CLOUD_API_KEY,
)

In [None]:
llm = OpenAILike(
    model="Qwen/Qwen3-8B",
    api_key=os.getenv("BENTO_CLOUD_API_KEY"),
    api_base=f'{os.getenv("qwen3_endpoint_url")}/v1',
    is_chat_model=True,
    is_function_calling_model=True,
    temperature=0,
)

In [33]:
alita_engine = alita_index.as_query_engine(llm=llm, **kwargs)
mcp_zero_engine = mcp_zero_index.as_query_engine(llm=llm, **kwargs)

In [34]:
response = mcp_zero_engine.query("What is MCP zero and how does it work?")
display(Markdown(str(response)))



MCP-Zero is an innovative framework designed to enhance the autonomy of large language models (LLMs) in discovering and utilizing tools dynamically. Unlike traditional approaches that embed all available tools into prompts, MCP-Zero enables LLMs to actively identify their capability gaps and request specific tools on-demand, transforming them into autonomous agents rather than passive selectors.  

The framework operates through three core mechanisms:  
1. **Active Tool Request**: LLMs generate structured requests to specify their exact tool requirements, ensuring alignment between task needs and available tools.  
2. **Hierarchical Semantic Routing**: A two-stage process filters relevant servers based on platform requirements and ranks tools within those servers using semantic similarity, optimizing precision and reducing search complexity.  
3. **Iterative Capability Extension**: Agents progressively build cross-domain toolchains during task execution, refining their requests and discovering alternative tools as needed, which enhances adaptability and fault tolerance.  

This approach minimizes context overhead, maintains high accuracy in multi-turn interactions, and scales efficiently with expanding tool ecosystems. By prioritizing dynamic, on-demand tool discovery, MCP-Zero addresses limitations of static retrieval methods and supports robust, scalable autonomous agent systems.