In [1]:
%load_ext autoreload
%autoreload 2

## Ingesting documents into Llama Cloud

In [None]:
import os
from dotenv import load_dotenv, find_dotenv

from llama_index.indices.managed.llama_cloud import LlamaCloudIndex

LLAMA_CLOUD_API_KEY = os.environ['LLAMA_CLOUD_API_KEY']

kwargs = {
    'dense_similarity_top_k': 10,
    'sparse_similarity_top_k': 20,
    'enable_reranking': True,
    'alpha': 0.5,
    'rerank_top_n': 8
}

In [None]:
alita_index = LlamaCloudIndex(
  name="alita-paper",
  project_name="Default",
  organization_id="bf9b425c-54cb-4182-a93f-8ac6aed04348",
  api_key=LLAMA_CLOUD_API_KEY,
)

In [None]:
mcp_zero_index = LlamaCloudIndex(
  name="mcp-zero-paper",
  project_name="Default",
  organization_id="bf9b425c-54cb-4182-a93f-8ac6aed04348",
  api_key=LLAMA_CLOUD_API_KEY,
)

In [44]:
from llama_index.llms.ollama import Ollama

llm = Ollama(model="gpt-oss:20b", request_timeout = 600)
mcp_zero_engine = mcp_zero_index.as_query_engine(llm=llm, **kwargs)

In [45]:
response = mcp_zero_engine.query("What is MCP zero and how does it work?")

In [46]:
from IPython.display import display, Markdown

display(Markdown(str(response)))

**MCP‑Zero** is an agent framework that lets a large language model (LLM) discover and call external tools on its own instead of being handed a huge list of tool descriptions up front.  
It addresses two common bottlenecks in today’s tool‑augmented LLMs:

| Problem | Conventional approach | MCP‑Zero solution |
|---------|-----------------------|-------------------|
| **Context bloat** – all JSON‑schema definitions are injected, consuming tens of thousands of tokens | Inject the whole tool ecosystem into the prompt | The LLM requests only the tools it needs, so only a few relevant schemas are added |
| **Passive selection** – the model simply picks from a pre‑selected set | A single query is matched to the whole tool set | The model actively asks for a tool, can refine its request, and can request new tools in later turns |

### Core ideas

1. **Active Tool Request**  
   The LLM emits a tiny, structured block that states what it needs.  
   ```xml
   <tool assistant>
   server: <domain or permission>
   tool:   <operation type + target>
   </tool assistant>
   ```
   Because the request is generated by the model itself, it matches the semantics of the tool documentation more closely than a raw user query.

2. **Hierarchical Semantic Routing**  
   The request is handled in two stages using semantic embeddings:

   * **Server filtering** – match the `server` field against short server descriptions (often just one sentence).  
   * **Tool ranking** – within each chosen server, rank tools by similarity between the `tool` field and the tool’s description.  
   A combined score (product × max of the two similarities) selects the top‑k schemas to feed back to the LLM.

3. **Iterative Capability Extension**  
   After a tool is used, the LLM checks whether the task is still incomplete.  
   If more capability is required, it emits another request; if not, it proceeds.  
   This loop allows the agent to build a cross‑domain chain of tools (e.g., filesystem access → code editing → command execution) while never loading the full tool collection into the prompt.

### How it works in practice

1. **User asks**: “Debug my code in `src/train.py`.”  
2. **LLM** → notices it lacks filesystem, code‑analysis, and shell execution tools.  
3. **LLM** emits a request for a filesystem read tool.  
4. **System** finds the best matching server (e.g., “File System”) and the specific read tool, returns the JSON‑schema.  
5. **LLM** calls the tool, gets the file, then requests a code‑analysis tool, and so on.  
6. **Once all needed tools are called**, the LLM solves the problem and produces the final answer.

Because only the schemas of the actually used tools are sent back, the prompt stays tiny (often a few hundred tokens) even when the total tool ecosystem contains thousands of APIs. The agent remains fully autonomous: it decides *when* and *what* to request, can refine its requests in subsequent turns, and can grow a customized toolchain on the fly.

## Composite retrieval
Not recommended. It's better to break the question into sub parts and query the correct index with each part.

In [37]:
from llama_cloud import CompositeRetrievalMode, RetrieverPipeline
from llama_index.indices.managed.llama_cloud import (
    LlamaCloudIndex,
    LlamaCloudCompositeRetriever,
)

retriever = LlamaCloudCompositeRetriever(
    name="Alita and MCP Zero Retriever",
    api_key=LLAMA_CLOUD_API_KEY,
    create_if_not_exists=True,
    mode=CompositeRetrievalMode.FULL,
    rerank_top_n=8,
)

In [38]:
retriever.add_index(
    alita_index, description="Knowledge base for the Alita paradigm for agents"
)
retriever.add_index(
    mcp_zero_index, description="Knowledge base of the (model context protocol) MCP zero paradigm"
)

Retriever(name='Alita and MCP Zero Retriever', pipelines=[RetrieverPipeline(name='alita-paper', description='Knowledge base for the Alita paradigm for agents', pipeline_id='6e287db8-a658-48c2-837f-1e13c85edc84', preset_retrieval_parameters=PresetRetrievalParams(dense_similarity_top_k=30, dense_similarity_cutoff=0.0, sparse_similarity_top_k=30, enable_reranking=True, rerank_top_n=6, alpha=0.5, search_filters=None, search_filters_inference_schema=None, files_top_k=1, retrieval_mode=<RetrievalMode.CHUNKS: 'chunks'>, retrieve_image_nodes=False, retrieve_page_screenshot_nodes=False, retrieve_page_figure_nodes=False, class_name='base_component')), RetrieverPipeline(name='mcp-zero-paper', description='Knowledge base of the (model context protocol) MCP zero paradigm', pipeline_id='1d5c48e0-9849-49a6-a59d-0af4eb09f794', preset_retrieval_parameters=PresetRetrievalParams(dense_similarity_top_k=30, dense_similarity_cutoff=0.0, sparse_similarity_top_k=30, enable_reranking=True, rerank_top_n=6, alph

In [39]:
nodes = retriever.retrieve(
    "What is Alita and what is MCP Zero? Can Alita and MCP zero work together?"
)

In [42]:
nodes

[NodeWithScore(node=TextNode(id_='014911b9-3bda-45c1-901f-2c96a94b2295', embedding=None, metadata={'id': 'mcp_zero_paper.pdf', 'file_size': 975244, 'last_modified_at': '2025-08-13T02:23:02', 'file_path': 'mcp_zero_paper.pdf', 'file_name': 'mcp_zero_paper.pdf', 'external_file_id': 'mcp_zero_paper.pdf', 'file_id': '99275b01-66e2-4f31-9b1c-914c3916ba5d', 'pipeline_file_id': '1c2460ac-3fe0-4394-ace3-d4b2df4559e3', 'pipeline_id': '1d5c48e0-9849-49a6-a59d-0af4eb09f794', 'page_label': 10, 'start_page_index': 9, 'start_page_label': 10, 'end_page_index': 9, 'end_page_label': 10, 'document_id': '2e13a37a379d317a1af0a61b9d831caed3d655a04d32d55bc9', 'start_char_idx': 81283, 'end_char_idx': 86577, 'retriever_id': 'd2c4add1-f4cb-41a0-bcfd-58dac5293769', 'retriever_pipeline_name': 'mcp-zero-paper'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='**2. Semantic grounding.** The example also clarifies t