In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import nest_asyncio
nest_asyncio.apply()

## Observability Setup

In [3]:
import phoenix as px
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from phoenix.otel import register

tracer_provider = register(
    project_name="multi-step-rag", 
)
LlamaIndexInstrumentor().instrument(
    tracer_provider=tracer_provider,
)

  from .autonotebook import tqdm as notebook_tqdm


üî≠ OpenTelemetry Tracing Details üî≠
|  Phoenix Project: multi-step-rag
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: localhost:4317
|  Transport: gRPC
|  Transport Headers: {}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



## RAG application setup

In [4]:
from llama_index.core import (
    Settings, 
    SimpleDirectoryReader, 
    VectorStoreIndex, 
)
from llama_index.llms.openai import OpenAI
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

Settings.llm = OpenAI(model="gpt-5-nano", temperature=0)
Settings.embed_model = OllamaEmbedding(model_name="mxbai-embed-large:latest")
eval_llm = Ollama(model="gpt-oss:20b", timeout=60000)

### Ingestion

Create the vector index

In [5]:
documents = SimpleDirectoryReader(input_dir="./docs").load_data()

Create the vector index and ingest vectors into PostGres

In [6]:
index = VectorStoreIndex.from_documents(documents)

2025-11-05 16:56:13,859 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 16:56:13,935 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 16:56:13,993 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 16:56:14,058 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 16:56:14,121 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 16:56:14,174 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 16:56:14,227 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 16:56:14,279 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 16:56:14,317 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 16:56:14,359 - INFO - HTTP Request: POST http://localhost:1143

## The RAG query engine

In [7]:
from llama_index.core.indices.query.query_transform import (
    HyDEQueryTransform,
    StepDecomposeQueryTransform
)
from llama_index.core.postprocessor.llm_rerank import LLMRerank
from llama_index.core.query_engine import (
    MultiStepQueryEngine,
    SubQuestionQueryEngine, 
    TransformQueryEngine
)
from llama_index.core.question_gen import LLMQuestionGenerator
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.core.tools import QueryEngineTool, ToolMetadata

## Base Query Engine
reranker = LLMRerank(top_n=8) #Uses Settings.llm
base_query_engine = index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[reranker]
)

## HyDE Query Engine
hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(base_query_engine, query_transform=hyde)

## Sub Question Query Engine
generator = LLMQuestionGenerator.from_defaults()
query_engine_tools = [
    QueryEngineTool(
        query_engine=hyde_query_engine,
        metadata=ToolMetadata(
            name="alita-gepa-mcpZero-harvardOSS-engine",
            description="Use this for specific questions relating to alita, gepa, mcp zero or Harvard's oss paper",
        ),
    ),
]
sub_question_query_engine = SubQuestionQueryEngine(
    question_gen=generator,
    response_synthesizer=get_response_synthesizer(),
    query_engine_tools=query_engine_tools,
    use_async=True
)

## Multi-Step Query Engine
transform = StepDecomposeQueryTransform(verbose=True)
final_query_engine = MultiStepQueryEngine(
    query_engine = sub_question_query_engine,
    query_transform = transform,
    index_summary = "Answers questions relating to alita, gepa, mcp zero or Harvard's oss paper"
)

In [8]:
query = (
    "Can you tell me how Alita and MCP Zero can interplay with each other? "
    "Also, how can GEPA perform better than GRPO even though it's a prompt engineering "
    "technique that does not rewrite the weights of the LLM?"
)
response = await base_query_engine.aquery(query)

2025-11-05 16:56:27,260 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 16:56:49,504 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-11-05 16:57:26,252 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-11-05 16:57:59,298 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [10]:
from IPython.display import display, Markdown

display(Markdown(response.response))

Here‚Äôs how the two topics fit together, based on the provided material:

- Interplay between Alita and MCP-Zero
  - What each does
    - MCP-Zero is a tool-discovery engine: it actively searches for existing tools and capabilities across resources, and invokes them when suitable. It focuses on maximizing tool discovery and reuse.
    - Alita is a generalist agent framework that evolves capabilities by generating and refining task-related model context protocols (MCPs) from open-source material. It aims to synthesize and reuse external capabilities with minimal upfront handcrafting.
  - How they work together
    - They form a complementary loop: first, MCP-Zero tries to find and invoke existing tools to tackle the agent‚Äôs tasks.
    - If no suitable tool is found, Alita‚Äôs workflow can be engaged to synthesize a new tool by generating a new MCP tailored to the task, effectively creating new capabilities.
    - The newly created tool (and its MCP) can then be registered and made available to the community, enriching the tool ecosystem for future tasks.
  - Why this is powerful
    - This pairing balances discovery and creation: MCP-Zero maximizes what already exists, while Alita drives scalable self-evolution by producing and integrating new tools via MCPs.
    - The combination supports broader generalization across domains: semantic grounding via MCPs helps clarify tool semantics, enabling reliable tool use and faster adaptation to new tasks.

- Why GEPA can beat GRPO without changing LLM weights
  - Core idea
    - GEPA is a reflective prompt evolution method that optimizes prompts (system-level instructions and tool-use guidance) rather than updating model weights. It leverages natural-language reflection to diagnose issues, propose prompt updates, and combine lessons from multiple attempts.
  - Why it can outperform weight-space RL (GRPO)
    - High sample efficiency: GEPA can achieve large performance gains with far fewer rollouts (up to 35x fewer) by learning mainly from improved prompts and reflections rather than policy updates.
    - Better use of feedback: GEPA uses a reflection-based process to generate high-quality, task-relevant learning signals from each rollout, guiding prompt evolution more effectively than scalar reward signals alone.
    - Diverse, Pareto-guided exploration: GEPA uses Pareto-based candidate sampling to maintain diversity among evolving prompts, avoiding local optima that can trap strategies that always pick the current best candidate.
    - Systematic prompt combination: The approach includes mutation and a system-aware merge step, which can combine complementary prompt strategies from different evolutionary lineages to produce stronger prompts.
    - Evidence across tasks/models: In experiments, GEPA and its variant GEPA+Merge outperformed GRPO by up to about 19% on some tasks, with substantial reductions in rollouts required, and often matched or exceeded GRPO‚Äôs best validation scores with far fewer learning signals.
  - Practical takeaway
    - The gains come from optimizing the prompts and the learning dynamics (how prompts are mutated, merged, and selected) rather than from changing LLM weights. This makes GEPA a highly sample-efficient way to improve downstream performance for complex, modular AI systems where prompts and system behavior are crucial.

If you want, I can summarize how to architect a system that combines Alita with MCP-Zero in a concrete workflow, and separately outline a GEPA-inspired prompt-evolution protocol you could pilot for a given task.