In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import nest_asyncio
nest_asyncio.apply()

## Observability Setup

In [3]:
import phoenix as px
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from phoenix.otel import register

tracer_provider = register(
    project_name="multi-step-rag", 
)
LlamaIndexInstrumentor().instrument(
    tracer_provider=tracer_provider,
)

  from .autonotebook import tqdm as notebook_tqdm


üî≠ OpenTelemetry Tracing Details üî≠
|  Phoenix Project: multi-step-rag
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: localhost:4317
|  Transport: gRPC
|  Transport Headers: {}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



## RAG application setup

In [4]:
from llama_index.core import (
    Settings, 
    SimpleDirectoryReader, 
    VectorStoreIndex, 
)
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

Settings.llm = Ollama(model="gemma3", temperature=0, request_timeout=60000)
Settings.embed_model = OllamaEmbedding(model_name="mxbai-embed-large:latest")
eval_llm = Ollama(model="gpt-oss:20b", request_timeout=60000)

### Ingestion

Create the vector index

In [6]:
documents = SimpleDirectoryReader(input_dir="./docs").load_data()

Create the vector index and ingest vectors into PostGres

In [7]:
index = VectorStoreIndex.from_documents(documents)

2025-11-05 20:14:53,757 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:14:53,819 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:14:53,873 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:14:53,928 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:14:53,984 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:14:54,037 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:14:54,090 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:14:54,143 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:14:54,180 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:14:54,221 - INFO - HTTP Request: POST http://localhost:1143

## The RAG query engine

In [8]:
query = (
    "Can you tell me how Alita and MCP Zero can interplay with each other? "
    "Also, how can GEPA perform better than GRPO even though it's a prompt engineering "
    "technique that does not rewrite the weights of the LLM?"
)

In [9]:
base_query_engine = index.as_query_engine()

2025-11-05 20:15:09,831 - INFO - HTTP Request: POST http://localhost:11434/api/show "HTTP/1.1 200 OK"


In [10]:
response = await base_query_engine.aquery(query)

2025-11-05 20:15:10,694 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:15:14,821 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


In [11]:
from IPython.display import display, Markdown
display(Markdown(response.response))

Alita generates MCPs which can be utilized by other agents, enhancing their capabilities and problem-solving abilities. These MCPs, distilled from powerful models like Claude-3.7-Sonnet, bridge the gap in task-processing capabilities between agents utilizing smaller LLMs and those leveraging larger models.

In [12]:
higher_k_query_engine = index.as_query_engine(similarity_top_k=8)

In [13]:
response = await higher_k_query_engine.aquery(query)
display(Markdown(response.response))

2025-11-05 20:15:19,922 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:15:26,321 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Alita‚Äôs design centers around leveraging the increasing coding and reasoning capabilities of LLMs. When running Alita on GAIA using GPT-4o-mini, it generates its own MCPs ‚Äì meaning it doesn‚Äôt rely on distilled MCPs from more powerful models like Claude-3.7-Sonnet. The experiment shows that Alita performs significantly worse on GAIA compared to when using GPT-4o-mini. This highlights the critical role of the underlying models‚Äô coding capabilities. 

GEPA can outperform GRPO, even though it‚Äôs a prompt engineering technique, because it uses a Pareto-based sampling strategy to generate prompts. This approach allows GEPA to explore a broader range of potential solutions and identify the most effective prompts, leading to improved performance. Furthermore, GEPA‚Äôs system-aware crossover strategies can provide large gains, but the optimal budget allocation between mutation and crossover, as well as when to invoke merge needs further study.

In [14]:
from llama_index.core.postprocessor.llm_rerank import LLMRerank
from llama_index.core.query_engine import RetrieverQueryEngine

reranker = LLMRerank(top_n=4)
reranker_query_engine = RetrieverQueryEngine.from_args(
    retriever = index.as_retriever(similarity_top_k=10),
    llm = Settings.llm,
    node_postprocessors=[reranker]
)

In [15]:
response = await reranker_query_engine.aquery(query)
display(Markdown(response.response))

2025-11-05 20:15:30,094 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:15:35,911 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:15:39,487 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Alita and MCP-Zero address complementary halves of the same problem: MCP-Zero efficiently finds and invokes existing tools, while Alita automatically builds missing tools on-the-fly. When combined, they form a virtuous loop where an agent first actively discovers tools, and if none fit, Alita synthesizes a new one. GEPA achieves optimal test set performance by rapidly adapting and generalizing in compound AI systems, outperforming GRPO by up to 19% while using up to 35x fewer rollouts.

In [16]:
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine

hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(reranker_query_engine, query_transform=hyde)

In [17]:
response = await hyde_query_engine.aquery(query)
display(Markdown(response.response))

2025-11-05 20:15:54,153 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:15:54,238 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:15:54,266 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:15:59,439 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:16:02,741 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


According to the provided documents, Alita generates MCPs (Mission Control Protocols) which are then reused by other agents, like ODR-smolagents. Specifically, Alita, when using GPT-4o-mini, generates its own MCPs, unlike the experiment in Section 5.1.3 where the agent utilized MCPs distilled from Claude-3.7-Sonnet. GEPA, a prompt optimizer, can outperform GRPO, a reinforcement learning algorithm, because it incorporates natural language reflection to diagnose problems and propose prompt updates, leading to a significant quality gain with fewer rollouts.

In [18]:
from llama_index.core.question_gen import LLMQuestionGenerator
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata

query_engine_tools = [
    QueryEngineTool(
        query_engine=hyde_query_engine,
        metadata=ToolMetadata(
            name="alita-gepa-mcpZero",
            description="Use this for specific questions relating to alita, gepa and/or mcp zero",
        ),
    ),
]
generator = LLMQuestionGenerator.from_defaults()
sub_question_query_engine = SubQuestionQueryEngine(
    question_gen=generator,
    response_synthesizer=get_response_synthesizer(),
    query_engine_tools=query_engine_tools,
    use_async=False
)

response = sub_question_query_engine.query(query)
display(Markdown(response.response))

2025-11-05 20:16:06,838 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Generated 3 sub questions.
[1;3;38;2;237;90;200m[alita-gepa-mcpZero] Q: What is the relationship between Alita and MCP Zero?
[0m

2025-11-05 20:16:07,729 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:16:07,781 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:16:07,806 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:16:15,592 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:16:18,973 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


[1;3;38;2;237;90;200m[alita-gepa-mcpZero] A: Alita autonomously expands its capabilities through continuous MCP integration. It generates MCPs, which are then encapsulated and stored in the MCP Box for future reuse. These MCPs are created through a self-reinforcing cycle where Alita continuously integrates new MCPs, enhancing its overall capabilities.
[0m[1;3;38;2;90;149;237m[alita-gepa-mcpZero] Q: How do Alita and MCP Zero interact?
[0m

2025-11-05 20:16:19,885 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:16:19,935 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:16:19,961 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:17:01,261 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


[1;3;38;2;90;149;237m[alita-gepa-mcpZero] A: Empty Response
[0m[1;3;38;2;11;159;203m[alita-gepa-mcpZero] Q: How does GEPA improve GRPO performance?
[0m

2025-11-05 20:17:02,453 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:17:02,505 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:17:02,530 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:17:08,339 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:17:11,661 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


[1;3;38;2;11;159;203m[alita-gepa-mcpZero] A: GEPA achieves superior test set performance compared to GRPO on tasks like HotpotQA, IFBench, HoVer, and PUPA by requiring significantly fewer rollouts. Specifically, it matches GRPO‚Äôs best validation scores with 402, 330, 1179, and 306 rollouts, respectively, while achieving up to 78 times greater sample efficiency. Furthermore, the combined GEPA+Merge approach out-performs GRPO by an even wider margin of 21% at a comparable rollout budget.
[0m

2025-11-05 20:17:12,754 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Alita autonomously expands its capabilities through continuous integration of MCPs. It generates these MCPs and stores them for later use. This creates a cycle of enhancement. GEPA achieves better results than GRPO by using far fewer rollouts to reach similar validation scores, demonstrating greater sample efficiency.

In [19]:
from llama_index.core.indices.query.query_transform import (
    StepDecomposeQueryTransform
)
from llama_index.core.query_engine import MultiStepQueryEngine

transform = StepDecomposeQueryTransform(verbose=True)
multi_step_query_engine = MultiStepQueryEngine(
    query_engine = sub_question_query_engine,
    query_transform = transform,
    index_summary = "Answers questions relating to alita, gepa, and/or mcp zero."
)

In [20]:
response = multi_step_query_engine.query(query)
display(Markdown(response.response))

2025-11-05 20:17:21,853 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


[1;3;33m> Current query: Can you tell me how Alita and MCP Zero can interplay with each other? Also, how can GEPA perform better than GRPO even though it's a prompt engineering technique that does not rewrite the weights of the LLM?
[0m[1;3;38;5;200m> New query: How does MCP Zero interact with Alita?
[0m

2025-11-05 20:17:22,928 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Generated 1 sub questions.
[1;3;38;2;237;90;200m[alita-gepa-mcpZero] Q: What is the interaction between Alita and MCP Zero?
[0m

2025-11-05 20:17:24,004 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:17:24,058 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:17:24,085 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:17:30,982 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


[1;3;38;2;237;90;200m[alita-gepa-mcpZero] A: Empty Response
[0m

2025-11-05 20:17:31,315 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:17:32,016 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


[1;3;33m> Current query: Can you tell me how Alita and MCP Zero can interplay with each other? Also, how can GEPA perform better than GRPO even though it's a prompt engineering technique that does not rewrite the weights of the LLM?
[0m[1;3;38;5;200m> New query: How does MCP Zero interact with Alita?
[0m

2025-11-05 20:17:33,121 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Generated 1 sub questions.
[1;3;38;2;237;90;200m[alita-gepa-mcpZero] Q: What is the interaction between Alita and MCP Zero?
[0m

2025-11-05 20:17:34,331 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:17:34,387 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:17:34,414 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:17:41,863 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


[1;3;38;2;237;90;200m[alita-gepa-mcpZero] A: Empty Response
[0m

2025-11-05 20:17:42,187 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:17:42,961 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


[1;3;33m> Current query: Can you tell me how Alita and MCP Zero can interplay with each other? Also, how can GEPA perform better than GRPO even though it's a prompt engineering technique that does not rewrite the weights of the LLM?
[0m[1;3;38;5;200m> New query: How does MCP Zero interact with Alita?
[0m

2025-11-05 20:17:44,183 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Generated 1 sub questions.
[1;3;38;2;237;90;200m[alita-gepa-mcpZero] Q: What is the interaction between Alita and MCP Zero?
[0m

2025-11-05 20:17:45,468 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:17:45,523 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:17:45,552 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-05 20:17:53,328 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


[1;3;38;2;237;90;200m[alita-gepa-mcpZero] A: Empty Response
[0m

2025-11-05 20:17:53,727 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-05 20:17:54,016 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Empty Response


In [52]:
from llama_index.core.prompts import RichPromptTemplate
from llama_index.core.workflow import (
    Context,
    step,
    StartEvent,
    StopEvent,
    Event,
    Workflow
)
from pydantic import BaseModel
from typing import Annotated, Optional

check_answer_template = RichPromptTemplate("""
{% chat role = "user" %}
This is the original question: 
<question>
    {{ question }}
</question>

Here are the following questions we have asked:
<follow_up_questions>
    {{ follow_up_questions }}
</follow_up_questions>

Here is our current answer:
<answer>
    {{ answer }}
</answer>

Does the current answer address the original question? If not, generate
a follow-up question such that including the answer to this follow-up question
to the current answer we have so far answers the user's original question.
{% endchat %}
""")

class ShouldContinue(BaseModel):
    should_continue: bool
    reasoning: Annotated[str, "Whether the current answer answers the question"]

class ConsolidateEvent(Event):
    original_question: str
    current_response: str
    new_response: str
    follow_up_questions: list[Optional[str]]

class CheckAnswerEvent(Event):
    original_question: str
    follow_up_questions: list[Optional[str]]
    response: str

class ContinueEvent(Event):
    original_question: str
    current_answer: str
    follow_up_questions: list[Optional[str]]
    reason_to_continue: str

class AskQueryEvent(Event):
    query: str

In [29]:
llm = Settings.llm

In [None]:
class MultiStepRAG(Workflow):    
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.llm = Settings.llm        
    
    @step
    async def query_step(
        self, ctx: Context, ev: StartEvent | AskQueryEvent
    ) -> CheckAnswerEvent | ConsolidateEvent:
        
        query = ev.get('query')
        follow_up_questions = ev.get('follow_up_questions', [])
        response = sub_question_query_engine.query(query)
        current_response = await ctx.store.get("current_answer", None)
        
        if current_response:
            return ConsolidateEvent(
                original_question=query,
                current_response=current_response,
                new_response=response.response,
                follow_up_questions=follow_up_questions
            )
        
        else:
            await ctx.store.set("current_answer", current_response)
            return CheckAnswerEvent(
                original_question = query,
                follow_up_questions=follow_up_questions,
                response = response.response
            )
    
    @step
    async def consolidate_response(
        self, ctx: Context, ev: ConsolidateEvent
    ) -> CheckAnswerEvent:
        
        follow_up_questions = ev.get('follow_up_questions')
        original_question = ev.get('original_question')
        current_response = ev.get('current_response')
        new_response = ev.get('new_response')
        response = await self.llm.acomplete(
            f"""
            This is the question we're trying to answer: {original_question}
            
            Here is the answer we have so far: {current_response}
            
            This is an additional component to make our current answer more complete: {new_response}
            
            Generate a coherent answer based on our current answer and the additional component.
            """
        )
        await ctx.store.set("current_answer", response.text)
        print(current_response)
        print(original_question)
        print(follow_up_questions)
        
        return CheckAnswerEvent(
            original_question = query,
            follow_up_questions=follow_up_questions,
            response = response.text
        )
    
    @step
    async def check_answer_step(
        self, ctx: Context, ev: CheckAnswerEvent
    ) -> ContinueEvent | StopEvent:
        original_question = ev.get('original_question')
        follow_up_questions = ev.get('follow_up_questions')
        current_answer = ev.get("response")
        check_answer_template.format(
            question=original_question, 
            follow_up_questions = follow_up_questions,
            answer = current_answer
        )
        result = self.llm.structured_predict(ShouldContinue, check_answer_template)
        print(result)
        if result.should_continue:
            return ContinueEvent(
                current_answer = current_answer,
                original_question = original_question,
                follow_up_questions = follow_up_questions,
                reason_to_continue = result.reasoning
            )
        return StopEvent(result = current_answer)
    
    @step 
    async def generate_follow_up_question(
        self, ctx: Context, ev: ContinueEvent
    ) -> AskQueryEvent:
        original_question = ev.get("original_question")
        current_response = ev.get("current_answer")
        
        result = await llm.acomplete(
             f"""
            This is the question we're trying to answer: {original_question}
            
            Here is the answer we have so far: {current_response}
            
            We've not fully addressed the question yet. Generate a follow-up question
            so that the answer to this question will address the original question once
            combined with our current response.
            """
        )
        return AskQueryEvent(query = result.text)

In [43]:
from llama_index.utils.workflow import draw_all_possible_flows

draw_all_possible_flows(
    MultiStepRAG(),
    filename="multi_step_query_engine.html",
    # Optional, can limit long event names in your workflow
    # Can help with readability
    # max_label_length=10,
)

multi_step_query_engine.html


In [57]:
multi_step_query_engine = MultiStepRAG(timeout=6000)
ctx = Context(multi_step_query_engine)

In [58]:
result = await multi_step_query_engine.run(query=query)

2025-11-06 07:17:06,009 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Generated 3 sub questions.
[1;3;38;2;237;90;200m[alita-gepa-mcpZero] Q: What is the relationship between Alita and MCP Zero?
[0m

2025-11-06 07:17:06,895 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-06 07:17:06,941 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-06 07:17:06,971 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-06 07:17:14,318 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-06 07:17:17,370 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


[1;3;38;2;237;90;200m[alita-gepa-mcpZero] A: Alita autonomously expands its capabilities through continuous MCP integration. It generates MCPs, which are then encapsulated and stored in the MCP Box for future reuse. These MCPs are created through a self-reinforcing cycle where Alita continuously integrates new MCPs, enhancing its overall capabilities.
[0m[1;3;38;2;90;149;237m[alita-gepa-mcpZero] Q: How do Alita and MCP Zero interact?
[0m

2025-11-06 07:17:18,273 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-06 07:17:18,327 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-06 07:17:18,356 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-06 07:17:57,887 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


[1;3;38;2;90;149;237m[alita-gepa-mcpZero] A: Empty Response
[0m[1;3;38;2;11;159;203m[alita-gepa-mcpZero] Q: How does GEPA improve GRPO performance?
[0m

2025-11-06 07:17:59,058 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-06 07:17:59,110 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-06 07:17:59,140 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-11-06 07:18:05,137 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-06 07:18:08,506 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


[1;3;38;2;11;159;203m[alita-gepa-mcpZero] A: GEPA achieves superior test set performance compared to GRPO on tasks like HotpotQA, IFBench, HoVer, and PUPA by requiring significantly fewer rollouts. Specifically, it matches GRPO‚Äôs best validation scores with 402, 330, 1179, and 306 rollouts, respectively, while achieving up to 78 times greater sample efficiency. Furthermore, the combined GEPA+Merge approach out-performs GRPO by an even wider margin of 21% at a comparable rollout budget.
[0m

2025-11-06 07:18:09,619 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-11-06 07:18:11,083 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


None
None
None
The current answer is empty. Therefore, it does not address the original question, which is missing. To address the original question, we need a follow-up question that prompts the user to provide an answer. A suitable follow-up question is: 'Please provide your answer to the original question.' 


ValidationError: 3 validation errors for ContinueEvent
original_question
  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.12/v/string_type
current_answer
  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.12/v/string_type
follow_up_questions
  Input should be a valid list [type=list_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.12/v/list_type

In [None]:
from llama_index.llms.openai import OpenAI

openai_query_engine = index.as_query_engine(similariry_top_k=20, llm=OpenAI(model="gpt-5-nano", temperature=0))
display(Markdown(response.response))

Here‚Äôs how the two topics fit together, based on the provided material:

- Interplay between Alita and MCP-Zero
  - What each does
    - MCP-Zero is a tool-discovery engine: it actively searches for existing tools and capabilities across resources, and invokes them when suitable. It focuses on maximizing tool discovery and reuse.
    - Alita is a generalist agent framework that evolves capabilities by generating and refining task-related model context protocols (MCPs) from open-source material. It aims to synthesize and reuse external capabilities with minimal upfront handcrafting.
  - How they work together
    - They form a complementary loop: first, MCP-Zero tries to find and invoke existing tools to tackle the agent‚Äôs tasks.
    - If no suitable tool is found, Alita‚Äôs workflow can be engaged to synthesize a new tool by generating a new MCP tailored to the task, effectively creating new capabilities.
    - The newly created tool (and its MCP) can then be registered and made available to the community, enriching the tool ecosystem for future tasks.
  - Why this is powerful
    - This pairing balances discovery and creation: MCP-Zero maximizes what already exists, while Alita drives scalable self-evolution by producing and integrating new tools via MCPs.
    - The combination supports broader generalization across domains: semantic grounding via MCPs helps clarify tool semantics, enabling reliable tool use and faster adaptation to new tasks.

- Why GEPA can beat GRPO without changing LLM weights
  - Core idea
    - GEPA is a reflective prompt evolution method that optimizes prompts (system-level instructions and tool-use guidance) rather than updating model weights. It leverages natural-language reflection to diagnose issues, propose prompt updates, and combine lessons from multiple attempts.
  - Why it can outperform weight-space RL (GRPO)
    - High sample efficiency: GEPA can achieve large performance gains with far fewer rollouts (up to 35x fewer) by learning mainly from improved prompts and reflections rather than policy updates.
    - Better use of feedback: GEPA uses a reflection-based process to generate high-quality, task-relevant learning signals from each rollout, guiding prompt evolution more effectively than scalar reward signals alone.
    - Diverse, Pareto-guided exploration: GEPA uses Pareto-based candidate sampling to maintain diversity among evolving prompts, avoiding local optima that can trap strategies that always pick the current best candidate.
    - Systematic prompt combination: The approach includes mutation and a system-aware merge step, which can combine complementary prompt strategies from different evolutionary lineages to produce stronger prompts.
    - Evidence across tasks/models: In experiments, GEPA and its variant GEPA+Merge outperformed GRPO by up to about 19% on some tasks, with substantial reductions in rollouts required, and often matched or exceeded GRPO‚Äôs best validation scores with far fewer learning signals.
  - Practical takeaway
    - The gains come from optimizing the prompts and the learning dynamics (how prompts are mutated, merged, and selected) rather than from changing LLM weights. This makes GEPA a highly sample-efficient way to improve downstream performance for complex, modular AI systems where prompts and system behavior are crucial.

If you want, I can summarize how to architect a system that combines Alita with MCP-Zero in a concrete workflow, and separately outline a GEPA-inspired prompt-evolution protocol you could pilot for a given task.