# ü¶ô Using `MultiStepQueryEngine`: Tuning Context (k) vs. Hops with Arize Phoenix

**Objective:**
This notebook optimizes the LlamaIndex `MultiStepQueryEngine` for complex reasoning tasks. We perform a sensitivity analysis using a **subset of the HotpotQA dataset ** to understand the trade-offs between retrieval depth and reasoning steps.

**Tools Used:**
* **Engine:** `MultiStepQueryEngine` (Sequential multi-hop reasoning)
* **Dataset:** HotpotQA (Wiki-based complex QA)
* **Observability:** Arize Phoenix (Trace analysis & Evaluation)

**The Experiment:**
We sweep through different values of `similarity_top_k` (context amount) and `num_steps` (reasoning depth) to find the "sweet spot" where the agent answers correctly without unnecessary latency.

**Results:**
Model: llama3.2:3b
Samples: 30
Scores (your scores may vary)
--- FINAL GRID SEARCH RESULTS ---
|   Hops |    1 |    3 |    5 |
|-------:|-----:|-----:|-----:|
|      1 | 2.50 | 3.67 | 4.03 |
|      2 | 3.00 | 3.90 | 3.07 |
|      3 | 2.90 | 3.23 | 3.53 |

---

## 1. Installation and Setup

We begin by installing the necessary libraries. This includes:
* **`llama-index`**: The framework for building our Agentic RAG system.
* **`arize-phoenix`**: For observability, tracing, and evaluation.
* **`pandas`**: For analyzing the benchmark results.

In [None]:


!pip install llama-index llama-index-llms-google-genai llama-index-embeddings-google-genai llama-index-llms-ollama arize-phoenix openinference-instrumentation-llama-index pandas python-dotenv datasets nest_asyncio



In [None]:
import os
import argparse
import time
import pandas as pd
from datetime import datetime
from dotenv import load_dotenv

from datasets import load_dataset
from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.core.query_engine import MultiStepQueryEngine
from llama_index.core.indices.query.query_transform import (
    StepDecomposeQueryTransform,
)

import phoenix as px
from phoenix.otel import register
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

## 2.  Configuration (API Keys)

The below cell uses the Google Colab Secrets feature


In [None]:
# --- CONFIGURATION ---

# Uncomment if your
# load_dotenv()

from google.colab import userdata


# comment the following out if using load_dotenv
GEMINI_API_KEY = userdata.get("GEMINI_API_KEY")
HF_TOKEN = userdata.get("HF_TOKEN")

## 3. Helper Function for embedding



In [None]:
def configure_global_settings():
    """
    Sets up the Embedding model globally.
    We use Google Embeddings for ALL runs to ensure retrieval quality is consistent.
    """
    print("üåç Setting up Global Embeddings (Google text-embedding-004)...")
    Settings.embed_model = GoogleGenAIEmbedding(
        model_name="models/text-embedding-004", api_key=GEMINI_API_KEY
    )

## 4. Function for prepping the data

* Uses hotpotqa dataset from HuggingFace
* The dataset is suited for Multi Hop reasoning unlike standard QA tasks where the answer is contained in a single document.  

In [None]:
# --- DATA PREP (Run Once) ---
def load_and_prep_data(num_samples=5):
    """Loads HotpotQA data and returns Documents and Ground Truth list."""
    if str(num_samples).upper() == "ALL":
        print("üì• Downloading ALL samples...")
        split = "validation"
    else:
        print(f"üì• Downloading top {num_samples} samples...")
        split = f"validation[:{num_samples}]"

    dataset = load_dataset("hotpot_qa", "distractor", split=split)
    documents = []
    ground_truth = []

    for row in dataset:
        titles = row["context"]["title"]
        sentences_list = row["context"]["sentences"]
        for title, sentences in zip(titles, sentences_list):
            doc = Document(
                text=f"Title: {title}\nContent: {''.join(sentences)}",
                metadata={"title": title, "question_id": row["id"]},
            )
            documents.append(doc)
        ground_truth.append(
            {"question": row["question"], "reference_answer": row["answer"]}
        )

    print(
        f"‚úÖ Loaded {len(documents)} docs from {len(ground_truth)} questions."
    )
    return documents, ground_truth

## 5. Helper Function to initialize LLM instance

Here there's 2 options for the model_name, the test of 30 samples with results at the beginning was done with:

model_name = "llama3.2:3b"


In [None]:
# --- DYNAMIC LLM FACTORY ---
def get_student_llm(model_name):
    """
    Creates a FRESH instance of the LLM for every single question.
    This ensures no 'context' or 'memory' leaks between benchmark questions.
    """
    # Option A: Google Gemini
    if "gemini" in model_name.lower():
        return GoogleGenAI(
            model=model_name, temperature=0, api_key=GEMINI_API_KEY
        )

    # Option B: Local Ollama
    else:
        # request_timeout=300 protects against slow local generations
        return Ollama(model=model_name, request_timeout=300.0)

## 6. Function to execute a full evaluation pass on all data samples.






In [None]:
# --- BENCHMARK ENGINE ---
def run_benchmark_iteration(index, data_samples, k, hops, model_name):
    """
    Runs a benchmark pass where the LLM is reset for every question.
    """
    print(f"\nüß™ STARTING RUN: k={k} | hops={hops} | model={model_name}")

    # Initialize Judge ONCE
    judge_llm = GoogleGenAI(
        model="models/gemini-2.5-pro", temperature=0.0, api_key=GEMINI_API_KEY
    )
    evaluator = CorrectnessEvaluator(llm=judge_llm)

    scores = []

    for i, sample in enumerate(data_samples):
        try:
            # A. FRESH LLM INSTANCE
            student_llm = get_student_llm(model_name)

            # --- FIX: SET GLOBAL SETTINGS FOR THIS ITERATION ---
            Settings.llm = student_llm
            # ---------------------------------------------------

            # B. Configure Base Engine (k)
            base_query_engine = index.as_query_engine(
                similarity_top_k=k, llm=student_llm
            )

            # C. Configure Logic (Hops)
            if hops > 1:
                step_decompose = StepDecomposeQueryTransform(
                    llm=student_llm, verbose=True
                )
                query_engine = MultiStepQueryEngine(
                    query_engine=base_query_engine,
                    query_transform=step_decompose,
                    index_summary="Contains detailed encyclopedia articles about specific people, places, and things. RESTRICTION: This search engine CANNOT compare items. You must query for one entity at a time. If the user asks for a comparison, break it down into single questions about each entity.",
                    num_steps=hops,
                    early_stopping=True,
                )
            else:
                query_engine = base_query_engine

            # D. Run Query
            response = query_engine.query(sample["question"])

            # E. Evaluate
            eval_result = evaluator.evaluate(
                query=sample["question"],
                response=response.response,
                reference=sample["reference_answer"],
            )
            scores.append(eval_result.score)

            # F. Rate Limit Safety
            time.sleep(1.0)

        except Exception as e:
            print(f"  ‚ùå Error on sample {i}: {e}")
            scores.append(0.0)

    avg_score = sum(scores) / len(scores) if scores else 0
    print(f"üèÅ END RUN: Score: {avg_score:.2f}")
    return avg_score

## 6. Function that loops through k and hops (nested)

And instruments in Arize Phoenix


In [None]:
# --- GRID SEARCH CONTROLLER ---
def run_grid_search(index, data_samples, max_k, max_hops, model_name):
    """
    Grid search loop. Note: We now pass in the index and data,
    so we don't reload them every time.
    """
    k_values = range(1, max_k + 1, 2)
    hop_values = range(1, max_hops + 1)

    all_results = []
    total_combinations = len(k_values) * len(hop_values)

    print(
        f"\nüöÄ Starting Grid Search over {total_combinations} combinations..."
    )

    for hops in hop_values:
        for k in k_values:
            # 1. DEFINE DYNAMIC PROJECT NAME
            current_project = datetime.now().strftime(
                f"run-grid-k{k}-hops{hops}-%Y-%m-%d-%H-%M"
            )
            print(f"üîÑ Switching Phoenix Project to: {current_project}")

            # 2. RESET INSTRUMENTATION
            LlamaIndexInstrumentor().uninstrument()

            # 3. REGISTER NEW TRACER
            tracer_provider = register(
                project_name=current_project,
                endpoint="http://localhost:6006/v1/traces",
            )

            # 4. RE-INSTRUMENT
            LlamaIndexInstrumentor().instrument(
                tracer_provider=tracer_provider
            )

            # 5. RUN ITERATION
            score = run_benchmark_iteration(
                index, data_samples, k, hops, model_name
            )

            all_results.append({"Hops": hops, "Top-k": k, "Score": score})

    # Create DataFrame
    df = pd.DataFrame(all_results)

    print("\nüèÜ --- FINAL GRID SEARCH RESULTS ---")
    if not df.empty:
        pivot_df = df.pivot(index="Hops", columns="Top-k", values="Score")
        print(pivot_df.to_markdown(floatfmt=".2f"))

    csv_filename = f"grid_results_{model_name.replace(':','-')}.csv"
    df.to_csv(csv_filename, index=False)
    print(f"\nüíæ Results saved to '{csv_filename}'")


This replaces the standard arguments in a command line.

Replace with the parameters of your choice.  


In [None]:
# This is to simulate arguments
class args:
    mode = "single"
    k = 1
    hops = 1
    num_samples = "1"
    model = "gemini-2.5-flash-lite"


Importing nest_asyncio to get around the known issue of aync event loops run by Jupyter.

Llama-index and Arizer Phoenix using async and thus would run into a bug without the nest_asyncio


In [None]:
import nest_asyncio

# This allows nested event loops (fixes the "asyncio.run()" error)
nest_asyncio.apply()

session = px.launch_app()
# if you want to see the Arize Phoenix Dashboard
# session = px.active_session()
# session.view()

# --- A. CONSOLIDATED SETUP (RUNS ONCE FOR BOTH MODES) ---
configure_global_settings()

# Load Data & Prep Index
# This is now done exactly once, regardless of mode.
documents, data_samples = load_and_prep_data(args.num_samples)


print("\n‚öôÔ∏è  Building Master Index (Reused for all runs)...")
index = VectorStoreIndex.from_documents(documents)



üåç To view the Phoenix app in your browser, visit https://0rhnsdbge17a5-496ff2e9c6d22116-6006-colab.googleusercontent.com/
üìñ For more information on how to use Phoenix, check out https://arize.com/docs/phoenix
üåç Setting up Global Embeddings (Google text-embedding-004)...
üì• Downloading top 1 samples...
‚úÖ Loaded 10 docs from 1 questions.

‚öôÔ∏è  Building Master Index (Reused for all runs)...


Execute one run for k X hops

In [None]:
# Pass the pre-built index and data to the grid runner
run_grid_search(
    index,
    data_samples,
    max_k=args.k,
    max_hops=args.hops,
    model_name=args.model,
)




üöÄ Starting Grid Search over 1 combinations...
üîÑ Switching Phoenix Project to: run-grid-k1-hops1-2025-12-16-20-04
üî≠ OpenTelemetry Tracing Details üî≠
|  Phoenix Project: run-grid-k1-hops1-2025-12-16-20-04
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: http://localhost:6006/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.


üß™ STARTING RUN: k=1 | hops=1 | model=gemini-2.5-flash-lite
üèÅ END RUN: Score: 2.00

üèÜ --- FINAL GRID SEARCH RESULTS ---
|   Hops |    1 |
|-------:|-----:|
|      1 | 2.00 |

üíæ Results saved to 'grid_results_gemini-2.5-flash-lite.csv'




Execute a singular run



In [None]:
current_project = datetime.now().strftime(
    f"run-single-k{args.k}-hops{args.hops}-%Y-%m-%d-%H-%M"
)

tracer_provider = register(
    project_name=current_project, endpoint="http://localhost:6006/v1/traces"
)

LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

# Run the single iteration
run_benchmark_iteration(index, data_samples, args.k, args.hops, args.model)



üî≠ OpenTelemetry Tracing Details üî≠
|  Phoenix Project: run-single-k1-hops1-2025-12-16-20-04
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: http://localhost:6006/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.


üß™ STARTING RUN: k=1 | hops=1 | model=gemini-2.5-flash-lite
üèÅ END RUN: Score: 2.00


2.0