# Querying Basics: Advanced Query Techniques

This notebook explores advanced querying capabilities in LlamaIndex including streaming, async operations, query transformations, and response evaluation.

## Learning Objectives

By the end of this notebook, you will:
1. Implement streaming responses for better UX
2. Use async queries for performance
3. Apply query transformations for better retrieval
4. Evaluate response quality
5. Handle different query types

---

In [None]:
# Setup
import nest_asyncio
nest_asyncio.apply()

import asyncio
from dotenv import load_dotenv
load_dotenv()

from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
    StorageContext,
    load_index_from_storage,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

print("✓ Setup complete!")

In [None]:
# Load or create index
import os

PERSIST_DIR = "./storage/query_basics_index"

if os.path.exists(PERSIST_DIR):
    print("Loading existing index...")
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)
else:
    print("Building new index...")
    documents = SimpleDirectoryReader("../data/sample_docs").load_data()
    index = VectorStoreIndex.from_documents(documents, show_progress=True)
    index.storage_context.persist(persist_dir=PERSIST_DIR)

print("✓ Index ready!")

## 1. Streaming Responses

Streaming provides a better user experience by showing the response as it's generated, rather than waiting for the complete response.

### Why Streaming?
- **Perceived speed**: Users see content immediately
- **Early feedback**: Can stop if response is going wrong
- **Better UX**: Feels more interactive and responsive

In [None]:
# Create a streaming query engine
streaming_query_engine = index.as_query_engine(
    streaming=True,
    similarity_top_k=3,
)

# Query with streaming
query = "Explain the different types of machine learning with examples."
print(f"Query: {query}\n")
print("Response (streaming):")
print("-" * 40)

streaming_response = streaming_query_engine.query(query)

# Print tokens as they arrive
for text in streaming_response.response_gen:
    print(text, end="", flush=True)

print("\n" + "-" * 40)
print("\n✓ Streaming complete!")

In [None]:
# Access source nodes after streaming completes
print("\nSource nodes used:")
for i, node in enumerate(streaming_response.source_nodes):
    print(f"  {i+1}. Score: {node.score:.4f}")
    print(f"     Preview: {node.text[:100]}...")

## 2. Async Queries

Async queries allow you to run multiple queries concurrently, significantly improving throughput for batch operations.

In [None]:
import time

# Standard query engine
query_engine = index.as_query_engine(similarity_top_k=3)

# Questions to ask
questions = [
    "What is artificial intelligence?",
    "What are the main Python data types?",
    "Explain supervised learning.",
    "What is deep learning?",
]

# Synchronous approach (sequential)
print("Running queries SEQUENTIALLY...")
start_time = time.time()

sync_results = []
for q in questions:
    response = query_engine.query(q)
    sync_results.append(str(response))

sync_time = time.time() - start_time
print(f"Sequential time: {sync_time:.2f} seconds")

In [None]:
# Async approach (concurrent)
print("\nRunning queries CONCURRENTLY (async)...")
start_time = time.time()

async def run_async_queries():
    # Create async tasks for all queries
    tasks = [query_engine.aquery(q) for q in questions]
    # Run all concurrently
    results = await asyncio.gather(*tasks)
    return results

async_results = asyncio.run(run_async_queries())

async_time = time.time() - start_time
print(f"Async time: {async_time:.2f} seconds")

# Calculate speedup
speedup = sync_time / async_time
print(f"\nSpeedup: {speedup:.2f}x faster with async!")

In [None]:
# Display async results
print("\nAsync Query Results:")
print("=" * 60)

for q, result in zip(questions, async_results):
    print(f"\nQ: {q}")
    print(f"A: {str(result)[:200]}...")

## 3. Query Transformations

Query transformations can improve retrieval by reformulating the user's query. Common techniques:

| Transformation | Description | Use Case |
|---------------|-------------|----------|
| HyDE | Generate hypothetical answer, use for retrieval | Complex queries |
| Query Decomposition | Break into sub-queries | Multi-part questions |
| Query Expansion | Add related terms | Improve recall |

In [None]:
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine

# HyDE: Hypothetical Document Embeddings
# Generates a hypothetical answer, then uses that for retrieval

hyde_transform = HyDEQueryTransform(include_original=True)
base_query_engine = index.as_query_engine(similarity_top_k=3)

hyde_query_engine = TransformQueryEngine(
    query_engine=base_query_engine,
    query_transform=hyde_transform,
)

# Compare results
query = "How do neural networks learn patterns in data?"

print("Query:", query)
print("\n" + "=" * 50)
print("Without HyDE:")
print("=" * 50)
base_response = base_query_engine.query(query)
print(base_response)

print("\n" + "=" * 50)
print("With HyDE:")
print("=" * 50)
hyde_response = hyde_query_engine.query(query)
print(hyde_response)

### Step-Back Prompting

Another technique is to first ask a more general question, then use that context for the specific query:

In [None]:
from llama_index.core.indices.query.query_transform.base import StepDecomposeQueryTransform

# Step decomposition breaks complex queries into steps
step_decompose_transform = StepDecomposeQueryTransform(
    llm=Settings.llm,
    verbose=True,
)

# This shows how the query is transformed
complex_query = "How does Python's object-oriented programming relate to AI development?"

transformed = step_decompose_transform.run(complex_query)
print(f"\nOriginal: {complex_query}")
print(f"Transformed: {transformed}")

## 4. Different Query Types

LlamaIndex handles different types of queries:

In [None]:
# Setup for demonstrations
query_engine = index.as_query_engine(similarity_top_k=3)

# Query Type 1: Factual Questions
print("=" * 60)
print("FACTUAL QUERY")
print("=" * 60)
factual_q = "What year was Python created?"
print(f"Q: {factual_q}")
print(f"A: {query_engine.query(factual_q)}")

In [None]:
# Query Type 2: Comparative Questions
print("=" * 60)
print("COMPARATIVE QUERY")
print("=" * 60)
comparative_q = "What are the differences between supervised and unsupervised learning?"
print(f"Q: {comparative_q}")
print(f"A: {query_engine.query(comparative_q)}")

In [None]:
# Query Type 3: Analytical Questions
print("=" * 60)
print("ANALYTICAL QUERY")
print("=" * 60)
analytical_q = "Why is Python popular for machine learning and data science?"
print(f"Q: {analytical_q}")
print(f"A: {query_engine.query(analytical_q)}")

In [None]:
# Query Type 4: Synthesis Questions (combining multiple sources)
print("=" * 60)
print("SYNTHESIS QUERY")
print("=" * 60)
synthesis_q = "How can Python and AI be combined to create intelligent applications?"
print(f"Q: {synthesis_q}")
print(f"A: {query_engine.query(synthesis_q)}")

## 5. Response Evaluation

LlamaIndex provides built-in evaluators to assess response quality:

In [None]:
from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
)

# Initialize evaluators
faithfulness_evaluator = FaithfulnessEvaluator(llm=Settings.llm)
relevancy_evaluator = RelevancyEvaluator(llm=Settings.llm)

print("✓ Evaluators ready!")

In [None]:
# Make a query and evaluate the response
query = "What are the key principles of object-oriented programming in Python?"
response = query_engine.query(query)

print(f"Query: {query}")
print(f"\nResponse: {response}\n")

# Evaluate faithfulness (is the response faithful to the source documents?)
print("Evaluating response quality...")
faithfulness_result = faithfulness_evaluator.evaluate_response(response=response)
relevancy_result = relevancy_evaluator.evaluate_response(
    query=query,
    response=response,
)

print(f"\n=== Evaluation Results ===")
print(f"Faithfulness: {'PASS' if faithfulness_result.passing else 'FAIL'}")
print(f"  - Score: {faithfulness_result.score}")
print(f"  - Feedback: {faithfulness_result.feedback}")
print(f"\nRelevancy: {'PASS' if relevancy_result.passing else 'FAIL'}")
print(f"  - Score: {relevancy_result.score}")
print(f"  - Feedback: {relevancy_result.feedback}")

### Batch Evaluation

In [None]:
# Evaluate multiple queries
test_queries = [
    "What is machine learning?",
    "How does Python handle errors?",
    "What are neural networks?",
]

evaluation_results = []

for query in test_queries:
    response = query_engine.query(query)
    
    faithfulness = faithfulness_evaluator.evaluate_response(response=response)
    relevancy = relevancy_evaluator.evaluate_response(query=query, response=response)
    
    evaluation_results.append({
        "query": query,
        "faithfulness": faithfulness.passing,
        "relevancy": relevancy.passing,
        "response_length": len(str(response)),
    })

# Display results as table
import pandas as pd

df = pd.DataFrame(evaluation_results)
print("\n=== Batch Evaluation Results ===")
print(df.to_string(index=False))

# Summary
print(f"\nSummary:")
print(f"  Faithfulness pass rate: {df['faithfulness'].mean()*100:.0f}%")
print(f"  Relevancy pass rate: {df['relevancy'].mean()*100:.0f}%")

## 6. Handling Edge Cases

What happens when the query can't be answered from the documents?

In [None]:
# Query about something not in our documents
out_of_scope_query = "What is the capital of France?"

response = query_engine.query(out_of_scope_query)

print(f"Query: {out_of_scope_query}")
print(f"Response: {response}")
print(f"\nSource nodes retrieved: {len(response.source_nodes)}")

# The response may still try to answer from unrelated context
# This is where custom prompts help!

In [None]:
from llama_index.core import PromptTemplate

# Better handling of out-of-scope queries with custom prompt
careful_prompt = PromptTemplate(
    """You are a helpful assistant. Answer the question based ONLY on the following context.
If the context does not contain information to answer the question, respond with:
"I cannot answer this question based on the available documents."

Context:
{context_str}

Question: {query_str}

Answer:"""
)

careful_query_engine = index.as_query_engine(
    similarity_top_k=3,
    text_qa_template=careful_prompt,
)

# Try again with careful prompt
response = careful_query_engine.query(out_of_scope_query)
print(f"Query: {out_of_scope_query}")
print(f"Response with careful prompt: {response}")

## 7. Query Logging and Debugging

For production systems, logging queries and responses is essential:

In [None]:
from datetime import datetime
import json

class QueryLogger:
    """Simple query logger for debugging and analytics."""
    
    def __init__(self, query_engine):
        self.query_engine = query_engine
        self.logs = []
    
    def query(self, question: str) -> str:
        start_time = time.time()
        
        response = self.query_engine.query(question)
        
        elapsed = time.time() - start_time
        
        log_entry = {
            "timestamp": datetime.now().isoformat(),
            "query": question,
            "response_length": len(str(response)),
            "num_sources": len(response.source_nodes),
            "top_score": response.source_nodes[0].score if response.source_nodes else 0,
            "elapsed_seconds": round(elapsed, 2),
        }
        
        self.logs.append(log_entry)
        
        return str(response)
    
    def get_stats(self) -> dict:
        if not self.logs:
            return {"message": "No queries logged yet"}
        
        return {
            "total_queries": len(self.logs),
            "avg_response_time": sum(l["elapsed_seconds"] for l in self.logs) / len(self.logs),
            "avg_sources_used": sum(l["num_sources"] for l in self.logs) / len(self.logs),
            "avg_top_score": sum(l["top_score"] for l in self.logs) / len(self.logs),
        }
    
    def export_logs(self, filepath: str):
        with open(filepath, 'w') as f:
            json.dump(self.logs, f, indent=2)
        print(f"Exported {len(self.logs)} log entries to {filepath}")

In [None]:
# Use the logger
logged_engine = QueryLogger(query_engine)

# Run some queries
queries = [
    "What is Python?",
    "Explain machine learning algorithms.",
    "How does error handling work in Python?",
]

for q in queries:
    print(f"Q: {q}")
    response = logged_engine.query(q)
    print(f"A: {response[:150]}...\n")

# View statistics
print("\n=== Query Statistics ===")
stats = logged_engine.get_stats()
for key, value in stats.items():
    print(f"{key}: {value}")

## 8. Summary

You've learned advanced querying techniques in LlamaIndex:

### Key Takeaways

| Technique | When to Use | Benefit |
|-----------|-------------|--------|
| **Streaming** | User-facing apps | Better perceived performance |
| **Async** | Batch processing | Concurrent execution |
| **HyDE** | Complex queries | Better retrieval |
| **Evaluation** | Quality assurance | Measurable quality |
| **Custom Prompts** | Edge cases | Better handling |

### Best Practices

1. **Always stream** for user-facing applications
2. **Use async** for batch operations
3. **Evaluate regularly** to catch quality issues
4. **Log everything** in production
5. **Handle edge cases** with custom prompts

### Next Steps

In the Intermediate section, we'll explore:
- Different index types (Summary, Tree, Keyword)
- Custom retrievers and reranking
- Chat engines with memory

---

## Exercises

1. **Streaming chat**: Build a simple chat interface that streams responses

2. **Async benchmark**: Compare sync vs async with 10+ queries

3. **Custom evaluator**: Create an evaluator that checks for specific criteria

4. **Query analysis**: Use the logger to analyze query patterns

In [None]:
# Exercise space
# Try your own experiments here!