# Notebook 02: Inference and Question Generation

This notebook demonstrates:
1. Loading embedded chunks from the previous notebook
2. Asking user-defined questions about the document
3. Auto-generating questions from chunks
4. Tracking tokens and timing for all inference calls
5. Saving inference metrics for reporting

## Setup


In [1]:
# Import standard library modules
import json
import sys
from pathlib import Path

# Add the src directory to Python path
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root))

# Import our custom modules
from src.config import Config
from src.pipeline import generate_questions, save_metrics
from src.ollama_client import generate
from src.timing_metrics import MetricsStore, TimingContext
from src.token_accounting import count_tokens, count_prompt_and_response

print("Modules imported successfully!")


Modules imported successfully!


## Configuration and Load Previous Results


In [2]:
# Create configuration (same as notebook 01)
config = Config(
    embedding_model="embeddinggemma",
    generation_model="gemma3:1b",
    ollama_endpoint="http://localhost:11434"
)

# Load embedded chunks from previous notebook
chunks_path = config.get_chunks_path()
print(f"Loading chunks from: {chunks_path}")

with open(chunks_path, 'r') as f:
    embedded_chunks = json.load(f)

print(f"Loaded {len(embedded_chunks)} embedded chunks")

# Create metrics store for inference metrics
# We'll add to this as we make inference calls
inference_metrics_store = MetricsStore()


Loading chunks from: results/chunks.json
Loaded 772 embedded chunks


## User-Defined Questions

Ask your own questions about the document. Each question will be sent to the model with relevant context from the chunks.


In [3]:
# Example user-defined questions
# You can modify these or add your own
user_questions = [
    "What is the main theme of this passage?",
    "What question does this passage make you ask yourself?",
    "What is the main charater of this passage?"
]

# Store all question-answer pairs
qa_pairs = []

# Process each user question
for idx, question in enumerate(user_questions):
    print(f"\n{'='*60}")
    print(f"Question {idx + 1}: {question}")
    print('='*60)
    
    # For simplicity, we'll use the first chunk as context
    # In a real RAG system, you'd use vector similarity to find the most relevant chunks
    context_chunk = embedded_chunks[0] if embedded_chunks else None
    
    if context_chunk:
        # Build prompt with context
        prompt = f"""Based on the following text, answer this question: {question}

Text:
{context_chunk['text']}

Answer:"""
        
        # Count prompt tokens
        prompt_tokens = count_tokens(prompt)
        
        # Time the inference call
        with TimingContext() as timer:
            # Call Ollama to generate answer
            response_text, metadata = generate(
                prompt,
                config.generation_model,
                config.ollama_endpoint
            )
        
        # Get duration
        duration = timer.duration
        
        # Count response tokens
        response_tokens = count_tokens(response_text)
        
        # Record metrics
        question_id = f"user_question_{idx}"
        inference_metrics_store.add_inference_metric(
            duration=duration,
            prompt_tokens=prompt_tokens,
            response_tokens=response_tokens,
            question_id=question_id,
            response_text=response_text
        )
        
        # Store Q&A pair
        qa_pairs.append({
            'question_id': question_id,
            'question': question,
            'answer': response_text,
            'chunk_id': context_chunk['chunk_id']
        })
        
        # Display results
        print(f"\nAnswer:")
        print(response_text)
        print(f"\nMetrics:")
        print(f"  Prompt tokens: {prompt_tokens}")
        print(f"  Response tokens: {response_tokens}")
        print(f"  Total tokens: {prompt_tokens + response_tokens}")
        print(f"  Duration: {duration:.2f} seconds")
        print(f"  Tokens/second: {(prompt_tokens + response_tokens) / duration:.2f}" if duration > 0 else "  Tokens/second: N/A")
    else:
        print("No chunks available for context")



Question 1: What is the main theme of this passage?

Answer:
The main theme of this passage is the exploration of obsession and the destructive nature of unchecked ambition, particularly as exemplified by Captain Ahab’s pursuit of Moby Dick. It’s a story about a man consumed by a desire for revenge and the dangerous consequences of pursuing a singular, almost mythical, goal. The text also hints at themes of fate, the unknowable, and the vastness of the natural world, all woven together through the narrative of a whale.

Metrics:
  Prompt tokens: 534
  Response tokens: 93
  Total tokens: 627
  Duration: 2.48 seconds
  Tokens/second: 252.59

Question 2: What question does this passage make you ask yourself?

Answer:
What question does this passage make you ask yourself?

The passage primarily invites self-reflection on the story’s themes and the author’s intent. It asks, “What does this text suggest about the nature of obsession, the power of storytelling, and the potential consequences

## Auto-Generate Questions from Chunks

Automatically generate questions from chunks. This is useful for creating a question-answer dataset or testing the system.


In [None]:
# Generate questions from a few sample chunks
# This demonstrates the auto-generation capability
num_chunks_to_process = max(3, len(embedded_chunks))  # Process first 3 chunks
questions_per_chunk = 2  # Generate 2 questions per chunk

print(f"Generating {questions_per_chunk} questions from {num_chunks_to_process} chunks...")
print("This may take a while.\n")

all_generated_questions = []

for i, chunk in enumerate(embedded_chunks[:num_chunks_to_process]):
    print(f"\n{'='*60}")
    print(f"Processing chunk {i+1}/{num_chunks_to_process}: {chunk['chunk_id']}")
    print('='*60)
    
    # Generate questions for this chunk
    questions = generate_questions(
        chunk,
        num_questions=questions_per_chunk,
        config=config,
        metrics_store=inference_metrics_store
    )
    
    all_generated_questions.extend(questions)
    
    # Display generated questions
    for q in questions:
        print(f"\nQuestion: {q['question_text']}")

print(f"\n\nTotal questions generated: {len(all_generated_questions)}")


Generating 2 questions from 772 chunks...
This may take a while.


Processing chunk 1/772: chunk_0

Question: What is the cost and restrictions of accessing the ebook?

Question: What is the release date of the ebook?

Processing chunk 2/772: chunk_1

Question: What does Stubb do to Ahab?

Question: What is the significance of “To Him, Stubb” in the text?

Processing chunk 3/772: chunk_2

Question: What is the primary focus of the text’s chapters 75-89?

Processing chunk 4/772: chunk_3

Question: What is the text about?

Question: What does the text imply about the original Usher?

Processing chunk 5/772: chunk_4

Question: What is the definition of “WHALE” as described in the text?

Question: What does the text suggest about the importance of the extracts?

Processing chunk 6/772: chunk_5

Question: What is the main point of the text?

Question: What does the text state about Leviathan?

Processing chunk 7/772: chunk_6

Question: What is the primary focus of Isaiah’s punishment?

Ques

## Save Inference Metrics

Save all inference metrics to disk so they can be used in the reporting notebook.


In [None]:
# Load existing metrics from notebook 01 (if they exist)
# Then merge with inference metrics
import json

metrics_path = config.get_metrics_path()
all_metrics = []

# Load existing metrics
if metrics_path.exists():
    with open(metrics_path, 'r') as f:
        all_metrics = json.load(f)
    print(f"Loaded {len(all_metrics)} existing metrics from notebook 01")

# Add inference metrics
inference_metrics = inference_metrics_store.metrics
all_metrics.extend(inference_metrics)
print(f"Added {len(inference_metrics)} inference metrics")

# Save combined metrics
print(f"\nSaving all metrics to: {metrics_path}")
with open(metrics_path, 'w') as f:
    json.dump(all_metrics, f, indent=2)

print("✅ All metrics saved successfully!")
print("You can now proceed to notebook 03 for reporting and visualization.")

# Display summary
if inference_metrics:
    total_inference_time = sum(m['duration_seconds'] for m in inference_metrics)
    total_inference_tokens = sum(m['token_counts'].get('total_tokens', 0) for m in inference_metrics)
    
    print(f"\nInference Summary:")
    print(f"  Total inference calls: {len(inference_metrics)}")
    print(f"  Total inference time: {total_inference_time:.2f} seconds")
    print(f"  Total tokens: {total_inference_tokens:,}")
    print(f"  Average time per call: {total_inference_time / len(inference_metrics):.2f} seconds")
    print(f"  Throughput: {total_inference_tokens / total_inference_time:.2f} tokens/second" if total_inference_time > 0 else "  Throughput: N/A")


Loaded 772 existing metrics from notebook 01
Added 9 inference metrics

Saving all metrics to: results/metrics.json
✅ All metrics saved successfully!
You can now proceed to notebook 03 for reporting and visualization.

Inference Summary:
  Total inference calls: 9
  Total inference time: 15.09 seconds
  Total tokens: 5,401
  Average time per call: 1.68 seconds
  Throughput: 357.89 tokens/second
