# Dynamic In-Context Learning (dICL) Demo with Seed Data

This notebook demonstrates the dICL system using seed data from `seed_data.yml`:
1. Taking a user question
2. Finding the top 3 most relevant examples from the database (populated with seed data)
3. Using those examples as context for generating an answer

The system now uses three types of examples:
- **Structured Output**: JSON format examples (e.g., "Banana" → `{"type": "fruit", "era": "modern", "diet": "herbivore"}`)
- **Pattern Learning**: Haiku examples (e.g., "Write a haiku about..." → haiku responses)
- **Voice & Style**: Narrative examples (e.g., "Tell me about..." → descriptive responses)

## Smart Database Management
The notebook automatically checks if the database exists and is properly populated before attempting to use it. It will only populate the database if necessary, avoiding unnecessary re-seeding.


## Setup and Imports

This notebook now uses the updated dICL system that loads examples from seed data instead of generating them via BAML.


In [1]:
import sys
import os
import requests
import pandas as pd
import lancedb
from typing import List, Dict, Any
from dataclasses import dataclass
import logging

# Add the src directory to the path
sys.path.append('/Users/davidhughes/dev/mmgraphrag-odsc-west-2025/src')

# Import our dICL system
from mmgraphrag_odsc_west_2025.lib.dicl_system import DICLSystem
from mmgraphrag_odsc_west_2025.config import config

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("Setup complete!")
print("✅ Updated dICL system with seed data support loaded")


Setup complete!
✅ Updated dICL system with seed data support loaded


## Check Prerequisites

Before we start, let's make sure Ollama is running and accessible with the nomic-embed-text model for embeddings.


In [2]:
# Check if Ollama is running
import requests

ollama_url = "http://localhost:11434"
try:
    response = requests.get(f"{ollama_url}/api/tags", timeout=5)
    if response.status_code == 200:
        print("✅ Ollama is running and accessible")
        
        # Check if nomic-embed-text model is available
        models = response.json().get('models', [])
        embed_models = [m for m in models if 'nomic-embed-text' in m.get('name', '').lower()]
        if embed_models:
            print(f"✅ Embedding model found: {embed_models[0]['name']}")
        else:
            print("⚠️  nomic-embed-text model not found. You may need to pull it with: ollama pull nomic-embed-text:v1.5")
        
        # Also check for phi4 (for BAML if available)
        phi4_models = [m for m in models if 'phi4' in m.get('name', '').lower()]
        if phi4_models:
            print(f"✅ Phi4 model found: {phi4_models[0]['name']} (for BAML)")
        else:
            print("ℹ️  Phi4 model not found (BAML will use fallback mode)")
    else:
        print("❌ Ollama is not responding properly")
except Exception as e:
    print(f"❌ Cannot connect to Ollama: {e}")
    print("Please make sure Ollama is running: ollama serve")


✅ Ollama is running and accessible
✅ Embedding model found: nomic-embed-text:v1.5
✅ Phi4 model found: phi4-ctx16k:latest (for BAML)


## Initialize the dICL System

Now let's initialize the dICL system and connect to the database populated with seed data.


## View Database Contents

Let's take a look at what's stored in the LanceDB database populated with seed data.


In [3]:
# View the LanceDB table contents
import pandas as pd

def check_database_status():
    """Check if the database exists and is properly populated."""
    import os
    db_path = '/Users/davidhughes/dev/mmgraphrag-odsc-west-2025/dicl_examples'
    
    # Check if database directory exists
    if not os.path.exists(db_path):
        return False, "Database directory does not exist"
    
    # Check if examples table exists
    if not os.path.exists(os.path.join(db_path, 'examples.lance')):
        return False, "Examples table does not exist"
    
    return True, "Database exists"

def ensure_database_populated():
    """Ensure the database is populated with seed data."""
    global system
    
    # Check if system is already initialized and working
    if 'system' in globals() and hasattr(system, 'table') and system.table is not None:
        try:
            # Test if we can access the data
            test_data = system.table.to_pandas()
            if len(test_data) > 0:
                print("✅ Database is already populated and accessible")
                return True
        except:
            pass
    
    # Check database status
    db_exists, status_msg = check_database_status()
    print(f"🔍 Database status: {status_msg}")
    
    if not db_exists:
        print("📦 Database not found. Populating with seed data...")
        try:
            system = DICLSystem()
            system.populate_database(db_path='/Users/davidhughes/dev/mmgraphrag-odsc-west-2025/dicl_examples', seed_data_path='seed_data.yml')
            print("✅ Database populated successfully!")
            return True
        except Exception as e:
            print(f"❌ Error populating database: {e}")
            return False
    else:
        # Database exists, try to connect
        try:
            system = DICLSystem()
            system.initialize_database('/Users/davidhughes/dev/mmgraphrag-odsc-west-2025/dicl_examples')
            test_data = system.table.to_pandas()
            if len(test_data) == 0:
                print("📦 Database exists but is empty. Populating with seed data...")
                system.populate_database(db_path='/Users/davidhughes/dev/mmgraphrag-odsc-west-2025/dicl_examples', seed_data_path='seed_data.yml')
                print("✅ Database populated successfully!")
            else:
                print(f"✅ Database is already populated with {len(test_data)} examples")
            return True
        except Exception as e:
            print(f"❌ Error connecting to database: {e}")
            print("📦 Attempting to repopulate database...")
            try:
                system = DICLSystem()
                system.populate_database(db_path='/Users/davidhughes/dev/mmgraphrag-odsc-west-2025/dicl_examples', seed_data_path='seed_data.yml')
                print("✅ Database repopulated successfully!")
                return True
            except Exception as e2:
                print(f"❌ Error repopulating database: {e2}")
                return False

def view_database_contents():
    """Display the contents of the LanceDB database populated with seed data."""
    global system
    
    # Ensure database is populated first
    if not ensure_database_populated():
        print("❌ Could not populate database. Please check your setup.")
        return None
    
    try:
        # Get all data from the table
        all_data = system.table.to_pandas()
        
        print(f"📊 Database Overview (Seed Data):")
        print(f"   Total examples: {len(all_data)}")
        print(f"   Columns: {list(all_data.columns)}")
        print("\n" + "="*80)
        
        # Show basic statistics
        print(f"\n📈 Data Statistics:")
        print(f"   - Unique IDs: {all_data['id'].nunique()}")
        
        # Group by example type (based on ID prefix)
        example_types = all_data['id'].str.split('_').str[0].value_counts()
        print(f"   - Example Types: {dict(example_types)}")
        
        # Show what each type contains
        print(f"\n📋 Example Type Breakdown:")
        for ex_type, count in example_types.items():
            type_data = all_data[all_data['id'].str.startswith(ex_type)]
            sample_input = type_data.iloc[0]['input']
            print(f"   - {ex_type}: {count} examples (e.g., '{sample_input}')")
        
        print("\n" + "="*80)
        print(f"\n📋 Sample Data (first 5 examples):")
        print("\n" + "-"*80)
        
        # Display sample data
        for i, (_, row) in enumerate(all_data.head().iterrows()):
            print(f"\n🔹 Example {i+1} ({row['id'].split('_')[0]}):")
            print(f"   ID: {row['id']}")
            print(f"   Input: {row['input']}")
            print(f"   Output: {row['output'][:100]}{'...' if len(row['output']) > 100 else ''}")
            print(f"   Vector dimension: {len(row['vector'])}")
            print("-" * 60)
        
        # Show all data in a table format (without vectors for readability)
        print(f"\n📊 Complete Data Table (without vectors):")
        display_data = all_data[['id', 'input', 'output']].copy()
        display_data['output'] = display_data['output'].str[:50] + '...'
        display(display_data)
        
        return all_data
        
    except Exception as e:
        print(f"❌ Error viewing database: {e}")
        return None

# Call the function to view the database
database_data = view_database_contents()


2025-10-25 15:31:54,904 - mmgraphrag_odsc_west_2025.lib.dicl_system - INFO - Connected to database at /Users/davidhughes/dev/mmgraphrag-odsc-west-2025/dicl_examples
INFO:mmgraphrag_odsc_west_2025.lib.dicl_system:Connected to database at /Users/davidhughes/dev/mmgraphrag-odsc-west-2025/dicl_examples


🔍 Database status: Database exists
✅ Database is already populated with 15 examples
📊 Database Overview (Seed Data):
   Total examples: 15
   Columns: ['id', 'vector', 'input', 'output']


📈 Data Statistics:
   - Unique IDs: 15
   - Example Types: {'structured': 5, 'pattern': 5, 'voice': 5}

📋 Example Type Breakdown:
   - structured: 5 examples (e.g., 'Apple')
   - pattern: 5 examples (e.g., 'Write a haiku about a Tyrannosaurus Rex.')
   - voice: 5 examples (e.g., 'Explain lambda calculus.')


📋 Sample Data (first 5 examples):

--------------------------------------------------------------------------------

🔹 Example 1 (structured):
   ID: structured_0
   Input: Apple
   Output: {"type": "fruit", "color": "red", "taste": "sweet", "season": "autumn"}
   Vector dimension: 768
------------------------------------------------------------

🔹 Example 2 (structured):
   ID: structured_1
   Input: Banana
   Output: {"type": "fruit", "color": "yellow", "taste": "sweet", "season": "year-round"}

Unnamed: 0,id,input,output
0,structured_0,Apple,"{""type"": ""fruit"", ""color"": ""red"", ""taste"": ""sw..."
1,structured_1,Banana,"{""type"": ""fruit"", ""color"": ""yellow"", ""taste"": ..."
2,structured_2,Orange,"{""type"": ""fruit"", ""color"": ""orange"", ""taste"": ..."
3,structured_3,Strawberry,"{""type"": ""fruit"", ""color"": ""red"", ""taste"": ""sw..."
4,structured_4,Grape,"{""type"": ""fruit"", ""color"": ""purple"", ""taste"": ..."
5,pattern_0,Write a haiku about a Tyrannosaurus Rex.,"King of ancient earth,\nMassive jaws and tiny ..."
6,pattern_1,Write a haiku about a Triceratops.,"Three horns crown its head,\nArmored shield pr..."
7,pattern_2,Write a haiku about a Stegosaurus.,"Plates along its spine,\nTail spikes swing lik..."
8,pattern_3,Write a haiku about a Velociraptor.,"Swift and cunning hunter,\nPack hunter with de..."
9,pattern_4,Write a haiku about a Brachiosaurus.,"Towering above trees,\nLong neck reaches for t..."


## Step 1: User Input and Similar Example Search

Enter your question below, and the system will find the top 3 most relevant examples from the seed data database.

**Try these example queries to see different types of responses:**
- **Structured Output**: "Banana", "Tyrannosaurus Rex", "Apple"
- **Pattern Learning**: "Write a haiku about a dinosaur", "Write a haiku about a banana"
- **Voice & Style**: "Tell me about a banana", "Tell me about a dinosaur"


In [4]:
# Interactive user input
import ipywidgets as widgets
from IPython.display import display, clear_output

# Create input widgets
question_input = widgets.Textarea(
    value="Banana",
    placeholder="Enter your question here...",
    description="Question:",
    layout=widgets.Layout(width='100%', height='80px')
)

submit_button = widgets.Button(
    description="Search for Similar Examples",
    button_style='primary',
    layout=widgets.Layout(width='200px')
)

output_area = widgets.Output()

def on_submit_clicked(b):
    """Handle submit button click."""
    with output_area:
        clear_output(wait=True)
        
        user_question = question_input.value.strip()
        
        if not user_question:
            print("❌ Please enter a question!")
            return
            
        print(f"🔍 User Question: {user_question}")
        print("\n" + "="*50)

        # Search for similar examples
        try:
            similar_examples = system._search_similar_examples(user_question, num_examples=3)
            
            print(f"📊 Found {len(similar_examples)} similar examples:")
            print("\n" + "-"*50)
            
            for i, example in enumerate(similar_examples, 1):
                print(f"\n📝 Example {i} ({example.id.split('_')[0]}):")
                print(f"   ID: {example.id}")
                print(f"   Input: {example.input}")
                print(f"   Output: {example.output}")
                print("\n" + "-"*30)
            
            # Store the results globally for the next cell
            globals()['user_question'] = user_question
            globals()['similar_examples'] = similar_examples
            
            print(f"\n✅ Ready for Step 2! Found {len(similar_examples)} examples to use as context.")
                
        except Exception as e:
            print(f"❌ Error searching for examples: {e}")

# Connect the button to the function
submit_button.on_click(on_submit_clicked)

# Display the widgets
print("📝 Enter your question below and click 'Search for Similar Examples':")
display(question_input)
display(submit_button)
display(output_area)


📝 Enter your question below and click 'Search for Similar Examples':


Textarea(value='Banana', description='Question:', layout=Layout(height='80px', width='100%'), placeholder='Ent…

Button(button_style='primary', description='Search for Similar Examples', layout=Layout(width='200px'), style=…

Output()

## Step 2: Dynamic In-Context Learning

Now we'll use the dICL system to process the query with the similar examples as context. The system will use BAML if available, or fall back to a simple response using the seed data examples.


In [12]:
# Use the dICL system to process the query
print("🤖 Using Dynamic In-Context Learning with Seed Data...")
print("\n" + "="*50)

# Check if we have the required variables from Step 1
if 'user_question' not in globals() or 'similar_examples' not in globals():
    print("❌ Please run Step 1 first and click 'Search for Similar Examples'!")
    print("The variables 'user_question' and 'similar_examples' are not available.")
else:
    try:
        print(f"📤 Processing query with dICL system:")
        print(f"   Query: {user_question}")
        print(f"   Number of context examples: {len(similar_examples)}")
        print("\n" + "-"*30)
        
        # Use the dICL system's process_query method
        result = system.process_query(user_question, num_examples=3)
        
        print("\n🎯 Final Answer:")
        print("\n" + "="*50)
        print(f"{result['answer']}")
        
        print("\n\n🧠 Reasoning Process:")
        print("\n" + "="*50)
        print(f"{result['reasoning']}")
        
        print(f"\n📊 Examples Used:")
        print("\n" + "-"*30)
        for i, example in enumerate(result['examples'], 1):
            print(f"   {i}. {example.input} → {example.output[:50]}...")
        
    except Exception as e:
        print(f"❌ Error in dynamic in-context learning: {e}")
        import traceback
        traceback.print_exc()


🤖 Using Dynamic In-Context Learning with Seed Data...

📤 Processing query with dICL system:
   Query: Who is Alonzo Church in regards to programming?
   Number of context examples: 3

------------------------------

🎯 Final Answer:

Ah, Alonzo Church! A towering figure in the realm of theoretical computer science and mathematics. He is best known for his formulation of lambda calculus, a formal system that explores function definition, application, and recursion - the very foundation upon which modern functional programming languages are built. His work laid the groundwork for understanding computation itself, influencing not only programming but also logic and philosophy. Church's contributions extend to the Church-Turing thesis, which posits that any effectively calculable function can be computed by a Turing machine, thus bridging abstract mathematical concepts with practical computing.


🧠 Reasoning Process:

To answer this question in the style of the examples provided, I focused 

## Summary

This demo showed how the updated dICL system with seed data:
1. ✅ Took your question and embedded it using nomic-embed-text
2. ✅ Found the 3 most relevant examples from the seed data database
3. ✅ Used those examples as context for generating an answer (BAML or fallback)
4. ✅ Generated a contextual answer with reasoning based on the seed data examples

The system successfully used dynamic in-context learning to provide a more relevant and contextual answer based on the three types of examples from the seed data:
- **Structured Output**: JSON format responses
- **Pattern Learning**: Haiku and creative writing patterns  
- **Voice & Style**: Narrative and descriptive responses


## Try Different Questions

You can modify the question in the input widget above and re-run the cells to test different queries. Here are some suggested questions to test the different example types:

**Structured Output Examples:**
- "Banana" → Should return JSON format
- "Tyrannosaurus Rex" → Should return JSON format
- "Apple" → Should return JSON format

**Pattern Learning Examples:**
- "Write a haiku about a dinosaur" → Should return haiku examples
- "Write a haiku about a banana" → Should return haiku examples
- "Write a haiku about an apple" → Should return haiku examples

**Voice & Style Examples:**
- "Tell me about a banana" → Should return narrative examples
- "Tell me about a dinosaur" → Should return narrative examples
- "Tell me about an apple" → Should return narrative examples
