# üéõÔ∏è Interactive SelfIndex Configuration & Testing
## Complete Control Over Data, Indexing, and Queries

This notebook allows you to:
- Choose different datasets
- Configure index variants (x, y, z combinations)
- Create custom queries
- Measure performance metrics (A, B, C, D)
- Compare multiple configurations

---

## üì¶ Part 0: Setup & Imports

In [1]:
# Install required dependencies
import subprocess
import sys

required_packages = [
    'nltk==3.9.1',
    'datasets==4.1.1',
    'matplotlib==3.10.6',
    'numpy==2.3.3',
    'pandas==2.3.3',
    'scipy==1.16.2',
    'psutil==5.9.6',
    'tqdm==4.67.1',
    'ipywidgets==8.0.0',
]

print("üì¶ Installing dependencies...")
for package in required_packages:
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])
    except:
        print(f"‚ö†Ô∏è  Could not install {package}")

print("‚úÖ Dependencies installed")

üì¶ Installing dependencies...
‚úÖ Dependencies installed


In [2]:
# Core imports
import os
import sys
import time
import json
import pickle
import string
import statistics
import threading
from collections import Counter, defaultdict
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Tuple, Any
import traceback

# Data & Analysis
import numpy as np
import pandas as pd
from scipy import stats

# Visualization
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from matplotlib.gridspec import GridSpec

# UI & Widgets
from IPython.display import display, HTML, clear_output, Markdown
from tqdm import tqdm, tqdm_notebook
import ipywidgets as widgets

# NLP
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# System
import psutil

# Download NLTK data
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)

print("‚úÖ All imports successful")

‚úÖ All imports successful


---

## üéõÔ∏è Part 1: Configuration Panel

In [3]:
# Dataset presets
DATASET_PRESETS = {
    'wikipedia_small': {'num_docs': 100, 'desc': 'Wikipedia - Small (100 docs)'},
    'wikipedia_medium': {'num_docs': 1000, 'desc': 'Wikipedia - Medium (1K docs)'},
    'wikipedia_large': {'num_docs': 5000, 'desc': 'Wikipedia - Large (5K docs)'},
    'wikipedia_xlarge': {'num_docs': 50000, 'desc': 'Wikipedia - XLarge (50K docs)'},
    'custom_political': {'num_docs': 50, 'desc': 'Custom - Political Philosophy (50 docs)'},
}

# Index variants (x, y, z combinations)
INDEX_VARIANTS = {
    'Boolean_Basic': {'info': 'BOOLEAN', 'dstore': 'CUSTOM', 'compr': 'NONE', 'qproc': 'TERMatat', 'optim': 'Null', 'desc': 'Boolean - Basic'},
    'Boolean_Compr_GapVByte': {'info': 'BOOLEAN', 'dstore': 'CUSTOM', 'compr': 'CODE', 'qproc': 'TERMatat', 'optim': 'Null', 'desc': 'Boolean - GapVByte compression'},
    'Boolean_Compr_NoCompr': {'info': 'BOOLEAN', 'dstore': 'CUSTOM', 'compr': 'NONE', 'qproc': 'TERMatat', 'optim': 'Null', 'desc': 'Boolean - NoCompr compression'},
    'Boolean_Compr_Zlib': {'info': 'BOOLEAN', 'dstore': 'CUSTOM', 'compr': 'CLIB', 'qproc': 'TERMatat', 'optim': 'Null', 'desc': 'Boolean - Zlib compression'},
    'Boolean_Custom_Zlib_DocAtTime_EarlyStop': {'info': 'BOOLEAN', 'dstore': 'CUSTOM', 'compr': 'CLIB', 'qproc': 'DOCatat', 'optim': 'EarlyStopping', 'desc': 'Boolean - Custom - Zlib - DocAtTime - EarlyStop'},
    'Boolean_DS_Custom': {'info': 'BOOLEAN', 'dstore': 'CUSTOM', 'compr': 'NONE', 'qproc': 'TERMatat', 'optim': 'Null', 'desc': 'Boolean - Custom datastore'},
    'Boolean_DS_Redis': {'info': 'BOOLEAN', 'dstore': 'DB2', 'compr': 'NONE', 'qproc': 'TERMatat', 'optim': 'Null', 'desc': 'Boolean - Redis datastore'},
    'Boolean_DS_SQLite': {'info': 'BOOLEAN', 'dstore': 'DB1', 'compr': 'NONE', 'qproc': 'TERMatat', 'optim': 'Null', 'desc': 'Boolean - SQLite datastore'},
    'Boolean_Opt_EarlyStop': {'info': 'BOOLEAN', 'dstore': 'CUSTOM', 'compr': 'NONE', 'qproc': 'TERMatat', 'optim': 'EarlyStopping', 'desc': 'Boolean - EarlyStop'},
    'Boolean_Opt_NoOpt': {'info': 'BOOLEAN', 'dstore': 'CUSTOM', 'compr': 'NONE', 'qproc': 'TERMatat', 'optim': 'Null', 'desc': 'Boolean - NoOpt'},
    'Boolean_Opt_Skip': {'info': 'BOOLEAN', 'dstore': 'CUSTOM', 'compr': 'NONE', 'qproc': 'TERMatat', 'optim': 'Skipping', 'desc': 'Boolean - Skip'},
    'Boolean_Opt_Thresh': {'info': 'BOOLEAN', 'dstore': 'CUSTOM', 'compr': 'NONE', 'qproc': 'TERMatat', 'optim': 'Thresholding', 'desc': 'Boolean - Thresh'},
    'Boolean_QProc_DocAtTime': {'info': 'BOOLEAN', 'dstore': 'CUSTOM', 'compr': 'NONE', 'qproc': 'DOCatat', 'optim': 'Null', 'desc': 'Boolean - DocAtTime'},
    'Boolean_QProc_TermAtTime': {'info': 'BOOLEAN', 'dstore': 'CUSTOM', 'compr': 'NONE', 'qproc': 'TERMatat', 'optim': 'Null', 'desc': 'Boolean - TermAtTime'},
    'TFIDF_Basic': {'info': 'TFIDF', 'dstore': 'CUSTOM', 'compr': 'NONE', 'qproc': 'TERMatat', 'optim': 'Null', 'desc': 'TF-IDF - Basic'},
    'TFIDF_Custom_GapVByte_TermAtTime': {'info': 'TFIDF', 'dstore': 'CUSTOM', 'compr': 'CODE', 'qproc': 'TERMatat', 'optim': 'Null', 'desc': 'TF-IDF - Custom - GapVByte - TermAtTime'},
    'TFIDF_Redis_GapVByte_TermAtTime_Thresh': {'info': 'TFIDF', 'dstore': 'DB2', 'compr': 'CODE', 'qproc': 'TERMatat', 'optim': 'Thresholding', 'desc': 'TF-IDF - Redis - GapVByte - TermAtTime - Threshold'},
    'TFIDF_SQLite_Zlib_DocAtTime': {'info': 'TFIDF', 'dstore': 'DB1', 'compr': 'CLIB', 'qproc': 'DOCatat', 'optim': 'Null', 'desc': 'TF-IDF - SQLite - Zlib - DocAtTime'},
    'WordCount_Basic': {'info': 'WORDCOUNT', 'dstore': 'CUSTOM', 'compr': 'NONE', 'qproc': 'TERMatat', 'optim': 'Null', 'desc': 'WordCount - Basic'},
    'WordCount_SQLite_NoCompr': {'info': 'WORDCOUNT', 'dstore': 'DB1', 'compr': 'NONE', 'qproc': 'TERMatat', 'optim': 'Null', 'desc': 'WordCount - SQLite - No compression'},
}

# Query presets
QUERY_PRESETS = {
    'political': {
        'queries': ['anarchism', 'political philosophy', 'socialism', 'democracy', 'government', 'authority', 'freedom', 'society'],
        'desc': 'Political philosophy queries'
    },
    'technology': {
        'queries': ['artificial intelligence', 'machine learning', 'neural networks', 'deep learning', 'computer', 'data', 'algorithm', 'system'],
        'desc': 'Technology queries'
    },
    'diverse': {
        'queries': ['philosophy', 'science', 'history', 'culture', 'society', 'technology', 'politics', 'economics'],
        'desc': 'Diverse general queries'
    },
    'simple': {
        'queries': ['the', 'and', 'of', 'to', 'in', 'is', 'it', 'you', 'that', 'he'],
        'desc': 'Simple common words'
    },
}

print("‚úÖ Configuration presets loaded")

‚úÖ Configuration presets loaded


In [4]:
# Create interactive widgets
dataset_dropdown = widgets.Dropdown(
    options=[(v['desc'], k) for k, v in DATASET_PRESETS.items()],
    value='wikipedia_small',
    description='Dataset:',
    style={'description_width': '120px'}
)

index_dropdown = widgets.Dropdown(
    options=[(v['desc'], k) for k, v in INDEX_VARIANTS.items()],
    value='Boolean_Basic',  # Changed from 'Boolean_Custom_NoCompr' to a valid key
    description='Index Type:',
    style={'description_width': '120px'}
)

query_dropdown = widgets.Dropdown(
    options=[(v['desc'], k) for k, v in QUERY_PRESETS.items()],
    value='political',
    description='Queries:',
    style={'description_width': '120px'}
)

custom_queries_text = widgets.Textarea(
    value='',
    placeholder='Enter custom queries (one per line)',
    description='Custom Queries:',
    rows=4,
    style={'description_width': '120px'}
)

# Buttons
create_index_button = widgets.Button(
    description='üî® Create Index',
    button_style='info',
)

measure_metrics_button = widgets.Button(
    description='üìä Measure Metrics',
    button_style='success',
)

clear_button = widgets.Button(
    description='üóëÔ∏è  Clear',
    button_style='warning',
)

output_area = widgets.Output()

# Display configuration panel
config_box = widgets.VBox([
    widgets.HTML("<h3>1Ô∏è‚É£  Select Configuration</h3>"),
    dataset_dropdown,
    index_dropdown,
    widgets.HTML("<h3>2Ô∏è‚É£  Select Queries</h3>"),
    query_dropdown,
    widgets.HTML("<h4>Or enter custom queries (one per line):</h4>"),
    custom_queries_text,
    widgets.HTML("<h3>3Ô∏è‚É£  Execute</h3>"),
    widgets.HBox([create_index_button, measure_metrics_button, clear_button]),
])

display(config_box)
display(output_area)

print("‚úÖ Configuration panel ready")

VBox(children=(HTML(value='<h3>1Ô∏è‚É£  Select Configuration</h3>'), Dropdown(description='Dataset:', options=(('W‚Ä¶

Output()

‚úÖ Configuration panel ready


In [5]:
# Session state management
class IndexingSession:
    def __init__(self):
        self.current_index = None
        self.current_config = None
        self.current_data = None
        self.metrics = {}
        self.query_results = {}
    
    def clear(self):
        self.current_index = None
        self.current_config = None
        self.current_data = None
        self.metrics = {}
        self.query_results = {}

session = IndexingSession()
print("‚úÖ Session manager initialized")

‚úÖ Session manager initialized


---

## üìö Part 2: Data Loading

In [6]:
def load_preprocessed_dataset(max_docs=None):
    """Load the preprocessed dataset from ElasticSearch.ipynb"""
    try:
        import pandas as pd
        df = pd.read_csv('dataset/preprocessed_dataset.csv')
        if max_docs:
            df = df.head(max_docs)
        
        # Use original_text for indexing (not processed_tokens)
        docs = [(str(row['id']), str(row['original_text'])) for _, row in df.iterrows()]
        print(f"‚úÖ Loaded {len(docs)} preprocessed documents from dataset/preprocessed_dataset.csv")
        return docs
    except FileNotFoundError:
        print("‚ö†Ô∏è  Preprocessed dataset not found. Using fallback data.")
        return load_political_data()
    except Exception as e:
        print(f"‚ö†Ô∏è  Error loading preprocessed dataset: {e}. Using fallback data.")
        return load_political_data()

def load_political_data():
    """Load custom political philosophy data"""
    docs = [
        ('1', 'anarchism is a political philosophy and movement skeptical of all authority'),
        ('2', 'marxism represents socialist economic theories focused on worker liberation'),
        ('3', 'capitalism is an economic system based on private property and markets'),
        ('4', 'democracy is a form of government in which power rests with the people'),
        ('5', 'socialism advocates for collective ownership of production means'),
        ('6', 'fascism is an authoritarian far-right form of government'),
        ('7', 'libertarianism emphasizes individual liberty and limited government'),
        ('8', 'liberalism emphasizes individual rights and freedoms'),
        ('9', 'conservatism emphasizes tradition and gradual change'),
        ('10', 'communism seeks a classless society with common ownership'),
    ]
    print(f"‚úÖ Loaded {len(docs)} custom political documents")
    return docs

def load_wikipedia_data(num_docs=100):
    """Load Wikipedia data from preprocessed dataset"""
    docs = load_preprocessed_dataset(num_docs)
    if len(docs) < num_docs:
        print(f"‚ö†Ô∏è  Only {len(docs)} documents available, requested {num_docs}")
    return docs

def load_data(dataset_key):
    """Load dataset based on preset"""
    preset = DATASET_PRESETS[dataset_key]
    
    if 'custom' in dataset_key:
        return load_political_data()
    else:
        return load_wikipedia_data(preset['num_docs'])

print("‚úÖ Data loading functions defined")

‚úÖ Data loading functions defined


---

## üî® Part 3: Indexing Functions

In [7]:
def preprocess_text(text):
    """Preprocess text for indexing"""
    try:
        stop_words = set(stopwords.words('english'))
        stemmer = PorterStemmer()
        
        text = text.lower()
        text = text.translate(str.maketrans('', '', string.punctuation))
        tokens = word_tokenize(text)
        tokens = [stemmer.stem(word) for word in tokens if word not in stop_words and word.isalpha()]
        return tokens
    except Exception as e:
        print(f"‚ö†Ô∏è  Preprocessing error: {e}")
        return []

def create_index_from_config(dataset_key, variant_key, output_area):
    """Create index with specified configuration"""
    
    with output_area:
        print("\n" + "="*70)
        print("üî® CREATING INDEX")
        print("="*70)
        
        try:
            # Load data
            print(f"\n1Ô∏è‚É£  Loading data...")
            documents = load_data(dataset_key)
            
            if not documents:
                print("‚ùå No documents loaded")
                return None
            
            # Prepare documents
            print(f"2Ô∏è‚É£  Preparing {len(documents)} documents...")
            prepared_docs = [(doc_id, text) for doc_id, text in documents]
            
            # Create index
            print(f"3Ô∏è‚É£  Creating index with variant: {INDEX_VARIANTS[variant_key]['desc']}")
            
            config = INDEX_VARIANTS[variant_key]
            
            # Import here to avoid circular imports
            from self_index import SelfIndex, create_self_index
            
            # Index documents using the helper function
            start_time = time.time()
            index_name = f"index_{dataset_key}_{variant_key}_{int(time.time())}"
            
            index = create_self_index(
                index_id=index_name,
                files=prepared_docs,
                info=config['info'],
                dstore=config['dstore'],
                qproc=config.get('qproc', 'TERMatat'),
                compr=config['compr'],
                optim=config.get('optim', 'Null')
            )
            
            index_time = time.time() - start_time
            
            print(f"4Ô∏è‚É£  Index created in {index_time:.3f} seconds")
            print(f"   Index ID: {index.identifier_short}")
            print(f"   Vocabulary size: {len(index.vocabulary)} terms")
            print(f"   Documents indexed: {index.num_docs}")
            
            # Save session state
            session.current_index = index
            session.current_config = {
                'dataset': dataset_key,
                'variant': variant_key,
                'num_docs': len(documents),
                'vocab_size': len(index.vocabulary),
                'index_time': index_time,
                'index_name': index_name
            }
            session.current_data = prepared_docs
            
            print(f"\n‚úÖ Index created successfully!")
            return index
            
        except Exception as e:
            print(f"‚ùå Error creating index: {e}")
            traceback.print_exc()
            return None

print("‚úÖ Indexing functions defined")

‚úÖ Indexing functions defined


---

## üîç Part 4: Query & Metrics Functions

In [8]:
def run_queries(index, queries, output_area):
    """Run queries and measure latency"""
    
    with output_area:
        print("\n" + "="*70)
        print("üîç RUNNING QUERIES")
        print("="*70)
        
        if not index:
            print("‚ùå No index available")
            return {}
        
        results = {}
        latencies = []
        
        print(f"\nüìã Running {len(queries)} queries...\n")
        
        for i, query in enumerate(tqdm(queries, desc="Executing queries"), 1):
            try:
                start_time = time.time()
                result_json = index.query(query)
                query_time = (time.time() - start_time) * 1000
                
                result = json.loads(result_json)
                
                results[query] = {
                    'num_results': result['num_results'],
                    'time_ms': query_time,
                    'top_result': result['results'][0]['doc_id'] if result['results'] else None,
                }
                
                latencies.append(query_time)
                
            except Exception as e:
                results[query] = {'error': str(e)}
        
        # Calculate statistics
        if latencies:
            stats_dict = {
                'mean_latency': statistics.mean(latencies),
                'median_latency': statistics.median(latencies),
                'p95_latency': np.percentile(latencies, 95),
                'p99_latency': np.percentile(latencies, 99),
            }
            
            print(f"\nüìä Query Statistics:")
            print(f"   Mean: {stats_dict['mean_latency']:.2f} ms")
            print(f"   P95: {stats_dict['p95_latency']:.2f} ms ‚≠ê")
            print(f"   P99: {stats_dict['p99_latency']:.2f} ms ‚≠ê")
            
            session.metrics['latency'] = stats_dict
        
        print(f"\n‚úÖ Queries executed successfully")
        session.query_results = results
        return results

def measure_metrics(index, queries, output_area):
    """Measure Metrics A, B, C, D"""
    
    with output_area:
        print("\n" + "="*70)
        print("üìä MEASURING PERFORMANCE METRICS")
        print("="*70)
        
        if not index:
            print("‚ùå No index available")
            return
        
        # Metric A: Latency
        print(f"\nüÖ∞Ô∏è  METRIC A: LATENCY (Response Time)")
        latencies = []
        for query in queries[:10]:
            try:
                start_time = time.time()
                index.query(query)
                latencies.append((time.time() - start_time) * 1000)
            except:
                pass
        
        if latencies:
            metric_a = {
                'mean': statistics.mean(latencies),
                'p95': np.percentile(latencies, 95),
                'p99': np.percentile(latencies, 99)
            }
            print(f"   Mean: {metric_a['mean']:.2f} ms")
            print(f"   P95: {metric_a['p95']:.2f} ms")
            print(f"   P99: {metric_a['p99']:.2f} ms")
            session.metrics['a_latency'] = metric_a
        
        # Metric B: Throughput
        print(f"\nüÖ±Ô∏è  METRIC B: THROUGHPUT (Queries/Second)")
        start_time = time.time()
        query_count = 0
        
        while (time.time() - start_time) < 5:
            query = queries[query_count % len(queries)]
            try:
                index.query(query)
                query_count += 1
            except:
                pass
        
        elapsed = time.time() - start_time
        metric_b = query_count / elapsed
        print(f"   Throughput: {metric_b:.2f} qps")
        session.metrics['b_throughput'] = metric_b
        
        # Metric C: Memory
        print(f"\nüÖ≤  METRIC C: MEMORY FOOTPRINT")
        process = psutil.Process()
        memory_mb = process.memory_info().rss / 1024 / 1024
        vocab_size = len(index.vocabulary)
        print(f"   Memory: {memory_mb:.2f} MB")
        print(f"   Vocabulary: {vocab_size} terms")
        session.metrics['c_memory'] = {'memory_mb': memory_mb, 'vocab_size': vocab_size}
        
        # Metric D: Functional
        print(f"\nüÖ≥  METRIC D: FUNCTIONAL METRICS")
        results_with_data = sum(1 for r in session.query_results.values() if r.get('num_results', 0) > 0)
        coverage = results_with_data / len(session.query_results) if session.query_results else 0
        print(f"   Queries with results: {results_with_data}/{len(session.query_results)}")
        print(f"   Coverage: {coverage:.1%}")
        session.metrics['d_functional'] = {'coverage': coverage}
        
        print(f"\n‚úÖ All metrics measured")

print("‚úÖ Query and metrics functions defined")

‚úÖ Query and metrics functions defined


---

## üé® Part 5: Visualization

In [9]:
def visualize_results(output_area):
    """Create comprehensive visualization"""
    
    with output_area:
        if not session.query_results:
            print("‚ùå No query results to visualize")
            return
        
        print("\n" + "="*70)
        print("üìà VISUALIZATION")
        print("="*70)
        
        fig = plt.figure(figsize=(16, 10))
        gs = GridSpec(2, 2, figure=fig, hspace=0.3, wspace=0.3)
        
        # Query results distribution
        ax1 = fig.add_subplot(gs[0, 0])
        queries = list(session.query_results.keys())[:10]
        results_count = [session.query_results[q].get('num_results', 0) for q in queries]
        ax1.barh(range(len(queries)), results_count, color='steelblue')
        ax1.set_yticks(range(len(queries)))
        ax1.set_yticklabels([q[:20] for q in queries], fontsize=9)
        ax1.set_xlabel('Number of Results')
        ax1.set_title('üìä Query Results Distribution')
        ax1.grid(axis='x', alpha=0.3)
        
        # Latency distribution
        ax2 = fig.add_subplot(gs[0, 1])
        latencies = [session.query_results[q].get('time_ms', 0) for q in queries]
        ax2.bar(range(len(queries)), latencies, color='coral')
        ax2.set_xticks(range(len(queries)))
        ax2.set_xticklabels([q[:10] for q in queries], rotation=45, fontsize=8)
        ax2.set_ylabel('Time (ms)')
        ax2.set_title('‚è±Ô∏è  Query Latency Distribution')
        ax2.grid(axis='y', alpha=0.3)
        
        # Metrics summary
        ax3 = fig.add_subplot(gs[1, :])
        ax3.axis('off')
        
        metrics_text = "üìä PERFORMANCE METRICS SUMMARY\n\n"
        
        if 'a_latency' in session.metrics:
            m = session.metrics['a_latency']
            metrics_text += f"üÖ∞Ô∏è  Latency: Mean={m['mean']:.2f}ms, P95={m['p95']:.2f}ms, P99={m['p99']:.2f}ms\n\n"
        
        if 'b_throughput' in session.metrics:
            metrics_text += f"üÖ±Ô∏è  Throughput: {session.metrics['b_throughput']:.2f} qps\n\n"
        
        if 'c_memory' in session.metrics:
            m = session.metrics['c_memory']
            metrics_text += f"üÖ≤  Memory: {m['memory_mb']:.2f}MB, Vocab: {m['vocab_size']} terms\n\n"
        
        if 'd_functional' in session.metrics:
            m = session.metrics['d_functional']
            metrics_text += f"üÖ≥  Coverage: {m['coverage']:.1%}"
        
        ax3.text(0.05, 0.5, metrics_text, fontsize=11, family='monospace',
                verticalalignment='center', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
        
        plt.suptitle("üéõÔ∏è Interactive SelfIndex Results Dashboard", fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()
        
        print("\n‚úÖ Visualization complete")

print("‚úÖ Visualization functions defined")

‚úÖ Visualization functions defined


---

## üîò Part 6: Button Event Handlers

In [10]:
def on_create_index_clicked(b):
    """Handle create index button click"""
    with output_area:
        clear_output(wait=True)
        
        dataset_key = dataset_dropdown.value
        variant_key = index_dropdown.value
        
        index = create_index_from_config(dataset_key, variant_key, output_area)
        
        if index:
            # Get queries
            if custom_queries_text.value.strip():
                queries = [q.strip() for q in custom_queries_text.value.split('\n') if q.strip()]
            else:
                query_key = query_dropdown.value
                queries = QUERY_PRESETS[query_key]['queries']
            
            # Run queries
            run_queries(index, queries, output_area)

def on_measure_metrics_clicked(b):
    """Handle measure metrics button click"""
    with output_area:
        if not session.current_index:
            print("‚ùå No index created yet. Please create index first.")
            return
        
        # Get queries
        if custom_queries_text.value.strip():
            queries = [q.strip() for q in custom_queries_text.value.split('\n') if q.strip()]
        else:
            query_key = query_dropdown.value
            queries = QUERY_PRESETS[query_key]['queries']
        
        # Measure metrics
        measure_metrics(session.current_index, queries, output_area)
        
        # Visualize
        visualize_results(output_area)

def on_clear_clicked(b):
    """Handle clear button click"""
    with output_area:
        clear_output(wait=True)
    session.clear()
    print("üóëÔ∏è  Session cleared")

# Attach event handlers
create_index_button.on_click(on_create_index_clicked)
measure_metrics_button.on_click(on_measure_metrics_clicked)
clear_button.on_click(on_clear_clicked)

print("‚úÖ Event handlers attached")

‚úÖ Event handlers attached


---

## üìù Usage Guide

### Quick Start:

1. **Select Configuration** (from dropdowns above)
   - Choose dataset (Wikipedia small/medium/large or custom)
   - Select index variant (Boolean, WordCount, or TF-IDF)
   - Select query preset or enter custom queries

2. **Create Index**
   - Click "üî® Create Index" button
   - Indexes the dataset with selected configuration
   - Automatically runs queries

3. **Measure Metrics**
   - Click "üìä Measure Metrics" button
   - Measures Metrics A (latency), B (throughput), C (memory), D (functional)
   - Shows visualization dashboard

4. **Compare Variants**
   - Change index type or dataset
   - Click "üî® Create Index" again
   - Compare results across runs

### Metrics Explained:

- **üÖ∞Ô∏è Metric A**: Latency (mean, p95, p99 response times in ms)
- **üÖ±Ô∏è Metric B**: Throughput (queries executed per second)
- **üÖ≤  Metric C**: Memory (process memory and vocabulary size)
- **üÖ≥ Metric D**: Functional (query coverage percentage)

### Custom Queries:

Enter your own queries one per line in the "Custom Queries" text box. Examples:
- `anarchism`
- `political philosophy`
- `machine learning`