feat(benchmarks): Add comprehensive temporal reasoning and vector benchmarks#113
Merged
feat(benchmarks): Add comprehensive temporal reasoning and vector benchmarks#113
Conversation
…chmarks Introduces a new benchmark suite in /examples/benchmarks/ with: Temporal Reasoning (based on TimePuzzles arXiv:2601.07148): - TemporalPuzzle and TemporalConstraint types for date inference - TemporalSolver with tool augmentation (calendar math, web search) - TimePuzzles generator with configurable difficulty and cross-cultural events - Sample puzzle sets (easy, medium, hard, cross-cultural) Vector Index with IVF: - DenseVec with cosine similarity and SIMD-friendly operations - VectorIndex with IVF-style coarse quantization (k-means centroids) - CoherenceGate for filtering based on external signals (mincut coherence) - Persistence using serde + bincode Swarm Controller Regret Tracking: - RegretTracker for episode-based sublinear regret computation - OracleBaseline for computing optimal rewards - SwarmController with R_k/k metric tracking - Sublinearity detection and trend analysis Logging Schema: - BenchmarkLogger with structured JSON logging - Log entry types: Temporal, Vector, Swarm, Tool, System - LogReader with aggregation capabilities Binaries: - temporal-benchmark: Run TimePuzzles-style probes - vector-benchmark: Benchmark IVF search performance - swarm-regret: Track sublinear regret across episodes - timepuzzle-runner: Quick 10-minute probe for agent testing Includes comprehensive integration tests.
Vector Index Optimizations: - Added 4-wide loop unrolling for l2_norm() enabling auto-vectorization - Added 4-wide loop unrolling for dot() product - Achieved 83% improvement in queries/sec (1,929 → 3,527) - Achieved 119% improvement in insert/sec (4,151 → 9,082) - Reduced P50 latency by 46% (497µs → 267µs) - Reduced P99 latency by 39% (718µs → 439µs) TimePuzzles Bug Fixes: - Fixed reference anchor not being added when RelativeToAnchor constraints are generated - Changed generate_constraint to generate_constraint_with_anchor to return both constraint and anchor - Ensures all relative constraints have their required references populated
Advanced Optimizations: - Added rayon dependency for parallel processing - Implemented parallel batch search (search_batch) - Added parallel flat search (search_flat_parallel) for large indices - Implemented adaptive IVF probing (search_adaptive) with dynamic probe count - Added search_ivf_adaptive with minimum candidates threshold Norm Caching: - Added cached_norm field to DenseVec with serde skip - Implemented compute_norm() for explicit caching - Added with_norm() constructor for precomputed norms - Added random_normalized() for unit vectors - Updated scale() to maintain cache when possible - Updated normalize() to cache norm as 1.0 - Added invalidate_cache() for manual cache invalidation - add_scaled() now properly invalidates cache Performance Results (10K vectors, 128D): - Queries/sec: 1,929 → 3,331 (+73%) - Insert/sec: 4,151 → 8,526 (+105%) - P50 latency: 497µs → 275µs (-45%) - P99 latency: 718µs → 485µs (-32%)
- Fixed determine_search_range to use rewritten constraints - Rewrite constraints BEFORE determining search range - This ensures relative constraints (DaysAfter, etc.) are resolved to explicit dates before search range calculation - Added tempfile dev-dependency for tests - All 13 tests now pass
Adds a new intelligence metrics module that measures cognitive capabilities: - Capability scores: temporal reasoning, constraint satisfaction, planning - Reasoning quality: logical coherence, solution optimality, efficiency - Learning metrics: sample efficiency, regret sublinearity, generalization - Tool use proficiency: selection, effectiveness, composition - Meta-cognition: self-correction, strategy adaptation, progress monitoring Includes intelligence-assessment binary for running full assessments with configurable episodes and detailed reporting.
Enhanced the learning metrics calculation: - Use linear regression to detect regret trend slope - Compare first half vs. second half average regret - More accurate sublinearity detection for learning assessment
Implements the learning loop described in lean-agentic design for achieving sublinear regret: - Add ReasoningBank module with trajectory tracking, verdict judgment, and strategy optimization - Add AdaptiveSolver that uses ReasoningBank to learn from experience - Integrate lean-agentic crate for type theory infrastructure - Update intelligence assessment with adaptive learning flag and progress reporting Key features: - Trajectory recording for all puzzle solutions - Pattern learning from successful strategies - Confidence calibration from historical accuracy - Strategy selection based on puzzle characteristics - Improvement tracking and sublinear regret verification Results with 40 episodes: - Sublinear regret: Yes ✓ - Regret trend: -0.1728 (decreasing ✓) - Patterns learned: 20+ - Is improving: Yes ✓
ReasoningBank optimizations: - Add pattern_index HashMap for O(1) lookups by (constraint_type, difficulty) - Add constraint_frequency tracking for prioritization - Add batch trajectory recording for parallel processing - Index patterns at insertion time for fast retrieval Vector index optimizations: - Parallel centroid scoring using rayon for IVF search - Optimized adaptive probe logic with early termination - Better candidate collection with bounds checking - Maintained P99 latency reduction Benchmark results at 50k vectors: - Insert: 1,342 vec/s - Query: 808 q/s - P99 latency: 2.0ms
ruvnet
added a commit
that referenced
this pull request
Feb 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduces a new benchmark suite in /examples/benchmarks/ with:
Temporal Reasoning (based on TimePuzzles arXiv:2601.07148):
Vector Index with IVF:
Swarm Controller Regret Tracking:
Logging Schema:
Binaries:
Includes comprehensive integration tests.