Skip to content

feat(benchmarks): Add comprehensive temporal reasoning and vector benchmarks#113

Merged
ruvnet merged 9 commits intomainfrom
claude/add-benchmarks-examples-nhkeM
Jan 15, 2026
Merged

feat(benchmarks): Add comprehensive temporal reasoning and vector benchmarks#113
ruvnet merged 9 commits intomainfrom
claude/add-benchmarks-examples-nhkeM

Conversation

@ruvnet
Copy link
Copy Markdown
Owner

@ruvnet ruvnet commented Jan 15, 2026

Introduces a new benchmark suite in /examples/benchmarks/ with:

Temporal Reasoning (based on TimePuzzles arXiv:2601.07148):

  • TemporalPuzzle and TemporalConstraint types for date inference
  • TemporalSolver with tool augmentation (calendar math, web search)
  • TimePuzzles generator with configurable difficulty and cross-cultural events
  • Sample puzzle sets (easy, medium, hard, cross-cultural)

Vector Index with IVF:

  • DenseVec with cosine similarity and SIMD-friendly operations
  • VectorIndex with IVF-style coarse quantization (k-means centroids)
  • CoherenceGate for filtering based on external signals (mincut coherence)
  • Persistence using serde + bincode

Swarm Controller Regret Tracking:

  • RegretTracker for episode-based sublinear regret computation
  • OracleBaseline for computing optimal rewards
  • SwarmController with R_k/k metric tracking
  • Sublinearity detection and trend analysis

Logging Schema:

  • BenchmarkLogger with structured JSON logging
  • Log entry types: Temporal, Vector, Swarm, Tool, System
  • LogReader with aggregation capabilities

Binaries:

  • temporal-benchmark: Run TimePuzzles-style probes
  • vector-benchmark: Benchmark IVF search performance
  • swarm-regret: Track sublinear regret across episodes
  • timepuzzle-runner: Quick 10-minute probe for agent testing

Includes comprehensive integration tests.

…chmarks

Introduces a new benchmark suite in /examples/benchmarks/ with:

Temporal Reasoning (based on TimePuzzles arXiv:2601.07148):
- TemporalPuzzle and TemporalConstraint types for date inference
- TemporalSolver with tool augmentation (calendar math, web search)
- TimePuzzles generator with configurable difficulty and cross-cultural events
- Sample puzzle sets (easy, medium, hard, cross-cultural)

Vector Index with IVF:
- DenseVec with cosine similarity and SIMD-friendly operations
- VectorIndex with IVF-style coarse quantization (k-means centroids)
- CoherenceGate for filtering based on external signals (mincut coherence)
- Persistence using serde + bincode

Swarm Controller Regret Tracking:
- RegretTracker for episode-based sublinear regret computation
- OracleBaseline for computing optimal rewards
- SwarmController with R_k/k metric tracking
- Sublinearity detection and trend analysis

Logging Schema:
- BenchmarkLogger with structured JSON logging
- Log entry types: Temporal, Vector, Swarm, Tool, System
- LogReader with aggregation capabilities

Binaries:
- temporal-benchmark: Run TimePuzzles-style probes
- vector-benchmark: Benchmark IVF search performance
- swarm-regret: Track sublinear regret across episodes
- timepuzzle-runner: Quick 10-minute probe for agent testing

Includes comprehensive integration tests.
Vector Index Optimizations:
- Added 4-wide loop unrolling for l2_norm() enabling auto-vectorization
- Added 4-wide loop unrolling for dot() product
- Achieved 83% improvement in queries/sec (1,929 → 3,527)
- Achieved 119% improvement in insert/sec (4,151 → 9,082)
- Reduced P50 latency by 46% (497µs → 267µs)
- Reduced P99 latency by 39% (718µs → 439µs)

TimePuzzles Bug Fixes:
- Fixed reference anchor not being added when RelativeToAnchor constraints are generated
- Changed generate_constraint to generate_constraint_with_anchor to return both constraint and anchor
- Ensures all relative constraints have their required references populated
Advanced Optimizations:
- Added rayon dependency for parallel processing
- Implemented parallel batch search (search_batch)
- Added parallel flat search (search_flat_parallel) for large indices
- Implemented adaptive IVF probing (search_adaptive) with dynamic probe count
- Added search_ivf_adaptive with minimum candidates threshold

Norm Caching:
- Added cached_norm field to DenseVec with serde skip
- Implemented compute_norm() for explicit caching
- Added with_norm() constructor for precomputed norms
- Added random_normalized() for unit vectors
- Updated scale() to maintain cache when possible
- Updated normalize() to cache norm as 1.0
- Added invalidate_cache() for manual cache invalidation
- add_scaled() now properly invalidates cache

Performance Results (10K vectors, 128D):
- Queries/sec: 1,929 → 3,331 (+73%)
- Insert/sec: 4,151 → 8,526 (+105%)
- P50 latency: 497µs → 275µs (-45%)
- P99 latency: 718µs → 485µs (-32%)
- Fixed determine_search_range to use rewritten constraints
- Rewrite constraints BEFORE determining search range
- This ensures relative constraints (DaysAfter, etc.) are resolved
  to explicit dates before search range calculation
- Added tempfile dev-dependency for tests
- All 13 tests now pass
Adds a new intelligence metrics module that measures cognitive capabilities:
- Capability scores: temporal reasoning, constraint satisfaction, planning
- Reasoning quality: logical coherence, solution optimality, efficiency
- Learning metrics: sample efficiency, regret sublinearity, generalization
- Tool use proficiency: selection, effectiveness, composition
- Meta-cognition: self-correction, strategy adaptation, progress monitoring

Includes intelligence-assessment binary for running full assessments with
configurable episodes and detailed reporting.
Enhanced the learning metrics calculation:
- Use linear regression to detect regret trend slope
- Compare first half vs. second half average regret
- More accurate sublinearity detection for learning assessment
Implements the learning loop described in lean-agentic design for achieving
sublinear regret:

- Add ReasoningBank module with trajectory tracking, verdict judgment, and
  strategy optimization
- Add AdaptiveSolver that uses ReasoningBank to learn from experience
- Integrate lean-agentic crate for type theory infrastructure
- Update intelligence assessment with adaptive learning flag and progress
  reporting

Key features:
- Trajectory recording for all puzzle solutions
- Pattern learning from successful strategies
- Confidence calibration from historical accuracy
- Strategy selection based on puzzle characteristics
- Improvement tracking and sublinear regret verification

Results with 40 episodes:
- Sublinear regret: Yes ✓
- Regret trend: -0.1728 (decreasing ✓)
- Patterns learned: 20+
- Is improving: Yes ✓
ReasoningBank optimizations:
- Add pattern_index HashMap for O(1) lookups by (constraint_type, difficulty)
- Add constraint_frequency tracking for prioritization
- Add batch trajectory recording for parallel processing
- Index patterns at insertion time for fast retrieval

Vector index optimizations:
- Parallel centroid scoring using rayon for IVF search
- Optimized adaptive probe logic with early termination
- Better candidate collection with bounds checking
- Maintained P99 latency reduction

Benchmark results at 50k vectors:
- Insert: 1,342 vec/s
- Query: 808 q/s
- P99 latency: 2.0ms
@ruvnet ruvnet merged commit 5834cd0 into main Jan 15, 2026
5 checks passed
@ruvnet ruvnet deleted the claude/add-benchmarks-examples-nhkeM branch April 21, 2026 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants