feat(benchmarks): Add comprehensive temporal reasoning and vector benchmarks by ruvnet · Pull Request #113 · ruvnet/RuVector

ruvnet · 2026-01-15T02:37:49Z

Introduces a new benchmark suite in /examples/benchmarks/ with:

Temporal Reasoning (based on TimePuzzles arXiv:2601.07148):

TemporalPuzzle and TemporalConstraint types for date inference
TemporalSolver with tool augmentation (calendar math, web search)
TimePuzzles generator with configurable difficulty and cross-cultural events
Sample puzzle sets (easy, medium, hard, cross-cultural)

Vector Index with IVF:

DenseVec with cosine similarity and SIMD-friendly operations
VectorIndex with IVF-style coarse quantization (k-means centroids)
CoherenceGate for filtering based on external signals (mincut coherence)
Persistence using serde + bincode

Swarm Controller Regret Tracking:

RegretTracker for episode-based sublinear regret computation
OracleBaseline for computing optimal rewards
SwarmController with R_k/k metric tracking
Sublinearity detection and trend analysis

Logging Schema:

BenchmarkLogger with structured JSON logging
Log entry types: Temporal, Vector, Swarm, Tool, System
LogReader with aggregation capabilities

Binaries:

temporal-benchmark: Run TimePuzzles-style probes
vector-benchmark: Benchmark IVF search performance
swarm-regret: Track sublinear regret across episodes
timepuzzle-runner: Quick 10-minute probe for agent testing

Includes comprehensive integration tests.

…chmarks Introduces a new benchmark suite in /examples/benchmarks/ with: Temporal Reasoning (based on TimePuzzles arXiv:2601.07148): - TemporalPuzzle and TemporalConstraint types for date inference - TemporalSolver with tool augmentation (calendar math, web search) - TimePuzzles generator with configurable difficulty and cross-cultural events - Sample puzzle sets (easy, medium, hard, cross-cultural) Vector Index with IVF: - DenseVec with cosine similarity and SIMD-friendly operations - VectorIndex with IVF-style coarse quantization (k-means centroids) - CoherenceGate for filtering based on external signals (mincut coherence) - Persistence using serde + bincode Swarm Controller Regret Tracking: - RegretTracker for episode-based sublinear regret computation - OracleBaseline for computing optimal rewards - SwarmController with R_k/k metric tracking - Sublinearity detection and trend analysis Logging Schema: - BenchmarkLogger with structured JSON logging - Log entry types: Temporal, Vector, Swarm, Tool, System - LogReader with aggregation capabilities Binaries: - temporal-benchmark: Run TimePuzzles-style probes - vector-benchmark: Benchmark IVF search performance - swarm-regret: Track sublinear regret across episodes - timepuzzle-runner: Quick 10-minute probe for agent testing Includes comprehensive integration tests.

Vector Index Optimizations: - Added 4-wide loop unrolling for l2_norm() enabling auto-vectorization - Added 4-wide loop unrolling for dot() product - Achieved 83% improvement in queries/sec (1,929 → 3,527) - Achieved 119% improvement in insert/sec (4,151 → 9,082) - Reduced P50 latency by 46% (497µs → 267µs) - Reduced P99 latency by 39% (718µs → 439µs) TimePuzzles Bug Fixes: - Fixed reference anchor not being added when RelativeToAnchor constraints are generated - Changed generate_constraint to generate_constraint_with_anchor to return both constraint and anchor - Ensures all relative constraints have their required references populated

Advanced Optimizations: - Added rayon dependency for parallel processing - Implemented parallel batch search (search_batch) - Added parallel flat search (search_flat_parallel) for large indices - Implemented adaptive IVF probing (search_adaptive) with dynamic probe count - Added search_ivf_adaptive with minimum candidates threshold Norm Caching: - Added cached_norm field to DenseVec with serde skip - Implemented compute_norm() for explicit caching - Added with_norm() constructor for precomputed norms - Added random_normalized() for unit vectors - Updated scale() to maintain cache when possible - Updated normalize() to cache norm as 1.0 - Added invalidate_cache() for manual cache invalidation - add_scaled() now properly invalidates cache Performance Results (10K vectors, 128D): - Queries/sec: 1,929 → 3,331 (+73%) - Insert/sec: 4,151 → 8,526 (+105%) - P50 latency: 497µs → 275µs (-45%) - P99 latency: 718µs → 485µs (-32%)

- Fixed determine_search_range to use rewritten constraints - Rewrite constraints BEFORE determining search range - This ensures relative constraints (DaysAfter, etc.) are resolved to explicit dates before search range calculation - Added tempfile dev-dependency for tests - All 13 tests now pass

Adds a new intelligence metrics module that measures cognitive capabilities: - Capability scores: temporal reasoning, constraint satisfaction, planning - Reasoning quality: logical coherence, solution optimality, efficiency - Learning metrics: sample efficiency, regret sublinearity, generalization - Tool use proficiency: selection, effectiveness, composition - Meta-cognition: self-correction, strategy adaptation, progress monitoring Includes intelligence-assessment binary for running full assessments with configurable episodes and detailed reporting.

Enhanced the learning metrics calculation: - Use linear regression to detect regret trend slope - Compare first half vs. second half average regret - More accurate sublinearity detection for learning assessment

Implements the learning loop described in lean-agentic design for achieving sublinear regret: - Add ReasoningBank module with trajectory tracking, verdict judgment, and strategy optimization - Add AdaptiveSolver that uses ReasoningBank to learn from experience - Integrate lean-agentic crate for type theory infrastructure - Update intelligence assessment with adaptive learning flag and progress reporting Key features: - Trajectory recording for all puzzle solutions - Pattern learning from successful strategies - Confidence calibration from historical accuracy - Strategy selection based on puzzle characteristics - Improvement tracking and sublinear regret verification Results with 40 episodes: - Sublinear regret: Yes ✓ - Regret trend: -0.1728 (decreasing ✓) - Patterns learned: 20+ - Is improving: Yes ✓

ReasoningBank optimizations: - Add pattern_index HashMap for O(1) lookups by (constraint_type, difficulty) - Add constraint_frequency tracking for prioritization - Add batch trajectory recording for parallel processing - Index patterns at insertion time for fast retrieval Vector index optimizations: - Parallel centroid scoring using rayon for IVF search - Optimized adaptive probe logic with early termination - Better candidate collection with bounds checking - Maintained P99 latency reduction Benchmark results at 50k vectors: - Insert: 1,342 vec/s - Query: 808 q/s - P99 latency: 2.0ms

…chmarks (#113)

claude added 9 commits January 14, 2026 14:28

feat(benchmarks): Improve regret sublinearity detection

6a0d59f

Enhanced the learning metrics calculation: - Use linear regression to detect regret trend slope - Compare first half vs. second half average regret - More accurate sublinearity detection for learning assessment

chore: Add benchmark logs and data to gitignore

f63f58f

ruvnet merged commit 5834cd0 into main Jan 15, 2026
5 checks passed

ruvnet added a commit that referenced this pull request Feb 20, 2026

feat(benchmarks): Add comprehensive temporal reasoning and vector ben…

0e37715

…chmarks (#113)

ruvnet deleted the claude/add-benchmarks-examples-nhkeM branch April 21, 2026 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): Add comprehensive temporal reasoning and vector benchmarks#113

feat(benchmarks): Add comprehensive temporal reasoning and vector benchmarks#113
ruvnet merged 9 commits intomainfrom
claude/add-benchmarks-examples-nhkeM

ruvnet commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ruvnet commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants