Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
708 changes: 699 additions & 9 deletions examples/data/Cargo.lock

Large diffs are not rendered by default.

102 changes: 89 additions & 13 deletions examples/data/README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,91 @@
# RuVector Dataset Discovery Framework

Comprehensive examples demonstrating RuVector's capabilities for novel discovery across world-scale datasets.
**Find hidden patterns and connections in massive datasets that traditional tools miss.**

## What's New
RuVector turns your data—research papers, climate records, financial filings—into a connected graph, then uses cutting-edge algorithms to spot emerging trends, cross-domain relationships, and regime shifts *before* they become obvious.

- **SIMD-Accelerated Vectors** - 2.9x faster cosine similarity
- **Parallel Batch Processing** - 8.8x faster vector insertion
- **Statistical Significance** - P-values, effect sizes, confidence intervals
- **Temporal Causality** - Granger-style cross-domain prediction
- **Cross-Domain Bridges** - Automatic detection of hidden connections
## Why RuVector?

Most data analysis tools excel at answering questions you already know to ask. RuVector is different: it helps you **discover what you don't know you're looking for**.

**Real-world examples:**
- 🔬 **Research**: Spot a new field forming 6-12 months before it gets a name, by detecting when papers start citing across traditional boundaries
- 🌍 **Climate**: Detect regime shifts in weather patterns that correlate with economic disruptions
- 💰 **Finance**: Find companies whose narratives are diverging from their peers—often an early warning signal

## Features

| Feature | What It Does | Why It Matters |
|---------|--------------|----------------|
| **Vector Memory** | Stores data as 384-1536 dim embeddings | Similar concepts cluster together automatically |
| **HNSW Index** | O(log n) approximate nearest neighbor search | 10-50x faster than brute force for large datasets |
| **Graph Structure** | Connects related items with weighted edges | Reveals hidden relationships in your data |
| **Min-Cut Analysis** | Measures how "connected" your network is | Detects regime changes and fragmentation |
| **Cross-Domain Detection** | Finds bridges between different fields | Discovers unexpected correlations (e.g., climate → finance) |
| **ONNX Embeddings** | Neural semantic embeddings (MiniLM, BGE, etc.) | Production-quality text understanding |
| **Causality Testing** | Checks if changes in X predict changes in Y | Moves beyond correlation to actionable insights |
| **Statistical Rigor** | Reports p-values and effect sizes | Know which findings are real vs. noise |

### What's New in v0.3.0

- **HNSW Integration**: O(n log n) similarity search replaces O(n²) brute force
- **Similarity Cache**: 2-3x speedup for repeated similarity queries
- **Batch ONNX Embeddings**: Chunked processing with progress callbacks
- **Shared Utils Module**: `cosine_similarity`, `euclidean_distance`, `normalize_vector`
- **Auto-connect by Embeddings**: CoherenceEngine creates edges from vector similarity

### Performance

- ⚡ **10-50x faster** similarity search (HNSW vs brute force)
- ⚡ **8.8x faster** batch vector insertion (parallel processing)
- ⚡ **2.9x faster** similarity computation (SIMD acceleration)
- ⚡ **2-3x faster** repeated queries (similarity cache)
- 📊 Works with **millions of records** on standard hardware

## Quick Start

### Prerequisites

```bash
# Ensure you're in the ruvector workspace
cd /workspaces/ruvector
```

### Run Your First Example

```bash
# Run the optimized benchmark
# 1. Performance benchmark - see the speed improvements
cargo run --example optimized_benchmark -p ruvector-data-framework --features parallel --release

# Run the discovery hunter
# 2. Discovery hunter - find patterns in sample data
cargo run --example discovery_hunter -p ruvector-data-framework --features parallel --release

# Run cross-domain discovery
# 3. Cross-domain analysis - detect bridges between fields
cargo run --example cross_domain_discovery -p ruvector-data-framework --release
```

### Domain-Specific Examples

# Run climate regime detector
```bash
# Climate: Detect weather regime shifts
cargo run --example regime_detector -p ruvector-data-climate

# Run financial coherence watch
# Finance: Monitor corporate filing coherence
cargo run --example coherence_watch -p ruvector-data-edgar
```

### What You'll See

```
🔍 Discovery Results:
Pattern: Climate ↔ Finance bridge detected
Strength: 0.73 (strong connection)
P-value: 0.031 (statistically significant)

→ Drought indices may predict utility sector
performance with a 3-period lag
```

## The Discovery Thesis

RuVector's unique combination of **vector memory**, **graph structures**, and **dynamic minimum cut algorithms** enables discoveries that most analysis tools miss:
Expand Down Expand Up @@ -230,10 +287,23 @@ examples/data/
| `cross_domain` | true | Enable cross-domain discovery |
| `batch_size` | 256 | Parallel batch size |
| `use_simd` | true | Enable SIMD acceleration |
| `similarity_cache_size` | 10000 | Max cached similarity pairs |
| `significance_threshold` | 0.05 | P-value threshold |
| `causality_lookback` | 10 | Temporal lookback periods |
| `causality_min_correlation` | 0.6 | Minimum correlation for causality |

### CoherenceConfig (v0.3.0)

| Parameter | Default | Description |
|-----------|---------|-------------|
| `similarity_threshold` | 0.5 | Min similarity for auto-connecting embeddings |
| `use_embeddings` | true | Auto-create edges from embedding similarity |
| `hnsw_k_neighbors` | 50 | Neighbors to search per vector (HNSW) |
| `hnsw_min_records` | 100 | Min records to trigger HNSW (else brute force) |
| `min_edge_weight` | 0.01 | Minimum edge weight threshold |
| `approximate` | true | Use approximate min-cut for speed |
| `parallel` | true | Enable parallel computation |

## Discovery Examples

### Climate-Finance Bridge
Expand Down Expand Up @@ -271,6 +341,12 @@ Climate → Finance causality detected

## Algorithms

### HNSW (Hierarchical Navigable Small World)
Approximate nearest neighbor search in high-dimensional spaces.
- **Complexity**: O(log n) search, O(log n) insert
- **Use**: Fast similarity search for edge creation
- **Parameters**: `m=16`, `ef_construction=200`, `ef_search=50`

### Stoer-Wagner Min-Cut
Computes minimum cut of weighted undirected graph.
- **Complexity**: O(VE + V² log V)
Expand All @@ -279,7 +355,7 @@ Computes minimum cut of weighted undirected graph.
### SIMD Cosine Similarity
Processes 8 floats per iteration using AVX2.
- **Speedup**: 2.9x vs scalar
- **Fallback**: Chunked scalar (4 floats)
- **Fallback**: Chunked scalar (8 floats per iteration)

### Granger Causality
Tests if past values of X predict Y.
Expand Down
11 changes: 9 additions & 2 deletions examples/data/framework/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
[package]
name = "ruvector-data-framework"
version.workspace = true
version = "0.3.0"
edition.workspace = true
description = "Core discovery framework for RuVector dataset integrations"
description = "Core discovery framework for RuVector dataset integrations - find hidden patterns in massive datasets using vector memory, graph structures, and dynamic min-cut algorithms"
license.workspace = true
repository.workspace = true
readme = "../README.md"
documentation = "https://docs.rs/ruvector-data-framework"
authors = ["RuVector Team <team@ruvector.dev>"]
keywords = ["vector-database", "discovery", "graph", "mincut", "coherence"]
categories = ["science", "database", "data-structures"]

Expand Down Expand Up @@ -48,6 +51,9 @@ clap = { version = "4.5", features = ["derive"] }
num_cpus = "1.16"
warp = { version = "0.3", optional = true }

# ONNX embeddings (optional - for semantic embeddings)
ruvector-onnx-embeddings = { version = "0.1.0", optional = true }

[dev-dependencies]
tokio-test = "0.4"
rand = "0.8"
Expand Down Expand Up @@ -119,3 +125,4 @@ default = ["async", "parallel"]
async = []
parallel = ["rayon"]
sse = ["warp"]
onnx-embeddings = ["dep:ruvector-onnx-embeddings"]
4 changes: 4 additions & 0 deletions examples/data/framework/examples/multi_domain_discovery.rs
Original file line number Diff line number Diff line change
Expand Up @@ -388,6 +388,10 @@ async fn main() -> std::result::Result<(), Box<dyn std::error::Error>> {
epsilon: 0.15,
parallel: true,
track_boundaries: true,
similarity_threshold: 0.4, // Lower threshold for cross-domain connections
use_embeddings: true,
hnsw_k_neighbors: 40, // More neighbors for multi-domain
hnsw_min_records: 50,
};

let mut coherence = CoherenceEngine::new(coherence_config);
Expand Down
77 changes: 76 additions & 1 deletion examples/data/framework/examples/real_data_discovery.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,27 @@
//! - Pattern trends and anomalies
//!
//! This demonstrates real-world discovery on live academic data.
//!
//! ## Embedder Options
//! - Default: SimpleEmbedder (bag-of-words, fast but low quality)
//! - With `onnx-embeddings` feature: OnnxEmbedder (neural, high quality)
//!
//! Run with ONNX:
//! ```bash
//! cargo run --example real_data_discovery --features onnx-embeddings --release
//! ```

use std::collections::HashMap;
use std::time::Instant;

use ruvector_data_framework::{
CoherenceConfig, CoherenceEngine, DiscoveryConfig, DiscoveryEngine, OpenAlexClient,
PatternCategory, SimpleEmbedder,
PatternCategory, SimpleEmbedder, Embedder,
};

#[cfg(feature = "onnx-embeddings")]
use ruvector_data_framework::OnnxEmbedder;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize logging
Expand Down Expand Up @@ -87,6 +99,62 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
return Ok(());
}

// ============================================================================
// Phase 1.5: Re-embed with ONNX (if feature enabled)
// ============================================================================
#[cfg(feature = "onnx-embeddings")]
{
println!();
println!("━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━");
println!("🧠 Phase 1.5: Generating Neural Embeddings (ONNX)");
println!();
println!(" Loading MiniLM-L6-v2 model (384-dim semantic embeddings)...");

let onnx_start = Instant::now();
match OnnxEmbedder::new().await {
Ok(embedder) => {
println!(" ✓ Model loaded in {:?}", onnx_start.elapsed());
println!(" Embedding {} papers...", all_records.len());

let embed_start = Instant::now();
for record in &mut all_records {
// Extract text from JSON data for embedding
let title = record.data.get("title")
.and_then(|v| v.as_str())
.unwrap_or("");
let abstract_text = record.data.get("abstract")
.and_then(|v| v.as_str())
.unwrap_or("");
let concepts = record.data.get("concepts")
.and_then(|v| v.as_array())
.map(|arr| arr.iter()
.filter_map(|c| c.get("display_name").and_then(|n| n.as_str()))
.collect::<Vec<_>>()
.join(" "))
.unwrap_or_default();

let text = format!("{} {} {}", title, abstract_text, concepts);
let embedding = embedder.embed_text(&text);
record.embedding = Some(embedding);
}

println!(" ✓ Embedded {} papers in {:?}", all_records.len(), embed_start.elapsed());
println!(" Embedding dimension: 384 (semantic)");
}
Err(e) => {
println!(" ⚠️ ONNX model failed to load: {}", e);
println!(" Falling back to bag-of-words embeddings");
}
}
}

#[cfg(not(feature = "onnx-embeddings"))]
{
println!();
println!(" 💡 Tip: Enable ONNX embeddings for better discovery quality:");
println!(" cargo run --example real_data_discovery --features onnx-embeddings --release");
}

// ============================================================================
// Phase 2: Build Coherence Graph
// ============================================================================
Expand All @@ -103,6 +171,10 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
epsilon: 0.1,
parallel: true,
track_boundaries: true,
similarity_threshold: 0.5, // Connect papers with >= 50% similarity
use_embeddings: true, // Use ONNX embeddings for edge creation
hnsw_k_neighbors: 30, // Search 30 nearest neighbors per paper
hnsw_min_records: 50, // Use HNSW for datasets >= 50 records
};

let mut coherence = CoherenceEngine::new(coherence_config);
Expand Down Expand Up @@ -273,6 +345,9 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
println!();

println!(" 🔬 Methodology:");
#[cfg(feature = "onnx-embeddings")]
println!(" • Semantic embeddings: ONNX MiniLM-L6-v2 (384-dim neural)");
#[cfg(not(feature = "onnx-embeddings"))]
println!(" • Semantic embeddings: Simple bag-of-words (128-dim)");
println!(" • Graph construction: Citation + concept relationships");
println!(" • Coherence metric: Dynamic minimum cut");
Expand Down
Loading