From 47d405e066e02577c35bcdf6f3e17f36253f907d Mon Sep 17 00:00:00 2001 From: Feng Zhang Date: Thu, 24 Jul 2025 09:59:10 -0700 Subject: [PATCH] Clean up documents and add storage guides --- README.md | 16 +- ...versioned-memory.md => ai-agent-memory.md} | 0 ...xternal-storage.md => external-storage.md} | 0 doc/guide/git-workflows.md | 776 ------------------ doc/guide/{git-prolly.md => git.md} | 0 doc/guide/{git-prolly-sql.md => sql.md} | 0 doc/guide/storage.md | 399 +++++++++ 7 files changed, 405 insertions(+), 786 deletions(-) rename doc/design/{ai-agent-versioned-memory.md => ai-agent-memory.md} (100%) rename doc/design/{git-external-storage.md => external-storage.md} (100%) delete mode 100644 doc/guide/git-workflows.md rename doc/guide/{git-prolly.md => git.md} (100%) rename doc/guide/{git-prolly-sql.md => sql.md} (100%) create mode 100644 doc/guide/storage.md diff --git a/README.md b/README.md index 8ce0f6c..c7d06b6 100644 --- a/README.md +++ b/README.md @@ -221,16 +221,12 @@ The following features are for Prolly tree library for Version 0.2.0: The following features are for Prolly tree library for Version 0.2.1: - [X] tree diffing and merging examples - [X] show history of changes of the Prolly tree (git logs style) using `gitoxide` crate -- [ ] support gluesql as a kv store - -The following features are for Prolly tree library for Version 0.2.2: -- [ ] version-controlled databases - - [ ] support for IPFS (InterPlanetary File System) for distributed storage - - [ ] advanced probabilistic splitting algorithms -- [ ] decentralized databases - - [ ] add support for cryptographic hash functions like SHA-256, BLAKE3, etc. - - [ ] support ZK-friendly hashes such as Poseidon, MiMC, etc. - - [ ] supporting a rollup-style L2 for a decentralized database +- [X] support python bindings for Prolly Tree +- [X] support sql query based on gluesql as a query engine +- [X] add usage examples for git-prolly use cases +- [X] add usage examples for ai agent memory use cases +- [X] support rocksdb as storage backend +- [ ] support ipdl as storage backend ## Contributing diff --git a/doc/design/ai-agent-versioned-memory.md b/doc/design/ai-agent-memory.md similarity index 100% rename from doc/design/ai-agent-versioned-memory.md rename to doc/design/ai-agent-memory.md diff --git a/doc/design/git-external-storage.md b/doc/design/external-storage.md similarity index 100% rename from doc/design/git-external-storage.md rename to doc/design/external-storage.md diff --git a/doc/guide/git-workflows.md b/doc/guide/git-workflows.md deleted file mode 100644 index 94ed26c..0000000 --- a/doc/guide/git-workflows.md +++ /dev/null @@ -1,776 +0,0 @@ -# Design of Development Workflows using Prollytree and Git - -A comprehensive guide to integrating git-prolly into your development workflows, covering both separate repository and monorepo approaches. - -## Table of Contents - -1. [Overview](#overview) -2. [Repository Architecture Patterns](#repository-architecture-patterns) -3. [Separate Repository Workflow](#separate-repository-workflow) -4. [Monorepo Workflow](#monorepo-workflow) -5. [Cross-Branch Data Testing](#cross-branch-data-testing) -6. [Advanced Debugging Techniques](#advanced-debugging-techniques) -7. [Best Practices](#best-practices) -8. [Common Scenarios](#common-scenarios) - -## Overview - -git-prolly enables versioned key-value storage with full Git integration, allowing you to version your data alongside your code. This manual covers recommended workflows for different development scenarios. - -### Key Benefits -- **Version Control**: Full history tracking for both code and data -- **Branching**: Separate data states for different features/environments -- **Collaboration**: Standard Git workflows for team development -- **Debugging**: Test code against different data states -- **Deployment**: Coordinate code and data deployments - -## Repository Architecture Patterns - -### Pattern 1: Separate Repositories -``` -myapp/ (Main application repository) -├── .git/ -├── src/ -├── Cargo.toml -└── data/ (Git submodule → kv-data repo) - -kv-data/ (Separate KV data repository) -├── .git/ -├── prolly_tree_root -└── README.md -``` - -### Pattern 2: Monorepo (Single Repository) -``` -myapp/ (Single repository) -├── .git/ -├── src/ (Application code) -├── config/ (KV data store) -│ └── prolly_tree_root -├── user-data/ (Another KV store) -│ └── prolly_tree_root -└── Cargo.toml -``` - -## Separate Repository Workflow - -### Setup - -#### 1. Create KV Data Repository -```bash -# Create and initialize KV data repository -git clone --bare https://github.com/myteam/kv-data.git -cd kv-data -git-prolly init -git-prolly set config:app:name "MyApp" -git-prolly set config:app:version "1.0.0" -git-prolly commit -m "Initial configuration" -git push origin main -``` - -#### 2. Add KV Data as Submodule -```bash -# In your main application repository -git submodule add https://github.com/myteam/kv-data.git data -git commit -m "Add KV data submodule" -``` - -### Development Workflow - -#### Feature Development -```bash -# Start new feature -git checkout -b feature/new-ui - -# Update KV data for this feature -cd data -git checkout -b feature/new-ui-config -git-prolly set ui:theme "material" -git-prolly set ui:layout "grid" -git-prolly commit -m "Add new UI configuration" -git push origin feature/new-ui-config -cd .. - -# Update submodule reference -git add data -git commit -m "Update KV data for new UI feature" -``` - -#### Environment-Specific Branches -```bash -# Production KV data -cd data -git checkout production -git-prolly set db:host "prod-db.example.com" -git-prolly set cache:ttl "3600" -git-prolly commit -m "Production configuration" -cd .. - -# Staging KV data -cd data -git checkout staging -git-prolly set db:host "staging-db.example.com" -git-prolly set cache:ttl "300" -git-prolly commit -m "Staging configuration" -cd .. -``` - -### Using KV Data in Code -```rust -// src/main.rs -use prollytree::git::VersionedKvStore; - -fn main() -> Result<(), Box> { - // Open KV store from submodule - let store = VersionedKvStore::open("./data")?; - - let app_name = store.get(b"config:app:name")?; - let db_host = store.get(b"db:host")?; - - println!("Starting {} with database at {}", - String::from_utf8_lossy(&app_name.unwrap_or_default()), - String::from_utf8_lossy(&db_host.unwrap_or_default()) - ); - - Ok(()) -} -``` - -### Deployment -```bash -# Deploy to production -git checkout main -cd data -git checkout production # Use production KV data -cd .. -git add data -git commit -m "Deploy with production configuration" -git push origin main - -# Deploy to staging -git checkout staging -cd data -git checkout staging # Use staging KV data -cd .. -git add data -git commit -m "Deploy with staging configuration" -git push origin staging -``` - -## Monorepo Workflow - -### Setup - -#### 1. Initialize Monorepo -```bash -# Create project structure -mkdir myapp && cd myapp -git init - -# Initialize KV stores -mkdir config && cd config -git-prolly init -git-prolly set app:name "MyApp" -git-prolly set app:version "1.0.0" -git-prolly commit -m "Initial app configuration" -cd .. - -mkdir user-data && cd user-data -git-prolly init -git-prolly set schema:version "1" -git-prolly commit -m "Initial user data schema" -cd .. - -# Add application code -mkdir src -echo 'fn main() { println!("Hello World"); }' > src/main.rs - -# Commit everything -git add . -git commit -m "Initial project setup" -``` - -### Development Workflow - -#### Feature Development -```bash -# Create feature branch -git checkout -b feature/user-profiles - -# Update both code and KV data -echo 'fn create_user_profile() {}' >> src/lib.rs - -cd config -git-prolly set features:user_profiles "true" -git-prolly set ui:profile_page "enabled" -git-prolly commit -m "Enable user profiles feature" -cd .. - -cd user-data -git-prolly set schema:user_profile "name,email,created_at" -git-prolly commit -m "Add user profile schema" -cd .. - -# Commit all changes together -git add . -git commit -m "Implement user profiles feature" -``` - -#### Environment-Specific Configurations -```bash -# Production configuration -git checkout main -cd config -git-prolly set db:host "prod-db.example.com" -git-prolly set features:beta_features "false" -git-prolly commit -m "Production settings" -cd .. -git add config/ -git commit -m "Update production configuration" - -# Staging configuration -git checkout -b staging -cd config -git-prolly set db:host "staging-db.example.com" -git-prolly set features:beta_features "true" -git-prolly commit -m "Staging settings" -cd .. -git add config/ -git commit -m "Update staging configuration" -``` - -### Using Multiple KV Stores -```rust -// src/main.rs -use prollytree::git::VersionedKvStore; - -fn main() -> Result<(), Box> { - // Open multiple KV stores - let config_store = VersionedKvStore::open("./config")?; - let user_store = VersionedKvStore::open("./user-data")?; - - // Use configuration - let app_name = config_store.get(b"app:name")?; - let db_host = config_store.get(b"db:host")?; - - // Use user data schema - let schema = user_store.get(b"schema:user_profile")?; - - println!("App: {} | DB: {} | Schema: {}", - String::from_utf8_lossy(&app_name.unwrap_or_default()), - String::from_utf8_lossy(&db_host.unwrap_or_default()), - String::from_utf8_lossy(&schema.unwrap_or_default()) - ); - - Ok(()) -} -``` - -## Cross-Branch Data Testing - -### The Problem -You're working on a hotfix and need to test it against data from different branches/environments: -- Production data (stable) -- Staging data (recent changes) -- Production-sample data (subset for testing) - -### Solution 1: Git Worktrees (Recommended) - -```bash -# Create separate worktrees for different environments -git worktree add ../myapp-staging staging -git worktree add ../myapp-production production -git worktree add ../myapp-sample production-sample - -# Test your hotfix against each environment -cd ../myapp-staging -cargo test -- --test-threads=1 - -cd ../myapp-production -cargo test -- --test-threads=1 - -cd ../myapp-sample -cargo test -- --test-threads=1 - -# Clean up when done -cd ../myapp -git worktree remove ../myapp-staging -git worktree remove ../myapp-production -git worktree remove ../myapp-sample -``` - -### Solution 2: KV Data Branch Switching - -```bash -#!/bin/bash -# test_cross_branch.sh - -BRANCHES=("staging" "production" "production-sample") -ORIGINAL_BRANCH=$(cd config && git-prolly current-branch) - -echo "Testing hotfix against multiple data branches..." - -for branch in "${BRANCHES[@]}"; do - echo "=========================================" - echo "Testing against $branch data..." - - # Switch KV data to this branch - cd config - git-prolly checkout $branch - cd .. - - # Run tests - echo "Running tests with $branch data:" - cargo test --test integration_tests - - if [ $? -eq 0 ]; then - echo "✅ Tests PASSED with $branch data" - else - echo "❌ Tests FAILED with $branch data" - fi - - echo "" -done - -# Restore original branch -cd config -git-prolly checkout $ORIGINAL_BRANCH -cd .. - -echo "Cross-branch testing complete!" -``` - -### Solution 3: Programmatic Testing - -```rust -// tests/cross_branch_test.rs -use prollytree::git::VersionedKvStore; -use std::process::Command; - -#[derive(Debug)] -struct TestResult { - branch: String, - passed: bool, - details: String, -} - -struct CrossBranchTester { - config_path: String, -} - -impl CrossBranchTester { - fn new(config_path: &str) -> Self { - Self { - config_path: config_path.to_string(), - } - } - - fn test_against_branch(&self, branch: &str) -> Result> { - // Switch to the test branch - let mut store = VersionedKvStore::open(&self.config_path)?; - let current_branch = store.current_branch().to_string(); - - store.checkout(branch)?; - - // Run your hotfix logic - let result = self.run_hotfix_tests(&store); - - // Restore original branch - store.checkout(¤t_branch)?; - - Ok(TestResult { - branch: branch.to_string(), - passed: result.is_ok(), - details: match result { - Ok(msg) => msg, - Err(e) => format!("Error: {}", e), - }, - }) - } - - fn run_hotfix_tests(&self, store: &VersionedKvStore<32>) -> Result> { - // Your actual hotfix testing logic - let db_host = store.get(b"db:host")?; - let timeout = store.get(b"db:timeout")?; - - // Simulate hotfix logic - match (db_host, timeout) { - (Some(host), Some(timeout_val)) => { - let host_str = String::from_utf8_lossy(&host); - let timeout_str = String::from_utf8_lossy(&timeout_val); - - // Your hotfix validation logic here - if host_str.contains("prod") && timeout_str.parse::().unwrap_or(0) > 1000 { - Ok("Hotfix works correctly".to_string()) - } else { - Err("Hotfix validation failed".into()) - } - } - _ => Err("Required configuration missing".into()), - } - } - - fn test_all_branches(&self) -> Result, Box> { - let branches = vec!["staging", "production", "production-sample"]; - let mut results = Vec::new(); - - for branch in branches { - match self.test_against_branch(branch) { - Ok(result) => results.push(result), - Err(e) => { - results.push(TestResult { - branch: branch.to_string(), - passed: false, - details: format!("Error: {}", e), - }); - } - } - } - - Ok(results) - } -} - -#[test] -fn test_hotfix_cross_branch() { - let tester = CrossBranchTester::new("./config"); - let results = tester.test_all_branches().unwrap(); - - for result in results { - println!("Branch: {} - Passed: {} - Details: {}", - result.branch, result.passed, result.details); - - // You can assert specific conditions here - // assert!(result.passed, "Hotfix failed for branch: {}", result.branch); - } -} -``` - -## Advanced Debugging Techniques - -### 1. Historical Data Testing - -```bash -# Test against specific historical commits -cd config -git-prolly checkout abc123def # Specific commit -cd .. -cargo test - -# Test against tagged versions -cd config -git-prolly checkout v1.2.3 -cd .. -cargo test -``` - -### 2. Data Diff Analysis - -```bash -# Compare data between branches -git-prolly diff production staging - -# Compare specific commits -git-prolly diff abc123def def456abc - -# JSON output for automation -git-prolly diff production staging --format=json > data_diff.json -``` - -### 3. Debugging with Multiple Datasets - -```rust -// src/debug_tools.rs -use prollytree::git::VersionedKvStore; - -pub fn debug_with_multiple_datasets() -> Result<(), Box> { - let datasets = vec![ - ("staging", "./config"), - ("production", "./config"), - ("production-sample", "./config"), - ]; - - for (name, path) in datasets { - println!("=== Debugging with {} dataset ===", name); - - let mut store = VersionedKvStore::open(path)?; - store.checkout(name)?; - - // Your debugging logic here - debug_specific_issue(&store, name)?; - } - - Ok(()) -} - -fn debug_specific_issue(store: &VersionedKvStore<32>, dataset: &str) -> Result<(), Box> { - // Example: Debug a specific configuration issue - let problematic_config = store.get(b"feature:problematic_feature")?; - - if let Some(config) = problematic_config { - println!("Dataset {}: problematic_feature = {}", - dataset, String::from_utf8_lossy(&config)); - - // Apply your fix logic and test - let result = test_fix_logic(&config); - println!("Fix result for {}: {:?}", dataset, result); - } - - Ok(()) -} - -fn test_fix_logic(config: &[u8]) -> bool { - // Your fix logic here - true -} -``` - -## Best Practices - -### Repository Structure - -#### Separate Repositories -``` -# Use when: -- Teams manage data and code separately -- Different release cycles for data and code -- Multiple applications share the same data -- Strict separation of concerns required - -# Benefits: -- Clear ownership boundaries -- Independent versioning -- Reusable data across projects -- Granular access control -``` - -#### Monorepo -``` -# Use when: -- Small team with unified workflow -- Data and code are tightly coupled -- Atomic updates required -- Simple deployment pipeline - -# Benefits: -- Atomic commits -- Simplified dependency management -- Unified testing and CI/CD -- Easier refactoring -``` - -### Branch Strategy - -#### For Data Branches -```bash -# Environment branches -main # Production-ready -staging # Pre-production testing -development # Integration testing - -# Feature branches -feature/new-ui-config # UI configuration changes -feature/api-v2-schema # API schema updates -hotfix/critical-config # Critical configuration fixes -``` - -#### For Code Branches -```bash -# Standard Git flow -main # Production code -develop # Integration branch -feature/new-feature # Feature development -hotfix/critical-fix # Critical fixes -``` - -### Testing Strategy - -#### Unit Tests -```rust -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn test_with_mock_data() { - // Test with controlled data - let mut store = create_test_store(); - store.insert(b"test:key".to_vec(), b"test:value".to_vec()).unwrap(); - - // Your test logic - assert_eq!(get_processed_value(&store), Some("expected".to_string())); - } - - fn create_test_store() -> VersionedKvStore<32> { - // Create a temporary store for testing - let temp_dir = tempfile::tempdir().unwrap(); - VersionedKvStore::init(temp_dir.path()).unwrap() - } -} -``` - -#### Integration Tests -```rust -#[cfg(test)] -mod integration_tests { - use super::*; - - #[test] - fn test_with_real_data() { - // Test with real data from different branches - let tester = CrossBranchTester::new("./config"); - let results = tester.test_all_branches().unwrap(); - - for result in results { - assert!(result.passed, "Integration test failed for {}: {}", - result.branch, result.details); - } - } -} -``` - -### CI/CD Integration - -#### GitHub Actions Example -```yaml -# .github/workflows/test.yml -name: Test with Multiple Datasets - -on: [push, pull_request] - -jobs: - test: - runs-on: ubuntu-latest - strategy: - matrix: - dataset: [staging, production, production-sample] - - steps: - - uses: actions/checkout@v2 - with: - submodules: true # For separate repo approach - - - name: Setup Rust - uses: actions-rs/toolchain@v1 - with: - toolchain: stable - - - name: Switch to test dataset - run: | - cd config - git-prolly checkout ${{ matrix.dataset }} - cd .. - - - name: Run tests - run: cargo test -``` - -## Common Scenarios - -### Scenario 1: Feature Development with Data Changes - -```bash -# Developer workflow -git checkout -b feature/recommendation-engine - -# Update KV data -cd config -git-prolly set ml:model_version "2.1" -git-prolly set ml:confidence_threshold "0.85" -git-prolly commit -m "Update ML model configuration" -cd .. - -# Update application code -# ... make code changes ... - -# Test together -cargo test - -# Commit everything -git add . -git commit -m "Implement recommendation engine v2.1" -``` - -### Scenario 2: Hotfix Testing - -```bash -# Critical bug in production -git checkout -b hotfix/memory-leak-fix - -# Fix the code -vim src/memory_manager.rs - -# Test against production data -cd config -git-prolly checkout production -cd .. -cargo test --test memory_tests - -# Test against staging data -cd config -git-prolly checkout staging -cd .. -cargo test --test memory_tests - -# Deploy with confidence -git checkout main -git merge hotfix/memory-leak-fix -``` - -### Scenario 3: Environment Promotion - -```bash -# Promote from staging to production -git checkout staging - -# Verify staging tests pass -cargo test - -# Update production KV data -cd config -git-prolly checkout production -git-prolly merge staging -git-prolly commit -m "Promote staging configuration to production" -cd .. - -# Deploy to production -git checkout main -git merge staging -git push origin main -``` - -### Scenario 4: Data Migration - -```bash -# Migrate data schema -cd config -git-prolly branch migration/v2-schema -git-prolly checkout migration/v2-schema - -# Update schema -git-prolly set schema:version "2" -git-prolly set schema:user_table "id,name,email,created_at,updated_at" -git-prolly delete schema:legacy_fields -git-prolly commit -m "Migrate to schema v2" - -# Test migration -cd .. -cargo test --test migration_tests - -# Merge when ready -cd config -git-prolly checkout main -git-prolly merge migration/v2-schema -``` - -## Conclusion - -git-prolly provides powerful workflows for managing versioned key-value data alongside your application code. Whether you choose separate repositories or a monorepo approach, the key is to: - -1. **Maintain consistency** between code and data versions -2. **Test thoroughly** across different data states -3. **Use Git workflows** you're already familiar with -4. **Automate testing** for multiple datasets -5. **Document your patterns** for team consistency - -Choose the approach that best fits your team size, deployment complexity, and organizational structure. Both patterns provide robust solutions for different scenarios. \ No newline at end of file diff --git a/doc/guide/git-prolly.md b/doc/guide/git.md similarity index 100% rename from doc/guide/git-prolly.md rename to doc/guide/git.md diff --git a/doc/guide/git-prolly-sql.md b/doc/guide/sql.md similarity index 100% rename from doc/guide/git-prolly-sql.md rename to doc/guide/sql.md diff --git a/doc/guide/storage.md b/doc/guide/storage.md new file mode 100644 index 0000000..ffbdb07 --- /dev/null +++ b/doc/guide/storage.md @@ -0,0 +1,399 @@ +# ProllyTree Storage Backends Guide + +ProllyTree supports multiple storage backends to meet different performance, persistence, and deployment requirements. This guide provides a comprehensive overview of each available storage backend, their characteristics, use cases, and configuration options. + +## Overview + +ProllyTree uses a pluggable storage architecture through the `NodeStorage` trait, allowing you to choose the appropriate backend for your specific needs: + +- **InMemoryNodeStorage**: Fast, volatile storage for development and testing +- **FileNodeStorage**: Simple file-based persistence for local applications +- **RocksDBNodeStorage**: High-performance LSM-tree storage for production workloads +- **GitNodeStorage**: Git object store integration for development (experimental) + +## InMemoryNodeStorage + +### Description +The in-memory storage backend keeps all ProllyTree nodes in a `HashMap` in RAM. This provides the fastest access times but offers no persistence across application restarts. + +### Characteristics +- **Performance**: Fastest read/write operations +- **Persistence**: None - data is lost when application terminates +- **Memory Usage**: Entire tree stored in RAM +- **Concurrency**: Thread-safe with internal locking +- **Storage Overhead**: Minimal (just HashMap overhead) + +### Use Cases +- **Unit testing**: Fast test execution without I/O overhead +- **Development**: Quick prototyping and debugging +- **Caching layer**: Temporary storage for frequently accessed data +- **Small datasets**: When entire dataset fits comfortably in memory + +### Usage Example +```rust +use prollytree::storage::InMemoryNodeStorage; +use prollytree::tree::{ProllyTree, Tree}; +use prollytree::config::TreeConfig; + +let storage = InMemoryNodeStorage::<32>::new(); +let config = TreeConfig::<32>::default(); +let mut tree = ProllyTree::new(storage, config); + +// Data will be lost when `tree` goes out of scope +tree.insert(b"key".to_vec(), b"value".to_vec()); +``` + +### Configuration +The in-memory storage is self-contained and requires no configuration. It automatically manages memory allocation and cleanup. + +## FileNodeStorage + +### Description +The file storage backend persists each ProllyTree node as a separate file on the filesystem using binary serialization. Configuration data is stored in separate files with a `config_` prefix. + +### Characteristics +- **Performance**: Moderate - limited by filesystem I/O +- **Persistence**: Full persistence across application restarts +- **Storage Format**: Binary-serialized nodes (using bincode) +- **File Organization**: One file per node, named by hash +- **Platform Support**: Works on all platforms with filesystem access + +### Use Cases +- **Local applications**: Desktop applications needing persistence +- **Development**: When you need persistence but don't want database setup +- **Small to medium datasets**: Up to thousands of nodes +- **Debugging**: Easy to inspect individual node files + +### Usage Example +```rust +use prollytree::storage::FileNodeStorage; +use prollytree::tree::{ProllyTree, Tree}; +use prollytree::config::TreeConfig; +use std::path::PathBuf; + +let storage_dir = PathBuf::from("./prolly_data"); +let storage = FileNodeStorage::<32>::new(storage_dir); +let config = TreeConfig::<32>::default(); +let mut tree = ProllyTree::new(storage, config); + +tree.insert(b"key".to_vec(), b"value".to_vec()); +// Data persists in ./prolly_data/ directory +``` + +### File Structure +``` +prolly_data/ +├── a1b2c3d4e5f6... (node file - hex hash) +├── f6e5d4c3b2a1... (node file - hex hash) +├── config_tree_config (configuration file) +└── config_custom_key (custom configuration) +``` + +### Limitations +- **Scalability**: Performance degrades with large number of nodes +- **Atomicity**: No atomic updates across multiple nodes +- **Concurrent Access**: Not safe for concurrent writers + +## RocksDBNodeStorage + +### Description +RocksDB storage provides a production-ready, high-performance backend using Facebook's RocksDB LSM-tree implementation. It's optimized for ProllyTree's content-addressed, write-heavy workload patterns. + +### Characteristics +- **Performance**: High throughput for both reads and writes +- **Persistence**: Durable storage with WAL (Write-Ahead Log) +- **Scalability**: Handles millions of nodes efficiently +- **Compression**: LZ4 for hot data, Zstd for cold data +- **Caching**: Multi-level caching (LRU cache + RocksDB block cache) +- **Compaction**: Background cleanup of obsolete data + +### Architecture +``` +Application + ↓ +LRU Cache (1000 nodes default) + ↓ +RocksDB +├── Write Buffer (128MB) +├── Block Cache (512MB) +├── Bloom Filters (10 bits/key) +└── SST Files (compressed) +``` + +### Use Cases +- **Production applications**: High-performance persistent storage +- **Large datasets**: Millions of nodes and frequent updates +- **Write-heavy workloads**: Frequent tree modifications +- **Distributed systems**: Building block for distributed storage + +### Usage Example +```rust +use prollytree::storage::RocksDBNodeStorage; +use prollytree::tree::{ProllyTree, Tree}; +use prollytree::config::TreeConfig; +use std::path::PathBuf; + +// Basic usage +let db_path = PathBuf::from("./rocksdb_data"); +let storage = RocksDBNodeStorage::<32>::new(db_path)?; +let config = TreeConfig::<32>::default(); +let mut tree = ProllyTree::new(storage, config); + +// Custom cache size +let storage = RocksDBNodeStorage::<32>::with_cache_size(db_path, 5000)?; + +// Custom RocksDB options +let mut opts = RocksDBNodeStorage::<32>::default_options(); +opts.set_write_buffer_size(256 * 1024 * 1024); // 256MB +let storage = RocksDBNodeStorage::<32>::with_options(db_path, opts)?; +``` + +### Configuration Options + +#### Default Optimizations +- **Write Buffer**: 128MB for batching writes +- **Memory Tables**: Up to 4 concurrent memtables +- **Compression**: LZ4 for L0-L2, Zstd for bottom levels +- **Block Cache**: 512MB for frequently accessed data +- **Bloom Filters**: 10 bits per key for faster lookups + +#### Performance Tuning +```rust +use rocksdb::{Options, DBCompressionType, BlockBasedOptions, Cache}; + +let mut opts = Options::default(); + +// Increase write buffer for high write throughput +opts.set_write_buffer_size(256 * 1024 * 1024); + +// More aggressive compression for storage efficiency +opts.set_compression_type(DBCompressionType::Zstd); + +// Larger block cache for read-heavy workloads +let cache = Cache::new_lru_cache(1024 * 1024 * 1024); // 1GB +let mut block_opts = BlockBasedOptions::default(); +block_opts.set_block_cache(&cache); +opts.set_block_based_table_factory(&block_opts); +``` + +### Batch Operations +RocksDB storage supports efficient batch operations: + +```rust +let nodes = vec![ + (hash1, node1), + (hash2, node2), + (hash3, node3), +]; + +// Atomic batch insert +storage.batch_insert_nodes(nodes)?; + +// Atomic batch delete +storage.batch_delete_nodes(&[hash1, hash2])?; +``` + +### Monitoring and Maintenance +- **Statistics**: RocksDB provides detailed performance metrics +- **Compaction**: Automatic background compaction +- **Backup**: Use RocksDB backup utilities for data safety +- **Tuning**: Monitor write amplification and adjust settings + +## GitNodeStorage + +### Description +The Git storage backend stores ProllyTree nodes as Git blob objects in a Git repository. This experimental backend is designed for development workflows where you want to leverage Git's content-addressable storage. + +### ⚠️ Important Limitations + +**Development Use Only**: GitNodeStorage should only be used for local development and experimentation. It is not suitable for production use due to several important limitations: + +1. **Dangling Objects**: ProllyTree nodes are stored as Git blob objects but are **not committed** to any branch or tag. These objects exist as "dangling" or "unreachable" objects in Git's object database. + +2. **Garbage Collection Risk**: Git's garbage collector (`git gc`) will **delete these dangling objects** during cleanup operations. This can happen: + - When running `git gc` manually + - Automatically during Git operations (push, pull, repack, etc.) + - When Git's automatic garbage collection triggers + +3. **Data Loss**: Since the objects are not referenced by any commit, branch, or tag, they will be permanently lost when garbage collected. There is no recovery mechanism. + +### Characteristics +- **Storage Format**: Git blob objects (binary serialized nodes) +- **Content Addressing**: Leverages Git's SHA-1 content addressing +- **Persistence**: Temporary - objects can be garbage collected +- **Integration**: Works with existing Git repositories +- **Caching**: LRU cache for performance + +### Use Cases (Development Only) +- **Git Integration Experiments**: Testing Git-based storage concepts +- **Development Workflows**: Temporary storage during development +- **Learning**: Understanding content-addressable storage +- **Prototyping**: Rapid prototyping with Git infrastructure + +### Usage Example +```rust +// Only available with "git" feature +#[cfg(feature = "git")] +use prollytree::git::GitNodeStorage; + +let repo = gix::open(".")?; +let dataset_dir = std::path::PathBuf::from("./git_data"); +let storage = GitNodeStorage::<32>::new(repo, dataset_dir)?; + +// ⚠️ WARNING: Data may be lost during git gc! +let config = TreeConfig::<32>::default(); +let mut tree = ProllyTree::new(storage, config); +tree.insert(b"key".to_vec(), b"value".to_vec()); +``` + +### Data Safety Measures + +If you must use GitNodeStorage for development, consider these safety measures: + +1. **Disable Automatic GC**: + ```bash + git config gc.auto 0 + git config gc.autopacklimit 0 + ``` + +2. **Create Temporary Commits** (advanced): + ```bash + # Periodically commit to preserve objects + git add -A + git commit -m "temp: preserve prolly objects" + ``` + +3. **Use Separate Repository**: + Create a dedicated Git repository just for ProllyTree storage to avoid conflicts. + +### Architecture +``` +ProllyTree Node + ↓ +Bincode Serialization + ↓ +Git Blob Object (dangling) + ↓ +Git Object Database + ↓ +⚠️ git gc → Deletion +``` + +## Storage Backend Comparison + +| Feature | InMemory | File | RocksDB | Git | +|---------|----------|------|---------|-----| +| **Persistence** | None | Full | Full | Temporary⚠️ | +| **Performance** | Fastest | Moderate | High | Moderate | +| **Scalability** | RAM-limited | Poor | Excellent | Poor | +| **Setup Complexity** | None | None | Low | Medium | +| **Production Ready** | No | Limited | Yes | No⚠️ | +| **Concurrent Access** | Limited | No | Yes | Limited | +| **Storage Overhead** | None | High | Low | Medium | +| **Backup/Recovery** | N/A | File copy | RocksDB tools | Git tools | + +## Choosing the Right Backend + +### Development & Testing +- **Unit Tests**: InMemoryNodeStorage +- **Integration Tests**: FileNodeStorage or InMemoryNodeStorage +- **Local Development**: FileNodeStorage or RocksDBNodeStorage + +### Production Deployments +- **Small Applications**: FileNodeStorage (with careful consideration) +- **High-Performance Applications**: RocksDBNodeStorage +- **Distributed Systems**: RocksDBNodeStorage as foundation + +### Experimental +- **Git Integration Research**: GitNodeStorage (development only) + +## Performance Benchmarks + +Run the storage comparison benchmarks to understand performance characteristics: + +```bash +# Compare all available backends +cargo bench --bench storage_bench --features rocksdb_storage + +# Run specific benchmark +cargo bench --bench storage_bench storage_insert +``` + +## Migration Between Backends + +Currently, there's no built-in migration tool between storage backends. To migrate: + +1. **Export Data**: Iterate through the old storage and collect all key-value pairs +2. **Create New Storage**: Initialize the target storage backend +3. **Import Data**: Insert all data into the new storage +4. **Validate**: Verify data integrity after migration + +Example migration pattern: +```rust +// Export from old storage +let old_tree = ProllyTree::load_from_storage(old_storage, config.clone())?; +let mut data = Vec::new(); +// ... collect all key-value pairs + +// Import to new storage +let mut new_tree = ProllyTree::new(new_storage, config); +for (key, value) in data { + new_tree.insert(key, value); +} +``` + +## Best Practices + +### General +- Choose the simplest backend that meets your requirements +- Always benchmark with your specific data patterns +- Consider backup and recovery procedures +- Plan for data growth and scaling needs + +### InMemoryNodeStorage +- Monitor memory usage to prevent OOM conditions +- Use for temporary data only +- Consider data loss implications + +### FileNodeStorage +- Ensure adequate disk space and I/O performance +- Implement application-level locking for concurrent access +- Regular filesystem maintenance and monitoring + +### RocksDBNodeStorage +- Monitor RocksDB metrics for performance tuning +- Configure appropriate cache sizes for your workload +- Plan for disk space and compaction overhead +- Use batch operations for bulk updates + +### GitNodeStorage +- **Never use in production** +- Disable automatic garbage collection during development +- Use dedicated Git repositories +- Regularly backup important data to commits +- Understand that data can be lost without warning + +## Troubleshooting + +### Common Issues + +#### OutOfMemory with InMemoryNodeStorage +- Reduce dataset size or switch to persistent storage +- Monitor heap usage and tune JVM/runtime parameters + +#### Poor Performance with FileNodeStorage +- Check filesystem performance and available disk space +- Consider switching to RocksDBNodeStorage for better performance +- Reduce concurrent access patterns + +#### RocksDB Compilation Issues +- Ensure proper build tools (cmake, C++ compiler) +- Check RocksDB system dependencies +- Use pre-built binaries if available + +#### Git Storage Data Loss +- This is expected behavior - objects are not committed +- Disable garbage collection or switch to persistent storage +- Create periodic commits to preserve important data + +For additional help, consult the project documentation or open an issue on the GitHub repository. \ No newline at end of file