Skip to content

Saeedmora/Lumisift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Lumisift
Your RAG pipeline loses 64% of scientific data. Lumisift retains 92%.

Python 3.10+ AGPL-3.0 Benchmark Standard Benchmarks No GPU

Quick Start Β· Benchmarks Β· Use Cases Β· Value Β· Limitations

Created by Saeed Moradtalab


The Problem

A researcher asks your RAG system: "What was the IC50 of compound LX-4291?"

The AI hallucinates an answer. Not because the model is bad -- but because the retrieval selected the wrong paragraph. It picked the introduction ("EGFR inhibitors have shown promise...") instead of the results ("IC50 = 3.2 nM, 47-fold selectivity").

This happens because every RAG system selects context by semantic similarity -- "which text sounds like the query?" In science, the paragraph that sounds relevant rarely contains the actual data.

We measured this on 1,077 PubMed articles:

Standard retrieval discards 64% of numerical facts, 61% of comparative claims, and 59% of causal relationships before the LLM ever sees them.

In pharma, a missing IC50 = a missed drug candidate. In clinical research, a lost p-value = a flawed review. In regulatory work, a dropped threshold = a compliance failure.

Lumisift sits between your retrieval and your LLM and protects the data-rich paragraphs.


Quick Start

git clone https://github.com/Saeedmora/Lumisift.git
cd Lumisift
pip install -e .
python app.py        # Web UI at http://localhost:5000

Python API:

from core.pipeline import LogicalRoomsPipeline

pipe = LogicalRoomsPipeline()
result = pipe.select_context(
    chunks,
    query="What is the IC50 of LX-4291?",
    mode="hybrid", alpha=0.3, top_k=5
)
# result.selected_chunks β†’ send to your LLM
# 50% fewer tokens, 92% accuracy retained (PubMedQA n=999)

Diagnose your own pipeline:

python information_loss_taxonomy.py   # What does YOUR retrieval lose?

Tip

No API keys needed. Everything runs 100% locally. No data leaves your machine.


How It Works

Lumisift adds one scoring step to your pipeline: information density detection. Chunks with numbers, measurements, comparisons, and experimental results get boosted before selection.

Standard:   Document β†’ Chunk β†’ Embed β†’ Rank by similarity β†’ LLM
                                              ↓
                              "Sounds like the query" βœ“
                              "Contains actual data"  βœ—

Lumisift:   Document β†’ Chunk β†’ Embed β†’ Rank by similarity + density β†’ LLM
                                              ↓
                              "Sounds like the query" βœ“
                              "Contains IC50, p-values, fold-changes" βœ“βœ“

No new model. No cloud API. No GPU. One mechanism that preserves the data your researchers need.


Who This Is For

🧬 Pharmaceutical R&D

Your AI summarizes background instead of results. Medicinal chemists lose trust.

With Lumisift: Standard RAG retained 15% of critical drug data across 3 pharma scenarios. Lumisift retained 84%.

πŸ§ͺ Biotech & Protein Engineering

The AI can't find kinetic data (kcat/Km, E-values). It selected the methods section instead of results.

With Lumisift: Embedding retrieval kept 0 of 7 kinetic parameters. Lumisift kept 6 of 7.

πŸ“Š Clinical Research

Trial results are missing p-values, hazard ratios, confidence intervals. The statistical evidence is silently discarded.

With Lumisift: p-value retention: 34% β†’ 91%. The statistical backbone survives.

πŸ“‹ Regulatory & Academic

Auditors and PhDs need exact values, not paraphrases. The AI approximates when it should quote.

With Lumisift: Exact measurements, concentrations, and thresholds get selection priority.


The Benchmark

All results validated on 1,077 PubMed articles across 10 biomedical domains, 6,463 text chunks, 2,722 numerical facts. Every script is included. Every result is reproducible.

πŸ“‚ Corpus breakdown (10 domains, click to expand)
Domain ~Articles Example Data
Protein engineering 120 kcat/Km, thermostability, fold-change
Drug discovery 120 IC50, SAR, selectivity ratios
Enzyme optimization 110 Activity, enantioselectivity (E-value)
Antibody engineering 110 Kd, affinity maturation
Pharmacokinetics 110 ADME, bioavailability, half-life
Vaccine development 107 Neutralizing titers, efficacy
mRNA delivery 100 Transfection efficiency, LNP size
CRISPR gene editing 100 Knock-out efficiency, off-target
Biocatalysis 100 Conversion rates, TON
Protein extraction 100 Yields, purity, recovery

Lumisift vs. 6 retrieval baselines

Method Data Retained Delta
Embedding (MiniLM) 38% β€”
BM25 (keyword) 42% +4pp
ColBERT (token-level) 44% +6pp
Cross-Encoder (ms-marco) 44% +6pp
Lumisift (heuristic) 83% +45pp
Lumisift (learned) 87% +49pp

Important

Four fundamentally different architectures β€” keyword, dense, late-interaction, cross-encoder β€” all lose 56-62% of quantitative data. The method of matching doesn't matter. Matching doesn't know what data is. Lumisift does.

What types of information get lost?

Type Embedding loses Lumisift keeps Recoverable?
πŸ“ Numerical facts 64% 85% βœ… Yes (+49pp)
βš–οΈ Comparative claims 61% 87% βœ… Yes (+48pp)
πŸ”— Causal statements 59% 50% ⚠️ Partial (+9pp)
❓ Uncertainty markers 54% 49% ❌ No
πŸ”¬ Methods details 40% 57% ❌ No (-3pp)
🏷️ Named entities 35% 71% ⚠️ Partial (+5pp)

Numbers and comparisons = fully recoverable. Methods and uncertainty = not. That's a design trade-off, not a bug.

The Power of Specificity

We ablated every component to find what drives the result:

Test Retention Takeaway
Specificity detection alone 90% Outperforms the full system
Full system (8 axes) 83% Baseline
Remove specificity 48% Collapses to embedding-level
Remove any other axis 83% Zero effect

Note

Lumisift's advantage comes from one signal: information density detection β€” finding chunks with numbers, measurements, and quantitative data. That's the contribution. We built seven other scoring axes for future use (trust, causality, temporal, etc.), but for numerical retention, specificity is the mechanism. We're transparent about this.

Learned model replaces regex

The heuristic uses regex. Fragile, domain-specific. The learned model fixes both:

Heuristic Learned Model
Retention 83% 87% (+5pp)
Regex-dependent Yes No
Retrainable No Yes, on your data
Model size β€” 368 KB
Speed ~0ms ~0.1ms/chunk

Standard Benchmark Validation

To eliminate circular validation, we tested Lumisift on official, peer-reviewed datasets with human-expert ground truth. No self-generated questions. No LLM-as-judge for ground truth. Community-standard evaluation only.

PubMedQA (Jin et al., ACL 2019)

Task: 999 expert-annotated biomedical yes/no/maybe questions at 50% context compression.

Method Accuracy Tokens Used vs Full Context
Full Context 71.4% 100% baseline
Hybrid (50%) 66.2% 50% 93% retained
Lumisift (50%) 65.7% 50% 92% retained
Embedding Similarity (50%) 36.3% 50% 51% retained

Lumisift retains 92% of full-context accuracy with 50% fewer tokens. Standard embedding similarity retains only 51%. Hybrid mode (embedding + Lumisift) performs best at 93% retention. Evaluated on 999 instances with human expert ground truth (Jin et al., ACL 2019).

SciFact (Wadden et al., EMNLP 2020)

Task: Verify 290 scientific claims against abstracts β€” does 50% compression preserve enough evidence for correct verdict (SUPPORTS/REFUTES/NOT_ENOUGH_INFO)?

Claims Evaluated Lumisift Verdict Agreement
100 62.0%
200 65.5%
290 69.0%

Lumisift achieves 69% verdict agreement with full-context judgments using only 50% of abstract sentences. Agreement rate increases monotonically β€” stable, not driven by outliers.

Full methodology, per-instance results, and reproduction instructions: BENCHMARK.md


The Value

1. Your AI stops guessing at numbers

Without Lumisift With Lumisift
LLM output "The compound showed moderate activity" "IC50 = 3.2 nM, 47-fold selectivity over wild-type"
Source Hallucinated summary Actual data from paper

2. You cut API costs in half

Scale Before After Monthly Savings
100 papers/day $5/day $2.40/day $78
1K papers/month $150 $72 $78
10K papers/month $1,500 $720 $780

3. You see what your pipeline drops

Even without adopting Lumisift, the loss taxonomy diagnoses your retrieval system:

python information_loss_taxonomy.py
# β†’ "Your system silently discards 64% of numerical facts."
# Now you know. Now you can act.

Limitations

Warning

Read this before adopting. We document failure as carefully as success.

Limitation Detail
One mechanism Specificity alone outperforms the full system. You're adopting a data density detector, not "multi-axis intelligence."
IC50 sample size "100% IC50 retention" is n=24. Could be dataset-specific. We report sample size with every claim.
Comprehension trade-off PubMedQA (n=999): 65.7% vs 71.4% full text at 50% compression (92% retained). Hybrid mode closes to 93%.
Methods text Lumisift deprioritizes procedural text by -3pp. Design trade-off: data wins over methods.
Biomedical focus Trained on PubMed. Legal/financial text needs retraining via the learned model.
No human validation AI-judged quality only. Expert evaluation planned.

Roadmap

Milestone Status
βœ… Information loss taxonomy (6 types, 1,070 articles) Done
βœ… Ablation study (specificity = the mechanism) Done
βœ… Learned utility model (87%, no regex, 368 KB) Done
βœ… 6-method baseline comparison (+42pp over cross-encoder) Done
βœ… Drug discovery validation (84% vs 15%) Done
βœ… Reproducibility kit (200 articles, verifiable) Done
βœ… PubMedQA standard benchmark (n=999, 92% accuracy retained, ACL 2019) Done
βœ… SciFact standard benchmark (n=290, 69% evidence preservation, EMNLP 2020) Done
πŸ”œ LangChain / LlamaIndex drop-in plugin Next
πŸ”œ Cross-domain transfer (legal, financial, clinical) Planned
πŸ”œ Multi-objective learning (separate targets per axis) Planned
πŸ”œ Human expert evaluation Planned

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       LUMISIFT                          β”‚
β”‚                                                         β”‚
β”‚  Input:  Your text chunks (from any retrieval system)   β”‚
β”‚  Output: Reranked by information density                β”‚
β”‚                                                         β”‚
β”‚  Backends:          Modes:                              β”‚
β”‚   Heuristic ~0ms     lumisift   = max data retention    β”‚
β”‚   Learned   ~0.1ms   similarity = max semantic match    β”‚
β”‚   TinyLlama ~200ms   hybrid     = both (recommended)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration

Variable Required Purpose
GEMINI_API_KEY Optional AI-judged quality evaluation
HF_TOKEN Optional Faster HuggingFace downloads
MODEL_PATH Optional Custom GGUF model path

No API keys? Everything still runs. 100% local.


πŸ“ Project Structure
Lumisift/
β”œβ”€β”€ Core
β”‚   β”œβ”€β”€ core/pipeline.py                 # select_context() API
β”‚   β”œβ”€β”€ core/axes_evaluator.py           # Scoring engine
β”‚   β”œβ”€β”€ core/embeddings.py               # MiniLM embeddings
β”‚   └── app.py                           # Flask web interface
β”‚
β”œβ”€β”€ Diagnostics
β”‚   β”œβ”€β”€ information_loss_taxonomy.py     # What does retrieval lose?
β”‚   β”œβ”€β”€ ablation_study.py                # Which component matters?
β”‚   └── information_utility_model.py     # Train learned model
β”‚
β”œβ”€β”€ Benchmarks
β”‚   β”œβ”€β”€ numerical_retention_benchmark.py
β”‚   β”œβ”€β”€ baseline_comparison.py           # BM25 / ColBERT / Embedding
β”‚   β”œβ”€β”€ cross_encoder_benchmark.py       # Cross-encoder reranker
β”‚   β”œβ”€β”€ drug_discovery_usecase.py        # 3 pharma scenarios
β”‚   β”œβ”€β”€ hybrid_benchmark.py              # Alpha sweep
β”‚   β”œβ”€β”€ pubmed_benchmark.py              # 1,077-article corpus
β”‚   β”œβ”€β”€ pubmedqa_official_benchmark.py   # PubMedQA (ACL 2019)
β”‚   β”œβ”€β”€ scifact_benchmark.py             # SciFact (EMNLP 2020)
β”‚   β”œβ”€β”€ downstream_eval.py              # AI-judged quality
β”‚   └── export_reproducibility_kit.py    # Verifiable dataset
β”‚
└── Data
    β”œβ”€β”€ benchmark_data/                  # Results (JSON)
    └── models/                          # Trained models (.pt)

Contributing

Highest-impact contributions right now:

  1. 🌐 Cross-domain testing β€” Run the loss taxonomy on legal, financial, or clinical text
  2. πŸ”Œ LangChain/LlamaIndex plugin β€” Make Lumisift a drop-in reranker
  3. πŸ‘©β€πŸ”¬ Human evaluation β€” Domain expert ratings on selected vs full-text quality
  4. 🧠 Better utility signals β€” Improve the learned model's training data
git clone https://github.com/Saeedmora/Lumisift.git
git checkout -b feature/your-improvement
python information_loss_taxonomy.py   # verify results

License

AGPL-3.0 β€” Free to use, modify, distribute. Source must be shared for deployed modifications. Attribution required.

Commercial licensing available. Contact: Saeed Moradtalab


Lumisift
Standard retrieval loses 64% of scientific data. Lumisift retains 92%.
Validated on PubMedQA (n=999, ACL 2019) and SciFact (n=290, EMNLP 2020). Hybrid mode: 93% retention. Every result reproducible.
Β© 2026 Saeed Moradtalab

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages