Lumisift
Your RAG pipeline loses 64% of scientific data. Lumisift retains 92%.
Quick Start Β· Benchmarks Β· Use Cases Β· Value Β· Limitations
Created by Saeed Moradtalab
A researcher asks your RAG system: "What was the IC50 of compound LX-4291?"
The AI hallucinates an answer. Not because the model is bad -- but because the retrieval selected the wrong paragraph. It picked the introduction ("EGFR inhibitors have shown promise...") instead of the results ("IC50 = 3.2 nM, 47-fold selectivity").
This happens because every RAG system selects context by semantic similarity -- "which text sounds like the query?" In science, the paragraph that sounds relevant rarely contains the actual data.
We measured this on 1,077 PubMed articles:
Standard retrieval discards 64% of numerical facts, 61% of comparative claims, and 59% of causal relationships before the LLM ever sees them.
In pharma, a missing IC50 = a missed drug candidate. In clinical research, a lost p-value = a flawed review. In regulatory work, a dropped threshold = a compliance failure.
Lumisift sits between your retrieval and your LLM and protects the data-rich paragraphs.
git clone https://github.com/Saeedmora/Lumisift.git
cd Lumisift
pip install -e .
python app.py # Web UI at http://localhost:5000Python API:
from core.pipeline import LogicalRoomsPipeline
pipe = LogicalRoomsPipeline()
result = pipe.select_context(
chunks,
query="What is the IC50 of LX-4291?",
mode="hybrid", alpha=0.3, top_k=5
)
# result.selected_chunks β send to your LLM
# 50% fewer tokens, 92% accuracy retained (PubMedQA n=999)Diagnose your own pipeline:
python information_loss_taxonomy.py # What does YOUR retrieval lose?Tip
No API keys needed. Everything runs 100% locally. No data leaves your machine.
Lumisift adds one scoring step to your pipeline: information density detection. Chunks with numbers, measurements, comparisons, and experimental results get boosted before selection.
Standard: Document β Chunk β Embed β Rank by similarity β LLM
β
"Sounds like the query" β
"Contains actual data" β
Lumisift: Document β Chunk β Embed β Rank by similarity + density β LLM
β
"Sounds like the query" β
"Contains IC50, p-values, fold-changes" ββ
No new model. No cloud API. No GPU. One mechanism that preserves the data your researchers need.
|
Your AI summarizes background instead of results. Medicinal chemists lose trust. With Lumisift: Standard RAG retained 15% of critical drug data across 3 pharma scenarios. Lumisift retained 84%. |
The AI can't find kinetic data (kcat/Km, E-values). It selected the methods section instead of results. With Lumisift: Embedding retrieval kept 0 of 7 kinetic parameters. Lumisift kept 6 of 7. |
|
Trial results are missing p-values, hazard ratios, confidence intervals. The statistical evidence is silently discarded. With Lumisift: p-value retention: 34% β 91%. The statistical backbone survives. |
Auditors and PhDs need exact values, not paraphrases. The AI approximates when it should quote. With Lumisift: Exact measurements, concentrations, and thresholds get selection priority. |
All results validated on 1,077 PubMed articles across 10 biomedical domains, 6,463 text chunks, 2,722 numerical facts. Every script is included. Every result is reproducible.
π Corpus breakdown (10 domains, click to expand)
| Domain | ~Articles | Example Data |
|---|---|---|
| Protein engineering | 120 | kcat/Km, thermostability, fold-change |
| Drug discovery | 120 | IC50, SAR, selectivity ratios |
| Enzyme optimization | 110 | Activity, enantioselectivity (E-value) |
| Antibody engineering | 110 | Kd, affinity maturation |
| Pharmacokinetics | 110 | ADME, bioavailability, half-life |
| Vaccine development | 107 | Neutralizing titers, efficacy |
| mRNA delivery | 100 | Transfection efficiency, LNP size |
| CRISPR gene editing | 100 | Knock-out efficiency, off-target |
| Biocatalysis | 100 | Conversion rates, TON |
| Protein extraction | 100 | Yields, purity, recovery |
| Method | Data Retained | Delta |
|---|---|---|
| Embedding (MiniLM) | 38% | β |
| BM25 (keyword) | 42% | +4pp |
| ColBERT (token-level) | 44% | +6pp |
| Cross-Encoder (ms-marco) | 44% | +6pp |
| Lumisift (heuristic) | 83% | +45pp |
| Lumisift (learned) | 87% | +49pp |
Important
Four fundamentally different architectures β keyword, dense, late-interaction, cross-encoder β all lose 56-62% of quantitative data. The method of matching doesn't matter. Matching doesn't know what data is. Lumisift does.
| Type | Embedding loses | Lumisift keeps | Recoverable? |
|---|---|---|---|
| π Numerical facts | 64% | 85% | β Yes (+49pp) |
| βοΈ Comparative claims | 61% | 87% | β Yes (+48pp) |
| π Causal statements | 59% | 50% | |
| β Uncertainty markers | 54% | 49% | β No |
| π¬ Methods details | 40% | 57% | β No (-3pp) |
| π·οΈ Named entities | 35% | 71% |
Numbers and comparisons = fully recoverable. Methods and uncertainty = not. That's a design trade-off, not a bug.
We ablated every component to find what drives the result:
| Test | Retention | Takeaway |
|---|---|---|
| Specificity detection alone | 90% | Outperforms the full system |
| Full system (8 axes) | 83% | Baseline |
| Remove specificity | 48% | Collapses to embedding-level |
| Remove any other axis | 83% | Zero effect |
Note
Lumisift's advantage comes from one signal: information density detection β finding chunks with numbers, measurements, and quantitative data. That's the contribution. We built seven other scoring axes for future use (trust, causality, temporal, etc.), but for numerical retention, specificity is the mechanism. We're transparent about this.
The heuristic uses regex. Fragile, domain-specific. The learned model fixes both:
| Heuristic | Learned Model | |
|---|---|---|
| Retention | 83% | 87% (+5pp) |
| Regex-dependent | Yes | No |
| Retrainable | No | Yes, on your data |
| Model size | β | 368 KB |
| Speed | ~0ms | ~0.1ms/chunk |
To eliminate circular validation, we tested Lumisift on official, peer-reviewed datasets with human-expert ground truth. No self-generated questions. No LLM-as-judge for ground truth. Community-standard evaluation only.
Task: 999 expert-annotated biomedical yes/no/maybe questions at 50% context compression.
| Method | Accuracy | Tokens Used | vs Full Context |
|---|---|---|---|
| Full Context | 71.4% | 100% | baseline |
| Hybrid (50%) | 66.2% | 50% | 93% retained |
| Lumisift (50%) | 65.7% | 50% | 92% retained |
| Embedding Similarity (50%) | 36.3% | 50% | 51% retained |
Lumisift retains 92% of full-context accuracy with 50% fewer tokens. Standard embedding similarity retains only 51%. Hybrid mode (embedding + Lumisift) performs best at 93% retention. Evaluated on 999 instances with human expert ground truth (Jin et al., ACL 2019).
Task: Verify 290 scientific claims against abstracts β does 50% compression preserve enough evidence for correct verdict (SUPPORTS/REFUTES/NOT_ENOUGH_INFO)?
| Claims Evaluated | Lumisift Verdict Agreement |
|---|---|
| 100 | 62.0% |
| 200 | 65.5% |
| 290 | 69.0% |
Lumisift achieves 69% verdict agreement with full-context judgments using only 50% of abstract sentences. Agreement rate increases monotonically β stable, not driven by outliers.
Full methodology, per-instance results, and reproduction instructions: BENCHMARK.md
| Without Lumisift | With Lumisift | |
|---|---|---|
| LLM output | "The compound showed moderate activity" | "IC50 = 3.2 nM, 47-fold selectivity over wild-type" |
| Source | Hallucinated summary | Actual data from paper |
| Scale | Before | After | Monthly Savings |
|---|---|---|---|
| 100 papers/day | $5/day | $2.40/day | $78 |
| 1K papers/month | $150 | $72 | $78 |
| 10K papers/month | $1,500 | $720 | $780 |
Even without adopting Lumisift, the loss taxonomy diagnoses your retrieval system:
python information_loss_taxonomy.py
# β "Your system silently discards 64% of numerical facts."
# Now you know. Now you can act.Warning
Read this before adopting. We document failure as carefully as success.
| Limitation | Detail |
|---|---|
| One mechanism | Specificity alone outperforms the full system. You're adopting a data density detector, not "multi-axis intelligence." |
| IC50 sample size | "100% IC50 retention" is n=24. Could be dataset-specific. We report sample size with every claim. |
| Comprehension trade-off | PubMedQA (n=999): 65.7% vs 71.4% full text at 50% compression (92% retained). Hybrid mode closes to 93%. |
| Methods text | Lumisift deprioritizes procedural text by -3pp. Design trade-off: data wins over methods. |
| Biomedical focus | Trained on PubMed. Legal/financial text needs retraining via the learned model. |
| No human validation | AI-judged quality only. Expert evaluation planned. |
| Milestone | Status | |
|---|---|---|
| β | Information loss taxonomy (6 types, 1,070 articles) | Done |
| β | Ablation study (specificity = the mechanism) | Done |
| β | Learned utility model (87%, no regex, 368 KB) | Done |
| β | 6-method baseline comparison (+42pp over cross-encoder) | Done |
| β | Drug discovery validation (84% vs 15%) | Done |
| β | Reproducibility kit (200 articles, verifiable) | Done |
| β | PubMedQA standard benchmark (n=999, 92% accuracy retained, ACL 2019) | Done |
| β | SciFact standard benchmark (n=290, 69% evidence preservation, EMNLP 2020) | Done |
| π | LangChain / LlamaIndex drop-in plugin | Next |
| π | Cross-domain transfer (legal, financial, clinical) | Planned |
| π | Multi-objective learning (separate targets per axis) | Planned |
| π | Human expert evaluation | Planned |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LUMISIFT β
β β
β Input: Your text chunks (from any retrieval system) β
β Output: Reranked by information density β
β β
β Backends: Modes: β
β Heuristic ~0ms lumisift = max data retention β
β Learned ~0.1ms similarity = max semantic match β
β TinyLlama ~200ms hybrid = both (recommended) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Variable | Required | Purpose |
|---|---|---|
GEMINI_API_KEY |
Optional | AI-judged quality evaluation |
HF_TOKEN |
Optional | Faster HuggingFace downloads |
MODEL_PATH |
Optional | Custom GGUF model path |
No API keys? Everything still runs. 100% local.
π Project Structure
Lumisift/
βββ Core
β βββ core/pipeline.py # select_context() API
β βββ core/axes_evaluator.py # Scoring engine
β βββ core/embeddings.py # MiniLM embeddings
β βββ app.py # Flask web interface
β
βββ Diagnostics
β βββ information_loss_taxonomy.py # What does retrieval lose?
β βββ ablation_study.py # Which component matters?
β βββ information_utility_model.py # Train learned model
β
βββ Benchmarks
β βββ numerical_retention_benchmark.py
β βββ baseline_comparison.py # BM25 / ColBERT / Embedding
β βββ cross_encoder_benchmark.py # Cross-encoder reranker
β βββ drug_discovery_usecase.py # 3 pharma scenarios
β βββ hybrid_benchmark.py # Alpha sweep
β βββ pubmed_benchmark.py # 1,077-article corpus
β βββ pubmedqa_official_benchmark.py # PubMedQA (ACL 2019)
β βββ scifact_benchmark.py # SciFact (EMNLP 2020)
β βββ downstream_eval.py # AI-judged quality
β βββ export_reproducibility_kit.py # Verifiable dataset
β
βββ Data
βββ benchmark_data/ # Results (JSON)
βββ models/ # Trained models (.pt)
Highest-impact contributions right now:
- π Cross-domain testing β Run the loss taxonomy on legal, financial, or clinical text
- π LangChain/LlamaIndex plugin β Make Lumisift a drop-in reranker
- π©βπ¬ Human evaluation β Domain expert ratings on selected vs full-text quality
- π§ Better utility signals β Improve the learned model's training data
git clone https://github.com/Saeedmora/Lumisift.git
git checkout -b feature/your-improvement
python information_loss_taxonomy.py # verify resultsAGPL-3.0 β Free to use, modify, distribute. Source must be shared for deployed modifications. Attribution required.
Commercial licensing available. Contact: Saeed Moradtalab
Lumisift
Standard retrieval loses 64% of scientific data. Lumisift retains 92%.
Validated on PubMedQA (n=999, ACL 2019) and SciFact (n=290, EMNLP 2020). Hybrid mode: 93% retention. Every result reproducible.
Β© 2026 Saeed Moradtalab