GitHub - Saeedmora/Lumisift

Lumisift
Your RAG pipeline loses 64% of scientific data. Lumisift retains 92%.

Quick Start · Benchmarks · Use Cases · Value · Limitations

The Problem

A researcher asks your RAG system: "What was the IC50 of compound LX-4291?"

The AI hallucinates an answer. Not because the model is bad -- but because the retrieval selected the wrong paragraph. It picked the introduction ("EGFR inhibitors have shown promise...") instead of the results ("IC50 = 3.2 nM, 47-fold selectivity").

This happens because every RAG system selects context by semantic similarity -- "which text sounds like the query?" In science, the paragraph that sounds relevant rarely contains the actual data.

We measured this on 1,077 PubMed articles:

Standard retrieval discards 64% of numerical facts, 61% of comparative claims, and 59% of causal relationships before the LLM ever sees them.

In pharma, a missing IC50 = a missed drug candidate. In clinical research, a lost p-value = a flawed review. In regulatory work, a dropped threshold = a compliance failure.

Lumisift sits between your retrieval and your LLM and protects the data-rich paragraphs.

Quick Start

git clone https://github.com/Saeedmora/Lumisift.git
cd Lumisift
pip install -e .
python app.py        # Web UI at http://localhost:5000

Python API:

from core.pipeline import LogicalRoomsPipeline

pipe = LogicalRoomsPipeline()
result = pipe.select_context(
    chunks,
    query="What is the IC50 of LX-4291?",
    mode="hybrid", alpha=0.3, top_k=5
)
# result.selected_chunks → send to your LLM
# 50% fewer tokens, 92% accuracy retained (PubMedQA n=999)

Diagnose your own pipeline:

python information_loss_taxonomy.py   # What does YOUR retrieval lose?

Tip

No API keys needed. Everything runs 100% locally. No data leaves your machine.

How It Works

Lumisift adds one scoring step to your pipeline: information density detection. Chunks with numbers, measurements, comparisons, and experimental results get boosted before selection.

Standard:   Document → Chunk → Embed → Rank by similarity → LLM
                                              ↓
                              "Sounds like the query" ✓
                              "Contains actual data"  ✗

Lumisift:   Document → Chunk → Embed → Rank by similarity + density → LLM
                                              ↓
                              "Sounds like the query" ✓
                              "Contains IC50, p-values, fold-changes" ✓✓

No new model. No cloud API. No GPU. One mechanism that preserves the data your researchers need.

Who This Is For

🧬 Pharmaceutical R&D

Your AI summarizes background instead of results. Medicinal chemists lose trust.

With Lumisift: Standard RAG retained 15% of critical drug data across 3 pharma scenarios. Lumisift retained 84%.

🧪 Biotech & Protein Engineering

The AI can't find kinetic data (kcat/Km, E-values). It selected the methods section instead of results.

With Lumisift: Embedding retrieval kept 0 of 7 kinetic parameters. Lumisift kept 6 of 7.

📊 Clinical Research

Trial results are missing p-values, hazard ratios, confidence intervals. The statistical evidence is silently discarded.

With Lumisift: p-value retention: 34% → 91%. The statistical backbone survives.

📋 Regulatory & Academic

Auditors and PhDs need exact values, not paraphrases. The AI approximates when it should quote.

With Lumisift: Exact measurements, concentrations, and thresholds get selection priority.

The Benchmark

All results validated on 1,077 PubMed articles across 10 biomedical domains, 6,463 text chunks, 2,722 numerical facts. Every script is included. Every result is reproducible.

📂 Corpus breakdown (10 domains, click to expand)

Domain	~Articles	Example Data
Protein engineering	120	kcat/Km, thermostability, fold-change
Drug discovery	120	IC50, SAR, selectivity ratios
Enzyme optimization	110	Activity, enantioselectivity (E-value)
Antibody engineering	110	Kd, affinity maturation
Pharmacokinetics	110	ADME, bioavailability, half-life
Vaccine development	107	Neutralizing titers, efficacy
mRNA delivery	100	Transfection efficiency, LNP size
CRISPR gene editing	100	Knock-out efficiency, off-target
Biocatalysis	100	Conversion rates, TON
Protein extraction	100	Yields, purity, recovery

Lumisift vs. 6 retrieval baselines

Method	Data Retained	Delta
Embedding (MiniLM)	38%	—
BM25 (keyword)	42%	+4pp
ColBERT (token-level)	44%	+6pp
Cross-Encoder (ms-marco)	44%	+6pp
Lumisift (heuristic)	83%	+45pp
Lumisift (learned)	87%	+49pp

Important

Four fundamentally different architectures — keyword, dense, late-interaction, cross-encoder — all lose 56-62% of quantitative data. The method of matching doesn't matter. Matching doesn't know what data is. Lumisift does.

What types of information get lost?

Type	Embedding loses	Lumisift keeps	Recoverable?
📐 Numerical facts	64%	85%	✅ Yes (+49pp)
⚖️ Comparative claims	61%	87%	✅ Yes (+48pp)
🔗 Causal statements	59%	50%	⚠️ Partial (+9pp)
❓ Uncertainty markers	54%	49%	❌ No
🔬 Methods details	40%	57%	❌ No (-3pp)
🏷️ Named entities	35%	71%	⚠️ Partial (+5pp)

Numbers and comparisons = fully recoverable. Methods and uncertainty = not. That's a design trade-off, not a bug.

The Power of Specificity

We ablated every component to find what drives the result:

Test	Retention	Takeaway
Specificity detection alone	90%	Outperforms the full system
Full system (8 axes)	83%	Baseline
Remove specificity	48%	Collapses to embedding-level
Remove any other axis	83%	Zero effect

Note

Lumisift's advantage comes from one signal: information density detection — finding chunks with numbers, measurements, and quantitative data. That's the contribution. We built seven other scoring axes for future use (trust, causality, temporal, etc.), but for numerical retention, specificity is the mechanism. We're transparent about this.

Learned model replaces regex

The heuristic uses regex. Fragile, domain-specific. The learned model fixes both:

	Heuristic	Learned Model
Retention	83%	87% (+5pp)
Regex-dependent	Yes	No
Retrainable	No	Yes, on your data
Model size	—	368 KB
Speed	~0ms	~0.1ms/chunk

Standard Benchmark Validation

To eliminate circular validation, we tested Lumisift on official, peer-reviewed datasets with human-expert ground truth. No self-generated questions. No LLM-as-judge for ground truth. Community-standard evaluation only.

PubMedQA (Jin et al., ACL 2019)

Task: 999 expert-annotated biomedical yes/no/maybe questions at 50% context compression.

Method	Accuracy	Tokens Used	vs Full Context
Full Context	71.4%	100%	baseline
Hybrid (50%)	66.2%	50%	93% retained
Lumisift (50%)	65.7%	50%	92% retained
Embedding Similarity (50%)	36.3%	50%	51% retained

Lumisift retains 92% of full-context accuracy with 50% fewer tokens. Standard embedding similarity retains only 51%. Hybrid mode (embedding + Lumisift) performs best at 93% retention. Evaluated on 999 instances with human expert ground truth (Jin et al., ACL 2019).

SciFact (Wadden et al., EMNLP 2020)

Task: Verify 290 scientific claims against abstracts — does 50% compression preserve enough evidence for correct verdict (SUPPORTS/REFUTES/NOT_ENOUGH_INFO)?

Claims Evaluated	Lumisift Verdict Agreement
100	62.0%
200	65.5%
290	69.0%

Lumisift achieves 69% verdict agreement with full-context judgments using only 50% of abstract sentences. Agreement rate increases monotonically — stable, not driven by outliers.

Full methodology, per-instance results, and reproduction instructions: BENCHMARK.md

The Value

1. Your AI stops guessing at numbers

	Without Lumisift	With Lumisift
LLM output	"The compound showed moderate activity"	"IC50 = 3.2 nM, 47-fold selectivity over wild-type"
Source	Hallucinated summary	Actual data from paper

2. You cut API costs in half

Scale	Before	After	Monthly Savings
100 papers/day	$5/day	$2.40/day	$78
1K papers/month	$150	$72	$78
10K papers/month	$1,500	$720	$780

3. You see what your pipeline drops

Even without adopting Lumisift, the loss taxonomy diagnoses your retrieval system:

python information_loss_taxonomy.py
# → "Your system silently discards 64% of numerical facts."
# Now you know. Now you can act.

Limitations

Warning

Read this before adopting. We document failure as carefully as success.

Limitation	Detail
One mechanism	Specificity alone outperforms the full system. You're adopting a data density detector, not "multi-axis intelligence."
IC50 sample size	"100% IC50 retention" is n=24. Could be dataset-specific. We report sample size with every claim.
Comprehension trade-off	PubMedQA (n=999): 65.7% vs 71.4% full text at 50% compression (92% retained). Hybrid mode closes to 93%.
Methods text	Lumisift deprioritizes procedural text by -3pp. Design trade-off: data wins over methods.
Biomedical focus	Trained on PubMed. Legal/financial text needs retraining via the learned model.
No human validation	AI-judged quality only. Expert evaluation planned.

Roadmap

	Milestone	Status
✅	Information loss taxonomy (6 types, 1,070 articles)	Done
✅	Ablation study (specificity = the mechanism)	Done
✅	Learned utility model (87%, no regex, 368 KB)	Done
✅	6-method baseline comparison (+42pp over cross-encoder)	Done
✅	Drug discovery validation (84% vs 15%)	Done
✅	Reproducibility kit (200 articles, verifiable)	Done
✅	PubMedQA standard benchmark (n=999, 92% accuracy retained, ACL 2019)	Done
✅	SciFact standard benchmark (n=290, 69% evidence preservation, EMNLP 2020)	Done
🔜	LangChain / LlamaIndex drop-in plugin	Next
🔜	Cross-domain transfer (legal, financial, clinical)	Planned
🔜	Multi-objective learning (separate targets per axis)	Planned
🔜	Human expert evaluation	Planned

Architecture

┌─────────────────────────────────────────────────────────┐
│                       LUMISIFT                          │
│                                                         │
│  Input:  Your text chunks (from any retrieval system)   │
│  Output: Reranked by information density                │
│                                                         │
│  Backends:          Modes:                              │
│   Heuristic ~0ms     lumisift   = max data retention    │
│   Learned   ~0.1ms   similarity = max semantic match    │
│   TinyLlama ~200ms   hybrid     = both (recommended)    │
└─────────────────────────────────────────────────────────┘

Configuration

Variable	Required	Purpose
`GEMINI_API_KEY`	Optional	AI-judged quality evaluation
`HF_TOKEN`	Optional	Faster HuggingFace downloads
`MODEL_PATH`	Optional	Custom GGUF model path

No API keys? Everything still runs. 100% local.

📁 Project Structure

Lumisift/
├── Core
│   ├── core/pipeline.py                 # select_context() API
│   ├── core/axes_evaluator.py           # Scoring engine
│   ├── core/embeddings.py               # MiniLM embeddings
│   └── app.py                           # Flask web interface
│
├── Diagnostics
│   ├── information_loss_taxonomy.py     # What does retrieval lose?
│   ├── ablation_study.py                # Which component matters?
│   └── information_utility_model.py     # Train learned model
│
├── Benchmarks
│   ├── numerical_retention_benchmark.py
│   ├── baseline_comparison.py           # BM25 / ColBERT / Embedding
│   ├── cross_encoder_benchmark.py       # Cross-encoder reranker
│   ├── drug_discovery_usecase.py        # 3 pharma scenarios
│   ├── hybrid_benchmark.py              # Alpha sweep
│   ├── pubmed_benchmark.py              # 1,077-article corpus
│   ├── pubmedqa_official_benchmark.py   # PubMedQA (ACL 2019)
│   ├── scifact_benchmark.py             # SciFact (EMNLP 2020)
│   ├── downstream_eval.py              # AI-judged quality
│   └── export_reproducibility_kit.py    # Verifiable dataset
│
└── Data
    ├── benchmark_data/                  # Results (JSON)
    └── models/                          # Trained models (.pt)

Contributing

Highest-impact contributions right now:

🌐 Cross-domain testing — Run the loss taxonomy on legal, financial, or clinical text
🔌 LangChain/LlamaIndex plugin — Make Lumisift a drop-in reranker
👩‍🔬 Human evaluation — Domain expert ratings on selected vs full-text quality
🧠 Better utility signals — Improve the learned model's training data

git clone https://github.com/Saeedmora/Lumisift.git
git checkout -b feature/your-improvement
python information_loss_taxonomy.py   # verify results

License

AGPL-3.0 — Free to use, modify, distribute. Source must be shared for deployed modifications. Attribution required.

Commercial licensing available. Contact: Saeed Moradtalab

Lumisift
Standard retrieval loses 64% of scientific data. Lumisift retains 92%.
Validated on PubMedQA (n=999, ACL 2019) and SciFact (n=290, EMNLP 2020). Hybrid mode: 93% retention. Every result reproducible.
_{© 2026 Saeed Moradtalab}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Problem

Quick Start

How It Works

Who This Is For

🧬 Pharmaceutical R&D

🧪 Biotech & Protein Engineering

📊 Clinical Research

📋 Regulatory & Academic

The Benchmark

Lumisift vs. 6 retrieval baselines

What types of information get lost?

The Power of Specificity

Learned model replaces regex

Standard Benchmark Validation

PubMedQA (Jin et al., ACL 2019)

SciFact (Wadden et al., EMNLP 2020)

The Value

1. Your AI stops guessing at numbers

2. You cut API costs in half

3. You see what your pipeline drops

Limitations

Roadmap

Architecture

Configuration

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
benchmark_data		benchmark_data
core		core
data/projects/default		data/projects/default
models		models
templates		templates
tests		tests
visualization		visualization
.env.example		.env.example
.gitignore		.gitignore
BENCHMARK.md		BENCHMARK.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
ablation_study.py		ablation_study.py
app.py		app.py
axis_calibration.json		axis_calibration.json
baseline_comparison.py		baseline_comparison.py
cross_encoder_benchmark.py		cross_encoder_benchmark.py
download_model.py		download_model.py
downstream_eval.py		downstream_eval.py
drug_discovery_usecase.py		drug_discovery_usecase.py
export_reproducibility_kit.py		export_reproducibility_kit.py
hybrid_benchmark.py		hybrid_benchmark.py
information_loss_taxonomy.py		information_loss_taxonomy.py
information_utility_model.py		information_utility_model.py
learned_scoring.py		learned_scoring.py
main.py		main.py
numerical_retention_benchmark.py		numerical_retention_benchmark.py
pubmed_benchmark.py		pubmed_benchmark.py
pubmedqa_benchmark.py		pubmedqa_benchmark.py
pubmedqa_official_benchmark.py		pubmedqa_official_benchmark.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
scifact_benchmark.py		scifact_benchmark.py

Folders and files

Latest commit

History

Repository files navigation

The Problem

Quick Start

How It Works

Who This Is For

🧬 Pharmaceutical R&D

🧪 Biotech & Protein Engineering

📊 Clinical Research

📋 Regulatory & Academic

The Benchmark

Lumisift vs. 6 retrieval baselines

What types of information get lost?

The Power of Specificity

Learned model replaces regex

Standard Benchmark Validation

PubMedQA (Jin et al., ACL 2019)

SciFact (Wadden et al., EMNLP 2020)

The Value

1. Your AI stops guessing at numbers

2. You cut API costs in half

3. You see what your pipeline drops

Limitations

Roadmap

Architecture

Configuration

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages