Qualify scraped leads, then measure the qualifier. Point it at a noisy scraped list and it returns kept vs dropped, each lead carrying a 0-100 relevance score and a short reason. Then it does the part most lead-gen tooling skips: it tells you how good the qualifier actually is, with precision, recall, f1, and accuracy against a labeled set.
A scraper hands you everything that matched a search. The business question is narrower: which of these are real, relevant leads? A page that says "veneers all inclusive" might be a cosmetic dentist, or it might be selling toothpaste. lead-qualifier is the small, reliable layer that makes that call, explains it, and lets you prove on real data that it is making the call well.
Zero runtime dependencies. The whole core runs on the Python standard library. The LLM is injected as a plain
Callable[[str], str], so the package never imports any provider SDK.
flowchart LR
S["Scraped leads<br/>json / csv / list"] --> Q{Scorer}
C[["Criteria<br/>include · exclude<br/>required · threshold"]] -.-> Q
Q -->|"RuleScorer<br/>(deterministic)"| V[Verdicts]
Q -->|"LLMScorer<br/>(injected callable)"| V
V --> K["Kept<br/>score · reason"]
V --> D["Dropped<br/>score · reason"]
L[["Labeled set<br/>gold keep/drop"]] -.-> E
Q --> E{"evaluate()"}
E --> R["precision · recall<br/>f1 · accuracy"]
Every lead gets exactly one Verdict: a clamped 0-100 score, a keep/drop decision, the criteria terms that fired, and a one-line reason. The same criteria that drive qualification also drive the eval harness, so the metrics measure the thing you actually run in production, not a proxy.
git clone https://github.com/vinimabreu/lead-qualifier
cd lead-qualifier
make install # creates .venv and installs the package + dev tools
make demo # runs the offline demo belowThe demo has two acts. First it qualifies a noisy list of cosmetic-dentistry leads that includes traps a naive keyword filter gets wrong. Then it measures the qualifier against a labeled set with harder cases, so the metrics are honest instead of a suspicious 1.0:
=== Act 1: qualify a noisy scraped list ===
5 kept, 5 dropped (of 10) against 'cosmetic-dentistry'
KEPT:
+ [100] Bright Smile Dental - Porcelain Veneers & Cosmetic Dentistry
matched veneers, cosmetic, dentist, smile; score 100 vs threshold 50 -> kept.
+ [100] Lakeside Implant & Cosmetic Dentistry Clinic
matched veneers, cosmetic, dentist, implant, smile; score 100 vs threshold 50 -> kept.
+ [100] Downtown Smile Studio - Veneers, Implants, Invisalign
matched veneers, cosmetic, dentist, implant, smile; score 100 vs threshold 50 -> kept.
+ [ 75] Family Dental Care of Riverton
matched cosmetic, dentist, smile; score 75 vs threshold 50 -> kept.
+ [100] Premier Veneers of Hill Country
matched veneers, cosmetic, dentist, implant; score 100 vs threshold 50 -> kept.
DROPPED:
- [ 0] WhiteGlow Veneers Whitening Toothpaste, 6-Pack
Hard drop: matched excluded term(s): toothpaste, amazon.
- [ 0] Veneers (dental restoration) - Wikipedia
Hard drop: matched excluded term(s): wikipedia.
- [ 0] Now Hiring: Cosmetic Dentist - Full Time Vacancy
Hard drop: matched excluded term(s): job, vacancy.
- [ 25] 10 Foods That Naturally Whiten Your Smile
matched smile; score 25 vs threshold 50 -> dropped.
- [ 0] Local Plumber - 24/7 Emergency Service
no include terms matched; score 0 vs threshold 50 -> dropped.
=== Act 2: measure the qualifier ===
eval report (rule)
examples: 12
confusion: tp=6 fp=1 tn=4 fn=1
precision: 0.857
recall: 0.857
f1: 0.857
accuracy: 0.833
No network, no API keys.
# qualify a scrape into kept / dropped, with a reason per lead
lead-qualifier qualify --source json:leads.json --criteria criteria.json --out kept.json
# measure the qualifier against a labeled set
lead-qualifier eval --labeled labeled.json --criteria criteria.json--sourceisjson:PATHorcsv:PATH. JSON may be an array or an object with aleadskey.--criteriais the JSON spec below.--outwrites kept leads plus the full per-lead verdict list as JSON.qualifyandevalexit0on success and2on a usage or input error.
A criteria file is small and declarative:
{
"name": "cosmetic-dentistry",
"description": "real cosmetic dentistry practices, not products or articles",
"include_keywords": ["veneers", "cosmetic", "dentist", "implant", "smile"],
"exclude_keywords": ["toothpaste", "amazon", "wikipedia", "job", "vacancy"],
"text_fields": ["title", "snippet"],
"required_fields": ["url"],
"threshold": 50
}from lead_qualifier import RuleScorer, load_criteria, qualify
criteria = load_criteria("criteria.json")
leads = [
{"title": "Bright Smile - Veneers & Cosmetic Dentistry", "snippet": "...", "url": "https://a.example"},
{"title": "WhiteGlow Veneers Toothpaste", "snippet": "buy on amazon", "url": "https://b.example"},
]
result = qualify(leads, criteria, RuleScorer())
print(result.summary()) # {'total': 2, 'kept': 1, 'dropped': 1}
for verdict in result.verdicts: # aligned with input order
print(verdict.score, verdict.keep, verdict.reason)The RuleScorer is deterministic and key-free, which is exactly why it is the default and the baseline you measure against.
The LLM is just a function from prompt to text. Adapt whatever client you already use to that signature and inject it; the package never imports a provider library, so it adds nothing to your dependency tree:
from lead_qualifier import LLMScorer, load_criteria, qualify
def my_llm(prompt: str) -> str:
# call your own client here; return its raw text response
return client.complete(prompt)
scorer = LLMScorer(my_llm)
result = qualify(leads, load_criteria("criteria.json"), scorer)LLMScorer builds the prompt, parses a strict JSON verdict out of the response (even when the model wraps it in prose or a code fence), clamps the score to 0-100, and degrades gracefully: if the call raises or the output cannot be parsed, it returns a neutral below-threshold verdict whose reason records the fallback instead of crashing the batch.
This is the part that makes the tool trustworthy. Hand evaluate a labeled set (each lead paired with the gold keep/drop boolean) and it returns the full confusion matrix and the four metrics that matter:
from lead_qualifier import RuleScorer, evaluate, load_criteria
labeled = [
({"title": "Bright Smile - Veneers", "url": "https://a.example"}, True),
({"title": "WhiteGlow Veneers Toothpaste", "url": "https://b.example"}, False),
]
report = evaluate(RuleScorer(), labeled, load_criteria("criteria.json"))
print(report.to_dict())
# {'tp': 1, 'fp': 0, 'tn': 1, 'fn': 0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'accuracy': 1.0}Now you can compare a rule scorer against an LLM scorer, or tune a threshold, with numbers instead of a hunch. A "positive" is a lead the scorer keeps: precision answers of the leads we kept, how many were real, recall answers of the real leads, how many did we keep.
cron
0 7 * * * cd /app && lead-qualifier qualify --source json:/data/scrape.json \
--criteria /data/criteria.json --out /data/kept.json >> /var/log/qualifier.log 2>&1Docker
docker build -t lead-qualifier .
docker run --rm -v "$PWD/data:/data" lead-qualifier qualify \
--source json:/data/scrape.json --criteria /data/criteria.json --out /data/kept.json- Two scorers, one interface. Anything with
score(lead, criteria) -> Verdictis aScorer. The deterministicRuleScoreris the baseline; theLLMScoreris the upgrade. You can measure both against the same labeled set with the same harness. - The toothpaste guard. Any exclude-keyword hit is a hard drop (
keep=False,score=0), regardless of how many include words also matched. This is the exact case a naiveif "veneers" in textfilter gets wrong. - Scores are clamped. Every score is forced into 0-100 on construction, so an LLM returning
score: 250can never leak a runaway value downstream. - Graceful LLM fallback. A failed call or unparseable response yields a neutral, below-threshold verdict, not an exception, so one bad response does not sink a whole batch.
- Order is preserved.
qualifykeeps input order within both the kept and dropped lists, and the verdict list is aligned one-to-one with the input. - No hidden dependencies. The LLM is injected, never imported, so the package installs with zero runtime dependencies and works fully offline in the demo and tests.
- Measure, do not assume. The eval harness is a first-class part of the package, not an afterthought, because a qualifier you cannot measure is a qualifier you cannot trust.
make test # pytest
make lint # ruffThe suite covers the rule scorer (include hits, the exclude hard-drop, the threshold boundary, the missing-required-field penalty, the matched list, determinism), prompt parsing (clean JSON, JSON embedded in prose or a code fence, braces inside strings, graceful fallback), the LLM scorer with a fake injected callable, the pipeline (kept/dropped split, order preservation, summary counts), the eval harness (hand-built confusion matrices with exact precision/recall/f1/accuracy), criteria loading, and the CLI.
MIT. See LICENSE.
Built by Vinicius Pereira (github.com/vinimabreu)