Terminal-first web research tool. Search the web, scrape pages, score content quality, filter out junk — get clean markdown ready for LLM consumption.
Search engines increasingly return SEO spam and low-quality content. LLM-powered search tools often hallucinate or give shallow answers. A simple research question shouldn't mean 25+ open tabs just to find a few good sources.
better-web automates the entire research workflow: query a private search engine, scrape results, score quality using multiple signals (domain reputation, AI detection, readability, semantic relevance), filter out the noise, and return focused, clean markdown — all in one command.
No GPU required — runs on simple hardware. The only ML model used is a small sentence-transformer (~80MB) for relevance scoring.
- Python 3.12+
- A local SearXNG instance for search queries
Quick SearXNG setup with Docker:
docker run -d --name searxng -p 8882:8080 searxng/searxngWith Nix (recommended):
nix develop && poetry installWith pipx (isolated install):
pipx install git+https://github.com/wh1le/better-web.git
playwright install chromiumWithout Nix:
pip install poetry
poetry install
playwright install chromiumConfigure your SearXNG URL in config.yaml under searx_engine (default http://localhost:8882/search).
bw search "query" # search + scrape + score + copy
bw search "q1" "q2" --limit 20 # multi-query batch
bw search --quick "query" # snippets only, no scraping
bw scrape "https://example.com" # single URL to stdout
bw digest --raw # re-export latest research
bw preview # render page as clean markdown
bw update-blocklist # refresh domain blocklists
bin/explore # fzf picker -> preview in editor
bin/agent # fzf picker -> copy/claudeEvery page gets 0-100 based on:
| Signal | Tool | What |
|---|---|---|
| Domain reputation | tranco | Top-1M ranking, boost only (unranked = neutral) |
| Domain heuristics | tldextract | Junk TLDs, hyphen stuffing, SEO keywords, year in name |
| AI detection | zippy | Compression-based, no ML models, no API keys |
| Readability | textstat | Flesch Reading Ease, grade level |
| Relevance | sentence-transformers | Cosine similarity between query and content |
| HTML structure | built-in | Code blocks, comments, link density, nav ratio, ad scripts |
| Text heuristics | built-in | Keyword stuffing, repetitive bigrams, slop phrases, thin content |
| Content dedup | datasketch | MinHash LSH, removes near-duplicate pages |
Pages below min_quality_score (default 30) are filtered out. Remaining pages are sorted best-first and tier-labeled (HIGH/MED/LOW).
config.yaml — SearXNG URL, scrape timing, quality thresholds, blocklist sources. Static lists (TLDs, blocked domains, AI phrases) live in data/*.txt.
- Support XDG configuration path at
~/.config/bw
MIT
