Skip to content

wh1le/better-web

Repository files navigation

Better Web

Terminal-first web research tool. Search the web, scrape pages, score content quality, filter out junk — get clean markdown ready for LLM consumption.

demo

Why

Search engines increasingly return SEO spam and low-quality content. LLM-powered search tools often hallucinate or give shallow answers. A simple research question shouldn't mean 25+ open tabs just to find a few good sources.

better-web automates the entire research workflow: query a private search engine, scrape results, score quality using multiple signals (domain reputation, AI detection, readability, semantic relevance), filter out the noise, and return focused, clean markdown — all in one command.

No GPU required — runs on simple hardware. The only ML model used is a small sentence-transformer (~80MB) for relevance scoring.

Prerequisites

  • Python 3.12+
  • A local SearXNG instance for search queries

Quick SearXNG setup with Docker:

docker run -d --name searxng -p 8882:8080 searxng/searxng

Setup

With Nix (recommended):

nix develop && poetry install

With pipx (isolated install):

pipx install git+https://github.com/wh1le/better-web.git
playwright install chromium

Without Nix:

pip install poetry
poetry install
playwright install chromium

Configure your SearXNG URL in config.yaml under searx_engine (default http://localhost:8882/search).

Usage

bw search "query"                     # search + scrape + score + copy
bw search "q1" "q2" --limit 20       # multi-query batch
bw search --quick "query"             # snippets only, no scraping
bw scrape "https://example.com"       # single URL to stdout
bw digest --raw                       # re-export latest research
bw preview                            # render page as clean markdown
bw update-blocklist                   # refresh domain blocklists
bin/explore                           # fzf picker -> preview in editor
bin/agent                             # fzf picker -> copy/claude

Scoring

Every page gets 0-100 based on:

Signal Tool What
Domain reputation tranco Top-1M ranking, boost only (unranked = neutral)
Domain heuristics tldextract Junk TLDs, hyphen stuffing, SEO keywords, year in name
AI detection zippy Compression-based, no ML models, no API keys
Readability textstat Flesch Reading Ease, grade level
Relevance sentence-transformers Cosine similarity between query and content
HTML structure built-in Code blocks, comments, link density, nav ratio, ad scripts
Text heuristics built-in Keyword stuffing, repetitive bigrams, slop phrases, thin content
Content dedup datasketch MinHash LSH, removes near-duplicate pages

Pages below min_quality_score (default 30) are filtered out. Remaining pages are sorted best-first and tier-labeled (HIGH/MED/LOW).

Config

config.yaml — SearXNG URL, scrape timing, quality thresholds, blocklist sources. Static lists (TLDs, blocked domains, AI phrases) live in data/*.txt.

TODO

  • Support XDG configuration path at ~/.config/bw

License

MIT

About

Terminal tool for cutting through SEO and AI noise. Scrape, Score, Digest, Explore. Compression-based AI detection, MinHash dedup, semantic relevance scoring, readability analysis.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors