Better Web

Terminal-first web research tool. Search the web, scrape pages, score content quality, filter out junk — get clean markdown ready for LLM consumption.

Why

Search engines increasingly return SEO spam and low-quality content. LLM-powered search tools often hallucinate or give shallow answers. A simple research question shouldn't mean 25+ open tabs just to find a few good sources.

better-web automates the entire research workflow: query a private search engine, scrape results, score quality using multiple signals (domain reputation, AI detection, readability, semantic relevance), filter out the noise, and return focused, clean markdown — all in one command.

No GPU required — runs on simple hardware. The only ML model used is a small sentence-transformer (~80MB) for relevance scoring.

Prerequisites

Python 3.12+
A local SearXNG instance for search queries

Quick SearXNG setup with Docker:

docker run -d --name searxng -p 8882:8080 searxng/searxng

Setup

With Nix (recommended):

nix develop && poetry install

With pipx (isolated install):

pipx install git+https://github.com/wh1le/better-web.git
playwright install chromium

Without Nix:

pip install poetry
poetry install
playwright install chromium

Configure your SearXNG URL in config.yaml under searx_engine (default http://localhost:8882/search).

Usage

bw search "query"                     # search + scrape + score + copy
bw search "q1" "q2" --limit 20       # multi-query batch
bw search --quick "query"             # snippets only, no scraping
bw scrape "https://example.com"       # single URL to stdout
bw digest --raw                       # re-export latest research
bw preview                            # render page as clean markdown
bw update-blocklist                   # refresh domain blocklists
bin/explore                           # fzf picker -> preview in editor
bin/agent                             # fzf picker -> copy/claude

Scoring

Every page gets 0-100 based on:

Signal	Tool	What
Domain reputation	tranco	Top-1M ranking, boost only (unranked = neutral)
Domain heuristics	tldextract	Junk TLDs, hyphen stuffing, SEO keywords, year in name
AI detection	zippy	Compression-based, no ML models, no API keys
Readability	textstat	Flesch Reading Ease, grade level
Relevance	sentence-transformers	Cosine similarity between query and content
HTML structure	built-in	Code blocks, comments, link density, nav ratio, ad scripts
Text heuristics	built-in	Keyword stuffing, repetitive bigrams, slop phrases, thin content
Content dedup	datasketch	MinHash LSH, removes near-duplicate pages

Pages below min_quality_score (default 30) are filtered out. Remaining pages are sorted best-first and tier-labeled (HIGH/MED/LOW).

Config

config.yaml — SearXNG URL, scrape timing, quality thresholds, blocklist sources. Static lists (TLDs, blocked domains, AI phrases) live in data/*.txt.

TODO

Support XDG configuration path at ~/.config/bw

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
assets		assets
bin		bin
data		data
lib		lib
tests		tests
.envrc		.envrc
.gitignore		.gitignore
.resume		.resume
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
flake.nix		flake.nix
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Better Web

Why

Prerequisites

Setup

Usage

Scoring

Config

TODO

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Better Web

Why

Prerequisites

Setup

Usage

Scoring

Config

TODO

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages