Releases: tamnd/ccrawl-cli
v0.1.0
The first public release. ccrawl is a single pure-Go binary that puts Common Crawl behind a tool that feels like curl: find a URL in the index, fetch the exact capture, stream whole archives, run SQL over the columnar index, and look up domain ranks. It talks to the public data on data.commoncrawl.org over plain HTTPS, so there are no credentials to set up and nothing to pay for.
What you get
- Find captures.
ccrawl searchqueries the CDX URL index for any URL or path pattern and filters by status, MIME, or language. - Fetch the exact bytes.
ccrawl getandccrawl fetchpull a single capture with an HTTP byte-range request, so a page comes back without downloading the WARC file it lives in. - Pull out the content.
--text,--markdown,--links, and--headersturn a captured page into the form you actually want. - Work with whole archives.
ccrawl paths,download,parse, andconvertlist, fetch, decode, and reshape WARC, WAT, and WET files.convertwrites columnar Parquet (zstd, dictionary-encoded) or JSONL. - Query the columnar index.
ccrawl tablebuilds the SQL for bulk questions across a crawl and runs it through a localduckdbbinary, or prints ready-to-run SQL when DuckDB is not installed. - Look up ranks.
ccrawl rankreads host and domain positions from the web graph tables. - Scan CC-NEWS.
ccrawl newsstreams the continuous news dataset, which has no index of its own.
Dataset library
For bulk work, --library gives the archive files a home and extends the commands you already know to list, download, and process them in place. Raw files land under ~/notes/ccrawl/<crawl>/<kind>/ and processed output beside them under <crawl>/<format>/<kind>/. Re-running download only fetches what is missing, so a corpus grows a piece at a time. Point it elsewhere with CCRAWL_LIBRARY or --library-dir.
Parsers you can import
The archive parsers live in their own packages, so you can read Common Crawl files from your own program without pulling in the rest of the tool:
import "github.com/tamnd/ccrawl-cli/pkg/warc"
err := warc.Iterate(r, func(rec warc.Record) error {
fmt.Println(rec.Header.TargetURI)
return nil
})pkg/warc reads WARC records and splits an HTTP block into its headers and body. pkg/wat decodes WAT metadata (status, title, meta tags, links) and pkg/wet decodes WET plain text, both on top of pkg/warc. None of them depend on the ccrawl library or the CLI.
Install
go install github.com/tamnd/ccrawl-cli/cmd/ccrawl@latestOr grab a prebuilt archive below. There are builds for Linux, macOS, Windows, and FreeBSD, plus Linux packages (deb, rpm, apk). The container image is on GHCR:
docker run --rm ghcr.io/tamnd/ccrawl:0.1.0 get example.com --textThe binary is pure Go with no runtime dependencies. DuckDB is optional and only used to run the columnar index queries locally; without it, ccrawl prints the SQL for you to run elsewhere.
Verify a download
Every archive ships a CycloneDX SBOM, and checksums.txt is signed with cosign (keyless). Check the checksum, then the signature:
sha256sum -c checksums.txt --ignore-missing
cosign verify-blob checksums.txt \
--signature checksums.txt.sig --certificate checksums.txt.pem \
--certificate-identity-regexp 'https://github.com/tamnd/ccrawl-cli.*' \
--certificate-oidc-issuer https://token.actions.githubusercontent.comFull documentation is at ccrawl-cli.tamnd.com, and these notes live at the release notes page.