Releases · tamnd/ccrawl-cli

The first public release. ccrawl is a single pure-Go binary that puts Common Crawl behind a tool that feels like curl: find a URL in the index, fetch the exact capture, stream whole archives, run SQL over the columnar index, and look up domain ranks. It talks to the public data on data.commoncrawl.org over plain HTTPS, so there are no credentials to set up and nothing to pay for.

What you get

Find captures. ccrawl search queries the CDX URL index for any URL or path pattern and filters by status, MIME, or language.
Fetch the exact bytes. ccrawl get and ccrawl fetch pull a single capture with an HTTP byte-range request, so a page comes back without downloading the WARC file it lives in.
Pull out the content. --text, --markdown, --links, and --headers turn a captured page into the form you actually want.
Work with whole archives. ccrawl paths, download, parse, and convert list, fetch, decode, and reshape WARC, WAT, and WET files. convert writes columnar Parquet (zstd, dictionary-encoded) or JSONL.
Query the columnar index. ccrawl table builds the SQL for bulk questions across a crawl and runs it through a local duckdb binary, or prints ready-to-run SQL when DuckDB is not installed.
Look up ranks. ccrawl rank reads host and domain positions from the web graph tables.
Scan CC-NEWS. ccrawl news streams the continuous news dataset, which has no index of its own.

Dataset library

For bulk work, --library gives the archive files a home and extends the commands you already know to list, download, and process them in place. Raw files land under ~/notes/ccrawl/<crawl>/<kind>/ and processed output beside them under <crawl>/<format>/<kind>/. Re-running download only fetches what is missing, so a corpus grows a piece at a time. Point it elsewhere with CCRAWL_LIBRARY or --library-dir.

Parsers you can import

The archive parsers live in their own packages, so you can read Common Crawl files from your own program without pulling in the rest of the tool:

import "github.com/tamnd/ccrawl-cli/pkg/warc"

err := warc.Iterate(r, func(rec warc.Record) error {
    fmt.Println(rec.Header.TargetURI)
    return nil
})

pkg/warc reads WARC records and splits an HTTP block into its headers and body. pkg/wat decodes WAT metadata (status, title, meta tags, links) and pkg/wet decodes WET plain text, both on top of pkg/warc. None of them depend on the ccrawl library or the CLI.

Install

go install github.com/tamnd/ccrawl-cli/cmd/ccrawl@latest

Or grab a prebuilt archive below. There are builds for Linux, macOS, Windows, and FreeBSD, plus Linux packages (deb, rpm, apk). The container image is on GHCR:

docker run --rm ghcr.io/tamnd/ccrawl:0.1.0 get example.com --text

The binary is pure Go with no runtime dependencies. DuckDB is optional and only used to run the columnar index queries locally; without it, ccrawl prints the SQL for you to run elsewhere.

Verify a download

Every archive ships a CycloneDX SBOM, and checksums.txt is signed with cosign (keyless). Check the checksum, then the signature:

sha256sum -c checksums.txt --ignore-missing
cosign verify-blob checksums.txt \
  --signature checksums.txt.sig --certificate checksums.txt.pem \
  --certificate-identity-regexp 'https://github.com/tamnd/ccrawl-cli.*' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com

Full documentation is at ccrawl-cli.tamnd.com, and these notes live at the release notes page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What you get

Dataset library

Parsers you can import

Install

Verify a download

Uh oh!

Releases: tamnd/ccrawl-cli

v0.1.0

What you get

Dataset library

Parsers you can import

Install

Verify a download

Uh oh!