Crawl DB

A standalone database for crawl events.

Overview

Take crawl events and turn them into Apache Parquet so they can be queried using tools like DuckDB, and so explored using things like Apache Superset.

Requires Python >= 3.8 to avoid fastparquet bug dask/fastparquet#825

At present, there is no MrJob/Map-Reduce version, but that could be added if needed.

Example of use

During a crawl, we rapidly built up a lot of log activity. Some 37,357,947 log lines, in a 14.5GB file. This file was process like this:

python -m crawldb.parquet.cli import crawl.log.cp00004-20231116123457 crawl-log-cp00004.parquet

This took a while, about an hour, but created a queryable file only 2.8GB in size.

Using the example queries (see duckdb-query.py for details), we could query this file very quickly, with even fairly intensive aggregation queries only requiring a second or so to run. Some simple examples are:

Total records

SELECT COUNT(*) FROM 'crawl-log-cp00004.parquet';

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│     37357947 │
└──────────────┘

Breakdown by status codes

SELECT status_code, SUM(content_length) AS total_bytes, COUNT(*) AS total_records from 'crawl-log-cp00004.parquet' GROUP BY status_code ORDER BY COUNT(*) DESC;

┌─────────────┬──────────────┬───────────────┐
│ status_code │ total_bytes  │ total_records │
│    int64    │    int128    │     int64     │
├─────────────┼──────────────┼───────────────┤
│       -5003 │         NULL │      30199915 │
│         200 │ 561472969271 │       4868164 │
│       -9998 │         NULL │        613207 │
│         301 │    336555824 │        574581 │
│         302 │    459076182 │        344545 │
│         403 │    977360013 │        228733 │
│         404 │   8837109684 │        193986 │
│          -6 │         NULL │         87998 │
│         307 │       668014 │         66263 │
│         503 │    188275757 │         44856 │
│          ·  │           ·  │             · │
│          ·  │           ·  │             · │
│          ·  │           ·  │             · │
│         415 │          336 │             3 │
│         421 │          513 │             3 │
│         402 │        12790 │             2 │
│         417 │            0 │             2 │
│         299 │       314915 │             2 │
│         203 │        14252 │             1 │
│         524 │         7222 │             1 │
│         999 │         1530 │             1 │
│         408 │        12448 │             1 │
│          -5 │         NULL │             1 │
├─────────────┴──────────────┴───────────────┤
│ 63 rows (20 shown)               3 columns │
└────────────────────────────────────────────┘

Previous Designs

There have been many different versions and re-designs of this approach over the years. e.g. Monitrix on Cassandra, then on ElasticSearch. And then directly using Solr, and later CockroachDB. All have been problematic because of the resources required to run these tools at the scale of UK domain crawls.

This latest design avoids a lot of that complexity by instead targeting the generation of static files in the Parquet format. This provides an efficient way to store data, while not requiring complex additional infrastructure. For modest sizes, we can just store them on local disk. For larger collections, we can store them on HDFS.

The first experiment with this showed that we could take an 11GB crawl log containing over 22 million crawl events, and store it in Parquet using GZIP compression in just 3GB (136 bytes per line). This format is also much easier to work with, and on a decent machine with SSD disks, could be use to run large aggregated queries in less than a second. See duckdb-query.py for examples.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
crawldb		crawldb
db		db
solr		solr
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
duckdb-query.py		duckdb-query.py
requirements.txt		requirements.txt
setup.py		setup.py
start-solrcloud.sh		start-solrcloud.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawl DB

Overview

Example of use

Total records

Breakdown by status codes

Previous Designs

About

Releases

Packages 1

Contributors 2

Languages

License

ukwa/crawl-db

Folders and files

Latest commit

History

Repository files navigation

Crawl DB

Overview

Example of use

Total records

Breakdown by status codes

Previous Designs

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 1

Contributors 2

Languages