Fast crawler/bot detection from User-Agent strings.
sub-140ns cold, 5ns warm
heuristic bool API
optional Crawlerdex info lookup
[dependencies]
iscrawl = "1.2"Database metadata:
[dependencies]
iscrawl = { version = "1.2", features = ["database"] }use iscrawl::is_crawler;
assert!(is_crawler("Googlebot/2.1 (+http://www.google.com/bot.html)"));
assert!(!is_crawler(
"Mozilla/5.0 (X11; Linux x86_64; rv:115.0) Gecko/20100101 Firefox/115.0"
));Default build has no deps. Use crawler_info with the database feature.
crawler_info(user_agent) returns Option<&'static CrawlerInfo> from the
bundled Crawlerdex DB. Matching is separate from is_crawler.
use iscrawl::crawler_info;
assert_eq!(
crawler_info("Googlebot/2.1").unwrap().description,
"Google's main web crawling bot for search indexing"
);Update DB:
curl -fsSL https://github.com/tn3w/Crawlerdex/releases/latest/download/crawlers.min.json \
-o crawlers.min.jsonis_crawler: stack buffer of 512 bytes, no heap.- First-byte lookup table prunes 99% of needle scans.
- Single pass over the lowered input.
- Thread-local 256-slot direct-mapped cache keyed by pointer/length with edge-word guards.
crawler_info: Aho-Corasick literals + chunked regex fallback.lto = "fat",codegen-units = 1,panic = "abort".
Benchmarked on x86_64: cold ~140 ns/call, warm cache hit ~5 ns/call.
- Empty input counts as crawler.
- Input over 512 bytes is rejected (returns
false). - If any crawler keyword (
bot,crawl,spider,scanner,+http,@,archive, ...) appears: crawler. - If the UA does not start with
Mozilla/orOpera/and has no known browser engine token (gecko,webkit,chrome,firefox,msie,edge,opera, ...): crawler. - If the UA starts with
Mozilla/orOpera/but is missing both an engine token and the(compatible;marker: crawler. - Otherwise: browser.
is_crawler is heuristic. crawler_info is feature-gated + database-backed.
Measured against bundled fixture corpora:
| corpus | size | result |
|---|---|---|
| crawler_user_agents.txt | 2,149 | 95.4% detected |
| loadkpi_crawlers.txt | 3,696 | 94.7% detected |
| crawler_user_agents_pgts.txt | 156 | 98.1% detected |
| browser_user_agents.txt | 19,897 | <1% false positive |
Run cargo test --release to verify on your machine.
cargo bench --bench bench
cargo bench --features database --bench database| run | ns/call | M calls/s |
|---|---|---|
| cold corpus | 137.1 | 7.30 |
| warm hits | 4.5 | 222.26 |
Fixture corpus: 25,898 User-Agents.
cargo build --release
cargo test --release
cargo test --release --features database
cargo test --doc
cargo doc --no-deps --openStandard formatting across Rust + markdown/yaml.
cargo fmt --all
cargo clippy --all-targets -- -D warnings
npx --yes prettier --write --single-quote --print-width=100 --trailing-comma=es5 --end-of-line=lf "**/*.{md,yml}"CI enforces cargo fmt --check and cargo clippy -D warnings on every push.
Pushes to main/master trigger .github/workflows/publish.yml:
- Reads
name+versionfromCargo.toml. - Skips if that version is already on crates.io.
- Tests on Ubuntu, macOS, Windows × stable, beta.
- Runs
cargo publishusingCARGO_REGISTRY_TOKEN. - Tags the commit
vX.Y.Z.
Bump version in Cargo.toml, push to main, done.
If this saved you time, buy me a coffee.
Apache-2.0. See LICENSE.