Skip to content

Fix: Overhaul NSRL workflow: streaming build, SHA1 verification, optional filter, download on install#166

Merged
steffenfritz merged 6 commits intosteffenfritz:mainfrom
rhaist:fix/bloom-filter-sizing
Apr 11, 2026
Merged

Fix: Overhaul NSRL workflow: streaming build, SHA1 verification, optional filter, download on install#166
steffenfritz merged 6 commits intosteffenfritz:mainfrom
rhaist:fix/bloom-filter-sizing

Conversation

@rhaist
Copy link
Copy Markdown
Contributor

@rhaist rhaist commented Apr 5, 2026

This PR fixes #165 and overhauls the entire NSRL workflow: memory/IO/disk
usage during bloom filter builds, file verification, making NSRL optional
at scan time, and distributing the bloom filter via GitHub Releases instead
of bundling it in the repository.

Changes

Remove db/nsrl.bloom from the repository

The pre-built bloom filter is too large to bundle in the repo. It is now
distributed as a GitHub Release asset and downloaded automatically during
ftrove --install.

Taskfile.nsrl.yml — streaming pipeline rewrite

  • No intermediate files: SHA1 hashes are streamed directly from the NSRL
    SQLite databases into admftrove via stdin, eliminating all temporary .txt
    files and cutting peak disk and memory usage to near zero.
  • Accurate item count: SELECT COUNT(*) FROM DISTINCT_HASH is run first so
    the bloom filter is correctly sized before ingestion begins.
  • SHA1 verification: Each downloaded zip is verified against the
    corresponding NIST-provided .sha file using sha1sum -c.
  • bash -eo pipefail: Multi-database pipe commands are wrapped in an
    explicit bash -eo pipefail -c '...' invocation so that a failing sqlite3
    process propagates an error through the pipeline (Debian /bin/sh is dash
    and does not support pipefail).
  • FPR relaxed to 1%: For archival classification the 0.01% FPR was
    unnecessarily strict and produced a ~474 MB file. 1% halves the size with no
    practical impact on classification accuracy.
  • Post-build test: Every build-* task ends with task nsrl:test
    (go test -run TestBloomWithRealNSRL) to verify the filter contains known
    NSRL hashes before the file is used.
  • check task: Compares the version embedded in db/nsrl.bloom against
    NSRL_VERSION so operators can see at a glance whether an update is needed.

nsrl.go — stdin support and auto-count

  • CreateNSRLBloom accepts "-" as the source path to read from os.Stdin;
    --nsrl-estimate is required in that mode.
  • When estimatedItems == 0 for a regular file, the file is pre-scanned to
    count non-empty lines (two-pass), guaranteeing correct filter sizing without
    the caller needing to supply an estimate.

install.go — configurable download on install

  • Added URL constants for all three variants: NSRLBloomURLModern,
    NSRLBloomURLMobile, NSRLBloomURLAll.
  • Added NSRLVariants map for validation.
  • InstallFT accepts an nsrlVariant parameter ("modern", "mobile",
    "all"); defaults to "all".
  • InstallFT tries to copy a local nsrl.bloom first → falls back to
    downloading the selected variant → continues gracefully if unavailable.
  • DownloadNSRLBloom(dst, url string) takes an explicit URL so callers can
    target any variant.

cmd/ftrove/main.go--nsrl-variant flag + optional NSRL

  • New --nsrl-variant flag (default: all) selects which bloom filter
    variant is downloaded during --install.
  • Bloom filter load failure is now a Warn log instead of a fatal
    Error + os.Exit(1). The scan continues with nsrlFilter = nil.
  • Nsrlversion in the session record is set to "none" when no filter is
    loaded.
  • The NSRL check is guarded with nsrlFilter != nil; all files get
    Filensrl = "FALSE" when no filter is present.

nsrl_test.go — new tests

  • TestBloomAutoCount: verifies that estimatedItems=0 triggers two-pass
    auto-count and all inserted hashes are found.
  • TestBloomStdinRequiresEstimate: verifies that reading from stdin without
    an estimate returns an error.

testdata/nsrl_known_hashes.txt

Replaced placeholder hashes with 10 real SHA1 hashes extracted from NSRL RDS
2026.03.1 modern minimal, verified present in the built bloom filter.

Creating the GitHub Release assets

After merging, build all three variants and upload them to a release:

# Build all variants (requires NSRL RDS downloads, ~4 GB total)
task nsrl:build-modern   # produces db/nsrl.bloom → rename to nsrl-modern.bloom
task nsrl:build-mobile   # produces db/nsrl.bloom → rename to nsrl-mobile.bloom
task nsrl:build-all      # produces db/nsrl.bloom → rename to nsrl-all.bloom

# Create the release and upload all three assets
gh release create nsrl-2026.03.1 \
  --repo steffenfritz/FileTrove \
  --title "NSRL Bloom Filters 2026.03.1" \
  --notes "Pre-built NSRL bloom filters from RDS 2026.03.1 at 1% FPR.
- nsrl-modern.bloom: modern minimal subset (~150 MB)
- nsrl-mobile.bloom: modern + android + ios (~200 MB)
- nsrl-all.bloom: modern + android + ios + legacy (~240 MB)" \
  nsrl-modern.bloom nsrl-mobile.bloom nsrl-all.bloom

The URL constants in install.go must be updated whenever a new NSRL build
is published.

Usage

# Install with the default (all) variant
ftrove --install /opt/filetrove

# Install with a specific variant
ftrove --install /opt/filetrove --nsrl-variant modern
ftrove --install /opt/filetrove --nsrl-variant mobile

Test plan

  • go test -v ./... passes
  • task nsrl:build-all completes: verifies SHA1, builds bloom, runs TestBloomWithRealNSRL
  • ftrove --install <dir> downloads the all variant by default
  • ftrove --install <dir> --nsrl-variant modern downloads the modern variant
  • ftrove scan completes without db/nsrl.bloom present (NSRL checks skipped, no exit)
  • Upload all three bloom files to GitHub Release tag nsrl-2026.03.1 after building

Closes #165

rhaist added 3 commits April 4, 2026 12:29
The Bloom filter was always sized for 40M items (hardcoded default),
regardless of actual input size. When the NSRL hash file contained
significantly more entries (e.g. 160M+), the undersized filter produced
~40-45% false positives instead of the target 0.01%.

CreateNSRLBloom now pre-scans the input file to count actual hashes
when estimatedItems is 0 (the new default), ensuring the filter is
always correctly dimensioned for the target FPR.

Fixes steffenfritz#165
Use -readonly mode for all sqlite3 SELECT queries in Taskfile.nsrl.yml
to prevent accidental writes to the NSRL source databases. Add project
guidance file for Claude Code.
@rhaist rhaist changed the title Fix: bloom filter sizing Fix: Overhaul NSRL workflow: streaming build, SHA1 verification, optional filter, download on install Apr 6, 2026
@rhaist rhaist marked this pull request as ready for review April 6, 2026 20:03
@rhaist rhaist requested a review from steffenfritz as a code owner April 6, 2026 20:03
- install.go: replace single NSRLBloomURL with per-variant constants
  (NSRLBloomURLModern/Mobile/All), add NSRLVariants map, make
  InstallFT accept nsrlVariant param, DownloadNSRLBloom takes explicit URL
- cmd/ftrove/main.go: add --nsrl-variant flag (default: all)
- README.md: update installation and NSRL sections to reflect that
  nsrl.bloom is no longer bundled; document --nsrl-variant flag and variants
- BUILDING.md: update dist:bundle description, remove nsrl.bloom
  prerequisite, update bloom sizes to 1% FPR values, add update workflow
  for publishing new release assets
@steffenfritz steffenfritz self-assigned this Apr 7, 2026
@steffenfritz steffenfritz added the bug Something isn't working label Apr 7, 2026
@steffenfritz steffenfritz merged commit c0f026c into steffenfritz:main Apr 11, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Bloom Filter is not working as expected

2 participants