High-performance tool for large JSONL file fingerprinting and structural comparisons.
- Overview
- Features
- Architecture
- Requirements
- Limitations
- Installation
- Configuration
- Usage
- Performance
- Benchmarks
- Testing
- Deployment
- Contributing
- Changelog
- License
dataprint is Free and Open Source software for anyone who needs a high performance tool that can be used to fingerprint or compare very large JSONL files. It uses the structure and semantics of JSON to meaningfully produce fingerprints and file comparisons. This means order swapping does not affect the output.
- Hashing — performs simple BLAKE3 hash of a file
- Fingerprint — creates a structural fingerprint of a JSONL file
- Compare — Compare how similar two JSONL files are
dataprint uses a multi-threaded worker pool that waits on a bounded queue containing work items. It uses simdjson for parsing the JSON. If you're interested in knowing more of the specifics, contact me.
| Dependency | Minimum Version | Notes |
|---|---|---|
| simdjson | v4.2.4 | JSON parser |
| BLAKE3 | v1.8.3 | Cryptographic hashing |
| xxhash | v0.8.3 | Non-cryptographic hashing |
| CLI11 | v2.6.1 | Command line parsing |
- Not compatible with MSVC. (contact me if this interests you)
- 32 bit address space may be insufficient for large file addressing
- dataprint will not work well with singular large JSON documents, for example, a GB+ JSON document. It is best to use a single thread in such scenarios because there is no natural, parallel way to parse a single JSON document. It is recommended to break large JSON documents into smaller documents such as in the JSONL format and recompute.
- integers that exceed 64 bits, big integers, are approximated. This may cause collisions if two big integers are approximated with the same value
- Malformed JSON will probably cause errors depending on the issue. For example, actual newline bytes in the JSON is incorrect JSON and it would not be parsed correctly.
- Pathological inputs will probably cause errors. For example, deeply nested (10000+ layers of nesting) JSON.
- Different Unicode normalizations will cause different fingerprints
- Some edge case floating point values may be represented differently on different systems and cause different fingerprints even though the number is the same. This should be treated as a bug but this generally shouldn't be an issue.
git clone https://github.com/ttarvis/dataprint.git
cd dataprint
make
make installdataprint fingerprint file.json1dataprint fingerprint file1.json1 file2.json file3.json ...dataprint fingerprint -T file.json1dataprint compare file1.json1 file2.jsondataprint uses a chunking pattern on files to create work items in the multi threaded mode. dataprint tries to allocate a minimal amount of objects when walking JSON paths. The fastest algorithms and data structures were chosen for each processing segment.
- Concurrency model: Worker pool with bounded work item queue to deal with backpressure on large files.
- Memory management: Memory use is primarily in chunking which has a capped size and is deallocated after use in RAII pattern.
- I/O strategy: Files are memory mapped and chunked.
- Hotpath optimization: simdjson parsing and fast hash functions are used to walk paths
From benchmarks, speedup over single threaded scales linearly and multithreaded usage efficiency is above 50%.
Benchmarks were run on an AWS c6a.8xlarge instance (AMD EPYC 7R13, 16 physical cores) with warm OS page cache. The best of 3 runs was taken. Files were generated by the included file generators so sizes reflected the actual sizes tested.
| File Size | Single Threaded | Multithreaded | Speedup | Throughput |
|---|---|---|---|---|
| 1.1 GB | 5.76s | 0.81s | 8.12x | 1.54 GB/s |
| 11 GB | 57.81s | 7.05s | 7.54x | 1.43 GB/s |
- Sub-second processing of 1GB JSONL files on commodity cloud hardware
- Consistent throughput from 1GB to 11GB — linear scaling with file size
- 8x speedup over single threaded on 16 physical cores
- 3086% CPU utilization on the 11GB benchmark — pipeline keeps all cores busy
cd benchmarks
make generate # generates test files (~30 min for 11GB)
make run # runs benchmarks and saves results to benchmarks/results/# unit tests
make testSee CONTRIBUTING.md for guidelines on reporting bugs, proposing changes, and submitting pull requests.
There is style convention that is in the .clang-format file.
See CHANGELOG.md.
Apache 2.0 © Terence Tarvis