Skip to content

ttarvis/dataprint

Repository files navigation

dataprint

High-performance tool for large JSONL file fingerprinting and structural comparisons.

License Version


Table of Contents


Overview

dataprint is Free and Open Source software for anyone who needs a high performance tool that can be used to fingerprint or compare very large JSONL files. It uses the structure and semantics of JSON to meaningfully produce fingerprints and file comparisons. This means order swapping does not affect the output.


Features

  • Hashing — performs simple BLAKE3 hash of a file
  • Fingerprint — creates a structural fingerprint of a JSONL file
  • Compare — Compare how similar two JSONL files are

Architecture

dataprint uses a multi-threaded worker pool that waits on a bounded queue containing work items. It uses simdjson for parsing the JSON. If you're interested in knowing more of the specifics, contact me.


Requirements

Dependency Minimum Version Notes
simdjson v4.2.4 JSON parser
BLAKE3 v1.8.3 Cryptographic hashing
xxhash v0.8.3 Non-cryptographic hashing
CLI11 v2.6.1 Command line parsing

Limitations

  • Not compatible with MSVC. (contact me if this interests you)
  • 32 bit address space may be insufficient for large file addressing
  • dataprint will not work well with singular large JSON documents, for example, a GB+ JSON document. It is best to use a single thread in such scenarios because there is no natural, parallel way to parse a single JSON document. It is recommended to break large JSON documents into smaller documents such as in the JSONL format and recompute.
  • integers that exceed 64 bits, big integers, are approximated. This may cause collisions if two big integers are approximated with the same value
  • Malformed JSON will probably cause errors depending on the issue. For example, actual newline bytes in the JSON is incorrect JSON and it would not be parsed correctly.
  • Pathological inputs will probably cause errors. For example, deeply nested (10000+ layers of nesting) JSON.
  • Different Unicode normalizations will cause different fingerprints
  • Some edge case floating point values may be represented differently on different systems and cause different fingerprints even though the number is the same. This should be treated as a bug but this generally shouldn't be an issue.

Installation

From Source

git clone https://github.com/ttarvis/dataprint.git
cd dataprint
make
make install

Usage

Basic Examples

dataprint fingerprint file.json1
dataprint fingerprint file1.json1 file2.json file3.json ...
dataprint fingerprint -T file.json1
dataprint compare file1.json1 file2.json

Performance

Design Decisions for Performance

dataprint uses a chunking pattern on files to create work items in the multi threaded mode. dataprint tries to allocate a minimal amount of objects when walking JSON paths. The fastest algorithms and data structures were chosen for each processing segment.

  • Concurrency model: Worker pool with bounded work item queue to deal with backpressure on large files.
  • Memory management: Memory use is primarily in chunking which has a capped size and is deallocated after use in RAII pattern.
  • I/O strategy: Files are memory mapped and chunked.
  • Hotpath optimization: simdjson parsing and fast hash functions are used to walk paths

Scaling Characteristics

From benchmarks, speedup over single threaded scales linearly and multithreaded usage efficiency is above 50%.


Benchmarks

Benchmarks were run on an AWS c6a.8xlarge instance (AMD EPYC 7R13, 16 physical cores) with warm OS page cache. The best of 3 runs was taken. Files were generated by the included file generators so sizes reflected the actual sizes tested.

Fingerprint — flat JSONL

File Size Single Threaded Multithreaded Speedup Throughput
1.1 GB 5.76s 0.81s 8.12x 1.54 GB/s
11 GB 57.81s 7.05s 7.54x 1.43 GB/s

Key observations

  • Sub-second processing of 1GB JSONL files on commodity cloud hardware
  • Consistent throughput from 1GB to 11GB — linear scaling with file size
  • 8x speedup over single threaded on 16 physical cores
  • 3086% CPU utilization on the 11GB benchmark — pipeline keeps all cores busy

Running benchmarks yourself

cd benchmarks
make generate  # generates test files (~30 min for 11GB)
make run       # runs benchmarks and saves results to benchmarks/results/

Testing

Running Tests

# unit tests
make test

Contributing

See CONTRIBUTING.md for guidelines on reporting bugs, proposing changes, and submitting pull requests.

Code Style

There is style convention that is in the .clang-format file.


Changelog

See CHANGELOG.md.


License

Apache 2.0 © Terence Tarvis

About

Fingerprinting and similarity for large JSONL files.

Resources

License

Contributing

Stars

Watchers

Forks

Packages