dataprint

High-performance tool for large JSONL file fingerprinting and structural comparisons.

Overview

dataprint is Free and Open Source software for anyone who needs a high performance tool that can be used to fingerprint or compare very large JSONL files. It uses the structure and semantics of JSON to meaningfully produce fingerprints and file comparisons. This means order swapping does not affect the output.

Features

Hashing — performs simple BLAKE3 hash of a file
Fingerprint — creates a structural fingerprint of a JSONL file
Compare — Compare how similar two JSONL files are

Architecture

dataprint uses a multi-threaded worker pool that waits on a bounded queue containing work items. It uses simdjson for parsing the JSON. If you're interested in knowing more of the specifics, contact me.

Requirements

Dependency	Minimum Version	Notes
simdjson	v4.2.4	JSON parser
BLAKE3	v1.8.3	Cryptographic hashing
xxhash	v0.8.3	Non-cryptographic hashing
CLI11	v2.6.1	Command line parsing

Limitations

Not compatible with MSVC. (contact me if this interests you)
32 bit address space may be insufficient for large file addressing
dataprint will not work well with singular large JSON documents, for example, a GB+ JSON document. It is best to use a single thread in such scenarios because there is no natural, parallel way to parse a single JSON document. It is recommended to break large JSON documents into smaller documents such as in the JSONL format and recompute.
integers that exceed 64 bits, big integers, are approximated. This may cause collisions if two big integers are approximated with the same value
Malformed JSON will probably cause errors depending on the issue. For example, actual newline bytes in the JSON is incorrect JSON and it would not be parsed correctly.
Pathological inputs will probably cause errors. For example, deeply nested (10000+ layers of nesting) JSON.
Different Unicode normalizations will cause different fingerprints
Some edge case floating point values may be represented differently on different systems and cause different fingerprints even though the number is the same. This should be treated as a bug but this generally shouldn't be an issue.

Installation

From Source

git clone https://github.com/ttarvis/dataprint.git
cd dataprint
make
make install

Usage

Basic Examples

dataprint fingerprint file.json1

dataprint fingerprint file1.json1 file2.json file3.json ...

dataprint fingerprint -T file.json1

dataprint compare file1.json1 file2.json

Performance

Design Decisions for Performance

dataprint uses a chunking pattern on files to create work items in the multi threaded mode. dataprint tries to allocate a minimal amount of objects when walking JSON paths. The fastest algorithms and data structures were chosen for each processing segment.

Concurrency model: Worker pool with bounded work item queue to deal with backpressure on large files.
Memory management: Memory use is primarily in chunking which has a capped size and is deallocated after use in RAII pattern.
I/O strategy: Files are memory mapped and chunked.
Hotpath optimization: simdjson parsing and fast hash functions are used to walk paths

Scaling Characteristics

From benchmarks, speedup over single threaded scales linearly and multithreaded usage efficiency is above 50%.

Benchmarks

Benchmarks were run on an AWS c6a.8xlarge instance (AMD EPYC 7R13, 16 physical cores) with warm OS page cache. The best of 3 runs was taken. Files were generated by the included file generators so sizes reflected the actual sizes tested.

Fingerprint — flat JSONL

File Size	Single Threaded	Multithreaded	Speedup	Throughput
1.1 GB	5.76s	0.81s	8.12x	1.54 GB/s
11 GB	57.81s	7.05s	7.54x	1.43 GB/s

Key observations

Sub-second processing of 1GB JSONL files on commodity cloud hardware
Consistent throughput from 1GB to 11GB — linear scaling with file size
8x speedup over single threaded on 16 physical cores
3086% CPU utilization on the 11GB benchmark — pipeline keeps all cores busy

Running benchmarks yourself

cd benchmarks
make generate  # generates test files (~30 min for 11GB)
make run       # runs benchmarks and saves results to benchmarks/results/

Testing

Running Tests

# unit tests
make test

Contributing

See CONTRIBUTING.md for guidelines on reporting bugs, proposing changes, and submitting pull requests.

Code Style

There is style convention that is in the .clang-format file.

Changelog

See CHANGELOG.md.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmarks		benchmarks
docs		docs
include		include
lib/blake3		lib/blake3
scripts		scripts
src		src
tests		tests
.clang-format		.clang-format
.clang-format-ignore		.clang-format-ignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataprint

Table of Contents

Overview

Features

Architecture

Requirements

Limitations

Installation

From Source

Usage

Basic Examples

Performance

Design Decisions for Performance

Scaling Characteristics

Benchmarks

Fingerprint — flat JSONL

Key observations

Running benchmarks yourself

Testing

Running Tests

Contributing

Code Style

Changelog

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

dataprint

Table of Contents

Overview

Features

Architecture

Requirements

Limitations

Installation

From Source

Usage

Basic Examples

Performance

Design Decisions for Performance

Scaling Characteristics

Benchmarks

Fingerprint — flat JSONL

Key observations

Running benchmarks yourself

Testing

Running Tests

Contributing

Code Style

Changelog

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 1

Languages

Packages