Skip to content
V0ldek edited this page Apr 6, 2023 · 6 revisions

Welcome to the rsonpath wiki!

These documents are meant mostly for the project's developers. They describe the design principles and the internal architecture of rsonpath.

What is rsonpath?

We're building the fastest JSONPath engine in the world. That's the mission statement.

The project consists of two main crates – rsonpath and rsonpath-lib. The first is a binary with the CLI interface, the other is the core library. Both are the "main" product, but have slightly differing user experiences in mind.

The CLI is meant to be used as a jq-like tool for processing JSONs as part of a shell script or a developer's workflow.

The library is meant to be an all-encompassing high-performance JSONPath tool that can be integrated in an app for extremely fast JSON querying.

The key takeaway here is that the library has to be more flexible than the CLI itself demands, since it's a standalone product.

Core principles

Keep these in mind when making design decisions.

I. Correctness

First and foremost, the results should be correct. This tenet is currently violated by UTF-8 (see #117), but we should not repeat that mistake. The following must hold: if the input is valid and interoperable according to RFC 8259, then rsonpath must either return a result compatible with JSONPath semantics of the RFC Draft, or a "not supported" error.

The "if" in that statement carries a lot of weight. We give no guarantees on behavior if the JSON is malformed, contains invalid UTF-8 characters, or violates interoperability, for example by including duplicate labels within an object. This is a concious and important choice – we are not a JSON verification tool, there are other fast tools for that like simdjson.

II. Portability

We are a relatively low level library, and a flexible CLI application. The more platforms we can support the better, even if we can't optimize for all of them as effectively. Thanks to Rust's robust conditional compilation we can achieve this without sacrificing performance. Every feature should be implemented so that it works on targets not supporting any of our current SIMD platforms, so that a portable fallback always exists. The feature can then be specialized for various SIMD architectures with conditional compilation.

III. Performance

Our crate is supposed to go "brrr". We measure performance in gigabytes where other engines do it in megabytes. Even the non-SIMD parts that are explicitly expected to be slower than the SIMD parts shouldn't be half-assed. We may compromise on performance only if it affects correctness, and even then it's worth asking if we can somehow expose the faster version as a flag – if the user knows the JSON is compressed, do they need to pay for our whitespace handling code?

Important consequences

  1. Since Correctness comes first, we need to compromise with our feature set. For example, it's hard to imagine a feasible streaming algorithm supporting arbitrary filtering expressions. Features that are seemingly very difficult to implement are pushed to the side in favour of ones that won't require months of banging our heads against a whiteboard.

  2. If there is a way to support a common case with major Performance gains, but that might be not Correct in presence of some exotic JSONs, we should allow conditionally using the more performant version. However, the correct version should be the default. The assumption here is that, one, we are fast anyway, two, it's less impactful if a user becomes disappointed with the performance of rsonpath and then discovers that there's a way to make it faster for their case than if they forget the flag that makes the library actually correct. "My API works 10% slower" is a better problem to have than "my API returns incorrect results 10% of the time".

  3. If there is a way to use hardware-specific features for major Performance gains, that platform should be special cased and optimized for with a non-Portable solution. That's the whole shtick of rsonpath, we use SIMD everywhere we can. But the portable version must always exist, even if it's going to be 100x slower – users of that hardware would rather have a slower implementation of JSONPath than none at all.

Research

The core ideas in this crate come from my master's thesis.