Add new field type for semi-structured data indexing and efficient querying #1050

fmassot · 2021-05-18T11:41:29Z

Make semi structured-data a first-class citizen

In many use cases, users/engineers don't know the data schema in advance but want to index it and query it fast. This is very common for logs but also in the analytics realm.

The aim of this feature is to make what we call from now semi-structured data a first-class citizen in tantivy.

Definition of semi-structured data

Let's define what semi-structured data is by these two key properties:

it has not a fixed schema which can constantly evolves with time: new attributes, same attribute but different types
it can contain n-level hierarchy of nested attributes
optionally we can also say that attributes ordering can be ignored

Some databases are already handling it

Many databases/search engines offer different way to handle semi-structured data:

rockset offers what they called "smart schema": they ingest any json objects and index all fields with their associated types (object, array, string, float, int, null_type)
snowflake offers a VARIANT column where you can store your json, it will create columns from data according to certain rules to make query fast.
elasticsearch allows to insert json and index it through dynamic field mapping but it will break if you have mixed types for the same attributes
elasticsearch has a data type field called "flattened" which indeed flattens everything in one field
veloci from @PSeitz

Json query language

One common way is to use "json path" (it was [introduced in postgresql 12] for example)(https://www.postgresql.org/docs/12/functions-json.html#FUNCTIONS-SQLJSON-PATH).
It's clearly overkill for what we want to do but we can get some inspiration from these examples:

query on an attribute with the .key accessor: '$.track.location=paris'
query on attributes inside arrays by using [*]: '$.track.segments[*].location=paris'

Feature proposal

be able to add semi structured data (json object) in a new field type
be able to filter on any attributes values (non array)
be able to filter with operator on any attribute values with a given type
be able to distinguish at least string, integer, float types

Optionally:

be able to make range queries on attributes values
be able to make regex queries on string values
be able to filter on any attributes values in nested item in arrays
be able to treat a string like a text field (tokenization)
be able to handle null types

As of now, we are lacking a little bit of users feedbacks and it would definitely be worth to define some usages of such a field.
To move fast, we can propose a very simple and straightforward implementation.

Implementation 1: add a new type field

Field type name

Let's call it ~~JsonField~~, flattened field.

Add new type field based on bytes field

A straightforward way to implement this is to use a binary field and encode subsequently attributes names and values in it.

Serialization

For each (nested) attribute, append a term:
level1.level2.key<separator><type><separator><value>

When writing the document, each term is added to the posting list.

Query

Filtering on term value is obvious.
We can provide the not equal operation too.
Currently, I wonder how to make available some kind of regex with the above implementation, it seems possible but not sure about that.

Pros

json is a standard field in[many databases and it offers flexibility with a simple implementation
while we won't be able to have complex queries on attributes value like phrase queries, we will be able to make term query and maybe a little bit better

Cons

we will need more code to control dynamic field creation and types
no score, no position, no tokenization

Implementation 2: adding fields dynamically

This solution will automatically detect and add new fields on the schema during indexation.
Then we will need to detect type fields, this can lead to annoying issues well known in ES.

But, we will need to take care of several things:

to avoid mixed types issues, convert all values in string like in veloci. This should be a first step and will need to be customized later
handle different schemas per segment

Pros

it uses already existing field types, so less code to benefit from field features (tokenization) and query language
scoring is possible
dynamic field has been requested several times here: Dynamic Field feature like Solr #555 and Multiple schemas per index #385

Cons

we will need more code to control dynamic field creation and types
as of now, I don't get all the consequences of such a choice on tantivy

Comments on the two possible implementations

The two implementations discussed here are in fact very different, they correspond to different use cases:

json: you just want a field to put your garbage and be able to make simple queries fast, other?
dynamic field mapping: you're prototyping something and you're happy to let tantivy choose the mapping for you, other?

My feeling about dynamic field mapping is something big to implement, so we would need to start thinking about enabling schema updates.

@fulmicoton your input is welcomed :)

The text was updated successfully, but these errors were encountered:

PSeitz · 2021-05-18T13:13:40Z

Let me explain how it's handled in veloci, which is neat in some parts I think, but I don't know how applicable it would be in tantivy.

Before indexing, you can add a configuration, which is completely optional. This configuration can cover single fields, but also a GLOBAL fallback for all fields.

During indexing all data is collected to their fields, in a HashMap<FieldPath, (FieldConfig, Data)>, where FieldPath is just a string similar to JSONPath.
FieldPath supports 1:n fields (or arrays), which are marked with [], examples would be: book.authors[] and book.author. They can be configured and queried that way.

When a new Field is encountered, it creates default settings for that field (which may come from the GLOBAL section in the config). All data is converted to strings for maximum compatibility. When finishing the index, fields and their metadata are written. There is no difference between a dynamic field and a configured field, all are just regular fields, so there is no special logic besides field detection during indexing.

fmassot · 2021-05-18T15:23:52Z

I get it, you have implemented what Elasticsearch calls "dynamic field mapping". You can have more control of data types by defining some mapping templates.

We definitely have to consider this solution and compare the pros and cons.

fmassot · 2023-01-18T23:21:20Z

closing, the JSON field is already there :)

fmassot mentioned this issue Jul 16, 2021

Flatten field quickwit-oss/quickwit#310

Closed

PSeitz linked a pull request Mar 16, 2022 that will close this issue

Added JSON type #1270

Merged

fmassot closed this as completed Jan 18, 2023

longjiquan mentioned this issue Nov 7, 2023

Could we specify the document id explictly when trying to add a document into the index? #2240

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new field type for semi-structured data indexing and efficient querying #1050

Add new field type for semi-structured data indexing and efficient querying #1050

fmassot commented May 18, 2021 •

edited

Loading

PSeitz commented May 18, 2021

fmassot commented May 18, 2021

fmassot commented Jan 18, 2023

Add new field type for semi-structured data indexing and efficient querying #1050

Add new field type for semi-structured data indexing and efficient querying #1050

Comments

fmassot commented May 18, 2021 • edited Loading

Make semi structured-data a first-class citizen

Definition of semi-structured data

Some databases are already handling it

Json query language

Feature proposal

Implementation 1: add a new type field

Field type name

Add new type field based on bytes field

Serialization

Query

Pros

Cons

Implementation 2: adding fields dynamically

Pros

Cons

Comments on the two possible implementations

PSeitz commented May 18, 2021

fmassot commented May 18, 2021

fmassot commented Jan 18, 2023

fmassot commented May 18, 2021 •

edited

Loading