Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new field type for semi-structured data indexing and efficient querying #1050

Closed
fmassot opened this issue May 18, 2021 · 3 comments · Fixed by #1270
Closed

Add new field type for semi-structured data indexing and efficient querying #1050

fmassot opened this issue May 18, 2021 · 3 comments · Fixed by #1270

Comments

@fmassot
Copy link
Contributor

fmassot commented May 18, 2021

Make semi structured-data a first-class citizen

In many use cases, users/engineers don't know the data schema in advance but want to index it and query it fast. This is very common for logs but also in the analytics realm.

The aim of this feature is to make what we call from now semi-structured data a first-class citizen in tantivy.

Definition of semi-structured data

Let's define what semi-structured data is by these two key properties:

  • it has not a fixed schema which can constantly evolves with time: new attributes, same attribute but different types
  • it can contain n-level hierarchy of nested attributes
  • optionally we can also say that attributes ordering can be ignored

Some databases are already handling it

Many databases/search engines offer different way to handle semi-structured data:

  • rockset offers what they called "smart schema": they ingest any json objects and index all fields with their associated types (object, array, string, float, int, null_type)
  • snowflake offers a VARIANT column where you can store your json, it will create columns from data according to certain rules to make query fast.
  • elasticsearch allows to insert json and index it through dynamic field mapping but it will break if you have mixed types for the same attributes
  • elasticsearch has a data type field called "flattened" which indeed flattens everything in one field
  • veloci from @PSeitz

Json query language

One common way is to use "json path" (it was [introduced in postgresql 12] for example)(https://www.postgresql.org/docs/12/functions-json.html#FUNCTIONS-SQLJSON-PATH).
It's clearly overkill for what we want to do but we can get some inspiration from these examples:

  • query on an attribute with the .key accessor: '$.track.location=paris'
  • query on attributes inside arrays by using [*]: '$.track.segments[*].location=paris'

Feature proposal

  • be able to add semi structured data (json object) in a new field type
  • be able to filter on any attributes values (non array)
  • be able to filter with operator on any attribute values with a given type
  • be able to distinguish at least string, integer, float types

Optionally:

  • be able to make range queries on attributes values
  • be able to make regex queries on string values
  • be able to filter on any attributes values in nested item in arrays
  • be able to treat a string like a text field (tokenization)
  • be able to handle null types

As of now, we are lacking a little bit of users feedbacks and it would definitely be worth to define some usages of such a field.
To move fast, we can propose a very simple and straightforward implementation.

Implementation 1: add a new type field

Field type name

Let's call it JsonField, flattened field.

Add new type field based on bytes field

A straightforward way to implement this is to use a binary field and encode subsequently attributes names and values in it.

Serialization

For each (nested) attribute, append a term:
level1.level2.key<separator><type><separator><value>

When writing the document, each term is added to the posting list.

Query

Filtering on term value is obvious.
We can provide the not equal operation too.
Currently, I wonder how to make available some kind of regex with the above implementation, it seems possible but not sure about that.

Pros

  • json is a standard field in[many databases and it offers flexibility with a simple implementation
  • while we won't be able to have complex queries on attributes value like phrase queries, we will be able to make term query and maybe a little bit better

Cons

  • we will need more code to control dynamic field creation and types
  • no score, no position, no tokenization

Implementation 2: adding fields dynamically

This solution will automatically detect and add new fields on the schema during indexation.
Then we will need to detect type fields, this can lead to annoying issues well known in ES.

But, we will need to take care of several things:

  • to avoid mixed types issues, convert all values in string like in veloci. This should be a first step and will need to be customized later
  • handle different schemas per segment

Pros

Cons

  • we will need more code to control dynamic field creation and types
  • as of now, I don't get all the consequences of such a choice on tantivy

Comments on the two possible implementations

The two implementations discussed here are in fact very different, they correspond to different use cases:

  • json: you just want a field to put your garbage and be able to make simple queries fast, other?
  • dynamic field mapping: you're prototyping something and you're happy to let tantivy choose the mapping for you, other?

My feeling about dynamic field mapping is something big to implement, so we would need to start thinking about enabling schema updates.

@fulmicoton your input is welcomed :)

@PSeitz
Copy link
Contributor

PSeitz commented May 18, 2021

Let me explain how it's handled in veloci, which is neat in some parts I think, but I don't know how applicable it would be in tantivy.

Before indexing, you can add a configuration, which is completely optional. This configuration can cover single fields, but also a GLOBAL fallback for all fields.

During indexing all data is collected to their fields, in a HashMap<FieldPath, (FieldConfig, Data)>, where FieldPath is just a string similar to JSONPath.
FieldPath supports 1:n fields (or arrays), which are marked with [], examples would be: book.authors[] and book.author. They can be configured and queried that way.

When a new Field is encountered, it creates default settings for that field (which may come from the GLOBAL section in the config). All data is converted to strings for maximum compatibility. When finishing the index, fields and their metadata are written. There is no difference between a dynamic field and a configured field, all are just regular fields, so there is no special logic besides field detection during indexing.

@fmassot
Copy link
Contributor Author

fmassot commented May 18, 2021

I get it, you have implemented what Elasticsearch calls "dynamic field mapping". You can have more control of data types by defining some mapping templates.

We definitely have to consider this solution and compare the pros and cons.

@PSeitz PSeitz linked a pull request Mar 16, 2022 that will close this issue
@fmassot
Copy link
Contributor Author

fmassot commented Jan 18, 2023

closing, the JSON field is already there :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants