-
-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new field type for semi-structured data indexing and efficient querying #1050
Comments
Let me explain how it's handled in veloci, which is neat in some parts I think, but I don't know how applicable it would be in tantivy. Before indexing, you can add a configuration, which is completely optional. This configuration can cover single fields, but also a During indexing all data is collected to their fields, in a When a new Field is encountered, it creates default settings for that field (which may come from the |
I get it, you have implemented what Elasticsearch calls "dynamic field mapping". You can have more control of data types by defining some mapping templates. We definitely have to consider this solution and compare the pros and cons. |
closing, the JSON field is already there :) |
Make semi structured-data a first-class citizen
In many use cases, users/engineers don't know the data schema in advance but want to index it and query it fast. This is very common for logs but also in the analytics realm.
The aim of this feature is to make what we call from now semi-structured data a first-class citizen in tantivy.
Definition of semi-structured data
Let's define what semi-structured data is by these two key properties:
Some databases are already handling it
Many databases/search engines offer different way to handle semi-structured data:
Json query language
One common way is to use "json path" (it was [introduced in postgresql 12] for example)(https://www.postgresql.org/docs/12/functions-json.html#FUNCTIONS-SQLJSON-PATH).
It's clearly overkill for what we want to do but we can get some inspiration from these examples:
Feature proposal
Optionally:
As of now, we are lacking a little bit of users feedbacks and it would definitely be worth to define some usages of such a field.
To move fast, we can propose a very simple and straightforward implementation.
Implementation 1: add a new type field
Field type name
Let's call it
JsonField, flattened field.Add new type field based on bytes field
A straightforward way to implement this is to use a binary field and encode subsequently attributes names and values in it.
Serialization
For each (nested) attribute, append a term:
level1.level2.key<separator><type><separator><value>
When writing the document, each term is added to the posting list.
Query
Filtering on term value is obvious.
We can provide the not equal operation too.
Currently, I wonder how to make available some kind of regex with the above implementation, it seems possible but not sure about that.
Pros
Cons
Implementation 2: adding fields dynamically
This solution will automatically detect and add new fields on the schema during indexation.
Then we will need to detect type fields, this can lead to annoying issues well known in ES.
But, we will need to take care of several things:
Pros
Cons
Comments on the two possible implementations
The two implementations discussed here are in fact very different, they correspond to different use cases:
My feeling about dynamic field mapping is something big to implement, so we would need to start thinking about enabling schema updates.
@fulmicoton your input is welcomed :)
The text was updated successfully, but these errors were encountered: