Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
I am quite keen to work on it, because I have a good use case in mind.
I expect this to be a difficult ticket touching many components.
I have read about Lucene's implementation - it relies on indexing nested child documents as documents in their own right (with own schemas and fields) with a pointer to the parent document (we can use its parent
Below is my attempt to scope the problem across the tantivy components that will need to be changed.
I would appreciate your feedback and pointers, on 1) time estimate and 2) best way to break it into smaller tickets.
Allow an API like this, extend the schema builder to accept other Schemas as arguments.
let child_builder = Schema::builder(); child_builder.add_text_field("child_title", TEXT | STORED); let child = child_builder.build(); let root_builder = Schema::builder(); root_builder.add_nested_doc(child); let full_doc = root_builder.build();
Prevent users from creating nested documents with the same field names as parent documents.
Users won't see this but we will transform Schemas of nested documents into a struct with parent `DocID` to make it easy to connect with its parent.
Will change the on-disk and in-RAM format of the index. Will require a major version change + announcing to current users. Hopefully, the cost of updating will be compensated by the benefit of indexing nested strucutres.
Will need quite a bit of help with this. Might mess up the bitpacking magic, so please advise.
The Document struct is a vector of `FieldValue`s, so it should be possible to extend the struct with an
passing nested structures into
One of the options is to create a new Segment as soon as a nested documents are indexed and redirect all future writes to the same Segment. Will need a rework of the
If we expose
Will need a new NestedDocumentQuery type that wraps other query types, if the schema has nested documents.
Needs to support all query types for fields inside arbitrarily-deep nested documents.
v1 would support bottom-up only approach. The parent document would be scored using the children documents - either a simple sum of children document scores or a heuristic-driven weights.
Give users the ability to customise scoring using nested documents.
I think Lucene solution is a good idea. I would use a normal inverted list for the moment, and store parent and child docs in the same segment as follows
So the parent doc comes after the child docs.
This will have the benefit of being able to reuse the logic of
There is very little modifications required in tantivy internals.
In Lucerne child documents aren’t really supported, but instead Solr and ElasticSearch wrap them. Solr breaks down a nested document into multiple documents and indexes them, bottom up. So the deepest child document would be indexed before its parent linking them using a field called root storing the parents ID.