Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Nested documents #617

Open
petr-tik opened this issue Aug 7, 2019 · 8 comments
Open

Add support for Nested documents #617

petr-tik opened this issue Aug 7, 2019 · 8 comments

Comments

@petr-tik
Copy link
Contributor

petr-tik commented Aug 7, 2019

Nested documents have been mentioned before in #464 and #391, where @fulmicoton said they are hard to implement.

I am quite keen to work on it, because I have a good use case in mind.

I expect this to be a difficult ticket touching many components.

I have read about Lucene's implementation - it relies on indexing nested child documents as documents in their own right (with own schemas and fields) with a pointer to the parent document (we can use its parent DocId).

https://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

Below is my attempt to scope the problem across the tantivy components that will need to be changed.

I would appreciate your feedback and pointers, on 1) time estimate and 2) best way to break it into smaller tickets.

Schema

Allow an API like this, extend the schema builder to accept other Schemas as arguments.

let child_builder = Schema::builder();
child_builder.add_text_field("child_title", TEXT | STORED);
let child = child_builder.build();
let root_builder = Schema::builder();
root_builder.add_nested_doc(child);
let full_doc = root_builder.build();

Prevent users from creating nested documents with the same field names as parent documents.

Users won't see this but we will transform Schemas of nested documents into a struct with parent `DocID` to make it easy to connect with its parent.

Storage/serialisation

Will change the on-disk and in-RAM format of the index. Will require a major version change + announcing to current users. Hopefully, the cost of updating will be compensated by the benefit of indexing nested strucutres.

Will need quite a bit of help with this. Might mess up the bitpacking magic, so please advise.

Document

The Document struct is a vector of `FieldValue`s, so it should be possible to extend the struct with an vec<FieldValue>s.

IndexWriter API

passing nested structures into add_document should accept nested documents For efficient reads, all nested documents will need to live in one Segment.

One of the options is to create a new Segment as soon as a nested documents are indexed and redirect all future writes to the same Segment. Will need a rework of the MergePolicy to avoid merging such "reserved" Segments.

If we expose DocIds to users then we should also add a method like
add_nested_document(DocId, new_nested_doc).

Query

Type

Will need a new NestedDocumentQuery type that wraps other query types, if the schema has nested documents.

Parser

Needs to support all query types for fields inside arbitrarily-deep nested documents.

Scorer

v1 would support bottom-up only approach. The parent document would be scored using the children documents - either a simple sum of children document scores or a heuristic-driven weights.

Further development

Give users the ability to customise scoring using nested documents.

@fulmicoton
Copy link
Collaborator

fulmicoton commented Aug 8, 2019

I think Lucene solution is a good idea. I would use a normal inverted list for the moment, and store parent and child docs in the same segment as follows

- child 1
- child 2
- ...
- child N
- parent doc

So the parent doc comes after the child docs.

This will have the benefit of being able to reuse the logic of skip_to code.
The main reason is that it should be much easier, give us slightly lesser performance for very low number of childs per parents and much better perf for very high number of childs per parents.

There is very little modifications required in tantivy internals.
The main difficulty is to make an API that is easy to understand and to maintain.

@petr-tik
Copy link
Contributor Author

petr-tik commented Aug 8, 2019

Hey, I've been thinking about it some more.

What if the children documents exceed the RAM budget for one segment? Will the merge policy have to change or not?

@fulmicoton
Copy link
Collaborator

fulmicoton commented Aug 8, 2019

@petr-tik I would not worry, and this has nothing to do with merge policies.

@barrotsteindev
Copy link
Contributor

In Lucerne child documents aren’t really supported, but instead Solr and ElasticSearch wrap them. Solr breaks down a nested document into multiple documents and indexes them, bottom up. So the deepest child document would be indexed before its parent linking them using a field called root storing the parents ID.
ElasticSearch supports this method or another in which the documents are stored in the same shard and are queried for using the field which specifies their parents ID.

@sabine
Copy link

sabine commented Apr 7, 2021

I have an app where, in Elasticsearch, I have a mapping with nested documents.

So, for some table A I have a mapping

{
  'properties': {
    [..]
    'B': {
      type': 'nested',
      'properties': {
        'x': {'type': 'integer'},
        'y': {'type': 'integer'},
        'z': {'type': 'integer'},
      }
   }
}

In my database, there is a one-to-many relation from A to B, and all of x, y, and z are actually arrays. I naively translated this to Elasticsearch nested values and actually got search to work as I expected, specifically, the ability to facet on the individual fields of B.

However, my backend is already in Rust and so it would be cool to be using a Rust search engine (which I believe to be more reliable and less resource-hungry).

The queries need to return A records (ids that I can look up in the database would actually be enough) and are a conjunction of
(a) one or more filters on fields ofA
(b) one or more filters of the shape "there exists a B for which some predicate P(B) holds"

There are two different ways the app needs to facet on x, y, and z:
(1) facet on x, y, z of those Bs that fulfill a specific P(B), where B belongs to on an A that fulfills (a) and (b)
(2) facet on x, y, z of all Bs that belong to an A that fulfills (a) and (b).

Is there a reasonable chance, tantivy will do nested faceting some time in the next few years?

@fulmicoton
Copy link
Collaborator

Thank you for stating your use case, so clearly!

The part that is the most difficult is probably:
(b) one or more filters of the shape "there exists a B for which some predicate P(B) holds"

I believe the way it is done in elastic is by allocating a contiguous range of docids for a parent and its children.
it is a very interested feature but it is one that involves a lot of work and implies a lot of technical debt... So maybe one day, but not now.

@at15
Copy link

at15 commented Dec 25, 2022

Actually in Lucene there are two types of joins #617 (comment) is index time join where children are saved before parent in same index file (segment) and you get parent id from child id without saving parent id in child document (ES call this nested). Another is query time join where you save ids (just like relational databases) and use the id to join, the parent and child are in same shard but not in same lucene segment file.

Ref

@fulmicoton
Copy link
Collaborator

Thanks for the information @at15!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants