Extend DocumentDeserialize with a stateful variant DocumentDeserializeSeed #2362

adamreichold · 2024-04-15T16:59:04Z

This is modelled on Serde's DeserializeSeed include how the relevant API entry points gain a _seed variant. It can be used for example to obtain runtime field ID values when deserializing a struct field by field without relying on the order of the fields as written to/read from the document store.

ChillFish8 · 2024-04-17T16:15:30Z

I had a read through the discussion in #2359 but I am not entirely sure how this solves the issue of fields being deserialized in a non-guaranteed order without still needing to go to some intermediate value first (i.e. BTreeMap or similar) do you have a small example to demo the utility?

adamreichold · 2024-04-17T16:31:10Z

Consider a struct like

struct Foobar {
   foo: String,
   bar: f64,
}

which we want to deserialize field by field by implementing DocumentDeserialize, doing that directly fails because

impl DocumentDeserialize for Foobar {
    fn deserialize<'de, D>(mut deserializer: D) -> Result<Self, DeserializeError>
    where
        D: DocumentDeserializer<'de>,
    {
        let (_, foo) = deserializer
            .next_field()?
            .ok_or_else(|| DeserializeError::custom("missing foo"))?;

        let (_, bar) = deserializer
            .next_field()?
            .ok_or_else(|| DeserializeError::custom("missing bar"))?;
            
        Ok(Foobar { foo, bar })
    }
}

relies on the unspecified order of the deserialized fields.

However, if we have the runtime field ID available from the schema, stored for example as

struct FoobarFields {
    foo: Field,
    bar: Field,
}

we can implement DocumentDeserializeSeed on this type to get a robust implementation, e.g.

impl DocumentDeserializeSeed for FoobarFields {
    type Value = Foobar;
    
    fn deserialize<'de, D>(self, mut deserializer: D) -> Result<Self::Value, DeserializeError>
    where
        D: DocumentDeserializer<'de>,
    {
        let mut foo = None;
        let mut bar = None;
        
        while let Some((field, value)) = deserializer.next_field()? {
            if field == self.foo {
                match value {
                    OwnedValue::Str(value) => foo = Some(value),
                    _ => return Err(DeserializeError::custom("unexpected type for foo"),
                }
            } else if field == self.bar {
                match value {
                    OwnedValue::F64(value) => bar = Some(value),
                    _ => return Err(DeserializeError::custom("unexpected type for bar"),
                }
            } else {
                return Err(DeserializeError::custom("unexpected field"));
            }
        }
        
        let foo = foo.ok_or_else(|| DeserializeError::custom("missing foo"))?;
        let bar = bar.ok_or_else(|| DeserializeError::custom("missing bar"))?;
            
        Ok(Foobar { foo, bar })
    }
}

Of course, the positional guarantee would still be nicer because I could actually expect a certain type instead of matching on OwnedValue. But even this could be mitigated if DocumentDeserializer would get a method to peek the next Field without committing to read its value.

ChillFish8 · 2024-04-17T16:36:01Z

Right, that makes sense thanks, I agree a guaranteed order would be nicer, although maybe it is also worth looking into adding custom serialization formats, since that could effectively kill two birds with one stone.

adamreichold · 2024-04-17T17:45:04Z

although maybe it is also worth looking into adding custom serialization formats, since that could effectively kill two birds with one stone.

I am not sure, because if I really want the document store to take full advantage of all the byte-and-bit-packing tricks Tantivy has to offer. If I want a custom serialization format, I can already put Vec<u8> into the document store and deserialize it as I see fit. (For example, in our application we have exactly three stored fields, two identifiers and BLOB which contains a complex bincode-based format.)

ChillFish8 · 2024-04-17T17:52:20Z

I am not sure, because if I really want the document store to take full advantage of all the byte-and-bit-packing tricks Tantivy has to offer. If I want a custom serialization format, I can already put Vec into the document store and deserialize it as I see fit. (For example, in our application we have exactly three stored fields, two identifiers and BLOB which contains a complex bincode-based format.)

Perhaps yes, although I think maybe it is application-specific where being able to get a borrowed buffer, I'm thinking of something like bitcode that does some great specialization for floating points in order to improve compression ratios and performance.

adamreichold · 2024-04-17T18:17:41Z

although I think maybe it is application-specific where being able to get a borrowed buffer

I think borrowing byte slices conflicts with document store compression?

I'm thinking of something like bitcode that does some great specialization for floating points in order to improve compression ratios and performance.

But you should be able to put bytes serialized using bitcode instead of bincode into the document store can you not? (We are also currently evaluating whether to replace bincode by bitcode, but our Tantivy schema does not care.)

ChillFish8 · 2024-04-17T18:21:41Z

But you should be able to put bytes serialized using bitcode instead of bincode into the document store can you not? (We are also currently evaluating whether to replace bincode by bitcode, but our schema Tantivy does not care.)

Absolutely yes

I think borrowing byte slices conflicts with document store compression?

I am not sure, the compression, and decompressing will create a new owned buffer, but last time I went through the code it should be possible I believe to allow for borrowing of that decompressed buffer. Saying that though, I'm not sure whether that leads to any gains or losses in performance. It is probably quite niche.

adamreichold · 2024-05-23T06:22:58Z

I would like to ping this as making the default document type more opinionated increases the need for full flexibility when writing custom deserializers.

I think the main design question here is whether to add this as additional API or wholesale replace the deserialization traits with their stateful variants.

…eSeed This is modelled on Serde's `DeserializeSeed` include how the relevant API entry points gain a `_seed` variant. It can be used for example to obtain runtime field ID values when deserializing a struct field by field without relying on the order of the fields as written to/read from the document store.

PSeitz · 2024-05-27T13:07:16Z

I would like to ping this as making the default document type more opinionated increases the need for full flexibility when writing custom deserializers.

I think the main design question here is whether to add this as additional API or wholesale replace the deserialization traits with their stateful variants.

I think the question is if the added complexity justifies the potential performance gain. The de-serialization step is usually dwarfed by other parts of the search execution. Maybe you can share some numbers or your use case.

adamreichold · 2024-05-27T15:47:02Z

I think the question is if the added complexity justifies the potential performance gain.

I do not want to frame this as a performance question. Without the state, some kinds of documents just cannot be deserialized from the stored representation.

One would always have to build an intermediate representation to later on extract these documents. And if the limits chosen for CompactDoc are insufficient, it means having a custom intermediate representation and a custom final representation.

So from my point of view, this is mainly a question of reducing complexity in downstream code.

As proposed, the API does not become any more complex for users how do not want the state and there should be no runtime overhead at all.

I admit there is additional implementation complexity, but I think it was relatively obvious that this would become a Serde reimplementation when DocumentDeserialize was first introduced.

Finally, note that when Tantivy's document store becomes are primary store, search is not the only use case. For example, we have endpoints which just stream documents and therefore do nothing but deserialize and serialize documents. (Without having done a CPU profile, I suspect that decompression would dominate in that case, at least if the workload does not fit into the block cache.)

adamreichold · 2024-05-27T15:55:29Z

One relevant alternative I see, at least for the case of Rust structure with statically known fields, is driving the deserialization by field ID and names. This would allow matching fields without having access to the runtime field ID but just to the also statically known names. I fear this would make the general case more complicated though and also likely requiring looking that information up during deserialization even when it isn't needed.

PSeitz · 2024-05-27T16:41:52Z

If you want to reduce complexity, you can convert your document from the JSON value instead of implementing the deserialize trait. The whole document as trait stuff is about indexing performance, not ergonomics.

If someone has more than u16:MAX fields I would be interested in their use case. I think it's quite unlikely (and unnecessary with JSON fields now). The limit could be removed easily.

adamreichold · 2024-05-27T16:52:30Z

I really don't understand the abrasiveness.

This is relatively simple change that closely follows Serde's precedent and would enable a straight-forward conversion from and to Rust structures as documents which is impossible to do otherwise. Personally, I don't think serde_json::Value is reasonable approach to that use case. Especially compared to the cost of supporting it directly.

And yes, my use case does fit into CompactDoc. But I just don't think we should force downstream projects into unnecessary API detours if the solution is a +92-12 with no stated technical defects.

adamreichold mentioned this pull request Apr 15, 2024

improve document docs #2359

Merged

adamreichold force-pushed the document-deserialize-seed branch from 9e73734 to e9c16a4 Compare May 23, 2024 06:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend DocumentDeserialize with a stateful variant DocumentDeserializeSeed #2362

Extend DocumentDeserialize with a stateful variant DocumentDeserializeSeed #2362

adamreichold commented Apr 15, 2024

ChillFish8 commented Apr 17, 2024

adamreichold commented Apr 17, 2024

ChillFish8 commented Apr 17, 2024

adamreichold commented Apr 17, 2024 •

edited

ChillFish8 commented Apr 17, 2024

adamreichold commented Apr 17, 2024 •

edited

ChillFish8 commented Apr 17, 2024

adamreichold commented May 23, 2024

PSeitz commented May 27, 2024

adamreichold commented May 27, 2024

adamreichold commented May 27, 2024

PSeitz commented May 27, 2024

adamreichold commented May 27, 2024

Extend DocumentDeserialize with a stateful variant DocumentDeserializeSeed #2362

Are you sure you want to change the base?

Extend DocumentDeserialize with a stateful variant DocumentDeserializeSeed #2362

Conversation

adamreichold commented Apr 15, 2024

ChillFish8 commented Apr 17, 2024

adamreichold commented Apr 17, 2024

ChillFish8 commented Apr 17, 2024

adamreichold commented Apr 17, 2024 • edited

ChillFish8 commented Apr 17, 2024

adamreichold commented Apr 17, 2024 • edited

ChillFish8 commented Apr 17, 2024

adamreichold commented May 23, 2024

PSeitz commented May 27, 2024

adamreichold commented May 27, 2024

adamreichold commented May 27, 2024

PSeitz commented May 27, 2024

adamreichold commented May 27, 2024

adamreichold commented Apr 17, 2024 •

edited

adamreichold commented Apr 17, 2024 •

edited