Skip to content
This repository has been archived by the owner on Aug 15, 2021. It is now read-only.

Implement semantic tagging of values #157

Closed
1 of 15 tasks
pyfisch opened this issue Oct 6, 2019 · 12 comments
Closed
1 of 15 tasks

Implement semantic tagging of values #157

pyfisch opened this issue Oct 6, 2019 · 12 comments

Comments

@pyfisch
Copy link
Owner

pyfisch commented Oct 6, 2019

Support for CBOR tagged values is the most requested feature in serde_cbor (see #3, #56, #129, #151). Tags allow users to "give [a data item] additional semantics while retaining its structure". They enable the definition of common data types and they are needed to correctly implement some specifications that use CBOR (see for example COSE).

There is a hack (with a few variations) commonly suggested to implement tags: Create a struct with a special name and two fields, one for the tag and another for the value. Now use this type to represent tagged values and give it special treatment in the Serializer and Deserializer` to convert tags.

This has two downsides:

  1. If the name of the special struct occurs (or is introduced by an attacker) in a CBOR document the deserialization breaks.
  2. Serde decouples data formats and data types. The hack forces a coupling between the two.
  • If a CBOR deserializer emits a tag but the data type does not expect one the deserialization fails (or produces wrong results).
  • If a data type emits its tag to a serializer other than the CBOR serializer an artifact from the tag is found in the output.

Design criteria

  • The changes made in serde should be small and must be backwards-compatible (that is: not require changes to existing data formats).
  • Do not rely on features not yet available in stable rust (such as specialization).
  • Tagged values must be ignored by data formats and data types alike if they are not supported.
  • The solution should not be limited to CBOR but work for other data formats that use tags as well.
  • Users should be able to:
    • Serialize/Deserialize loosely typed serde_cbor::Values with tags.
    • Serialize custom types with the correct CBOR tag.
    • Deserialize CBOR documents according to the contained tags.

Tasks

  1. Extend the serde::ser::Serializer trait to allow the serialization of tagged values. (branch)
  2. Extend serde::de::Visitor to visit a tagged value and make other necessary changes for deserialization.
    • Get the deserialization changes reviewed and merged into serde.
  3. Implement tagged serialization in serde_cbor. (exprimental branch)
  4. Implement tagged deserialization in serde_cbor.
  5. Add Tagged variant to serde_cbor::Value
    • Implement serialization.
    • Implement deserializaton.
  6. (optional) Create procedural macros to simplify the serialization of tagged values.
  7. (optional) Always serialize common types from third-party crates with the correct tag.
    • url::Url with tag 32
    • standard date/time string with tag 0 (chrono crate?)
    • epoch-based date/time with tag 1 (chrono crate?)
    • bignums?
    • to be extended
@pyfisch pyfisch pinned this issue Oct 6, 2019
vmx added a commit to vmx/serde_ipld_dagcbor that referenced this issue Oct 22, 2019
vmx added a commit to vmx/serde_ipld_dagcbor that referenced this issue Oct 22, 2019
@vmx
Copy link

vmx commented Oct 22, 2019

Here's my version of the deserializer:

I'm not happy with the result. One missing thing is to deserialize the value (without the tag) by default. I just couldn't get it working. I always got trait bounds errors. I hope someone can help with that.

I also don't really like the Tagger. I tried several different ways. Traits with generics, traits with associated types, using a deserializer for the tag, implementing a TagAccess trait similar to the SeqAccess trait. I could none of those working. If anyone has a good idea, please let me know. Meanwhile I changed the Tagger to return an Any-type instead of having methods for each type to access it. I like it better, but I'm also open to change that to some other approach.

vmx added a commit to vmx/serde_ipld_dagcbor that referenced this issue Oct 22, 2019
vmx added a commit to vmx/serde_ipld_dagcbor that referenced this issue Oct 22, 2019
@pyfisch
Copy link
Owner Author

pyfisch commented Oct 26, 2019

@vmx Thanks for trying. Unfortunately I still have not received any feedback on the serde PR. I fear it might stall and go nowhere. Therefore I will only work on the deserialization part once it has received some positive comments by the serde maintainers. ☹️

@pyfisch
Copy link
Owner Author

pyfisch commented Nov 10, 2019

@dignifiedquire, @vmx, @rklaehn (@Actyx), @mcamou, @bantonsson, @Alexander89 and all others interested in support for CBOR tagged values I would appreciate your input which of the options outlined below you would prefer and if you have any other suggestions.

Two weeks have passed since serde-rs/serde#1643 was closed. I think it is unlikely that serde will incorporate support for tags any time soon or even at all.

Now there are different ways forward:

  1. Do nothing and release serde_cbor v1.0 as the crate is otherwise feature complete and rather well tested.
  2. Use the "struct hack" to support tags. Basically add a new struct to this crate that represents a value with the associated tag. Deserializer and Serializer recognize the special field names and correctly handle tagged values. See Add tags support behind feature flag #151 and feat: implement simple tagging #129.
    I've repeatededly rejected the "struct hack" in the past because you lose the interoperability between different serde data formats and I believed an extension to serde would be possible and preferable.
  3. Write a new Rust CBOR crate that does not depend on serde but instead supports the full range of CBOR features as suggested by dtolnay. In this case serde_cbor could be a thin layer depending on the serialization of the aforementioned crate and provide integration with serde.
  4. Fork serde and create a compatible version that includes the necessary infrastructure to support tagged values. By default serde_cbor would depend on the normal serde crate. But users who want to use tags can replace the serde dependency with the extended version and enable a "tags" feature in the serde_cbor crate and now use tags in all of their code while still reaping the benefits of being part of the serde ecosystem.
    I lean towards this option but I am unsure whether users would accept to patch their dependencies and probably I would need to be the person to regularly update the fork to follow normal serde.

@vmx
Copy link

vmx commented Nov 11, 2019

3. Write a new Rust CBOR crate that does not depend on serde but instead supports the full range of CBOR features as suggested by dtolnay.

I think that is generally spoken a good idea. The YAML and MessagePack implementations for Serde both use a general parsing library and have a Serde layer on top. If someone needs tags and doesn't necessarily need Serde, you could still use a the same well tested, established code without needing to look into alternatives.

4. Fork serde and create a compatible version that includes the necessary infrastructure to support tagged values. By default serde_cbor would depend on the normal serde crate. But users who want to use tags can replace the serde dependency with the extended version and enable a "tags" feature in the serde_cbor crate and now use tags in all of their code while still reaping the benefits of being part of the serde ecosystem.

Does [patch] also work for published crates?

  1. Find a way to extend Serde without forking it. I had hopped that this is possible and tried it, bit wasn't able to get it working. If that's something that is not possible in Rust today, it also is no viable short term solution to the problem.

I think opt-on for tags (with or without fork) is the way to go. This way hopefully other library authors that need tags will support that version and it will show how useful that feature is. This will at least keep the hope up for me that perhaps Serde will eventually support such a feature in the distant future.

@rklaehn
Copy link
Contributor

rklaehn commented Nov 16, 2019

Please don't hit me for the suggestion, but essentially the problem is that there is a missing communication channel between the format and the serializer to allow tags to work. Would it help to just put the damn tag into a threadlocal so that a special visitor can get the tag when it needs to know how to interpret the data? As a stopgap until we can convince dtolnay to help formats that don't exactly fit the serde data model a tiny little bit...

Not 100% sure what the consequences are, but it does seem like thread locals are pretty cheap in rust when using optimize: https://godbolt.org/z/MZDnX4

So if users are made aware of the downside (reserves an additional 16 bytes for every thread, or every thread that has ever been used for CBOR stuff, not sure how exactly it works), then this might work.

@rklaehn
Copy link
Contributor

rklaehn commented Nov 16, 2019

OK, I just could not resist trying this out (using a thread-local to get the info over the fence): rklaehn#2

@rklaehn
Copy link
Contributor

rklaehn commented Nov 18, 2019

Not sure what the implications are of option 4, forking serde. Serde is as close to a standard library as it gets, and lots of crates do offer serde support behind a feature flag. That would all no longer work with a forked serde, right (not entirely sure how that works, as I am rather new to rust)?

@vmx
Copy link

vmx commented Nov 20, 2019

I just want to mention that I've put more thought into the use-case I have. I'll probably go with something that doesn't need Serde support. So I'd be happy to see a separation between the CBOR processing and Serde (like e.g. serde_yaml or msgpack-rust are doing it) and having Tag support in the non-Serde part. I'd also happy to contribute some time helping with such a refactoring.

@pyfisch
Copy link
Owner Author

pyfisch commented Dec 7, 2019

@vmx I played around with deserializing CBOR without serde: https://github.com/pyfisch/minicbor

It takes a &[u8] and allows you to iterate over the different tokens. Everything else like CBOR values or serializaiton is missing.

One pain-point if something like this should serve as a backend to serde-cbor is that it needs to work with both Read/Write streams and &[u8]/Vec.

@vmx
Copy link

vmx commented Dec 16, 2019

@pyfisch Wow, nice! That looks really promising. I sadly have zero time atm to dig deeper or thinking about how serialization would look like. Hopefully I'll have next year.

@pyfisch
Copy link
Owner Author

pyfisch commented Dec 21, 2019

how serialization would look like

Serialization is the "boring" part because you can just output some bytes.

Getting deserialization right is more difficult. The majority of bugs in this crate were in the deserialization part. Introducing a token iterator is probably makes deserialization much easier than the approach used in serde-cbor where we go directly from bytes to serde structures.

@pyfisch
Copy link
Owner Author

pyfisch commented Jan 10, 2020

Now in v0.11!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants