Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: 3rd party non-JSON serialization/deserialization #79

Open
NowanIlfideme opened this issue Jun 8, 2022 · 11 comments
Open

Comments

@NowanIlfideme
Copy link

Hi, author of pydantic-yaml here. I have no idea about anything Rust-related, unfortunately, but hopefully this feature request will make sense in Python land.

I'm going off this slide in this presentation by @samuelcolvin, specifically:

We could add support for other formats (e.g. yaml, toml) the only side affect would be bigger binaries.

Here's a relevant discussion about "3rd party" deserialization from v1: pydantic/pydantic#3025

It would be great if pydantic-core were built in a way where non-JSON formats could be added "on top" rather than necessarily being built into the core. I understand performance is a big question in this rewrite, so ideally these would be high-level interfaces that can be hacked in Python (or implemented in Rust/etc. for better performance).

From the examples available already, it's possible that such a feature could be quite simple on the pydantic-core side - the 3rd party would create their own function a-la validate_json, possibly just calling validate_python. However, care would be needed on how format-specific details are sent between pydantic and the implementation. In V1 this is done with the Config class and special json_encoder/decoder attributes, which have been a pain to re-implement for YAML properly (without way too much hackery).

Ideally for V2, this would be something more easily addable and configurable. The alternative would be to just implement TOML, YAML etc. directly in the binary (and I wouldn't have to keep supporting my project, ha!)

Thanks again for Pydantic!

@samuelcolvin
Copy link
Member

Thanks for the question.

I think this is just calling validate_python, or indeed initializing the pydantic model as I guess you do now.

No changes should be required to pydantic-core to allow this.

I want to add support for line numbers in errors, but that requires a new rust JSON parser, so won't be added until v2.1 or later.

@samuelcolvin
Copy link
Member

Closing, but feel free to ask if you have more questions.

@samuelcolvin
Copy link
Member

To be clear, I would love this to be possible, but I don't want to have to add the capability to parse more formats to pydantic-core, so the only way this would be possible would be to achieve runtime linking of pydantic-core and the third party libraries that perform the parsing.

This is the only way I think of that it might work:

(Note 1: there's probably a better way to do this, I'm not an expert at this stuff)
(Note 2: this might not work)
(Note 3: I'm not even convinced this is a good idea and I don't promise to add this functionality)

With that out of the way, here's a very rough idea:

  1. New rust crates package which basic just exports:
  • JsonInput (new name required)
  • A thin (but opaque) wrapper for JsonInput which makes it available in as a python object (not actual to the dict etc. from JsonInput, just a way to return a pointer to JsonInput back to python land
  1. pydantic-core uses this new package to parse JSON, as it does currently
  2. pydantic-core also provides a new way to pass the thin wrapper to to SchemaValidator, pydantic-core then extracts the JsonInput and validates it with the same logic it uses now for json data
  3. 3rd party packages (written in rust) use the above pydantic-core-json-input crate, perform the logic of building JsonInputs in rust, then return them to python world to in turn be passed to pydantic-re

With this approach, while we go "via python", we never have to do the hard work to convert the JsonInput to a python object.

@07pepa
Copy link

07pepa commented Jul 27, 2022

i am coming here from here

and i want to share my ideas how to runtime pluging of should work....

my idea is to do it kind of similar to dll

you would have to have your pydantic-core plu-ins in one preagreed location (compiled) (or registered somehow)
so we know where to load them from and we can avoid "dll hell"
and on startup of pydantic-core you would say... hey load this and that of that version (no multiple version alowed for one instance of pydantic core (globaly you could have them but in app i think it would create confusion)

during deserialization you would just say what serializer to use... (format is not enought since there are more then one serializer/deserializer availible for one format (like simd json))

i am also not expert in this but i think there should be folowing requirements

  1. more then one serializer allowed and allow to deserialize to one class from multiple format
  2. more then one serialzier per format (chosen by name ?)
  3. serializer can be chosen dinamicaly.
  4. (not mandatory) only chosen serialzer are loaded (to limit load time of libary)
  5. hard fail if serializer missing or is incompatible

why i am suggesting deoupeling...

  • you do not have to care about serializatin in pydantic-core (and have it purely be focused on validation)
  • a serialziation can be upgraded/fixed/changed independent of pydantic-core as long as ABI is same
  • get rid of request to include some wanky format just point them how to write plugin

if this is too hard i would just suggest to do in on compile time but that may be too complicated...

@samuelcolvin
Copy link
Member

This sounds very hard to do in a reliable cross-platform way. Given the problems we already experience (at the scale of pydantic or watchfiles) with very subtle OS differences and wheels. I'm very unwilling to enter into this mess.

You're effectively proposing another way link libraries that side steps both crates and python imports, are there any other examples of python libraries that use DLL/ share libraries to access functionality in other packages without using the python runtime?

(Perhaps worth remembering that I'll probably be maintaining pydantic for the next decade at least, one "clever" idea proposed now which relies on shaky cross-platform behaviour, could cost me 100s of hours of support over the years - hence my caution)

I real question is how much faster would this be than my proposal above?

To proceed with the conversation, we need to benchmark that. Really someone need to build both solutions (and any 3rd solutions proposed) and see how they perform.

@PrettyWood @tiangolo do you have any feedback on any of this?

@07pepa
Copy link

07pepa commented Jul 27, 2022

well loading crates may be fine as well... but as said i am not expert....

@samuelcolvin
Copy link
Member

crates would need to be a compile time dependency, so distributed wheels couldn't be used.

@07pepa
Copy link

07pepa commented Jul 27, 2022

ah.... yea... i forgot about that..... because i would just force you to compile when you install library...

however if there are little to no perfomance impatcs for @samuelcolvin solution i would be also fine.

However there are people "needing" SIMDjson... and in extreame cases perfomance may degrade

@samuelcolvin
Copy link
Member

If you care about "extreme performance", don't use python, build the whole thing in Rust, Go or C.

@NowanIlfideme
Copy link
Author

Sorry for missing this discussion 2 weeks ago...

I need to check out and play with the current (v0.3.1) version of pydantic-core before I can really give an informed opinion, but from a cursory glance it seems that validate_python() should be enough to implement in Python-land.

Regarding Rust-side implementation, I think that it all sounds too messy for a Python-facing library. "Config parsing" use cases don't require cutting edge performance anyways - you generally parse a single YAML file at the beginning of a script (vs many JSON API requests/sec). And YAMLs aren't usually passed between (performance-critical) applications since parsing YAML is slower anyways. There's similar considerations with TOML. I guess the most JSON-like thing would be XML derivatives, but I don't have much experience there, and haven't encountered anyone using Pydantic for XML yet 😉

@samuelcolvin
Copy link
Member

I agree, validate_python is enough for everything except performance critical applications.

The only other thing you might need is line numbers, that's one of the main drivers (for me) of #10.

We need to think about how to make this possible without adding complexity or damaging performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants