-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Allow subclasses of known types to be encoded with superclass encoder #1291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow subclasses of known types to be encoded with superclass encoder #1291
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1291 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 21 21
Lines 3736 3741 +5
Branches 739 740 +1
=========================================
+ Hits 3736 3741 +5
Continue to review full report at Codecov.
|
9066ca6 to
38a41f6
Compare
|
I'm concerned about the performance impact of this as discussed on #1157, I hope to provide a better solution for this kind of thing in future. You can achieve what you want by using your own |
I'm not sure what performance impacts you are seeing, I am seeing the same or slightly faster performance with my tests (cleaning and compiling before each run): Using this script to make sure things were fair
|
|
As it is, FastAPI implements nearly the same logic in its I would argue it is not "always going to be slower", as it will be exactly the same performance for native python types (as a loop over a single element array Thank you for your time and consideration, and the efforts you've put into making this one of the best libraries for what it does. |
|
@samuelcolvin For what it's worth, I think @StephenBrown2 is right that this should generally be very cheap since it only matters for types with a long inheritance chain (where you could get better performance if desired through specifying an encoder). (Note that if I recall/understand correctly, the approach discussed in #1157 looped over the possible encoding types, rather than over the base classes of the encoding target. So I think it makes sense that the approach taken in this PR wouldn't necessarily have the same performance issues.) I do think it would be nice to support this style of lookup, and the benchmark execution times seem, at worst, pretty minimally affected. It seems to me much more likely that this change would reduce confusion around json encoding than that it would result in even a measurable reduction in encoding performance. (That said, @StephenBrown2 it might be good to see what the performance impact looks like for types that do have a longer inheritance chain leading back to something encodable.) On the other hand, I could see an argument in favor of preventing "surprising" performance issues by just requiring the type to be specified explicitly. I would also understand if you have some fundamentally different approach in mind that you intend to implement down the line... 😬 |
for base in obj.__class__.__mro__:
if base in ENCODERS_BY_TYPE:
return ENCODERS_BY_TYPE[base](obj)Mea culpa I had missed that this trick avoids iterating through I also think we can make that slightly faster by switching to for base in obj.__class__.__mro__:
try:
f = ENCODERS_BY_TYPE[base]
except KeyError:
continue
return f(obj)So we're only doing one lookup on However I'm still a bit concerned about performance of this and think we should put the time into a small benchmark (or extension to the current benchmarks). Unless I'm missing something In summary "yes", but we should invest the time in benchmarking |
|
Also alternatively, the loop could be put in the exception block, and the list sliced to reduce unnecessary lookups something like:try:
encoder = ENCODERS_BY_TYPE[obj.__class__]
except KeyError:
for base in obj.__class__.__mro__[1:-1]:
encoder = ENCODERS_BY_TYPE.get(base):
if encoder:
return encoder(obj)
else:
raise TypeError(f"Object of type '{obj.__class__.__name__}' is not JSON serializable")since for base in obj.__class__.__mro__[:-1]:
try:
encoder = ENCODERS_BY_TYPE[base]
except KeyError:
continue
return encoder(obj)
else:
raise TypeError(f"Object of type '{obj.__class__.__name__}' is not JSON serializable")(Still raising I find the above try/except inside the loop to be fairly readable and only does one lookup as both @samuelcolvin and @dmontagu suggested (Thanks!)
Yes, though I wouldn't want to add something like pendulum or asyncpg or bson to the benchmarks just to test inheritability. Something like the tests I added to this PR would be more like what I'd go for. I can however put together a gist to test and see how the inheritance chain affects things.
@dmontagu had mentioned testing "fastapi's |
f5ee401 to
ba0bc5a
Compare
|
Regarding the loop let's go with for base in obj.__class__.__mro__[:-1]:
try:
f = ENCODERS_BY_TYPE[base]
except KeyError:
continue
return f(obj)for now, we can always optimise in future if we find a faster solution. regarding benchmarking, let's start off with an extension of the current benchmarks that calls |
Blacken doc Fix test that worked on my machine datetime.timestamp() is flakey? Single quotes only
- Remove last element in `__mro__` as it will always be `object` - Use .get for compactness
ba0bc5a to
4373495
Compare
|
Went with your version of the loop and added the start of a json benchmark. Thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you create a new PR for the JSON benchmarks bit (see comment below), you'll solve the problem of conflicts with master without having to rebase this.
benchmarks/run.py
Outdated
| tests += other_tests | ||
|
|
||
| repeats = int(os.getenv('BENCHMARK_REPEATS', '5')) | ||
| results, csv_results = run_tests(tests, cases, repeats, 'json' in sys.argv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| results, csv_results = run_tests(tests, cases, repeats, 'json' in sys.argv) | |
| test_json = TEST_JSON in os.environ | |
| results, csv_results = run_tests(tests, cases, repeats, test_json) |
That way:
- it's consistent with
BENCHMARK_REPEATS - you can alter it from outside the makefile.
| @@ -1,4 +1,5 @@ | |||
| from datetime import datetime, timedelta | |||
| import pendulum | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think too much extra confusion for those who haven't heard of pendulum to introduce another package for one test. Either find an example in another package we use or create a synthetic example.
You can add a note about pendulum, asyncpg etc. in text.
4373495 to
1e50af8
Compare
|
Made a new PR and force-pushed this branch back a commit to avoid conflicts with travis.yml (which had been removed) |
|
So, some examples for subclassed types I've seen from other comments include:
Two out of three of these involve third-party packages, and one of them (asyncpg) is a cdef, but I suppose synthetic benchmarks could be made to imitate them. EDIT: Found a mock asyncpg...UUID in fastapi/fastapi#756 (comment), so I'll see if I can't add that and something similar for |
Sort ENCODERS_BY_TYPE
|
Just a heads up, I should have some time tonight and Friday to work on finishing this up; will probably rebase/merge #1344 into it to see the Benchmarks for the changes. |
|
I've got this set up so far, just in case I don't get to this tonight like I hope: import datetime
import random
import string
import uuid
from devtools import debug
from pydantic import BaseModel, ValidationError
from pydantic.color import COLORS_BY_NAME, COLORS_BY_VALUE, Color
class SubStr(str):
pass
class HexColor(Color):
def __str__(self) -> str:
return self.as_hex()
class ColorOriginal(Color):
def __str__(self) -> str:
return str(self.original())
class DateTime(datetime.datetime):
def __str__(self):
return self.isoformat("T")
def __repr__(self):
us = ""
if self.microsecond:
us = ", {}".format(self.microsecond)
repr_ = "{klass}({year}, {month}, {day}, {hour}, {minute}, {second}{us}"
if self.tzinfo is not None:
repr_ += ", tzinfo={tzinfo}"
repr_ += ")"
return repr_.format(
klass=self.__class__.__name__,
year=self.year,
month=self.month,
day=self.day,
hour=self.hour,
minute=self.minute,
second=self.second,
us=us,
tzinfo=self.tzinfo,
)
# https://github.com/tiangolo/fastapi/pull/756#issuecomment-572251121
class MyUuid:
def __init__(self, uuid_string: str):
self.uuid = uuid_string
def __str__(self):
return self.uuid
def __repr__(self):
return f"{self.__class__.__name__}({self.uuid})"
@property
def __class__(self):
return uuid.UUID
@property
def __dict__(self):
"""Spoof a missing __dict__ by raising TypeError, this is how
asyncpg.pgroto.pgproto.UUID behaves"""
raise TypeError("vars() argument must have __dict__ attribute")
class ValidatedUuid(str):
@classmethod
def __get_validators__(cls):
yield cls.validate
@classmethod
def validate(cls, v):
if not isinstance(v, str):
raise TypeError("string required")
return MyUuid(v)
class SomeCustomClass(BaseModel):
a_uuid: ValidatedUuid
sub_str: SubStr
hex_color: HexColor
orig_color: ColorOriginal
date_time: DateTime
class Config:
arbitrary_types_allowed = True
json_encoders = {
MyUuid: str,
SubStr: str,
HexColor: str,
ColorOriginal: str,
DateTime: str,
}
PUNCTUATION = " \t\n!\"#$%&'()*+,-./"
LETTERS = string.ascii_letters
UNICODE = "\xa0\xad¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
ALL = PUNCTUATION * 5 + LETTERS * 20 + UNICODE
def rand_string(min_length, max_length, corpus=ALL):
return "".join(random.choices(corpus, k=random.randrange(min_length, max_length)))
def rand_date():
r = random.randrange
return f"{r(1900, 2020)}-{r(0, 12)}-{r(0, 32)}T{r(0, 24)}:{r(0, 60)}:{r(0, 60)}"
def generate_case():
return dict(
a_uuid=str(uuid.uuid4()),
sub_str=rand_string(5, 20),
hex_color=random.choice(
list(COLORS_BY_NAME.keys()) + list(COLORS_BY_VALUE.keys())
),
orig_color=random.choice(
list(COLORS_BY_NAME.keys()) + list(COLORS_BY_VALUE.keys())
),
date_time=rand_date(),
)
cases = [generate_case() for _ in range(20)]
debug(cases)
models = []
for case in cases:
try:
models.append(SomeCustomClass(**case))
except ValidationError:
pass
debug(models)
for model in models:
debug(model.json()) |
| @@ -18,25 +18,27 @@ def isoformat(o: Union[datetime.date, datetime.time]) -> str: | |||
|
|
|||
|
|
|||
| ENCODERS_BY_TYPE: Dict[Type[Any], Callable[[Any], Any]] = { | |||
| bytes: lambda o: o.decode(), | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear why we're bothering to re-order this dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a dict, so order doesn't matter, especially since we're KeyErroring, and sorted items make it easier to add more later. I sorted it when I moved Path and Enum.
|
this is awesome. Thanks so much and sorry again for the delay. 🥳 🍾 🎉 |
|
Thanks to you, for taking it over the finish line. Life has taken a bit of a busy turn so I haven't had as much time for open-source contributions like I would like. |
|
No problem, thank you. |
Change Summary
After further research, it seems my aside on
__mro__usage in #1281 was unfounded and unnecessary, and the only other place I figured this change might be useful (validators.py) seems to already be handled alright. Thus, I'm submitting this PR as-is to resolve the issue found with JSON serialization of subclasses of known types.While adding documentation, I found that the custom
json_encoderswere being ignored, so I added the same fix tocustom_pydantic_encoderas well.Related issue number
Resolves #1281
Checklist
changes/<pull request or issue id>-<github username>.mdfile added describing change(see changes/README.md for details)