Skip to content

Conversation

StephenBrown2
Copy link
Contributor

@StephenBrown2 StephenBrown2 commented Mar 4, 2020

Change Summary

After further research, it seems my aside on __mro__ usage in #1281 was unfounded and unnecessary, and the only other place I figured this change might be useful (validators.py) seems to already be handled alright. Thus, I'm submitting this PR as-is to resolve the issue found with JSON serialization of subclasses of known types.

While adding documentation, I found that the custom json_encoders were being ignored, so I added the same fix to custom_pydantic_encoder as well.

Related issue number

Resolves #1281

Checklist

  • Unit tests for the changes exist
  • Tests pass on CI and coverage remains at 100%
  • Documentation reflects the changes where applicable
  • changes/<pull request or issue id>-<github username>.md file added describing change
    (see changes/README.md for details)

@codecov
Copy link

codecov bot commented Mar 4, 2020

Codecov Report

Merging #1291 into master will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master     #1291   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           21        21           
  Lines         3736      3741    +5     
  Branches       739       740    +1     
=========================================
+ Hits          3736      3741    +5     
Impacted Files Coverage Δ
pydantic/json.py 100.00% <100.00%> (ø)
pydantic/types.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a705a2...2ec046e. Read the comment docs.

@StephenBrown2 StephenBrown2 force-pushed the encode_known_subclasses branch from 9066ca6 to 38a41f6 Compare March 4, 2020 19:49
@samuelcolvin
Copy link
Member

I'm concerned about the performance impact of this as discussed on #1157, I hope to provide a better solution for this kind of thing in future.

You can achieve what you want by using your own config.json_encoders

@StephenBrown2
Copy link
Contributor Author

I'm concerned about the performance impact of this as discussed on #1157, I hope to provide a better solution for this kind of thing in future.

I'm not sure what performance impacts you are seeing, I am seeing the same or slightly faster performance with my tests (cleaning and compiling before each run):

Benchmark on master:
python benchmarks/run.py pydantic-only
testing pydantic, 10 times each
    pydantic ( 1/10) time=0.437s, success=50.10%
    pydantic ( 2/10) time=0.451s, success=50.10%
    pydantic ( 3/10) time=0.437s, success=50.10%
    pydantic ( 4/10) time=0.428s, success=50.10%
    pydantic ( 5/10) time=0.433s, success=50.10%
    pydantic ( 6/10) time=0.432s, success=50.10%
    pydantic ( 7/10) time=0.431s, success=50.10%
    pydantic ( 8/10) time=0.419s, success=50.10%
    pydantic ( 9/10) time=0.531s, success=50.10%
    pydantic (10/10) time=0.449s, success=50.10%
    pydantic best=0.419s, avg=0.445s, stdev=0.032s

    pydantic best=69.874μs/iter avg=74.140μs/iter stdev=5.273μs/iter version=1.4a1

Benchmark on encode_known_subclasses:
python benchmarks/run.py pydantic-only
testing pydantic, 10 times each
    pydantic ( 1/10) time=0.467s, success=50.10%
    pydantic ( 2/10) time=0.471s, success=50.10%
    pydantic ( 3/10) time=0.431s, success=50.10%
    pydantic ( 4/10) time=0.427s, success=50.10%
    pydantic ( 5/10) time=0.434s, success=50.10%
    pydantic ( 6/10) time=0.434s, success=50.10%
    pydantic ( 7/10) time=0.423s, success=50.10%
    pydantic ( 8/10) time=0.417s, success=50.10%
    pydantic ( 9/10) time=0.433s, success=50.10%
    pydantic (10/10) time=0.454s, success=50.10%
    pydantic best=0.417s, avg=0.439s, stdev=0.019s

    pydantic best=69.490μs/iter avg=73.185μs/iter stdev=3.087μs/iter version=1.4a1

Using this script to make sure things were fair

compare_branch_perf.sh:
#!/usr/bin/env bash
branch=${1:-$(git branch --show-current)}
export BENCHMARK_REPEATS=${2:-10}
output=$(mktemp -t "benchmark-${branch}")

if [[ "${branch}" == "master" ]]; then
  echo "Currently on master, and no branch specified."
  echo "Either check out the branch to compare or specify it as the first argument."
  break
fi

echo "Preparing to compare benchmarks on master vs ${branch}, ${BENCHMARK_REPEATS} repetitions"

git checkout master
echo "Cleaning and building with cython"
make clean build-cython 2>&1 | awk '{printf "."; system("")} END{print "\n"}'
echo "Benchmarking master..."
echo "Benchmark on master:" > ${output}
make benchmark-pydantic | tee -a ${output}

git checkout ${branch}
echo "Cleaning and building with cython"
make clean build-cython 2>&1 | awk '{printf "."; system("")} END{print "\n"}'
echo "Benchmarking ${branch}..."
echo -e "\nBenchmark on ${branch}:" >> ${output}
make benchmark-pydantic | tee -a ${output}

clear
cat ${output}

Can you show performance benchmarks that show significant degredation?
This branch in Travis: pydantic best=56.288μs/iter avg=56.288μs/iter stdev=0.000μs/iter version=1.4a1
Master in Travis: pydantic best=56.039μs/iter avg=56.039μs/iter stdev=0.000μs/iter version=1.4a1

A difference of 0.249μs.

You can achieve what you want by using your own config.json_encoders

Yes, but that's exactly the point of this. In case I want to use pendulum objects instead of naive datetimes everywhere, I would have to add the custom Config.json_encoders to every single model that has one. This is a simple, performant solution that allows me to use subclasses without having to add a lot of boilerplate to my code, and also resolves #1157.

Of note, I attempted to try a faster solution that still compared all the elements but shortcircuited using set logic to avoid a loop, and came out just a tiny bit slower (a rounding error), using this logic:

    if set(obj.__class__.__mro__) & set(ENCODERS_BY_TYPE.keys()):
        return next(ENCODERS_BY_TYPE[base](obj) for base in obj.__class__.__mro__ if base in ENCODERS_BY_TYPE)
    else:
        raise TypeError(f"Object of type '{obj.__class__.__name__}' is not JSON serializable")
Benchmark on master:
python benchmarks/run.py pydantic-only
testing pydantic, 10 times each
    pydantic ( 1/10) time=0.444s, success=50.10%
    pydantic ( 2/10) time=0.422s, success=50.10%
    pydantic ( 3/10) time=0.415s, success=50.10%
    pydantic ( 4/10) time=0.415s, success=50.10%
    pydantic ( 5/10) time=0.413s, success=50.10%
    pydantic ( 6/10) time=0.420s, success=50.10%
    pydantic ( 7/10) time=0.455s, success=50.10%
    pydantic ( 8/10) time=0.432s, success=50.10%
    pydantic ( 9/10) time=0.420s, success=50.10%
    pydantic (10/10) time=0.442s, success=50.10%
    pydantic best=0.413s, avg=0.428s, stdev=0.015s

    pydantic best=68.833μs/iter avg=71.283μs/iter stdev=2.440μs/iter version=1.4a1

Benchmark on encode_known_subclasses:
python benchmarks/run.py pydantic-only
testing pydantic, 10 times each
    pydantic ( 1/10) time=0.464s, success=50.10%
    pydantic ( 2/10) time=0.425s, success=50.10%
    pydantic ( 3/10) time=0.404s, success=50.10%
    pydantic ( 4/10) time=0.449s, success=50.10%
    pydantic ( 5/10) time=0.429s, success=50.10%
    pydantic ( 6/10) time=0.438s, success=50.10%
    pydantic ( 7/10) time=0.446s, success=50.10%
    pydantic ( 8/10) time=0.421s, success=50.10%
    pydantic ( 9/10) time=0.416s, success=50.10%
    pydantic (10/10) time=0.404s, success=50.10%
    pydantic best=0.404s, avg=0.430s, stdev=0.020s

    pydantic best=67.315μs/iter avg=71.630μs/iter stdev=3.276μs/iter version=1.4a1

@StephenBrown2
Copy link
Contributor Author

As it is, FastAPI implements nearly the same logic in its jsonable_encoder, and while I would rather not add a custom encoder to all my classes, I suppose I could subclass BaseModel and add it there.

I would argue it is not "always going to be slower", as it will be exactly the same performance for native python types (as a loop over a single element array O(n/n) is the same as accessing a value by key O(1)), and only get slower as the inheritance chain grows. Thus on average, the performance hit would be negligible as demonstrated, and only applicable if one has a very deeply inherited object. In which case, one should certainly create a custom encoder, which will speed up the execution back to native levels.

Thank you for your time and consideration, and the efforts you've put into making this one of the best libraries for what it does.

@dmontagu
Copy link
Contributor

dmontagu commented Mar 7, 2020

@samuelcolvin For what it's worth, I think @StephenBrown2 is right that this should generally be very cheap since it only matters for types with a long inheritance chain (where you could get better performance if desired through specifying an encoder).

(Note that if I recall/understand correctly, the approach discussed in #1157 looped over the possible encoding types, rather than over the base classes of the encoding target. So I think it makes sense that the approach taken in this PR wouldn't necessarily have the same performance issues.)

I do think it would be nice to support this style of lookup, and the benchmark execution times seem, at worst, pretty minimally affected. It seems to me much more likely that this change would reduce confusion around json encoding than that it would result in even a measurable reduction in encoding performance.

(That said, @StephenBrown2 it might be good to see what the performance impact looks like for types that do have a longer inheritance chain leading back to something encodable.)

On the other hand, I could see an argument in favor of preventing "surprising" performance issues by just requiring the type to be specified explicitly. I would also understand if you have some fundamentally different approach in mind that you intend to implement down the line... 😬

@samuelcolvin
Copy link
Member

    for base in obj.__class__.__mro__:
        if base in ENCODERS_BY_TYPE:
            return ENCODERS_BY_TYPE[base](obj)

Mea culpa I had missed that this trick avoids iterating through ENCODERS_BY_TYPE as was suggested in #1157. I therefore think it's probably a good idea.

I also think we can make that slightly faster by switching to

for base in obj.__class__.__mro__:
    try:
        f = ENCODERS_BY_TYPE[base]
    except KeyError:
        continue
    return f(obj)

So we're only doing one lookup on ENCODERS_BY_TYPE (unless I'm missing something and the above is somehow faster than this?).

However I'm still a bit concerned about performance of this and think we should put the time into a small benchmark (or extension to the current benchmarks). Unless I'm missing something benchmarks/run.py never calls .json() so does nothing to measure the performance of this PR.

In summary "yes", but we should invest the time in benchmarking .json() before we make this change.

@samuelcolvin samuelcolvin reopened this Mar 7, 2020
@StephenBrown2
Copy link
Contributor Author

StephenBrown2 commented Mar 10, 2020

Also alternatively, the loop could be put in the exception block, and the list sliced to reduce unnecessary lookups

something like:
try:
    encoder = ENCODERS_BY_TYPE[obj.__class__]
except KeyError:
    for base in obj.__class__.__mro__[1:-1]:
        encoder = ENCODERS_BY_TYPE.get(base):
        if encoder:
            return encoder(obj)
    else:
        raise TypeError(f"Object of type '{obj.__class__.__name__}' is not JSON serializable")

since obj.__class__.__mro__[0] == type(obj) and obj.__class__.__mro__[-1] == object, however, that may be unnecessary, and we can just trim the last element (object) from the loop with your suggestion:

for base in obj.__class__.__mro__[:-1]:
    try:
        encoder = ENCODERS_BY_TYPE[base]
    except KeyError:
        continue
    return encoder(obj)
else:
    raise TypeError(f"Object of type '{obj.__class__.__name__}' is not JSON serializable")

(Still raising TypeError if we exit the for loop without finding anything)

I find the above try/except inside the loop to be fairly readable and only does one lookup as both @samuelcolvin and @dmontagu suggested (Thanks!)

That said, @StephenBrown2 it might be good to see what the performance impact looks like for types that do have a longer inheritance chain leading back to something encodable.

Yes, though I wouldn't want to add something like pendulum or asyncpg or bson to the benchmarks just to test inheritability. Something like the tests I added to this PR would be more like what I'd go for. I can however put together a gist to test and see how the inheritance chain affects things.

I'm still a bit concerned about performance of this and think we should put the time into a small benchmark (or extension to the current benchmarks). Unless I'm missing something benchmarks/run.py never calls .json() so does nothing to measure the performance of this PR.

@dmontagu had mentioned testing "fastapi's jsonable_encoder when called on a relatively large list of dicts of int to int" in the other PR. Would that be a decent test case, or would calling .json() on the existing test cases be enough? It's coming from json for the benchmarks, so possibly modifying generate_case to add a couple inherited classes and running it through .json() instead of json.dump?

@StephenBrown2 StephenBrown2 force-pushed the encode_known_subclasses branch from f5ee401 to ba0bc5a Compare March 17, 2020 22:45
@samuelcolvin
Copy link
Member

Regarding the loop let's go with

for base in obj.__class__.__mro__[:-1]:
    try:
        f = ENCODERS_BY_TYPE[base]
    except KeyError:
        continue
    return f(obj)

for now, we can always optimise in future if we find a faster solution.

regarding benchmarking, let's start off with an extension of the current benchmarks that calls .json() and go from there.

Blacken doc
Fix test that worked on my machine
datetime.timestamp() is flakey?
Single quotes only
- Remove last element in `__mro__` as it will always be `object`
- Use .get for compactness
@StephenBrown2 StephenBrown2 force-pushed the encode_known_subclasses branch from ba0bc5a to 4373495 Compare March 18, 2020 16:47
@StephenBrown2
Copy link
Contributor Author

StephenBrown2 commented Mar 18, 2020

Went with your version of the loop and added the start of a json benchmark. Thoughts?

Copy link
Member

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you create a new PR for the JSON benchmarks bit (see comment below), you'll solve the problem of conflicts with master without having to rebase this.

tests += other_tests

repeats = int(os.getenv('BENCHMARK_REPEATS', '5'))
results, csv_results = run_tests(tests, cases, repeats, 'json' in sys.argv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
results, csv_results = run_tests(tests, cases, repeats, 'json' in sys.argv)
test_json = TEST_JSON in os.environ
results, csv_results = run_tests(tests, cases, repeats, test_json)

That way:

  1. it's consistent with BENCHMARK_REPEATS
  2. you can alter it from outside the makefile.

@@ -1,4 +1,5 @@
from datetime import datetime, timedelta
import pendulum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think too much extra confusion for those who haven't heard of pendulum to introduce another package for one test. Either find an example in another package we use or create a synthetic example.

You can add a note about pendulum, asyncpg etc. in text.

@StephenBrown2 StephenBrown2 force-pushed the encode_known_subclasses branch from 4373495 to 1e50af8 Compare March 26, 2020 16:00
@StephenBrown2 StephenBrown2 mentioned this pull request Mar 26, 2020
4 tasks
@StephenBrown2
Copy link
Contributor Author

Made a new PR and force-pushed this branch back a commit to avoid conflicts with travis.yml (which had been removed)

@StephenBrown2
Copy link
Contributor Author

StephenBrown2 commented Apr 1, 2020

So, some examples for subclassed types I've seen from other comments include:

  • pendulum.DateTime -> datetime.datetime
  • ColorOriginal -> pydantic.Color
  • asyncpg.pgproto.pgproto.UUID -> uuid.UUID

Two out of three of these involve third-party packages, and one of them (asyncpg) is a cdef, but I suppose synthetic benchmarks could be made to imitate them.

EDIT: Found a mock asyncpg...UUID in fastapi/fastapi#756 (comment), so I'll see if I can't add that and something similar for pendulum.DateTime to a test.

@StephenBrown2
Copy link
Contributor Author

Just a heads up, I should have some time tonight and Friday to work on finishing this up; will probably rebase/merge #1344 into it to see the Benchmarks for the changes.

@StephenBrown2
Copy link
Contributor Author

StephenBrown2 commented Apr 16, 2020

I've got this set up so far, just in case I don't get to this tonight like I hope:

import datetime
import random
import string
import uuid

from devtools import debug
from pydantic import BaseModel, ValidationError
from pydantic.color import COLORS_BY_NAME, COLORS_BY_VALUE, Color


class SubStr(str):
    pass


class HexColor(Color):
    def __str__(self) -> str:
        return self.as_hex()


class ColorOriginal(Color):
    def __str__(self) -> str:
        return str(self.original())


class DateTime(datetime.datetime):
    def __str__(self):
        return self.isoformat("T")

    def __repr__(self):
        us = ""
        if self.microsecond:
            us = ", {}".format(self.microsecond)

        repr_ = "{klass}({year}, {month}, {day}, {hour}, {minute}, {second}{us}"

        if self.tzinfo is not None:
            repr_ += ", tzinfo={tzinfo}"

        repr_ += ")"

        return repr_.format(
            klass=self.__class__.__name__,
            year=self.year,
            month=self.month,
            day=self.day,
            hour=self.hour,
            minute=self.minute,
            second=self.second,
            us=us,
            tzinfo=self.tzinfo,
        )


# https://github.com/tiangolo/fastapi/pull/756#issuecomment-572251121
class MyUuid:
    def __init__(self, uuid_string: str):
        self.uuid = uuid_string

    def __str__(self):
        return self.uuid

    def __repr__(self):
        return f"{self.__class__.__name__}({self.uuid})"

    @property
    def __class__(self):
        return uuid.UUID

    @property
    def __dict__(self):
        """Spoof a missing __dict__ by raising TypeError, this is how
        asyncpg.pgroto.pgproto.UUID behaves"""
        raise TypeError("vars() argument must have __dict__ attribute")


class ValidatedUuid(str):
    @classmethod
    def __get_validators__(cls):
        yield cls.validate

    @classmethod
    def validate(cls, v):
        if not isinstance(v, str):
            raise TypeError("string required")
        return MyUuid(v)


class SomeCustomClass(BaseModel):
    a_uuid: ValidatedUuid
    sub_str: SubStr
    hex_color: HexColor
    orig_color: ColorOriginal
    date_time: DateTime

    class Config:
        arbitrary_types_allowed = True
        json_encoders = {
            MyUuid: str,
            SubStr: str,
            HexColor: str,
            ColorOriginal: str,
            DateTime: str,
        }


PUNCTUATION = " \t\n!\"#$%&'()*+,-./"
LETTERS = string.ascii_letters
UNICODE = "\xa0\xad¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
ALL = PUNCTUATION * 5 + LETTERS * 20 + UNICODE


def rand_string(min_length, max_length, corpus=ALL):
    return "".join(random.choices(corpus, k=random.randrange(min_length, max_length)))


def rand_date():
    r = random.randrange
    return f"{r(1900, 2020)}-{r(0, 12)}-{r(0, 32)}T{r(0, 24)}:{r(0, 60)}:{r(0, 60)}"


def generate_case():
    return dict(
        a_uuid=str(uuid.uuid4()),
        sub_str=rand_string(5, 20),
        hex_color=random.choice(
            list(COLORS_BY_NAME.keys()) + list(COLORS_BY_VALUE.keys())
        ),
        orig_color=random.choice(
            list(COLORS_BY_NAME.keys()) + list(COLORS_BY_VALUE.keys())
        ),
        date_time=rand_date(),
    )


cases = [generate_case() for _ in range(20)]

debug(cases)

models = []
for case in cases:
    try:
        models.append(SomeCustomClass(**case))
    except ValidationError:
        pass

debug(models)

for model in models:
    debug(model.json())

@@ -18,25 +18,27 @@ def isoformat(o: Union[datetime.date, datetime.time]) -> str:


ENCODERS_BY_TYPE: Dict[Type[Any], Callable[[Any], Any]] = {
bytes: lambda o: o.decode(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear why we're bothering to re-order this dict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a dict, so order doesn't matter, especially since we're KeyErroring, and sorted items make it easier to add more later. I sorted it when I moved Path and Enum.

@samuelcolvin samuelcolvin merged commit cab6313 into pydantic:master Apr 18, 2020
@samuelcolvin
Copy link
Member

samuelcolvin commented Apr 18, 2020

this is awesome. Thanks so much and sorry again for the delay.

🥳 🍾 🎉

@StephenBrown2
Copy link
Contributor Author

Thanks to you, for taking it over the finish line. Life has taken a bit of a busy turn so I haven't had as much time for open-source contributions like I would like.

@samuelcolvin
Copy link
Member

No problem, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Subclasses of known types are not JSON serializable
3 participants