Skip to content

Conversation

sydney-runkle
Copy link
Contributor

@sydney-runkle sydney-runkle commented Aug 12, 2024

#10074 gives an in depth description of the troubles of our parent namespace approach.

In this PR, currently:

In future PRs, or maybe this one, I'd like to:

  • Filter parent (or global) namespaces that we acquire in order to save on memory usage when building models - @dmontagu suggested that we could perhaps search through stringified field annotations in order to do this
  • I think we could really clean up how we handle namespaces, but this small fix might work as a mega performance improving patch
  • I'd like to revert the fast build config setting if we can fix things with this

I wouldn't say this justifies closing #10074, but it gets us much closer!

@github-actions github-actions bot added the relnotes-change Used for changes to existing functionality which don't have a better categorization. label Aug 12, 2024
@sydney-runkle sydney-runkle added relnotes-performance Used for performance improvements. and removed relnotes-change Used for changes to existing functionality which don't have a better categorization. labels Aug 12, 2024
@sydney-runkle
Copy link
Contributor Author

sydney-runkle commented Aug 12, 2024

My rough benchmark, testing building 7500 ish models:

time python sandbox/pydanticv2.py
> on this branch: 6.77s user 0.21s system 97% cpu 7.171 total
> on main: 37.64s user 37.71s system 76% cpu 1:38.82 total

Copy link

codspeed-hq bot commented Aug 12, 2024

CodSpeed Performance Report

Merging #10113 will not alter performance

Comparing namespace-change (2ea648d) with main (f0d8f65)

Summary

✅ 17 untouched benchmarks

@sydney-runkle
Copy link
Contributor Author

sydney-runkle commented Aug 12, 2024

Also open to ideas, maybe we don't do this in this function, but rather build out another function to do this specific check - looks like we're slowing down type adapter stuff a bit, which makes sense bc getting the frame info is slowing down parent_namespace() by 96%

@sydney-runkle
Copy link
Contributor Author

I'll have time this evening to experiment with variations on this pattern - @dmontagu what do you think about this idea in general.

@sydney-runkle
Copy link
Contributor Author

@MarkusSintonen ping as well 🚀

Copy link
Contributor

@dmontagu dmontagu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can get the docs tests passing with this I'm good with it in principle; if we make this change we should also change the name and docstring of that function to reflect the fact that it returns None for top-level module namespaces.

@MarkusSintonen
Copy link
Contributor

we're slowing down type adapter stuff a bit, which makes sense bc getting the frame info is slowing down parent_namespace() by 96%

This sounds rather worrying. TypeAdapters are used at extreme levels eg in every FastAPI application 🤔 Is there any other way to do this? Or skipping the parent namespace stuff in TypeAdapters altogether?

@sydney-runkle
Copy link
Contributor Author

@MarkusSintonen,

This sounds rather worrying. TypeAdapters are used at extreme levels eg in every FastAPI application 🤔 Is there any other way to do this? Or skipping the parent namespace stuff in TypeAdapters altogether?

Yep, I'll find an alternative that doesn't slow TA builds.

Copy link

cloudflare-workers-and-pages bot commented Aug 13, 2024

Deploying pydantic-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2ea648d
Status: ✅  Deploy successful!
Preview URL: https://6eae71c4.pydantic-docs.pages.dev
Branch Preview URL: https://namespace-change.pydantic-docs.pages.dev

View logs

Copy link
Contributor

github-actions bot commented Aug 13, 2024

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  pydantic/_internal
  _typing_extra.py
Project Total  

This report was generated by python-coverage-comment-action

Co-authored-by: David Montague <35119617+dmontagu@users.noreply.github.com>
# if f_back is None, it's the global module namespace and we don't need to include it here
if frame.f_back is None:
# if the class is defined at the top module level, we don't need to add namespace information
if frame.f_back is None or frame.f_code.co_name == '<module>':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, won't surprise me if we need to use a getattr here because some implementation of python has f_code is None or something, that feels like the kind of thing that might happen. But I don't know if those would have a pydantic-core release that worked with them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be ok for the moment, but will keep an eye out for this as a potential source of bugs in the future!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ye, I thought the exact same thing as @dmontagu.

Typeshed suggests that both .f_code and .co_name should always exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect.getframeinfo (which was used before in this PR) also assumes frame.f_code.co_name is defined.

@sydney-runkle
Copy link
Contributor Author

sydney-runkle commented Aug 13, 2024

@MarkusSintonen woohoo, with 5742f82, we're no longer seeing the 10% slowdown :)

@sydney-runkle sydney-runkle changed the title Addressing parent namespace issues Performance boost: skip caching parent namespaces in most cases Aug 13, 2024
@sydney-runkle
Copy link
Contributor Author

On memory usage, for the case mentioned above with 7.5k models, we see 1/3 the memory usage 🚀 with this change

main:

📏 Total allocations:
        3171464

📦 Total memory allocated:
        15.083GB

📊 Histogram of allocation size:
        min: 1.000B
        ---------------------------------------------
        < 4.000B   : 307836 ▇▇▇▇▇▇▇▇▇▇▇▇
        < 16.000B  : 615674 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 70.000B  : 457803 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 288.000B : 525717 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 1.163KB  : 678000 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 4.795KB  : 419167 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 19.771KB :  85507 ▇▇▇▇
        < 81.511KB :  44063 ▇▇
        < 336.054KB:  33217 ▇▇
        <=1.353MB  :   4480 ▇
        ---------------------------------------------
        max: 1.353MB

📂 Allocator type distribution:
         MALLOC: 3001528
         REALLOC: 133803
         CALLOC: 31662
         MMAP: 4471

🥇 Top 5 largest allocating locations (by size):
        - __init__:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_model_construction.py:656 -> 2.801GB
        - unpack_lenient_weakvaluedict:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_model_construction.py:698 -> 2.019GB
        - build_lenient_weakvaluedict:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_model_construction.py:684 -> 2.019GB
        - <dictcomp>:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_typing_extra.py:172 -> 2.019GB
        - build_lenient_weakvaluedict:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_model_construction.py:681 -> 1.092GB

🥇 Top 5 largest allocating locations (by number of allocations):
        - create_schema_validator:/Users/programming/pydantic_work/pydantic/pydantic/plugin/_schema_validator.py:50 -> 1119544
        - validate_core_schema:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_core_utils.py:568 -> 923396
        - complete_model_class:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_model_construction.py:568 -> 478022
        - __init__:/Users/programming/anaconda3/envs/pydantic_env/lib/python3.10/typing.py:668 -> 91979
        - unpack_lenient_weakvaluedict:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_model_construction.py:698 -> 62323

this branch:

📏 Total allocations:
        2980604

📦 Total memory allocated:
        4.992GB

📊 Histogram of allocation size:
        min: 1.000B
        ---------------------------------------------
        < 4.000B   : 307836 ▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 16.000B  : 615674 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 70.000B  : 457803 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 288.000B : 525718 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 1.163KB  : 634252 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 4.795KB  : 375627 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 19.771KB :  43216 ▇▇
        < 81.511KB :   6280 ▇
        < 336.054KB:  13850 ▇
        <=1.353MB  :    348 ▇
        ---------------------------------------------
        max: 1.353MB

📂 Allocator type distribution:
         MALLOC: 2814725
         REALLOC: 133851
         CALLOC: 31688
         MMAP: 340

🥇 Top 5 largest allocating locations (by size):
        - add_module_globals:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_typing_extra.py:216 -> 2.028GB
        - push:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_generate_schema.py:335 -> 1.014GB
        - validate_core_schema:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_core_utils.py:568 -> 884.282MB
        - __init__:/Users/programming/anaconda3/envs/pydantic_env/lib/python3.10/typing.py:668 -> 350.568MB
        - create_schema_validator:/Users/programming/pydantic_work/pydantic/pydantic/plugin/_schema_validator.py:50 -> 164.853MB

🥇 Top 5 largest allocating locations (by number of allocations):
        - create_schema_validator:/Users/programming/pydantic_work/pydantic/pydantic/plugin/_schema_validator.py:50 -> 1119542
        - validate_core_schema:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_core_utils.py:568 -> 923393
        - complete_model_class:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_model_construction.py:568 -> 478018
        - __init__:/Users/programming/anaconda3/envs/pydantic_env/lib/python3.10/typing.py:668 -> 91964
        - get_cls_type_hints_lenient:/Users/programming/pydantic_work/pydantic/pydantic/_internal/_typing_extra.py:237 -> 51607

Copy link
Member

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Great find!

# if f_back is None, it's the global module namespace and we don't need to include it here
if frame.f_back is None:
# if the class is defined at the top module level, we don't need to add namespace information
if frame.f_back is None or frame.f_code.co_name == '<module>':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ye, I thought the exact same thing as @dmontagu.

Typeshed suggests that both .f_code and .co_name should always exist.

@sydney-runkle
Copy link
Contributor Author

Going to merge. Excited to see impact in v2.9!

@sydney-runkle sydney-runkle merged commit 4bc2da6 into main Aug 14, 2024
61 checks passed
@sydney-runkle sydney-runkle deleted the namespace-change branch August 14, 2024 16:22
@zzstoatzz
Copy link
Contributor

zzstoatzz commented Aug 27, 2024

hey folks!

I am trying to understand what potential breaking change occurred here (git bisect lead me here) thats causing an issue with forward references over in prefect

We have something like:

if TYPE_CHECKING:
    from prefect.flows import Flow

class EngineContext(RunContext):
    ...
    flow: Optional["Flow"] = None

which on import (python -c "from prefect import flow") now (2.9.0b1) gives a trace like this (full trace in linked issue)

» python -c "from prefect import flow"
Traceback (most recent call last):
  File "/Users/nate/github.com/prefecthq/prefect/.venv/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py", line 863, in _resolve_forward_ref
    obj = _typing_extra.eval_type_backport(obj, globalns=self._types_namespace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nate/github.com/prefecthq/prefect/.venv/lib/python3.12/site-packages/pydantic/_internal/_typing_extra.py", line 287, in eval_type_backport
    return _eval_type_backport(value, globalns, localns, type_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nate/github.com/prefecthq/prefect/.venv/lib/python3.12/site-packages/pydantic/_internal/_typing_extra.py", line 311, in _eval_type_backport
    return _eval_type(value, globalns, localns, type_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nate/github.com/prefecthq/prefect/.venv/lib/python3.12/site-packages/pydantic/_internal/_typing_extra.py", line 340, in _eval_type
    return typing._eval_type(  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/typing.py", line 415, in _eval_type
    return t._evaluate(globalns, localns, type_params, recursive_guard=recursive_guard)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/typing.py", line 947, in _evaluate
    eval(self.__forward_code__, globalns, localns),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 1, in <module>
NameError: name 'Flow' is not defined

...
    
  File "<string>", line 1, in <module>
pydantic.errors.PydanticUndefinedAnnotation: name 'Flow' is not defined

For further information visit https://errors.pydantic.dev/2.9/u/undefined-annotation

I'm of course suspicious that we were depending on something weird, but I haven't yet found our error and I wasn't immediately seeing how postponed annotations or the other suggestions I found from the usage error docs would help, but could totally be missing something. this is something we're currently doing to avoid circular imports

im still very much looking into this, just figured i'd post here in case you could shortcut my investigation

happy to open an issue if that seems appropriate


EDIT (as I investigate more)

unsurprisingly given the diff, if I remove this extension of the condition or frame.f_code.co_name == '<module>', everything works as expected

I also tried replacing "Flow" with the full "prefect.flows.Flow" and this also appeared to resolve (but then I appear to hit another instance of the same case)

@Viicos
Copy link
Member

Viicos commented Aug 28, 2024

This is what happens in your case, step by step:

prefect.context module:

if TYPE_CHECKING:
    from prefect.flows import Flow

class EngineContext(BaseModel):
    flow: Optional["Flow"]

prefect.main module:

from prefect.flows import Flow

from prefect.context import EngineContext

EngineContext.model_rebuild()
  1. In the prefect.context module, EngineContext is defined with a a field having a forward reference to Flow. At runtime, Pydantic will try to look for Flow in the module's __dict__ so that it can evaluate the forward reference and know where it points to. Because Flow is imported in an if TYPE_CHECKING block, it is not present in the module's __dict__ so the forward reference fails to evaluate. When it fails to do so, Pydantic will set a temporary "mock" core schema and the model needs to be rebuilt.
  2. In the prefect.main module, you added a call to EngineContext.model_rebuild(). Doing so will try to reevaluate the forward annotations, using:
  • The model module's __dict__ (this time, the module is fully defined, so forward annotations to symbols defined after the model can be resolved).
  • The parent frame namespace. Prior to the change in this PR, this would include everything defined in prefect.main. I'm not sure if this is intended behavior, as it can be confusing.
  1. In prefect.main, Flow is imported, so model_rebuild() succeeds.

To work around this, I would provide an explicit _types_namespace={'Flow': Flow} argument to EngineContext.model_rebuild(). This way Pydantic won't try to infer the types namespace itself.


@sydney-runkle, there are probably other projects relying on this behavior (probably without explicit knowledge). In the case of prefect, it might be that Flow was imported to be reexported in __all__ in the first place.

On the other hand, I feel like including the parent namespace during model_rebuild() can be counterintuitive, e.g. what happens if you define a different Flow class in prefect.main?

So I think we can either have a boolean flag in the definition of parent_frame_namespace to fallback to the old behavior, and enable it for model_rebuild -- or we keep it this way and recommend users having new issues like this one to explicitly provide _types_namespace.

@sydney-runkle
Copy link
Contributor Author

Hey @zzstoatzz,

Thanks so much for reporting this. I'll look around for a fix on our end for this today - we definitely don't want to introduce this breaking change with an official minor release.

The goal of this namespace modification stuff was to improve performance for schema builds + reduce unnecessary memory usage, but we certainly don't want users to have to manually modify types namespaces in cases where they didn't have to before.

Will keep you updated regarding a fix!

@sydney-runkle
Copy link
Contributor Author

I've created #10253 to track updates here.

@zzstoatzz
Copy link
Contributor

thank you @Viicos and @sydney-runkle!

the context is super helpful, and with your _types_namespace suggestion we seem to be in good shape 🙂

I'll keep up with that tracking issue 👍

@skshetry
Copy link

skshetry commented Sep 6, 2024

@sydney-runkle, I was debugging an issue with the newer version of pydantic and after some investigation using git bisect, I ended up here.

We have a Python script that is dynamically generated using datamodel-code-generator and is executed within a function using exec(). I’ve included a reproducible example below:

# models.py
from __future__ import annotations

from pydantic import BaseModel

class Model1(BaseModel):
    pass

class Model2(BaseModel):
    model1: Model1

print(Model2.model_fields)
# script.py
from pathlib import Path

def f():
    source = Path("models.py").read_text()
    exec(source)

f()

I get a different result when I run models.py and script.py.

$ python models.py
{'model1': FieldInfo(annotation=Model1, required=True)}

$ python script.py
{'model1': FieldInfo(annotation=ForwardRef('Model1'), required=True)}

In the second case, I’m seeing a ForwardRef in model_fields.

Adding model_rebuild() seems to resolve the issue, but I'm not sure if this behavior is a bug or if it’s expected. (As per docs, model_rebuild is only needed when the first attempt has failed, but it seems to me, that this should not fail at all).

I can open an issue if that would be helpful. Thank you. :)

@MarkusSintonen
Copy link
Contributor

MarkusSintonen commented Sep 6, 2024

@skshetry does it work if you remove from __future__ import annotations? I guess it's (sometimes) expected to see ForwardRefs when using that. If it works without it, I think it should be documented that model_rebuild may be required when using __future__ import annotations 🤔

@skshetry
Copy link

skshetry commented Sep 7, 2024

@skshetry does it work if you remove from __future__ import annotations?

Thank you for your comment. If I remove __future__ import from the above example, I get this error:

$ python script.py
Traceback (most recent call last):
  File "/Users/user/script.py", line 9, in <module>
    f()
  File "/Users/user/script.py", line 6, in f
    exec(source)
  File "<string>", line 8, in <module>
  File "<string>", line 9, in Model2
NameError: name 'Model1' is not defined

For comparison, if I directly invoke, it works fine.

$ python models.py
{'model1': FieldInfo(annotation=Model1, required=True)}

@Viicos
Copy link
Member

Viicos commented Sep 7, 2024

In your case, using exec was working for an unexpected reason: it relied on the parent namespace fetching logic we had previously.

exec does things a bit differently than you would expect, especially when used inside functions. Your code will work fine if ran in script.py in the module scope. Otherwise, you need to explicitly pass the globals:

def f():
    exec(source, global())
    print(Model2)  # `<class '__main__.Model2'>`

f()

You may encounter new issues when using exec, and it should be generally avoided as a solution. We probably won't provide any non straightforward fixes to support such use cases.

More reading:

@skshetry
Copy link

skshetry commented Sep 7, 2024

exec does things a bit differently than you would expect, especially when used inside functions. Your code will work fine if ran in script.py in the module scope. Otherwise, you need to explicitly pass the globals.

We do pass globals. I am sorry, I cut some parts from the above example, because I was not sure what was causing the issue. But the real example looks something like follows:

def f():
    local_vars = {}
    source = Path("models.py").read_text()
    exec(source, globals(), local_vars)
    # access model from `local_vars`
    spec = local_vars["spec"]

f()
from __future__ import annotations

from pydantic import BaseModel

class Model1(BaseModel):
    pass

class Model2(BaseModel):
    model1: Model1

model_fields = Model2.model_fields
print(model_fields)
spec = Model2

I guess it's a similar issue here (or, that the scope is not same?)

@Viicos
Copy link
Member

Viicos commented Sep 9, 2024

By explicitly passing locals, you are hitting issues unrelated to Pydantic.

Let's say models.py contains the following:

A = 1

class B:
    a = A

As per the documentation:

Most users should just pass a globals argument and never locals. If exec gets two separate objects as globals and locals, the code will be executed as if it were embedded in a class definition.

So doing the following:

def f():
    ...
    exec(source, globals(), local_vars)

Is equivalent to:

def f():
    ...
    class SomeClass:
        A = 1
        class B:
            a = A

Which will result in a NameError, as A isn't defined in the scope of B.


What prevents you from importing models and grabbing spec from here? Again, I will strongly advise avoiding using exec.

@skshetry
Copy link

skshetry commented Sep 9, 2024

What prevents you from importing models and grabbing spec from here? Again, I will strongly advise avoiding using exec.

We use pydantic in a library to dynamically generate pydantic models based on user's example unstructured data, so they are not saved anywhere.

Also, we don't want to pollute user's workspace, and sys.modules (which I believe importlib requires), or not make any alterations to sys (like how runpy.run_path does, which is also not thread-safe).

@MarkusSintonen
Copy link
Contributor

dynamically generate pydantic models

@skshetry have you tried using create_model in Pydantic to generate models dynamically?

@skshetry
Copy link

skshetry commented Sep 9, 2024

@skshetry have you tried using create_model in Pydantic to generate models dynamically?

We use pydantic and create_model() quite heavily in https://github.com/iterative/datachain. But the usecase that I have here where it broke is validating large amounts of unstructured data conforms to the schema of an example data (which is also unstructured). We use datamodel-code-generate to generate schema out of that example data, and then apply it over other data to validate.

We probably could have generated a jsonschema or implemented other mechanisms, but that's how it's implemented at the moment.

@sydney-runkle
Copy link
Contributor Author

You may encounter new issues when using exec, and it should be generally avoided as a solution. We probably won't provide any non straightforward fixes to support such use cases.

I agree with @Viicos' analysis here 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes-performance Used for performance improvements.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants