Skip to content

Conversation

adriangb
Copy link
Member

@adriangb adriangb commented May 20, 2024

This PR introduces a new experimental API to do more type-safe advanced constraining, parsing and transformation of types.

from typing import Annotated
from pydantic import BaseModel, validate_as, transform

class User(BaseModel):
    username: Annotated[str, transform(str.lower)]
    age: Annotated[int, validate_as().gt(0)]

More specifically, from the docs updates:

This PR introduces an experimental "pipeline" API that allows composing of parsing (validation), constraints and transformations in a more type-safe manner than existing APIs. This API is subject to change or removal, we are looking for feedback and suggestions before making it a permanent part of Pydantic.

Generally, the pipeline API is used to define a sequence of steps to apply to incoming data during validation. The pipeline API is designed to be more type-safe and composable than the existing Pydantic API.

Each step in the pipeline can be:

  • A validation step that runs pydantic validation on the provided type
  • A transformation step that modifies the data
  • A constraint step that checks the data against a condition
  • A predicate step that checks the data against a condition and raises an error if it returns False

You can see more detail in the docs here: https://transform.pydantic-docs.pages.dev/concepts/experimental/#importing-experimental-feature-warnings

Also, this PR is very much aligned with PEP 746, which introduces type checking for annotated metadata.

Fix #9522

^^ adding new experimental feature pattern docs to the version policy

@github-actions github-actions bot added the relnotes-fix Used for bugfixes. label May 20, 2024
Copy link

cloudflare-workers-and-pages bot commented May 20, 2024

Deploying pydantic-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 166df3d
Status: ✅  Deploy successful!
Preview URL: https://b2819b73.pydantic-docs.pages.dev
Branch Preview URL: https://transform.pydantic-docs.pages.dev

View logs

Copy link

codspeed-hq bot commented May 20, 2024

CodSpeed Performance Report

Merging #9459 will not alter performance

Comparing transform (166df3d) with main (93d6015)

Summary

✅ 13 untouched benchmarks

Copy link
Contributor

@sydney-runkle sydney-runkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool API you've designed here. I enjoyed this review, and am looking forward to your thoughts re my organizational questions!

@sydney-runkle
Copy link
Contributor

Nice. I like the new namespace changes, and thanks for switching over to annotations in the docs!

@adriangb
Copy link
Member Author

@sydney-runkle i think the main thing missing here is some tests and validating json schema?

@sydney-runkle
Copy link
Contributor

@adriangb,

Indeed. I think we need:

  1. More testing (I'd be happy to add some tests, which will help me get more familiar with this API)
  2. JSON schema support - I can also try to work on this, if it helps!
  3. Docs for JSON schema
  4. Do we want to go ahead and support aggregations? Maybe not in this PR, but a feature for down the line?
  5. Let's decide on how we want to introduce experimental features going forward (experimental module, prefix, both, etc)

@sydney-runkle
Copy link
Contributor

Other questions I thought about while working on tests:

  • Do we want to call the file with this logic transform.py eventually? I feel like that might confuse some folks, it's a pretty hot word in the data science space... don't want to step on any toes like we did with model.

Copy link
Contributor

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like this though the verbosity seems quite high. It would be nice to explore ideas how to encourage users to make reusable pipeline pieces and similar techniques to reduce verbosity.

username: Annotated[str, parse(str).str.pattern(r'[a-z]+')] # (3)!
password: Annotated[
str,
parse(str).transform(str.lower).predicate(lambda x: x != 'password')] # (4)!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two observations here:

  • It feels pretty verbose to have this pipeline inline. Should we encourage users to make it reusable?

     validate_password = parse(str).transform(str.lower).predicate(lambda x: x != 'password')
    
     class User(BaseModel):
         password: Annotated[str, validate_password]
  • Is there a way to not need the Annotated? I can't see an obvious one, but I'd love for it to work somehow. e.g. Pipeline[str, transform(str.lower).predicate(...)]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipeline[str, transform(str.lower).predicate(...)] would work if we just export Pipeline = Annotated (I think) but that's not much better? I would love for foo: parse(str).transform(int) to work and infer the output type as int but that just doesn't work.

Should we encourage users to make it reusable?

I agree, making examples reusable makes sense. I would suggest:

# or call some password check/validation
Password = Annotated[str, parse(str).predicate(lambda x: x != 'password')]

 class User(BaseModel):
     password: Password

5. Use the `|` or `&` operators to combine steps (like a logical OR or AND).
6. Calling `parse()` with no arguments implies `parse(<field type>)`. Use `parse(Any)` to accept any type.
7. For recursive types you can use `parse_defer` to reference the type itself before it's defined.
8. You can call `parse()` before or after other steps to do pre or post processing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this means? I assume that this is interactions with BeforeValidator. How do we implement wrap validators in this model?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is you can do Annotated[int, parse(str).transform(str.strip).parse().transform(lambda x: x * 2) which is a wrap validator since it does (1) validate as a string and strip whitespace, (2) parse as an integer and (3) multiple by 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. .parse() like that was totally not obvious to me, is it a significant ergonomic win to allow the argument to be inferred. Otherwise I might argue to make it required for now to simplify.

Or write an explicit section which is like

BeforeValidator -> parse(str).transform(bv).parse()
AfterValidator -> parse().transform(av)
WrapValidator -> parse(str).transform(bv).parse().transform(av)

... @sydney-runkle and I had an interesting use case for a wrap validator the other day which split IANA timezone off a timestamp so that we could do the timestamp in Rust and then reconstruct a tz-aware datetime. It would be interesting to see how to implement that in this API, as it's not obvious to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on that example and we can try and see how it would work out?

Copy link
Member Author

@adriangb adriangb May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a significant ergonomic win to allow the argument to be inferred

I'd say so, consider if the type is list[MyGeneric[dict[str, set[bool]]]] or something...

Copy link
Member Author

@adriangb adriangb May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had an interesting use case for a wrap validator the other day which split IANA timezone off a timestamp so that we could do the timestamp in Rust and then reconstruct a tz-aware datetime

We could also add an explicit wrap validator: parse(str).transform(lambda x, h: h(x.strip())).parse(int)?

I don't really like that because it breaks the linearity and typing.

"""Pipe the result of one validation chain into another."""
return _Pipeline([_PipelineAnd(self, other)])

__and__ = then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand why this API was added — as long as the input and output types of a pipeline are the same, then it's basically equivalent to do them in order. However, I'll note that sequencing the items like this may end up being unintuitive, in particular if you expect to get an error for each failure in the case of multiple independent validators, rather than just the first failure. I understand it's hard to "fix" that given that transformations are possible though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it that you mean that for validate_as(int).gt(0) & validate_as(int).gt(1) you’d expect -1 to give two errors? Indeed it would only give one. I think that’s reasonable behavior.

I also think you’re saying that for the case where you have a chain of constraints you could error for all of them eg validate_as(int).gt(0).gt(1) and indeed that will happen if they’re all one after another and are known constraints, but not if they’re custom predicates or there’s a transformation between them. And also not when you use &. I think that’s okay.

The one improvement we could make is “collapsing” sequential constraints into one level eg validate_as(int).predicate(lambda x: x > 0).predicate(lambda x: x % 2 == 0) could give both errors despite it being arbitrary user code. That’s a reasonable future feature request that shouldn’t be too hard to implement.

Copy link
Contributor

@dmontagu dmontagu Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my point was that I might expect validate_as(int).predicate(lambda x: x > 0) & validate_as(int).predicate(lambda x: x % 2 == 0) to produce two errors. (Well, knowing it's doing chaining, I'm not even sure if that's valid as written there since it repeats the validate_as(int), but that's what my intuition would be for how I'd expect to use &.) I think it's reasonable for validate_as(int).predicate(lambda x: x > 0).predicate(lambda x: x % 2 == 0) to just produce one though.

Copy link
Member Author

@adriangb adriangb Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My expectation would be the exact opposite: & is often a greedy operation e.g. False & 1 / 0. So I don't think there's a valid general intuition here. If we could make it behave as you expect I suspect there'd be complaints because it's unintuitive or because it is doing unnecessary work. I don't know why you'd expect validate_as(int).predicate(lambda x: x > 0).predicate(lambda x: x % 2 == 0) to produce just one error. Do you also expect Field(min_length=10, pattern=...) to produce a single error? In any case given that we can't really change the behavior and that no users have complained yet I suspect trying to determine what is most intuitive here is not a productive debate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late to the conversation, but I'd expect it to short-circuit the same way it would for an if expression or any other useage (that I know of) for and.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grantmwilliams I think it works as you'd expect then, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriangb I've only tested it a bit, but it seems to work exactly as expected.

@adriangb
Copy link
Member Author

adriangb commented Jun 5, 2024

@sydney-runkle the changes to make validate_as(...) work seem to have broken type checking: validate_as(str).transform(lambda x: x + 1) no longer fails type checking because x is Any

@sydney-runkle
Copy link
Contributor

@sydney-runkle the changes to make validate_as(...) work seem to have broken type checking: validate_as(str).transform(lambda x: x.strip()) x is now Any.

😢 ugh, I'll take another look. I tried overloading, but the EllipsisType isn't available in 3.8 and 3.9

sydney-runkle and others added 5 commits June 5, 2024 13:16
Co-authored-by: David Montague <35119617+dmontagu@users.noreply.github.com>
Co-authored-by: David Montague <35119617+dmontagu@users.noreply.github.com>
@sydney-runkle
Copy link
Contributor

@adriangb,

What's left to do here?

@adriangb
Copy link
Member Author

adriangb commented Jun 6, 2024

@sydney-runkle I have a couple of followup ideas but I think we should merge this and iterate from there

@sydney-runkle
Copy link
Contributor

@adriangb,

Awesome, going to merge then! Could you document those thoughts in an issue at some point for reference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write new feature pattern docs
7 participants