Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bytes type validator #80

Merged
merged 12 commits into from
Jun 14, 2022
Merged

Add bytes type validator #80

merged 12 commits into from
Jun 14, 2022

Conversation

dswij
Copy link
Contributor

@dswij dswij commented Jun 9, 2022

Part of #9

Add bytes type validator

@codecov
Copy link

codecov bot commented Jun 9, 2022

Codecov Report

Merging #80 (1425049) into main (1f0d57f) will decrease coverage by 1.38%.
The diff coverage is 86.80%.

❗ Current head 1425049 differs from pull request most recent head 8bdc703. Consider uploading reports for the commit 8bdc703 to get more accurate results

@@            Coverage Diff             @@
##             main      #80      +/-   ##
==========================================
- Coverage   93.53%   92.15%   -1.39%     
==========================================
  Files          37       35       -2     
  Lines        3157     2779     -378     
  Branches       23       21       -2     
==========================================
- Hits         2953     2561     -392     
- Misses        204      218      +14     
Impacted Files Coverage Δ
src/errors/kinds.rs 100.00% <ø> (ø)
src/input/shared.rs 100.00% <ø> (ø)
src/input/input_json.rs 93.10% <62.50%> (-4.34%) ⬇️
src/validators/bytes.rs 84.76% <84.76%> (ø)
pydantic_core/_types.py 100.00% <100.00%> (ø)
src/input/input_abstract.rs 100.00% <100.00%> (ø)
src/input/input_python.rs 87.27% <100.00%> (-4.64%) ⬇️
src/input/return_enums.rs 100.00% <100.00%> (ø)
src/validators/mod.rs 98.20% <100.00%> (-0.05%) ⬇️
src/validators/string.rs 90.96% <0.00%> (-5.81%) ⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f0d57f...8bdc703. Read the comment docs.

Copy link
Member

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking amazing, thank you so much.

Let me know if you have any questions.

@@ -44,4 +44,10 @@ pub trait Input: fmt::Debug + ToPy + ToLocItem {
fn lax_set<'data>(&'data self) -> ValResult<GenericSequence<'data>> {
self.strict_set()
}

fn strict_bytes(&self) -> ValResult<Vec<u8>>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a slice here with a lifetime?

Copy link
Contributor Author

@dswij dswij Jun 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use a slice here. We're creating a new value when we parse the input, and we need an owner for the new value.

Creating a slice here will be temporary and then it will be dropped immediately.

src/validators/bytes.rs Outdated Show resolved Hide resolved
#[strum(message = "Value must be a valid bytes")]
BytesType,
#[strum(message = "Bytes must have at least {min_length} characters")]
BytesTooShort,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per #76, I think we should have one TooShort error which we us for strings and bytes.

You can either leave this and we'll fix them all in one PR, or do it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to choose a generic word like "Input" instead of "Bytes/Stirng"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per #76, I think we should have one TooShort error which we us for strings and bytes.

You can either leave this and we'll fix them all in one PR, or do it here.

Let's fix them all in one PR. I can get to it after this one.

fn strict_bytes(&self) -> ValResult<Vec<u8>> {
match self {
JsonInput::String(s) => Ok(s.clone().into_bytes()),
JsonInput::Int(int) => Ok(int.to_ne_bytes().to_vec()),
Copy link
Member

@samuelcolvin samuelcolvin Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this a lot.

As per this, I think that when to coerce and when to raise an error should follow the following rule:

if there's a (single) obvious representation in the field type AND converting to that looses no information, coerce, otherwise error.

"Single" added now while thinking about this.

I think what you have here is not the same as what we have in pydantic v1. In pydantic v1 12.5 would be converted to b'12.5', while with what you have here I think it would be converted to b'@)\x00\x00\x00\x00\x00\x00'.

This highlights that there's no single obvious representation of an int or float in bytes.

I therefore think we should remove int and float automatic conversion.

My plan for next week is to write a long form blog post on my plans for pydantic v2 which should cover all this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stumbled on this issue too while writing this. After giving this a thought, there's really no way to know exactly what value the user wants. I'd imagine arbitrarily choosing coercion will add frustration.

if there's a (single) obvious representation in the field type AND converting to that looses no information, coerce, otherwise error.

This sounds like a good rule of thumb. If there's no single obvious representation, letting the user work around it will remove ambiguity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed 👍.

pydantic_core/_types.py Show resolved Hide resolved
tests/validators/test_bytes.py Show resolved Hide resolved
@dswij dswij marked this pull request as ready for review June 13, 2022 11:05
@dswij
Copy link
Contributor Author

dswij commented Jun 14, 2022

The benchmark doesn't look too good if I understand it correctly:

---------------------------------------------------- benchmark 'bytes': 4 tests ---------------------------------------------------
Name (time in us)                         Min                    Mean                StdDev            Outliers  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------
test_long_bytes_pyd                    1.0410 (1.0)            1.1529 (1.0)          0.1400 (1.02)     288;2370   87589           1
test_list_of_long_bytes_pyd            1.6250 (1.56)           1.7862 (1.55)         0.1374 (1.0)      912;6538   75472           1
test_long_bytes_core                 155.7910 (149.66)       160.4364 (139.16)       3.0425 (22.15)     309;217    2590           1
test_list_of_long_bytes_core     157,143.9590 (>1000.0)  160,641.3751 (>1000.0)  3,385.3945 (>1000.0)       2;0       7           1
-----------------------------------------------------------------------------------------------------------------------------------

@samuelcolvin
Copy link
Member

The benchmark doesn't look too good if I understand it correctly:

looks like something must be wrong, I'll have a look.

@samuelcolvin
Copy link
Member

The benchmark doesn't look too good if I understand it correctly:

Ok, solved.

You had two problems:

First, you weren't using an optimised build of pydantic-core (make build-fast, soon to be renamed to make build-prod), with that performance increases roughly 10x to:

------------------------------------------- benchmark 'bytes': 2 tests -------------------------------------------
Name (time in ns)            Min                  Mean              StdDev            Outliers  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------
test_bytes_core         165.9500 (1.0)        220.4993 (1.0)       53.2037 (1.0)        211;14   27651           1
test_bytes_pyd        1,457.9855 (8.79)     1,575.0939 (7.14)     220.4102 (4.14)       46;698   36980           1
------------------------------------------------------------------------------------------------------------------

But we're still paying a penalty to convert PyBytes to Vec<u8> only to convert it straight back again when we have no length checks.

As per my recent work on dates and times see #82 , we can improve performance further by avoiding the conversion completely by using an enum - that's what I've added here.

With that the performance becomes:

------------------------------------------- benchmark 'bytes': 2 tests -------------------------------------------
Name (time in ns)            Min                  Mean              StdDev            Outliers  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------
test_bytes_core          40.9782 (1.0)        101.8615 (1.0)       65.8615 (1.0)         95;95   88559           1
test_bytes_pyd        1,291.0459 (31.51)    1,396.9133 (13.71)    361.6564 (5.49)         4;38    2871           1
------------------------------------------------------------------------------------------------------------------

I've added the return_enums.rs file as we'll want to do this for a number of other types in future.

I've also cleaned up your benchmarks somewhat, you had a benchmark with just a list with a single element which wasn't doing anything.

@dswij
Copy link
Contributor Author

dswij commented Jun 14, 2022

I've added the return_enums.rs file as we'll want to do this for a number of other types in future.

I've also cleaned up your benchmarks somewhat, you had a benchmark with just a list with a single element which wasn't doing anything.

Sounds great. Is there anything else missing for the bytes validator?

@samuelcolvin samuelcolvin mentioned this pull request Jun 14, 2022
12 tasks
@samuelcolvin
Copy link
Member

Conflicts, otherwise LGTM.

@samuelcolvin samuelcolvin merged commit 2b46ec5 into pydantic:main Jun 14, 2022
@samuelcolvin
Copy link
Member

thanks so much for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants