Add `bytes` type validator #80

dswij · 2022-06-09T09:52:51Z

Part of #9

Add bytes type validator

codecov · 2022-06-09T09:57:39Z

Codecov Report

Merging #80 (1425049) into main (1f0d57f) will decrease coverage by 1.38%.
The diff coverage is 86.80%.

❗ Current head 1425049 differs from pull request most recent head 8bdc703. Consider uploading reports for the commit 8bdc703 to get more accurate results

@@            Coverage Diff             @@
##             main      #80      +/-   ##
==========================================
- Coverage   93.53%   92.15%   -1.39%     
==========================================
  Files          37       35       -2     
  Lines        3157     2779     -378     
  Branches       23       21       -2     
==========================================
- Hits         2953     2561     -392     
- Misses        204      218      +14

Impacted Files	Coverage Δ
src/errors/kinds.rs	`100.00% <ø> (ø)`
src/input/shared.rs	`100.00% <ø> (ø)`
src/input/input_json.rs	`93.10% <62.50%> (-4.34%)`	⬇️
src/validators/bytes.rs	`84.76% <84.76%> (ø)`
pydantic_core/_types.py	`100.00% <100.00%> (ø)`
src/input/input_abstract.rs	`100.00% <100.00%> (ø)`
src/input/input_python.rs	`87.27% <100.00%> (-4.64%)`	⬇️
src/input/return_enums.rs	`100.00% <100.00%> (ø)`
src/validators/mod.rs	`98.20% <100.00%> (-0.05%)`	⬇️
src/validators/string.rs	`90.96% <0.00%> (-5.81%)`	⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f0d57f...8bdc703. Read the comment docs.

samuelcolvin

This is looking amazing, thank you so much.

Let me know if you have any questions.

samuelcolvin · 2022-06-10T09:06:36Z

src/input/input_abstract.rs

@@ -44,4 +44,10 @@ pub trait Input: fmt::Debug + ToPy + ToLocItem {
    fn lax_set<'data>(&'data self) -> ValResult<GenericSequence<'data>> {
        self.strict_set()
    }
+
+    fn strict_bytes(&self) -> ValResult<Vec<u8>>;


can we use a slice here with a lifetime?

I don't think we can use a slice here. We're creating a new value when we parse the input, and we need an owner for the new value.

Creating a slice here will be temporary and then it will be dropped immediately.

src/validators/bytes.rs

samuelcolvin · 2022-06-10T09:11:13Z

src/errors/kinds.rs

+    #[strum(message = "Value must be a valid bytes")]
+    BytesType,
+    #[strum(message = "Bytes must have at least {min_length} characters")]
+    BytesTooShort,


as per #76, I think we should have one TooShort error which we us for strings and bytes.

You can either leave this and we'll fix them all in one PR, or do it here.

You might want to choose a generic word like "Input" instead of "Bytes/Stirng"

as per #76, I think we should have one TooShort error which we us for strings and bytes.

You can either leave this and we'll fix them all in one PR, or do it here.

Let's fix them all in one PR. I can get to it after this one.

samuelcolvin · 2022-06-10T09:15:28Z

src/input/input_json.rs

+    fn strict_bytes(&self) -> ValResult<Vec<u8>> {
+        match self {
+            JsonInput::String(s) => Ok(s.clone().into_bytes()),
+            JsonInput::Int(int) => Ok(int.to_ne_bytes().to_vec()),


I've been thinking about this a lot.

As per this, I think that when to coerce and when to raise an error should follow the following rule:

if there's a (single) obvious representation in the field type AND converting to that looses no information, coerce, otherwise error.

"Single" added now while thinking about this.

I think what you have here is not the same as what we have in pydantic v1. In pydantic v1 12.5 would be converted to b'12.5', while with what you have here I think it would be converted to b'@)\x00\x00\x00\x00\x00\x00'.

This highlights that there's no single obvious representation of an int or float in bytes.

I therefore think we should remove int and float automatic conversion.

My plan for next week is to write a long form blog post on my plans for pydantic v2 which should cover all this.

I stumbled on this issue too while writing this. After giving this a thought, there's really no way to know exactly what value the user wants. I'd imagine arbitrarily choosing coercion will add frustration.

if there's a (single) obvious representation in the field type AND converting to that looses no information, coerce, otherwise error.

This sounds like a good rule of thumb. If there's no single obvious representation, letting the user work around it will remove ambiguity.

Agreed 👍.

pydantic_core/_types.py

tests/validators/test_bytes.py

dswij · 2022-06-14T08:45:32Z

The benchmark doesn't look too good if I understand it correctly:

---------------------------------------------------- benchmark 'bytes': 4 tests ---------------------------------------------------
Name (time in us)                         Min                    Mean                StdDev            Outliers  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------
test_long_bytes_pyd                    1.0410 (1.0)            1.1529 (1.0)          0.1400 (1.02)     288;2370   87589           1
test_list_of_long_bytes_pyd            1.6250 (1.56)           1.7862 (1.55)         0.1374 (1.0)      912;6538   75472           1
test_long_bytes_core                 155.7910 (149.66)       160.4364 (139.16)       3.0425 (22.15)     309;217    2590           1
test_list_of_long_bytes_core     157,143.9590 (>1000.0)  160,641.3751 (>1000.0)  3,385.3945 (>1000.0)       2;0       7           1
-----------------------------------------------------------------------------------------------------------------------------------

samuelcolvin · 2022-06-14T09:19:17Z

The benchmark doesn't look too good if I understand it correctly:

looks like something must be wrong, I'll have a look.

samuelcolvin · 2022-06-14T10:11:08Z

The benchmark doesn't look too good if I understand it correctly:

Ok, solved.

You had two problems:

First, you weren't using an optimised build of pydantic-core (make build-fast, soon to be renamed to make build-prod), with that performance increases roughly 10x to:

------------------------------------------- benchmark 'bytes': 2 tests -------------------------------------------
Name (time in ns)            Min                  Mean              StdDev            Outliers  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------
test_bytes_core         165.9500 (1.0)        220.4993 (1.0)       53.2037 (1.0)        211;14   27651           1
test_bytes_pyd        1,457.9855 (8.79)     1,575.0939 (7.14)     220.4102 (4.14)       46;698   36980           1
------------------------------------------------------------------------------------------------------------------

But we're still paying a penalty to convert PyBytes to Vec<u8> only to convert it straight back again when we have no length checks.

As per my recent work on dates and times see #82 , we can improve performance further by avoiding the conversion completely by using an enum - that's what I've added here.

With that the performance becomes:

------------------------------------------- benchmark 'bytes': 2 tests -------------------------------------------
Name (time in ns)            Min                  Mean              StdDev            Outliers  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------
test_bytes_core          40.9782 (1.0)        101.8615 (1.0)       65.8615 (1.0)         95;95   88559           1
test_bytes_pyd        1,291.0459 (31.51)    1,396.9133 (13.71)    361.6564 (5.49)         4;38    2871           1
------------------------------------------------------------------------------------------------------------------

I've added the return_enums.rs file as we'll want to do this for a number of other types in future.

I've also cleaned up your benchmarks somewhat, you had a benchmark with just a list with a single element which wasn't doing anything.

dswij · 2022-06-14T15:33:59Z

I've added the return_enums.rs file as we'll want to do this for a number of other types in future.

I've also cleaned up your benchmarks somewhat, you had a benchmark with just a list with a single element which wasn't doing anything.

Sounds great. Is there anything else missing for the bytes validator?

samuelcolvin · 2022-06-14T18:10:14Z

Conflicts, otherwise LGTM.

samuelcolvin · 2022-06-14T19:58:10Z

thanks so much for this.

dswij force-pushed the main branch from 5ae6fb3 to b0a66ab Compare June 9, 2022 10:00

init bytes type

5d84607

dswij force-pushed the main branch from eeabce3 to 5d84607 Compare June 10, 2022 08:33

single quote lint

ee06d02

samuelcolvin reviewed Jun 10, 2022

View reviewed changes

Remove int and float coercion to bytes

23b08f9

dswij force-pushed the main branch from 3787ebf to db9868f Compare June 13, 2022 05:15

Finish tests

703a0a4

dswij force-pushed the main branch from 5823147 to 703a0a4 Compare June 13, 2022 06:20

dswij added 3 commits June 13, 2022 14:27

fix json string test

c59581a

Remove config setting for BytesValidator

6d1ad30

Add bytes case to test_typing

2c10c57

dswij marked this pull request as ready for review June 13, 2022 11:05

dswij added 2 commits June 14, 2022 15:01

Add benchmark for bytes type

a503a77

use slice for validation logic

d1fcbbf

using enum for bytes

152b6c1

use IntoPy

1425049

samuelcolvin mentioned this pull request Jun 14, 2022

More types #9

Closed

12 tasks

Merge branch 'main' into dswij-main

8bdc703

samuelcolvin merged commit 2b46ec5 into pydantic:main Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `bytes` type validator #80

Add `bytes` type validator #80

dswij commented Jun 9, 2022

codecov bot commented Jun 9, 2022 •

edited

Loading

samuelcolvin left a comment

samuelcolvin Jun 10, 2022

dswij Jun 12, 2022 •

edited

Loading

samuelcolvin Jun 10, 2022

samuelcolvin Jun 10, 2022

dswij Jun 13, 2022

samuelcolvin Jun 10, 2022 •

edited

Loading

dswij Jun 10, 2022

samuelcolvin Jun 10, 2022

dswij commented Jun 14, 2022

samuelcolvin commented Jun 14, 2022

samuelcolvin commented Jun 14, 2022

dswij commented Jun 14, 2022

samuelcolvin commented Jun 14, 2022

samuelcolvin commented Jun 14, 2022

Add bytes type validator #80

Add bytes type validator #80

Conversation

dswij commented Jun 9, 2022

codecov bot commented Jun 9, 2022 • edited Loading

Codecov Report

samuelcolvin left a comment

Choose a reason for hiding this comment

samuelcolvin Jun 10, 2022

Choose a reason for hiding this comment

dswij Jun 12, 2022 • edited Loading

Choose a reason for hiding this comment

samuelcolvin Jun 10, 2022

Choose a reason for hiding this comment

samuelcolvin Jun 10, 2022

Choose a reason for hiding this comment

dswij Jun 13, 2022

Choose a reason for hiding this comment

samuelcolvin Jun 10, 2022 • edited Loading

Choose a reason for hiding this comment

dswij Jun 10, 2022

Choose a reason for hiding this comment

samuelcolvin Jun 10, 2022

Choose a reason for hiding this comment

dswij commented Jun 14, 2022

samuelcolvin commented Jun 14, 2022

samuelcolvin commented Jun 14, 2022

dswij commented Jun 14, 2022

samuelcolvin commented Jun 14, 2022

samuelcolvin commented Jun 14, 2022

Add `bytes` type validator #80

Add `bytes` type validator #80

codecov bot commented Jun 9, 2022 •

edited

Loading

dswij Jun 12, 2022 •

edited

Loading

samuelcolvin Jun 10, 2022 •

edited

Loading