Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate bare items on construction #102

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

zcei
Copy link

@zcei zcei commented Jan 21, 2023

Closes #98

Hej folks,

as promised in the related issue, I gave a stab at validating the bare items on construction.

For the moment I have introduced BareItem::Validated* variants adjacent to their unvalidated enum member counterparts. Using Deref I was able to quite easily pass them to their respective RefBareItem version and thus could avoid a breaking change if wanted. I think eventually the shorthand version of a BareItem member should be the safe variant, as shorter names tend to be used more frequently.

To demonstrate that constructing the bare_item::* structs without using (Try)From is not possible, I've added an unused function in the bench setup. When using from within the same module tree, one can access the private struct members, so I had to do it outside of it.

BareItem variant validations

I looked at all the currently existing enum variants and mapped out whether they have validations that run in the parse or serialize steps.

Variant Serialize Validation Parse Validation
Decimal
Integer
String
ByteSeq
Boolean
Token

(Right now all have some form of parse validations, to verify integrity of the input strings. I decided later to not focus on the parsing step for the moment, so this is just mentioned for completeness.)

On the serialization side, all but Boolean and ByteSeq check for specific conditions on the backing data structure.
For example i64 supports a bigger number range than the spec describes for an Integer. Tokens have to start with specific characters, etc.

Approach

For each variant that needs validation while being serialized, their value struct implements a TryFrom which uses the Serializer::serialize_* method to check for correctness on construction. Is is somewhat wasteful and in a follow up PR validation could be moved out to their respective structs and called upon construction and then be omitted during serialization.

The value structs inner value is public within the crate. This means after parsing an input we can create the value structs without going through validation again.

For the two variants that don't need validation and can always be serialized (Boolean and ByteSeq) the From trait is implemented. Backing value structs are created to a) keep consistency in the implementation and b) to allow for validations/parsing to be offloaded into the value structs without a breaking change if that is ever desired.

Non-goals

As parsing usually happens from a complete structured field value and not their parts, I omitted this for the moment. This means right now you can only construct those value structs from their "native representation" and not from their serialized string value (e.g. 42_i64 can be converted into a bare_item::Integer, but "42" can not).
If ever needed, parsing can be offloaded into the value structs and bare_item::Integer::parse("42") or so could become reality.

src/bare_item.rs Outdated Show resolved Hide resolved
src/bare_item.rs Outdated Show resolved Hide resolved
src/bare_item.rs Outdated Show resolved Hide resolved
src/test_serializer.rs Outdated Show resolved Hide resolved
@valenting
Copy link
Collaborator

For the moment I have introduced BareItem::Validated* variants adjacent to their unvalidated enum member counterparts.

Do you see us keeping both Validated and unvalidated enum types, or do you intend to remove the unvalidated ones?

and thus could avoid a breaking change if wanted

I think changing the enum in any way might count as a breaking change. In any case, as mentioned earlier, we need to make a breaking change to add the Date type anyway, so we're not constrained by that.

@zcei
Copy link
Author

zcei commented Jan 23, 2023

Do you see us keeping both Validated and unvalidated enum types, or do you intend to remove the unvalidated ones?

Initially it was to keep the amount of touched LOC low, then I thought it might be nicer to transition over. Forgot at that moment that you either way wanted to introduce a breaking change.

One could argue that the "struct wrappers" are not needed for all the enum variants, like Integer(i64) can already only accept a valid 64-bit integer.
But I would say with Rust's .into() and the proper From traits implemented it's an okay-ish overhead to have all of them in their value type wrappers, so that we keep them consistent. Only outlier here is the Decimal as it's basically already a value type that we receive.


Idea: let's rename the current enum members to something like Raw* / Unvalidated* / Unsafe* members. The migration path is really easy in an existing codebase via search&replace. Then we can have the new and recommended way with the same names as the current enum members, use the breaking change for that, yet upgrade migration isn't a pain and users can gradually transition to validated bare items. WDYT?

@valenting
Copy link
Collaborator

let's rename the current enum members to something like Raw* / Unvalidated* / Unsafe* members.

Would they still be in the same enum as the validated types? Either way, I'd love to look at some code.

The IETF list also has some suggestions about adding #[non_exhaustive] to BareItem
If you don't add it now I'll do it later when adding the Date type.

https://lists.w3.org/Archives/Public/ietf-http-wg/2023JanMar/0034.html

@zcei
Copy link
Author

zcei commented Feb 6, 2023

Just as a quick update from some work on the weekend:

I decided it's not really worth it to keep both the "unsafe" variants and the validated ones at the same time, as it becomes a bit weird to deal with. The parsing would have yielded "unsafe" variants, even though the parser validated that they're correct.

So the code change will be a bit bigger, as it changes most of the test invocations from .into() to .try_into(), etc. Also some of the tests will need slight changes where errors were expected during the serialization. Now they fail when the value struct gets created.

I'm right now fighting with the String value struct, as I can't easily use the parse_string function when using TryInto, as you may or may now have it wrapped with quotes and the parser expects "double escaped" values in string (like literal values \" which are "\\\"" as a rust string) whereas it's totally fine to have a quote " ("\"") as a string input which can be serialized.

Edit: The problem here was to expect that we pass parseable inputs. This was a wrong assumption on my end. We need serializable values and therefore use the Serializer::serialize_* methods now internally to check for that. It's a bit wasteful, but can be optimized later.

Once that's done I'll open up this PR for review, with one commit per BareItem enum variant being moved to the value structs.

P.S.: Didn't know about the #[non_exhaustive], that's cool!

@zcei zcei marked this pull request as ready for review February 9, 2023 18:08
@zcei
Copy link
Author

zcei commented Feb 9, 2023

@valenting I updated the PR description to explain the work that has happened. Let me know if there's anything controversial to it. Specifically I'm interested what you think about the whole .try_into() and .into() dance, which is now necessary to construct field values manually.

I'm happy to be pointed to whatever you think is worth changing, I'm sure there's something I missed 🙂

Comment on lines +95 to +97
// TODO: must_fail has to always fail on creation
// As not all cases are moved yet, we take a two-step approach here:
// Either creation fails or serialization fails
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can currently not be tackled, because Dictionary keys have their own validations that only happen on serialization, so for the time being we would have to live with the workaround or try to accommodate for valid keys on construction time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're just testing serialization, I think we can still create invalid bareItems if we simply don't use try_into, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I try to use e.g. the Integer value struct in build_bare_item, I get

cannot initialize a tuple struct which contains private fields

Which is the correct way for external usage. Not sure whether we can make the tests "part of the crate module tree" to allow it access to the pub(crate) value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validating keys can be done similar to this PR with a new Key value struct that is used with IndexMap.

I have drafted the change here, but would probably take that in its own PR: zcei/sfv@validate-bare-items...zcei:sfv:validate-keys

One ergonomic advantage, but not sure whether it justifies the implementation overhead, would be to allow .get with any unvalidated &str and return None early if it can't validate. Only for valid keys we would check back the IndexMap whether it contains the value.

@zcei zcei mentioned this pull request Feb 14, 2023
Copy link
Collaborator

@valenting valenting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're on the right track here.
My only concern is that all these try_into seem overly verbose, Also it seems inellegant that some implement into while others are try_into. So I'm wondering if we should have some helper functions on BareItem that abstract away all the differences and easily allow you to create them.

fn new_boolean(value: bool) -> Result<BareItem, &str>
fn new_decimal(value: f64) -> Result<BareItem, &str>
fn new_decimal(value: &str) -> Result<BareItem, &str>
etc.

We also need to resolve the performance issue with the unnecessary serialization when constructing the bare item.

src/bare_item.rs Outdated
type Error = &'static str;
fn try_from(value: rust_decimal::Decimal) -> Result<Self, Self::Error> {
let mut output = String::new();
Serializer::serialize_decimal(value, &mut output)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, doing an actual serialization in the try_from implementation seems less than optimal.
That means for every bare item that we construct, we allocate a string we then throw away. And then we do the serialization again when we actually serialize it into a buffer.

I think what we should do here is split the serialization from the validity checks, so try_from for Decimal would only check if the decimal is in the allowed range, and not actually serialize it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not happy with validating by serializing either, and as mentioned in the PR description we could (and should 😅) move validations to a place where both can access it.
To get my idea validated and not change another hundreds of lines, I decided to keep it like this for the moment - but I'm totally supportive of the fact that this needs to change.

Copy link
Author

@zcei zcei Feb 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a new ValidateValue trait that the value structs implement. It will either return a Result where the Ok value is either the input (passing back ownership) or a new "sanitized" value (important for Decimal as we can round to three decimal places already)

Comment on lines +95 to +97
// TODO: must_fail has to always fail on creation
// As not all cases are moved yet, we take a two-step approach here:
// Either creation fails or serialization fails
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're just testing serialization, I think we can still create invalid bareItems if we simply don't use try_into, right?

/// use rust_decimal::Decimal;
/// # fn main() -> Result<(), &'static str> {
/// let decimal_number = Decimal::from_f64(48.01).unwrap();
/// let bare_item: BareItem = decimal_number.try_into()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to wonder if try_into is the most ergonomic solution.
Whouldn't it be nicer to have let decimal_number = BareItem::decimal_from_f64(48.01)?; - and similar for the other subtypes?

Copy link
Author

@zcei zcei Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda like 3.i. from the issue (just with a better method naming than mine):

We can have static methods on the primitives that allow early validation, like BareItem::try_token("foo") which returns a Result of either the token or the validation error. The actual token would be "unverified" in the sense that one could still create BareItem::Token("12345"). Edit: this wouldn't apply now, as we would keep the value structs that can't be created externally without going through validation.

Happy to explore that!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I do think both ways can be supported. TryFrom<f64> could simply call BareItem::decimal_from_f64

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valenting what's your stance on having the external API only accepting f64? Do you want to keep the usage of rust_decimal::Decimal? The spec seems to only allow 12 digits on the integer component and 3 digits on the fractional component. f64 should cover that completely? 🤔

src/bare_item.rs Outdated Show resolved Hide resolved
@valenting
Copy link
Collaborator

valenting commented Feb 15, 2023 via email

@zcei
Copy link
Author

zcei commented Feb 18, 2023

Allowed ways to construct a BareItem per variant

// BareItem::Decimal
let value = BareItem::new_decimal_from_f64(13.37)?;
let value: BareItem = 13.37.try_into()?;

let decimal = rust_decimal::Decimal::from_f64(13.37).unwrap();
let value = BareItem::new_decimal(decimal);

let value: BareItem = decimal.try_into()?;

// BareItem::Integer
let value = BareItem::new_integer(42)?;
let value: BareItem = 42.try_into()?;

// BareItem::String
let value = BareItem::new_string("foo")?;
// No `.try_into()`, as it's ambiguous whether `Token` or `String` variant should be used

// BareItem::ByteSeq
let value = BareItem::new_byte_seq("hello".as_bytes())?;
let value: BareItem = "hello".as_bytes().try_into()?;

// BareItem::Boolean
let value = BareItem::new_boolean(true)?;
let value: BareItem = true.try_into()?;

// BareItem::Token
let value = BareItem::new_token("foo")?;
// No `.try_into()`, as it's ambiguous whether `Token` or `String` variant should be used

The recommended way would be to use the BareItem::new_* constructors, but I do believe the TryInto variants are quite useful still when creating an Item right away, like:

Item::new(true.try_into()?);

src/bare_item.rs Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validation of Items on local contruction
2 participants