Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Value "refinements" #153

Merged
merged 28 commits into from
Feb 21, 2023
Merged

Value "refinements" #153

merged 28 commits into from
Feb 21, 2023

Conversation

apparentlymart
Copy link
Collaborator

Originally the purpose of unknown values in cty was primarily to allow type checking of expressions whose input values are not fully known yet. Consequently they had a pretty simple model: an unknown value has a type constraint and operations on unknown values check that the type constraints are valid but otherwise just return even more unknown values to allow type checking to continue with subsequent derived expressions.

However, this limited approach misses some opportunities to use known constraints on the range of results from certain operations to allow some operations on unknown values to return known values, or at least values that are "more known" (have a smaller range than just meeting a type constraint).

This changeset introduces a new model of "refinements" that can constrain the range of an unknown value beyond just a type constraint, and some early work to make use of that new information to return refined values from various built-in operations.

This is backward-compatible with the old behavior in the sense that conventionally we don't consider it a breaking change for a result to be "more known" in a new minor release than it was in the previous release. However, applications depending on operation methods -- in particular Value.RawEquals, but not limited to that -- may find that their unit tests now produce slightly different results where previously-unknown values are now either fully known or are refined with a more limited range. If the new refined range seems correct then I'd suggest updating those tests as part of upgrading cty; if the refined range seems incorrect then that may be a bug, so please open a new issue to discuss it.

Future changes might improve the refinements analysis further, causing even smaller ranges for unknown values. However, refinement analysis is always "best effort" and so it is not a bug for cty to return an unknown value with a wider range than necessary: narrowing that range might be an opportunity for improvement in a future minor release.

Previously the internal definition of an unknown value was any value whose
raw value "v" field was exactly equal to the unexported global "unknown".

In preparation for supporting optional refinement of unknown values we'll
now broaden that definition to any value where "v" is an instance of
*unknownType, regardless of exact value.

In later commits we'll add optional fields inside struct unknownType to
capture the optional additional refinements. "Refinements" means extra
constraints on an unknown value that don't affect its type but do
constrain the possible concrete values the unknown value could be a
placeholder for.
This is a new concept allowing for an unknown value to optionally carry
some additional information about constraints on the value for which it
is a placeholder.

Refinements do not affect the type of the value or most equality testing
rules. As of this commit they also don't affect any other operations on
values, but in future commits we'll use refinements to improve the
precision of certain operations on unknown values if the refinements give
enough information to still produce a known result.

For the moment this is limited just to a few key cases that seem likely
to help produce known results in common operations involving unknown
values:
- Equality tests will null, which are commonly used as part of conditional
  branches in calling applications.
- Numeric bounds, in particular focused on comparison to zero but general
  enough to help with other numeric comparisons too.
- Bounds on collection length, to use in conjunction with the numeric
  bounds so that asking for the length of an unknown collection can return
  an unknown number with numeric bounds.
- String prefixes, to help with string-prefix-related tests that seem to be
  a common subset of validation checks in calling applications.

Applications will tend to need to do some work to refine their own values
in order to benefit fully from this, but we'll also plumb refinements into
some of the built-in value operations and stdlib functions over time so
that even applications not working directly with refinements can still get
some indirect benefit.
Using the "refinements" feature we can refine the range of any unknown
values from Equal to exclude cty.NullVal(cty.Bool), since we know that
equality tests can never return null.

We can also return a known False if the test is between a null value and
an unknown value which has been refined as non-null.
The length of a collection is never null and always at least zero.

For sets containing unknown values in particular we also know that they
must have at least one element (otherwise there wouldn't be an unknown
value) and that they can't have any more elements than currently stored
(because coalescing can't ever generate new elements).
We can refine the unknown value results for all of the arithmetic
functions to represent that they cannot possibly be null.

The Value.Absolute method is additionally guaranteed to return a value
greater than or equal to zero, by definition.
Many of our operation methods are guaranteed to never return null, so
we'll refine the unknown values they return to specify that.
This slightly reduces the result range described by unknown results from
these functions so that downstream operations on the results have more
chance of producing known results even though their input is unknown.
I initially just copied the refinement model for numeric range, but in
retrospect that was silly because we know that collection lengths are
always integers and so there's no need to model unknown values and
fractional values and whatever else.

Instead, we now just have an integer lower and upper bound, always known
but when unconstrained they are the smallest and largest values of int
respectively.
The new concept of "refinements" for unknown values allows us to track
some additional constraints on the range of an unknown value.

Normally those constraints would get lost during serialization, but for
msgpack in particular we can use an extension code to represent an unknown
value that has refinements and then include a serialization of the
refinements inside its body.

This then allows preserving the refinements when communicating over wire
protocols that use msgpack as their base serialization. However, as of
this commit the unmarshal step isn't quite complete becausethe refinement
API for numeric range is incomplete. We'll address both the refinement API
and the msgpack package's use of it in a future commit.

This required upgrading to a newer major version of the msgpack library
because v4 did not have all of the functionality required to encode a raw
extension object's payload.
The API for declaring numeric bounds was previously rather awkward and
inconsistent with the API for inspecting them in the ValueRange type.

Now we use separate methods for defining the lower bound and upper bound,
allowing the inclusiveness of each one to be independent. We retain the
previous NumberRangeInclusive helper for setting both bounds at once, but
it's now just a thin wrapper around the other two.
Previously we considered it a usage error to assert a bound that was
outside of the existing refined range.

However, if we already know that a number value is less than 2 then it
isn't really an error to assert that it's also less than 4; the second
assertion just doesn't give us any new information and so we should
silently discard it and retain the original assertion.

It's still an error to make assertions that directly contradict existing
refinements, such as asserting that an unknown value isn't null and then
later asserting that it's null.
When we're being asked for the length of an unknown collection we can
always put at least _some_ bounds on the range of the result, because we
know that a collection can't have a negative length and it cannot have a
length that is bigger than the current machine's integer size because
otherwise we wouldn't be able to index into it.

For some collections we may have further refinements on the length of the
unknown collection, in which case we'll also transfer those to being
refinements on the range of the numeric result.
This allows us to return a known result in more cases than before by
making deductions based on the known numeric range.

This first implementation is a little more conservative than it strictly
needs to be because it just treats all bounds as exclusive rather than
taking into account inclusive bounds. This makes the implementation a bit
simpler for this initial round though, and should never generate an
incorrect known result.
When given an unknown value, the Strlen function will now return a refined
unknown value with an inclusive lower bound. This bound is always at least
zero because a string can't have a negative length, but if we also know a
prefix for the string then we can refine the result even more to take that
into account.
When comparing a known value to an unknown value we may be able to return
a definitive False if we can prove that the known value isn't in the range
of the unknown value.
Because cty strings are unicode texts following both the normalization and
text segmentation rules, we need to be cautious in how we model prefix
matching of strings in the refinement system.

To avoid problems caused by combining character sequences, we'll trim off
characters from the end of a prefix if we cannot guarantee that they won't
combine with something that follows. This then avoids us overpromising
what the prefix will be in situations where the last character might
change after appending the rest of the string.

For applications that fully control their values and can guarantee that
no combining will occur at the boundary, the new function StringPrefixFull
retains the previous behavior of StringPrefix. The idea here is to be
conservative by default but allow applications to be less conservative if
they are able to offer greater guarantees.
When encoding a JSON string we will often be able to predict the first
character of the result based just on the argument type.

That wouldn't be worthwhile alone, but we can also use a known first
character of a JSON document to infer a specific result type even for an
unknown value, and to raise an error immediately if the known prefix
cannot possibly be a JSON value of any type.

Unfortunately due to how JSON interacts with the cty type system the only
case where type information truly carries through in a round-trip is
encoding an unknown string and then decoding that unknown result, but
JSON may be produced and consumed by components other than this pair of
functions and they may be able to reach more definitive conclusions based
on this additional information.
It's common for format strings to begin with at least a few literal
characters before the first verb, and if so we can predict the prefix of
the result even if some of the remaining arguments are unknown.
Previously we were preserving marks only in the simple cases, and not
properly dealing with already-marked known values when checking
refinements.

Now we'll always unmark a value on entry into Refine, and then reapply
the same marks on all return paths out of RefinementBuilder.NewValue.
When we're converting from a structural type to a collection type the
source type gives us information to refine the range of the resulting
collection even if the input value is unknown.

Similarly when converting from one collection type to another we can
transfer the length range constraints because, aside from set element
coalescing, conversions cannot change the length of a collection.
We know that the maximum possible cardinality for a resulting set is the
product of the cardinalities of all of the input sets. We might not know
the exact cardinalities of all of the input sets either, but if we do
know an upper bound then we can use that to calculate the maximum
cardinality of the result instead.
Previously we were treating these as unknown, which is also a reasonable
way to model a lack of bounds but is less convenient when we want to do
arithmetic against the bounds.
If we are performing addition, subtraction, or multiplication on unknown
numbers with known numeric bounds then we can propagate bounds to the
result by performing interval arithmetic.

This is not as complete as it could be because of trying to share a single
implementation across all of the functions while still dealing with all
of their panic edge cases.
Now that we're in the business of trying to produce known values from
operations on unknown inputs where possible, these are some simple cases
that hold regardless of the range/refinements of the unknown operands and
can help take a particular unknown result out of consideration when
evaluating a broader expression.
If the list argument is an unknown value or contains unknown values then
we can't possibly return a fully-known result, but we do at least know
that sorting will never change the number of elements and so we can refine
our unknown result using the range of the input value.

The refinements system automatically collapses an unknown list collection
whose upper and lower length bounds are equal into a known list where
all elements are unknown, so this automatically preserves the known-ness
of the input length in the case where we're given a known list with
unknown elements, without needing to handle that as a special case here.
So far I've just left this implied because mostly I was the one
responsible for upgrading this library in various applications that use it
(by virtue of an employment coincidence).

However, the (indirect) use of this library has broadened lately, so it's
worth being clear about what does not constitute a breaking change in a
minor release so that application developers can have some guidance on
what to do when they see a test failure after upgrading and so that
hopefully they can avoid depending on implementation details in their main
code, outside of unit tests.
@apparentlymart apparentlymart merged commit 2d9df4d into main Feb 21, 2023
@apparentlymart apparentlymart deleted the value-refinements branch February 21, 2023 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant