Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow StringType and BytesType to have undefined byte lengths #413

Merged
merged 1 commit into from
Jan 18, 2024

Conversation

criccomini
Copy link
Contributor

Prior to this commit, Recap StringType and BytesType both required bytes_ to be set. If unset, a default of 65536 was used. This configuration allowed developers to skip the bytes field in the concrete syntax tree (CST), but still required bytes_ to be set in the abstract syntax tree (AST). @adrianisk and I converged on this design in
#284.

Recently, @mjperrone pointed out that requiring bytes_ in the AST is still awkward for JSON. In the absence of an undefined byte_ option, you have to set something for JSON--either a magic number or INT/LONG_MAX. Thus far, we defaulted to LONG_MAX. But a defined LONG_MAX and an undefined length are actually two separate states. Allowing a converter to specify an undefined string/byte length permits other converters to decide how to deal with the undefined byte length.

I made the change and it fit quite naturally, so I think we're on the right track with this. JSON and Hive metastore both benefited; both support undefined lengths.

I've left the other converters alone for now. This means they'll continue to barf if you go from JSON/HMS to Recap to Avro/Proto. I think that's fine for now. We can revisit this later if we want to.

@criccomini
Copy link
Contributor Author

criccomini commented Jan 17, 2024

@adrianisk @mjperrone @cpard @gwukelic @alexdemeo PTAL

Also, the spec change: recap-build/recap-website#13

Copy link
Contributor

@mjperrone mjperrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Having great existing tests and test frameworks makes this pretty straightforward 💯

recap/types.py Outdated
if not self.variable and self.bytes_ is None:
raise ValueError("Fixed length bytes must have a length set")
if self.bytes_ is not None and self.bytes_ < 1:
raise ValueError("Bytes must be greater than or equal to 1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you dropping the max length constraint here and for Bytes below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, good question. I'm not 100% sure what I was thinking here. The spec still says 64-bit signed integer | null, so if bytes_ is set it should be enforced as such. Good catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I pushed an updated PR here (force push squash). LMK if this looks OK.

  • Went back to LONG_MAX for string/bytes/list length
  • Moved list length min from 0 to 1 (in accordance with spec)
  • Updated 0.3.0.json to reflect maximum=int_max
  • Left 0.3.0.md as is since it already reflected these rules

@@ -20,7 +20,7 @@

VALID_SCHEMA_DIR = "tests/spec/valid"
INVALID_SCHEMA_DIR = "tests/spec/invalid"
RECAP_SPEC_JSON_HTTP = "https://recap.build/specs/type/0.2.0.json"
RECAP_SPEC_JSON_HTTP = "https://recap.build/specs/type/0.3.0.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is why the tests are failing, the website change hasn't landed yet. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! PR for that is here:

recap-build/recap-website#13

Prior to this commit, Recap StringType and BytesType both required `bytes_` to
be set. If unset, a default of 65536 was used. This configuration allowed
developers to skip the `bytes` field in the concrete syntax tree (CST), but
still required `bytes_` to be set in the abstract syntax tree (AST). @adrianisk
and I converged on this design in
#284.

Recently, @mjperrone pointed out that requiring `bytes_` in the AST is still
awkward for JSON. In the absence of an undefined byte_ option, you have to set
*something* for JSON--either a magic number or INT/LONG_MAX. Thus far, we
defaulted to LONG_MAX. But a defined LONG_MAX and an undefined length are
actually two separate states. Allowing a converter to specify an undefined
string/byte length permits other converters to decide how to deal with the
undefined byte length.

I made the change and it fit quite naturally, so I think we're on the right
track with this. JSON and Hive metastore both benefited; both support undefined
lengths.

I've left the other converters alone for now. This means they'll continue to
barf if you go from JSON/HMS to Recap to Avro/Proto. I think that's fine for
now. We can revisit this later if we want to.
@criccomini criccomini merged commit 18b505f into main Jan 18, 2024
3 checks passed
@criccomini criccomini deleted the make-byte-length-optional branch January 18, 2024 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants