Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-duplication in BlobDataProvider #2062

Merged
merged 21 commits into from
Jul 11, 2022
Merged

De-duplication in BlobDataProvider #2062

merged 21 commits into from
Jul 11, 2022

Conversation

sffc
Copy link
Member

@sffc sffc commented Jun 14, 2022

Fixes #838

@sffc
Copy link
Member Author

sffc commented Jun 14, 2022

No code size change for the first commit. Waiting for the second one to come in.

@sffc sffc changed the title Change representation of BlobProvider to BlobSchemaV1 De-duplication in BlobDataProvider Jun 14, 2022
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • provider/blob/src/blob_data_provider.rs is different
  • provider/blob/src/export/blob_exporter.rs is different
  • provider/blob/src/static_data_provider.rs is different
  • provider/blob/tests/data/hello_world.postcard is different
  • provider/datagen/src/bin/fingerprint-data.rs is now changed in the branch
  • provider/testdata/data/decimal-bn-en.postcard is different
  • provider/testdata/data/testdata.postcard is different
  • utils/litemap/src/map.rs is now changed in the branch
  • utils/zerovec/src/map2d/cursor.rs is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@sffc sffc marked this pull request as ready for review June 30, 2022 02:54
@sffc sffc requested review from Manishearth, a team and robertbastian as code owners June 30, 2022 02:54
@sffc
Copy link
Member Author

sffc commented Jun 30, 2022

Need to resolve the testdata conflict again, but this is good enough for review

.map_err(|kind| kind.with_req(key, req))?;
blob.buffers
.get(idx)
.ok_or(DataErrorKind::InvalidState.with_req(key, req))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the invariant during construction would remove the invalid state possibility

@@ -68,17 +69,34 @@ impl BlobDataProvider {
BlobSchema::deserialize(&mut postcard::Deserializer::from_bytes(bytes)).map(
|blob| {
let BlobSchema::V001(blob) = blob;
blob.resources
blob
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're adding new invariants to BlobSchemaV1, which I think we should verify here. Something like

blob.keys.iter_values().max() < blob.buffers.len()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check the invariant upon construction, but it means we need to scan over the buffer. I'd rather lean towards GIGO behavior than having a somewhat expensive validation operation for a case that should never happen.

I will make it a debug assertion instead.

#[doc(hidden)] // See #1771, we don't want this to be a publicly visible API
pub fn get_map(&self) -> &ZeroMap2dBorrowed<ResourceKeyHash, [u8], [u8]> {
self.data.get()
#[doc(hidden)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: I don't like having this just for the fingerprinting script. I'll try to add fingerprinting to the exporter.

Manishearth
Manishearth previously approved these changes Jun 30, 2022
provider/blob/src/export/blob_exporter.rs Show resolved Hide resolved
Manishearth
Manishearth previously approved these changes Jun 30, 2022
.data
.buffers
.get(idx)
.ok_or_else(|| DataErrorKind::InvalidState.with_req(key, req))?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this InvalidState error, because it doesn't tell the client anything. Can we make it a custom error with a message or camouflage it as a postcard error (with a message)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with you that InvalidState errors are not great because they don't tell the client anything, but I think your other suggestions are even worse (especially camouflaging it as a postcard error). I can attach a message string to the InvalidState error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how Custom with message is even worse than InvalidState with message. This is arguably not even invalid "state", it's an invalid input (the bytes buffer), which is why I suggested pretending that it's a postcard error. If we checked the invariant during deserialization it would become a postcard error, and for most invalid blobs it will be a postcard error (because the ZeroMap or its contents don't deserialize), but for a very small subset of invalid inputs we return InvalidState instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the other place where we use InvalidState is in the try_unwrap_owned function.

for a very small subset of invalid inputs we return InvalidState instead.

I'm convinced by this argument. With postcard, we do

        crate::DataError::custom("Postcard deserialize").with_display_context(&e)

so we can use custom here as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re try_unwrap_owned, I'd be happy to use custom there and remove the InvalidState error kind.

@sffc sffc requested a review from robertbastian July 6, 2022 09:29
robertbastian
robertbastian previously approved these changes Jul 6, 2022
Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused that decimal-bn-en.postcard reduced in size, but it's because it also includes und. Is that intended?

Manishearth
Manishearth previously approved these changes Jul 6, 2022
@sffc sffc dismissed stale reviews from Manishearth and robertbastian via 0e5805e July 10, 2022 23:09
@sffc
Copy link
Member Author

sffc commented Jul 10, 2022

I was confused that decimal-bn-en.postcard reduced in size, but it's because it also includes und. Is that intended?

This would be something potentially fixed by #834

@sffc sffc merged commit 7082923 into unicode-org:main Jul 11, 2022
@sffc sffc deleted the blob-refactor branch July 11, 2022 21:11
samchen61661 pushed a commit to samchen61661/icu4x that referenced this pull request Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement de-duplication in blob provider
3 participants