Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A path to killing checked in postcard #3530

Closed
Manishearth opened this issue Jun 14, 2023 · 15 comments
Closed

A path to killing checked in postcard #3530

Manishearth opened this issue Jun 14, 2023 · 15 comments
Assignees
Labels
A-data Area: Data coverage or quality C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement) T-docs-tests Type: Code change outside core library

Comments

@Manishearth
Copy link
Member

Manishearth commented Jun 14, 2023

Related: #3529

As our tests move towards baked data, having postcard be checked in becomes less relevant. Git doesn't like such large files and it slows down clones. I still want deserialization to be thoroughly tested, especially since we rely on funky zero-copy stuff that may not be seen without running tests. There are two things we can do, and I think we should do both:

  • Have a test that runs full-datagen, and then compares the data from each source.
  • Have some (internal) testing mode where you can override autodata with a postcard file, and on cronjob CI run full-datagen (postcard) + all tests.

Discuss with:

@Manishearth Manishearth added discuss Discuss at a future ICU4X-SC meeting A-data Area: Data coverage or quality labels Jun 14, 2023
@Manishearth
Copy link
Member Author

cc @sffc @robertbastian

@robertbastian
Copy link
Member

Have a test that runs full-datagen, and then compares the data from each source.

Comparing DataPayloads is not currently possible, although it's something we need for #2683. We could check that databake <-> postcard roundtrips.

Have some (internal) testing mode where you can override autodata with a postcard file, and on cronjob CI run full-datagen (postcard) + all tests.

Overriding baked data with postcard will be a lot of effort, as singleton data keys don't go through the DataProvider::load API.

@Manishearth
Copy link
Member Author

Comparing DataPayloads is not currently possible, although it's something we need for #2683. We could check that databake <-> postcard roundtrips.

It's a test, we can do registry shenanigans. Note that checking roundtrips may not be enough because databake ⇒ postcard ⇒ databake will happily roundtrip with malformed data in databake (not the other path though).

Though I think checking roundtrips into JSON may be enough since it forces the zero-copy stuff to actually be read.

Overriding baked data with postcard will be a lot of effort, as singleton data keys don't go through the DataProvider::load API.

Oh I was imagining we build a special secret baked datagen mode where all the macros hit postcard (and generate it on demand, replacing defaultdata). Given the refactors we've done I think it would be possible.

@robertbastian
Copy link
Member

We need to check serdify(databake gen) == serde gen and databakify(serde gen) == databake gen.

@robertbastian
Copy link
Member

We can rewrite verify-zero-copy as a DataExporter that serializes and deserializes each generated payload. We can then use datagen options to control which locales this runs on and speed it up by using the fallback deduplication mechanism (the exporters will not see duplicates).

@sffc
Copy link
Member

sffc commented Jun 14, 2023

I would be quite happy if we could plumb globaldata to read from a BufferProvider and then run all our docstests on postcard or JSON. I think that, plus the existing verify-zero-copy, should be enough coverage.

We need to decide whether to generate the testing postcard from tags or from repodata. I still prefer repodata so that the test can be run locally and reproducibly from repo artifacts. I think we should read all CLDR JSON locales checked into the repo for individual components.

It would be nice to test for equality amongst the different formats, but that's above and beyond what we currently test, and there's no reason to believe that there would be any bugs so long as the Serialize, Deserialize, and Bake impls are correct.

@Manishearth
Copy link
Member Author

Manishearth commented Jun 14, 2023

I think repodata is sufficient to ensure tests are reproducible.

Being able to redirect globaldata to postcard is definitely ideal.

Equality is going above and beyond but it reduces risk to what will be becoming a less-used codepaths. It's definitely nice to have.

@sffc
Copy link
Member

sffc commented Jun 14, 2023

I'm not a huge fan characterizing the globaldata change as making postcard a "less-used codepath", because it's already the case that docstests use bake data.

@Manishearth
Copy link
Member Author

Manishearth commented Jun 14, 2023

Yeah that's true. I do think in that case we should be fixing this even if we don't kill checked-in postcard.

And we probably should consider killing checked-in postcard anyway (especially as we're also talking about sunsetting test data)

@robertbastian
Copy link
Member

We need to decide whether to generate the testing postcard from tags or from repodata. I still prefer repodata so that the test can be run locally and reproducibly from repo artifacts. I think we should read all CLDR JSON locales checked into the repo for individual components.

One nice thing that databake allows us to do is move component tests from datagen to the components. They should be in components but currently aren't because of the fixed repodata locale set. If all tests need to work with globaldata replaced by buffer testdata generated from repodata, this is not possible.

@sffc
Copy link
Member

sffc commented Jun 14, 2023

Datagen tests still require the data to be repodata, but currently testdata ignores locales that aren't in the testdata locales list. I'm suggesting that we change the generated postcard to include data for any locale that we have in repodata, which is not necesarilly the same across components.

@Manishearth
Copy link
Member Author

@robertbastian in that case I am ok generating postcard from tags. I was imagining postcard would use the union of all the globaldata and testdata config files (potentially with exceptions for e.g. fallback tests)

@robertbastian
Copy link
Member

Datagen tests still require the data to be repodata

I was hoping to move all tests that actually require data into the components.

@Manishearth
Copy link
Member Author

Agreed, I plan to do the same with the casemapping datagen tests when I move on to doing test work for casemapping

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 15, 2023
@sffc
Copy link
Member

sffc commented Jun 26, 2023

Conclusion:

  • In full-datagen, add a flag that serializes and deserializes and then invokes a new PartialEq-type function. This covers the Serde code for each individual data struct.
  • Improve the BlobProvider tests to:
    1. Cover the case of a key with no entries
    2. Cover a singleton key (check for the error types)
    3. Fallback on the HelloWorld data

We discussed running all tests with BlobProvider. This is challenging because:

  • Const functions relying on singletons cannot be patched to rely on blob
  • Things need to get blob_provider dependencies

LGTM: @sffc @robertbastian @Manishearth

@sffc sffc added S-small Size: One afternoon (small bug fix or enhancement) S-medium Size: Less than a week (larger bug fix or enhancement) and removed discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band S-small Size: One afternoon (small bug fix or enhancement) labels Jun 26, 2023
@sffc sffc added this to the 1.x Priority ⟨P2⟩ milestone Jun 26, 2023
@sffc sffc added T-docs-tests Type: Code change outside core library C-data-infra Component: provider, datagen, fallback, adapters labels Jun 26, 2023
@robertbastian robertbastian self-assigned this Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-data Area: Data coverage or quality C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement) T-docs-tests Type: Code change outside core library
Projects
None yet
Development

No branches or pull requests

3 participants