Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding source data versions to baked crates #4399

Merged
merged 5 commits into from
Dec 8, 2023

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Dec 1, 2023

impl<T> ExportableProvider for T where
T: IterableDynamicDataProvider<ExportMarker> + DynamicDataProvider<AnyMarker> + Sync
{
/// Returns the source versions of the data.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks semver but I don't know how else I would get this data from the source provider. I really don't want to define a data key for this...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the function to the trait with a default impl doesn't break semver, but removing the blanket impl does. Can you use runtime specialization in the blanket impl to sniff for the known concrete impl that has a source_versions function? Well I guess that requires making T: 'static which is also semver. I lean toward just deleting the blanket impl at least to get the PR building and then we can escalate to the team.

Comment on lines 196 to 199
tag.strip_prefix("release-")
.unwrap_or(tag)
.replace('-', ".")
.leak(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: often we use pre-release tags. Maybe just use the whole tag name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GH tags don't match the version in the ICU repo, so I have to process them somehow. release-75-rc will become 75.rc which I guess is fine. I do want release-75-1 to become 75.1, because that's the version that we would read from the files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just read it from the files?

I'm thinking about the tags like icu4x/2023-12-01/73.1 which I'd like to not need to parse. I'm okayish with bubbling them up to users, though I agree having an actual version number seems more useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just read it from the files?

Because then 10 different tags will be 74.1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we save the tag string in SourceVersions and then if we want we can add an instance method to SourceVersions that parses the ICU version out of the tag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about you let me parse release-xx-yy to xx.yy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a more consistent mental model if SourceVersions::Public always contains the tag string.

If you want to parse the release version into SourceVersions::Public, you could make a new variant SourceVersions::Tagged for arbitrary tags that aren't release tags (including hashes, etc)?

/// A concrete version
pub enum SourceVersion<'data> {
/// A released version
Public(#[cfg_attr(feature = "serde", serde(borrow))] &'data str),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: use a Cow so we don't leak strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't have cows in consts because of dropping stuff.

pub struct SourceVersions<'data> {
#[cfg_attr(feature = "serde", serde(borrow))]
/// The version of CLDR source data, if used during datagen.
pub cldr: Option<SourceVersion<'data>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation/Issue: icu_provider has no concept of CLDR. I'd like to keep it agnostic if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I think it's important to have this but it's not a strong opinion

impl<T> ExportableProvider for T where
T: IterableDynamicDataProvider<ExportMarker> + DynamicDataProvider<AnyMarker> + Sync
{
/// Returns the source versions of the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the function to the trait with a default impl doesn't break semver, but removing the blanket impl does. Can you use runtime specialization in the blanket impl to sniff for the known concrete impl that has a source_versions function? Well I guess that requires making T: 'static which is also semver. I lean toward just deleting the blanket impl at least to get the PR building and then we can escalate to the team.

@sffc
Copy link
Member

sffc commented Dec 1, 2023

Oh and thanks for this PR!

Manishearth
Manishearth previously approved these changes Dec 2, 2023
}

#[cfg(feature = "datagen")]
impl databake::Bake for SourceVersions<'_> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: ideally would be very cool if e.g. the SourceVersions for properties did not mention a cldr version (since it's irrelevant)

but that seems to be more hassle than it's worth to track

@robertbastian robertbastian added the discuss-priority Discuss at the next ICU4X meeting label Dec 7, 2023
@sffc
Copy link
Member

sffc commented Dec 7, 2023

  • @Manishearth - There are two pieces of information. One, what is the provenance of the data. Second, what is the versioning of that data source. So is it more useful to know your list of sources, or... ICU4C contains a lot of CLDR data and it could start exporting that data. What does that mean here? Do they reflect that as an ICU version or do they put it as the CLDR version?
  • @robertbastian - I see this as a feature for compiled data. If you are building your own data, you don't need this information. And baked data is always from our three datagen sources.
  • @younies - What's going on here... is it exposing the version of CLDR or LSTM?
  • @robertbastian - it's for everything, we want to expose the version of the source data. CLDR, icuexport, segmenter
  • @Manishearth - You need to know if you're up-to-date, etc.
  • @younies - Why not only in documentation?
  • @Manishearth - Because you want to access from code.
  • @sffc - @sven-oly's conformance suite wants to query the data version so it can select the correct test data.
  • @Manishearth - Chrome/V8 bubble this information up via version strings, bug reporting, ...
  • @sffc - This is just a fingerprint. Doesn't even need to be well structured.
  • @sffc - Don't like how it is in the global icu_provider crate. It could be a struct defined in each data crate.
  • @robertbastian - If this is metadata, why don't we put it in the data crates Cargo.toml? It's programatically accessible, and we're more flexible with the structure. I'm not a big fan of having a Rust API for this.
  • @younies - Structured data is better because you don't want to parse a string.
  • @sffc - we could put it in cargo.toml, would satisfy concern about schema. reading from cargo.toml is a pain. we do that for icu4x versions in the conformance test code.
  • @younies - don't like litemaps because you'll be forced to have strings and you have to parse them
  • @Manishearth - but we can use any type we want, including enums. if there are certain common structures we can add them to the enum
  • @robertbastian - We read the version numbers from either the git tag or... we don't get that guarantee ourselves.
  • @robertbastian - We can overdesign this but I really don't want to commit to a Rust API for this. For the conformance tests it's fine to put it in Cargo.toml
  • @Manishearth - want to highlight that conformance tests are an internal use case, we should make sure they have everything they need, but we shouldn't overindex on making it convenient unless other clients need the same thing.
  • @sffc - OK to do the Cargo.toml solution and relitigate this when we have a client other than the conformance suite.
  • @robertbastian - And the data crates can include a macro that generates docs with the version numbers.
  • @younies - How do people access the data from Cargo.toml?
  • @sffc - They have to read the Cargo.toml file. There is a cargo_metadata crate which makes it slightly easier.
  • @younies - How about a programmatic user?
  • @Manishearth - We can revisit that when we have a concrete use case other than @sven-oly's.

Consensus:

  • data crate Cargo.toml files contain an auto-generated metadata section with the versions:
[metadata.sources]
# some free structure chosen by datagen
cldr = { tagged = "44.0.0" }
icuexport = { custom = "74.0" } 
  • data crates include the version number they use in their own doc string (e.g. in lib.rs)
  • Revisit an API solution if a client comes to us with a clear use case.

LGTM: @sffc @Manishearth @robertbastian @younies

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Dec 7, 2023
@robertbastian robertbastian changed the title Piping source data versions through datagen Adding source data versions to baked crates Dec 7, 2023
@@ -15,3 +15,8 @@ homepage.workspace = true
include.workspace = true
repository.workspace = true
rust-version.workspace = true

[metadata.sources]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it should be [package.metadata.sources]

https://doc.rust-lang.org/cargo/reference/manifest.html#the-metadata-table

Comment on lines -115 to +124
.replace("_version_", version),
.replace("_version_", version)
.replace("_cldr_tag_", DatagenProvider::LATEST_TESTED_CLDR_TAG)
.replace(
"_icuexport_tag_",
DatagenProvider::LATEST_TESTED_ICUEXPORT_TAG,
)
.replace(
"_segmenter_lstm_tag_",
DatagenProvider::LATEST_TESTED_SEGMENTER_LSTM_TAG,
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: This is, like... a super clean and simple way to do things. Good suggestion 👍

@robertbastian robertbastian merged commit 60b84ef into unicode-org:main Dec 8, 2023
29 checks passed
@robertbastian robertbastian deleted the sourceversionos branch January 4, 2024 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants