Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalize-index background job #5558

Merged
merged 8 commits into from Jan 6, 2023
Merged

Conversation

arlosi
Copy link
Contributor

@arlosi arlosi commented Nov 29, 2022

Adds a background job that regenerates the git index by reading existing files, normalizing them and writing them back out again. Does not use the database.

  • Sort deps field
  • Sort features field
  • Remove "links": null (since it's optional)
  • Replace dep "kind": null with "kind": Normal (appears to be an early implementation error)
  • Replace dep "features": [""] with "features": [] (appears to be an early implementation error)

r? @Turbo87

@Turbo87 Turbo87 added C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear A-backend ⚙️ labels Nov 30, 2022
@bors
Copy link
Contributor

bors commented Dec 3, 2022

☔ The latest upstream changes (presumably #5579) made this pull request unmergeable. Please resolve the merge conflicts.

@Turbo87
Copy link
Member

Turbo87 commented Dec 3, 2022

@arlosi took the liberty of rebasing your branch since I introduced a few merge conflicts 😅

@Turbo87 Turbo87 force-pushed the normalize branch 3 times, most recently from 0671bd9 to fc19915 Compare December 7, 2022 09:36
// Add an additional commit after the squash commit that normalizes the index.
println!("committing normalization");
let msg = "Normalize index format\n\n\
More information can be found at https://github.com/rust-lang/crates.io/pull/5066";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to update this link to point to this PR, or perhaps an internals post?

@arlosi
Copy link
Contributor Author

arlosi commented Dec 7, 2022

@Turbo87 The rebase looks good to me. Let me know if there's anything else you need.

I understand that this is a large change to the index repo and probably needs some time for discussion within the team.

@Turbo87
Copy link
Member

Turbo87 commented Dec 31, 2022

I've thought a bit about this PR, and there are a few things in it where I'm not sure yet whether it's the best solution or not:

  • Sort crate versions

unless I'm misreading the code, it sorts the version by semver, while previously they were sorted by insertion order. I understand that semver sorting might seem cleanest but a) the database can't sort them that way directly, b) it causes a big diff for most crates and c) when we run the index normalize command then new versions afterwards will still be sorted by insertion order.

in other words: I think it might be easier to keep the current sort order and for #5066 change it to sort by created_at

  • Remove "links": null (since it's optional)

I was initially a bit worried by that, but since you've already shown in #5066 (comment) that this shouldn't be an issue, I guess this is fine :)

Add normalize-index admin command

Unfortunately, with #5066 (comment) I might be partially responsible for the design of making this an admin command, but I wonder if it would be better to implement it as a background job. Since the background worker has a lock on the index repo it would allow us to run this migration without having to go into read-only mode. The dry-run functionality would have to look slightly different, potentially pushing the update to a dedicated branch instead of master, but that should be doable.

After normalizing the index files, this admin commands also performs a squash of the index

While I agree that a timely squash would be useful, I'm not convinced that this part should be done automatically. It's easy enough to trigger manually after the result of the normalization has been confirmed to be correct.

Sorry for the late feedback. It took me a while to understand my gut feeling hesitation on merging this as-is 😅

@Turbo87 Turbo87 force-pushed the normalize branch 3 times, most recently from 811ea40 to e9d56da Compare December 31, 2022 14:14
@Turbo87
Copy link
Member

Turbo87 commented Dec 31, 2022

I took the liberty of implementing the changes I mentioned above and tested it on the staging index. The result is rust-lang/staging.crates.io-index@1c98a7d, which looks reasonable to me.

One thing I noticed it that some dependencies don't seem to have a kind field and I'm wondering if it would make sense for us to skip the field in that case to save space in the index.

Now that I think about it, I'm wondering for what reason we have the dev dependencies in the index at all. They should not be used for dependency resolution, so I'm wondering if we could remove those as well. That is unrelated to the current cleanup though.

@bors
Copy link
Contributor

bors commented Dec 31, 2022

☔ The latest upstream changes (presumably #5771) made this pull request unmergeable. Please resolve the merge conflicts.

@Eh2406
Copy link
Contributor

Eh2406 commented Dec 31, 2022

You brought up a bunch of good points, but I think there's an underlying question required to make decisions here. Is normalizing the index a one time thing, or are we willing to do it multiple times?

If were willing to do it multiple times, then a lot of these questions become less important because we can do them at the next normalization. The cargo team is attempting to get a more nuanced understanding of how "optional" most fields really are. Once that is done I would not be surprised if we could remove kind and many fields that have been added since 1.19 when they equal their default. Sorting versions by semver probably increases the compressibility of the files, but can be put off until new files can also be written in sorted order.

Of course the normalization and its diff with no semantic meaning will cause a hassle for a lot of people. It would not be unreasonable for us to decide we will only do this once. In which case all of these questions need to be decided before we do the one and only normalization.

@Turbo87
Copy link
Member

Turbo87 commented Dec 31, 2022

my understanding was that we can do multiple normalizations without major issues, so I wouldn't want to put all of these steps into a single PR. Since we would couple the normalization with a timely squash most people probably won't notice it anyway.

@arlosi
Copy link
Contributor Author

arlosi commented Jan 3, 2023

Your changes look good to me. Thanks for taking the initiative here. Let me know if there's more you need help with.

  • Sorting by created_at seems fine. We'll need to manually verify that it produces the same order as the git index if that's the goal.
  • kind is documented as a required field (even though in practice it having an empty kind is equivalent to normal. If we want to change the Cargo documentation, then we could remove kind for normal dependencies.
  • I'm fine with decoupling the normalization from squashing, though we'll probably want to do a squash shortly after normalization, since it will increase the size quite a bit.

@arlosi arlosi changed the title Add normalize-index admin command Add normalize-index background job Jan 3, 2023
@Turbo87
Copy link
Member

Turbo87 commented Jan 6, 2023

We talked about this PR in our weekly team meeting today and the conclusion was that this should be good to go.

I'll merge this, deploy it, and then do a dry run on the production index. Unless there are any surprises in that dry run, we will probably run the proper normalization some time next week.

@jonhoo
Copy link
Contributor

jonhoo commented Jan 9, 2023

I may be missing something, but I don't see a sort of .features in the code that landed?

let mut krate: Crate = serde_json::from_str(&line)?;

@arlosi
Copy link
Contributor Author

arlosi commented Jan 9, 2023

The features keys should end up being sorted because they're in a BTreeMap now.

It was changed from a HashMap in this comment: arlosi@a26e319, so the normalization should end up sorting them by reading and writing back out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants