General outline of proposed updates to publishing logic #1522

jtgeibel · 2018-10-16T01:32:05Z

I've recently gone through our publishing logic and would like to document my findings and propose some changes. This was originally raised in the context of a cargo publish --dry-run option in #1517, but I think some refactoring here would help with proposed enhancements such as background jobs (#1466) and direct client uploads to S3.

Currently (edit: updated 2019-11-13)

Our publishing logic currently follows the following sequence:

Check metadata length against the global max crate size
Decode metadata
Check for non-empty: description, license, authors
Verify user is authenticated
Obtain database connection and enter transaction
Ensure user has a verified email address
Validate URLs if present: homepage, documentation, repository (NewCrate::validate)
Ensure name is not reserved (NewCrate::ensure_name_not_reserved)
If crate is not present, insert it and add the user as an owner (NewCrate::save_new_crate)
If this is a brand new crate, check the rate limit (NewCrate::create_or_update)
If crate already existed, update it (NewCrate::create_or_update)
Check that the user has publish rights on the crate
Check that the new name is identical to the existing name (sans-canonicalization)
Check that Content-Length header exists and doesn't exceed the crate specific max
Validate license if specified (NewVersion::validate_license via NewVersion::new)
Check if the version already exists (NewVersion::save)
Insert version and add authors to the version (NewVersion::save)
Iterate over deps (models::dependency::add_dependencies)
- Check that not an alternate registry dependency
- Check that crate exists in the database
- Enforce "no wildcard" constraint
- Handle package renames
- Insert deps into database
- Return vec of git::Dependency
Update keywords (Keyword::update_crate)
Update categories, returning a list of ignored categories for warning (Category::update_crate)
Update badges, returning a list for warnings (Badge::update_crate)
- Validate deserialization to our enum, collecting invalid ones
- Update database
Use database to obtain max_version of the crate (for response)
If readme provided, enqueue rendering and upload as a background job
Proposed --dry-run check
Upload crate (uploaders::upload_crate)
- Read remaining request body
- Verify tarball
- Upload crate
- Calculate crate tarball hash
Enque index update
Encode response
Commit database transaction

Background job: Render and upload README

Defined in render::render_and_upload_readme

Render README (render::readme_to_html)
Obtain connection
Record README rendered_at for version (Version::record_readme_rendering)
Upload the rendered README (uploaders::upload_readme)

Background job: Update Index

Defined in git::add_crate

Determine file path from crate name
Append line of JSON data to file in registry
Commit and push

Proposed

Notes

We enforce a 50MB max in nginx
We should add a configuration entry for the global max size of the metadata (we currently use max tarball size several places)
A few guidelines I tried to follow:
- Identify and reject invalid requests as quickly as possible.
- Minimize the work done while holding a database connection, especially after entering the main transaction.
- The final main transaction may need to repeat some queries to ensure it doesn't rely on data obtained outside of the transaction.

Verify Headers

Verify user is authenticated
Check that Content-Length header exists and doesn't exceed global max tarball + global max metadata + 2 size fields

Verify Request Body

Check metadata length against the global max metadata size
Read in metadata
Read in tarball size, verify tarball size + metadata size + 2 size fields == Content-Length
Decode metadata
Check for non-empty: description, license, authors
Validate URLs if present: homepage, documentation, repository
Validate license if specified
Iterate over deps
- Enforce "no wildcard" constraint on deps
- Check that not an alternate registry dependency
Validate deserialization of badges into enum, collect invalid ones
Read remaining request body
Verify tarball
Calculate crate tarball hash (for registry)

With database, outside of main transaction

Obtain database connection
Ensure user has a verified email address
Ensure name is not reserved
Obtain a list of valid and invalid categories
Ensure that all deps exist
If crate exists
- Check that the new name is identical to the existing name (sans-canonicalization)
- Verify tarball doesn't exceed the crate specific max
- Check that the user has publish rights on the crate
If crate didn't exist
- Check the rate limit
- Verify tarball doesn't exceed default max
Check if the version already exists
--dry-run check

Start writing within the transaction

Enter database transaction
If crate didn't exist then insert and add the user as an owner
If crate was present, update it (TODO: review what fields on the crate we update under which circumstances. How do we deal with prerleases (Incorrect metadata coming from last published version #1389) and backports?)
Insert version (abort if exists) and add authors to the version
Record README rendered_at for version
Iterate over deps
- Handle package renames
- Insert deps into database
Update keywords
Update categories
Update badges
Use database to obtain max_version of the crate (for response)
Iterate over deps to get a vec of git::Dependency
Upload crate
Background jobs
- If readme provided, enqueue rendering and upload as a background job
- Enqueue index update
Commit database transaction
Encode response

The text was updated successfully, but these errors were encountered:

sgrif · 2018-10-16T01:51:41Z

Just for posterity, max crate size will eventually need to be checked asynchronously, so this probably should be primarily enforced in cargo.

carols10cents · 2019-11-11T22:02:21Z

I've been thinking of how this task could be split up into smaller tasks to make it easier to review and lower the risk of breaking something (or at least making smaller changes so that we can tell which change broke something).

I tried putting the current and proposed lists as revisions of a gist to enable viewing them in diff format, it helps a little I think: https://gist.github.com/carols10cents/4f32c43855fdfd77a8a5b48f53ab06b5/revisions#diff-b160a194db1110d5914710115e64d429

I also think there's opportunity to refactor the publish function and the parse_new_headers function by extracting smaller functions that name the checks they're doing, so that the publish function reads more like the bulleted list here and each function encapsulates the exact implementation of these checks.

So I'd like to see these smaller changes to the code in the publish controller made in separate PRs in approximately this order (multiple items in this list might be accomplished in one PR depending on how much reordering is or is not necessary):

Verify Headers

Move the check that there is an authenticated user to happen first (out of parse_new_headers, it doesn't need to be in there)
Move the Content-length header checking next

Verify Request Body

With database, outside of main transaction

Move the database connection getting next
Move the verified email checking next
Move the check that the name isn't a special rust-reserved name next; this check is currently defined here and called here
Move splitting categories into valid and invalid categories next; this is currently part of Category::update_crate which is called here so that will need to be detangled
Move checking that all dependencies exist next; this is currently part of dependency::add_dependencies which is called here so that will need to be detangled

Start writing within the transaction

TODO i need to go right now but I will edit this with lots more items later

carols10cents · 2019-11-13T20:46:02Z

@jtgeibel can you clarify a bit more what you mean by this in the proposed section:

Read in tarball size, verify tarball size + metadata size + 2 size fields == Content-Length

As far as I can tell, this isn't a check we're doing directly right now. Are you thinking this is a quick way we can reject invalid requests rather than waiting until verify_tarball gets the data? Or would this new check prevent problems we could potentially be open to right now?

carols10cents · 2019-11-13T20:48:54Z

* We should add a configuration entry for the global max size of the metadata (we currently use max tarball size several places)

Do we really need a separate setting for this? Isn't max content length an effective maximum on metadata, because you could theoretically have a tarball that's 0 bytes and metadata that takes up the rest of the space? Just thinking in terms of what we would set this configuration option to if we had it!

I suppose that gets into how we want to resolve this comment and how we want to communicate this limit to someone whose crate is getting rejected who has a .crate file below our stated limits.

carols10cents · 2019-11-13T21:07:57Z

Since the original issue was created, we've added the requirement that the publishing user have a verified email address. It's currently pretty early in the process, and it doesn't depend on the crate content at all, but it does need a database connection. If I'm following the logic you've laid out here, I think it should get inserted here? Do you agree @jtgeibel ?

  ### With database, outside of main transaction
  
  * Obtain database connection
+ * Ensure user has a verified email address
  * Ensure name is not reserved

(also the reason that I'm all of the sudden all over this issue is that I think this can be split into a bunch of smaller contributions we can get new people to swarm on, and also it'll help us enable cargo publish --dry-run that I think would be a nice feature to do soon)

jtgeibel · 2019-11-14T06:12:48Z

Thanks for taking a fresh look at this @carols10cents! I definitely agree with the approach of taking small, reviewable steps towards this general outline.

I also think there's opportunity to refactor the publish function and the parse_new_headers function by extracting smaller functions that name the checks they're doing, so that the publish function reads more like the bulleted list here and each function encapsulates the exact implementation of these checks.

👍 This is what I have in mind as well. In general, I think we should review logic that is currently in models like NewCrate, NewVersion, and dependency::add_dependencies, as some of this logic may be clearer if moved into the controller. It seems like the models could be the minimal needed to orchestrate tests, with most everything else being private to this one endpoint.

I've gone through the code again and updated the "Currently" section above, splitting out the background job work and adding notes showing where other steps are currently located. I've also added several new steps (to both sections):

Verified email
Rate limiting for new crate names
For declared dependencies
- Reject alternate registries
- Handle package renames

Read in tarball size, verify tarball size + metadata size + 2 size fields == Content-Length

As far as I can tell, this isn't a check we're doing directly right now.

No, we don't do that currently. I was considering it mainly as a quick sanity check on the request that can be done very early in the request processing. We might want to land it in an atomic deploy, if we decide to add such a check.

We should add a configuration entry for the global max size of the metadata (we currently use max tarball size several places)

Do we really need a separate setting for this? Isn't max content length an effective maximum on metadata, because you could theoretically have a tarball that's 0 bytes and metadata that takes up the rest of the space? Just thinking in terms of what we would set this configuration option to if we had it!

I think it makes sense to add a separate config item here, but I don't know what we should set it to. Maybe we should add some logging to get a feel for some typical metadata sizes. The publish endpoint is fairly attractive from a DOS perspective, and the metadata could kick off a lot of database activity. I could potentially see a scenario where it would be nice to drop this limit quickly via an environment variable. Although I haven't really put much serious though into what such an attack might look like and if this would be an effective mitigation.

(also the reason that I'm all of the sudden all over this issue is that I think this can be split into a bunch of smaller contributions we can get new people to swarm on, and also it'll help us enable cargo publish --dry-run that I think would be a nice feature to do soon)

Both of these sound great to me!

carols10cents · 2019-11-14T18:26:30Z

I've gone through the code again and updated the "Currently" section above

Thanks!!! I've updated the diff view and I'm going to update the checklist next :)

carols10cents · 2019-11-14T18:30:24Z

I decided to split off the max metadata length check to a separate issue: #1896

carols10cents · 2019-11-14T19:18:02Z

### Verify Request Body

...
* Upload crate and rendered README (if not `--dry-run`)

### With database, outside of main transaction

...

### Start writing within the transaction

...

Hm, shouldn't we wait until we validate everything to upload to s3? If the crate doesn't make it into the database or the index, then I don't think there's a way for cargo to download the .crate thinking it's legitimate, but I wonder about invalid stuff (say, by someone who doesn't own a crate) being uploaded and accessible from the direct static.crates.io URL. Am I missing something?

jtgeibel · 2019-11-14T19:23:03Z

I agree. Those uploads are now done in the background jobs. I think I forgot to remove it in my last edit (and I agree that was probably the wrong place in my original plan anyway).

…

On Thu, Nov 14, 2019, 14:18 Carol (Nichols || Goulding) < ***@***.***> wrote: ### Verify Request Body ... * Upload crate and rendered README (if not `--dry-run`) ### With database, outside of main transaction ... ### Start writing within the transaction ... Hm, shouldn't we wait until we validate everything to upload to s3? If the crate doesn't make it into the database or the index, then I don't *think* there's a way for cargo to download the .crate thinking it's legitimate, but I wonder about invalid stuff (say, by someone who doesn't own a crate) being uploaded and accessible from the direct static.crates.io URL. Am I missing something? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1522?email_source=notifications&email_token=AAAFNKRSDIJG2WBOIZ3P3QTQTWP6VA5CNFSM4F3VU6O2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEC7ETI#issuecomment-554037837>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAFNKR5HV33LRNGKSSRIDTQTWP6VANCNFSM4F3VU6OQ> .

carols10cents · 2019-11-20T16:29:01Z

Those uploads are now done in the background jobs.

The readme rendering and uploading is done in a background job, but I don't think uploading the .crate file is as far as I can tell? Where upload_crate is called in the publish endpoint, definitions of upload_crate and upload

jtgeibel · 2019-11-21T00:56:49Z

The readme rendering and uploading is done in a background job, but I don't think uploading the .crate file is as far as I can tell?

You're right. I've updated the issue above to clarify. (I thought I had already done so, but maybe I didn't hit save on that edit.)

In the proposed section above, I've added the crate upload step to be the last step before enqueueing the background jobs.

Turbo87 · 2022-06-19T13:26:02Z

since there hasn't been any activity here for 2.5 years, I guess we can close this issue. feel free to reopen if it becomes relevant again :)

jtgeibel mentioned this issue Oct 16, 2018

Add a dry_run flag to /crates/new #1517

Closed

alexcrichton mentioned this issue Nov 6, 2018

Rejects publishing packages with wildcard version constraints (#5941) rust-lang/cargo#6150

Closed

ehuss mentioned this issue Jun 14, 2019

Category slugs assignment is not user friendly. rust-lang/cargo#7035

Closed

carols10cents added A-publish C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear A-backend ⚙️ labels Nov 11, 2019

carols10cents mentioned this issue Nov 14, 2019

Consider adding a configuration option for maximum metadata length #1896

Open

carols10cents mentioned this issue Nov 20, 2019

Filter by bin/lib #814

Open

ehuss mentioned this issue May 8, 2020

New publish API rust-lang/crates-io-cargo-teams#82

Open

jtgeibel mentioned this issue Nov 25, 2020

Add a dry-run publish API #1515

Closed

Turbo87 closed this as completed Jun 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General outline of proposed updates to publishing logic #1522

General outline of proposed updates to publishing logic #1522

jtgeibel commented Oct 16, 2018 •

edited by carols10cents

Loading

sgrif commented Oct 16, 2018

carols10cents commented Nov 11, 2019 •

edited

Loading

carols10cents commented Nov 13, 2019

carols10cents commented Nov 13, 2019 •

edited

Loading

carols10cents commented Nov 13, 2019

jtgeibel commented Nov 14, 2019

carols10cents commented Nov 14, 2019

carols10cents commented Nov 14, 2019

carols10cents commented Nov 14, 2019

jtgeibel commented Nov 14, 2019 via email

carols10cents commented Nov 20, 2019

jtgeibel commented Nov 21, 2019

Turbo87 commented Jun 19, 2022

General outline of proposed updates to publishing logic #1522

General outline of proposed updates to publishing logic #1522

Comments

jtgeibel commented Oct 16, 2018 • edited by carols10cents Loading

Currently (edit: updated 2019-11-13)

Background job: Render and upload README

Background job: Update Index

Proposed

Notes

Verify Headers

Verify Request Body

With database, outside of main transaction

Start writing within the transaction

sgrif commented Oct 16, 2018

carols10cents commented Nov 11, 2019 • edited Loading

Verify Headers

Verify Request Body

With database, outside of main transaction

Start writing within the transaction

carols10cents commented Nov 13, 2019

carols10cents commented Nov 13, 2019 • edited Loading

carols10cents commented Nov 13, 2019

jtgeibel commented Nov 14, 2019

carols10cents commented Nov 14, 2019

carols10cents commented Nov 14, 2019

carols10cents commented Nov 14, 2019

jtgeibel commented Nov 14, 2019 via email

carols10cents commented Nov 20, 2019

jtgeibel commented Nov 21, 2019

Turbo87 commented Jun 19, 2022

jtgeibel commented Oct 16, 2018 •

edited by carols10cents

Loading

carols10cents commented Nov 11, 2019 •

edited

Loading

carols10cents commented Nov 13, 2019 •

edited

Loading