Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoded URLs with a + sign fail #4891

Closed
AngelloMaggio opened this issue Jun 14, 2022 · 13 comments
Closed

Encoded URLs with a + sign fail #4891

AngelloMaggio opened this issue Jun 14, 2022 · 13 comments
Labels
A-backend ⚙️ C-bug 🐞 Category: unintended, undesired behavior

Comments

@AngelloMaggio
Copy link

Some clients, proxies, and/or repository managers may encode URLs to the upstream (for instance JFrog Artifactory). Due to blocks on crates.io some of these requests fail if there is a + sign in the URL

e.g. If we go to this URL it works fine:

https://static.crates.io/crates/libgit2-sys/libgit2-sys-0.12.25+1.3.0.crate

But if we go to the encoded URL it gives us a 403 Access Denied
https://static.crates.io/crates/libgit2-sys/libgit2-sys-0.12.25%2B1.3.0.crate

This has also exposed some more strange behavior.
If we encode the + sign as a space (%20 instead of %2B), it works.
https://static.crates.io/crates/libgit2-sys/libgit2-sys-0.12.25%201.3.0.crate

Or even if I put a space instead in the URL (though that may be my browser encoding for me)
"https://static.crates.io/crates/libgit2-sys/libgit2-sys-0.12.25 1.3.0.crate"

Since URL encoding is fairly standard and other encoding works, I believe this is a bug

@carols10cents
Copy link
Member

static.crates.io goes to our s3 bucket. The docs for s3 object names say:

The following characters in a key name might require additional code handling and likely need to be URL encoded or referenced as HEX.

...

  • Plus ("+")

So it looks like we should be URL encoding +.

To fix this, we'll need to research whether we need to upload both URL encoded and unencoded for compatibility reasons or not...

@Turbo87 Turbo87 added C-bug 🐞 Category: unintended, undesired behavior A-backend ⚙️ labels Jun 19, 2022
@MaggioAngello
Copy link

Any update on this one? It's failing rust builds for us and wanted to see if there is any light at the end of the tunnel. Thank you so much!
@carols10cents @Turbo87

@carols10cents
Copy link
Member

carols10cents commented Jul 12, 2022

Hi, yes, we talked about this in our most recent meeting but due to vacations and other obligations, no one has the bandwidth to work on this immediately. If you have time to help, researching whether cargo can handle URLs using spaces returned from crates.io's /download redirects with or without encoding would be helpful. Thanks!

@ambassadorfunk
Copy link

@carols10cents - Are there any further updates on this issue and the timing of when the code fix will be merged? Thanks!

@carols10cents
Copy link
Member

No, it needs to be reviewed and tested and no one has had time to do that yet.

@Turbo87
Copy link
Member

Turbo87 commented Jun 7, 2023

together with @jdno I've done a bit of investigation on this topic to figure out the root causes and potential solutions:

Investigation

https://datatracker.ietf.org/doc/html/rfc3986#section-2.2 defines + as a
reserved character, but it is not forbidden from being used in URLs.

https://datatracker.ietf.org/doc/html/rfc1866#section-8.2.1 specifies that space
characters are replaced by + characters, but that is only relevant for the
application/x-www-form-urlencoded MIME type and for query parameters and
form submission request payloads.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html defines
+ as a character "that might require special handling". In practice, S3
converts + characters in object keys to space characters
, similar to the
specification in RFC1866 above.

cargo

cargo requests dependencies without any percent-encoding in the URL path.

e.g. https://crates.io/api/v1/crates/libgit2-sys/0.12.25+1.3.0/download

crates.io

For crates.io, + characters in the URL path are handled without conversion:

static.crates.io

For static.crates.io, + characters in the URL path are converted to space
characters due to the S3 backing storage:

Additional context

https://github.com/rust-lang/crates.io-index/blob/60eb74e915ef71515ada464494839a372bcae87f/config.json#L2
allows us to change the download endpoint in case of infrastructure issues with
the crates.io API or even the CDN. By default, all downloads go through the
crates.io API for download counting and normalization purposes, but they can be
switched to hit static.crates.io, or even point to S3 directly in case
CloudFront and/or Fastly are experiencing issues.

Proposed Solution

Summary: We switch S3 over to use %2B and configure the CDNs to automatically
convert + to %2B for backwards compatibility.

Detailed Plan:

  1. Temporarily upload crates with + in the version to the space- and
    plus-containing S3 paths.
  2. Copy all space-containing files on S3 to their plus-containing equivalents.
  3. Switch crates.io to use %2B for download redirects
  4. Implement path rewriting on the CDN level (if possible)
  5. Revert 1)
  6. Delete space-containing files on S3

This should not disrupt the regular downloads, and it should still allow us to
switch to static.crates.io in case of crates.io API incidents. In case of CDN
incidents we can switch to direct S3 downloads, but they might fail for versions
with build metadata since cargo requests use +, but at that point S3 would
only support %2B.

We could optionally fix that last part by adjusting cargo to always encode +
as %2B when downloading crates. Since the percent-encoding is part of the
official specs this should not cause any issues aside from people potentially
asserting on specific URLs, though that should be relatively easy to fix.

Alternatives

  • We could keep everything as-is, which is obviously the least amount of work,
    but would also keep causing issues for misbehaving "clients, proxies, and/or
    repository managers".

  • We could rely entirely on CDN path rewriting, but that might seem a bit
    confusing to crates.io contributors if we upload the file to an escaped path,
    but the download path uses an unescaped URL.

@rust-lang/crates-io @rust-lang/cargo @rust-lang/infra please let me know if you have any objections, suggestions or other comments. if not, I would start implementing the proposed solution in the next weeks.

@weihanglo
Copy link
Member

We could optionally fix that last part by adjusting cargo to always encode +
as %2B when downloading crates. Since the percent-encoding is part of the
official specs this should not cause any issues aside from people potentially
asserting on specific URLs, though that should be relatively easy to fix.

Thanks for the write-up! It's pretty great to just follow the spec, though I guess it may break some 3rd-party registries if we do the adjustment on Cargo side. Could you open an issue on rust-lang/cargo so that Cargo team can also track it?

@Turbo87
Copy link
Member

Turbo87 commented Jun 23, 2023

Progress Update:

Step 1 of the proposed solution was implemented by #6649 and #6660. Step 2 was also fixed this morning. I'm now working on step 3.

@Turbo87
Copy link
Member

Turbo87 commented Jun 23, 2023

Step 3 is being implemented by #6665

@rust-lang/infra I will need your help on step 4 since I don't have access to the CDN infrastructure.

@Turbo87
Copy link
Member

Turbo87 commented Jun 23, 2023

/cc @dtolnay since this might become relevant to https://github.com/dtolnay/get-all-crates depending on when step 4 is implemented :)

@jyn514
Copy link
Member

jyn514 commented Jun 23, 2023

@Turbo87 could you ask in the t-infra channel on Zulip, please? all of us get a lot of pings and it will be easier to coordinate the discussion.

(i don't have permissions to modify S3 or CloudFront)

@Turbo87
Copy link
Member

Turbo87 commented Jun 26, 2023

Could you open an issue on rust-lang/cargo so that Cargo team can also track it?

done :)

@Turbo87
Copy link
Member

Turbo87 commented Jul 14, 2023

rust-lang/simpleinfra#313 has been merged and deployed this morning, which implements step 4 of the plan above.

#6666 has subsequently been merged, which implements step 5 of the plan above. This PR will be deployed in the near future too.

The only thing remaining now is to remove all files from S3 that include a space character instead of the encoded + (step 6).

Since only cleanup for this is remaining and the original issue should be resolved I'll close the issue now :)

@Turbo87 Turbo87 closed this as completed Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-bug 🐞 Category: unintended, undesired behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants