Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network retry issue 1602 #2396

Merged
merged 1 commit into from May 12, 2016

Conversation

Projects
None yet
4 participants
@sbeckeriv
Copy link
Contributor

sbeckeriv commented Feb 19, 2016

Dearest Reviewer,

This branch resolves #1602 which relates to retrying network
issues automatically. There is a new utility helper for retrying
any network call.

There is a new config called net.retry value in the .cargo.config
file. The default value is 2. The documentation has also been
updated to reflect the new value.

Thanks
Becker

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Feb 20, 2016

Thanks for the PR @sbeckeriv! Some thoughts of mine:

  • Could this logic be centralized as much as possible? Currently the retry loop is encoded in two separate places and ideally it'd only be defined in one.
  • I'm pretty wary of retrying on any error, I've generally had bad experiences with tools that have that kind of behavior. Most of the time an error is impossible to resolve until something changes, it's only pretty rarely that an intermittent error can be retried to fix behavior. Can we whitelist specific errors to retry on?
  • I think that this logic needs to be attached directly to the calls to the network themselves, right now it's placed very high up in the chain and far away from where some updates may happen. There's also network updates throughout the rest of cargo (e.g. when publishing or installing) which won't get affected in the locations this is added.
  • Can you be sure to add some tests for this? It may be easiest to just mirror the HTTPS tests where some local sever is spun up, connected to, and immediately dropped to generate an error.
  • Can you be sure to document any new options in src/doc/config.md?
@sbeckeriv

This comment has been minimized.

Copy link
Contributor Author

sbeckeriv commented Feb 20, 2016

Thanks for the reply.

I will move the net_retry logic in to the utils config object. For the logic it self I believe I know how to do this. I am going to write a function that takes a closure and handles the logic.

I will figure out a white list enum too.

logic needs to be attached directly to the calls to the network themselves

I think with the closure concept I could push the logic down the stack, however, that would not blanket all network calls. Do you have some thoughts on how I might go about that? I think most network calls are else where crates. I could update curl-rust to support a retry option pretty easy. Make it an http setting like proxy and timeout. I read that libcurl does not directly support it. I need to read more around the ssh side. Let me know your thoughts.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Feb 21, 2016

Do you have some thoughts on how I might go about that? I think most network calls are else where crates. I could update curl-rust to support a retry option pretty easy.

Nah I can help audit after the infrastructure in place, and then I can also help to ensure that we always push network requests through the same location. I'd also recommend that even if this were an option in curl-rust that we'd still do it in Cargo as we'd likely want to give a UI update as part of it as well anyway.

@sbeckeriv

This comment has been minimized.

Copy link
Contributor Author

sbeckeriv commented Feb 22, 2016

Before I get to far thoughts on these changes:
net retry value from the config with a default value:
https://github.com/rust-lang/cargo/pull/2396/files#diff-4559bc8f75ebf0a4d42be6e4e7fe9eaaR200

Wrapped retry logic:
https://github.com/rust-lang/cargo/pull/2396/files#diff-f9445846d299388561a3972b309c9d9eR1
and an example call:
https://github.com/rust-lang/cargo/pull/2396/files#diff-591901395b44f65a638e1c81531878d2R173

Thanks! It was fun to figure out the with_retry.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Feb 22, 2016

Yeah, that looks great to me!

I think we may want to have another callback for "was this error one we can retry from", but other than that it looks like the right infrastructure to me

@sbeckeriv

This comment has been minimized.

Copy link
Contributor Author

sbeckeriv commented Feb 24, 2016

Hey,
I have updated to the new callback. I updated the docs and added in some comments in the code. I copied a https test and I am checking the stdout.

For the "was this error one we can retry from"; error handling is new to me. Should I pass it the CargoResult value or the unwrapped Err? What might that callback look like? How do I check its a Err I care about vs any other error?

Thanks!

@@ -134,7 +140,7 @@ impl<'cfg> PackageRegistry<'cfg> {
Some(&(ref previous, _)) => {
if previous.precise() == namespace.precise() {
debug!("load/match {}", namespace);
return Ok(())
return Ok(());

This comment has been minimized.

@alexcrichton

alexcrichton Feb 25, 2016

Member

Perhaps you have rustfmt on auto-run? Could you back out these changes for now? They tend to clutter up the diff unfortunately :(

@@ -93,7 +97,8 @@ impl<'cfg> PackageRegistry<'cfg> {
let mut ret = Vec::new();

for (_, source) in self.sources.sources_mut() {
try!(source.download(package_ids));
try!(network::with_retry(&self.config, || source.download(package_ids)));

This comment has been minimized.

@alexcrichton

alexcrichton Feb 25, 2016

Member

This probably wants to be attached to the literal network calls themselves rather than way up here at the download layer. For example a download from everything but the registry doesn't actually need to be retried.

let repo = try!(self.remote.checkout(&self.db_path));

let repo = try!(network::with_retry(&self.config, ||{
self.remote.checkout(&self.db_path)

This comment has been minimized.

@alexcrichton

alexcrichton Feb 25, 2016

Member

Could you attach this to the actual fetch done by the git repo?

This comment has been minimized.

@sbeckeriv

sbeckeriv Feb 27, 2016

Author Contributor

Im going to need some advice here. I started to put a cargo_config: &'cfg Config attribute on GitRemote. This lead to a cascade of lifetime markers. After that cleared up I am getting more errors. I think I can reason about the errors but I just wanted to make sure my approach was right.
Thanks!
Becker

current list of errors

This comment has been minimized.

@alexcrichton

alexcrichton Feb 28, 2016

Member

Ah it should be fine to probably just pass in the Config instead of storing it. Each source already stores a pointer to Config somewhere so it should be readily available to pass in

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Feb 25, 2016

I think that it may be good to introduce a new trait:

trait NetworkError {
    pub fn may_be_spurious(&self) -> bool;
}

That way you can define the retry function with a NetworkError bound on the E type coming out. I would only implement NetworkError for types like git2::Error and curl::Error rather than a blanket CargoError because that'll help target the errors individually.

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Feb 26, 2016

☔️ The latest upstream changes (presumably #2397) made this pull request unmergeable. Please resolve the merge conflicts.

@sbeckeriv

This comment has been minimized.

Copy link
Contributor Author

sbeckeriv commented Mar 3, 2016

I pushed down the network call in to the checkout command.

I added the NetworkError trait. I am unsure how to apply the bounds in the retry method. I added it as a trait bound to T but the T is the return Ok value for the cargo result not the error. Any advice or examples of how to do this would be great!

Thanks again
Becker

/// Example:
/// use util::network;
/// cargo_result = network.with_retry(&config, || something.download());
pub fn with_retry<T, F>(config: &Config, mut callback: F) -> CargoResult<T>

This comment has been minimized.

@alexcrichton

alexcrichton Mar 3, 2016

Member

I think the signature you want for this method is something along the lines of:

pub fn with_retry<T, E, F>(config: &Config, mut callback: F) -> CargoResult<T>
    where F: FnMut() -> Result<T, E>,
          E: NetworkError + CargoError

This would then match on the result of callback and call maybe_spurious to see whether it is indeed a spurious error or not.

// NetworkError trait

pub trait NetworkError: Error + Send + 'static {
fn maybe_spurious(&self) -> bool { true }

This comment has been minimized.

@alexcrichton

alexcrichton Mar 3, 2016

Member

Yeah this looks along the right lines to me, but you may want to define it as:

pub trait NetworkError: CargoError {
    fn maybe_spurious(&self) -> bool;
}

That way you can elide the extra bounds by just inheriting from CargoError, and it also forces the implementations below to check correctly whether they're related to the network before unconditionally retrying.

This comment has been minimized.

@sbeckeriv

sbeckeriv Mar 4, 2016

Author Contributor

Did not work out of the box. I need some help deciphering what to do. I pushed the failing code but here is my errors: https://gist.github.com/sbeckeriv/bd2173d07e98b0e599d6

Thanks.
Becker

This comment has been minimized.

@sbeckeriv

sbeckeriv Mar 4, 2016

Author Contributor

I think I put this on the wrong comment. I dont think I want to bound the with_retry to the NetworkError trait. I think I want to check in with_retry if the error is NetworkError or not, but all the errors should all be CargoError bound?

let mut result: CargoResult<T> = Err(human("failed to initialize network call"));
for try in 0..net_retry + 1 {
result = callback();
if result.is_err() {

This comment has been minimized.

@alexcrichton

alexcrichton Mar 4, 2016

Member

Shouldn't this call maybe_spurious?

This comment has been minimized.

@sbeckeriv

sbeckeriv Mar 6, 2016

Author Contributor

What do you see maybe_spurious doing? I am thinking a is_retryable makes more sense. Then it would be if result.is_err() && result.is_retryable() which reads like I am thinking the code works. Right now it would return true for the two errors but it could do more for other errors.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Mar 4, 2016

@sbeckeriv yeah the point of the bounds was to highlight the fact that Box<CargoError> is opaque and very difficult to see if we need to retry it or not. It basically just means that the retry needs to be pushed farther down in the stack.

@sbeckeriv

This comment has been minimized.

Copy link
Contributor Author

sbeckeriv commented Mar 6, 2016

The bounds make perfect sense now. I pushed the config down to the bottom fetch call. I am still getting "mismatched type" errors on the result I am returning. The generics at this level are beyond me. The error text https://gist.github.com/sbeckeriv/3866bd2ab7241da32b7c I have tried many combinations. I thought the error came from how I initialized the result value but I removed that. The normal human err is does not impl network error. I then thought it was because the shell command internally could cause an error but I removed the try with no luck.

Do i need to rebuild the result or implement a from?

Thanks again,
Becker

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Mar 7, 2016

No worries! Here's how I might write this function:

pub fn with_retry<T, E, F>(config: &Config, mut callback: F) -> CargoResult<T>
    where F: FnMut() -> Result<T, E>,
          E: NetworkError
{
    let mut remaining = try!(config.net_retry());
    loop {
        match callback() {
            Ok(ret) => return Ok(ret),
            Err(ref e) if e.maybe_spurious() && remaining > 0 => {
                // print to shell
                remaining -= 1;
            }
            Err(e) => return Err(Box::new(e)),
        }
    }
}

Perhaps that can help you out?

@sbeckeriv

This comment has been minimized.

Copy link
Contributor Author

sbeckeriv commented Mar 7, 2016

That does! When I first started I decomposed but I went the other way for reasons I do not remember. I need to update my test but it compiles now. Thanks!

@sbeckeriv

This comment has been minimized.

Copy link
Contributor Author

sbeckeriv commented Mar 10, 2016

I have the retry at the lowest level fetch. I think this covers all the git actions. I am looking in to the interactions with the registry.

@sbeckeriv

This comment has been minimized.

Copy link
Contributor Author

sbeckeriv commented Mar 10, 2016

As I figure out the code I am not sure I want to extern cargo in to crates-io. I am guessing there are separated in to there own project for a reason. Now I would require Config and network to cross back in to creates. I could just copy the concept but there is no config system for cargo-io helpers. Maybe this is figured out in another pull request? Let me know your thoughts.

Thanks again!
Becker

}

impl NetworkError for git2::Error {
fn maybe_spurious(&self) -> bool { true }

This comment has been minimized.

@alexcrichton

alexcrichton Mar 11, 2016

Member

For git errors this may want to at least check the class of the error for Net, and perhaps look at the code as well to ensure it's not a known non-spurious error.

fn maybe_spurious(&self) -> bool { true }
}
impl NetworkError for curl::ErrCode {
fn maybe_spurious(&self) -> bool { true }

This comment has been minimized.

@alexcrichton

alexcrichton Mar 11, 2016

Member

We probably want to inspect the error code here a little more to ensure it actually is spurious

This comment has been minimized.

@sbeckeriv

sbeckeriv Mar 16, 2016

Author Contributor

Im having a heck of a time understand how to use the curl errors. I believe this should work.

curl::ErrCode::COULDNT_CONNECT

But believing it doesn't make it work. Thoughts on how I can use the names? Should I give up and use the numbers instead?

This comment has been minimized.

@alexcrichton

alexcrichton Mar 16, 2016

Member

Ah using the raw error codes from curl is fine, this should basically just be a whitelist of known bad error codes that can be retried, and we can expand the list over time.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Mar 11, 2016

Looking good! I think though that this may not catch where we download from the registry (using curl), just git so far. API calls with crates.io also look like they may need instrumentation as well, and you can probably just implement NetworkError for the error type coming out of the crates-io library

@sbeckeriv

This comment has been minimized.

Copy link
Contributor Author

sbeckeriv commented Mar 16, 2016

I updated the error checking.

For git2::ErrorCode its not clear which errors relate to network activity. Maybe EOF for a disconnect? In my test I found that not connecting to a host gave an git2::ErrorClass::Os error. I am checking for both Os and Net now. Maybe my test is invalid?

For curl errors I left a comment to what I believe the number maps to based off of this website https://curl.haxx.se/libcurl/c/libcurl-errors.html

Thanks again for all the help.
Becker

.file("src/lib.rs", "");
.file("src/lib.rs", "").file(".cargo/config", r#"
[net]
retry=0

This comment has been minimized.

@alexcrichton

alexcrichton Mar 17, 2016

Member

In theory these modifications to tests shouldn't be necessary. All of these are indicative of errors that can never be resolved, so they shouldn't print out warnings anyway

This comment has been minimized.

@sbeckeriv

sbeckeriv Mar 17, 2016

Author Contributor

Yes. These are leftover from before the trait. I should be able to remove these.

try!(config.shell()
.say(format!("failed network call. {} \
tries remaining", remaining),
YELLOW));

This comment has been minimized.

@alexcrichton

alexcrichton Mar 17, 2016

Member

Could this use shell().warn(..)? It also may want to avoid using a period as we tend to avoid that kind of punctuation in Cargo.

5 | //COULDNT_RESOLVE_PROXY
6 | //COULDNT_RESOLVE_HOST
28 | //OPERATION_TIMEDOUT
56 // RECV_ERROR

This comment has been minimized.

@alexcrichton

alexcrichton Mar 17, 2016

Member

Can this link to the curl-sys crate and use the symbolic names for these constants?

This comment has been minimized.

@sbeckeriv

sbeckeriv Mar 17, 2016

Author Contributor

curl-sys is a dependency of curl but I cant figure out how to access it. Should I just added it as a dependency?

This comment has been minimized.

@alexcrichton

alexcrichton Mar 17, 2016

Member

Yeah adding a dependency is fine

fn maybe_spurious(&self) -> bool {
match self.class() {
git2::ErrorClass::Net |
git2::ErrorClass::Os => true,

This comment has been minimized.

@alexcrichton

alexcrichton Mar 17, 2016

Member

How come Os is included here? In theory that's not spurious?

This comment has been minimized.

@sbeckeriv

sbeckeriv Mar 17, 2016

Author Contributor

This is the class I got for my test when it timed out. When I tell the test to connect to a server that is not running https://github.com/rust-lang/cargo/pull/2396/files#diff-5a6267705b81a22950ae48ebdce1a553R8

I thought it would have gave a Net error but Os is what it printed.

@@ -1,3 +1,4 @@
extern crate curl_sys;

This comment has been minimized.

@alexcrichton

alexcrichton May 11, 2016

Member

Could this be added to the crate root?

This comment has been minimized.

@sbeckeriv

sbeckeriv May 11, 2016

Author Contributor

I moved it to src/cargo/lib.rs I think thats the "root". Let me know if that is wrong.

This comment has been minimized.

@alexcrichton

alexcrichton May 11, 2016

Member

Ah yeah that's it!

curl_sys::CURLcode::CURLE_COULDNT_CONNECT |
curl_sys::CURLcode::CURLE_COULDNT_RESOLVE_PROXY |
curl_sys::CURLcode::CURLE_COULDNT_RESOLVE_HOST |
curl_sys::CURLcode::CURLE_OPERATION_TIMEDOUT |

This comment has been minimized.

@alexcrichton

alexcrichton May 11, 2016

Member

This seems like an operation we shouldn't retry by default?

This comment has been minimized.

@sbeckeriv

sbeckeriv May 11, 2016

Author Contributor

It looks like curl retries (at least once) on CURLE_OPERATION_TIMEDOUT https://github.com/curl/curl/blob/b9728bca549709a26a5228f1d44f7488dd26811d/src/tool_operate.c#L1449

I do not have any data around which errors people are seeing. It looks like, after a quick search in the curl code, that CURLE_OPERATION_TIMEDOUT covers a lot of different case. After this is merge the list can be changed easy. In this case I believe catching more is better only because if this error code turns out to be bad the work around would be to configure the setting. If you release cargo without this and it turns out that it is the core failure case for curl everyone who is affected need to wait for a new release cycle.

Let me know if you would still like to remove it.

This comment has been minimized.

@alexcrichton

alexcrichton May 11, 2016

Member

Ok, let's leave it in and we can always back it out if weird things happen

Err(ref e) if e.maybe_spurious() && remaining > 0 => {
try!(config.shell()
.warn(format!("failed network call {} \
tries remaining", remaining)));

This comment has been minimized.

@alexcrichton

alexcrichton May 11, 2016

Member

Perhaps this could also print out the error (just for reference)

This comment has been minimized.

@sbeckeriv

sbeckeriv May 11, 2016

Author Contributor

Yes! I agree. I completely removed the "Failed network call" part. It now looks like
" [2/-1] Failed to connect to 127.0.0.1: Connection refused 1 tries remaining"

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 11, 2016

Looking good @sbeckeriv, thanks!

Err(ref e) if e.maybe_spurious() && remaining > 0 => {
try!(config.shell()
.warn(format!("{} {} tries remaining", e,
remaining)));

This comment has been minimized.

@alexcrichton

alexcrichton May 12, 2016

Member

Perhaps this could be along the lines of:

let msg = format!("spurious network error ({} tries remaining): {}", remaining, e);
try!(config.shell().warn(msg));
@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 12, 2016

Awesome! Just a minor tweak to some wording, and then perhaps some squashing? Other than that r=me

@sbeckeriv sbeckeriv force-pushed the sbeckeriv:network-retry-1602 branch from 4fc800d to b3de4c2 May 12, 2016

@sbeckeriv

This comment has been minimized.

Copy link
Contributor Author

sbeckeriv commented May 12, 2016

@alexcrichton I made it one commit. Thanks for working with me and all the help! I learned a lot.

Becker

Ok(ret) => return Ok(ret),
Err(ref e) if e.maybe_spurious() && remaining > 0 => {
let msg = format!("spurious network error ({} tries \
remaining): {}", remaining, e);

This comment has been minimized.

@alexcrichton

alexcrichton May 12, 2016

Member

I think the remaining count here is actually remaining - 1

This comment has been minimized.

@sbeckeriv

sbeckeriv May 12, 2016

Author Contributor

The retry config is 2 by default.

count is 2 
I run once and fail i output 
  2 tries remains 
    subtract 1 count is 1
run a second time 
  1 tries remaining 
    subtract 1 count is 0
retry again
just error. 

The retry count is the count of retries not the count of total tries. I can subtract one from the config or update the docs to be explicitly clear it is the count of retries and not the total network tries?

This comment has been minimized.

@alexcrichton

alexcrichton May 12, 2016

Member

Ah right yeah that makes sense, nevermind!

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 12, 2016

Looks like the Windows CI is failing for non-spurious reasons?

@sbeckeriv sbeckeriv force-pushed the sbeckeriv:network-retry-1602 branch from b3de4c2 to 2f04d00 May 12, 2016

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 12, 2016

bors added a commit that referenced this pull request May 12, 2016

Auto merge of #2396 - sbeckeriv:network-retry-1602, r=alexcrichton
Network retry issue 1602

Dearest Reviewer,

This branch resolves #1602 which relates to retrying network
issues automatically. There is a new utility helper for retrying
any network call.

There is a new config called net.retry value in the .cargo.config
file. The default value is 2. The documentation has also been
updated to reflect the new value.

Thanks
Becker
@bors

This comment has been minimized.

Copy link
Contributor

bors commented May 12, 2016

⌛️ Testing commit 2f04d00 with merge e2648ce...

@bors

This comment has been minimized.

Copy link
Contributor

bors commented May 12, 2016

💔 Test failed - cargo-win-msvc-64

@sbeckeriv

This comment has been minimized.

Copy link
Contributor Author

sbeckeriv commented May 12, 2016

Sorry I dont have access to a windows box right now. I might tonight. Does windows not load the .cargo/config file or does it output extra information?

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 12, 2016

It should load that, yeah, but I suspect the problem here is that the exact error message is slightly different. It should be possible to just use with_stderr_contains perhaps?

Network retry issue 1602
Dearest Reviewer,

This branch resolves #1602 which relates to retrying network
issues automatically. There is a new utility helper for retrying
any network call.

There is a new config called net.retry value in the .cargo.config
file. The default value is 2. The documentation has also been
updated to reflect the new value.

Thanks
Becker

@sbeckeriv sbeckeriv force-pushed the sbeckeriv:network-retry-1602 branch from 2f04d00 to 7b03532 May 12, 2016

@sbeckeriv

This comment has been minimized.

Copy link
Contributor Author

sbeckeriv commented May 12, 2016

Works locally. Lets see!

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 12, 2016

bors added a commit that referenced this pull request May 12, 2016

Auto merge of #2396 - sbeckeriv:network-retry-1602, r=alexcrichton
Network retry issue 1602

Dearest Reviewer,

This branch resolves #1602 which relates to retrying network
issues automatically. There is a new utility helper for retrying
any network call.

There is a new config called net.retry value in the .cargo.config
file. The default value is 2. The documentation has also been
updated to reflect the new value.

Thanks
Becker
@bors

This comment has been minimized.

Copy link
Contributor

bors commented May 12, 2016

⌛️ Testing commit 7b03532 with merge a4a9491...

@bors

This comment has been minimized.

@bors bors merged commit 7b03532 into rust-lang:master May 12, 2016

2 of 3 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details

@sbeckeriv sbeckeriv deleted the sbeckeriv:network-retry-1602 branch May 12, 2016

@metajack metajack referenced this pull request May 13, 2016

Merged

Update Cargo to 2016-05-12 #11177

3 of 5 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.