Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry ACME challenge more than once #162

Closed
wants to merge 1 commit into from

Conversation

jkralik
Copy link
Contributor

@jkralik jkralik commented Jan 23, 2020

Description

http.Client doesn't supports retry during connect. For k8s this
can cause issues like:

  • dial tcp: i/o timeout
  • dial tcp 10.106.221.133:80: connect: connection refused

Fixes

#149

💔Thank you!

@jkralik jkralik requested review from dopey and maraino and removed request for dopey January 23, 2020 17:30
http.Client doesn't supports retry during connect. For k8s this
can cause issues like:
- dial tcp: i/o timeout
- dial tcp 10.106.221.133:80: connect: connection refused
@codecov-io
Copy link

Codecov Report

Merging #162 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #162      +/-   ##
==========================================
+ Coverage   78.98%   78.99%   +0.01%     
==========================================
  Files          58       58              
  Lines        6224     6228       +4     
==========================================
+ Hits         4916     4920       +4     
  Misses       1063     1063              
  Partials      245      245
Impacted Files Coverage Δ
acme/authority.go 96.45% <100%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 967e86a...d74f210. Read the comment docs.

@jkralik jkralik mentioned this pull request Jan 24, 2020
@maraino maraino removed their request for review January 25, 2020 02:29
@maraino
Copy link
Contributor

maraino commented Jan 25, 2020

@dopey: can you take a look to this?

@dopey
Copy link
Contributor

dopey commented Jan 25, 2020

Yep. I'd like to get to the bottom of the database issue first since it seems these are related. According to @jkralik the size of the database has stopped endlessly expanding since they made this change, so I definitely want to pull it in.

My only qualm is that we'd be pulling in an external dependency. I haven't had any time to check if that's definitely necessary, or if this behavior is something we can replicate using the standard golang clients.

@jkralik
Copy link
Contributor Author

jkralik commented Jan 26, 2020

This fix doesn't fix endlessly expanding DB, but it just try challenge more times. It's fix #149. I think the issue just trigger the bug with endlessly expanding DB.

@dcow
Copy link
Contributor

dcow commented Jan 29, 2020

@jkralik I reviewed the ACME spec. Section 8.2 discusses some mandatory and considerations when implementing challenge validation retries in both the server and the client. Most notably:

The server MUST provide information about its retry state to the
client via the "error" field in the challenge and the Retry-After
HTTP header field in response to requests to the challenge resource.
The server MUST add an entry to the "error" field in the challenge
after each failed validation query. The server SHOULD set the Retry-
After header field to a time after the server's next validation
query, since the status of the challenge will not change until that
time.

Clients can explicitly request a retry by re-sending their response
to a challenge in a new POST request (with a new nonce, etc.). This
allows clients to request a retry when the state has changed (e.g.,
after firewall rules have been updated). Servers SHOULD retry a
request immediately on receiving such a POST request. In order to
avoid denial-of-service attacks via client-initiated retries, servers
SHOULD rate-limit such requests.

The spec mandates that we tie the retry state into our challenge resource so that clients can discern what is happening during the challenge/validation process. This also helps us keep track of retries in the code in any event (as opposed to "fire and forget").

Additionally, while some amount of server retry is allowable to handle exactly your type of scenario (propagation delay for newly provisioned infrastructure), based on my interpretation of language in the section, the responsibility really falls on the client to continue to bug the server until the challenge can be validated. In other words, make sure your client only requests the challenge once it knows the infrastructure is ready, and make sure your client re-requests the challenge until it is happy with the outcome.

Thank you for the patch! Since you've likely got this up and running and it works for you, feel free to keep on doing what you're doing. In terms of pulling this into the official CA, we'd like to hold off until we can fully address the requirements laid out in section 8.2 of the spec. If you're interested in revising your patch we're more than happy to work with you to get things in shape. Just say so and we can reopen or you can propose a new patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants