Retry ACME challenge more than once #162

jkralik · 2020-01-23T17:30:17Z

Description

http.Client doesn't supports retry during connect. For k8s this
can cause issues like:

dial tcp: i/o timeout
dial tcp 10.106.221.133:80: connect: connection refused

Fixes

💔Thank you!

http.Client doesn't supports retry during connect. For k8s this can cause issues like: - dial tcp: i/o timeout - dial tcp 10.106.221.133:80: connect: connection refused

codecov-io · 2020-01-23T17:41:00Z

Codecov Report

Merging #162 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #162      +/-   ##
==========================================
+ Coverage   78.98%   78.99%   +0.01%     
==========================================
  Files          58       58              
  Lines        6224     6228       +4     
==========================================
+ Hits         4916     4920       +4     
  Misses       1063     1063              
  Partials      245      245

Impacted Files	Coverage Δ
acme/authority.go	`96.45% <100%> (+0.1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 967e86a...d74f210. Read the comment docs.

maraino · 2020-01-25T02:30:37Z

@dopey: can you take a look to this?

dopey · 2020-01-25T02:54:35Z

Yep. I'd like to get to the bottom of the database issue first since it seems these are related. According to @jkralik the size of the database has stopped endlessly expanding since they made this change, so I definitely want to pull it in.

My only qualm is that we'd be pulling in an external dependency. I haven't had any time to check if that's definitely necessary, or if this behavior is something we can replicate using the standard golang clients.

jkralik · 2020-01-26T19:39:28Z

This fix doesn't fix endlessly expanding DB, but it just try challenge more times. It's fix #149. I think the issue just trigger the bug with endlessly expanding DB.

dcow · 2020-01-29T06:15:24Z

@jkralik I reviewed the ACME spec. Section 8.2 discusses some mandatory and considerations when implementing challenge validation retries in both the server and the client. Most notably:

The server MUST provide information about its retry state to the
client via the "error" field in the challenge and the Retry-After
HTTP header field in response to requests to the challenge resource.
The server MUST add an entry to the "error" field in the challenge
after each failed validation query. The server SHOULD set the Retry-
After header field to a time after the server's next validation
query, since the status of the challenge will not change until that
time.

Clients can explicitly request a retry by re-sending their response
to a challenge in a new POST request (with a new nonce, etc.). This
allows clients to request a retry when the state has changed (e.g.,
after firewall rules have been updated). Servers SHOULD retry a
request immediately on receiving such a POST request. In order to
avoid denial-of-service attacks via client-initiated retries, servers
SHOULD rate-limit such requests.

The spec mandates that we tie the retry state into our challenge resource so that clients can discern what is happening during the challenge/validation process. This also helps us keep track of retries in the code in any event (as opposed to "fire and forget").

Additionally, while some amount of server retry is allowable to handle exactly your type of scenario (propagation delay for newly provisioned infrastructure), based on my interpretation of language in the section, the responsibility really falls on the client to continue to bug the server until the challenge can be validated. In other words, make sure your client only requests the challenge once it knows the infrastructure is ready, and make sure your client re-requests the challenge until it is happy with the outcome.

Thank you for the patch! Since you've likely got this up and running and it works for you, feel free to keep on doing what you're doing. In terms of pulling this into the official CA, we'd like to hold off until we can fully address the requirements laid out in section 8.2 of the spec. If you're interested in revising your patch we're more than happy to work with you to get things in shape. Just say so and we can reopen or you can propose a new patch.

jkralik requested review from dopey and maraino and removed request for dopey January 23, 2020 17:30

Retry ACME challenge more than once

d74f210

http.Client doesn't supports retry during connect. For k8s this can cause issues like: - dial tcp: i/o timeout - dial tcp 10.106.221.133:80: connect: connection refused

jkralik force-pushed the master branch from c10ba0b to d74f210 Compare January 23, 2020 17:37

jkralik mentioned this pull request Jan 24, 2020

DB consumes lots of disk #159

Closed

maraino removed their request for review January 25, 2020 02:29

dcow closed this Jan 29, 2020

dcow mentioned this pull request Jan 29, 2020

Implement RFC8555 (ACME spec) § 8.2 #168

Open

dcow mentioned this pull request Apr 30, 2020

ACME (RFC 8555) § 8.2 Challenge Retries #242

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry ACME challenge more than once #162

Retry ACME challenge more than once #162

jkralik commented Jan 23, 2020

codecov-io commented Jan 23, 2020

maraino commented Jan 25, 2020

dopey commented Jan 25, 2020

jkralik commented Jan 26, 2020

dcow commented Jan 29, 2020 •

edited

Loading

Retry ACME challenge more than once #162

Retry ACME challenge more than once #162

Conversation

jkralik commented Jan 23, 2020

Description

Fixes

codecov-io commented Jan 23, 2020

Codecov Report

maraino commented Jan 25, 2020

dopey commented Jan 25, 2020

jkralik commented Jan 26, 2020

dcow commented Jan 29, 2020 • edited Loading

dcow commented Jan 29, 2020 •

edited

Loading