Trust both DOD and commercial certs when connecting to DMDC #1948

jamesatheyDDS · 2019-04-02T18:56:16Z

DMDC's Identity Web Services recently changed from presenting a certificate signed by DISA to a certificate signed by Entrust. As a result, SSN<->EDIPI conversions required by both the Orders Gateway and by the DPS-to-login.gov bridge stopped working. The error Get https://pkict.dmdc.osd.mil/appj/rbs/rest/{more URL components}: x509: certificate signed by unknown authority appears in the logs.

So, instead of trusting only DOD-signed certs when connecting to DMDC, or trusting only commercial certs, trust both. That way, if DMDC changes its mind about using a commercial cert, especially when the current cert expires in March of 2020, our code will continue to work.

codecov · 2019-04-02T19:29:09Z

Codecov Report

Merging #1948 into master will decrease coverage by 0.02%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #1948      +/-   ##
==========================================
- Coverage    60.6%   60.58%   -0.02%     
==========================================
  Files         192      192              
  Lines       12259    12264       +5     
==========================================
  Hits         7429     7429              
- Misses       3952     3957       +5     
  Partials      878      878

kahlouie

@jamesatheyDDS or @chrisgilmerproj can you add tests to make sure we don't have any regressions moving forward and to illuminate what's actually happening here?

aileendds · 2019-04-03T21:45:29Z

@chrisgilmerproj do you know what CAs are in our AWS environments? @jamesatheyDDS testing in experimental might be helpful

chrisgilmerproj · 2019-04-03T22:05:42Z

@jamesatheyDDS or @chrisgilmerproj can you add tests to make sure we don't have any regressions moving forward and to illuminate what's actually happening here?

Agreed, I'd also like to see some tests here.

chrisgilmerproj · 2019-04-03T22:06:16Z

@chrisgilmerproj do you know what CAs are in our AWS environments? @jamesatheyDDS testing in experimental might be helpful

@aileendds - I don't know what CAs are in AWS. I'd have to introspect the secure migrations to figure it out.

jamesatheyDDS · 2019-04-04T14:54:58Z

@chrisgilmerproj @kahlouie What would a test look like for this integration with DMDC? They only allow specific client certificates to connect to their API - in our case, our staging and experimental DOD-signed certs can connect to their Contractor Test environment, while our production DOD-signed cert can connect to their production environment.

Do we have other unit tests that require network connectivity? Is it fair to block someone from committing if DMDC's servers are down?

jamesatheyDDS · 2019-04-04T14:55:42Z

This seems more like a monitoring issue than a unit test issue. What kind of health alert or monitoring thing can we set up to check the connection to DMDC in our environments?

chrisgilmerproj · 2019-04-04T15:25:02Z

@chrisgilmerproj @kahlouie What would a test look like for this integration with DMDC? They only allow specific client certificates to connect to their API - in our case, our staging and experimental DOD-signed certs can connect to their Contractor Test environment, while our production DOD-signed cert can connect to their production environment.

I would say we should have some kind of dummy DoD cert package to test this code with. Specifically something we can pipe into dodCACertPackage to make sure parsing is happening and we're getting the certs added to the x509 cert pool.

Do we have other unit tests that require network connectivity? Is it fair to block someone from committing if DMDC's servers are down?

We don't have unit tests against network connectivity. That being said, we should gracefully handle this kind of situation in the code. So if RBS is down we shouldn't be throwing 5XX errors in the app. I'd prefer we throw a 421 Misdirected Request.

The 421 (Misdirected Request) status code indicates that the request was directed at a server that is not able to produce a response. This can be sent by a server that is not configured to produce responses for the combination of scheme and authority that are included in the request URI.

And something we should address in another PR: in the case that we can't get the EDIPI what should be the fallback behavior of our app?

chrisgilmerproj · 2019-04-04T15:26:05Z

This seems more like a monitoring issue than a unit test issue. What kind of health alert or monitoring thing can we set up to check the connection to DMDC in our environments?

We should put this on the Infra backlog. We could probably create a Route53 health check. We also have an outstanding issue to create a custom health check Lambda for things that Route53 can't check for us.

aileendds · 2019-04-04T16:05:25Z

@chrisgilmerproj - The documentation for 421 does not match this situation, so I don't think it's an appropriate response. That is meant to refer to the original request, not the secondary request to the 3rd party. This situation is closer to a proxy, and the documentation explicitly states that it should not be returned by a proxy.

The fact that we're calling RBS is an internal implementation detail, which the caller doesn't necessarily need to know about so I think 5xx makes sense. If RBS provides more detailed error codes, we can return a 503 instead of a 500 if it's a temporary issue.

kahlouie · 2019-04-04T16:53:22Z

@aileendds I agree. We should know when our dependencies are down, so we should give 500 level errors, but we should add more info so we can change the alert to go off separately from the 500 errors that are caused in our application.

jamesatheyDDS · 2019-04-04T16:56:36Z

@chrisgilmerproj It is reasonable to have a unit test that the DOD CA cert package file is being parsed and loaded correctly. At some point in the future, new DOD CAs will be created and the package will need to be updated.
The DOD CA package is not touched by this PR, so is adding a unit test for that a blocker? I need this PR accepted to demo the Orders Gateway to RADM Clarke...

chrisgilmerproj · 2019-04-04T17:10:43Z

@chrisgilmerproj It is reasonable to have a unit test that the DOD CA cert package file is being parsed and loaded correctly. At some point in the future, new DOD CAs will be created and the package will need to be updated.

Yeah, that's good enough I think.

The DOD CA package is not touched by this PR, so is adding a unit test for that a blocker?

I'm just saying we need a unit test for the change to the method you're updating. If there's an existing test that covers this then that's great. If not then we've been implementing new test-coverage rules that mean we need to add a test here before merge.

I need this PR accepted to demo the Orders Gateway to RADM Clarke...

I'd be willing to provisionally accept this with a promise we get a test in a follow up PR.

chrisgilmerproj · 2019-04-04T17:14:58Z

@chrisgilmerproj - The documentation for 421 does not match this situation, so I don't think it's an appropriate response. That is meant to refer to the original request, not the secondary request to the 3rd party. This situation is closer to a proxy, and the documentation explicitly states that it should not be returned by a proxy.

The fact that we're calling RBS is an internal implementation detail, which the caller doesn't necessarily need to know about so I think 5xx makes sense. If RBS provides more detailed error codes, we can return a 503 instead of a 500 if it's a temporary issue.

My experience is that 5XX means there was an error we were unable to handle. We ought to handle this type of situation and return a better code so the client can do something about it. Also, 5XX errors will end up paging the application on-call. If the application on-call cannot do anything about this then they ought not be paged, which I think is the case here. Again, returning a 4XX so a client can handle the problem gracefully would be better in my opinion.

@aileendds I agree. We should know when our dependencies are down, so we should give 500 level errors, but we should add more info so we can change the alert to go off separately from the 500 errors that are caused in our application.

We can't differentiate 5XX errors at the ALB -> PD level. So any 5XX will get bundled up to page the on-call. Again, if Truss has no power to handle this issue then we need to communicate to the requesting client what needs to happen next.

chrisgilmerproj

I'm approving, providing:

we need to handle the errors that this can produce if RBS is down (I’m not picky about how but I want something that’s not a 5XX).
we need some test that verifies the behavior here is what we want in a file like rbs_person_lookup_test.go.

https://defensedigitalservice.slack.com/archives/G8P7UD05U/p1554398325153400?thread_ts=1554397836.149400&cid=G8P7UD05U

aileendds · 2019-04-04T17:27:46Z

5xx means the server failed to process a valid request. 4xx means the client provided an invalid request. In this specific situation, it was an issue on our end that the on-call should have been notified of and fixed (i.e. with this PR). There is nothing the client could have done to their request to make the server return a valid response.

If in the future if RBS is temporarily down and we can identify that it's temporary, then we can handle it differently. @jamesatheyDDS do you know if RBS returns granular error codes? I still think this should be 503 and not 4xx though.

kahlouie

Given follow-up plan, I'm good with this.

This reverts commit 050fc56.

jamesatheyDDS · 2019-04-04T18:25:02Z

@aileendds @chrisgilmerproj Testing in experimental successful; at least for DMDC's current cert, experimental had the right CA in its system bundle.

jamesatheyDDS · 2019-04-04T18:43:16Z

@aileendds Yes, RBS does have some granular error codes, but they are not HTTP status codes. Instead, the body of the response is an XML document rooted with <RbsError>. The iws package you and I wrote does parse this, but doesn't do anything different for different RBS error codes.

If the documentation is to be believed, here are the possible codes:

chrisgilmerproj · 2019-04-04T23:12:25Z

5xx means the server failed to process a valid request. 4xx means the client provided an invalid request. In this specific situation, it was an issue on our end that the on-call should have been notified of and fixed (i.e. with this PR). There is nothing the client could have done to their request to make the server return a valid response.

If in the future if RBS is temporarily down and we can identify that it's temporary, then we can handle it differently. @jamesatheyDDS do you know if RBS returns granular error codes? I still think this should be 503 and not 4xx though.

I'd like to change our code so that we can process the request by returning some kind of valid information to the client so the client can smartly handle this. I'm sure we can all hop on a zoom sometime and have a technical discussion about what to do. But I'd prefer that we handle the RBS return codes and also gracefully manage it if RBS is down or not taking requests from us.

Can either @jamesatheyDDS or @aileendds follow up on this? I'd love to get some resolution. And in the meantime I have added a task to my backlog to figure out how we monitor these services.

trust both DOD and commercial certs when connecting to DMDC

3374491

jamesatheyDDS requested review from aileendds, chrisgilmerproj, mikena-truss, stangah and tinyels April 2, 2019 18:56

check for errors one more time

b9c906f

chrisgilmerproj requested review from chrisrcoles and kahlouie April 2, 2019 19:23

kahlouie suggested changes Apr 3, 2019

View reviewed changes

jamesatheyDDS added 2 commits April 4, 2019 10:50

Merge branch 'master' into iws-commercial-cert

3e7e721

test on experimental

050fc56

chrisgilmerproj approved these changes Apr 4, 2019

View reviewed changes

Merge branch 'master' into iws-commercial-cert

07fb2ad

kahlouie approved these changes Apr 4, 2019

View reviewed changes

Revert "test on experimental"

8937152

This reverts commit 050fc56.

jamesatheyDDS merged commit 375863f into master Apr 4, 2019

jamesatheyDDS deleted the iws-commercial-cert branch April 4, 2019 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trust both DOD and commercial certs when connecting to DMDC #1948

Trust both DOD and commercial certs when connecting to DMDC #1948

jamesatheyDDS commented Apr 2, 2019

codecov bot commented Apr 2, 2019 •

edited

kahlouie left a comment

aileendds commented Apr 3, 2019

chrisgilmerproj commented Apr 3, 2019

chrisgilmerproj commented Apr 3, 2019

jamesatheyDDS commented Apr 4, 2019

jamesatheyDDS commented Apr 4, 2019

chrisgilmerproj commented Apr 4, 2019

chrisgilmerproj commented Apr 4, 2019

aileendds commented Apr 4, 2019

kahlouie commented Apr 4, 2019

jamesatheyDDS commented Apr 4, 2019

chrisgilmerproj commented Apr 4, 2019

chrisgilmerproj commented Apr 4, 2019

chrisgilmerproj left a comment

aileendds commented Apr 4, 2019 •

edited

kahlouie left a comment

jamesatheyDDS commented Apr 4, 2019

jamesatheyDDS commented Apr 4, 2019

chrisgilmerproj commented Apr 4, 2019

Trust both DOD and commercial certs when connecting to DMDC #1948

Trust both DOD and commercial certs when connecting to DMDC #1948

Conversation

jamesatheyDDS commented Apr 2, 2019

codecov bot commented Apr 2, 2019 • edited

Codecov Report

kahlouie left a comment

Choose a reason for hiding this comment

aileendds commented Apr 3, 2019

chrisgilmerproj commented Apr 3, 2019

chrisgilmerproj commented Apr 3, 2019

jamesatheyDDS commented Apr 4, 2019

jamesatheyDDS commented Apr 4, 2019

chrisgilmerproj commented Apr 4, 2019

chrisgilmerproj commented Apr 4, 2019

aileendds commented Apr 4, 2019

kahlouie commented Apr 4, 2019

jamesatheyDDS commented Apr 4, 2019

chrisgilmerproj commented Apr 4, 2019

chrisgilmerproj commented Apr 4, 2019

chrisgilmerproj left a comment

Choose a reason for hiding this comment

aileendds commented Apr 4, 2019 • edited

kahlouie left a comment

Choose a reason for hiding this comment

jamesatheyDDS commented Apr 4, 2019

jamesatheyDDS commented Apr 4, 2019

chrisgilmerproj commented Apr 4, 2019

codecov bot commented Apr 2, 2019 •

edited

aileendds commented Apr 4, 2019 •

edited