Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trust both DOD and commercial certs when connecting to DMDC #1948

Merged
merged 6 commits into from Apr 4, 2019

Conversation

jamesatheyDDS
Copy link
Contributor

DMDC's Identity Web Services recently changed from presenting a certificate signed by DISA to a certificate signed by Entrust. As a result, SSN<->EDIPI conversions required by both the Orders Gateway and by the DPS-to-login.gov bridge stopped working. The error Get https://pkict.dmdc.osd.mil/appj/rbs/rest/{more URL components}: x509: certificate signed by unknown authority appears in the logs.

So, instead of trusting only DOD-signed certs when connecting to DMDC, or trusting only commercial certs, trust both. That way, if DMDC changes its mind about using a commercial cert, especially when the current cert expires in March of 2020, our code will continue to work.

@codecov
Copy link

codecov bot commented Apr 2, 2019

Codecov Report

Merging #1948 into master will decrease coverage by 0.02%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #1948      +/-   ##
==========================================
- Coverage    60.6%   60.58%   -0.02%     
==========================================
  Files         192      192              
  Lines       12259    12264       +5     
==========================================
  Hits         7429     7429              
- Misses       3952     3957       +5     
  Partials      878      878

Copy link
Contributor

@kahlouie kahlouie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesatheyDDS or @chrisgilmerproj can you add tests to make sure we don't have any regressions moving forward and to illuminate what's actually happening here?

@aileendds
Copy link
Contributor

@chrisgilmerproj do you know what CAs are in our AWS environments? @jamesatheyDDS testing in experimental might be helpful

@chrisgilmerproj
Copy link
Contributor

@jamesatheyDDS or @chrisgilmerproj can you add tests to make sure we don't have any regressions moving forward and to illuminate what's actually happening here?

Agreed, I'd also like to see some tests here.

@chrisgilmerproj
Copy link
Contributor

@chrisgilmerproj do you know what CAs are in our AWS environments? @jamesatheyDDS testing in experimental might be helpful

@aileendds - I don't know what CAs are in AWS. I'd have to introspect the secure migrations to figure it out.

@jamesatheyDDS
Copy link
Contributor Author

@chrisgilmerproj @kahlouie What would a test look like for this integration with DMDC? They only allow specific client certificates to connect to their API - in our case, our staging and experimental DOD-signed certs can connect to their Contractor Test environment, while our production DOD-signed cert can connect to their production environment.

Do we have other unit tests that require network connectivity? Is it fair to block someone from committing if DMDC's servers are down?

@jamesatheyDDS
Copy link
Contributor Author

This seems more like a monitoring issue than a unit test issue. What kind of health alert or monitoring thing can we set up to check the connection to DMDC in our environments?

@chrisgilmerproj
Copy link
Contributor

@chrisgilmerproj @kahlouie What would a test look like for this integration with DMDC? They only allow specific client certificates to connect to their API - in our case, our staging and experimental DOD-signed certs can connect to their Contractor Test environment, while our production DOD-signed cert can connect to their production environment.

I would say we should have some kind of dummy DoD cert package to test this code with. Specifically something we can pipe into dodCACertPackage to make sure parsing is happening and we're getting the certs added to the x509 cert pool.

Do we have other unit tests that require network connectivity? Is it fair to block someone from committing if DMDC's servers are down?

We don't have unit tests against network connectivity. That being said, we should gracefully handle this kind of situation in the code. So if RBS is down we shouldn't be throwing 5XX errors in the app. I'd prefer we throw a 421 Misdirected Request.

The 421 (Misdirected Request) status code indicates that the request was directed at a server that is not able to produce a response. This can be sent by a server that is not configured to produce responses for the combination of scheme and authority that are included in the request URI.

And something we should address in another PR: in the case that we can't get the EDIPI what should be the fallback behavior of our app?

@chrisgilmerproj
Copy link
Contributor

This seems more like a monitoring issue than a unit test issue. What kind of health alert or monitoring thing can we set up to check the connection to DMDC in our environments?

We should put this on the Infra backlog. We could probably create a Route53 health check. We also have an outstanding issue to create a custom health check Lambda for things that Route53 can't check for us.

@aileendds
Copy link
Contributor

@chrisgilmerproj - The documentation for 421 does not match this situation, so I don't think it's an appropriate response. That is meant to refer to the original request, not the secondary request to the 3rd party. This situation is closer to a proxy, and the documentation explicitly states that it should not be returned by a proxy.

The fact that we're calling RBS is an internal implementation detail, which the caller doesn't necessarily need to know about so I think 5xx makes sense. If RBS provides more detailed error codes, we can return a 503 instead of a 500 if it's a temporary issue.

@kahlouie
Copy link
Contributor

kahlouie commented Apr 4, 2019

@aileendds I agree. We should know when our dependencies are down, so we should give 500 level errors, but we should add more info so we can change the alert to go off separately from the 500 errors that are caused in our application.

@jamesatheyDDS
Copy link
Contributor Author

@chrisgilmerproj It is reasonable to have a unit test that the DOD CA cert package file is being parsed and loaded correctly. At some point in the future, new DOD CAs will be created and the package will need to be updated.
The DOD CA package is not touched by this PR, so is adding a unit test for that a blocker? I need this PR accepted to demo the Orders Gateway to RADM Clarke...

@chrisgilmerproj
Copy link
Contributor

@chrisgilmerproj It is reasonable to have a unit test that the DOD CA cert package file is being parsed and loaded correctly. At some point in the future, new DOD CAs will be created and the package will need to be updated.

Yeah, that's good enough I think.

The DOD CA package is not touched by this PR, so is adding a unit test for that a blocker?

I'm just saying we need a unit test for the change to the method you're updating. If there's an existing test that covers this then that's great. If not then we've been implementing new test-coverage rules that mean we need to add a test here before merge.

I need this PR accepted to demo the Orders Gateway to RADM Clarke...

I'd be willing to provisionally accept this with a promise we get a test in a follow up PR.

@chrisgilmerproj
Copy link
Contributor

@chrisgilmerproj - The documentation for 421 does not match this situation, so I don't think it's an appropriate response. That is meant to refer to the original request, not the secondary request to the 3rd party. This situation is closer to a proxy, and the documentation explicitly states that it should not be returned by a proxy.

The fact that we're calling RBS is an internal implementation detail, which the caller doesn't necessarily need to know about so I think 5xx makes sense. If RBS provides more detailed error codes, we can return a 503 instead of a 500 if it's a temporary issue.

My experience is that 5XX means there was an error we were unable to handle. We ought to handle this type of situation and return a better code so the client can do something about it. Also, 5XX errors will end up paging the application on-call. If the application on-call cannot do anything about this then they ought not be paged, which I think is the case here. Again, returning a 4XX so a client can handle the problem gracefully would be better in my opinion.

@aileendds I agree. We should know when our dependencies are down, so we should give 500 level errors, but we should add more info so we can change the alert to go off separately from the 500 errors that are caused in our application.

We can't differentiate 5XX errors at the ALB -> PD level. So any 5XX will get bundled up to page the on-call. Again, if Truss has no power to handle this issue then we need to communicate to the requesting client what needs to happen next.

Copy link
Contributor

@chrisgilmerproj chrisgilmerproj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving, providing:

  1. we need to handle the errors that this can produce if RBS is down (I’m not picky about how but I want something that’s not a 5XX).

  2. we need some test that verifies the behavior here is what we want in a file like rbs_person_lookup_test.go.

https://defensedigitalservice.slack.com/archives/G8P7UD05U/p1554398325153400?thread_ts=1554397836.149400&cid=G8P7UD05U

@aileendds
Copy link
Contributor

aileendds commented Apr 4, 2019

5xx means the server failed to process a valid request. 4xx means the client provided an invalid request. In this specific situation, it was an issue on our end that the on-call should have been notified of and fixed (i.e. with this PR). There is nothing the client could have done to their request to make the server return a valid response.

If in the future if RBS is temporarily down and we can identify that it's temporary, then we can handle it differently. @jamesatheyDDS do you know if RBS returns granular error codes? I still think this should be 503 and not 4xx though.

Copy link
Contributor

@kahlouie kahlouie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given follow-up plan, I'm good with this.

@jamesatheyDDS
Copy link
Contributor Author

@aileendds @chrisgilmerproj Testing in experimental successful; at least for DMDC's current cert, experimental had the right CA in its system bundle.

@jamesatheyDDS
Copy link
Contributor Author

@aileendds Yes, RBS does have some granular error codes, but they are not HTTP status codes. Instead, the body of the response is an XML document rooted with <RbsError>. The iws package you and I wrote does parse this, but doesn't do anything different for different RBS error codes.

If the documentation is to be believed, here are the possible codes:
Screen Shot 2019-04-04 at 2 41 37 PM

@jamesatheyDDS jamesatheyDDS merged commit 375863f into master Apr 4, 2019
@jamesatheyDDS jamesatheyDDS deleted the iws-commercial-cert branch April 4, 2019 18:54
@chrisgilmerproj
Copy link
Contributor

5xx means the server failed to process a valid request. 4xx means the client provided an invalid request. In this specific situation, it was an issue on our end that the on-call should have been notified of and fixed (i.e. with this PR). There is nothing the client could have done to their request to make the server return a valid response.

If in the future if RBS is temporarily down and we can identify that it's temporary, then we can handle it differently. @jamesatheyDDS do you know if RBS returns granular error codes? I still think this should be 503 and not 4xx though.

I'd like to change our code so that we can process the request by returning some kind of valid information to the client so the client can smartly handle this. I'm sure we can all hop on a zoom sometime and have a technical discussion about what to do. But I'd prefer that we handle the RBS return codes and also gracefully manage it if RBS is down or not taking requests from us.

Can either @jamesatheyDDS or @aileendds follow up on this? I'd love to get some resolution. And in the meantime I have added a task to my backlog to figure out how we monitor these services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants