New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trust both DOD and commercial certs when connecting to DMDC #1948
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1948 +/- ##
==========================================
- Coverage 60.6% 60.58% -0.02%
==========================================
Files 192 192
Lines 12259 12264 +5
==========================================
Hits 7429 7429
- Misses 3952 3957 +5
Partials 878 878 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesatheyDDS or @chrisgilmerproj can you add tests to make sure we don't have any regressions moving forward and to illuminate what's actually happening here?
@chrisgilmerproj do you know what CAs are in our AWS environments? @jamesatheyDDS testing in experimental might be helpful |
Agreed, I'd also like to see some tests here. |
@aileendds - I don't know what CAs are in AWS. I'd have to introspect the secure migrations to figure it out. |
@chrisgilmerproj @kahlouie What would a test look like for this integration with DMDC? They only allow specific client certificates to connect to their API - in our case, our staging and experimental DOD-signed certs can connect to their Contractor Test environment, while our production DOD-signed cert can connect to their production environment. Do we have other unit tests that require network connectivity? Is it fair to block someone from committing if DMDC's servers are down? |
This seems more like a monitoring issue than a unit test issue. What kind of health alert or monitoring thing can we set up to check the connection to DMDC in our environments? |
I would say we should have some kind of dummy DoD cert package to test this code with. Specifically something we can pipe into
We don't have unit tests against network connectivity. That being said, we should gracefully handle this kind of situation in the code. So if RBS is down we shouldn't be throwing 5XX errors in the app. I'd prefer we throw a 421 Misdirected Request.
And something we should address in another PR: in the case that we can't get the EDIPI what should be the fallback behavior of our app? |
We should put this on the Infra backlog. We could probably create a Route53 health check. We also have an outstanding issue to create a custom health check Lambda for things that Route53 can't check for us. |
@chrisgilmerproj - The documentation for 421 does not match this situation, so I don't think it's an appropriate response. That is meant to refer to the original request, not the secondary request to the 3rd party. This situation is closer to a proxy, and the documentation explicitly states that it should not be returned by a proxy. The fact that we're calling RBS is an internal implementation detail, which the caller doesn't necessarily need to know about so I think 5xx makes sense. If RBS provides more detailed error codes, we can return a 503 instead of a 500 if it's a temporary issue. |
@aileendds I agree. We should know when our dependencies are down, so we should give 500 level errors, but we should add more info so we can change the alert to go off separately from the 500 errors that are caused in our application. |
@chrisgilmerproj It is reasonable to have a unit test that the DOD CA cert package file is being parsed and loaded correctly. At some point in the future, new DOD CAs will be created and the package will need to be updated. |
Yeah, that's good enough I think.
I'm just saying we need a unit test for the change to the method you're updating. If there's an existing test that covers this then that's great. If not then we've been implementing new test-coverage rules that mean we need to add a test here before merge.
I'd be willing to provisionally accept this with a promise we get a test in a follow up PR. |
My experience is that 5XX means there was an error we were unable to handle. We ought to handle this type of situation and return a better code so the client can do something about it. Also, 5XX errors will end up paging the application on-call. If the application on-call cannot do anything about this then they ought not be paged, which I think is the case here. Again, returning a 4XX so a client can handle the problem gracefully would be better in my opinion.
We can't differentiate 5XX errors at the ALB -> PD level. So any 5XX will get bundled up to page the on-call. Again, if Truss has no power to handle this issue then we need to communicate to the requesting client what needs to happen next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving, providing:
-
we need to handle the errors that this can produce if RBS is down (I’m not picky about how but I want something that’s not a 5XX).
-
we need some test that verifies the behavior here is what we want in a file like
rbs_person_lookup_test.go
.
5xx means the server failed to process a valid request. 4xx means the client provided an invalid request. In this specific situation, it was an issue on our end that the on-call should have been notified of and fixed (i.e. with this PR). There is nothing the client could have done to their request to make the server return a valid response. If in the future if RBS is temporarily down and we can identify that it's temporary, then we can handle it differently. @jamesatheyDDS do you know if RBS returns granular error codes? I still think this should be 503 and not 4xx though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given follow-up plan, I'm good with this.
This reverts commit 050fc56.
@aileendds @chrisgilmerproj Testing in experimental successful; at least for DMDC's current cert, experimental had the right CA in its system bundle. |
@aileendds Yes, RBS does have some granular error codes, but they are not HTTP status codes. Instead, the body of the response is an XML document rooted with If the documentation is to be believed, here are the possible codes: |
I'd like to change our code so that we can process the request by returning some kind of valid information to the client so the client can smartly handle this. I'm sure we can all hop on a zoom sometime and have a technical discussion about what to do. But I'd prefer that we handle the RBS return codes and also gracefully manage it if RBS is down or not taking requests from us. Can either @jamesatheyDDS or @aileendds follow up on this? I'd love to get some resolution. And in the meantime I have added a task to my backlog to figure out how we monitor these services. |
DMDC's Identity Web Services recently changed from presenting a certificate signed by DISA to a certificate signed by Entrust. As a result, SSN<->EDIPI conversions required by both the Orders Gateway and by the DPS-to-login.gov bridge stopped working. The error
Get https://pkict.dmdc.osd.mil/appj/rbs/rest/{more URL components}: x509: certificate signed by unknown authority
appears in the logs.So, instead of trusting only DOD-signed certs when connecting to DMDC, or trusting only commercial certs, trust both. That way, if DMDC changes its mind about using a commercial cert, especially when the current cert expires in March of 2020, our code will continue to work.