-
Notifications
You must be signed in to change notification settings - Fork 19
Returning casServiceFailureView after Successful ST Validation #55
Comments
Could you please provide an example of this long service url, so that we may better be able to duplicate the issue? I am wondering if there are any particular strange characters in the url that could cause the issue. Just a guess at this point. |
We can provide a full list of the long URLs which we pulled from the logs when CAS went into the problem state, but it comes with the caveat that we haven't been able to reproduce the problem playing them back against a test CAS server, despite numerous attempts. We have isolated the new service to it's own dedicated CAS server, which is where we deployed the non-MFA version of CAS (the new service does not use MFA). The error text we're getting (I don't see it above) is "ServiceTicket [ST-117-DLSkRVYvyer9tiTuEvqt-cas01.example.org] does not satisfy validation specification.", with a convertedDescription of "ticket ''{0}'' not recognized" (from debug logging we added to generateErrorView). https%3A%2F%2Flogin.bay.evergreen.edu%2Flogin?qurl=ezp.2aHR0cHM6Ly9iYXkuZXZlcmdyZWVuLmVkdS91c2VyT2JqZWN0P3NlcnZpY2U9Z2V0VG9rZW4mcmV0dXJuVVJMPWh0dHBzJTNhJTJmJTJmbWFzdGVybmF2ZXpwLmlkbS5vY2xjLm9yZyUyZnVzZXJPYmplY3QlM2ZzZXJ2aWNlJTNkcHJveHlHZXRUb2tlbiUyNnJldHVyblVSTCUzZGh0dHAlMjUzYSUyNTJmJTI1MmZzdW1taXQud29ybGRjYXQub3JnJTI1MmZ3Y3BhJTI1MmZzZXJ2bGV0JTI1MmZvcmcub2NsYy5sYWMud2NuLldjblBhZ2VTZXJ2bGV0JTI1M2Z3Y29jbGNudW0lMjUzZDcxMjExNzgxMCUyNTI2Y29uc29ydGlhUmVnaXN0cnlJZCUyNTNkNTg0NjAlMjUyNnRpdGxlJTI1M2RzaWxlbmNlLWxlY3R1cmVzLWFuZC13cml0aW5ncyUyNTI2cGFnZSUyNTNkd2NuJTI1MjZzZXJ2aWNlJTI1M2RwbGFjZVJlcXVlc3QlMjUyNnN2Y1VybCUyNTNkaHR0cCUyNTI1M0ElMjUyNTJGJTI1MjUyRndvcmxkY2F0Lm9yZyUyNTI1MkZ3ZWJzZXJ2aWNlcyUyNTI1MkZkZWxpdmVyeSUyNTI1MkZyZXF1ZXN0Rm9ybSUyNTI1M0ZpdGVtdHlwZSUyNTI1M0RNb25vZ3JhcGglMjUyNnBhcnRuZXIlMjUzZG9jbGMlMjUyNndjcGFydG5lciUyNTNkb2NsYyUyNTI2b2NsY251bSUyNTNkNzEyMTE3ODEw https%3A%2F%2Flogin.bay.evergreen.edu%2Flogin?qurl=ezp.2aHR0cHM6Ly9iYXkuZXZlcmdyZWVuLmVkdS91c2VyT2JqZWN0P3NlcnZpY2U9Z2V0VG9rZW4mcmV0dXJuVVJMPWh0dHBzJTNhJTJmJTJmbWFzdGVybmF2ZXpwLmlkbS5vY2xjLm9yZyUyZnVzZXJPYmplY3QlM2ZzZXJ2aWNlJTNkcHJveHlHZXRUb2tlbiUyNnJldHVyblVSTCUzZGh0dHAlMjUzYSUyNTJmJTI1MmZzdW1taXQud29ybGRjYXQub3JnJTI1MmZ3Y3BhJTI1MmZzZXJ2bGV0JTI1MmZvcmcub2NsYy5sYWMud2NuLldjblBhZ2VTZXJ2bGV0JTI1M2Z3Y29jbGNudW0lMjUzZDcxMTc4ODg0NiUyNTI2Y29uc29ydGlhUmVnaXN0cnlJZCUyNTNkNTg0NjAlMjUyNnRpdGxlJTI1M2RwZXJzcGhvbmVzLWRlc2NlbnQtcmV0dXJuJTI1MjZwYWdlJTI1M2R3Y24lMjUyNnNlcnZpY2UlMjUzZHBsYWNlUmVxdWVzdCUyNTI2c3ZjVXJsJTI1M2RodHRwJTI1MjUzQSUyNTI1MkYlMjUyNTJGd29ybGRjYXQub3JnJTI1MjUyRndlYnNlcnZpY2VzJTI1MjUyRmRlbGl2ZXJ5JTI1MjUyRnJlcXVlc3RGb3JtJTI1MjUzRml0ZW10eXBlJTI1MjUzRE1vbm9ncmFwaCUyNTI2cGFydG5lciUyNTNkb2NsYyUyNTI2d2NwYXJ0bmVyJTI1M2RvY2xjJTI1MjZvY2xjbnVtJTI1M2Q3MTE3ODg4NDY- If you want the full list, let us know via email and we'll send them along. |
Thanks for the info. May I suggest that adjust the logging level to TRACE and observe the behavior? If you can get the integration to failed with TRACE turned on, I'd be interested to see what the log has to say, particularly up to the point where you see the "not satisfied" message. There are a number of conditions that might make the assertion fail validation, and the log should tell us more about the assertion detail. |
We enabled TRACE logging today and found some interesting things. Today just before the problem started we got an 'InvalidTicketException' (gt and lt signs converted to +):
The next time someone (my test user) got a second ST using the same TGT, it failed in 'isSatisfiedBy' with INVALID_TICKET (same as above). Interestingly, the trace logs show that the exact same methods and values in the isSatisfiedBy method for both success and failure, the only difference it the result. I did a brief check on one of our older logs and found the same pattern. We have more logs, let me know if they would help and I'll clean out the credentials and post or email them. |
Thanks for the repo. One other question; are there any particular log statements before/after this line?
I am specially interested in DEBUG entries. |
Also, bit of an obvious guess it might be, but is the application specially using renew=true either on login or ST validation? That should help narrow down the scope. |
Renew appears to be 'on' in the client config (they're using the EZProxy CAS client), but the httpd logs show that it is only being called ~16% of the time (403 of 2523 login requests over a week's time); we're still investigating the logic behind that observation. In the session we ran last Thursday, we had 2 requests come in with renew=true before the problem started, 2 'during' the ticket generation that seemed to mark the beginning of the problem, and 2 after that. I'll directly email you the TRACE log from the ticket generation that marked the beginning of the problem state. |
I pulled some new login URLs from the httpd logs, some of which included renew=true, and updated my test script to log in again when it found renew=true in the URL; I still could not reproduce the problem, although I did fill up the disk on the test server with the TRACE logs...twice :) Looking over my httpd log vs. the one I pulled from the production server, I see that my script is not preserving the jsessionid like the real-life users are, so I'll fix that and try again at some point soon. |
I'm doing a "spring cleaning" of the issues and have a question - can this issue be closed? If anything like this happens in the future it could always be re-opened, etc. |
Safe to close, for now I would say. Lets revisit once we can actually duplicate the issue on our end. |
The issue is still happening. We would love to get it solved. We've resorted to running a second non-MFA CAS for the application that triggers the error. -James From: Dmitriy Kopylenko [notifications@github.com] I'm doing a "spring cleaning" of the issues and have a question - can this issue be closed? If anything like this happens in the future it could always be re-opened, etc. — |
OK, I'm gonna move it to 1.0.0-M3 for now and assign to @mmoayyed ;-) |
Any idea when 1.0.0-M3 might be available? We can pretty easily reproduce the problem in our production environment, but unfortunately are still unable to reproduce it in a test environment. |
Keith, Thank you and your team jumping on this. We (UOFU, Internet2 and Unicon) are actively seeking other U’s to proof of concept this CAS MFA. I have missed the window to propose a presentation for I2 CAMP, with luck we can do an ACAMP presentation in Indianapolis this fall. Cheers, Bryan Any idea when 1.0.0-M3 might be available? We can pretty easily reproduce the problem in our production environment, but unfortunately are still unable to reproduce it in a test environment. — |
This is still happening with the almost-M3 code. It has happened twice now; the first time was yesterday afternoon, but I didn't have DEBUG logging enabled, so I enabled it and it happened again this morning. I believe the problem is with the renew logic, and unfortunately the EZProxy CAS client likes to use renew=true. What logs can I submit to help figure out and address this issue? If there is anything else we can do to help address this issue, please let us know. |
Turn up the log level to TRACE and post the entire cas.log file. you can certainly attach this file to my email as well mmoayyed@unicon.net We'll be deploying the changes for UOFU pretty soon, and so should be able to test this out quite a bit. |
Ok, I'll capture a new TRACE log; hopefully I'll have it tomorrow morning (its a production server so I have to wait until late tonight to change the log level). I'll also send along the httpd logs so you can see what traffic is coming in. Having spent 6 weeks at the beginning of this year investigating this and still not being able to reproduce the problem in test, my best guess is that the problem is a combination of the renew flag and the way EZProxy validates STs. In my tests, I generated the service validation URL based on the service URL from the httpd logs, I didn't use the service validation URLs from the httpd logs, and I'm wondering if that's why I couldn't reproduce the problem in my tests. My next step was going to be looking into the EZProxy client's service validation URLs, but I needed to get back to my regularly scheduled work so I haven't gone down that path. In the meantime, I re-sent you the 26k line TRACE log file we collected in February; hopefully that will yield some good info. |
Today we were able to capture the problem in TRACE logs. We have everything that happened from the service restart at 11:15am until the service entered the problem state at about 2:25pm. The total size is about 2.6GB. I broke it up into 16 ~1,000,000 line files; do you want them all, or just the last one, or...? |
Keyith, I am not sure when Unicon will be able to assist you on this. Please contact me and we can discuss the options of this may be addressed within community effort. |
We are seeing the same issue with the M3 build where the CAS logs show a successful ST validation but the service is receiving a failure response with a "ticket ''{0}'' not recognized" message. We are preparing to upgrade to the M6 build. Has this issue been resolved that version? |
I haven't heard that this has been fixed, and it wasn't fixed in the "almost-M3" version I last tried. This began happening for us when we added a service (OCLC's EZProxy) that often uses renew=true; prior to that we hadn't see this issue. JohnPfeifer, can you add any details about when this happens, specifically with regard to requests where renew=true (my best guess/current suspicion)? |
Yes, details would be very helpful in figuring this out. Particularly, trace calls from the app to the CAS server with renew=true would be helpful. We have not been able to successfully duplicate this problem, but hopefully with more data we'll review more thoroughly. |
It does not seem to be related to renew=true. There are only one or two services which use the renew flag and the are not reporting any problems. Reproducibility is the bane of this problem. It occurs irregularly on some services that are in their period of peak usage. I have been unable to reproduce it in my dev or qa environments. I should also note that the services in question are not doing MFA.
// |
Good to know that it's not MFA-request related. I think @keyith mentioned that he was not able to duplicate this with a vanilla CAS server. It may be some place in the MFA code that tries to handle non-mfa functionality perhaps. I'll run a few more tests with the latest M6 or later with some load and see if I can make sense out of this. If you can capture tomcat/server access requests logs that correlate with the issue, it would help. |
After much ado, we may have solved it. We were never able to reproduce the problem in our qa environment but, after a week of running the patched code in production, we have not seen this error again. The issue is with how the validationSpecificationClass is handled differently in this code v/s the stock ServiceValidateController class. In the JASIG code, a new instance is created by each call to getCommandClass() where, as in the Unicon code, there is a single instance that gets reused. I believe that this leads to a race condition when you have a mix of mfa and non-mfa requests happening at the same instant (which would explain the challenge in reproducing the problem). Below is the patch that I wrote:
Alternately, you could declare MultiFactorAuthenticationProtocolValidationSpecification to be Cloneable and then getCommandClass() could clone the instance. |
@JohnPfeifer thanks very much for the patch. I'll get to it tomorrow to merge the change into the mfa codebase. [or if you'd rather submit a PR, I'd be happy to merge that as well] |
@JohnPfeifer before I apply a patch, could you test one more time with scope set to prototype for the following beans: <bean id="mfaValidationWithoutProxySupportSpec"
class="net.unicon.cas.mfa.MultiFactorAuthenticationProtocolValidationSpecification"
c:validateProxyAuthenticationRequests="false"
scope="prototype"/>
<bean id="mfaValidationWithProxySupportSpec"
class="net.unicon.cas.mfa.MultiFactorAuthenticationProtocolValidationSpecification"
c:validateProxyAuthenticationRequests="true"
scope="prototype"/> |
Are you asking to test the straight M6 code or the modified code with that scope set for those two beans? |
The M6 code. |
Actually, to clarify, the M6 code with the two beans modified for their scope. |
That was the easy part; integrating our overlay with the M6 code is proving more difficult. Is there any reason I can't change the scope of those two beans to "prototype" in our M0 code? Also, I've been thinking about this a lot lately, and want to point out that I've been able to repro the issue on the server serving EZProxy without any MFA requests, while the server serving both MFA and non-MFA requests but not EZProxy hasn't had the issue. I think the logic behind creating a new instance vs. a single instance still stands, but I wanted to call that out. |
Sure, M0 would be fine too. |
I made the scope change to those two beans in our existing M0 code and got permission to deploy it to our CAS server dedicated to EZProxy yesterday at 10am. In past attempts, the failures began occurring within 2 hours of deployment, but today I am happy to say that we haven't encountered the issue after 24 hours! So far, so good... Current plans call for running it for the rest of this week, and if we still haven't encountered the issue, we'll deploy it to the main production CAS server next Sunday morning. |
Excellent news. I'll wait to hear back from you, and if the change still holds, we'll modify the code to include that. Thanks again. |
And cut RC1, perhaps ;-) |
I made a few changes to fix the renew behavior. |
With the MFA extension, CAS returns the casServiceFailureView view to clients calling serviceValidate, despite the fact that the actual validation is successful. This problem began when a new service was added that has very long service URL. The symptom only presents on the second or subsequent ST validation after TGT creation. Reverting back to a non-MFA CAS demonstrates that the problem only occurs in MFA CAS. The symptoms do not occur only on requests with these very long service URL, rather, the service with very long URL puts CAS in a state where this happens with all services and a CAS restart is required to correct it. Additionally, a single instance of the very long URL will not necessarily trigger the problem. We added a few logger.debug lines to MultiFactorServiceValidateController.java and started getting a bit more info about where the problem was occurring:
2014-01-23 11:10:26,703 DEBUG [net.unicon.cas.mfa.web.MultiFactorServiceValidateController] - <ServiceTicket [ST-117-DLSkRVYvyer9tiTuEvqt-cas01.example.org] does not satisfy validation specification.>
2014-01-23 11:10:26,703 DEBUG [net.unicon.cas.mfa.web.MultiFactorServiceValidateController] - <convertedDescription is ticket ''{0}'' not recognizedand code is INVALID_TICKET>
From MultiFactorServiceValidateController.java:
if (!validationSpecification.isSatisfiedBy(assertion)) {
logger.debug("ServiceTicket [" + serviceTicketId + "] does not satisfy validation specification.");
return generateErrorView("INVALID_TICKET", "INVALID_TICKET_SPEC", null);
I may be able to correct or provide more info but I need a little assistance in where to start looking.
The text was updated successfully, but these errors were encountered: