Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved ILS availability system #768

Merged
merged 3 commits into from Oct 18, 2016
Merged

Improved ILS availability system #768

merged 3 commits into from Oct 18, 2016

Conversation

demiankatz
Copy link
Member

@demiankatz demiankatz commented Aug 2, 2016

This pull request contains work in progress on a smarter approach to detecting and dealing with ILS technical difficulties. The existing "loadNoILSOnFailure" mechanism depends upon the ILS driver throwing an exception in its init() method to detect problems. This may be too limiting.

TODO:

  • Refactor VuFind\ILS\Connection to perform NoILS failover "on demand" rather than in the constructor.
  • Discuss whether the health check in getOfflineMode needs to be made more intelligent; perhaps we should add a method to the driver spec and fail over to this default behavior when the method is undefined. (This may be worth considering in the future, but was not implemented at this time).
  • Discuss whether we need a configuration setting for the AJAX health check on the search/home page. (We decided this is not needed at this time).
  • Investigate whether we can use some kind of cache to improve ILS performance -- for example, if we have a health check, can we cache health check status for a period of time to reduce load on the ILS? (We decided not to do this, as it could magnify intermittent problems).
  • Investigate whether we can extend Lazily create Voyager database connection and try to avoid using it u… #736 so that its on-demand database initialization functionality remains compatible with the NoILS failover capabilities of VuFind.

@EreMaijala
Copy link
Contributor

I think this is a good start. I'm just not sure if there's a good way to do a health check. For instance we have some SOAP based ILS connections that just fail calls intermittently and very temporarily meaning that the call may succeed if it's retried right after the failure. Especially caching would be harmful in this case, but also the health check might succeed even thought subsequent actual API call could fail. Could it work if the Connection class would catch ILS exceptions from the driver and then either pass them on or switch to NoILS based on whether failover is enabled?

@demiankatz
Copy link
Member Author

The possibility of making the Connection class catch exceptions and fail over dynamically definitely crossed my mind -- the problem is that it's hard to predict how VuFind would behave in scenarios where multiple ILS driver calls are made on a single page, and some fail while others do not. You could end up with a page that's partially constructed with the real ILS driver and partially with the NoILS driver. This might or might not be a problem -- it's hard to test all the permutations to find out exactly what the possibilities could be. If I'm being optimistic, I might suppose that there's a good chance this would work properly: the "getOfflineStatus" call is made as part of template rendering, which happens after the controller executes... so as long as the controller action catches problems and switches to NoILS, then the interface should render the appropriate messages. However, I'd want to do some fairly thorough testing to confirm that optimistic theory.

And, as you say, the whole idea of a failover strategy is going to depend on failure patterns, and these are going to vary from case to case, function to function and driver to driver. No "health check" function is going to be able to catch all the edge cases, and some "normal" errors (like an exception thrown due to unexpected bad input) could easily be misinterpreted as a system failure. The questions are, what can we do that offers more benefit than harm, and what configurable settings might we provide to make this adaptable to the widest range of use cases?

Just as one example, what if we offered an option to cache failure status for X seconds. If something fails, we cache the fail time. The driver then stays in offline mode for at least the configured number of seconds before running a health check and clearing the cached fail time. This would interrupt access to the ILS driver during problem times and potentially give it time to recover... but, of course, the down side is that false positive failures are going to boot all users out of the ILS. Potentially worth considering as a non-default option, though.

Another thing to think about is retries -- is there any value in retrying from the ILS Connection? (My initial feeling is no: any retry loops should be handled INSIDE the ILS driver, since higher-level code shouldn't be making those kinds of assumptions... but I bring it up for the sake of discussion).

Any thoughts on where we should go next? Would you like me to take a more serious look at putting try..catch blocks around all the calls and switching into NoILS mode more aggressively? Perhaps we can add another setting to the Demo driver to simulate intermittent failures in order to test this.

@EreMaijala
Copy link
Contributor

I suppose the worst situation is something like trying to renew loans where the renewal call goes through but then the getMyTransactions call fails, right? In this case the user can't know if the renewal happened or not, but I don't think the situation would be any worse than it is at the moment.

As far as I can see no ILS exception should be thrown for invalid input or similar situations. E.g. placeHold returns the array ["success" => false, "sysMessage" => 'something'] with invalid input.

If the ILS status is cached, it would need to be common place for all front ends for us (like the database). Otherwise there would be inconsistency between front ends and subsequent calls for the user.

I agree that any retrying should happen in the ILS driver. That's where it can be decided whether retrying makes sense or is even safe.

Yes, I'd go for the try-catch and see how it goes. Making intermittent failures configurable in Demo driver is a good idea.

@demiankatz
Copy link
Member Author

I agree that no ILS exception should be thrown for invalid input... but just observing that if a bug caused this to happen, it could potentially be exploited to bring down the whole ILS integration in the "cached status" scenario.

Using the database for an ILS status cache could work. My first instinct was to use the disk-based object cache, but that is not shared between instances. Perhaps the best answer is to abstract the caching to a service so that we can easily add configurability or local overrides.

In any case, I'll do some more work on the try-catch / Demo driver failure simulation next time I have a chance to do some coding!

@demiankatz
Copy link
Member Author

@EreMaijala, as discussed, I have implemented fake system-level failures in the Demo driver and added try..catch blocks to make NoILS failover more robust. Some initial tests seem to suggest that this is working in at least some cases. I'm not sure what the best strategy is for more thoroughly gaining confidence that this is a good idea (or proving that it is not). Any ideas? I'll try to play with this some more when I have more time, but in the meantime, any feedback would be appreciated.

@EreMaijala
Copy link
Contributor

Well, I'm still pretty sure this is a good idea, but I agree that it's difficult to validate. Perhaps checking all situations where the ILS drivers throw exceptions would help to make sure that they're not thrown with invalid user input etc. I think I already did that when working on the ILS exception handling before, but it wouldn't hurt to doublecheck.

@demiankatz
Copy link
Member Author

demiankatz commented Aug 9, 2016

I double checked -- while I didn't look at every single exception thrown by every single driver, I think I checked a large enough sample to be confident that there are no exceptions being thrown that we shouldn't be catching and treating as a real failure. (Though, as I said, there's always the possibility that a bug in some driver could introduce an accidental scenario where specific user input could lead to an exception -- but that's just a fact of life with software).

I'll leave this open for a while so we can test it a bit more, and I'll make sure it gets mentioned on the next dev call.

The next question then becomes whether this works well enough with #736 that we can consider merging both PR's, or if we need more time to discuss a health check mechanism. One important scenario to think about:

In the current master code, where we always connect to the database when we initialize the ILS driver, if Voyager goes down and the appropriate NoILS settings are configured, the home page of VuFind will hide the login button and display a message about the ILS being offline.

In this code, with the on-demand Voyager DB initialization, the VuFind home page will always look healthy, and the log in option will always be available. Only AFTER the user attempts to log in will they get the "system under maintenance" message. If they close the login window and refresh, the system will look healthy again, but as soon as they try to log in, they'll get the error again. (Note: this behavior description is based on testing with the modified demo driver -- I haven't actually confirmed that this is what happens with your Voyager revisions... we ought to do that in the near future just to be sure).

The scenario described with the revised code is not completely unacceptable -- the user receives clear feedback explaining why their action didn't work. However, it does allow them to waste more time before learning about the system outage, and it might encourage them to waste further time retrying again and again. The question then is whether implementing some kind of status cache/health check mechanism might make sense to offset that, or whether it's good enough as it is.

@demiankatz
Copy link
Member Author

@EreMaijala, any further thoughts on this? Is this something you'd urgently like to see resolved for release 3.1, or can we give this a bit more time for discussion? (I'm just trying to prioritize my work for the next few weeks... and this one will need a lot of attention if we want to finish it quickly).

@EreMaijala
Copy link
Contributor

Well, I find this with #736 good enough, but we're not in a bad hurry to finish it so no need for it to make it into 3.1 (we'll probably merge or cherry-pick from master anyway).

@demiankatz
Copy link
Member Author

@EreMaijala, I think @lahmann's work from #752 may help close up some of the gaps I encountered during testing. You might want to take a look at the most recent comment over there. Hopefully if everyone agrees I'll move some of his AJAX status checking into this branch, and then we can see whether we need to add some additional mechanisms to make everything work properly. I don't think it will all get done in time for 3.1, but I'm feeling increasingly confident that it may be done shortly after that release.

@EreMaijala
Copy link
Contributor

@demiankatz Nice, that sounds good to me. Please just make sure the AJAX offline thing is optional, since we will probably be unable to use it with MultiBackend.

@demiankatz
Copy link
Member Author

That should be easy enough to accomplish! I'll take a closer look once my latest round of comments on #752 are resolved.

@demiankatz demiankatz added this to the 4.0 milestone Sep 13, 2016
@demiankatz
Copy link
Member Author

I have rolled @lahmann AJAX status check work from #752 into this branch with some additions/modifications. Here are some recent changes I've just made in this branch:

1.) In the ILS connection class, even if "NoILS" failover is turned off, we now remember failures of the primary ILS and return 'ils-offline' status when getOfflineMode is called (unless the ILS driver has its own status method). This should allow us to detect and warn about problems in more situations.

2.) getOfflineMode now has an optional "health check" parameter which does an arbitrary lookup on the ILS driver to test the connection. This is used both in @lahmann's AJAX routine and in the install controller's ILS test. We may want to add an optional ILS driver method that can override this default test with something smarter and more specific -- I put a checkbox about that on the TODO list above.

3.) For the moment, the AJAX ILS status check on the home page is not configurable -- it's just always on. I'm not sure if it's preferable to disable this by customizing the search/home template, or if it is worth wiring up a config.ini setting for it. I'm open to suggestions either way, but for the moment I just wanted to get enough pieces put together for a proof of concept.

4.) I have tested that all of this functionality seems to work reasonably in combination with #736.

I'm interested in what @EreMaijala has to say about all of this. I'm definitely not comfortable enough with this to put it into release 3.1, but hopefully we can wrap this conversation up some time after the release so this can be incorporated into 4.0.

@demiankatz
Copy link
Member Author

@EreMaijala, now that 3.1 is released, would you have time to give this another look? I think we might be able to merge this to master fairly soon with your approval, though there are still some outstanding questions (summarized in the above comment and in the TODO checkboxes).

@EreMaijala
Copy link
Contributor

I'm ok with this, though it obviously doesn't work with MultiBackend driver.

@demiankatz demiankatz merged commit 9195ad4 into master Oct 18, 2016
@demiankatz
Copy link
Member Author

Okay, this has been merged to master. Thanks again for the input!

@demiankatz demiankatz deleted the ils-failover-refactor branch September 14, 2017 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants