I added a new site and it took a while to start because it needed to install the database schema. Eventually I got an error:
And then it appears z_site_sup has crashed: no site can be reached anymore.
The exact same thing happened to me last week when installing mem01 on production.
Workaround for when this happens:
We have a better startup sequence now. Is this still an issue?
Could easily be. It sure smells like it.
The problem with the restart loop was that z_sites_dispatcher needs a running z_sites_manager when it starts. When both servers have failed zotonic_sup will keep restarting them. Sometimes always in the wrong order, so restarting z_sites_manager will always fail.
By using a one_for_all supervisor both servers will always be started in the right order.
Maybe we should also think about letting zotonic_sup fail. It currently tries restarting stuff forever, but if we let it fail, heart have the opportunity to restart zotonic.
Thinking about it, what also could have happened was this:
1) One site failed timed out. (my case it was the module indexer)
2) z_sites_manager keeps running, will restart it after a delay.
3) Then somehow z_sites_dispatcher failed, (why? it should not)
4) zotonic_sup tries to restart it, (with a delay)
5) z_sites_dispatcher wants to fetch the dispatch table of the failed site
6) and crashes because the dispatcher of the failed site is not running. GOTO 4
I think we should look at the sites dispatcher and manager to check dependencies. What worries me a bit is step 3. Why did the sites_dispatcher fail when one of the sites is waiting to be restarted.
When we isolate the sites and the sites managers then we can surely let the main supervisor give up after a while.
As the internal "plumbing" should be robust enough to not fail under non-lethal circumstances.
I'll assign this to the 0.9.1 milestone for further investigations.
The crash of z_sites_dispatcher is related to updating calling z_sites_dispatcher:update_dispatchinfo when the dispatcher of a site is down. This happens a lot when mod_develop is running, or when are busy coding and call z:m/0 a lot.
This is a race condition between calling z_sites_manager:get_sites and actually calling the dispatcher of the site. Maybe the sites manager even reports that the site is running, but actually parts of it are failing.
This kind of situation should be handled, maybe with the same service we need for the delayed restarts. I have some thoughts on how to handle this, and started experimenting (https://github.com/mmzeeman/fusebox)
A related fix was done here, e8ef40a
Let's close this until it becomes a real issue again