-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a user with better information when application is misconfigured #78
Comments
This could be replicated further more across all the images we provide. |
I don't think it compares to a fork bomb. Handling pods that fail to start is part of the system, and looking at the logs is something people should be doing when their deployment fails. The idea of showing it as a web page sounds nice, but I would only do that in "development mode". It would be misleading to see that the deployment succeeded, but instead of your webapp you get an error message served... Imagine you break something and try to deploy a bad image: instead of the deployment failing and your good old version staying in place, you get your pods replaced with an error message web page. |
The idea of "development mode" seems reasonable argument 👍 |
Could mess up readiness checks too. Ben Parees | OpenShift
|
The problem up till now has been that the log output has been hard to capture due to no log aggregation to preserve it properly. So unless you are lucky enough to get the logs in the very small amount of time before the pod is trashed, you don't know what is going on. I ended up having to guess what was wrong as logs wasn't producing anything. I don't know if newer versions have changed. You might want to intentionally create a broken pod and try under OpenShift and see how you fair. |
You can get the logs from a previous pod with --previous, does that help Ben Parees | OpenShift
|
Using
Also, which readiness check are you talking about. I can think of two that you are talking about. The first is simply whether the container is actually running. The second is an explicit readiness probe specified in the DeploymentConfig.
The thing is that the first is not a conclusive indication of a container being ready to handle requests. Many Python WSGI servers at least, and possibly the same for other language web servers, do not validate the WSGI application entry point on server start as they will only lazily load the WSGI application. This means that the server can appear to start up, but then all the web requests return 500 anyway. This is where a proper readiness probe is really needed as is used in some Java images including the mlbparks example at https://blog.openshift.com/part-2-creating-a-template-a-technical-walkthrough/. I am assuming that without using a template it wouldn't be possible to incorporate a default readiness probe within an S2I built image. So touch a file if get passed all startup checks and then proceed to actually run an actual web server. A default readiness probe could check that file and say looks okay if exists. This whole issue confirms my belief that one shouldn't just throw a raw Python WSGI server in with it pointing direct at the users Python application. In mod_wsgi-express, which uses Apache/mod_wsgi, it is actually configured to start up a WSGI application I provide with mod_wsgi-express. Loading of that is always guaranteed to work at least. That application in turn loads the real target WSGI application and internally within the process passes through the request. If the target WSGI application cannot load then mod_wsgi-express can technically return a better error response. With this flexibility I could even incorporate readiness probe check support which could be used in support of the OpenShift readiness probe mechanism. I will have to investigate whether OpenShift will log the output of the readiness probe if it returns a non zero exit status. Could be a good way for me to return additional information about why something is not working properly. Eg., couldn't load the WSGI application. |
alternatively "we get into teaching people the proper way to successfully use and administer our product because not every image they encounter is going to have this sort of built in debugging tool anyway". This is an education problem, not a technical problem.
that's a liveness check, not a readiness check.
right, ultimately "readiness" is an application dependent concept. so providing a generic one would be difficult... just because wsgi is up doesn't mean, for example, the backend DB the application depends on is available or that the application can reach it, so the application still might not be "ready". My point is that most people's first attempt at a readiness check (or ours, if we do provide a default one) will be "is the app serving http successfully", so if you make it such that the image successfully serves http content even when the image is misconfigured, you're going to cause people's (admittedly naive) readiness checks to "pass" when they should failing, and then that's going to lead to actual requests getting sent to that broken image and getting served the "debugging" page. (that the error page would return a 500 error might be sufficient to avoid this, depending just how naive the readiness check is) We have started discussing a DEBUG option on our DB images, perhaps it makes sense to do something like this if a DEBUG option is enabled, but I would still prefer to just print more useful information to the log, and educate users about how to use "--previous" to see logs from failing/restarting containers (or even better, make those logs more easily/obviously accessible via the web console or changes to k8s upstream). Trying to fix it one image at a time is going to be painful and not help the ecosystem of images people want to run on openshift. |
per the discussion above, i'm not in favor of solving this in a one-off way via an http page for this particular image. I agree we need to give users more help in case of a crashloopbackoff situation, that should be opened as an issue against openshift (pointing them towards the --previous logs flag, for example) |
Currently our run scripts fail with error messages being available only in logs when one did not provide certain values, see here:
This results in pod start failure and OpenShift re-starting the pod over and over again, causing kind of fork bomb, but with pods this time. There's the idea of running simple HTTP server (
python -m SimpleHTTPServer 8000
) return that information with a 500 error page.All the credits for the idea goes to @GrahamDumpleton
@bparees @rhcarvalho wdyt?
The text was updated successfully, but these errors were encountered: