Provide a user with better information when application is misconfigured #78

soltysh · 2015-11-26T10:11:16Z

Currently our run scripts fail with error messages being available only in logs when one did not provide certain values, see here:

&2 echo "ERROR: don't know how to run your application."
&2 echo "Please set either APP_MODULE or APP_FILE environment variables, or create a file 'app.py' to launch your application."
exit 1

This results in pod start failure and OpenShift re-starting the pod over and over again, causing kind of fork bomb, but with pods this time. There's the idea of running simple HTTP server (python -m SimpleHTTPServer 8000) return that information with a 500 error page.

All the credits for the idea goes to @GrahamDumpleton

@bparees @rhcarvalho wdyt?

The text was updated successfully, but these errors were encountered:

soltysh · 2015-11-26T10:11:41Z

This could be replicated further more across all the images we provide.

rhcarvalho · 2015-11-27T13:15:16Z

I don't think it compares to a fork bomb. Handling pods that fail to start is part of the system, and looking at the logs is something people should be doing when their deployment fails.

The idea of showing it as a web page sounds nice, but I would only do that in "development mode". It would be misleading to see that the deployment succeeded, but instead of your webapp you get an error message served...

Imagine you break something and try to deploy a bad image: instead of the deployment failing and your good old version staying in place, you get your pods replaced with an error message web page.

soltysh · 2015-11-27T13:48:53Z

The idea of "development mode" seems reasonable argument 👍

bparees · 2015-11-27T15:02:56Z

Could mess up readiness checks too.

Ben Parees | OpenShift
On Nov 27, 2015 8:15 AM, "Rodolfo Carvalho" notifications@github.com
wrote:

I don't think it compares to a fork bomb. Handling pods that fail to start
is part of the system, and looking at the logs is something people should
be doing when their deployment fails.

The idea of showing it as a web page sounds nice, but I would only do that
in "development mode". It would be misleading to see that the deployment
succeeded, but instead of your webapp you get an error message served...

Imagine you break something and try to deploy a bad image: instead of the
deployment failing and your good old version staying in place, you get your
pods replaced with an error message web page.

—
Reply to this email directly or view it on GitHub
#78 (comment)
.

GrahamDumpleton · 2015-11-27T19:46:08Z

The problem up till now has been that the log output has been hard to capture due to no log aggregation to preserve it properly. So unless you are lucky enough to get the logs in the very small amount of time before the pod is trashed, you don't know what is going on. I ended up having to guess what was wrong as logs wasn't producing anything. I don't know if newer versions have changed. You might want to intentionally create a broken pod and try under OpenShift and see how you fair.

bparees · 2015-11-27T20:18:30Z

You can get the logs from a previous pod with --previous, does that help
you for that situation?

Ben Parees | OpenShift
On Nov 27, 2015 2:46 PM, "Graham Dumpleton" notifications@github.com
wrote:

The problem up till now has been that the log output has been hard to
capture due to no log aggregation to preserve it properly. So unless you
are lucky enough to get the logs in the very small amount of time before
the pod is trashed, you don't know what is going on. I ended up having to
guess what was wrong as logs wasn't producing anything. I don't know if
newer versions have changed. You might want to intentionally create a
broken pod and try under OpenShift and see how you fair.

—
Reply to this email directly or view it on GitHub
#78 (comment)
.

GrahamDumpleton · 2015-11-29T23:27:51Z

Using --previous does help, but then we get into the world of people needing to know about obscure options.

$ oc logs failtest-1-6epfj
Error from server: Internal error occurred: Pod "failtest-1-6epfj" in namespace "python": container "failtest" is in waiting state.

$ oc logs --previous failtest-1-6epfj
WARNING: file 'app.py' not found.
ERROR: don't know how to run your application.
Please set either APP_MODULE or APP_FILE environment variables, or create a file 'app.py' to launch your application.

Also, which readiness check are you talking about. I can think of two that you are talking about.

The first is simply whether the container is actually running.

The second is an explicit readiness probe specified in the DeploymentConfig.

                  "readinessProbe": { 
                     "exec": {
                        "command": [
                           "/bin/bash",
                           "-c",
                           "/opt/eap/bin/readinessProbe.sh"
                        ]
                     }
                  },

The thing is that the first is not a conclusive indication of a container being ready to handle requests.

Many Python WSGI servers at least, and possibly the same for other language web servers, do not validate the WSGI application entry point on server start as they will only lazily load the WSGI application. This means that the server can appear to start up, but then all the web requests return 500 anyway.

This is where a proper readiness probe is really needed as is used in some Java images including the mlbparks example at https://blog.openshift.com/part-2-creating-a-template-a-technical-walkthrough/.

I am assuming that without using a template it wouldn't be possible to incorporate a default readiness probe within an S2I built image. So touch a file if get passed all startup checks and then proceed to actually run an actual web server. A default readiness probe could check that file and say looks okay if exists.

This whole issue confirms my belief that one shouldn't just throw a raw Python WSGI server in with it pointing direct at the users Python application. In mod_wsgi-express, which uses Apache/mod_wsgi, it is actually configured to start up a WSGI application I provide with mod_wsgi-express. Loading of that is always guaranteed to work at least. That application in turn loads the real target WSGI application and internally within the process passes through the request. If the target WSGI application cannot load then mod_wsgi-express can technically return a better error response. With this flexibility I could even incorporate readiness probe check support which could be used in support of the OpenShift readiness probe mechanism.

I will have to investigate whether OpenShift will log the output of the readiness probe if it returns a non zero exit status. Could be a good way for me to return additional information about why something is not working properly. Eg., couldn't load the WSGI application.

bparees · 2015-11-30T02:37:44Z

Using --previous does help, but then we get into the world of people needing to know about obscure options.

alternatively "we get into teaching people the proper way to successfully use and administer our product because not every image they encounter is going to have this sort of built in debugging tool anyway". This is an education problem, not a technical problem.

Also, which readiness check are you talking about. I can think of two that you are talking about.
The first is simply whether the container is actually running.

that's a liveness check, not a readiness check.

I am assuming that without using a template it wouldn't be possible to incorporate a default readiness probe within an S2I built image.

right, ultimately "readiness" is an application dependent concept. so providing a generic one would be difficult... just because wsgi is up doesn't mean, for example, the backend DB the application depends on is available or that the application can reach it, so the application still might not be "ready".

My point is that most people's first attempt at a readiness check (or ours, if we do provide a default one) will be "is the app serving http successfully", so if you make it such that the image successfully serves http content even when the image is misconfigured, you're going to cause people's (admittedly naive) readiness checks to "pass" when they should failing, and then that's going to lead to actual requests getting sent to that broken image and getting served the "debugging" page. (that the error page would return a 500 error might be sufficient to avoid this, depending just how naive the readiness check is)

We have started discussing a DEBUG option on our DB images, perhaps it makes sense to do something like this if a DEBUG option is enabled, but I would still prefer to just print more useful information to the log, and educate users about how to use "--previous" to see logs from failing/restarting containers (or even better, make those logs more easily/obviously accessible via the web console or changes to k8s upstream).

Trying to fix it one image at a time is going to be painful and not help the ecosystem of images people want to run on openshift.

bparees · 2015-12-18T21:26:55Z

per the discussion above, i'm not in favor of solving this in a one-off way via an http page for this particular image.

I agree we need to give users more help in case of a crashloopbackoff situation, that should be opened as an issue against openshift (pointing them towards the --previous logs flag, for example)

bparees closed this as completed Dec 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a user with better information when application is misconfigured #78

Provide a user with better information when application is misconfigured #78

soltysh commented Nov 26, 2015

soltysh commented Nov 26, 2015

rhcarvalho commented Nov 27, 2015

soltysh commented Nov 27, 2015

bparees commented Nov 27, 2015

GrahamDumpleton commented Nov 27, 2015

bparees commented Nov 27, 2015

GrahamDumpleton commented Nov 29, 2015

bparees commented Nov 30, 2015

bparees commented Dec 18, 2015

Provide a user with better information when application is misconfigured #78

Provide a user with better information when application is misconfigured #78

Comments

soltysh commented Nov 26, 2015

soltysh commented Nov 26, 2015

rhcarvalho commented Nov 27, 2015

soltysh commented Nov 27, 2015

bparees commented Nov 27, 2015

GrahamDumpleton commented Nov 27, 2015

bparees commented Nov 27, 2015

GrahamDumpleton commented Nov 29, 2015

bparees commented Nov 30, 2015

bparees commented Dec 18, 2015