-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
images built on rocker/binder
can't run RStudio on mybinder.org
#29
Comments
@januz Thanks for the bug report and sorry for the trouble. This does indeed sound very weird, I'll have to poke around. Sounds like something funny has happened on the Docker Hub end, if the same Dockerfile building locally is working fine. Possibly something in the post-build hook configuration?? I'll tickle the hub to rebuild and then poke around. |
@januz can you get your image to rebuild on Binder? No idea where this went wrong, but everything seems to be working on my fork of your 'binder-fails' example: https://github.com/cboettig/binder-fails |
@cboettig Yes, indeed. After making a commit to the repo, the container builds successfully on Binder! @betatim's assumption that a cached layer with the old version of |
If binder builds with |
Hm, I am not 100% sure anymore, but I think that during my tests for the above problems, I had the same thing happening to me (i.e., RStudio not opening) when I built/ran my docker container from DockerHub. But the same fault (having a cached layer from an earlier build with the outdated |
I think if long term runn-ability is your goal the best thing to do is to rebuild and run your image at regular intervals. From watching people use mybinder.org and using some of the repos in talks/demos over many months my take away is that it is surprisingly hard to make something that works now and will still work in 6months. Mostly this is around pinning the right kind of dependencies at the right level. keeping the current docker image is a good start but if you want to keep the option open of ever re-building it I'd attempt rebuilding the image once a month or so (via cron job). TL;DR: this is really hard :D |
Well said; I'm 💯 with Tim on this being a remarkably hard (and remarkably under-appreciated how hard) problem. Rocker's tagged images (i.e. |
Thanks to you two for taking the time to investigate and for your tips! So, if I understand you correctly, the best thing to do if I want long term usability (which is what I want as it is for a reproducible research compendium, so might be relevant to somebody some months/years down the road), is to rebuild and try out my docker image on DockerHub regularly (best without needing a commit to the repo as @cboettig describes). But how does this translate to mybinder.org? There, the image is built based on my Dockerfile, not on the image I provide at DockerHub, correct? Also, how does it translate to reproducibility of the computational environment in general? I was hoping to "pin" the complete environment by using Docker. From what I understand, you say that you can't really pin everything, so a build now is always somehow different from a build in a year or so. |
My-Binder and Rocker both try to pin things as best as possible, but nothing is perfect. e.g. R packages come from MRAN snapshots, Microsoft has done a great job keeping these (though it's hard to externally validate that everything in the snapshot comes from the date claimed); only failures I've seen are temporary server down-time. Of course MRAN could vanish in the future. System libraries are pinned by the linux distro, but can get backported security patches. And of course some aspects of 'reproducibility' are contingent on hardware, clearly beyond the scope here. Right, I believe Binder looks at your Dockerfile and tries to build it; which is a good check on reproducibility. Of course using "your own" Dockerfile from Binder's perspective means you've taken responsibility for ensuring (or not) that it's a stably reproducible build (e.g. Tim may have more insight on this, so I'd be curious on his take too. |
Even doing something that is conceptually simple like "Pin all the things" turns out to be tricky to get right if you have a sufficiently large project (I'd say this is the fundamental reason this issue was created :) ). For example We could pin everything to exactly the version we use. This would prevent accidental breakge, probably. Once you find everything and pin it all (which takes time because you won't notice that one thing you missed until in a few months time when it suddenly breaks). Now there is a bug fix in a package you were using. This means we need to decide if we want to update (your result becomes more correct) or not (repeatability, we keep reproducing the result we know is incorrect but it is the same as it has been). A lot of this is only hard because humans are building software and make mistakes in the process. If you only rely on two other things chances are you won't be caught up in a mistake. However, if you depend on a large stack (everything from matplotlib to the linux kernel via some docker magic containerisation stuff) I bet you you will be the victim of a mistake being made somewhere :) Hence, I would setup a mothly (or so) rebuild and re-run job. It costs nearly nothing and at least I get a timely notification when something breaks. The hypothesis being that fixing it close to when it breaks is much easier than trying to fix the accumulation of all breakages of 12months or 24months. Or you decide that "nope we won't fix this, it is Ok that it is now broken." |
As a physicist, I assume "spherical cow in a vacuum without friction". Everything is nice and easy to calculate. In reality, cows are a weird shape, there is friction and an atmosphere. Now something that was a nice simple problem you could solve with pen and paper has turned into something requiring complicated numerical approximations. I see reproducibility a bit like that. In theory it should be simple, in practice there are so many factors that make it more complicated than you first thought :-/ |
Tim, I think this is great, but maybe overstating the goal slightly. As you note, the real catch in this scenario is wanting to update some part of your stack to a newer version that you didn't actually use, because perhaps some bug was fixed in your software and you want to see if it changed your result. That's an important use case; but it is also very distinct from the use case of "wanting to reproduce your original results in the original environment, bugs and all". |
Thank you two so much for your insights!!
@cboettig At least for the R side of things that is solvable, correct? (at least when assuming that MRAN works reliably) All packages that are installed into the docker image, are installed from an MRAN snapshot that is fixed to a specific date by the base image. If one wants to install newer versions of specific packages, I found that there is the risk that @betatim describes if one just uses
But if one instead specifies a specific MRAN snapshot a package should be installed from, the installed dependencies should also be reliably installed from that snapshot, correct? For example
For everything outside of R (including non-R-depencies of R packages), there is less control though as you both line out. |
@januz yes, the MRAN snapshots are a convenient way to pin the version (you don't need to specify a version when installing from MRAN, since the 'latest' version is already fixed to the date). Both the versioned rocker images and the standard binder R config uses this MRAN snapshot configuration. For system libraries installed by apt-get things are relatively stable on the rocker-versioned stack as well, since these are always installed from the same release. (technically these can change in minor ways due to security updates, but the basic version is fixed. Most linux distros work more like bioconductor than CRAN, where all software in the distro is effectively pinned at a version for the lifespan of that distribution.) Not trying to deter discussion but I'm going to mark this as closed since I believe the OP question is resolved with the re-triggered builds. |
I had problems running Docker containers that use
FROM: rocker/binder:3.5.0
on mybinder.org. The interactive RStudio session wouldn't open and report "500 : Internal Server Error". Interestingly, the same images used to run at the end of last week.With the help of @betatim (see jupyterhub/binderhub#753), I figured out that if I use another rocker image and add the code from the rocker/binder Dockerfile to my own Dockerfile, the container runs successfully on mybinder.org again. Apparently, when using
FROM: rocker/binder
, an older version ofnbrsessionproxy
is installed that still has a bug that leads to the aforementioned error.As there has been one commit to the rocker/binder repo that falls in between the image working and not working, I assume that this commit is at the core of the issue.
I made two repos for you to check for yourself:
https://github.com/januz/binder-fails
https://github.com/januz/binder-works
The text was updated successfully, but these errors were encountered: