New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze all Jupyter notebooks mentioned in PubMed Central #25

Open
Daniel-Mietchen opened this Issue Feb 26, 2017 · 85 comments

Comments

Projects
None yet
@Daniel-Mietchen
Collaborator

Daniel-Mietchen commented Feb 26, 2017

Jupyter notebooks are a popular vehicle these days to share data science workflows. To get an idea of best practices in this regard, it would be good to analyze a good number of them in terms of their reproducibility and other aspects of usability (e.g. documentation, ease of reuse).

A search in PubMed Central (PMC) reveals the following results:

With currently just 102 hits, a systematic reanalysis seems entirely doable and could perhaps itself be documented by way of reproducible notebooks that might eventually end up being mentioned in PMC.

A good starting point here could be An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study, for which both a Jupyter notebook and a Docker image are available.

I plan to give a lightning talk on this. Some background is in this recent news piece.

@rossmounce

This comment has been minimized.

rossmounce commented Mar 3, 2017

A few notes...

With EuropePMC a search for ipynb OR jupyter gives 107 results:
http://europepmc.org/search?query=jupyter%20OR%20ipynb

I find it extremely interesting to note that EuropePMC has the full text for 102 of these 107 articles/preprints

(jupyter OR ipynb) AND (HAS_FT:Y)

Which demonstrates that jupyter/IPython notebooks are almost exclusively associated with open-friendly journals(?) Or perhaps this is a bias influenced by legally-enforced inability to do full text search on 'closed' journals where jupyter/ipynb might be mentioned but can't be found by EuropePMC because they are not allowed.

@rossmounce

This comment has been minimized.

rossmounce commented Mar 3, 2017

Rcode to get bibliographic metadata on each of those 107 hits from EuropePMC:

install.packages('europepmc')
library(europepmc)
hits <- epmc_search(query='jupyter%20OR%20ipynb&synonym=TRUE',limit=200)
dim(hits)
names(hits)
write.csv(hits,file="107hits.csv")

I've also made available the resulting CSV as an editable spreadsheet via GDocs:
https://docs.google.com/spreadsheets/d/1txg0u9zARHrLkY4MYuz5vmsVCZUOItgybEHqx13Bkbc/edit?usp=sharing

Perhaps with this sheet we can assign who takes responsibility for which papers?

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 3, 2017

That's a great starting point — thanks!

@npscience

This comment has been minimized.

Contributor

npscience commented Mar 4, 2017

+1 from me. Interested to contribute and to see the output.

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 4, 2017

We've taken Ross' spreadsheet and added some columns for

  • Notebook URL
  • Code in problem cell
  • Problem

The "Code in problem cell" column documents the notebook code causing the first problem, and the "Problem" column gives more details. So far, basically none of the notebooks ran through: We normally stopped after the first such error and went on to the next notebook, but for one rather complex notebook, we tried to go through to the end, which we have not reached yet.

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 5, 2017

I've also added a column for the PMC URL to reduce the fiddling with URLs.

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 5, 2017

I notified the Jupyter mailing list: https://groups.google.com/forum/#!topic/jupyter/6pQIarRmrsc .

@mrw34

This comment has been minimized.

mrw34 commented Mar 5, 2017

Here's a write-up of our efforts:

https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.html

Many thanks to @Daniel-Mietchen for the original idea, and for all the help over the weekend!

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 5, 2017

@mrw34 Thanks - I'll go right into it.

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 5, 2017

I found one that actually ran through, albeit after a warning about an old kernel:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4940747/bin/13742_2016_135_MOESM3_ESM.ipynb . A very simple notebook to test a random number generator, but hey, it works!

To celebrate the event, I introduced color coding to the spreadsheet: red for cases where the run resulted in an error, green when it did not.

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 5, 2017

Here's a notebook shared only as a screenshot, from a paper about reproducibility:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5014984/figure/figure1/ .

Just added yellow to the spreadsheet for cases like this where the notebook did not produce errors nor ran through, and where "n/a" is not applicable in the sense that there is no notebook.

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 5, 2017

There is a nice "Ten simple rules" series in PLOS Computational Biology:
http://collections.plos.org/ten-simple-rules . Perhaps we should do one on "how to share Jupyter notebooks"?

They already have Ten Simple Rules for Reproducible Computational Research and Ten Simple Rules for Cultivating Open Science and Collaborative R&D as well as other somewhat related articles, but none of them seem to touch upon Jupyter notebooks.

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 5, 2017

Some comments relevant for here are also in #41 (comment) .

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 5, 2017

The above close was just as part of the wrap-up of the doathon. I will keep working on it and document my progress over at Daniel-Mietchen/ideas#2 .

@rossmounce

This comment has been minimized.

rossmounce commented Mar 5, 2017

@mrw34 @Daniel-Mietchen excellent write-up!

If licensing allows could you upload somewhere all the .ipynb notebooks you found that were related to those 107 papers?

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 5, 2017

@rossmounce I plan to do that but haven't yet checked for licensing (added column AH for that).

The notebook URLs are in column AD, which currently has the following list:

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 6, 2017

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 6, 2017

There is a validator tool for Jupyter notebooks: https://github.com/jupyter/nbformat/blob/master/nbformat/validator.py

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 6, 2017

Here is a discussion about using Jupyter notebooks programmatically, with a useful demo.

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 6, 2017

I am thinking of submitting this to JupyterCon — submission deadline March 14. Anyone in?

@rossmounce

This comment has been minimized.

rossmounce commented Mar 6, 2017

@Daniel-Mietchen I'd be happy to help you prepare the abstract submission, do a bit more analysis but I can't go to the meeting :) Does that count as 'in' ?

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 6, 2017

That's "in enough" for my taste. Don't know whether I can go either, but the point is to reach out to the Jupyter community and to help do something about these issues, e.g. by refining the recommendations and perhaps offering some validation mechanism (think Schematron for XML).

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 23, 2017

Thanks, @khinsen and @ThomasA for chiming in here. The pointers in your posts sent me off a long reading chain this morning, which provided me with additional perspectives on these issues — thanks for that.

Your ActivePapers project looks interesting, and using something like this for some basic verification would probably be a good step forward — have you seen @mrw34's CI for Jupyter verification?

@ThomasA

This comment has been minimized.

ThomasA commented Mar 23, 2017

@khinsen a Ph.D. student of mine, @Chroxvi, has also implemented some functionality in our Magni package that sounds a bit similar: https://github.com/SIP-AAU/Magni/tree/master/magni/reproducibility - http://magni.readthedocs.io/en/latest/magni.reproducibility.html

@khinsen

This comment has been minimized.

khinsen commented Mar 23, 2017

@Daniel-Mietchen Yes, I have seen this and other CI-based approaches. They are great when applicable but frustrating when they aren't - as all technology-based solutions. For example, @mrw34's approch requires all code dependencies to be pip-installable, all data dependencies to be downloadable or part of the repository, and the whole analysis to run within whatever resource limitations GitLab's CI imposes.

ActivePapers takes an inverse approach to CI: guarantee that if the computation runs to completion, it is also reproducible. This is particularly valuable for long-running computations. Of course ActivePapers has technological limitations as well, in particular the restriction to pure Python code.

The parts that could be useful outside of the full ActivePapers framework are the ones that restrict the modules one is allowed to import and the data files one is allowed to access.

@ThomasA Yes, the code by @Chroxvi explores a similar approach. From the docs it seems to be a bit specialized to your environment. Do you think it could be generalized (not relying on Magni, not relying on Conda)?

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 23, 2017

@khinsen One of the motivations behind our verification exercise here was to come up with recommendations for how to share software, and Jupyter notebooks in particular.

Having "all data dependencies to be downloadable or part of the repository" seems like a good recommendation to me, and being able to show that your a particular piece of software complies with that criterion is helpful. Yes, there are cases when such an approach is not applicable, but I still think such a recommendation would be a better recommendation than the lack of recommendations that we currently have.

As for pip-installability, this obviously makes sense only for Python dependencies, while Jupyter notebooks can contain code or dependencies in multiple other languages, and there are of course several Python package managers. Still, I think it would be good if more of the code shared through articles in PMC that is pip-installable in principle could actually demonstrate this pip-installability, and if people intending to share their code (e.g. as per BjornFJohansson/pygenome#1 ) were made more aware of these issues and the tools available, the tools and their usage could be improved in response to community needs, e.g. beyond pip or GitLab.

As an aside, I just came across the "Repeatability in Computer Science" project (from 2015) over at http://reproducibility.cs.arizona.edu/ , which set a much lower bar for replicability than we did here but had similar observations: http://reproducibility.cs.arizona.edu/v2/index.html . I assume this is known to some people in this thread - just adding it in here to keep all reproducbility-related information in this repo in one place. I know the thread is getting long and unwieldy, so I'm also thinking of branching it out somewhere, so as to provide a simpler way of getting an overview of the state we are at.

@khinsen

This comment has been minimized.

khinsen commented Mar 24, 2017

@Daniel-Mietchen First of all, I didn't mean my list of restrictions of @mrw34's approach as a criticism. Every technology has limitations. If there's anything to criticize, it's that the restrictions are not listed explicitly, leaving the task of figuring them out to potential users.

Having all dependencies downloadable or packaged with the notebook is indeed a decent compromise given today's state of the art. Recommending it is OK in a domain where it is most probably applicable. The same goes for pip-installability, although I'd expect its applicability to be limited almost everywhere, given how many Python packages depend on C libraries.

The problem with downloadable data in the long run is that it requires baking in a URL into the notebook. Five years from now, that URL is probably stale. More stable references, such as DOIs, don't permit direct downloading today. So today everyone has to choose between ease of replication and long-term availability. It isn't obvious that one or the other choice is to be preferred in general.

I live in a world where datasets are a few GB in size and processing requires a few hours using 10 to 100 processors on a parallel computer. These machines often have network restrictions that make downloading data from the Internet impossible. I mention this just to illustrate that no recommendations can ever be absolute - there is too much diversity in scientific computing.

@gbilder

This comment has been minimized.

gbilder commented Mar 27, 2017

Hi. Daniel pointed me at this thread.

When you say that "DOIs, don't permit direct downloading today" I'm not sure if the word "permit" here is in reference to:

  • access control restrictions
  • the convention that Crossref/DataCite DOIs tend to resolve to a landing page instead of 'the thing itself."

Or both.

As far as I can tell, parties appear unanimous on the fact that, barring privacy/confidentiality issues, data should be made available openly. Access control issues should be minimal.

And DOIs would allow direct downloading today if, when possible and practical, those registering DOI metadata included links to direct downloads in their Crossref or DataCite metadata. At Crossref we are increasingly seeing publishers registering full text links in their metadata. For examples, see the text-mining links in these metadata records from PeerJ:

http://api.crossref.org/members/4443/works

@khinsen

This comment has been minimized.

khinsen commented Mar 27, 2017

@gbilder My reference was to the landing-page issue. It's good to hear that people are discussion solutions, but as far as I know it remains impractical today to download a dataset automatically given a DOI.

@mrw34

This comment has been minimized.

mrw34 commented Mar 27, 2017

@khinsen @gbilder I guess best-practice here involves assigning DOIs to datasets (and not just their parent publication, if any), and resolving any ambiguity over if/how the data itself can be automatically retrieved given the relevant DOI. Lots more on the latter here: https://jcheminf.springeropen.com/articles/10.1186/s13321-015-0081-7

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 29, 2017

I was just pinged about another validation tool:

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Mar 30, 2017

Here's an interesting blog post on what can go wrong in terms of reproducibility (with a focus on R): http://blog.appsilondatascience.com/rstats/2017/03/28/reproducible-research-when-your-results-cant-be-reproduced.html .

Daniel-Mietchen added a commit to Daniel-Mietchen/events that referenced this issue Apr 22, 2017

@DCGenomics

This comment has been minimized.

DCGenomics commented May 1, 2017

A demo for the original issue: https://www.ncbi.nlm.nih.gov/pubmed/27583132
If you click on the LinkOut link, would adding the Jupyter and Docker links here be helpful?

@mrw34

This comment has been minimized.

mrw34 commented May 8, 2017

More on reproducibility, including a few Jupyter mentions: https://www.practicereproducibleresearch.org

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented May 8, 2017

@DCGenomics I think making the Jupyter, Docker, mybinder etc. versions of the code more discoverable is useful in principle, but conventional LinkOut (which is not shown by default) may not be the best mechanism to do this.

What I could imagine is a mechanism similar to the way images are currently being presented in PubMed, i.e. something that is shown by default if the paper comes with code shared in a standard fashion. That standard would have to be defined, though.

While necessary for reproducibility, discoverability alone is not sufficient, and this example paper highlights that, as explained in Mark's initial write-up.

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented May 8, 2017

I asked for input from the JATS4R community on how such things could/ ought to be tagged in JATS.

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented May 10, 2017

There is a JupyterCon talk about citing Jupyter notebooks. I have contacted the speakers.

@mpacer

This comment has been minimized.

mpacer commented May 22, 2017

I'm not going to read all of this cause it's long. But, this is a cool idea and neat dataset. @eseiver mentioned you wanted to know how to read in notebooks, for which you can use nbformat specifically the nbformat.read() which you should probably use inside a context manager.

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented May 25, 2017

At the WikiCite 2017 hackathon today, we made some further progress in terms of making this analysis itself more reproducible — a Jupyter notebook that runs the Jupyter notebooks listed in our Google spreadsheet and spits out the first error message: http://paws-public.wmflabs.org/paws-public/995/WikiCite%20notebook%20validator.ipynb . @mpacer - yes, it makes use of nbformat.read()

We also looked at Jupyter notebooks cited from Wikipedia — notes at https://meta.wikimedia.org/wiki/WikiCite_2017/Jupyter_notebooks_on_Wikimedia_sites .

@yuvipanda

This comment has been minimized.

yuvipanda commented Jul 31, 2017

Hello! I'm part of the team that's working on beta.mybinder.org and related stuff, and am extremely interested in the idea of a 'badge that represents that your code is reproducible, and has been reproduced by CI'. Funnily, I also built the PAWS stuff, which is awesome to find in completely unrelated contexts :D

The part of the stack I've been attacking right now is the 'how do we reproduce the environment that the analysis took place in', as part of the mybinder work. You can see the project used for that here: https://github.com/jupyter/repo2docker. It takes a git repository and converts it into a Docker image, using conventions that should be easy to use for most people (and does not require them to understand or use Docker unless they want to). It's what powers the building bits of mybinder :)

As part of the CI for that project, you can see that we also build and validate some external repositories that are popular! We just represent these as YAML files here: https://github.com/jupyter/repo2docker/tree/master/tests/external and have them auto test on push so we make sure we can keep building them. This can be inverted too - in the repo's CI they can use repo2docker to make sure their changes don't break the build.

The part where we haven't made much progress yet is in actual validation. nbval mentioned here seems to be the one I like most - it integrates into pytest! We can possibly integrate repo2docker into pytest too, and use that to easily validate repos? Lots of possible avenues to work towards :)

One of the things I'd love to have is something like what https://www.ssllabs.com/ssltest/analyze.html?d=beta.mybinder.org does for HTTPS on websites - scores you on a bunch of factors, with clear ways of improving it. Doing something like that for git repos with notebooks would be great, and I believe we can do a fair amount of work towards it now.

I'll also be at JupyterCon giving a few talks, and would love to meet up if any of you are going to be there!

/ccing @choldgraf who also does a lot of these things with me :)

@danielskatz

This comment has been minimized.

danielskatz commented Aug 4, 2017

Hi!

@Daniel-Mietchen pointed me at this thread/project yesterday, and it seems quite interesting.

I wonder if it makes sense to think about short term and long term reproducibility for notebooks?

By short term, I mean that the notebook might depend on a python package that has to be installed, which could be done by pip before running the notebook, and this step could be automated by a notebook launcher perhaps.

And long term meaning that at some point, the dependency will not work, pip will be replaced by something new, etc., and the only way to solve this is to capture the full environment. This seems similar to what @yuvipanda describes, and what @TanuMalik is trying to do a bit differently in https://arxiv.org/abs/1707.05731 (though I don't think her code is available). And long term here might still have OS requirements, so maybe I really mean medium term.

Also, I thought I would cc some other people who I think will be interested in this topic, and could perhaps point to other work done in the context of making notebooks reproducible: @fperez @labarba @jennybc @katyhuff

@ThomasA

This comment has been minimized.

ThomasA commented Aug 8, 2017

@khinsen - sorry I lost track of this thread back in March... Yes, I think @Chroxvi's reproducibility can be carved out of the Magni package and I actually hope to do that sometime this fall. I hope to do that along with @ppeder08's validation (https://github.com/SIP-AAU/Magni/tree/master/magni/utils/validation) which can be used for in-/output-validation of for example somewhat more abstract "data types" than Python's built-in types.

@TanuMalik

This comment has been minimized.

TanuMalik commented Aug 8, 2017

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Aug 9, 2017

Thanks for the additional comments. I have proposed to work on this further during the Wikimania hackathon: https://phabricator.wikimedia.org/T172848 .

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Aug 20, 2017

I got sick during the hackathon and haven't fully recoverd, but JupyterCon is just days away, so I have started to distill the discussion here into an outline for the talk next Friday:
https://github.com/Daniel-Mietchen/events/blob/master/JupyterCon-2017.md#outline-of-the-talk . Will work on it from Tuesday onwards, and your contributions to this are as always welcome.

@RomanGurinovich

This comment has been minimized.

RomanGurinovich commented Aug 25, 2017

After chatting with @Daniel-Mietchen about this idea, we've implemented web app to autorun notebooks mentioned in the paper.

exe.sci.ai

Just add list of papers' URLs, like
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5322252/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4965465/
and run the executability validator.

It is a pre-pre-pre-alpha version done for fun and in the name of reproducibility. Please, report all the issues and suggest improvements.

Current setup might require additional horsepower to consume bigger datasets. Also we plan to implement whole repo autodeployment, too many fails because of lack of this feature at the moment.

List of current issues:

https://github.com/sciAI/exe/blob/master/Executability%20issues.md

Validator code:

https://github.com/sciAI/exe

All the credits are going to sci.AI team and, especially @AlexanderPashuk. Alex, thank you for the effort and fights with libs compatibility.

@RomanGurinovich

This comment has been minimized.

RomanGurinovich commented Aug 25, 2017

@yuvipanda, nice job. High five! Dan mentioned, you are on the conference now, right? If you are interested, we can combine effort

@choldgraf

This comment has been minimized.

choldgraf commented Aug 25, 2017

We're both at the conference now and will be at the hack sessions tomorrow!

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Aug 27, 2017

I was traveling during the hackathon - had heard about it too late. In any case, I hope we can combine efforts with the Binder team. For those interested, the talk sits at
https://github.com/Daniel-Mietchen/events/blob/master/JupyterCon-2017.md#outline-of-the-talk , and the video shall hopefully become available in a few weeks.

@Daniel-Mietchen

This comment has been minimized.

Collaborator

Daniel-Mietchen commented Aug 29, 2017

Seen at JupyterCon:
https://github.com/jupyter/repo2docker , a tool that can dockerize a git repo and provide a Jupyter notebook to explore the container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment