Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze all Jupyter notebooks mentioned in PubMed Central #25

Closed
Daniel-Mietchen opened this issue Feb 26, 2017 · 99 comments
Closed

Analyze all Jupyter notebooks mentioned in PubMed Central #25

Daniel-Mietchen opened this issue Feb 26, 2017 · 99 comments

Comments

@Daniel-Mietchen
Copy link
Collaborator

Daniel-Mietchen commented Feb 26, 2017

Jupyter notebooks are a popular vehicle these days to share data science workflows. To get an idea of best practices in this regard, it would be good to analyze a good number of them in terms of their reproducibility and other aspects of usability (e.g. documentation, ease of reuse).

A search in PubMed Central (PMC) reveals the following results:

With currently just 102 hits, a systematic reanalysis seems entirely doable and could perhaps itself be documented by way of reproducible notebooks that might eventually end up being mentioned in PMC.

A good starting point here could be An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study, for which both a Jupyter notebook and a Docker image are available.

I plan to give a lightning talk on this. Some background is in this recent news piece.

@rossmounce
Copy link

A few notes...

With EuropePMC a search for ipynb OR jupyter gives 107 results:
http://europepmc.org/search?query=jupyter%20OR%20ipynb

I find it extremely interesting to note that EuropePMC has the full text for 102 of these 107 articles/preprints

(jupyter OR ipynb) AND (HAS_FT:Y)

Which demonstrates that jupyter/IPython notebooks are almost exclusively associated with open-friendly journals(?) Or perhaps this is a bias influenced by legally-enforced inability to do full text search on 'closed' journals where jupyter/ipynb might be mentioned but can't be found by EuropePMC because they are not allowed.

@rossmounce
Copy link

Rcode to get bibliographic metadata on each of those 107 hits from EuropePMC:

install.packages('europepmc')
library(europepmc)
hits <- epmc_search(query='jupyter%20OR%20ipynb&synonym=TRUE',limit=200)
dim(hits)
names(hits)
write.csv(hits,file="107hits.csv")

I've also made available the resulting CSV as an editable spreadsheet via GDocs:
https://docs.google.com/spreadsheets/d/1txg0u9zARHrLkY4MYuz5vmsVCZUOItgybEHqx13Bkbc/edit?usp=sharing

Perhaps with this sheet we can assign who takes responsibility for which papers?

@Daniel-Mietchen
Copy link
Collaborator Author

That's a great starting point — thanks!

@npscience
Copy link
Contributor

+1 from me. Interested to contribute and to see the output.

@Daniel-Mietchen
Copy link
Collaborator Author

We've taken Ross' spreadsheet and added some columns for

  • Notebook URL
  • Code in problem cell
  • Problem

The "Code in problem cell" column documents the notebook code causing the first problem, and the "Problem" column gives more details. So far, basically none of the notebooks ran through: We normally stopped after the first such error and went on to the next notebook, but for one rather complex notebook, we tried to go through to the end, which we have not reached yet.

@Daniel-Mietchen
Copy link
Collaborator Author

I've also added a column for the PMC URL to reduce the fiddling with URLs.

@Daniel-Mietchen
Copy link
Collaborator Author

I notified the Jupyter mailing list: https://groups.google.com/forum/#!topic/jupyter/6pQIarRmrsc .

@mrw34
Copy link

mrw34 commented Mar 5, 2017

Here's a write-up of our efforts:

https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.html

Many thanks to @Daniel-Mietchen for the original idea, and for all the help over the weekend!

@Daniel-Mietchen
Copy link
Collaborator Author

@mrw34 Thanks - I'll go right into it.

@Daniel-Mietchen
Copy link
Collaborator Author

I found one that actually ran through, albeit after a warning about an old kernel:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4940747/bin/13742_2016_135_MOESM3_ESM.ipynb . A very simple notebook to test a random number generator, but hey, it works!

To celebrate the event, I introduced color coding to the spreadsheet: red for cases where the run resulted in an error, green when it did not.

@Daniel-Mietchen
Copy link
Collaborator Author

Here's a notebook shared only as a screenshot, from a paper about reproducibility:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5014984/figure/figure1/ .

Just added yellow to the spreadsheet for cases like this where the notebook did not produce errors nor ran through, and where "n/a" is not applicable in the sense that there is no notebook.

@Daniel-Mietchen
Copy link
Collaborator Author

There is a nice "Ten simple rules" series in PLOS Computational Biology:
http://collections.plos.org/ten-simple-rules . Perhaps we should do one on "how to share Jupyter notebooks"?

They already have Ten Simple Rules for Reproducible Computational Research and Ten Simple Rules for Cultivating Open Science and Collaborative R&D as well as other somewhat related articles, but none of them seem to touch upon Jupyter notebooks.

@Daniel-Mietchen
Copy link
Collaborator Author

Some comments relevant for here are also in #41 (comment) .

@Daniel-Mietchen
Copy link
Collaborator Author

The above close was just as part of the wrap-up of the doathon. I will keep working on it and document my progress over at Daniel-Mietchen/ideas#2 .

@rossmounce
Copy link

@mrw34 @Daniel-Mietchen excellent write-up!

If licensing allows could you upload somewhere all the .ipynb notebooks you found that were related to those 107 papers?

@Daniel-Mietchen
Copy link
Collaborator Author

@rossmounce I plan to do that but haven't yet checked for licensing (added column AH for that).

The notebook URLs are in column AD, which currently has the following list:

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen
Copy link
Collaborator Author

There is a validator tool for Jupyter notebooks: https://github.com/jupyter/nbformat/blob/master/nbformat/validator.py

@Daniel-Mietchen
Copy link
Collaborator Author

Here is a discussion about using Jupyter notebooks programmatically, with a useful demo.

@Daniel-Mietchen
Copy link
Collaborator Author

I am thinking of submitting this to JupyterCon — submission deadline March 14. Anyone in?

@rossmounce
Copy link

@Daniel-Mietchen I'd be happy to help you prepare the abstract submission, do a bit more analysis but I can't go to the meeting :) Does that count as 'in' ?

@Daniel-Mietchen
Copy link
Collaborator Author

That's "in enough" for my taste. Don't know whether I can go either, but the point is to reach out to the Jupyter community and to help do something about these issues, e.g. by refining the recommendations and perhaps offering some validation mechanism (think Schematron for XML).

@Daniel-Mietchen
Copy link
Collaborator Author

Daniel-Mietchen commented May 10, 2017

There is a JupyterCon talk about citing Jupyter notebooks. I have contacted the speakers.

@mpacer
Copy link

mpacer commented May 22, 2017

I'm not going to read all of this cause it's long. But, this is a cool idea and neat dataset. @eseiver mentioned you wanted to know how to read in notebooks, for which you can use nbformat specifically the nbformat.read() which you should probably use inside a context manager.

@Daniel-Mietchen
Copy link
Collaborator Author

At the WikiCite 2017 hackathon today, we made some further progress in terms of making this analysis itself more reproducible — a Jupyter notebook that runs the Jupyter notebooks listed in our Google spreadsheet and spits out the first error message: http://paws-public.wmflabs.org/paws-public/995/WikiCite%20notebook%20validator.ipynb . @mpacer - yes, it makes use of nbformat.read()

We also looked at Jupyter notebooks cited from Wikipedia — notes at https://meta.wikimedia.org/wiki/WikiCite_2017/Jupyter_notebooks_on_Wikimedia_sites .

@yuvipanda
Copy link

Hello! I'm part of the team that's working on beta.mybinder.org and related stuff, and am extremely interested in the idea of a 'badge that represents that your code is reproducible, and has been reproduced by CI'. Funnily, I also built the PAWS stuff, which is awesome to find in completely unrelated contexts :D

The part of the stack I've been attacking right now is the 'how do we reproduce the environment that the analysis took place in', as part of the mybinder work. You can see the project used for that here: https://github.com/jupyter/repo2docker. It takes a git repository and converts it into a Docker image, using conventions that should be easy to use for most people (and does not require them to understand or use Docker unless they want to). It's what powers the building bits of mybinder :)

As part of the CI for that project, you can see that we also build and validate some external repositories that are popular! We just represent these as YAML files here: https://github.com/jupyter/repo2docker/tree/master/tests/external and have them auto test on push so we make sure we can keep building them. This can be inverted too - in the repo's CI they can use repo2docker to make sure their changes don't break the build.

The part where we haven't made much progress yet is in actual validation. nbval mentioned here seems to be the one I like most - it integrates into pytest! We can possibly integrate repo2docker into pytest too, and use that to easily validate repos? Lots of possible avenues to work towards :)

One of the things I'd love to have is something like what https://www.ssllabs.com/ssltest/analyze.html?d=beta.mybinder.org does for HTTPS on websites - scores you on a bunch of factors, with clear ways of improving it. Doing something like that for git repos with notebooks would be great, and I believe we can do a fair amount of work towards it now.

I'll also be at JupyterCon giving a few talks, and would love to meet up if any of you are going to be there!

/ccing @choldgraf who also does a lot of these things with me :)

@danielskatz
Copy link

Hi!

@Daniel-Mietchen pointed me at this thread/project yesterday, and it seems quite interesting.

I wonder if it makes sense to think about short term and long term reproducibility for notebooks?

By short term, I mean that the notebook might depend on a python package that has to be installed, which could be done by pip before running the notebook, and this step could be automated by a notebook launcher perhaps.

And long term meaning that at some point, the dependency will not work, pip will be replaced by something new, etc., and the only way to solve this is to capture the full environment. This seems similar to what @yuvipanda describes, and what @TanuMalik is trying to do a bit differently in https://arxiv.org/abs/1707.05731 (though I don't think her code is available). And long term here might still have OS requirements, so maybe I really mean medium term.

Also, I thought I would cc some other people who I think will be interested in this topic, and could perhaps point to other work done in the context of making notebooks reproducible: @fperez @labarba @jennybc @katyhuff

@ThomasA
Copy link

ThomasA commented Aug 8, 2017

@khinsen - sorry I lost track of this thread back in March... Yes, I think @Chroxvi's reproducibility can be carved out of the Magni package and I actually hope to do that sometime this fall. I hope to do that along with @ppeder08's validation (https://github.com/SIP-AAU/Magni/tree/master/magni/utils/validation) which can be used for in-/output-validation of for example somewhat more abstract "data types" than Python's built-in types.

@TanuMalik
Copy link

TanuMalik commented Aug 8, 2017 via email

@Daniel-Mietchen
Copy link
Collaborator Author

Thanks for the additional comments. I have proposed to work on this further during the Wikimania hackathon: https://phabricator.wikimedia.org/T172848 .

@Daniel-Mietchen
Copy link
Collaborator Author

I got sick during the hackathon and haven't fully recoverd, but JupyterCon is just days away, so I have started to distill the discussion here into an outline for the talk next Friday:
https://github.com/Daniel-Mietchen/events/blob/master/JupyterCon-2017.md#outline-of-the-talk . Will work on it from Tuesday onwards, and your contributions to this are as always welcome.

@RomanGurinovich
Copy link

RomanGurinovich commented Aug 25, 2017

After chatting with @Daniel-Mietchen about this idea, we've implemented web app to autorun notebooks mentioned in the paper.

exe.sci.ai

Just add list of papers' URLs, like
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5322252/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4965465/
and run the executability validator.

It is a pre-pre-pre-alpha version done for fun and in the name of reproducibility. Please, report all the issues and suggest improvements.

Current setup might require additional horsepower to consume bigger datasets. Also we plan to implement whole repo autodeployment, too many fails because of lack of this feature at the moment.

List of current issues:

https://github.com/sciAI/exe/blob/master/Executability%20issues.md

Validator code:

https://github.com/sciAI/exe

All the credits are going to sci.AI team and, especially @AlexanderPashuk. Alex, thank you for the effort and fights with libs compatibility.

@RomanGurinovich
Copy link

@yuvipanda, nice job. High five! Dan mentioned, you are on the conference now, right? If you are interested, we can combine effort

@choldgraf
Copy link

We're both at the conference now and will be at the hack sessions tomorrow!

@Daniel-Mietchen
Copy link
Collaborator Author

I was traveling during the hackathon - had heard about it too late. In any case, I hope we can combine efforts with the Binder team. For those interested, the talk sits at
https://github.com/Daniel-Mietchen/events/blob/master/JupyterCon-2017.md#outline-of-the-talk , and the video shall hopefully become available in a few weeks.

@Daniel-Mietchen
Copy link
Collaborator Author

Seen at JupyterCon:
https://github.com/jupyter/repo2docker , a tool that can dockerize a git repo and provide a Jupyter notebook to explore the container.

@cgpu
Copy link

cgpu commented Nov 17, 2019

Hello! I'm part of the team that's working on beta.mybinder.org and related stuff, and am extremely interested in the idea of a 'badge that represents that your code is reproducible, and has been reproduced by CI'. Funnily, I also built the PAWS stuff, which is awesome to find in completely unrelated contexts :D

Apologies for reviving a closed issue. I am also interested in the reproducibility badge (which is not necessarily the same as the binder badge). I came across @mwoodbri 's jupyter-ci. Any plans for this to be used or people that currently use this?

(Also cc'ing @yuvipanda as he has been involved in repo2docker.)

@mwoodbri
Copy link

@cgpu Hi! At the time jupyter-ci was an answer to "what's the simplest possible way to generate a badge and notify of validation failure?". Simply porting the project to GitHub and GitHub Actions (updating to latest Jupyter best-practice in the process, if necessary) would be a great start. But a more general solution involving repo2docker and/or Binder would be even better!

@cgpu
Copy link

cgpu commented Nov 18, 2019

@cgpu Hi! At the time jupyter-ci was an answer to "what's the simplest possible way to generate a badge and notify of validation failure?". Simply porting the project to GitHub and GitHub Actions (updating to latest Jupyter best-practice in the process, if necessary) would be a great start. But a more general solution involving repo2docker and/or Binder would be even better!

Hi @mwoodbri , thank you for the prompt response and the background information! I am really fond of the idea of having a binary does it reproduce badge of honour, jupyter-ci sounds exactly like that. I am trying to reproduce .ipynb accompanied publications as part of a workshop exercise and I am really struggling. It would be nice to know before hands which ones to invest time on, hence the interest. Additionally, it would be nice to have/create a hub or awesome repo with the ones that are reproducible only (verified by ci).

jupyter-ci is currently only on gitlab, I know it's the same in essence, but the community is much more active in Github, it would be nice if you have plans to maintain the repo + idea long term to bring it over here. Just a thought.

Thanks once again!

@mwoodbri
Copy link

@cgpu Here's a version converted to use GitHub Actions: https://github.com/mwoodbri/jupyter-ci

@cgpu
Copy link

cgpu commented Nov 20, 2019

@cgpu Here's a version converted to use GitHub Actions: https://github.com/mwoodbri/jupyter-ci

@mwoodbri thank you! Time for me to test now :)

@Daniel-Mietchen thank you for providing the reproducibility cafe space for further discussions 2 years after the start of the initiative. Feel free to close this.

@Daniel-Mietchen
Copy link
Collaborator Author

Hello everyone. It has been a while since the last post in this thread, but I am happy to report that there is now a preprint that reports on a reproducibility analysis of the Jupyter notebooks associated with publications available via PubMed Central: Computational reproducibility of Jupyter notebooks from biomedical publications — joint work with @Sheeba-Samuel . Here is the abstract:

Jupyter notebooks allow to bundle executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows, including for research publications. Here, we analyze the computational reproducibility of 9625 Jupyter notebooks from 1117 GitHub repositories associated with 1419 publications indexed in the biomedical literature repository PubMed Central. 8160 of these were written in Python, including 4169 that had their dependencies declared in standard requirement files and that we attempted to re-run automatically. For 2684 of these, all declared dependencies could be installed successfully, and we re-ran them to assess reproducibility. Of these, 396 notebooks ran through without any errors, including 245 that produced results identical to those reported in the original. Running the other notebooks resulted in exceptions. We zoom in on common problems and practices, highlight trends and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.

For data and code, see https://doi.org/10.5281/zenodo.6802158 .

I'll keep this thread open until the paper is formally published, and invite your comments in the meantime. Extra pings to some of you who have contributed to this thread before: @mwoodbri @khinsen @yuvipanda @rossmounce @tompollard @RomanGurinovich @choldgraf @JosephMcArthur .

@fperez
Copy link

fperez commented Sep 15, 2022

Congrats @Daniel-Mietchen, excellent! I look forward to reading the paper, and will be sure to include it in my reading list for next year's reproducible research course I teach at Berkeley!

cc @facusapienza21.

@leipzig
Copy link

leipzig commented Sep 15, 2022

Lots of twitter interest in this preprint. Might be good to dig a bit deeper into all those unknown dependency resolution issues - or perhaps just feature a couple of examples as a panel. There were some comments regarding Docker.

@khinsen
Copy link

khinsen commented Sep 16, 2022

Thanks @Daniel-Mietchen for the update! The preprint is on my e-book reader.

@yuvipanda
Copy link

Holy shit this is AWESOME!

@Daniel-Mietchen
Copy link
Collaborator Author

We're still working on the revision of the paper but here are the slides of our JupyterCon 2023 talk: https://doi.org/10.5281/zenodo.7854503 .

Slide 23 — How you can get involved — asks for community input along a number of dimensions, which I am copying here.

  • How shall we share the dataset?
  • How would you use it?
    • Which variables (article/ repo/ notebook etc.) to look at?
    • Research questions
    • Teaching/ learning opportunities
  • What would be useful follow-ups?
  • Anyone interested in helping turn this into a service? NumFOCUS?
  • Who wants to do this for other languages?
  • Would it be useful to publish reports about computational reproducibility?
  • Would you like to publish such reports

@Daniel-Mietchen
Copy link
Collaborator Author

We recently submitted the revision of the paper — see https://arxiv.org/abs/2308.07333 for the latest preprint, which describes a complete re-run of our pipeline and provides some more contextualization. In the discussion, we also briefly touch upon scaling issues with such reproducibility studies, mentioning ReScience (pinging @khinsen) as an example. We are keen on putting this dataset to use, e.g. in educational settings (cc @fperez ).

As always, comments etc. are welcome.

@Daniel-Mietchen
Copy link
Collaborator Author

Dear all,
thanks for your participation in this thread.

The paper on this (with @Sheeba-Samuel) was published yesterday: Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113 .

We remain interested in

  1. ways to integrate our data with research and education workflows around reproducibility
  2. collaborating on similar analyses for
  • other research fields
  • languages other than Python
  • other types of notebooks
  1. exploring how such automated approaches can be integrated with peer review workflows

@mwoodbri @fperez @khinsen @yuvipanda @rossmounce @tompollard @RomanGurinovich @choldgraf @JosephMcArthur .

With that, I am closing this ticket after nearly 7 years - feel free to open up new ones in relevant places to discuss potential follow-ups.

@rossmounce
Copy link

Extremely impressive feat! Well done @Sheeba-Samuel & @Daniel-Mietchen !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests