Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze all Jupyter notebooks mentioned in PubMed Central #25

Open
Daniel-Mietchen opened this issue Feb 26, 2017 · 90 comments
Open

Analyze all Jupyter notebooks mentioned in PubMed Central #25

Daniel-Mietchen opened this issue Feb 26, 2017 · 90 comments

Comments

@Daniel-Mietchen
Copy link
Collaborator

@Daniel-Mietchen Daniel-Mietchen commented Feb 26, 2017

Jupyter notebooks are a popular vehicle these days to share data science workflows. To get an idea of best practices in this regard, it would be good to analyze a good number of them in terms of their reproducibility and other aspects of usability (e.g. documentation, ease of reuse).

A search in PubMed Central (PMC) reveals the following results:

With currently just 102 hits, a systematic reanalysis seems entirely doable and could perhaps itself be documented by way of reproducible notebooks that might eventually end up being mentioned in PMC.

A good starting point here could be An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study, for which both a Jupyter notebook and a Docker image are available.

I plan to give a lightning talk on this. Some background is in this recent news piece.

@rossmounce
Copy link

@rossmounce rossmounce commented Mar 3, 2017

A few notes...

With EuropePMC a search for ipynb OR jupyter gives 107 results:
http://europepmc.org/search?query=jupyter%20OR%20ipynb

I find it extremely interesting to note that EuropePMC has the full text for 102 of these 107 articles/preprints

(jupyter OR ipynb) AND (HAS_FT:Y)

Which demonstrates that jupyter/IPython notebooks are almost exclusively associated with open-friendly journals(?) Or perhaps this is a bias influenced by legally-enforced inability to do full text search on 'closed' journals where jupyter/ipynb might be mentioned but can't be found by EuropePMC because they are not allowed.

@rossmounce
Copy link

@rossmounce rossmounce commented Mar 3, 2017

Rcode to get bibliographic metadata on each of those 107 hits from EuropePMC:

install.packages('europepmc')
library(europepmc)
hits <- epmc_search(query='jupyter%20OR%20ipynb&synonym=TRUE',limit=200)
dim(hits)
names(hits)
write.csv(hits,file="107hits.csv")

I've also made available the resulting CSV as an editable spreadsheet via GDocs:
https://docs.google.com/spreadsheets/d/1txg0u9zARHrLkY4MYuz5vmsVCZUOItgybEHqx13Bkbc/edit?usp=sharing

Perhaps with this sheet we can assign who takes responsibility for which papers?

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 3, 2017

That's a great starting point — thanks!

@npscience
Copy link
Contributor

@npscience npscience commented Mar 4, 2017

+1 from me. Interested to contribute and to see the output.

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 4, 2017

We've taken Ross' spreadsheet and added some columns for

  • Notebook URL
  • Code in problem cell
  • Problem

The "Code in problem cell" column documents the notebook code causing the first problem, and the "Problem" column gives more details. So far, basically none of the notebooks ran through: We normally stopped after the first such error and went on to the next notebook, but for one rather complex notebook, we tried to go through to the end, which we have not reached yet.

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 5, 2017

I've also added a column for the PMC URL to reduce the fiddling with URLs.

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 5, 2017

I notified the Jupyter mailing list: https://groups.google.com/forum/#!topic/jupyter/6pQIarRmrsc .

@mrw34
Copy link

@mrw34 mrw34 commented Mar 5, 2017

Here's a write-up of our efforts:

https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.html

Many thanks to @Daniel-Mietchen for the original idea, and for all the help over the weekend!

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 5, 2017

@mrw34 Thanks - I'll go right into it.

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 5, 2017

I found one that actually ran through, albeit after a warning about an old kernel:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4940747/bin/13742_2016_135_MOESM3_ESM.ipynb . A very simple notebook to test a random number generator, but hey, it works!

To celebrate the event, I introduced color coding to the spreadsheet: red for cases where the run resulted in an error, green when it did not.

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 5, 2017

Here's a notebook shared only as a screenshot, from a paper about reproducibility:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5014984/figure/figure1/ .

Just added yellow to the spreadsheet for cases like this where the notebook did not produce errors nor ran through, and where "n/a" is not applicable in the sense that there is no notebook.

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 5, 2017

There is a nice "Ten simple rules" series in PLOS Computational Biology:
http://collections.plos.org/ten-simple-rules . Perhaps we should do one on "how to share Jupyter notebooks"?

They already have Ten Simple Rules for Reproducible Computational Research and Ten Simple Rules for Cultivating Open Science and Collaborative R&D as well as other somewhat related articles, but none of them seem to touch upon Jupyter notebooks.

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 5, 2017

Some comments relevant for here are also in #41 (comment) .

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 5, 2017

The above close was just as part of the wrap-up of the doathon. I will keep working on it and document my progress over at Daniel-Mietchen/ideas#2 .

@rossmounce
Copy link

@rossmounce rossmounce commented Mar 5, 2017

@mrw34 @Daniel-Mietchen excellent write-up!

If licensing allows could you upload somewhere all the .ipynb notebooks you found that were related to those 107 papers?

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 5, 2017

@rossmounce I plan to do that but haven't yet checked for licensing (added column AH for that).

The notebook URLs are in column AD, which currently has the following list:

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 6, 2017

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 6, 2017

There is a validator tool for Jupyter notebooks: https://github.com/jupyter/nbformat/blob/master/nbformat/validator.py

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 6, 2017

Here is a discussion about using Jupyter notebooks programmatically, with a useful demo.

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 6, 2017

I am thinking of submitting this to JupyterCon — submission deadline March 14. Anyone in?

@rossmounce
Copy link

@rossmounce rossmounce commented Mar 6, 2017

@Daniel-Mietchen I'd be happy to help you prepare the abstract submission, do a bit more analysis but I can't go to the meeting :) Does that count as 'in' ?

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 6, 2017

That's "in enough" for my taste. Don't know whether I can go either, but the point is to reach out to the Jupyter community and to help do something about these issues, e.g. by refining the recommendations and perhaps offering some validation mechanism (think Schematron for XML).

@gbilder
Copy link

@gbilder gbilder commented Mar 27, 2017

Hi. Daniel pointed me at this thread.

When you say that "DOIs, don't permit direct downloading today" I'm not sure if the word "permit" here is in reference to:

  • access control restrictions
  • the convention that Crossref/DataCite DOIs tend to resolve to a landing page instead of 'the thing itself."

Or both.

As far as I can tell, parties appear unanimous on the fact that, barring privacy/confidentiality issues, data should be made available openly. Access control issues should be minimal.

And DOIs would allow direct downloading today if, when possible and practical, those registering DOI metadata included links to direct downloads in their Crossref or DataCite metadata. At Crossref we are increasingly seeing publishers registering full text links in their metadata. For examples, see the text-mining links in these metadata records from PeerJ:

http://api.crossref.org/members/4443/works

@khinsen
Copy link

@khinsen khinsen commented Mar 27, 2017

@gbilder My reference was to the landing-page issue. It's good to hear that people are discussion solutions, but as far as I know it remains impractical today to download a dataset automatically given a DOI.

@mrw34
Copy link

@mrw34 mrw34 commented Mar 27, 2017

@khinsen @gbilder I guess best-practice here involves assigning DOIs to datasets (and not just their parent publication, if any), and resolving any ambiguity over if/how the data itself can be automatically retrieved given the relevant DOI. Lots more on the latter here: https://jcheminf.springeropen.com/articles/10.1186/s13321-015-0081-7

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 29, 2017

I was just pinged about another validation tool:

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Mar 30, 2017

Here's an interesting blog post on what can go wrong in terms of reproducibility (with a focus on R): http://blog.appsilondatascience.com/rstats/2017/03/28/reproducible-research-when-your-results-cant-be-reproduced.html .

Daniel-Mietchen added a commit to Daniel-Mietchen/events that referenced this issue Apr 22, 2017
@DCGenomics
Copy link

@DCGenomics DCGenomics commented May 1, 2017

A demo for the original issue: https://www.ncbi.nlm.nih.gov/pubmed/27583132
If you click on the LinkOut link, would adding the Jupyter and Docker links here be helpful?

@mrw34
Copy link

@mrw34 mrw34 commented May 8, 2017

More on reproducibility, including a few Jupyter mentions: https://www.practicereproducibleresearch.org

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented May 8, 2017

@DCGenomics I think making the Jupyter, Docker, mybinder etc. versions of the code more discoverable is useful in principle, but conventional LinkOut (which is not shown by default) may not be the best mechanism to do this.

What I could imagine is a mechanism similar to the way images are currently being presented in PubMed, i.e. something that is shown by default if the paper comes with code shared in a standard fashion. That standard would have to be defined, though.

While necessary for reproducibility, discoverability alone is not sufficient, and this example paper highlights that, as explained in Mark's initial write-up.

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented May 8, 2017

I asked for input from the JATS4R community on how such things could/ ought to be tagged in JATS.

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented May 10, 2017

There is a JupyterCon talk about citing Jupyter notebooks. I have contacted the speakers.

@mpacer
Copy link

@mpacer mpacer commented May 22, 2017

I'm not going to read all of this cause it's long. But, this is a cool idea and neat dataset. @eseiver mentioned you wanted to know how to read in notebooks, for which you can use nbformat specifically the nbformat.read() which you should probably use inside a context manager.

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented May 25, 2017

At the WikiCite 2017 hackathon today, we made some further progress in terms of making this analysis itself more reproducible — a Jupyter notebook that runs the Jupyter notebooks listed in our Google spreadsheet and spits out the first error message: http://paws-public.wmflabs.org/paws-public/995/WikiCite%20notebook%20validator.ipynb . @mpacer - yes, it makes use of nbformat.read()

We also looked at Jupyter notebooks cited from Wikipedia — notes at https://meta.wikimedia.org/wiki/WikiCite_2017/Jupyter_notebooks_on_Wikimedia_sites .

@yuvipanda
Copy link

@yuvipanda yuvipanda commented Jul 31, 2017

Hello! I'm part of the team that's working on beta.mybinder.org and related stuff, and am extremely interested in the idea of a 'badge that represents that your code is reproducible, and has been reproduced by CI'. Funnily, I also built the PAWS stuff, which is awesome to find in completely unrelated contexts :D

The part of the stack I've been attacking right now is the 'how do we reproduce the environment that the analysis took place in', as part of the mybinder work. You can see the project used for that here: https://github.com/jupyter/repo2docker. It takes a git repository and converts it into a Docker image, using conventions that should be easy to use for most people (and does not require them to understand or use Docker unless they want to). It's what powers the building bits of mybinder :)

As part of the CI for that project, you can see that we also build and validate some external repositories that are popular! We just represent these as YAML files here: https://github.com/jupyter/repo2docker/tree/master/tests/external and have them auto test on push so we make sure we can keep building them. This can be inverted too - in the repo's CI they can use repo2docker to make sure their changes don't break the build.

The part where we haven't made much progress yet is in actual validation. nbval mentioned here seems to be the one I like most - it integrates into pytest! We can possibly integrate repo2docker into pytest too, and use that to easily validate repos? Lots of possible avenues to work towards :)

One of the things I'd love to have is something like what https://www.ssllabs.com/ssltest/analyze.html?d=beta.mybinder.org does for HTTPS on websites - scores you on a bunch of factors, with clear ways of improving it. Doing something like that for git repos with notebooks would be great, and I believe we can do a fair amount of work towards it now.

I'll also be at JupyterCon giving a few talks, and would love to meet up if any of you are going to be there!

/ccing @choldgraf who also does a lot of these things with me :)

@danielskatz
Copy link

@danielskatz danielskatz commented Aug 4, 2017

Hi!

@Daniel-Mietchen pointed me at this thread/project yesterday, and it seems quite interesting.

I wonder if it makes sense to think about short term and long term reproducibility for notebooks?

By short term, I mean that the notebook might depend on a python package that has to be installed, which could be done by pip before running the notebook, and this step could be automated by a notebook launcher perhaps.

And long term meaning that at some point, the dependency will not work, pip will be replaced by something new, etc., and the only way to solve this is to capture the full environment. This seems similar to what @yuvipanda describes, and what @TanuMalik is trying to do a bit differently in https://arxiv.org/abs/1707.05731 (though I don't think her code is available). And long term here might still have OS requirements, so maybe I really mean medium term.

Also, I thought I would cc some other people who I think will be interested in this topic, and could perhaps point to other work done in the context of making notebooks reproducible: @fperez @labarba @jennybc @katyhuff

@ThomasA
Copy link

@ThomasA ThomasA commented Aug 8, 2017

@khinsen - sorry I lost track of this thread back in March... Yes, I think @Chroxvi's reproducibility can be carved out of the Magni package and I actually hope to do that sometime this fall. I hope to do that along with @ppeder08's validation (https://github.com/SIP-AAU/Magni/tree/master/magni/utils/validation) which can be used for in-/output-validation of for example somewhat more abstract "data types" than Python's built-in types.

@TanuMalik
Copy link

@TanuMalik TanuMalik commented Aug 8, 2017

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Aug 9, 2017

Thanks for the additional comments. I have proposed to work on this further during the Wikimania hackathon: https://phabricator.wikimedia.org/T172848 .

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Aug 20, 2017

I got sick during the hackathon and haven't fully recoverd, but JupyterCon is just days away, so I have started to distill the discussion here into an outline for the talk next Friday:
https://github.com/Daniel-Mietchen/events/blob/master/JupyterCon-2017.md#outline-of-the-talk . Will work on it from Tuesday onwards, and your contributions to this are as always welcome.

@RomanGurinovich
Copy link

@RomanGurinovich RomanGurinovich commented Aug 25, 2017

After chatting with @Daniel-Mietchen about this idea, we've implemented web app to autorun notebooks mentioned in the paper.

exe.sci.ai

Just add list of papers' URLs, like
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5322252/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4965465/
and run the executability validator.

It is a pre-pre-pre-alpha version done for fun and in the name of reproducibility. Please, report all the issues and suggest improvements.

Current setup might require additional horsepower to consume bigger datasets. Also we plan to implement whole repo autodeployment, too many fails because of lack of this feature at the moment.

List of current issues:

https://github.com/sciAI/exe/blob/master/Executability%20issues.md

Validator code:

https://github.com/sciAI/exe

All the credits are going to sci.AI team and, especially @AlexanderPashuk. Alex, thank you for the effort and fights with libs compatibility.

@RomanGurinovich
Copy link

@RomanGurinovich RomanGurinovich commented Aug 25, 2017

@yuvipanda, nice job. High five! Dan mentioned, you are on the conference now, right? If you are interested, we can combine effort

@choldgraf
Copy link

@choldgraf choldgraf commented Aug 25, 2017

We're both at the conference now and will be at the hack sessions tomorrow!

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Aug 27, 2017

I was traveling during the hackathon - had heard about it too late. In any case, I hope we can combine efforts with the Binder team. For those interested, the talk sits at
https://github.com/Daniel-Mietchen/events/blob/master/JupyterCon-2017.md#outline-of-the-talk , and the video shall hopefully become available in a few weeks.

@Daniel-Mietchen
Copy link
Collaborator Author

@Daniel-Mietchen Daniel-Mietchen commented Aug 29, 2017

Seen at JupyterCon:
https://github.com/jupyter/repo2docker , a tool that can dockerize a git repo and provide a Jupyter notebook to explore the container.

@cgpu
Copy link

@cgpu cgpu commented Nov 17, 2019

Hello! I'm part of the team that's working on beta.mybinder.org and related stuff, and am extremely interested in the idea of a 'badge that represents that your code is reproducible, and has been reproduced by CI'. Funnily, I also built the PAWS stuff, which is awesome to find in completely unrelated contexts :D

Apologies for reviving a closed issue. I am also interested in the reproducibility badge (which is not necessarily the same as the binder badge). I came across @mwoodbri 's jupyter-ci. Any plans for this to be used or people that currently use this?

(Also cc'ing @yuvipanda as he has been involved in repo2docker.)

@mwoodbri
Copy link

@mwoodbri mwoodbri commented Nov 18, 2019

@cgpu Hi! At the time jupyter-ci was an answer to "what's the simplest possible way to generate a badge and notify of validation failure?". Simply porting the project to GitHub and GitHub Actions (updating to latest Jupyter best-practice in the process, if necessary) would be a great start. But a more general solution involving repo2docker and/or Binder would be even better!

@cgpu
Copy link

@cgpu cgpu commented Nov 18, 2019

@cgpu Hi! At the time jupyter-ci was an answer to "what's the simplest possible way to generate a badge and notify of validation failure?". Simply porting the project to GitHub and GitHub Actions (updating to latest Jupyter best-practice in the process, if necessary) would be a great start. But a more general solution involving repo2docker and/or Binder would be even better!

Hi @mwoodbri , thank you for the prompt response and the background information! I am really fond of the idea of having a binary does it reproduce badge of honour, jupyter-ci sounds exactly like that. I am trying to reproduce .ipynb accompanied publications as part of a workshop exercise and I am really struggling. It would be nice to know before hands which ones to invest time on, hence the interest. Additionally, it would be nice to have/create a hub or awesome repo with the ones that are reproducible only (verified by ci).

jupyter-ci is currently only on gitlab, I know it's the same in essence, but the community is much more active in Github, it would be nice if you have plans to maintain the repo + idea long term to bring it over here. Just a thought.

Thanks once again!

@mwoodbri
Copy link

@mwoodbri mwoodbri commented Nov 19, 2019

@cgpu Here's a version converted to use GitHub Actions: https://github.com/mwoodbri/jupyter-ci

@cgpu
Copy link

@cgpu cgpu commented Nov 20, 2019

@cgpu Here's a version converted to use GitHub Actions: https://github.com/mwoodbri/jupyter-ci

@mwoodbri thank you! Time for me to test now :)

@Daniel-Mietchen thank you for providing the reproducibility cafe space for further discussions 2 years after the start of the initiative. Feel free to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet