Summary of BoF meeting at RDA Helsinki 23rd October 2019
This BoF was set up to discuss computational notebooks (such as Jupyter Notebooks or Rstudio notebooks). The purpose of the meeting was to discuss a set of topics on notebooks where there is clear overlap with the activities of the RDA. There were a set of talks on the following topics :-
- Publishing notebooks
- Long term preservation of notebooks
- Notebooks and FAIR digital objects
- Notebooks for Big data & compute
With breakout groups for each topic afterwards. Each breakout group was asked to consider the following questions with respect to their topic.
- What are the gaps and opportunities?
- What could this group (or the RDA community) do here that would be done better than elsewhere?
- What are the next steps? (More time to read and discuss? Reports to be written? Code to be developed?)
The original notes from this meeting can be found here.
The main points of Martin Fenner's talk were as follows:
- How do you reference notebooks for specificity and/or credit (as per Software Citation Principles)?
- How do you find notebooks you can reuse?
- How do you find notebooks linked to a publication, dataset, and/or funding?
Practical solutions were
- Use intrinsic identifiers and/or DOIs for specificity and/or credit and attribution.
- Use CodeMeta metadata for basic metadata describing the notebook.
- Use CodeMeta metadata for linking to publication, datasets, funding, etc.
it was also noted that related groups already exist within the RDA, namely the RDA Software Source Code IG and the RDA/Force11 Software Source Code Identification WG. Finally, there is the work on the PID graph that coud be utilised.
In the discussion, the following initial observation was made:
Notebooks are not just software or data or process. It is a combination of the three and is a Research Object. This explains in part why notebooks have not been examined from this perspective as of yet.
Two overall themes were discussed:
There were three main noted reasons for publishing notebooks. These were
Credit - in other words academic credit for writing a notebook and being able to start the process of tracking its use by it being a first class research object (i.e. something that should sit in a bibliography rather than hidden away as a footnote in a paper). It's interesting to note that this is not just in terms of academic credit but also as a mechanism to enable reuse of notebooks (F for Findable :^) ) for research and for teaching, which folds into the second point.
Understanding - notebooks provide a mechanism for explaining the software or workflows that have been implemented. A published notebook gives other researchers the means to understand how something was done computationally. This is related to, but distinct from reproducibility, where the focus is on reproducing the results. Understanding is perhaps more powerful - it enables researchers to go beyond what the first researcher has done. Likewise in explaining how the software works may even be more important than reusability.
Preservation - keeping the notebooks themselves. There's an obvious connection with the preservation discussion here.
Related to this, publication provides the possibility of access to stable notebooks and that the same notebook can be shared (rather than many slightly different versions). It was also noted that there needs to be a distinction between notebooks that are being updated in the development cycle (uploaded to a git/svn/mecurial/etc. resource on a daily basis); updates that correspond to major releases and releases that correspond to publications.
Connection with the RDA
How could this fit with the RDA's activities? The first observation above is key here. Notebooks are important and yet fall through the cracks in terms of citation etc. There is already an IG on Virtual Resarch Environments. The data that is generated by a notebook is part of what makes a notebook interesting so there's a clear overlap there as well. The recommmendations that the RDA have made on publishing data and archiving data and metadata provide an excellent starting point for putting together guidelines on publishing notebooks as well the related IG and WG on Preserving Software.
In terms of possible actions from an RDA group, one idea would be to provide guidelines on how to build a notebook to make it reusable.
Long-term preservation of notebooks
Patricia Herterich presented the case for the long term preservation of notebooks. Notebooks are increasingly cited in papers or listed as supplementary materials to published research. They seem to be becoming a primary research (and teaching) output, especially in the context of reproducible research.
Their preservation present challenges, namely: they are complex objects; they have multiple dependencies (not all of which are obvious); they are hard to find as there's no central catalogue and linking to them varies; finally, contrary to popular opinion, they are hard to re-execute unless containerised.
Preservation suggests three possible layers. At the simplest level is converting the notebook to a PDF or text file and hence storing it in the same way as other documents are stored. The next level is to allow a limited form of access using
nbviewer and extracting any media files separately. Finally preserving fully actionable notebooks requires documentation of the OS; computational environment and libraries in, for example, a dockerfile.
The discussion on this first focussed on gaps and opportunities.
Fully actionable preservation
Full preservation requires the preservation of the network of all the components which has comparisons with the art world. Librarians typically want to aim for storage of an item for greater than ten years. Preserving software beyond two years is hard - you can lose existing code components very quickly. If you have an executable it’s OK - but you need the right version for that. Dependencies degrade quickly - you need a great deal of context.
A number of technical questions were raised: can dockerfiles be part of things like emulators and emulators for standard container images will address this? How much coverage would being able to form a standard AWS environment provide? How much science happens in non-standard environments?
Environments such as Binder are playing a key role but who will inherit Binder? Various initiatives at Harvard, Yale, Whole Tale, EGI - could one of those be involved? This suggests possible collaboration.
It became apparent that, like in all preservation, trade-off decisions have to be made. Does everything need to be preserved? What layer is appropriate? Should there be some decision making process to determine what to preserve? There is (as always) an archival processing cost - maintaining accessibility over time. One must look at the cost/benefit analysis over time. Different preservation actions may be appropriate at different stages. One could adopt a records management approach for this.
In terms of solutions, is there a language (metadata) to describe different types of notebooks? e.g. distinguishing between simple notebooks that run a script and those that call datasets and so on (the Software Ontology may be a partial answer here). Can we create surrogates (video screenings? PDFs as documentation) to enhance the lowest layer of preservation?
Connection with the RDA
Opportunities through the RDA could include the development of a terminology. This would fit in with other working group activities could then be mapped with the layers of preservation. A preservation group could focus on “high quality” notebooks that have a certain level of curation - they need to work and be documented. There needs to be crosstalk of initiatives currently working in this space. The surface has just been scratched here; more time is required to read and discuss this, to consider what reports need to be written and code to be developed.
Notebooks and FAIR digital objects
Christine Kirkpatrick presented on the relationship between notebooks and FAIR standards. Notebooks do not have the FAIR principles built in. On the other hand, used properly Jupyter Notebooks can assist in aligning with FAIR. The discussion on the relationship between FAIR standards is ongoing but within the Digital Object Architecture application software is a Digital Object and hence “just another form of data”. It is not incorrect to replace the word 'data' with 'software' within the FAIR principles and be consistent in meaning [Note that there is disagreement with this opinion]. Specific items in FAIR have matches, namely:
F2 'Data are describd with rich metadata' - This is true for software too. 'Rich' is not a concrete term but notebooks allow 'rich' metadata to be included in its coding environment giving more semantic understanding of the data sources. While this does not directly impact findability, it can build trust in the code and data used.
F3 '(meta)data are registered or indexed in a searchable resource' - resources such as GitHub and versioning enable this.
A1 '(meta)data are retrievable by their identifier using a standardized communications protocol; A1.1 the protocol is open, free, and universally implementable; A1.2 the protocol allows for an authentication and authorization procedure, where necessary.' - Platforms such as JupyterHub and MyBinder are examples of such a protocol. Python tools leveraging these protocols will need to be developed. This has not been done yet, however the ease of creating python modules should make this easily accomplishable with the most useful modules finding wider adoption.
I1 '(meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation' - Python itself can be seen as a “formal, accessible, shared, and broadly applicable language” which includes tools to captures metadata about the code.
I2 '(meta)data use vocabularies that follow FAIR principles'; I3 '(meta)data include qualified references to other (meta)data' - These points go back to the additional documentation tools (F2 and F3) within a Notebook environment detailed documentation builds trust. Which in turns builds opportunities for interoperability of data and software.
R1 'meta(data) have a plurality of accurate and relevant attributes.'; R1.1 '(meta)data are released with a clear and accessible data usage license'; R1.2. '(meta)data are associated with their provenance'; R1.3. '(meta)data meet domain-relevant community standards' - Notebooks are generally designed for sharing and ease of reuse.
Software stored within open version management tools (aka GitHub-like environments) promote licensing and help keep a record of software provenance. They also provide community forum and contribution tools allowing domain-relevant standards to be discussed and implemented. This says nothing about Notebooks directly, but since the Notebook community tend to widely overlap the version control community, it seems to be a community preference (if not a community standard).
In the discussion that followed, a large number of questions were raised. How can notebooks be published? How can notebook sharing be made sexy so we see more reuse? How can the internals of a notebook (e.g. the JSON that describes the notebook) be described better? Could notebooks be used as a wrapper for data (after being placed in a container)? Can metadata for notebooks be standardised using domain-specific metadata standards? Can notebooks be metadata themselves? How can support for those who use notebooks be enabled? Can workflows in notebooks be made FAIR ? There already exist at least 10 separate ontologies for workflows and resources on FAIR workflows. Provbook provides a mechanism through a Jupyter plugin to provide information for each cell in a notebook. Correspondingly can notebooks be used to document transformations of raw data to controlled data, including visualisations?
Gaps and challenges (chopportunities)
Notebooks as a way to embed provenance and data transformation (e.g. ProvBook) but what happens if relevant plugin stops working? There are no clear definitions for making notebooks FAIR, e.g. no metrics or maturity indicators. A notebook is and is not software. It contains cells with codes, but is also a document. There are differences between communities on what the scope is for a notebook. There can a differences between the licensing for data in a notebook and the notebook. Notebooks could be a way to create FAIR workflows and improve machine testable/readiblity. Platforms are also going to evolve, e.g. Jupyter will be replaced by JupyterLab. Finally with respect to domains there are communities that do not use notebooks (e.g. linguistics) and those that face specfic challenges (e.g. climate studies that have to deal with large data sets).
A working group could be set up to understand the implications of differing data and notebook licences, especially with respect to sharing notebooks (in collaboration with the WG on Data policies and research compendia). Other possible tasks include determining the scope of notebooks and guidance on how to machine test notebooks for FAIR. In terms of different stakeholders there are a variety of different tasks. Software providers What would it mean for a notebook provider to be FAIR compliant? Specificaly to give recommendations on how to make them FAIR. IT/Research computing Identifying best practices for institutional notebook hosting, course support/hosting, but also for research. Users/Researchers Identifying best practices for including manual steps that not in a notebook, making notebooks FAIR, working with large datasets, batch computing.
Raise the profile of a recent paper on FAIR Software to the RDA community. Propose a working group with outputs on
- FAIR definitions/metrics/maturity indicators for notebooks.
- Making notebooks FAIR and behaviors that engender FAIR.
- Best practices (or ‘known good config’/this works) on institutional notebook hosting for courses and research
Notebooks for Big Data & compute
Gergely Sipos' talk discussed running notebooks on the EGI e-infrastructure. The initial questions were how to provide Notebooks ‘as a service’ for international communities? How to scale notebooks to big compute applications? How to handle big data I/O from notebooks? How to support reproducible analysis with notebooks?
Starting from JupyterHub, the goals are to provide scalable computing, scalable data access and enabling reproducibility through the provision of DOIs for notebooks and deploying notebooks and data using Binder. The architecture developed to do this was outlined. The next set of questions to be raised were:
- How to provide scalable batch computing from notebooks
- How can we ensure reproducibility from different JupyterHub installs
- What are the community's best practices?
The group was composed of about 15 people with ⅓ of already using notebooks for ‘big compute’ and ⅔ are looking for examples/solutions to do this. One was in both situations (how to help students move from a local notebook to big compute notebooks).
An early question was how to handle big data from notebooks? Gergely elaborated on how the EGI notebooks does this (with a back-end data management system called DataHub). There was a recognised connection point with the data repository interfaces BoF. As an action, notebook use case(s) could be sent to that WG.
Batch computing became a key topic of conversation. The EGI (and other cloud providers) started with interactive notebooks but now would like to add the possibility of batch computing. On the other hand, supercomputers would like to do the opposite with notebooks, i.e. to use notebooks to make access more interactive. This is a topic of common interest and there is a need for sharing good practices. Running long-running processes from notebooks will bring problems like time-outs in a browser, how to stay within ‘user quotas on the HPC/HTC site’, etc. It was raised why one should require batch computing - the response there is, much as any other platform, a user will want to scale up from test dataset to full dataset after the analytical code is working. Someone who can write a notebook may not be able to run batch jobs on large parameter space from command line, hence how can notebooks simplify this task from an interactive environment? For example Condor (and a few other) libraries exist to do batch computing from notebook There is a need to report back on this. An equivalent issue is running many notebooks, each requiring small capacity alone, but together can be a challenge. Examples of scaling up a notebook from a local environment to a big compute machine are required. Do we have/can we have training for users? (the content depends on how the provider actually implements the scalability in its notebook server).
The portability of motebooks across servers was also raised and action identified was finding metadata about notebooks, for example how much does Binder requirements.txt offer in this respect, and what more do we need?
Certain themes emerge from these discussions.
- Notebooks are research objects. Not software and not data. Lessons can be drawn from both communities but may not hold entirely.
- Notebooks are mostly about making data analysis understandable. It's not automatically about reproducibility.
- There is a need for various pieces of guidance and/or metadata standards on all of the above discussion topics.
- The configurability of Jupyter notebook extensions imply that they can be used in a huge number of contexts (e.g. HPC). The question of scope raises its head - are there agreed situations where notebooks are inappropriate?
- Different communities use notebooks in different ways (large data sets and high throughput computing; supercomputing; large image or video sets embedded in) and hence there may be a varietry of different answers.
- There are clear overlaps with different IG and WGs within the RDA and hence this represents a good location to discuss these issues.