Skip to content

Commit

Permalink
Add to tale publishing design doc
Browse files Browse the repository at this point in the history
  • Loading branch information
amoeba committed Jan 26, 2018
1 parent 274fff6 commit 07e4691
Showing 1 changed file with 81 additions and 6 deletions.
87 changes: 81 additions & 6 deletions mockups/tale-publishing/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Tale Publishing

When a user has created a Tale and wishes to save it so it can be shared/launched by others, they will have to be able to publish their Tale on an external repository such as Globus or a DataONE Member Node.

Questions:
High-level Questions
----------

- What files get saved?
Expand All @@ -15,24 +15,98 @@ Requirements:

Solution should satisfy these requirements:

- Tales can be published to one or more external repositories in a standard manner
- Published Tales can be round-tripped (imported back into WholeTale) and "forked"
- Some amount of provenance information should be archived
- Tales can be published to one or more external repositories using OAI/ORE Resource Maps as the glue between artifacts
- Published Tales can be round-tripped: A Tale can be published, then imported back into a WholeTale environment
- A none-zero amount of provenance information should be archived
- Published tales have to work outside the WT environment (to at least some degree) (not necessarily as seamless as in WT?)


What files get saved?
---------------------

Types of things
***************

There are three main categories of Things involved in Tale saving:

::

---------------- ------ --------------
| Registered Data | -> | Tale | -> | Derived Data |
---------------- ------ --------------

**Registered Data**
Zero or more filesystem artifacts, either externally registered or uploaded directly to the Dashboard
**Tale**
The combination of the Tale metadata and other artifacts (e.g. Dockerfiles) and the analytical code (Jupyter Notebooks, R scripts, etc.)
**Derived Data**
Any filesystem artifacts derived from exeucting a script/notebook.
This includes provenance traces.

(There are definitely other ways of thinking about this)

Use cases
*********

Tales have two different modalities of use:

1. Re-tell Tale (e.g., re-compute each cell in a Jupyter Notebook)

For this case, the user doesn't necessarily need the **Dervied Data** because they can generate it themselves.
However, this requires them to have the computational resources to do so, which can't be counted on

2. Read through Tale to see what was done

For this cae, the user *does* need the **Derived Data** (everything, really)
But they don't neen the computational resources to re-run the analysis

We want to cover both of these use cases.
Therefore, we need to archive enough information so the user doesn't have to re-run the analysis to read the Tale.

Plan
----

There are a lot of things we could do and, because of this, it makes a lot of sense to build this up in phases, starting with something simple.

Phases:

1: ORE + wt.yml + Dockerfile + Notebooks/Rscripts + Super-minimal metadata record + minimal prov information
2: Above + Derived Data
3. Above + Detailed PROV info

To make WholeTale useful, we really need to get close to 3.

todo:
make table of minimals, desires
do it in passes (incremental releases)

A little or a lot could be archived:

- An OAI/ORE Resource Map aggregating the following
- WT manifest (the yml file the Dashboard & backend generate)
- Docker image(s) / Dockerfiles
- Docker image(s) / Dockerfiles (Matt: do archive Dockerfile)

get more details on docker usage
- Registered datasets (the actual data files, not just pointers/URLs)

how do deal with non-d1/non-wt data?

- Provenance capture in the front end
- Automatic
- Manual
- Could use recordr in RStudio and capture that PROV
- TODO: Check out https://github.com/gems-uff/noworkflow
- Lots of intrest in notebook environments, and capturing prov in Jupyter
- Esp w/ YesWorkflow
- See NoWorkflow too
- See tickets from prov-a-thon

- Filesystem artifacts created using the front-end

could:
publish minimal metadata, generate miniaml doc, incld in package


Potential problems:

- Authentication (see below for possible solutions)
Expand All @@ -41,7 +115,6 @@ Potential problems:

Possible solutions:


- Publish to an existing Member Node (KNB is a good candidate)
- Set up a dedicated Member Node just for WholeTale
- Unlikely: Don't publish into DataONE
Expand All @@ -54,6 +127,8 @@ Potential problems:
- Author a minimal EML record for the Tale. EML only requires a title, creator, and contact (title <=> Tale title, creator/contact <=> Logged-in user)
- Support a new metadata format just for Tales

TODO: what do we do with really large files?

Handling authentication
-----------------------

Expand Down

0 comments on commit 07e4691

Please sign in to comment.