Skip to content

Commit

Permalink
Add to publishing and prov design docs
Browse files Browse the repository at this point in the history
  • Loading branch information
amoeba committed Jan 29, 2018
1 parent 07e4691 commit a6c81c0
Show file tree
Hide file tree
Showing 2 changed files with 108 additions and 56 deletions.
13 changes: 13 additions & 0 deletions mockups/provenance-capture/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,16 @@ Possible sources of provenance
- The running container could be inspected capture some minimal provenance (like files written out to disk)
- We could encourage/assist/mandate the user use some of the provenance tools already available in Jupyter
- The user could manually write out provenance information in their frontend


More notes (TODO: integrate)

- Provenance capture in the front end
- Automatic
- Manual
- Could use recordr in RStudio and capture that PROV
- TODO: Check out https://github.com/gems-uff/noworkflow
- Lots of intrest in notebook environments, and capturing prov in Jupyter
- Esp w/ YesWorkflow
- See NoWorkflow too
- See tickets from prov-a-thon
151 changes: 95 additions & 56 deletions mockups/tale-publishing/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,22 @@ Tale Publishing
When a user has created a Tale and wishes to save it so it can be shared/launched by others, they will have to be able to publish their Tale on an external repository such as Globus or a DataONE Member Node.

High-level Questions
----------
--------------------

- What files get saved?
- Will Tales get DOIs?
- How will we generate metadata for Tales?
- How will authentication happen with the service we save to?
- Will Tales get DOIs?

Requirements:
Requirements
-------------

Solution should satisfy these requirements:

- Tales can be published to one or more external repositories using OAI/ORE Resource Maps as the glue between artifacts
- Published Tales can be round-tripped: A Tale can be published, then imported back into a WholeTale environment
- A none-zero amount of provenance information should be archived
- Published tales have to work outside the WT environment (to at least some degree) (not necessarily as seamless as in WT?)
1. Tales can be published to one or more external repositories using OAI/ORE Resource Maps as the glue between artifacts
2. Published Tales can be round-tripped: A Tale can be published, then imported back into a WholeTale environment
3. A none-zero amount of provenance information should be archived
4. Published tales have to work outside the WT environment (to at least some degree) (not necessarily as seamless as in WT?)


What files get saved?
Expand All @@ -36,78 +37,73 @@ There are three main categories of Things involved in Tale saving:
---------------- ------ --------------

**Registered Data**
Zero or more filesystem artifacts, either externally registered or uploaded directly to the Dashboard
Zero or more filesystem artifacts, either externally registered or uploaded directly to the Dashboard. This is the data the user will compute with using their Frontend.
**Tale**
The combination of the Tale metadata and other artifacts (e.g. Dockerfiles) and the analytical code (Jupyter Notebooks, R scripts, etc.)
The combination of the Tale metadata and other artifacts (e.g. Dockerfiles) plus the analytical code (Jupyter Notebooks, R scripts, etc.)
**Derived Data**
Any filesystem artifacts derived from exeucting a script/notebook.
Any filesystem artifacts derived from executing a script/notebook.
This includes provenance traces.

(There are definitely other ways of thinking about this)

Use cases
*********

Tales have two different modalities of use:
We can certainly save a lot of stuff. Whatever gets saved needs to serve a use case. Tales have two different modalities of use:

1. Re-tell Tale (e.g., re-compute each cell in a Jupyter Notebook)
1. Re-tell Tale (e.g., re-compute each cell in a Jupyter Notebook, re-run the R script)

For this case, the user doesn't necessarily need the **Dervied Data** because they can generate it themselves.
However, this requires them to have the computational resources to do so, which can't be counted on

2. Read through Tale to see what was done

For this cae, the user *does* need the **Derived Data** (everything, really)
But they don't neen the computational resources to re-run the analysis

We want to cover both of these use cases.
Therefore, we need to archive enough information so the user doesn't have to re-run the analysis to read the Tale.

Plan
----
However, this requires them to have the computational resources to do so, which can't be counted on.

There are a lot of things we could do and, because of this, it makes a lot of sense to build this up in phases, starting with something simple.
Advantages:

Phases:
- Encourages re-running analysis and investigating results
- Fewer files / less data to archive

1: ORE + wt.yml + Dockerfile + Notebooks/Rscripts + Super-minimal metadata record + minimal prov information
2: Above + Derived Data
3. Above + Detailed PROV info
Disadvantages:

- User viewing Tale would have to re-run analysis to see the result
- User viewing Tale may not be able to re-run analysis (decently likely, probability increasing with time from Tale publication date)

To make WholeTale useful, we really need to get close to 3.
2. Read through Tale to see what was done (read the code, look at the output)

todo:
make table of minimals, desires
do it in passes (incremental releases)
For this case, the user *does* need the **Derived Data** (they need everything, really).
But they don't need the computational resources to re-run the analysis

A little or a lot could be archived:
Advantages:

- An OAI/ORE Resource Map aggregating the following
- WT manifest (the yml file the Dashboard & backend generate)
- Docker image(s) / Dockerfiles (Matt: do archive Dockerfile)
- User viewing Tale doesn't need to have access to WholeTale to see results
- User viewing Tale doesn't need to come up with the computation resources to see results

get more details on docker usage
- Registered datasets (the actual data files, not just pointers/URLs)
Disadvantages:

- More files / more data to archive

how do deal with non-d1/non-wt data?
We want to cover both of these use cases and covering use case 2 covers use case 1.
Therefore, we need to archive enough information so the user doesn't have to re-run the analysis to read the Tale.

- Provenance capture in the front end
- Automatic
- Manual
- Could use recordr in RStudio and capture that PROV
- TODO: Check out https://github.com/gems-uff/noworkflow
- Lots of intrest in notebook environments, and capturing prov in Jupyter
- Esp w/ YesWorkflow
- See NoWorkflow too
- See tickets from prov-a-thon
Plan
----

- Filesystem artifacts created using the front-end
There are a lot of things we could do and, because of this, it makes a lot of sense to build this up in phases, starting just getting basic publishing work from WT -> (DataONE/Globus).

could:
publish minimal metadata, generate miniaml doc, incld in package
=============== ======= ======= ======= ====
Artifact Phase 1 Phase 2 Phase 3 Note
=============== ======= ======= ======= ====
Registered Data N N N Probably never (See below)
Uploaded Data N Y Y
YT Manifest Y Y Y
Dockerfile Y Y Y
Script(s) Y Y Y
Metadata Y Y Y
Derived Data N Y Y
PROV N N Y
=============== ======= ======= ======= ====

To make WholeTale useful/special, we really need to get to Phase 3.

Potential problems:
Potential risks/problems:

- Authentication (see below for possible solutions)
- For publishing to DataONE:
Expand All @@ -127,13 +123,56 @@ Potential problems:
- Author a minimal EML record for the Tale. EML only requires a title, creator, and contact (title <=> Tale title, creator/contact <=> Logged-in user)
- Support a new metadata format just for Tales

TODO: what do we do with really large files?
- What if the user generates a massive file, how will we save that (or tell the user we won't?)

Handling authentication
-----------------------
Provenance
----------

See `Provenance Capture <../provenance-capture/README.rst>`_

Authentication
--------------

Right now, WT (Globus) Auth and DataONE auth aren't designed such that a user working within WT can write to DataONE and this needs to be resolved if the user is going to save Tales or if the WT backend is going be able to save tales for the user.

- Do we adopt one or the other auth system across both systems?
- Do we make DataONE trust Globus tokens?
- Do we make DataONE trust the WT backend and have the backend do the saving on behalf of the user?

Easiest thing is probably to get a D1 token

Metadata Creation
-----------------

- Which standard?
- How will the user generate it?

More coming in this section

Uploaded Data
-------------

- Are we going to archive this? Most likely

More coming in this section

Registerd Data
--------------

- Maybe we won't archive this since registration comes from long-term archives

More coming in this section

Saving to DataONE
-----------------------

Basically: Which MN do we save to?

More coming in this section

Saving to Globus
----------------

The user is already authenticated with Globus, so is this easy?

More coming in this section

0 comments on commit a6c81c0

Please sign in to comment.