Skip to content

Commit

Permalink
Add a bunch of stuff to tale publishing and prov design docs
Browse files Browse the repository at this point in the history
  • Loading branch information
amoeba committed Jan 30, 2018
1 parent a6c81c0 commit a4c2175
Show file tree
Hide file tree
Showing 2 changed files with 200 additions and 67 deletions.
131 changes: 114 additions & 17 deletions mockups/provenance-capture/README.rst
Original file line number Diff line number Diff line change
@@ -1,25 +1,122 @@
Provenance Capture
==================

How will the WT user, in combination with the WT Dashboard, capture Tale Provenance?
How will the WT user, in combination with the WT Dashboard, capture Tale Provenance and how will we display that to scientists?

Possible sources of provenance
------------------------------
Project deliverables
--------------------

- The WT manifest itself is sort of a prospective provenance trace
- The running container could be inspected capture some minimal provenance (like files written out to disk)
- We could encourage/assist/mandate the user use some of the provenance tools already available in Jupyter
- The user could manually write out provenance information in their frontend
- Include provenance in Tales
- Build Python package for authoring PROV

Include provenance in Tales
---------------------------

More notes (TODO: integrate)
Possible sources of provenance include

- Provenance capture in the front end
- Automatic
- Manual
- Could use recordr in RStudio and capture that PROV
- TODO: Check out https://github.com/gems-uff/noworkflow
- Lots of intrest in notebook environments, and capturing prov in Jupyter
- Esp w/ YesWorkflow
- See NoWorkflow too
- See tickets from prov-a-thon
- Super high-level PROV:

::

Tale :used Dataset
Tale :generated Dataset

This is as an alternative to the type of Execution-oriented PROV information we usually try to store with PROV/PROVONE. I think this has a lot of value. If the archived Tale uses a DataONE dataset, we would likely not archive that data because it's already archived. We'd either include it in the package via the ORE or somehow link to it with PROV statements.


Questions:

- Can we create this in PROVONE or do we need another way of doing it?

Pros:

- Easier for the Tale creator to author ("Which Dataset did you use?)
- Easier for the Tale user to understand ("Which Dataset did this use?)

Cons:

- Doesn't solve the problem of a person reading through a Tale and wanting to know how all the files connect without reading the source code

- The WT recipe itself is sort of a prospective provenance trace

This might not even be possible with just the information in the Recipe. To make a useful PROV trace, we'd probably need to have an Execution which needs at least one Entity (either used or generated) and the Recipe alone doesn't really have this information right now.

Pros:

- We already have the Recipe, so this doesn't require the user to do anything extra

Cons:

- This is a super minimal amount of PROV and maybe not even possible given how little PROV-relevant information is in the Recipe

- The running container could be inspected capture some minimal provenance (like files written out to disk)

Pros:

- Automated

Cons:
- Probably not useful due to low-level nature

- We could encourage/assist/mandate the user use some of the provenance tools already available in Jupyter/RStudio

This could be as simple as making sure to include PROV packages in the Docker image the Frontend is based against automatically.

Pros:

- Easy to do

Cons:

- May be too confusing/hard to make user author their PROV in a scripted/cell environment. Something graphical might be way easier.

- The user could manually write out provenance information in their frontend and we could automatically capture it. This would have to be via convention which is always trickier for new users to pick up.
- We could somehow re-use the MetacatUI PROV editor

For example, the Dashboard could send the user over to an Edit screen in a hosted copy of MetacatUI with minimal information already filled in and be asked to Edit (optional) and then Save again.

Pros:

- Not reinventing the wheel
- Integrates well with one of the possible solutions for Tale Saving
- We get great PROV information from the user

Cons:

- Might be hard to integrate (but I think it's not)

- We could build a new PROV Editor into the Dashboard so the user's flow would look like...

::

... -> Frontend -> Archive Tale [ Author Metadata -> Describe PROV ] -> Done

Pros:

- Tightly integrated

Cons:

- Requires creating a new PROV Editor (and one that may only benefit WT)

Build Python package for authoring PROV
---------------------------------------

Problem: If a user is in a Jupyter Python environment, they have no easy way to author PROV information, either via a trace or retrospective assertion.

Questions:

- Does it implement the RunManager API?
- Is this just recordr-in-Python?

Other notes
-----------

TODO: Integrate these above ^

- Could use recordr in RStudio and capture that PROV
- TODO: Check out https://github.com/gems-uff/noworkflow
- Lots of intrest in notebook environments, and capturing prov in Jupyter
- Esp w/ YesWorkflow
- TODO: See NoWorkflow too
- TODO: See tickets from prov-a-thon
136 changes: 86 additions & 50 deletions mockups/tale-publishing/README.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
Tale Publishing
===============

When a user has created a Tale and wishes to save it so it can be shared/launched by others, they will have to be able to publish their Tale on an external repository such as Globus or a DataONE Member Node.
When a user has created a Tale and wishes to save it so it can be shared/launched by others, they will have to be able to publish their Tale on an external repository such as a DataONE Member Node.

TODOS
-----
- TODO: Should I call this "Tale Archiving"?

High-level Questions
--------------------

- What files get saved?
- How will we generate metadata for Tales?
- How will authentication happen with the service we save to?
- How will authentication happen with the DataONE Member Node?
- Will Tales get DOIs?

Requirements
Expand All @@ -21,7 +25,6 @@ Solution should satisfy these requirements:
3. A none-zero amount of provenance information should be archived
4. Published tales have to work outside the WT environment (to at least some degree) (not necessarily as seamless as in WT?)


What files get saved?
---------------------

Expand Down Expand Up @@ -53,7 +56,7 @@ We can certainly save a lot of stuff. Whatever gets saved needs to serve a use c

1. Re-tell Tale (e.g., re-compute each cell in a Jupyter Notebook, re-run the R script)

For this case, the user doesn't necessarily need the **Dervied Data** because they can generate it themselves.
For this case, the user doesn't necessarily need the *Dervied Data* because they can generate it themselves.
However, this requires them to have the computational resources to do so, which can't be counted on.

Advantages:
Expand All @@ -68,7 +71,7 @@ We can certainly save a lot of stuff. Whatever gets saved needs to serve a use c

2. Read through Tale to see what was done (read the code, look at the output)

For this case, the user *does* need the **Derived Data** (they need everything, really).
For this case, the user *does* need the *Derived Data* (they need everything, really).
But they don't need the computational resources to re-run the analysis

Advantages:
Expand All @@ -83,17 +86,17 @@ We can certainly save a lot of stuff. Whatever gets saved needs to serve a use c
We want to cover both of these use cases and covering use case 2 covers use case 1.
Therefore, we need to archive enough information so the user doesn't have to re-run the analysis to read the Tale.

Plan
----
Proposal
********

There are a lot of things we could do and, because of this, it makes a lot of sense to build this up in phases, starting just getting basic publishing work from WT -> (DataONE/Globus).
There are a lot of things to archive, because of this, it makes a lot of sense to build this up in phases, starting just getting basic publishing work from WT -> (DataONE).

=============== ======= ======= ======= ====
Artifact Phase 1 Phase 2 Phase 3 Note
=============== ======= ======= ======= ====
Registered Data N N N Probably never (See below)
Uploaded Data N Y Y
YT Manifest Y Y Y
Recipe Y Y Y
Dockerfile Y Y Y
Script(s) Y Y Y
Metadata Y Y Y
Expand All @@ -103,76 +106,109 @@ PROV N N Y

To make WholeTale useful/special, we really need to get to Phase 3.

Potential risks/problems:
Provenance
----------

- Authentication (see below for possible solutions)
- For publishing to DataONE:
- DataONE itself cannot be published to. New content can only come into DataONE through a Member Node
**Problem:** We need to capture provenance for Tales.

Possible solutions:
See `Provenance Capture <../provenance-capture/README.rst>`_

- Publish to an existing Member Node (KNB is a good candidate)
- Set up a dedicated Member Node just for WholeTale
- Unlikely: Don't publish into DataONE
Authentication
--------------

- Content in DataONE is wrapped up in Data Packages which are essentially aggregations of files described by OAI/OE Resource Maps except DataONE requires an XML metadata document in every DataPackage. Users creating Tales might not necessarily (1) understand this requirement (2) want to fill in information
See ongoing discussion https://github.com/whole-tale/wt-design-docs/issues/4

Possible solutions:
**Problem:** Right now, WT (Globus) Auth and DataONE auth aren't designed such that a user working within WT can write to DataONE and this needs to be resolved if the user is going to save Tales or if the WT backend is going be able to save tales for the user.

- Allow the Tale manifest (YML) act as the metadata record
- Author a minimal EML record for the Tale. EML only requires a title, creator, and contact (title <=> Tale title, creator/contact <=> Logged-in user)
- Support a new metadata format just for Tales
**Problem:** Globus and DataONE have different ways of identifying users (Subjects): In DataONE, we use strings like the user's LDAP DN or their ORCID. Globus Auth generates unique identifiers for each user. If a user create content in DataONE, how is that linked to their work in WT?

- What if the user generates a massive file, how will we save that (or tell the user we won't?)
- Do we make DataONE trust Globus?

Provenance
----------
From what others on the team are saying, it sounds like we could essentially just store a Globus certificate on a DataONE CN and authenticate the incoming request from WholeTale this cert. I don't really know how this would work.

See `Provenance Capture <../provenance-capture/README.rst>`_
Pros:

Authentication
--------------
- The user doesn't have to log into DataONE ever. Users hate logging into things.

Cons:

- Will require discussion with DataONE CI about the change
- Potentially incompatible with how DataONE likes to do things
- If a user archives a Tale from the Dashboard, the Objects may not show up in their profile on DataONE because the Globus subject is unlikely to match their identity in DataONE

- Give the user a way to retrieve and store a DataONE auth token in the Dashboard

Right now, WT (Globus) Auth and DataONE auth aren't designed such that a user working within WT can write to DataONE and this needs to be resolved if the user is going to save Tales or if the WT backend is going be able to save tales for the user.
Pros:

- Do we adopt one or the other auth system across both systems?
- Do we make DataONE trust Globus tokens?
- Do we make DataONE trust the WT backend and have the backend do the saving on behalf of the user?
- Requires no buy-in from DataONE and no codebase changes on the DataONE side
- Doesn't require storing a Globus cert on a DataONE CN which reduces complexity and maintenance
- The user will definitely be able to view/edit their content they create from the Dashboard once on DataONE because the Objects they create will have been created by their Subject

Easiest thing is probably to get a D1 token
Cons:

- If we choose to generate tokens with an 18 hour expiry, the user would have to get a token more than once which is annoying and unusual for users
- The user would have to log into DataONE which is normal for third-party integrations but is still extra steps

- Set up a shadow account on Globus
This is from Kacper, I have on clue how any of this would work:

> Could also set up a shadow account on WT/Globus? that automatically connects the Globus user to DataONE. Would need to establish transitive trust between the two systems (DataONE needs to trust Globus)

Pros:

- It sounds like this would be seamless for the user

Cons:

- Not sure. Is this hard to maintain?
- (From above) If a user archives a Tale from the Dashboard, the Objects may not show up in their profile on DataONE because the Globus subject is unlikely to match their identity in DataONE

**Proposed solution:**

- Phase 1: Store a DataONE JWT in the Dashboard and send it with requests
- Phase 2: Decide on the above issues (either trust Globus w/in DataONE or stick with the storing a DataONE token approach)
- If we just store a DataONE token, build out UI/UX for supporting this in the Dashboard
- If we choose to trust Globus w/in DataONE, we need to implement that on the backend in WT

Metadata Creation
-----------------

General questions:

- How much metadata do we let/make the user submit?
- Which standard?
- How will the user generate it?

More coming in this section
**Problem:** To publish in DataONE, and also to make a useful Tale, we'll need a metadata record for the Tale.

Uploaded Data
-------------
**Possible solutions:**

- Are we going to archive this? Most likely
- Generate a minimal metadata record automatically for the user (w/o interaction)
- Create a minimal metadata editor in the Dashboard
- Send the user to the MetacatUI EML Editor pre-populated with files and metadata and let them finish the upload there

More coming in this section
**Proposed solution:**

Registerd Data
--------------
- Phase 1: Automatically generate an EML record
- Phase 2: Offer a rich metadata-editing environment, either in the Dashboard or via MetacatUI

- Maybe we won't archive this since registration comes from long-term archives
Saving to DataONE
-----------------

More coming in this section
**Problems:** DataONE itself cannot be published to. New content can only come into DataONE through a Member Node

Saving to DataONE
-----------------------
Possible solutions:

Basically: Which MN do we save to?
- Publish to an existing Member Node (KNB is a good candidate)
- Set up a dedicated Member Node just for WholeTale
- Unlikely: Don't publish into DataONE

More coming in this section
**Proposed solution:**

Saving to Globus
----------------
- Phase 1: Publish to a test MN just to get things working
- Phase 2: Decide on whether to re-use a production MN or set up a new one and make that work

The user is already authenticated with Globus, so is this easy?
Other potential risks/problems
------------------------------

More coming in this section
- What if the user generates a massive file, how will we save that (or tell the user we won't?)

0 comments on commit a4c2175

Please sign in to comment.