Data provenance mechanism #35

gselzer · 2021-06-18T17:12:23Z

This PR introduces a new mechanism, OpHistory, to provide data provenance for Ops and Op outputs involved with the matcher. This OpHistory provides a static repository of information about Op executions. For each top-level call to the matcher, a universally-unique identifier (UUID) is created. Then, Ops stores:

A Graph<OpInfo> describing the hierarchy of Ops involved in every return produced by the DefaultOpEnvironment
The top-level Op returned by the call to the matcher, along with the OpWrapper result (if used)
A history of the executions performed by the Op (when the OpWrapper was utilized), along with the output produced

Using these recorded Objects, we are able to then get answers for the questions:

OpHistory.executionsUpon(Object o): For a given Object output, what Op(s) were responsible for creating/mutating this Object?
OpHistory.opExecutionChain(Object op): For a given Object op, which OpInfo(s) were tasked with creating/populating the OpDependency needs to create this Object?

The answers to these questions allow us to obtain the OpInfos (and thus the backing algorithms) needed to construct an Op output in a reproducible way.

This Op also refactors the outputs of various DefaultOpEnvironment to better facilitate the population of the OpHistory. When DefaultOpEnvironment.op is called, the general scheme that should be followed is:

match the Op
store the Op along with its corresponding OpInfo in the op cache
ask the cache for the Op, wrap the Op, and return the Op to the user

This new process allows us to store the "raw" Op in the cache, instead of the wrapped Op, better facilitating Op cache hits.

TODO:

Remove this HACK, determine when/how the history should be reset
Form an issue about persisting the OpHistory, possibly using scijava-persist : persisting objects with Gson and scijava discovery mechanism #30. The usage of UUID should aid in the storage.
Determine what API pertaining to logging Op history (if any) belongs in OpEnvironment. Should data provenance be something that all OpEnvironments implement, or something limited to our DefaultOpEnvironment?
Consider the return of a Graph<OpInfo> for OpHistory.opExecutionChain: this is exposing Google Guava data structures, which is not ideal. We need to be able to return a Tree of some sort, and this seems to be the best answer.

Closes scijava/scijava#53

hinerm · 2021-06-25T16:43:22Z

Remove this HACK, determine when/how the history should be reset

Possible solution: make the OpHistory a service so that each time we create a Context we get a fresh history

This commmit makes two performance improvements. Firstly, it removes the need to store the inputs. What is the point of storing the inputs, if you know that the inputs are never modified? If you want to know how a particular Object came to be the way it is, just look at the Ops that modified the Object i.e. the Ops for which this Object was an output Secondly, change the backing Object of the OpHistory to a LinkedDeque. This allows us to add faster to the end, removing the need to iterate through the history in order to add something new

This commit removes the two new calls from the OpExecutionSummary API, resulting in less overhead in maintaining the Op history.

Not sure how this slipped past in the original work, but there were issues in the naming of Javassist classes generated by OpMethodInfos.

There are benefits to reworking the caching system, and to refactor the functionality of findOpInstance. The goal is to have findOpInstance place Ops DIRECTLY into the cache, and NOT return anything. Then Ops can be returned directly from the cache. This gives us more control over Op wrapping and retains a more "raw" version of the Op in the cache. We then push Op wrapping to the cache. TODO: Use the hints to only record executions of top-level Ops. What works is to be able to re-wrap an Op with a new UUID. What we DON'T yet do is re-wrap its dependencies with the same UUID.

We want to make sure that all external calls to the matcher generate a new UUID. Since we want to make a MatchingConditions anyways, let's just make all public API generate a MatchingConditions, which will generate a new UUID Also fixes a bug in the DependencyMatching Hint

This creates ONE OpHistory per context, and removes the static nature of the class in favor of state kept within the Context.

gselzer · 2021-07-14T20:43:21Z

Possible solution: make the OpHistory a service so that each time we create a Context we get a fresh history

@hinerm thanks for the suggestion, I just implemented it.

gselzer · 2021-07-14T21:17:18Z

Determine what API pertaining to logging Op history (if any) belongs in OpEnvironment. Should data provenance be something that all OpEnvironments implement, or something limited to our DefaultOpEnvironment?

If OpHistory is a Service, then we do not need to access that API through OpEnvironment, we can just directly access the OpHistoryService through the Context.

Consider the return of a Graph for OpHistory.opExecutionChain: this is exposing Google Guava data structures, which is not ideal. We need to be able to return a Tree of some sort, and this seems to be the best answer.

For lack of a better option, I say we keep this for now. @hinerm @ctrueden if you have other thoughts, feel free to express them.

Otherwise, let me know what you think about merging @hinerm

We made a couple changes: * Changed packages for a couple classes to prevent package dependency cycles, and to separate API from implementations * Removed a second OpInstance implementation, not sure how two implementations came about :P

We instead create an OpHistoryService that can provide an OpHistory. This can be used as long as we have an OpService

ctrueden

🚀 🎸 🏆 But change everything. Just kidding, change some things. 😉

For others: we are planning to merge this as is, but later, after reviewing the other PRs in the chain. They will all merge as is (barring major issues), but then we'll have another iteration of work on top of it all, based on these reviews. 👍

scijava/scijava-ops/src/main/java/org/scijava/ops/OpEnvironment.java

scijava/scijava-ops/src/main/java/org/scijava/ops/conversionLoss/LossReporterWrapper.java

scijava/scijava-ops/src/main/java/org/scijava/ops/hints/BaseOpHints.java

scijava/scijava-ops/src/main/java/org/scijava/ops/hints/Hints.java

scijava/scijava-ops/src/main/java/org/scijava/ops/provenance/OpHistoryService.java

ctrueden · 2021-08-10T20:55:47Z

scijava/scijava-ops/src/main/java/org/scijava/ops/provenance/impl/DefaultOpHistory.java

+	 * @return true iff {@code e} was successfully logged
+	 */
+	@Override
+	public boolean addExecution(OpExecutionSummary e) {


We discussed how simplification and adaptation should generate IDs for their OpInfos such that they can be reconstructed from those IDs later. Therefore, we need to make sure we test the history functionality with simplification, adaptation, and any other cases relating to ID generation and reexecution.

IDs added with 9fcbc9e

...ava/scijava-ops/src/main/java/org/scijava/ops/provenance/impl/SingletonOpHistoryService.java

scijava/scijava-ops/src/main/java/org/scijava/ops/simplify/SimplifiedOpRef.java

scijava/scijava-ops/src/main/java/org/scijava/ops/util/OpWrapper.java

gselzer requested review from ctrueden and hinerm June 18, 2021 17:12

gselzer force-pushed the scijava/scijava-ops/data-provenance branch 3 times, most recently from d176eb6 to 21b3fb5 Compare June 21, 2021 17:47

gselzer and others added 17 commits July 14, 2021 15:26

OpHistory infrastructure: first cut

d3c112b

Use Javassist to bake Types into an Object

a669a0a

Use OpWrappers to add executions to the history

c6eb911

Greatly simplify OpExecutionSummary API

0ae635a

This commit removes the two new calls from the OpExecutionSummary API, resulting in less overhead in maintaining the Op history.

OpHistory: throw IAE when tracing primitives

119900a

Test provenance

846b5f0

OpMethod: fix naming bug

6bbe8b4

Not sure how this slipped past in the original work, but there were issues in the naming of Javassist classes generated by OpMethodInfos.

Make OpHistory a Map of Deques

6b4005e

executionsUpon: return entire dependency tree

059b536

OpEnvironment: fix typo in javadoc

f7ab76f

Remove unhelpful API

93556d8

Clean/format code

093e08b

Service-ize OpHistory

5f3611a

This creates ONE OpHistory per context, and removes the static nature of the class in favor of state kept within the Context.

Rename OpHistory to OpHistoryService

ba2e33e

gselzer force-pushed the scijava/scijava-ops/data-provenance branch from 2c3250b to ba2e33e Compare July 14, 2021 20:37

gselzer marked this pull request as ready for review July 14, 2021 21:17

gselzer added 2 commits July 15, 2021 10:08

Clean up

6352b3c

We made a couple changes: * Changed packages for a couple classes to prevent package dependency cycles, and to separate API from implementations * Removed a second OpInstance implementation, not sure how two implementations came about :P

De-service-ize OpHistory

d7c54be

We instead create an OpHistoryService that can provide an OpHistory. This can be used as long as we have an OpService

gselzer mentioned this pull request Aug 5, 2021

Consolidate Parameter annotation data into Op Javadoc #28

Merged

3 tasks

ctrueden reviewed Aug 10, 2021

View reviewed changes

ctrueden mentioned this pull request Aug 20, 2021

Review and merge outstanding scijava-ops-related PRs scijava/scijava#75

Closed

25 tasks

ctrueden merged commit bf3018c into main Aug 23, 2021

ctrueden deleted the scijava/scijava-ops/data-provenance branch August 23, 2021 17:34

gselzer mentioned this pull request Nov 15, 2021

Data Provenance Revisions #46

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data provenance mechanism #35

Data provenance mechanism #35

Uh oh!

gselzer commented Jun 18, 2021 •

edited

Loading

Uh oh!

hinerm commented Jun 25, 2021

Uh oh!

gselzer commented Jul 14, 2021

Uh oh!

gselzer commented Jul 14, 2021

Uh oh!

ctrueden left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ctrueden Aug 10, 2021

Uh oh!

gselzer Aug 12, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Data provenance mechanism #35

Data provenance mechanism #35

Uh oh!

Conversation

gselzer commented Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hinerm commented Jun 25, 2021

Uh oh!

gselzer commented Jul 14, 2021

Uh oh!

gselzer commented Jul 14, 2021

Uh oh!

ctrueden left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ctrueden Aug 10, 2021

Choose a reason for hiding this comment

Uh oh!

gselzer Aug 12, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gselzer commented Jun 18, 2021 •

edited

Loading