Skip to content

Conversation

@gselzer
Copy link
Member

@gselzer gselzer commented Jun 18, 2021

This PR introduces a new mechanism, OpHistory, to provide data provenance for Ops and Op outputs involved with the matcher. This OpHistory provides a static repository of information about Op executions. For each top-level call to the matcher, a universally-unique identifier (UUID) is created. Then, Ops stores:

  • A Graph<OpInfo> describing the hierarchy of Ops involved in every return produced by the DefaultOpEnvironment
  • The top-level Op returned by the call to the matcher, along with the OpWrapper result (if used)
  • A history of the executions performed by the Op (when the OpWrapper was utilized), along with the output produced

Using these recorded Objects, we are able to then get answers for the questions:

  • OpHistory.executionsUpon(Object o): For a given Object output, what Op(s) were responsible for creating/mutating this Object?
  • OpHistory.opExecutionChain(Object op): For a given Object op, which OpInfo(s) were tasked with creating/populating the OpDependency needs to create this Object?

The answers to these questions allow us to obtain the OpInfos (and thus the backing algorithms) needed to construct an Op output in a reproducible way.

This Op also refactors the outputs of various DefaultOpEnvironment to better facilitate the population of the OpHistory. When DefaultOpEnvironment.op is called, the general scheme that should be followed is:

  1. match the Op
  2. store the Op along with its corresponding OpInfo in the op cache
  3. ask the cache for the Op, wrap the Op, and return the Op to the user

This new process allows us to store the "raw" Op in the cache, instead of the wrapped Op, better facilitating Op cache hits.

TODO:

  • Remove this HACK, determine when/how the history should be reset
  • Form an issue about persisting the OpHistory, possibly using scijava-persist : persisting objects with Gson and scijava discovery mechanism #30. The usage of UUID should aid in the storage.
  • Determine what API pertaining to logging Op history (if any) belongs in OpEnvironment. Should data provenance be something that all OpEnvironments implement, or something limited to our DefaultOpEnvironment?
  • Consider the return of a Graph<OpInfo> for OpHistory.opExecutionChain: this is exposing Google Guava data structures, which is not ideal. We need to be able to return a Tree of some sort, and this seems to be the best answer.

Closes scijava/scijava#53

@gselzer gselzer requested review from ctrueden and hinerm June 18, 2021 17:12
@gselzer gselzer force-pushed the scijava/scijava-ops/data-provenance branch 3 times, most recently from d176eb6 to 21b3fb5 Compare June 21, 2021 17:47
@hinerm
Copy link
Member

hinerm commented Jun 25, 2021

Remove this HACK, determine when/how the history should be reset

Possible solution: make the OpHistory a service so that each time we create a Context we get a fresh history

gselzer and others added 17 commits July 14, 2021 15:26
This commmit makes two performance improvements. Firstly, it removes the
need to store the inputs. What is the point of storing the inputs, if
you know that the inputs are never modified? If you want to know how a
particular Object came to be the way it is, just look at the Ops that
modified the Object i.e. the Ops for which this Object was an output

Secondly, change the backing Object of the OpHistory to a LinkedDeque.
This allows us to add faster to the end, removing the need to iterate
through the history in order to add something new
This commit removes the two new calls from the OpExecutionSummary API,
resulting in less overhead in maintaining the Op history.
Not sure how this slipped past in the original work, but there were
issues in the naming of Javassist classes generated by OpMethodInfos.
There are benefits to reworking the caching system, and to refactor the
functionality of findOpInstance. The goal is to have findOpInstance
place Ops DIRECTLY into the cache, and NOT return anything. Then Ops can
be returned directly from the cache. This gives us more control over Op
wrapping and retains a more "raw" version of the Op in the cache. We
then push Op wrapping to the cache.

TODO: Use the hints to only record executions of top-level Ops. What
works is to be able to re-wrap an Op with a new UUID. What we DON'T yet
do is re-wrap its dependencies with the same UUID.
We want to make sure that all external calls to the matcher generate a
new UUID. Since we want to make a MatchingConditions anyways, let's just
make all public API generate a MatchingConditions, which will generate a
new UUID

Also fixes a bug in the DependencyMatching Hint
This creates ONE OpHistory per context, and removes the static nature of
the class in favor of state kept within the Context.
@gselzer gselzer force-pushed the scijava/scijava-ops/data-provenance branch from 2c3250b to ba2e33e Compare July 14, 2021 20:37
@gselzer
Copy link
Member Author

gselzer commented Jul 14, 2021

Possible solution: make the OpHistory a service so that each time we create a Context we get a fresh history

@hinerm thanks for the suggestion, I just implemented it.

@gselzer
Copy link
Member Author

gselzer commented Jul 14, 2021

Determine what API pertaining to logging Op history (if any) belongs in OpEnvironment. Should data provenance be something that all OpEnvironments implement, or something limited to our DefaultOpEnvironment?

If OpHistory is a Service, then we do not need to access that API through OpEnvironment, we can just directly access the OpHistoryService through the Context.

Consider the return of a Graph for OpHistory.opExecutionChain: this is exposing Google Guava data structures, which is not ideal. We need to be able to return a Tree of some sort, and this seems to be the best answer.

For lack of a better option, I say we keep this for now. @hinerm @ctrueden if you have other thoughts, feel free to express them.

Otherwise, let me know what you think about merging @hinerm

@gselzer gselzer marked this pull request as ready for review July 14, 2021 21:17
gselzer added 2 commits July 15, 2021 10:08
We made a couple changes:
* Changed packages for a couple classes to prevent package dependency
cycles, and to separate API from implementations
* Removed a second OpInstance implementation, not sure how two
implementations came about :P
We instead create an OpHistoryService that can provide an OpHistory.
This can be used as long as we have an OpService
Copy link
Member

@ctrueden ctrueden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🎸 🏆 But change everything. Just kidding, change some things. 😉

For others: we are planning to merge this as is, but later, after reviewing the other PRs in the chain. They will all merge as is (barring major issues), but then we'll have another iteration of work on top of it all, based on these reviews. 👍

* @return true iff {@code e} was successfully logged
*/
@Override
public boolean addExecution(OpExecutionSummary e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed how simplification and adaptation should generate IDs for their OpInfos such that they can be reconstructed from those IDs later. Therefore, we need to make sure we test the history functionality with simplification, adaptation, and any other cases relating to ID generation and reexecution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDs added with 9fcbc9e

@ctrueden ctrueden merged commit bf3018c into main Aug 23, 2021
@ctrueden ctrueden deleted the scijava/scijava-ops/data-provenance branch August 23, 2021 17:34
@gselzer gselzer mentioned this pull request Nov 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects

Development

Successfully merging this pull request may close these issues.

Construct a mechanism for data provenance

4 participants