The yw-matlab module contains a YesWorkflow (YW) module and command line tools for interacting with the DataONE MATLAB RunManager. These tools make the RunManager's records of script runs and data provenance available to YW so that this retrospective provenance information can be queried in terms of, and in combination with, the prospective provenance revealed by YW annotations in the MATLAB scripts.
This repository also contains example MATLAB scripts marked up with YW annotations, the products of running YW on the scripts, and example prospective and retrospective queries of the provenance records jointly created by YW and the DataONE RunManager for runs of each script.
Problems solved by this package
The key feature currently implemented by this package is the ability to use the RunManager's authoritative list of input and output files for a script run when reconstructing the detailed provenance of a script's outputs using YesWorkflow.
This capability promises to be superior to YW's default behavior of searching the file system for files that match the URI templates declared for
@OUT ports in the script when reconstructing script runs. An example of a situation where the default approach to reconstructing runs will fail is when the outputs from multiple runs of a script intermingle in the same directory or a single directory tree. In such situations the core, language-neutral implementation of YesWorkflow currently cannot determine which files are associated with a particular run. Because the MATLAB RunManager records all the files that are read or written by a run of MATLAB script, the extended version of YW that uses this information does not share this weakness.
YesWorkflow extracts variable values from the file paths that match the URI templates declared via YW annotations in the same manner for MATLAB RunManager recorded file paths as for files discovered on the file system by the default reconstruction mechanism. Consequently both approaches to run resource discovery support the same kinds of queries of the retrospective provenance.
Layout of repository
|src/main/resources/scripts||MATLAB scripts for working with the RunManager and exporting provenance records for importing into YesWorkflow.|
|src/main/resources/examples||Example scripts and associated data files, provenance records, and results of prospective and retrospective provenance queries for runs of each script.|
|src/main/java/org/yesworkflow/matlab||Java package that extends YesWorkflow with the ability to import script run records exported from the MATLAB RunManager.|
|src/test/java/org/yesworkflow/matlab||Unit tests for the Java classes in the
The example script and data files currently used to exercise the tools in this repository are based on those used in the DataONE MATLAB Toolbox Walk-through.
Files for this example are located at src/main/resources/examples/C3_C4_mapping. Layout of directories and files within this example directory:
|Directory or file||Description|
|C3_C4_map_present_NA.m||The MATLAB script marked up with YW annotations.|
|inputs/||Script input data files.|
|outputs/||Outputs from one run of the script.|
|yw/||YesWorkflow configuration and output files.|
|yw/xsb/extractfacts.P, yw/xsb/modelfacts.P, yw/xsb/reconfacts.P||Fact files created by YesWorkflow and containing the prospective and retrospective provenance captured by YW.|
|yw/xsb/rules.P||Logic rules for querying the YW-written facts.|
|yw/xsb/extract_queries.P, yw/xsb/model_queries.P, yw/xsb/recon_queries.P||Queries employing the rules defined in rules.P.|
|yw/xsb/run_queries.sh||Bash script for running all of the queries.|
|yw/xsb/run_queries.txt||The output produced by running all of the queries using run_queries.sh.|
YesWorkflow combined view of script
YW provenance queries
The following is a summary of the queries posed by running yw/xsb/run_queries.sh and the corresponding results.
Queries about the script and the YW annotations extracted from it
See extract_queries.P for definitions of the queries.
| Query | Result
EQ1 | What source files were YW annotations extracted from? |
EQ2 | What are the names of all program blocks in the script? |
EQ3 | What out ports are qualified with URIs? |
Queries about the workflow model of the script (prospective provenance)
See model_queries.P for definitions of the queries.
| Query | Result
MQ1 | Where is the definition of program block
fetch_monthly_mean_precipitation_data? | SourceFile=
MQ2 | What is the name of the top-level workflow? |
MQ3 | What are the names of the program blocks comprising the workflow? |
MQ4 | What are the names of the program blocks in the workflow that produce workflow outputs? |
MQ5 | What are the inputs to the script? |
MQ6 | What data is output by program block
MQ7 | What program blocks provide input directly to
MQ8 | What programs have input ports that receive data
MQ9 | How many ports read data
MQ10 | How many data are read by more than port in workflow
MQ11 | What program blocks are immediately downstream of
MQ12 | What program blocks are immediately upstream of
MQ13 | What program blocks are upstream of
MQ14 | What program blocks are anywhere downstream of
MQ15 | What data is immediately downstream of
MQ16 | What data is immediately upstream of
MQ17 | What data is downstream of
MQ18 | What data is upstream of
MQ19 | What URI variables are associated with reads of data
MQ20 | What URI variables do data read into
mean_airtemp have in common? |
Queries about a run of the script (retrospective provenance)
See recon_queries.P for definitions of the queries.
| Query | Result
RQ1| What input files were used to compose the precipitation array
RQ2 | How many input files were used to compose the air temperature array
RQ3 | What input files provided the data used to derive the workflow output
RQ4 | What is the range of years over which the data in the
mean_precip input files were collected? | StartYear=
RQ5 | What months of the year do the mean_airtemp input files correspond to? |