Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle distinct directories with same content #26

Open
simleo opened this issue May 9, 2023 · 0 comments
Open

Handle distinct directories with same content #26

simleo opened this issue May 9, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@simleo
Copy link
Collaborator

simleo commented May 9, 2023

When serializing executions of workflows that take directory parameters, CWLProv does not create corresponding directories in the RO bundle: rather, files are always placed in directories whose name consists of the first two characters of the file's sha1 checksum.

When converting from CWLProv we recreate the original directories, giving them a name obtained by concatenating the sorted checksums of all contained files and computing the checksum of the concatenation. This means that directories with the same contents end up being mapped to the same directory in the output RO-Crate. This is especially convenient to avoid data duplication between workflow parameters and tool parameters: for instance, when a directory is an input of the workflow and also of the first step.

However, there are cases where we might not want to do that. For instance, suppose that a workflow takes an array of two directories as input:

cwlVersion: v1.2
class: Workflow
requirements:
  ScatterFeatureRequirement: {}

inputs:
  dir_array: Directory[]
outputs: []

steps:
  date_step:
    label: Prints date of input dirs
    scatter: dir
    in:
      dir: dir_array
    out: []
    run: dirdate.cwl

Where dirdate.cwl is:

cwlVersion: v1.2
class: CommandLineTool
baseCommand: [date, "-r"]

inputs:
  dir:
    type: Directory
    inputBinding:
      position: 1
outputs: []

Suppose the workflow is launched with the following parameters:

dir_array:
  - class: Directory
    location: foo
  - class: Directory
    location: bar

Where foo and bar have the same contents, e.g., they both contain a text file whose content is the string "dummy". What we currently get in the RO-Crate is:

{
    "@id": "packed.cwl#main/dir_array",
    "@type": "FormalParameter",
    "additionalType": "Dataset",
    "multipleValues": "True",
    "name": "dir_array"
},
...
{
    "@id": "#pv-main/dir_array",
    "@type": "PropertyValue",
    "exampleOfWork": {
        "@id": "packed.cwl#main/dir_array"
    },
    "name": "dir_array",
    "value": [
        {
            "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/"
        },
        {
            "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/"
        }
    ]
},
...
{
    "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/",
    "@type": "Dataset",
    "alternateName": "foo",
    "exampleOfWork": {
        "@id": "packed.cwl#dirdate.cwl/dir"
    },
    "hasPart": [
        {
            "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/0c8b9d6f753e8d8ec9276bfe98e993a133847642"
        }
    ]
},

Note that the duplicate id in the value of #pv-main/dir_array is a bug: the list should contain only one copy, since the duplicate makes no sense in the RO-Crate JSON-LD. Also, the Dataset has an alternateName of "foo", while "bar" does not appear in the metadata. Thus, in this case, the representation does not reflect the fact that the workflow took a list of two distinct directories as input.

@simleo simleo added the enhancement New feature or request label May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant