value of sos pack/unpack #850

BoPeng · 2017-11-09T16:35:28Z

I started to question if we should keep the commands sos pack and sos unpack because the former is basically sos remove --untracked and tar czf and the latter is basically tar zxf, but introducing some new commands gives users an impression that the files can only be unpacked by sos and thus unwilling to use it.

So we should either make pack/unpack more useful/acceptable to users, or remove them.

The text was updated successfully, but these errors were encountered:

gaow · 2017-11-09T16:40:25Z

So we should either make pack/unpack more useful/acceptable to users

Are there more concrete proposals? This, to me, can be tied to reproducibility. It would be nice if pack/unpack can somehow clone the entire computational environment and ship it. Not sure how to best implement that, though.

BoPeng · 2017-11-09T17:01:12Z

The problem with file tracking is that many times when we run a command with many output, we only pick an representative one as the result of the step. So

sos pack tends to miss some files
sos pack --include can be long and still miss something.
If we manually remove unwanted files, sos pack lacks an --all option, and even with it there is no advantage of sos pack over tar czf.

The "packing environment" proposal is reasonable but I am not aware of any way to do it because of all the commands and their dependencies. The closest thing is singularity to packing commands and data together for execution, but that is for another use case.

BoPeng · 2017-11-09T17:06:58Z

So something like .sosignore to mimic .gitignore could be useful for this particular case.

BoPeng · 2018-01-08T17:38:23Z

I think the point of bundling should be bundling of input and output files, not necessarily the intermediate files, and the primary purpose should be for reproducible analysis. That is to say, if we bundle a project, the project should be able to reproduce by itself, except for certain verifiable requirements (executables, remote hosts). That is to say, if we can include the following in bundles

meta data (keywords, descriptions etc)
"external" requirements (executable targets, reference genome files, etc)
input and output data

The bundled package should be reproducible as long as the external requirements are met.

If we leave "external" requirements outside of bundle, we can create tar files with required files, and allow the reproducibility of bundles when external requirements are met. If we would like to keep "external" requirements, we should create docker images for the bundles so that the workflow can be executed. I am not sure how to achieve the latter though.

BoPeng · 2018-02-02T14:13:47Z

The subcommands are now hidden. It will most likely be re-implemented after we consider more options such as docker to capture the entire environment.

gaow · 2018-02-02T18:23:36Z

I've been thinking about this. I do not think we should leave out external requirements but we can come back to it.

we can create tar files with required files

This means input and output data? And what's your definition of input / output -- in terms of the root (input to root) and leaves (output from leaves) of the DAG, not internal nodes? Or we can also bundle everything?

I think for everything some manual tarballing followed by untar + -s build is good enough (though a bit manual). What is truly valuable is to bundle only the essential stuff.

gaow · 2018-02-02T18:28:11Z

For external requirements, files may be easy; but not the computational environment, in a cross-platform fashion. If we can propose a somewhat satisfactory solution I'm sure it will greatly increase popularity of SoS. However nextflow's approach seems to be already one step ahead of us ....

BoPeng · 2018-02-02T18:56:23Z

What do you mean by nextflow's approach? Nextflow provides a lot of supports for S3, cloud, docker, singularity etc, but I do not see a systematic approach (maybe there would never be one).

I mean, we will need to first determine some goals and see how to achieve it. For example,

Do we want to provide a complete working environment for "complete" reproducibility? That means OS, data, program, etc.
If we want something less comprehensive, how much less?
Do we want to encapsulate only the workflow (say, workflow as a image), or the entire analysis?

My plan is to disable these two commands, work on the S3, singularity stuff as we move along, and see if at some point we see a clear need for some sort of bundling feature.

BoPeng · 2018-02-02T19:06:40Z

The cloud solution, like the DNAnexus cloud from nextflow is appealing but is very resource intensive, cannot last if there is no commercial support (like MS's Azure Jupyter Cloud), and is not really portable because it ties workflow to the cloud.

My understanding is that "daily computational research" cannot afford high level of reproducibility. As long as we can include all relevant information in the notebook (source code, software used, sessioninfo), the notebook is reproducible given enough resource. The only thing missing are then input and output files, which are the purpose of pack/unpack but it does not support sos notebook.

Another idea is that since the bundling of data with executables are expensive, maybe we can bundle them separately, something like a data/workflow bundle pointing to a specific version of some docker image...

gaow · 2018-02-02T19:29:47Z

What do you mean by nextflow's approach?

Exactly! That every step are designed to run tools from one of these docker-like sources, that it also integrates tightly with cloud.

It'd be good enough to start with a less comprehensive approach, which to me should include all files but not executables. Resource files are included by default (anything being a valid path object), along with the input to DAG root and output from DAG leaves. All other files are intermediate that one can opt to include or not.

I agree with your "another idea" but I think it is the user's duty to actually bundle it. Can SoS provides enough information, or even generates configuration files of different docker-like tools for users, so that they can bundle the executables with ease. in a consistently versioned fashion?

BoPeng · 2018-02-02T19:45:24Z

That every step are designed to run tools from one of these docker-like sources, that it also integrates tightly with cloud.

Then that will not be our approach. sos should have a grass root of daily data analysis, which will unlikely involve cloud in the majority of the cases.

Let us learn from users and other workflow tools and determine a featureset later.

BoPeng · 2018-02-28T22:42:42Z

Note the archive option of snakemake.

BoPeng added the discussion label Nov 9, 2017

vatlab deleted a comment from dongbosong Jan 8, 2018

BoPeng pushed a commit that referenced this issue Feb 2, 2018

Hide commands sos pack and sos unpack #850

4a2d554

This was referenced Mar 7, 2018

Fail to install SoS in Windows #916

Closed

Build missing targets with build sig mode. #906

Closed

BoPeng pushed a commit that referenced this issue Jun 4, 2018

Remove tests and commands pack/unpack #850

487827a

BoPeng closed this as completed Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

value of sos pack/unpack #850

value of sos pack/unpack #850

BoPeng commented Nov 9, 2017

gaow commented Nov 9, 2017

BoPeng commented Nov 9, 2017

BoPeng commented Nov 9, 2017

BoPeng commented Jan 8, 2018

BoPeng commented Feb 2, 2018

gaow commented Feb 2, 2018

gaow commented Feb 2, 2018

BoPeng commented Feb 2, 2018

BoPeng commented Feb 2, 2018 •

edited

gaow commented Feb 2, 2018

BoPeng commented Feb 2, 2018

BoPeng commented Feb 28, 2018

value of sos pack/unpack #850

value of sos pack/unpack #850

Comments

BoPeng commented Nov 9, 2017

gaow commented Nov 9, 2017

BoPeng commented Nov 9, 2017

BoPeng commented Nov 9, 2017

BoPeng commented Jan 8, 2018

BoPeng commented Feb 2, 2018

gaow commented Feb 2, 2018

gaow commented Feb 2, 2018

BoPeng commented Feb 2, 2018

BoPeng commented Feb 2, 2018 • edited

gaow commented Feb 2, 2018

BoPeng commented Feb 2, 2018

BoPeng commented Feb 28, 2018

BoPeng commented Feb 2, 2018 •

edited