Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

value of sos pack/unpack #850

Closed
BoPeng opened this issue Nov 9, 2017 · 12 comments
Closed

value of sos pack/unpack #850

BoPeng opened this issue Nov 9, 2017 · 12 comments

Comments

@BoPeng
Copy link
Contributor

BoPeng commented Nov 9, 2017

I started to question if we should keep the commands sos pack and sos unpack because the former is basically sos remove --untracked and tar czf and the latter is basically tar zxf, but introducing some new commands gives users an impression that the files can only be unpacked by sos and thus unwilling to use it.

So we should either make pack/unpack more useful/acceptable to users, or remove them.

@gaow
Copy link
Member

gaow commented Nov 9, 2017

So we should either make pack/unpack more useful/acceptable to users

Are there more concrete proposals? This, to me, can be tied to reproducibility. It would be nice if pack/unpack can somehow clone the entire computational environment and ship it. Not sure how to best implement that, though.

@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 9, 2017

The problem with file tracking is that many times when we run a command with many output, we only pick an representative one as the result of the step. So

  1. sos pack tends to miss some files
  2. sos pack --include can be long and still miss something.
  3. If we manually remove unwanted files, sos pack lacks an --all option, and even with it there is no advantage of sos pack over tar czf.

The "packing environment" proposal is reasonable but I am not aware of any way to do it because of all the commands and their dependencies. The closest thing is singularity to packing commands and data together for execution, but that is for another use case.

@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 9, 2017

So something like .sosignore to mimic .gitignore could be useful for this particular case.

@vatlab vatlab deleted a comment from dongbosong Jan 8, 2018
@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 8, 2018

I think the point of bundling should be bundling of input and output files, not necessarily the intermediate files, and the primary purpose should be for reproducible analysis. That is to say, if we bundle a project, the project should be able to reproduce by itself, except for certain verifiable requirements (executables, remote hosts). That is to say, if we can include the following in bundles

  1. meta data (keywords, descriptions etc)
  2. "external" requirements (executable targets, reference genome files, etc)
  3. input and output data

The bundled package should be reproducible as long as the external requirements are met.

If we leave "external" requirements outside of bundle, we can create tar files with required files, and allow the reproducibility of bundles when external requirements are met. If we would like to keep "external" requirements, we should create docker images for the bundles so that the workflow can be executed. I am not sure how to achieve the latter though.

BoPeng pushed a commit that referenced this issue Feb 2, 2018
@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 2, 2018

The subcommands are now hidden. It will most likely be re-implemented after we consider more options such as docker to capture the entire environment.

@gaow
Copy link
Member

gaow commented Feb 2, 2018

I've been thinking about this. I do not think we should leave out external requirements but we can come back to it.

we can create tar files with required files

This means input and output data? And what's your definition of input / output -- in terms of the root (input to root) and leaves (output from leaves) of the DAG, not internal nodes? Or we can also bundle everything?

I think for everything some manual tarballing followed by untar + -s build is good enough (though a bit manual). What is truly valuable is to bundle only the essential stuff.

@gaow
Copy link
Member

gaow commented Feb 2, 2018

For external requirements, files may be easy; but not the computational environment, in a cross-platform fashion. If we can propose a somewhat satisfactory solution I'm sure it will greatly increase popularity of SoS. However nextflow's approach seems to be already one step ahead of us ....

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 2, 2018

What do you mean by nextflow's approach? Nextflow provides a lot of supports for S3, cloud, docker, singularity etc, but I do not see a systematic approach (maybe there would never be one).

I mean, we will need to first determine some goals and see how to achieve it. For example,

  1. Do we want to provide a complete working environment for "complete" reproducibility? That means OS, data, program, etc.
  2. If we want something less comprehensive, how much less?
  3. Do we want to encapsulate only the workflow (say, workflow as a image), or the entire analysis?

My plan is to disable these two commands, work on the S3, singularity stuff as we move along, and see if at some point we see a clear need for some sort of bundling feature.

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 2, 2018

The cloud solution, like the DNAnexus cloud from nextflow is appealing but is very resource intensive, cannot last if there is no commercial support (like MS's Azure Jupyter Cloud), and is not really portable because it ties workflow to the cloud.

My understanding is that "daily computational research" cannot afford high level of reproducibility. As long as we can include all relevant information in the notebook (source code, software used, sessioninfo), the notebook is reproducible given enough resource. The only thing missing are then input and output files, which are the purpose of pack/unpack but it does not support sos notebook.

Another idea is that since the bundling of data with executables are expensive, maybe we can bundle them separately, something like a data/workflow bundle pointing to a specific version of some docker image...

@gaow
Copy link
Member

gaow commented Feb 2, 2018

What do you mean by nextflow's approach?

Exactly! That every step are designed to run tools from one of these docker-like sources, that it also integrates tightly with cloud.

It'd be good enough to start with a less comprehensive approach, which to me should include all files but not executables. Resource files are included by default (anything being a valid path object), along with the input to DAG root and output from DAG leaves. All other files are intermediate that one can opt to include or not.

I agree with your "another idea" but I think it is the user's duty to actually bundle it. Can SoS provides enough information, or even generates configuration files of different docker-like tools for users, so that they can bundle the executables with ease. in a consistently versioned fashion?

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 2, 2018

That every step are designed to run tools from one of these docker-like sources, that it also integrates tightly with cloud.

Then that will not be our approach. sos should have a grass root of daily data analysis, which will unlikely involve cloud in the majority of the cases.

Let us learn from users and other workflow tools and determine a featureset later.

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 28, 2018

Note the archive option of snakemake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants