New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
value of sos pack/unpack #850
Comments
Are there more concrete proposals? This, to me, can be tied to reproducibility. It would be nice if |
The problem with file tracking is that many times when we run a command with many output, we only pick an representative one as the result of the step. So
The "packing environment" proposal is reasonable but I am not aware of any way to do it because of all the commands and their dependencies. The closest thing is singularity to packing commands and data together for execution, but that is for another use case. |
So something like |
I think the point of bundling should be bundling of input and output files, not necessarily the intermediate files, and the primary purpose should be for reproducible analysis. That is to say, if we bundle a project, the project should be able to reproduce by itself, except for certain verifiable requirements (executables, remote hosts). That is to say, if we can include the following in bundles
The bundled package should be reproducible as long as the external requirements are met. If we leave "external" requirements outside of bundle, we can create tar files with required files, and allow the reproducibility of bundles when external requirements are met. If we would like to keep "external" requirements, we should create docker images for the bundles so that the workflow can be executed. I am not sure how to achieve the latter though. |
The subcommands are now hidden. It will most likely be re-implemented after we consider more options such as docker to capture the entire environment. |
I've been thinking about this. I do not think we should leave out external requirements but we can come back to it.
This means input and output data? And what's your definition of input / output -- in terms of the root (input to root) and leaves (output from leaves) of the DAG, not internal nodes? Or we can also bundle everything? I think for everything some manual tarballing followed by untar + |
For external requirements, files may be easy; but not the computational environment, in a cross-platform fashion. If we can propose a somewhat satisfactory solution I'm sure it will greatly increase popularity of SoS. However nextflow's approach seems to be already one step ahead of us .... |
What do you mean by nextflow's approach? Nextflow provides a lot of supports for S3, cloud, docker, singularity etc, but I do not see a systematic approach (maybe there would never be one). I mean, we will need to first determine some goals and see how to achieve it. For example,
My plan is to disable these two commands, work on the S3, singularity stuff as we move along, and see if at some point we see a clear need for some sort of bundling feature. |
The cloud solution, like the DNAnexus cloud from nextflow is appealing but is very resource intensive, cannot last if there is no commercial support (like MS's Azure Jupyter Cloud), and is not really portable because it ties workflow to the cloud. My understanding is that "daily computational research" cannot afford high level of reproducibility. As long as we can include all relevant information in the notebook (source code, software used, sessioninfo), the notebook is reproducible given enough resource. The only thing missing are then input and output files, which are the purpose of Another idea is that since the bundling of data with executables are expensive, maybe we can bundle them separately, something like a data/workflow bundle pointing to a specific version of some docker image... |
Exactly! That every step are designed to run tools from one of these docker-like sources, that it also integrates tightly with cloud. It'd be good enough to start with a less comprehensive approach, which to me should include all files but not executables. Resource files are included by default (anything being a valid I agree with your "another idea" but I think it is the user's duty to actually bundle it. Can SoS provides enough information, or even generates configuration files of different docker-like tools for users, so that they can bundle the executables with ease. in a consistently versioned fashion? |
Then that will not be our approach. sos should have a grass root of daily data analysis, which will unlikely involve cloud in the majority of the cases. Let us learn from users and other workflow tools and determine a featureset later. |
Note the archive option of snakemake. |
I started to question if we should keep the commands
sos pack
andsos unpack
because the former is basicallysos remove --untracked
andtar czf
and the latter is basicallytar zxf
, but introducing some new commands gives users an impression that the files can only be unpacked by sos and thus unwilling to use it.So we should either make
pack/unpack
more useful/acceptable to users, or remove them.The text was updated successfully, but these errors were encountered: