Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match input/output file grouping #41

Closed
gaow opened this issue Mar 20, 2016 · 11 comments
Closed

Match input/output file grouping #41

gaow opened this issue Mar 20, 2016 · 11 comments

Comments

@gaow
Copy link
Member

gaow commented Mar 20, 2016

The ticket stems from discussion in a closed issue and for better organization I make it separate issue here. The problem is that when group_by and pattern are both used in input: in conjunction with pattern in output:, the output file names is then based on input file names yet it does not honor the group_by rule from input:, thus creating name mismatch in upcoming actions.

@BoPeng
Copy link
Contributor

BoPeng commented Mar 20, 2016

Please provide an example. The current rules are

  1. group_by generates _input.
  2. pattern deciphers _step.input and generates matching variables for _input (similar to paired_with)
  3. _output should be generated for each _input
  4. _step.output consists of unique names of _ouputs.

@gaow
Copy link
Member Author

gaow commented Mar 20, 2016

The example is to try to get this work. Currently the problem is exactly that _output does not match the group_by behavior of input. In my example, _input passes _step.input individually due to group_by option, but _output passes _step.output altogether.

My intended behavior is like this:

INFO: Execute DSC_10: 
INFO: _step.input: ['7b6a5af72eaaf6a9d5ad0d4ca21eb349.rds', '8e595e2a1634ae5d0b0a2dff6a0bba88.rds']
INFO: _step.output: ['b2eba77d1a492d0ec3aac735064b45c3.rds', '352d5e05ef793b7eac5d58f18baa2aad.rds']
INFO: _step.depends: []
INFO: _idx: 0
INFO: _input: ['7b6a5af72eaaf6a9d5ad0d4ca21eb349.rds']
INFO: _output: ['b2eba77d1a492d0ec3aac735064b45c3.rds']
INFO: _idx: 1
INFO: _input: ['8e595e2a1634ae5d0b0a2dff6a0bba88']
INFO: _output: [ '352d5e05ef793b7eac5d58f18baa2aad.rds']

where _step.output is everything, matching _step.input and _output dive into the loop, matching _input.

@BoPeng
Copy link
Contributor

BoPeng commented Mar 20, 2016

All you need to do is to change

output: pattern = 'exec=mean.R.{base}.${output_suffix}'

to

output: pattern = 'exec=mean.R.{_base}.${output_suffix}'

Because _base matches _input, base matches _step.input. I am not sure how to document this well if even you are having trouble understanding it.

@gaow
Copy link
Member Author

gaow commented Mar 20, 2016

Oh that's a typo! Actually I did document that myself after you introduced that Option pattern, that when group_by is used then base and _base are different. My old version worked at least for this but I accidentally git purged it from disk so I wrote this new one that did not work -- that's why I thought it was a new bug.

I think I've proposed to dedicate _ to loop variables, and find something else to private variable _step; for example if self.input confuses Python we use this.input. That will likely cause less confusion if one can always have loop vs non-loop variable in mind. Am I missing something obvious that makes the proposal not valid?

@BoPeng
Copy link
Contributor

BoPeng commented Mar 20, 2016

I just tried to summarize the effect of these options in
https://github.com/bpeng2000/SOS/wiki/Documentation#summary with a figure.
_step indicates the information in _step is temporary and will not be seen
from other step unless step option alias is used.

On Sun, Mar 20, 2016 at 12:21 AM, gaow notifications@github.com wrote:

Oh that's a typo! Actually I did document that myself after you introduced
that Option pattern, that when group_by is used then base and _base are
different. My old version worked at least for this but I accidentally git
purged it from disk so I wrote this new one that did not work -- that's why
I thought it was a new bug.

I think I've proposed to dedicate _ to loop variables, and find something
else to private variable _step. That will likely cause less confusion if
one can always have loop vs non-loop variable in mind. Am I missing
something obvious that makes the proposal not valid?


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#41 (comment)

@gaow
Copy link
Member Author

gaow commented Mar 20, 2016

Great. Yes it is a nice summary! I just thought that since the only exception to _ being loop variable is _step, then if we change _step.input to this.input it might be more clear. After all, not all temporary variables are prefixed by _. For example, variables generated by pattern, e.g., {base} in my previous example, are local, too (right?).

@gaow
Copy link
Member Author

gaow commented Mar 20, 2016

Without group_by or for_each, what I said it is still logically Ok because it is understandable that if there is no loop involved then a loop variable means the same as the original.

@BoPeng
Copy link
Contributor

BoPeng commented Mar 20, 2016

Well, we can even use input, output and say _input and _output are
inside-loop version of input and output, but we will have to revert our
alias section option as well. Anyway, _step.input and _step.output is NOT
SUPPOSED to be used in SoS script so the name should not matter that much.
Let us keep this in mind and see if something better comes up.

On Sun, Mar 20, 2016 at 12:35 AM, gaow notifications@github.com wrote:

Great. Yes it is a nice summary! I just thought that since the only
exception to _ being loop variable is _step, then if we change _step.input
to this.input it might be more clear. After all, not all temporary
variables are prefixed by _. For example, variables generated by pattern,
e.g., {base} in my previous example, are local, too (right?).


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#41 (comment)

@BoPeng
Copy link
Contributor

BoPeng commented Mar 21, 2016

How about we make all input and output publicly available? Instead of using local _step, we use step name such as human_20 with the same attributes (name, index, input, output, depends, perhaps even more). I used _step because I would like to discourage cross-reference using step name but it is users' choice which method they want to use.

Right now

[20: alias='align']

[30]
output: ...

result = _step.output[1]

[40]
input: align.output, result

Proposed

[20: alias='align']

[30]
output: ...

result = default_20.output[1]

[40]
input: align.output, result, defult_30.input

@gaow
Copy link
Member Author

gaow commented Mar 25, 2016

I think this might cause problem and we should stick with alias. In my application I have something like this:

[A_1: alias = 'A']
[A_2: alias = 'A']
[B_1]
...
params = A.output
...
[B_2]
...
params = A.output
...

[Final_1=A_1+B_1]
[Final_1=A_1+B_2]
[Final_1=A_2+B_1]
[Final_1=A_2+B_2]

and all I'm concerned is sos run ... Final. I'm not sure if this is legal syntax with the current version (need to find it out) but for things like this it is dangerous to hard-code step names into another step

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

I agree. I have learned the hard way from VPT not to use step index in other steps. Let us stick with _step for now.

@BoPeng BoPeng closed this as completed Mar 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants