Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow parameters in tasks and store common information from subtasks to master task #1185

Closed
BoPeng opened this issue Jan 25, 2019 · 6 comments

Comments

@BoPeng
Copy link
Contributor

BoPeng commented Jan 25, 2019

There has been a reportedcase when tasks are slow to create and submit. The problem traces down to the following scenario:

[sub]
parameter: par=list
input: loop
task:

[default]
sos_run('sub', par=very_long_list)

SoS passes the very_long_list to sub so that parameter: par will get the value directly instead of reading from command line. The parameters are then sent to substeps and then to tasks.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 25, 2019

There are several problems here. First, do we need to parse parameters in tasks?

[1]
output: 'a.txt'

task:
parameter: value = 'a'
sh: expand=True
echo {value} >  {_output}

actually works but conceptually speaking we are not allowing

sos run task_id --par A

because tasks should encapsulate all information, which lead to the unique ID, and allowing --par A beats the purpose of this mechanism.

Therefore we should disallow the use of parameters in tasks. This help reduce the size of task files because par is no longer passed to tasks.

eb82512 disallows parameters in tasks.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 25, 2019

Another problem:

In the case of

[sub]
parameter: par=list
input: for_each='par'

print(_par)

[default]
sos_run('sub', par=list(range(100)))

The parameter is used by the step, but not in any of the substeps, so passing var to all substep will be a potentially substantial waste of zmq bandwidth and slowdown the substeps.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 25, 2019

The last problem, in case the long variables are really needed in tasks:

[sub]
parameter: par=list
input: for_each='par'

task: trunk_size=5
print(_par)
print(par)

[default]
sos_run('sub', par=list(range(10)))

and then we are creating a jumbo task with 100 copies of the par variable. It would be nice to somehow share the variable at the master task level so that there is no need to save several copy of it.

@gaow
Copy link
Member

gaow commented Jan 25, 2019

Therefore we should disallow the use of parameters in tasks.

But the following scenario will still work,

[1]
parameter: a  = 1

task:
sh: expand = True
  echo {a}

because it is not to use of parameters in tasks, right? I think the answer is Yes from your "last problem" statement although we can be smarter with it (not to save many copies)

so passing var to all substep will be a potentially substantial waste of zmq bandwidth and slowdown the substeps.

Indeed. Looks like this is fixed?

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 25, 2019

Yes, the example will work because a is passed as "used signature var", not as parameter.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 25, 2019

The last patch improves efficiency of hopefully not a corner case of large variables in subtasks

[sub]
parameter: par=list
input: for_each=dict(_par=range(1000))

task: trunk_size=500
print(_par)
assert par

[default]
sos_run('sub', par=[f'a_{i}' for i in range(10000)])

For this particular example, the task file reduced from 6M to 66K, and run time from 90s to 23s. The 90s was mostly spent on the compression of the pickled dictionary.

@BoPeng BoPeng closed this as completed Jan 25, 2019
@BoPeng BoPeng changed the title Disallow parameters in tasks Disallow parameters in tasks and store common information from subtasks to master task Jan 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants