# How to group many small tasks into larger ones

* **Difficulty level**: intermediate
* **Time need to lean**: 10 minutes or less
* **Key points**:
  * Option `trunk_size` groups small tasks into larger ones
  * Option `trunk_workers` determines number of workers per master task

## The problem with many small tasks

From time to time you may face the problem with many small tasks, such as running millions of simulations or analyzing thousands of genes. Whereas each simulation or analysis takes just a few minutes to complete, the entire analysis will take a long time and needs to be performed on a cluster. However, most cluster systems does not welcome millions or small tasks as managing a large number of jobs can pose management challenges to the scheduler.

## The bash script approach

What users have usually done are running these analysis in batch, which works more or less like the following script if implemented in SoS:

In [1]:
input: for_each=dict(batch=range(4))

bash: args=f'{{filename}} {batch*4+1} {(batch+1)*4}'
   for id in `seq $1 $2`
   do
      echo "Processing $id"
   done

Processing 1
Processing 2
Processing 3
Processing 4
Processing 5
Processing 6
Processing 7
Processing 8
Processing 9
Processing 10
Processing 11
Processing 12
Processing 13
Processing 14
Processing 15
Processing 16


The `args` option here determines what will be passed to the underlying `bash` command, which should contain `{filename}` as the filename of the temporary file generated by SoS. In this particular example we use

```
f'{{filename}} {batch*4+1} {(batch+1)*4}'
```
so that the following bash commands will be executed
```
bash {filename} 1 4
bash {filename} 5 8
bash {filename} 9 12
bash {filename} 13 16
```
for substeps with `batch` equals to `0`, `1`, `2` and `3` respectively.

Now that we have fewer number of jobs, we can submit the shell scripts to a batch system as tasks

In [1]:
input: for_each=dict(batch=range(4))

task:
bash: args=f'{{filename}} {batch*4+1} {(batch+1)*4}'
   for id in `seq $1 $2`
   do
      echo "Processing $id"
   done

The tasks in this example are executed locally but you can send the tasks to a remote host using

```
task: queue='host'
```
or
```
%run -q host
```

## Grouping SoS tasks

<div class="bs-callout bs-callout-primary" role="alert">
  <h4>The <code>trunk_size</code> task option</h4>
  <p>The <code>trunk_size=n</code> option groups tasks into groups of size `n` before submitting them to an executor.
</div>

The aforementioned example can be implemented in a much easier way as follows using the `trunk_size` task option:

In [1]:
input: for_each=dict(id=range(16))

task: trunk_size=4
bash: expand=True
    echo "Processing {id+1}"   

In this example, 15 tasks are generated from 15 substeps, each running a bash script
```
echo "Processing {id}"
```
with `id` = `0`, ..., `15` respectively.

With option `trunk_size=4`, the tasks are grouped into master tasks with names starting with `M5_`.

<div class="bs-callout bs-callout-primary" role="alert">
  <h4>The <code>trunk_workers</code> task option</h4>
  <p>The <code>trunk_workers=n</code> option specify the number of concurrent workers in each task.
</div>

The master tasks by default execute subtasks sequentially. If the master task has a large number of subtasks and there are computing resources available, you can specifying another option `trunk_workers` to set the number of workers for each master task. For example, in the following SoS workflow, the 16 tasks are grouped into two master tasks each having 8 subtasks. Two workers will be created to process these subtasks.

In [1]:
input: for_each=dict(id=range(16))

task: trunk_size=8, trunk_workers=2
bash: expand=True
    echo "Processing {id+1}" 