# How to work with a PBS/Torque cluster system

* **Difficulty level**: easy
* **Time need to lean**: 10 minutes or less
* **Key points**:
  * a
  

#### `queue_type`

Option `query_type` determines the type of remote server or job queue. SoS currently supports the following types of job queues:

1. **`process`**: this is the default queue type. Tasks are executed directly, either on local host or on a server.
2. **`pbs`**: A PBS/MOAB/LFS/Slurm cluster system where tasks are submitted using commands such as `qsub`.
3. **`rq`**: A redis queue where tasks are submitted to the rq server and monitored through rq-dashboard.

### PBS/Torch configuration

#### `job_template`

Option `job_template` is a template of a shell script that will be submitted to the PBS system. A typical template would be specified as (using a multi-line string literal of YAML 

```bash
hosts:
  server:
    job_template: |
      #!/bin/bash
      #PBS -N {task}
      #PBS -l nodes=1:ppn={cores}
      #PBS -l walltime={walltime}
      #PBS -l mem={mem//10**6}MB
      #PBS -o {cur_dir}/{task}.out
      #PBS -e {cur_dir}/{task}.err
      #PBS -q long
      #PBS -m ae
      #PBS -M your@email.address
      #PBS -v {cur_dir}
      
      cd {cur_dir}
      
      sos execute {task} -v {verbosity} -s {sig_mode} \
        {'--dryrun' if run_mode == 'dryrun' else ''}
```

The template will be interpolated with the following information

* `task`: task id
* `nodes`, `cores`, `walltime`, `mem`: resource task options
* `cur_dir`:  current project directory, which will be translated to path in remote host if the task is executed remotely
* `verbosity` and `sig_mode`: sos run mode.
* `run_mode` to allow the script to be executed in dryrun mode, in which mode scripts would be printed instead of executed. It is very important to set this option because the job script would be executed directly (on head node) instead of sent to the PBS queue if sos is running in dryrun mode (`sos run -q pbs -n`).
* Other key/value pairs you defined for the server

Note that
1. You will need to specify resource options (`nodes`, `cores`, `walltime`, and `mem`) as task options if they are used in the job template without default values.
2. If you need to specify more options (e.g. queue name), you can define multiple host entries with different options, for example, a `cluster-short` and a `cluster-long` on the same cluster.
3. Alternatively, it is possible to derive the options from existing runtime options. For example, you could put a task to long queue if it runs more than 24 hours

    ```
    #PBS -q {'long' if int(walltime.split(':')[0]) > 24 else 'short'}
    ```


#### `submit_cmd`

A `submit_cmd` template is the command that will be executed to submit the job. It accepts the same set of variables as `job_template`, with an additional variable `job_file` pointing to the location of the job file on the remote host. The `submit_cmd` is usually as simple as

```bash
qsub {job_file}
```

but you could specify some options from command line instead of the job file and define `submit_cmd` as

```bash
msub -l {walltime} < {job_file}
```

#### `submit_cmd_output`

This option specifies the output of the `submit_cmd` command and let SoS know how to extract `job_id` and other information from it. For example, for a regular PBS system, the output is simply the `job_id` (stripping spaces and newlines).

```
submit_cmd_output='{job_id}'
```

On a LSF system, the output should be similar to

```
submit_cmd_output='Job <{job_id}> is submitted to queue <{queue}>'
```

The information extracted (namely variables defined) from `submit_cmd` can be used for other commands such as `status_cmd`. 

This option is defult to `{job_id}`.

#### `status_cmd`

An command to query the status of a submitted task. For a standard PBS system, this option could be

```
qstat {job_id}
```

where `job_id` is the output of command `submit_cmd`. The `status_cmd` is interpolated with variables `job_id` (PBS job ID), `task` (SoS task id), and `verbosity` (command line verbosity level) so you would adjust options for different verbosity level (e.g. `{'-f' if verbosity > 2 else ''}`).

Note that the `status_cmd` is only called with `-v 2` or higher.

#### `kill_cmd`

A command to kill a submitted job on the cluster. For a standard PBS system, this option could be

```
qdel {job_id}
```
where `job_id` is the output of command `submit_cmd`.

#### Sample PBS/Torch configuration

```
    cluster:
        address: host.url
        description: cluster with PBS
        paths:
            home: /scratch/{user_name}
        queue_type: pbs
        status_check_interval: 30
        wait_for_task: false
        job_template: |
            #!/bin/bash
            #PBS -N {task}
            #PBS -l nodes={nodes}:ppn={ppn}
            #PBS -l walltime={walltime}
            #PBS -l mem={mem//10**9}GB
            #PBS -o {home_dir}/.sos/tasks/{task}.out
            #PBS -e {home_dir}/.sos/tasks/{task}.err
            #PBS -m ae
            #PBS -M email@address
            #PBS -v {cur_dir}
            cd {cur_dir}
            sos execute {task} -v {verbosity} -s {sig_mode} {'--dryrun' if run_mode == 'dryrun' else ''}        max_running_jobs: 100
        submit_cmd: qsub {job_file}
        status_cmd: qstat {job_id}
        kill_cmd: qdel {job_id}
```

#### Sample MOAB configuration

```
    cluster:
        address: host.url
        description: cluster with MOAB
        paths:
            home: /scratch/{user_name}
        queue_type: pbs
        status_check_interval: 30
        wait_for_task: false
        job_template: |
            #!/bin/bash
            #PBS -N {task}
            #PBS -l nodes={nodes}:ppn={ppn}
            #PBS -l walltime={walltime}
            #PBS -l mem={mem//10**9}GB
            #PBS -o {home_dir}/.sos/tasks/{task}.out
            #PBS -e {home_dir}/.sos/tasks/{task}.err
            #PBS -m ae
            #PBS -M email@address
            #PBS -v {cur_dir}
            cd {cur_dir}
            sos execute {task} -v {verbosity} -s {sig_mode} {'--dryrun' if run_mode == 'dryrun' else ''}        max_running_jobs: 100
        submit_cmd: msub {job_file}
        status_cmd: qstat {job_id}
        kill_cmd: qdel {job_id}
```

## Further reading

* 