# Running DSC on a remote computer

## Overview

DSC uses one additional configuration file and two command options to run on a remote computer. Here a remote computer can be:

1. A standalone desktop workstation without a job queue system
2. A system with a job queue manager, such as PBS-based cluster

Conventionally, to run jobs on a remote computer (or host, hereafter), users will have to log in to the host from their local computer (or local, hereafter), copy over required resources and install required software, before performing any computation. For cluster systems users will not only have to write job files to submit, but also have to actively monitoring active jobs and keep submitting new ones, because otherwise submitting them all will likely hit the system's limit of permitted jobs and result in failure. For a DSC benchmark that contains relatively arbitrary number of computational tasks with implicit input and output and different resource requirements, it is nearly impossible to manually configure cluster jobs properly.

Using a DSC job template and command options, DSC allows users to:

1. Use remote host without having to login.
2. Easily configure resource requirement per module.
3. Submit the entire benchmark without worrying about job limits.
4. Automatically sync files to the remote computer even if file path convention is different between local and host (eg `/Users/<username>` on Mac vs `/home/<username>` on Linux)

Under the hood, DSC will utilize local resource to:

1. Analyze DSC and build-up input / output dependencies
2. Submitting jobs to remote in such a way that only maximum allowed jobs are submitted; the unsubmitted jobs are stashed on a local computer, which will keep monitoring the remote computer and keep submitting jobs to it as previous jobs complete.

As one can probably tell, the local still has to be active and perform some computations (mostly monitoring, using one CPU thread) while the behchmark is being executed. Therefore the local has to be kept active -- the computer has to be on and not in "suspend" mode. We realize this might be inconvenient for laptop users. Though in principle one can use a [`screen`](https://www.gnu.org/software/screen/) on cluster's headnode as local and compute node as remote to keep the monitor run on the background, we discourage doing so because it is typically not good practice to have long running computations on a headnode. Users can either use an interactive section as "local" to submit jobs, or just keep a local laptop / desktop up and running throughout the entire benchmarking process.

## Implementation

DSC uses [`SoS`](https://vatlab.github.io/sos-docs) library to execute jobs. In order for remote submission to work, one needs to install `sos` package to the remote host, and additionally `sos-pbs` package if the remote uses a PBS system.To install SoS to remote host,

```
pip install sos sos-pbs
```

If you run into troubles, you may find [DSC installation guide](../installation.html) a useful resource to resolve problems.

## Remote configuration

DSC command option `--host` accepts a YAML configuration file that specifies a *template* for remote jobs. **We provide support to such configuration files on need-basis**, because we (the DSC developers) can only verify and ensure the it works on system that we have access to and use on regular basis. For example, here is a template for a system running PBS type of queue via Slurm Workload Manage:

```yaml
DSC:
  midway2:
    description: UChicago RCC cluster Midway 2
    address: gaow@midway2.rcc.uchicago.edu
    paths:
      home: /home/gaow
    queue_type: pbs
    wait_for_tasks: false
    status_check_interval: 60
    max_running_jobs: 30
    max_cores: 40
    max_walltime: "36:00:00"
    max_mem: 64G
    job_template: |
      #!/bin/bash
      #SBATCH --time={walltime}
      #{partition}
      #{account}
      #SBATCH --nodes=1
      #SBATCH --ntasks-per-node={cores}
      #SBATCH --mem-per-cpu={mem//10**9}G
      #SBATCH --job-name={job_name}
      #SBATCH --output={cur_dir}/.sos/{job_name}.out
      #SBATCH --error={cur_dir}/.sos/{job_name}.err
      cd {cur_dir}
      module load R/3.4.3
    partition: "SBATCH --partition=broadwl"
    account: ""
    submit_cmd: sbatch {job_file}
    submit_cmd_output: "Submitted batch job {job_id}"
    status_cmd: squeue --job {job_id}
    kill_cmd: scancel {job_id}
  stephenslab:
    based_on: hosts.midway2
    max_cores: 28
    max_mem: 128G
    max_walltime: "10d"
    partition: "SBATCH --partition=mstephens"
    account: "SBATCH --account=pi-mstephens"


default:
  queue: midway2
  time_per_instance: 10m
  instances_per_job: 2
  n_cpu: 1
  mem_per_cpu: 2G

simulate:
  instances_per_job: 20
```

The section `DSC` is required to configure all systems. Its syntax mostly follows from [SoS remote task specification](https://vatlab.github.io/sos-docs/doc/documentation/Remote_Execution.html). Here `midway2` is a host provided by [The University of Chicago RCC group](https://rcc.uchicago.edu). Jobs are submitted to `partition=broadwl`. Typically it has 40 cores per node, allows for 30 concurrent jobs per user, and a maximum running time of 36hrs per job. These limitations have been reflected by the `max_*` values in the configuration. `stephenslab` is a special partition on `midway2` that allows for different configurations under the same system, thus it is derived from `midway2` via `based_on: hosts.midway2`.

The section `default` is also required. It provides default settings for all modules in the DSC. Available settings are:

- `queue`: name of the queue on the remote host that the DSC uses
- `time_per_instance`: maximum computation time for each module instance.
- `instance_per_job`: how many module instances to submit as one remote jobs. This is useful consolidating numerous light-weight module instances into one jobs submission. 
- `n_cpu` and `mem_per_cpu` specify the CPU and memory requirement of a module instance.

For example for 100 module instances of `simulate` that each generates some data in under a minute, one can specify `time_per_instance: 1m` and `instance_per_job: 200`. Then a single job containing 200 simulations will be submitted to the host with a total of 200 minutes computation time reserved.

Typically, `DSC` and `default` section for host configuration do not have to be changed for different projects. Users can carefully configure them once, and reuse for various projects. For Stephens Lab users for example, one can take the example from above and replace `gaow` with their UChicago cnetID.

## Run remote jobs

Command

```
dsc ... --host /path/to/config.yml
```

will load remote configuration in `/path/to/config.yml` and submit jobs. 

Additionally, 

```
dsc ... --host /path/to/config.yml --to-host file1 dir1 file2
```

will sync specified files and folders to the remote, if the particular benchmark requires these files and folders to execute (eg, data resource, shell executables, or scripts in `DSC::lib_path`).

Caution that to successfully use `--host` and `--to-host`, the command program [`rsync`](https://rsync.samba.org) have to be available from the local computer.