In [None]:
using Distributed

println(nprocs())
addprocs(4)         # add 4 workers
println(nprocs())   # total number of processes
println(nworkers()) # only worker processes
rmprocs(workers())  # remove worker processes

From https://groups.google.com/g/slurm-users/c/r_MNRw4gYhQ

```
Now, you can get fancy and have hybrid applications which can be split
up across nodes (individual processes) but each one of those tasks can
also use multiple cores at the same time by multi-threading.
```

So how would I split each parameter combination to different nodes and multithread across trials within each node?

From https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis#hybrid

```
Hybrid Multithreaded, Multinode Codes
Some codes take advantage of both shared- and distributed-memory parallelism (e.g., OpenMP and MPI). In these cases you will need to vary the number of nodes, ntasks-per-node and cpus-per-task. Construct a table as above except include a new column for cpus-per-task. Note that when taking full nodes, the product of ntasks-per-node and cpus-per-task should be equal to the total number of CPU-cores per node. Use the "snodes" command to find the total number of CPU-cores per node for a given cluster.

Find the optimal values for these Slurm directives:

#SBATCH --nodes=<M>
#SBATCH --ntasks-per-node=<N>
#SBATCH --cpus-per-task=<T>
```

In the Caltech cluster compute nodes have 32 cores/node. So not to waste resources I can request e.g.

```
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=4
```
where the NLL for 8 parameter combinations is computed concurrently with each computation using 4 threads to compute trial likelihoods in parallel.

## `--cpus-per-task` and `--ntasks-per-node` options in batch scripts

`--cpus-per-task` uses cpus on a single node. So it should not exceed the number of cores/node. `--ntasks-per-node` can be distributed across nodes. If it is >1 then `--cpus-per-task` * `--ntasks-per-node` should not exceed number of cores/node.

So can I:  
- Parallelize subjects across jobs using job arrays (`#SBATCH --array=0-4`)  
- Parallelize models/parameter combinations across tasks using `--ntasks-per-node`  
- Parallelize trial likelihood computation (for non-sequential models) across threads using `--cpus-per-task`  

The first and third items I know how to implement. For the second, I need to learn how to make `ADDM.grid_search` and MPI job.