# Running Monte Carlo Sampling on HPC cluster

Running MC sampling jobs on limited HPC resource can be quite challenging, for the reasons below:

1. jobs typically have a long queue time, meaning we need to submit enough jobs to avoid long waiting time
2. However, it is hard to predict in advance how much work is needed in QEC simulations, because $p_L$ can vary by several orders of magnitudes
3. Often I want a group of simulation data points and I want to see some intermediate (rough) results before the full simulation finishes

Apart from the challenges, we also have a few nice properties of the problem: 

4. We can tolerate missing some accuracy on data points, e.g. when $p_L$ is too low, and the trade-off between cost and accuracy is somewhat
5. Unlike other problems, if some tasks are inherently time consuming, we can always split these Monte Carlo sampling problem into smaller ones.

Often times, I need to manually decide how many samples I want and iterate multiple times before I can get a proper result.
**Is it possible to let a program automatically run the simulation jobs for me?**

Due to condition (1) and (3), it is necessary to use a group of allocated "compute" nodes and a centralized "host" node to dynamically decide which task is running on which. [Dask](https://docs.dask.org/en/stable/futures.html) provides such functionality that works on various HPC cluster frameworks like Slurm.

The real challenge is (2) and (4): how can we intelligently decide which data point we would like to spend time on? Like what I would do manually? Well, there is no single answer for that but for generality we could let the user specify a "award function". Given a cost function and a group of data points that we would like to run, it is possible to implement something that automatically choose the highest award-to-cost ratio, where the cost is essentially time consumption.

Fortunately, the nature of Monte Carlo sampling (5) makes it easier to organize the problem.
We can abstract the problem of simulating a list of monte carlo results.
```python
job_array = MonteCarloJobArray([
    MonteCarloJob(args=dict(d=3, p=0.01)),
    MonteCarloJob(args=dict(d=3, p=0.02)),
    MonteCarloJob(args=dict(d=5, p=0.01)),
    MonteCarloJob(args=dict(d=5, p=0.02)),
])
```

As a generic framework of MC sampling, each Monte Carlo job object only maintain a `shot` variable. That is, the framework doesn't really care about logical error rate or other kind of objectives.
It is the responsibility of the user to provide a custom award function that indicates where I would like to run

```python
def award_function_1(status: MonteCarloJobsStatus) -> tuple[MonteCarloJob, int] | None:
    for job in status.jobs:
        if job.expecting_shots < 1000:
            return job, 1000 - job.active_shots
```

In [2]:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()

In [3]:
cluster

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 5
Total threads: 10,Total memory: 32.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:64909,Workers: 5
Dashboard: http://127.0.0.1:8787/status,Total threads: 10
Started: Just now,Total memory: 32.00 GiB

0,1
Comm: tcp://127.0.0.1:64925,Total threads: 2
Dashboard: http://127.0.0.1:64929/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:64912,
Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-hilczidp,Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-hilczidp

0,1
Comm: tcp://127.0.0.1:64924,Total threads: 2
Dashboard: http://127.0.0.1:64928/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:64914,
Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-pm9ku7wm,Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-pm9ku7wm

0,1
Comm: tcp://127.0.0.1:64922,Total threads: 2
Dashboard: http://127.0.0.1:64927/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:64916,
Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-amxf0f2x,Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-amxf0f2x

0,1
Comm: tcp://127.0.0.1:64923,Total threads: 2
Dashboard: http://127.0.0.1:64930/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:64918,
Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-2qq0sgy5,Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-2qq0sgy5

0,1
Comm: tcp://127.0.0.1:64926,Total threads: 2
Dashboard: http://127.0.0.1:64931/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:64920,
Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-5g67fwgc,Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-5g67fwgc


## Useful Resources

https://docs.dask.org/en/stable/deploying.html
https://docs.dask.org/en/stable/futures.html
https://docs.ycrc.yale.edu/clusters-at-yale/access/ood-jupyter/


In [None]:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
    queue='regular',
    account="myaccount",
    cores=24,
    memory="500 GB"
)
cluster.scale(jobs=10)  # ask for 10 jobs