# Running Monte Carlo Sampling on HPC cluster

Running MC sampling jobs on limited HPC resource can be quite challenging, for the reasons below:

1. jobs typically have a long queue time, meaning we need to submit enough jobs to avoid long waiting time
2. However, it is hard to predict in advance how much work is needed in QEC simulations, because $p_L$ can vary by several orders of magnitudes
3. Often I want a group of simulation data points and I want to see some intermediate (rough) results before the full simulation finishes

Apart from the challenges, we also have a few nice properties of the problem: 

4. We can tolerate missing some accuracy on data points, e.g. when $p_L$ is too low, and the trade-off between cost and accuracy is somewhat
5. Unlike other problems, if some tasks are inherently time consuming, we can always split these Monte Carlo sampling problem into smaller ones.

Often times, I need to manually decide how many samples I want and iterate multiple times before I can get a proper result.
**Is it possible to let a program automatically run the simulation jobs for me?**

Due to condition (1) and (3), it is necessary to use a group of allocated "compute" nodes and a centralized "host" node to dynamically decide which task is running on which. [Dask](https://docs.dask.org/en/stable/futures.html) provides such functionality that works on various HPC cluster frameworks like Slurm.

The real challenge is (2) and (4): how can we intelligently decide which data point we would like to spend time on? Like what I would do manually? Well, there is no single answer for that but for generality we could let the user to select which configuration to run and how many shots to run.

Fortunately, the nature of Monte Carlo sampling (5) makes it easier to organize the problem.
We can abstract the problem of simulating a list of monte carlo results.
```python
executor = MonteCarloJobExecutor(
    MonteCarloJob(d=3, p=0.01),
    MonteCarloJob(d=3, p=0.02),
    MonteCarloJob(d=5, p=0.01),
    MonteCarloJob(d=5, p=0.02),
)
```

As a generic framework of MC sampling, each Monte Carlo job object only maintain a `shot` variable. That is, the framework doesn't really care about logical error rate or other kind of objectives.
It is the responsibility of the user to provide a custom "select" function that indicates where I would like to run.
Once a `None` is returned, then the executor will try to finish all the work and return.
In case some of the submitted jobs fail, the executor may call the "select" function again to ask what the user want to do.

```python
def select(status: MonteCarloJobsStatus) -> tuple[MonteCarloJob, int] | None:
    for job in status.jobs:
        if job.expecting_shots < 1000:
            return job, 1000 - job.active_shots
```

In [1]:
%load_ext autoreload
%autoreload 2

In [11]:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 50992 instead


In [12]:
cluster

0,1
Dashboard: http://127.0.0.1:50992/status,Workers: 5
Total threads: 10,Total memory: 32.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:50993,Workers: 5
Dashboard: http://127.0.0.1:50992/status,Total threads: 10
Started: Just now,Total memory: 32.00 GiB

0,1
Comm: tcp://127.0.0.1:51007,Total threads: 2
Dashboard: http://127.0.0.1:51010/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:50996,
Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-tuf3av70,Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-tuf3av70

0,1
Comm: tcp://127.0.0.1:51006,Total threads: 2
Dashboard: http://127.0.0.1:51009/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:50998,
Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-zsm2rqmp,Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-zsm2rqmp

0,1
Comm: tcp://127.0.0.1:51008,Total threads: 2
Dashboard: http://127.0.0.1:51014/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:51000,
Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-kunox0_8,Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-kunox0_8

0,1
Comm: tcp://127.0.0.1:51013,Total threads: 2
Dashboard: http://127.0.0.1:51016/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:51002,
Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-82hmiclm,Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-82hmiclm

0,1
Comm: tcp://127.0.0.1:51018,Total threads: 2
Dashboard: http://127.0.0.1:51019/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:51004,
Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-hvqmudot,Local directory: /var/folders/vt/khhqppkd1472wb_cdm06rqhr0000gn/T/dask-scratch-space/worker-hvqmudot


In [4]:
from qec_lego_bench.hpc.monte_carlo import MonteCarloJob

job = MonteCarloJob(a=3)
hash(job)

()


3772678777081781564

## Useful Resources

- https://docs.dask.org/en/stable/deploying.html
- https://docs.dask.org/en/stable/futures.html
- https://docs.ycrc.yale.edu/clusters-at-yale/access/ood-jupyter/


In [5]:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
    queue='regular',
    account="myaccount",
    cores=24,
    memory="500 GB"
)
cluster.scale(jobs=10)  # ask for 10 jobs

ModuleNotFoundError: No module named 'dask_jobqueue'