# Parallel Hyper Parameter Search on a remote cluster with Slurm

The notebook consists of three parts:

-Python file for single-process optimization with Optuna.
  
-Slurm file to launch the Python script in parallel on multiple processes.

-Execution of the Slurm file on a remote cluster.

### Step 1 : Write a python script to run the experiment 

We are going to write a Python file (let's call it "HP_search_parallel_cluster.py" for example) that is similar to the first Optuna tutorial (without parallelization). It contains the necessary code to allow multiple trials to be run sequentially. Inside, we create our dataset, the objective function, create the study and run it.

Next, we will create a Slurm file to launch "HP_search_parallel_cluster.py" multiple times. The parallelization will be achieved by running separate instances of the Python script simultaneously on multiple CPUs or nodes.

Importantly, all trials' hyperparameters and results will be linked to the same storage, enabling efficient management and comparison of the optimization process.

#### HP_search_parallel_cluster.py

Python file to run multiple trials sequentially :

```python

import time
import argparse
import datetime
import numpy as np

import reservoirpy as rpy
import optuna

from reservoirpy.nodes import Reservoir, Ridge
from reservoirpy.observables import nrmse, rsquare
from reservoirpy.datasets import doublescroll

from optuna.storages import JournalStorage, JournalFileStorage

optuna.logging.set_verbosity(optuna.logging.ERROR)

parser = argparse.ArgumentParser()
parser.add_argument('--nb_trials', type=int, required=True)
parser.add_argument('--study_name', type=str, required=True)
args = parser.parse_args()

# Data Preprocessing

timesteps = 2000
x0 = [0.37926545, 0.058339, -0.08167691]
X = doublescroll(timesteps, x0=x0, method="RK23")

train_len = 1000

X_train = X[:train_len]
y_train = X[1 : train_len + 1]
X_test = X[train_len : -1]
y_test = X[train_len + 1:]

dataset = ((X_train, y_train), (X_test, y_test))

# Trial Fixed hyper-parameters

nb_seeds = 3
N = 500
iss = 0.9
ridge = 1e-7


def objective(trial):
    # Record objective values for each trial
    rpy.verbosity(0)
    losses = []

    # Trial generated parameters (with log scale)
    sr = trial.suggest_float("sr_1", 1e-2, 10, log=True)
    lr = trial.suggest_float("lr_1", 1e-3, 1, log=True)

    for seed in range(nb_seeds):
        reservoir = Reservoir(N,
                              sr=sr,
                              lr=lr,
                              input_scaling=iss,
                              seed=seed)
        
        readout = Ridge(ridge=ridge)
        model = reservoir >> readout

        # Train and test your model
        predictions = model.fit(X_train, y_train).run(X_test)

        # Compute the desired metric(s)
        loss = nrmse(y_test, predictions, norm_value=np.ptp(X_train))

        losses.append(loss)

    return np.mean(losses)


# Define study parameters
sampler = optuna.samplers.RandomSampler() 
log_name = f"optuna-journal_{args.study_name}.log"
storage = JournalStorage(JournalFileStorage(log_name))

# Create study
study = optuna.create_study(
    study_name=args.study_name,
    direction="minimize",
    sampler=sampler,
    storage=storage,
    load_if_exists=True)


# Launch the optimization for this specific job
start = time.time()
study.optimize(objective, n_trials=args.nb_trials)
end = time.time()

print(f"Optimization done with {args.nb_trials} trials in {str(datetime.timedelta(seconds=end-start))}")
```

### Step 2 : Write a Slurm file 

To parallelize the `HP_search_parallel_cluster.py` Python file on a remote cluster , we will create multiple jobs to run it simultaneously with a Slurm file (lets's call it `HP_search_parallel_cluster.slurm`). We can specify the number of jobs, their names, and the desired arguments for the Python file, including the number of trials per job (You can also run the Python file on multiple processes locally by using tools such as tmux).

By launching x jobs with y trials_per_job, we enable efficient hyperparameter optimization across multiple CPUs or nodes on x*y trials in total, leveraging the cluster's resources effectively. 

Depending on the cluster you are working on, it might not be equipped with Slurm. In such cases, you'll need to utilize the job scheduling tool that is installed and available on that specific cluster, instead of Slurm.

#### HP_search_parallel_cluster.slurm

Slurm file to run the Python file on several processes simultaneously :

```slurm
#!/bin/bash

#############################

# Your job name (displayed by the queue)
#SBATCH -J reservoirpy_parallel_HP_search_test

# Specify the number of desired jobs in your job array (here 50)
#SBATCH --array=0-49
# Specify the maximum walltime per process (hh:mm::ss)
#SBATCH -t 1:10:00

# Specify the number of nodes(nodes=) and the number of cores per nodes(tasks-pernode=) to be used
#SBATCH - N 1
#SBATCH --ntasks-per-node=1

# change working directory
# SBATCH --chdir=.

#############################

# useful information to print
echo "#############################"
echo "User:" $USER
echo "Date:" `date`
echo "Host:" `hostname`
echo "Directory:" `pwd`
echo "SLURM_JOBID:" $SLURM_JOBID
echo "SLURM_SUBMIT_DIR:" $SLURM_SUBMIT_DIR
echo "SLURM_JOB_NODELIST:" $SLURM_JOB_NODELIST
echo "#############################"

#############################

# What you actually want to launch 
python3 HP_search_parallel_cluster.py --nb_trials 5 --study_name cluster_parallelization_tutorial
# Total number of trials = nb_jobs * nb_trials = 50 * 5 = 250

# all done
echo "Job finished"


```

### Step 3 : Transfer the files to the cluster

With `HP_search_parallel_cluster.py` and `HP_search_parallel_cluster.slurm` ready on our local directory, we can efficiently transfer them to a remote cluster if we use Linux with the `rsync` command. According to the cluster you use, it can be recommended to use the [`scp` command](https://help.ubuntu.com/community/SSH/TransferFiles) instead.  

Bash commands to type in your local terminal, to transfer the Python and Slurm files to the cluster :

```bash
rsync -av HP_search_parallel_cluster.py username@cluster_address
rsync -av HP_search_parallel_cluster.slurm username@cluster_address

```

### Step 4 : Launch the Slurm file

Then from the cluster command line, you can launch the Slurm file by using the `sbatch` command : 

Slurm command to execute from the cluster command line, to launch the whole optimization process : 

```bash
sbatch HP_search_parallel_cluster.slurm 

```

After transferring the files to the remote cluster and launching the Slurm file, we need to wait for the hyperparameter optimization to finish. The command line output of each job (and of the python file launched with this job) will be stored in files called "slurm-<job_array_id>_<job_id>.out". 

Here are some commands that can be useful:

Bash commands to type directly in the cluster command line : 

```bash
# Check the status of the jobs
squeue -u $USER

# Cancel the job_array (your can find the <job_array_id> with the precedent command line)
scancel <job_array_id>

# When the jobs are finished, check the content of the output files 
cat slurm-<job_array_id>_<job_id>.out
```

### Step 5 : Retrieve the Optuna logs and visualize them

Once the hyperparameter optimization is complete on the remote cluster, you can load the Optuna storage back to your local directory using the `rsync` command. 


Bash command to type on your local machine to retrieve the Optuna logs : 

```bash
rsync -av username@cluster_address/optuna-journal_cluster_parallelization_tutorial.log local_directory

```

After transferring the log file, you can then load the study and plot the results in a Jupyter notebook (using the same procedure as described in the first tutorial) : 

```python
# Load the study with the correct name and storage
study = optuna.load_study(
    study_name = f'{study_name}',
    storage = storage
)

# Plot it with the function of your choice
plot_slice(study)

```