# Deploy tasks on SLURM schedulers

High-perforamnce supercomputers, academic institutions and other small clusters typically are managed with a job scheduler like PBS, SLURM, MOAB, SGE, LSF, HTCondor, or others. Here, Dask and [Dask-Jobqueue](https://jobqueue.dask.org/en/latest/) offer capabilities to work on big data using large-scale computing facilities for submitting cluster jobs with Python scripts.

## Overview

In this notebook, we will look briefly at two generic examples that ilustrate the steps to instantiate workers on SLURM-based clusters. The first example enables computations that rely solely on python instructions, whereas the second example depicts an alternative method that allows running code from a different file, normally a bash script, that can be used to call some external executable code.

# 1. Libraries

In [None]:
import dask
import textwrap
import subprocess

import pandas as pd

from dask.distributed import Client
from dask_jobqueue    import SLURMCluster

from pathlib import Path as path

# 2. Run a python function using the SLURM scheduler

The steps for this example are based on the following module `run_task`. This is the driver module each worker will be using to run its task. Since the code inside this module can become arbitrarily complex, here we will assume that it only recieves one argument named `data`.

In [None]:
def run_task(data):
    
    # Include here the code of your choice
    
    pass

As usual, we now have to set the Cluster and Client objects, then iterate over the elements obtained, e.g, from a DataFrame. For the sake of generalization, please keep in mind that strings alike '\<some_string\>' are generic tags specific to the cluster you are using. Hence you will need to modify them accordingly.

Note that we use `client.submit` to indicate that we want to include one task as a *future* run. Here we indicate our callable and its arguments.

In [None]:
if __name__ == '__main__':
#
### If needed, prepare empty array to collect futures
#
    futures = []
#
### Define Cluster and Client
#
    cluster = SLURMCluster(cores=1,
                           memory='2GB',
                           account='<your_account>',
                           queue='<the_queue>',
                           walltime='12:00:00',
                           job_extra_directives=['--ntasks=1', '--nodes=1', '--qos=<your_qos>']
                          )
    
    cluster.scale(jobs=16)
    
    client = Client(cluster)
#
### Define the setting you need for 'run_task'
#

#
### Define your dataframe
#
    pandascsv = pd.read_csv('<your_file.csv>')
    dataframe = dask.dataframe.from_pandas(pandascsv, npartitions=1)
#
### Execute 'run_task' for a generic field named 'Species'
#
    for row in dataframe.itertuples():
        future = client.submit(run_task, row.Species)
        
        futures.append(future)
#
### Collect your results if needed
#
    for future in futures: result = future.result()
#
### Close Cluster and Client
#
    client.shutdown()
    
    client.close()
    
    cluster.close()

# 3. Run a script file using the SLURM Scheduler

Instead of running purely python code, sometimes we might be interested in running an external code we compiled previously, or simply calling some files outside python. For such cases, we can use an analogous approach. The difference is that we call the `subprocess` function insted to execute that external command, e.g., `bash filename` and collect the commands necessary to run the external code in `filename`. Again, please notice that you will need to modify the code depending on your needs.

In [None]:
def write_bash_file(filename):
    
    open(filename, 'w').write(textwrap.dedent('''\
        #!/bin/bash
        
        <your bash code>
    '''))
    
def run_bash_file(filename):
    
    write_bash_file(filename)
    
    stdout = open(filename.parent/f'{filename.name}.out', 'w')
    stderr = open(filename.parent/f'{filename.name}.err', 'w')
    
    subprocess.call(['bash', filename], stdout=stdout, stderr=stderr)

Just like in the previous example, we now need to set the Cluster and Client objects. The following code also iterates over the rows of a generic DataFrame.

In [None]:
if __name__ == '__main__':
#
### If needed, prepare empty array to collect futures
#
    futures = []
#
### Define Cluster and Client
#
    cluster = SLURMCluster(cores=1,
                           memory='2GB',
                           account='<your_account>',
                           queue='<the_queue>',
                           walltime='12:00:00',
                           job_extra_directives=['--ntasks=1', '--nodes=1', '--qos=<your_qos>']
                          )
    
    cluster.scale(jobs=16)
    
    client = Client(cluster)
#
### Define the setting you need for 'run_bash_file'
#
    
#
### Define your dataframe
#
    pandasscv = pd.read_csv('<your_file.csv>')
    dataframe = dask.dataframe.from_pandas(pandascsv, npartitions=1)
#
### Execute 'run_bash_file' for a generic field named 'Species'
#
    for row in dataframe.itertuples():
        future = client.submit(run_bash_file, f'{row.Species}.sh')
        
        futures.append(future)
#
### Collect your results if needed
#
    for future in futures: result = future.result()
#
### Close Cluster and Client
#
    client.shutdown()
    
    client.close()
    
    cluster.close()