## Run a pipeline using a dask gateway cluster

This process will run the entire pipeline described in a configuration file using a dask gateway cluster (LocalCluster

## Import dependencies

In [1]:
%load_ext autoreload
%autoreload 2
from paidiverpy.pipeline import Pipeline

## Instantiate the Pipeline Class

In [2]:
pipeline = Pipeline(config_file_path="../config_files/config_benthic_client.yml", verbose=2)

As you can see, when you instantiate the Pipeline class with this configuration file, a LocalCluster is created and you can access the dashboard in the link above.

This occours because we have the following parameters in the configuration file, as you can see below:

```
"n_jobs": 2,
"client": {
    "cluster_type": "local",
    "params": {
        "n_workers": 4,
        "threads_per_worker": 4,
        "memory_limit": "8GB"
    }
}
```

In [3]:
pipeline.config

{
    "general": {
        "name": "raw",
        "step_name": "open",
        "sample_data": "benthic_csv",
        "input_path": "/home/tobfer/.paidiverpy_cache/benthic_csv/images",
        "metadata_path": "/home/tobfer/.paidiverpy_cache/benthic_csv/metadata/metadata_benthic_csv.csv",
        "metadata_type": "CSV_FILE",
        "image_type": "PNG",
        "append_data_to_metadata": "/home/tobfer/.paidiverpy_cache/benthic_csv/metadata/appended_metadata_benthic_csv.csv",
        "is_remote": false,
        "output_is_remote": false,
        "output_path": "output",
        "n_jobs": 2,
        "client": {
            "cluster_type": "local",
            "params": {
                "n_workers": 4,
                "threads_per_worker": 4,
                "memory_limit": "8GB"
            }
        },
        "track_changes": false,
        "rename": null,
        "sampling": [
            {
                "name": "sampling",
                "step_name": "sampling",
                "m

In [4]:
# See the pipeline steps. Click in a step to see more information about it
pipeline

## Run the pipeline

You can follow the workers on the dashboard provided

In [5]:
# Run the pipeline
pipeline.run()

[92m☁ paidiverpy ☁  |       INFO | 2025-03-06 10:28:33 | Processing images using Dask client using the following dashboard link: http://127.0.0.1:8787/status[0m


INFO:paidiverpy:Processing images using Dask client using the following dashboard link: http://127.0.0.1:8787/status


[92m☁ paidiverpy ☁  |       INFO | 2025-03-06 10:28:33 | Running step 0: raw - OpenLayer[0m


INFO:paidiverpy:Running step 0: raw - OpenLayer


[92m☁ paidiverpy ☁  |       INFO | 2025-03-06 10:28:35 | Step 0 completed[0m


INFO:paidiverpy:Step 0 completed


[92m☁ paidiverpy ☁  |       INFO | 2025-03-06 10:28:35 | Running step 1: colour_alteration - ColourLayer[0m


INFO:paidiverpy:Running step 1: colour_alteration - ColourLayer


[92m☁ paidiverpy ☁  |       INFO | 2025-03-06 10:28:36 | Step 1 completed[0m


INFO:paidiverpy:Step 1 completed


[92m☁ paidiverpy ☁  |       INFO | 2025-03-06 10:28:36 | Running step 2: gaussian_blur - ColourLayer[0m


INFO:paidiverpy:Running step 2: gaussian_blur - ColourLayer


[92m☁ paidiverpy ☁  |       INFO | 2025-03-06 10:28:37 | Step 2 completed[0m


INFO:paidiverpy:Step 2 completed


[92m☁ paidiverpy ☁  |       INFO | 2025-03-06 10:28:37 | Running step 3: sharpen - ColourLayer[0m


INFO:paidiverpy:Running step 3: sharpen - ColourLayer


[92m☁ paidiverpy ☁  |       INFO | 2025-03-06 10:28:37 | Step 3 completed[0m


INFO:paidiverpy:Step 3 completed


[92m☁ paidiverpy ☁  |       INFO | 2025-03-06 10:28:37 | Running step 4: contrast - ColourLayer[0m


INFO:paidiverpy:Running step 4: contrast - ColourLayer


[92m☁ paidiverpy ☁  |       INFO | 2025-03-06 10:28:38 | Step 4 completed[0m


INFO:paidiverpy:Step 4 completed


## See the images

Because you include in the configuration file the argument "track_changes: False", only the output of the last layer will be saved.
This is a strategy to save memory when you are processing a big dataset

In [6]:
pipeline.images