## Run a pipeline using a dask gateway cluster

This process will run the entire pipeline described in a configuration file using a dask gateway cluster (LocalCluster

## Import dependencies

In [15]:
%load_ext autoreload
%autoreload 2
from paidiverpy.pipeline import Pipeline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Instantiate the Pipeline Class

In [16]:
pipeline = Pipeline(config_file_path="../config_files/config_benthic_client.yml", verbose=2)

[92m☁ paidiverpy ☁  |       INFO | 2025-02-27 13:42:41 | Created LocalCluster with Client: http://127.0.0.1:37537/status[0m


As you can see, when you instantiate the Pipeline class with this configuration file, a LocalCluster is created and you can access the dashboard in the link above.

This occours because we have the following parameters in the configuration file, as you can see below:

```
"n_jobs": 2,
"client": {
    "cluster_type": "local",
    "params": {
        "n_workers": 4,
        "threads_per_worker": 4,
        "memory_limit": "8GB"
    }
}
```

In [17]:
pipeline.config

{
    "general": {
        "name": "raw",
        "step_name": "open",
        "sample_data": "benthic_csv",
        "input_path": "/home/tobfer/.paidiverpy_cache/benthic_csv/images",
        "metadata_path": "/home/tobfer/.paidiverpy_cache/benthic_csv/metadata/metadata_benthic_csv.csv",
        "metadata_type": "CSV_FILE",
        "image_type": "PNG",
        "append_data_to_metadata": "/home/tobfer/.paidiverpy_cache/benthic_csv/metadata/appended_metadata_benthic_csv.csv",
        "is_remote": false,
        "output_is_remote": false,
        "output_path": "output",
        "n_jobs": 2,
        "client": {
            "cluster_type": "local",
            "params": {
                "n_workers": 4,
                "threads_per_worker": 4,
                "memory_limit": "8GB"
            }
        },
        "track_changes": false,
        "rename": null,
        "sampling": [
            {
                "name": "sampling",
                "step_name": "sampling",
                "m

In [18]:
# See the pipeline steps. Click in a step to see more information about it
pipeline

## Run the pipeline

You can follow the workers on the dashboard provided

In [19]:
# Run the pipeline
pipeline.run()

[92m☁ paidiverpy ☁  |       INFO | 2025-02-27 13:42:48 | Processing images using Dask client using the following dashboard link: http://127.0.0.1:37537/status[0m
[92m☁ paidiverpy ☁  |       INFO | 2025-02-27 13:42:48 | Running step 0: raw - OpenLayer[0m
[92m☁ paidiverpy ☁  |       INFO | 2025-02-27 13:42:50 | Step 0 completed[0m
[92m☁ paidiverpy ☁  |       INFO | 2025-02-27 13:42:50 | Running step 1: colour_alteration - ColourLayer[0m
[92m☁ paidiverpy ☁  |       INFO | 2025-02-27 13:42:51 | Step 1 completed[0m
[92m☁ paidiverpy ☁  |       INFO | 2025-02-27 13:42:51 | Running step 2: gaussian_blur - ColourLayer[0m
[92m☁ paidiverpy ☁  |       INFO | 2025-02-27 13:42:51 | Step 2 completed[0m
[92m☁ paidiverpy ☁  |       INFO | 2025-02-27 13:42:51 | Running step 3: sharpen - ColourLayer[0m
[92m☁ paidiverpy ☁  |       INFO | 2025-02-27 13:42:52 | Step 3 completed[0m
[92m☁ paidiverpy ☁  |       INFO | 2025-02-27 13:42:52 | Running step 4: contrast - ColourLayer[0m
[92m☁ pa

## See the images

Because you include in the configuration file the argument "track_changes: False", only the output of the last layer will be saved.
This is a strategy to save memory when you are processing a big dataset

In [20]:
pipeline.images