# Creating a Rasterized ARCO Version of a Geodatabase

This Jupyter Notebook guides you through the process of converting a geodatabase into a rasterized ARCO version. The geodatabase contains multiple layers of geological seabed substrate data obtained from EMODnet Geology.

The conversion process is divided into several steps, each utilizing different Python packages:

1. **Reading Geospatial Data**: We use the `fiona` package to read the geospatial data from the geodatabase.

2. **Data Manipulation**: The `geopandas` package allows us to manipulate the vector geospatial data as needed.

3. **Raster Operations**: We use the `rasterio` package to perform raster operations to rasterize the geodataframes.

4. **Working with Multi-dimensional Arrays**: The `xarray` package enables us to work with multi-dimensional arrays, which is crucial for handling geospatial data.

5. **Data Storage**: Finally, we use the `zarr` package to store the processed data in a compressed format.  And we store the data in s3 storage, allowing us to subset the data in the cloud


In [2]:
import fiona
import pandas as pd
import geopandas as gpd
import rasterio
import rasterio.features
import zarr
import xarray as xr
import os
import numpy as np
from tqdm import tqdm
from datetime import datetime, date
import urllib.request
import zipfile
import dask
import shutil

To update the attributes of a dataset, the `attributes_update()` function is used. This function takes in the dataset, title, resolution, and metadata dictionary as input. It updates the latitude and longitude attributes, sets the CRS, adds spatial extent information, includes the resolution, history, title, comment, and sources attributes. The function ensures that the dataset has accurate and informative attributes for better understanding and analysis of the data.


In [10]:
def attributes_update(dataset, title, resolution, zipurl):
        latitudeattrs = {'_CoordinateAxisType': 'Lat', 
                            'axis': 'Y', 
                            'long_name': 'latitude', 
                            'max': dataset.latitude.values.max(), 
                            'min': dataset.latitude.values.min(), 
                            'standard_name': 'latitude', 
                            'step': (dataset.latitude.values.max() - dataset.latitude.values.min()) / dataset.latitude.values.shape[0], 
                            'units': 'degrees_north'
            }
        longitudeattrs = {'_CoordinateAxisType': 'Lon', 
                        'axis': 'X', 
                        'long_name': 'longitude',
                        'max': dataset.longitude.values.max(),
                        'min': dataset.longitude.values.min(),
                        'standard_name': 'longitude', 
                        'step': (dataset.longitude.values.max() - dataset.longitude.values.min()) / dataset.longitude.values.shape[0], 
                        'units': 'degrees_east'
        }
        dataset.latitude.attrs.update(latitudeattrs)
        dataset.longitude.attrs.update(longitudeattrs)

        # Set the CRS as an attribute
        dataset.attrs['proj:epsg'] = 4326
        dataset.attrs['resolution'] = resolution
        dataset.attrs.update({
            'geospatial_lat_min': dataset['latitude'].min().item(),
            'geospatial_lat_max': dataset['latitude'].max().item(),
            'geospatial_lon_min': dataset['longitude'].min().item(),
            'geospatial_lon_max': dataset['longitude'].max().item()
        })
        dataset.attrs['resolution'] = resolution
        #include where the data comes and when its been converted
        dataset.attrs['History'] = f'Zarr dataset converted from {title}.gdb, downloaded from {zipurl}, on {date.today()}'
        
        #add any other attributes you think necessary to include in the metadata of your zarr dataset
        #dataset.attrs['sources'] = source
    

        return dataset


### Converting GeoDataFrame to Zarr Format

The `gdf2zarrconverter` function converts spatial data from a GeoDataFrame into a Zarr store.

**Steps:**

1. Data Cleaning: Ensures uniformity by cleaning various input data types.
2. Spatial Extent Determination: Calculates the bounding box of the GeoDataFrame.
3. Resolution and Dimensions Calculation: Determines raster resolution, width, and height based on spatial extent.
4. Data Preparation: Separates columns into categorical and numerical, cleans missing data, and encodes categorical columns numerically.
5. Rasterization: Converts categorical and numerical data into raster layers.
6. Creating Xarray Dataset: Constructs a dataset to hold raster layers and category mappings, and sets latitude and longitude coordinates.
7. Attributes and Metadata: Sets categorical encoding to each variable.
8. Saving to Zarr: Saves the dataset to a Zarr store in the specified directory.

In [73]:
def gdf2zarrconverter(file_path, native_var, title, layer, arco_asset_tmp_path, zipurl):

    def cleaner(data):
        if isinstance(data, str):
            if data == '0' or data == ' ' or data == np.nan or data == 'nan' or data == "" or data == " ":
                data = 'None'
        return data

    def encode_categorical(data):
        if isinstance(data[0], str):
            data = pd.Series(data)
            data = data.fillna('None')  # replace None values with 'None'
            
            data[data == ' '] = 'None'
            data[data == '0'] = 'None'
            data = data.values 
            unique_categories = np.unique(data)
            category_mapping = {'None': 1}
            counter = 2
            for category in unique_categories:
                if category != 'None':
                    category_mapping[category] = counter
                    counter += 1
            encoded_data = np.array([category_mapping.get(item, np.nan) for item in data])
        else:
            encoded_data = data.astype(np.float32)
            category_mapping = {}
        return encoded_data, category_mapping

    with fiona.open(file_path, 'r', layer=layer) as src:
        crs = src.crs
        total_bounds = src.bounds
        lon_min, lat_min, lon_max, lat_max = total_bounds
        resolution = 0.01
        width = int(np.ceil((lon_max - lon_min) / resolution))
        height = int(np.ceil((lat_max - lat_min) / resolution))
        raster_transform = rasterio.transform.from_bounds(lon_min, lat_min, lon_max, lat_max, width, height)
        raster = np.zeros((height, width), dtype=np.float32)
        data = []
        geometries = []
        with tqdm(total=len(src), desc=f"Processing features of {layer} - {native_var}") as pbar:
            for feature in src:
                value = cleaner(feature['properties'][native_var])
                data.append(value)
                geometries.append(feature['geometry'])
                pbar.update()
        data = np.array(data)
        encoded_data, category_mapping = encode_categorical(data)
        with tqdm(total=len(geometries), desc="Rasterizing") as pbar:
            rasterio.features.rasterize(
                ((geom, value) for geom, value in zip(geometries, encoded_data)),
                out=raster,
                transform=raster_transform,
                merge_alg=rasterio.enums.MergeAlg.replace,
                dtype=np.float32,
            )
            pbar.update()
        
        # make xarray dataset, arrange latitude from max to min since rasterio makes rasters from top left to bottom right
        dataset = xr.Dataset(coords={'latitude':  np.round(np.linspace(lat_max, lat_min, height, dtype=float), decimals=4),
                                    'longitude': np.round(np.linspace(lon_min, lon_max, width, dtype=float), decimals=4)})
        dataset[native_var] = (['latitude', 'longitude'], raster)
        dataset = dataset.sortby('latitude')

        if category_mapping:
            # save the mappig dictionary with the variable attributes
            dataset[native_var].attrs['categorical_encoding']= category_mapping

        dataset = attributes_update(dataset, title, resolution, zipurl)
        zarr_var_path = f"{arco_asset_tmp_path}/{title}_{native_var}.zarr"
        dataset.to_zarr(zarr_var_path, mode='w', consolidated=True)
        return zarr_var_path


 5.  In this example we will use a multi-layer geodatabase featuring different layers of geological seabed substrate data taken from EMODnet Geology (https://emodnet.ec.europa.eu/geonetwork/srv/eng/catalog.search#/metadata/6eaf4c6bf28815e973b9c60aab5734e3ef9cd9c4)

In [74]:

import os
import fiona
import xarray as xr
import dask
from tqdm import tqdm


# Download the zip file
zipurl = 'https://s3.waw3-1.cloudferro.com/emodnet/emodnet_native/emodnet_geology/seabed_substrate/multiscale_folk_5/EMODnet_GEO_Seabed_Substrate_All_Res.zip'
geodatabase = 'EMODnet_Seabed_Substrate_1M.gdb'
zip_file = os.path.basename(zipurl)
class TqdmUpTo(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)

with TqdmUpTo(unit='B', unit_scale=True, miniters=1, desc=zip_file) as t:
    urllib.request.urlretrieve(zipurl, filename=zip_file, reporthook=t.update_to)

# Extract the geodatabase from the zip file
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    zip_ref.extractall('extracted_files')

for root, dirs, files in os.walk('extracted_files'):
    for dir in dirs:
        if dir.endswith('.gdb') and os.path.basename(dir) == geodatabase:
            gdb_path = os.path.join(root, dir)
            break

EMODnet_GEO_Seabed_Substrate_All_Res.zip: 777MB [01:12, 10.7MB/s]                              


 4. Geodatabase to zarr

 Geodatabases can often contain multiple layers, and often contain a number of columns(variables).

 We simplify the burden of conversion by converting layers and one variable from each layer into a zarr dataset. Then we combine them into a single zarr dataset using dask to rechunk the variables and ensure both compatibility of the zarr datasets, and lay the ground work for distributed computations if necessary.

In [75]:
temp_zarr_path = 'converted_zarr_files'

os.makedirs(temp_zarr_path, exist_ok=True)
title = os.path.splitext(os.path.basename(geodatabase))[0]

# Get the layers from the geodatabase
layers = fiona.listlayers(gdb_path)

# Create an empty xarray dataset to hold the combined data
combined_dataset = xr.Dataset()

# Process each layer and each variable using gdf2zarr
for layer in layers:
    # Get the variables from the layer
    variables = fiona.open(gdb_path, layer=layer).meta['schema']['properties'].keys()
    
    zarr_vars_paths = [] # replace with your column names
    for variable in variables:
        try:
            print(f"Processing {layer} - {variable}")
            zarr_var_path = gdf2zarrconverter(gdb_path, variable, title, layer, temp_zarr_path, zipurl)
            zarr_vars_paths.append(zarr_var_path)
        except Exception as e:
            print(f"Failed to process {layer} - {variable}: {e}")
            continue

    with dask.config.set(scheduler='single-threaded'):
        for path in zarr_vars_paths:
            try:
                dataset = xr.open_dataset(path, chunks={})  # Use Dask to lazily load the dataset
                dataset = dataset.chunk({'latitude': 'auto', 'longitude': 'auto'}) 
                combined_dataset = xr.merge([combined_dataset, dataset], compat='override', join='outer')
            except Exception as e:
                print(f"Failed to combine zarr dataset {path}: {e}")
                continue

    # add applicable categorical encodings
    categorical_encodings_dict = {}
    for var in combined_dataset.variables:
        if 'categorical_encoding' in combined_dataset[var].attrs:
            categorical_encodings_dict[var] = combined_dataset[var].attrs['categorical_encoding']

    combined_dataset.attrs['categorical_encoding'] = categorical_encodings_dict

    with dask.config.set(scheduler='single-threaded'):
        try:    
            final_dataset = combined_dataset.chunk({'latitude': 'auto', 'longitude': 'auto'})  # for var in dataset.variables:
            zarr_path = f"{layer}.zarr"
            final_dataset.to_zarr(zarr_path, mode = 'w')
            shutil.rmtree(temp_zarr_path)
        except Exception as e:
            print(f"final zarr dataset did not save {layer}: {e}")
            continue

# Print the combined dataset
print(combined_dataset)

Processing Seabed_substrate_1M_Sep2023 - Code


Processing features of Seabed_substrate_1M_Sep2023 - Code: 100%|██████████| 33645/33645 [00:03<00:00, 10369.23it/s]
Rasterizing:   0%|          | 1/33645 [00:02<21:18:08,  2.28s/it]


Processing Seabed_substrate_1M_Sep2023 - Country


Processing features of Seabed_substrate_1M_Sep2023 - Country: 100%|██████████| 33645/33645 [00:03<00:00, 10213.23it/s]
Rasterizing:   0%|          | 1/33645 [00:01<17:25:11,  1.86s/it]


Processing Seabed_substrate_1M_Sep2023 - Name


Processing features of Seabed_substrate_1M_Sep2023 - Name: 100%|██████████| 33645/33645 [00:03<00:00, 9875.61it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<18:00:34,  1.93s/it]


Processing Seabed_substrate_1M_Sep2023 - Data_Holder


Processing features of Seabed_substrate_1M_Sep2023 - Data_Holder: 100%|██████████| 33645/33645 [00:03<00:00, 10695.55it/s]
Rasterizing:   0%|          | 1/33645 [00:02<21:04:40,  2.26s/it]


Processing Seabed_substrate_1M_Sep2023 - Contact


Processing features of Seabed_substrate_1M_Sep2023 - Contact: 100%|██████████| 33645/33645 [00:03<00:00, 10075.84it/s]
Rasterizing:   0%|          | 1/33645 [00:02<18:47:00,  2.01s/it]


Processing Seabed_substrate_1M_Sep2023 - Scale


Processing features of Seabed_substrate_1M_Sep2023 - Scale: 100%|██████████| 33645/33645 [00:03<00:00, 10276.32it/s]
Rasterizing:   0%|          | 1/33645 [00:01<17:52:19,  1.91s/it]


Processing Seabed_substrate_1M_Sep2023 - Original_Scale


Processing features of Seabed_substrate_1M_Sep2023 - Original_Scale: 100%|██████████| 33645/33645 [00:03<00:00, 9207.93it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<18:01:22,  1.93s/it]


Processing Seabed_substrate_1M_Sep2023 - Original_Grain_Size


Processing features of Seabed_substrate_1M_Sep2023 - Original_Grain_Size: 100%|██████████| 33645/33645 [00:03<00:00, 9822.53it/s] 
Rasterizing:   0%|          | 1/33645 [00:02<21:18:47,  2.28s/it]


Processing Seabed_substrate_1M_Sep2023 - Mapping_Method


Processing features of Seabed_substrate_1M_Sep2023 - Mapping_Method: 100%|██████████| 33645/33645 [00:03<00:00, 10033.77it/s]
Rasterizing:   0%|          | 1/33645 [00:01<17:20:54,  1.86s/it]


Processing Seabed_substrate_1M_Sep2023 - References


Processing features of Seabed_substrate_1M_Sep2023 - References: 100%|██████████| 33645/33645 [00:03<00:00, 9747.00it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<17:22:17,  1.86s/it]


Processing Seabed_substrate_1M_Sep2023 - Comments


Processing features of Seabed_substrate_1M_Sep2023 - Comments: 100%|██████████| 33645/33645 [00:03<00:00, 10026.67it/s]
Rasterizing:   0%|          | 1/33645 [00:01<17:50:00,  1.91s/it]


Processing Seabed_substrate_1M_Sep2023 - Reclassification


Processing features of Seabed_substrate_1M_Sep2023 - Reclassification: 100%|██████████| 33645/33645 [00:03<00:00, 9892.20it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<17:11:05,  1.84s/it]


Processing Seabed_substrate_1M_Sep2023 - Method


Processing features of Seabed_substrate_1M_Sep2023 - Method: 100%|██████████| 33645/33645 [00:03<00:00, 9601.68it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<17:55:27,  1.92s/it]


Processing Seabed_substrate_1M_Sep2023 - Sample_number


Processing features of Seabed_substrate_1M_Sep2023 - Sample_number: 100%|██████████| 33645/33645 [00:03<00:00, 10484.89it/s]
Rasterizing:   0%|          | 1/33645 [00:01<17:23:39,  1.86s/it]


Processing Seabed_substrate_1M_Sep2023 - Original_substrate


Processing features of Seabed_substrate_1M_Sep2023 - Original_substrate: 100%|██████████| 33645/33645 [00:03<00:00, 9323.22it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<17:35:02,  1.88s/it]


Processing Seabed_substrate_1M_Sep2023 - Relation


Processing features of Seabed_substrate_1M_Sep2023 - Relation: 100%|██████████| 33645/33645 [00:03<00:00, 9705.02it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<17:37:56,  1.89s/it]


Processing Seabed_substrate_1M_Sep2023 - Folk_16cl


Processing features of Seabed_substrate_1M_Sep2023 - Folk_16cl: 100%|██████████| 33645/33645 [00:03<00:00, 10520.65it/s]
Rasterizing:   0%|          | 1/33645 [00:02<22:02:23,  2.36s/it]


Processing Seabed_substrate_1M_Sep2023 - Folk_16cl_txt


Processing features of Seabed_substrate_1M_Sep2023 - Folk_16cl_txt: 100%|██████████| 33645/33645 [00:03<00:00, 9133.11it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<18:20:25,  1.96s/it]


Processing Seabed_substrate_1M_Sep2023 - Folk_7cl


Processing features of Seabed_substrate_1M_Sep2023 - Folk_7cl: 100%|██████████| 33645/33645 [00:03<00:00, 8775.83it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<18:25:26,  1.97s/it]


Processing Seabed_substrate_1M_Sep2023 - Folk_7cl_txt


Processing features of Seabed_substrate_1M_Sep2023 - Folk_7cl_txt: 100%|██████████| 33645/33645 [00:03<00:00, 10013.97it/s]
Rasterizing:   0%|          | 1/33645 [00:01<17:52:57,  1.91s/it]


Processing Seabed_substrate_1M_Sep2023 - Folk_5cl


Processing features of Seabed_substrate_1M_Sep2023 - Folk_5cl: 100%|██████████| 33645/33645 [00:03<00:00, 9619.27it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<18:11:51,  1.95s/it]


Processing Seabed_substrate_1M_Sep2023 - Folk_5cl_txt


Processing features of Seabed_substrate_1M_Sep2023 - Folk_5cl_txt: 100%|██████████| 33645/33645 [00:03<00:00, 9145.59it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<18:01:40,  1.93s/it]


Processing Seabed_substrate_1M_Sep2023 - Surface_feature


Processing features of Seabed_substrate_1M_Sep2023 - Surface_feature: 100%|██████████| 33645/33645 [00:03<00:00, 10299.56it/s]
Rasterizing:   0%|          | 1/33645 [00:02<21:31:13,  2.30s/it]


Processing Seabed_substrate_1M_Sep2023 - Surface_feature_Group


Processing features of Seabed_substrate_1M_Sep2023 - Surface_feature_Group: 100%|██████████| 33645/33645 [00:03<00:00, 9875.52it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<18:20:12,  1.96s/it]


Processing Seabed_substrate_1M_Sep2023 - Conf_RS


Processing features of Seabed_substrate_1M_Sep2023 - Conf_RS: 100%|██████████| 33645/33645 [00:03<00:00, 10057.75it/s]
Rasterizing:   0%|          | 1/33645 [00:01<18:04:07,  1.93s/it]


Processing Seabed_substrate_1M_Sep2023 - Conf_S


Processing features of Seabed_substrate_1M_Sep2023 - Conf_S: 100%|██████████| 33645/33645 [00:03<00:00, 9506.80it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<17:55:49,  1.92s/it]


Processing Seabed_substrate_1M_Sep2023 - Conf_D


Processing features of Seabed_substrate_1M_Sep2023 - Conf_D: 100%|██████████| 33645/33645 [00:03<00:00, 9134.10it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<18:30:04,  1.98s/it]


Processing Seabed_substrate_1M_Sep2023 - Conf_TOT


Processing features of Seabed_substrate_1M_Sep2023 - Conf_TOT: 100%|██████████| 33645/33645 [00:03<00:00, 10001.35it/s]
Rasterizing:   0%|          | 1/33645 [00:02<22:35:16,  2.42s/it]


Processing Seabed_substrate_1M_Sep2023 - SHAPE_Length


Processing features of Seabed_substrate_1M_Sep2023 - SHAPE_Length: 100%|██████████| 33645/33645 [00:03<00:00, 9043.96it/s] 
Rasterizing:   0%|          | 1/33645 [00:01<18:33:40,  1.99s/it]


Processing Seabed_substrate_1M_Sep2023 - SHAPE_Area


Processing features of Seabed_substrate_1M_Sep2023 - SHAPE_Area: 100%|██████████| 33645/33645 [00:03<00:00, 10072.08it/s]
Rasterizing:   0%|          | 1/33645 [00:01<18:20:39,  1.96s/it]


<xarray.Dataset> Size: 14GB
Dimensions:                (latitude: 7395, longitude: 15675)
Coordinates:
  * latitude               (latitude) float64 59kB 7.903 7.913 ... 81.84 81.85
  * longitude              (longitude) float64 125kB -88.74 -88.73 ... 68.0
Data variables: (12/30)
    Code                   (latitude, longitude) float32 464MB dask.array<chunksize=(3704, 7840), meta=np.ndarray>
    Country                (latitude, longitude) float32 464MB dask.array<chunksize=(3704, 7840), meta=np.ndarray>
    Name                   (latitude, longitude) float32 464MB dask.array<chunksize=(3704, 7840), meta=np.ndarray>
    Data_Holder            (latitude, longitude) float32 464MB dask.array<chunksize=(3704, 7840), meta=np.ndarray>
    Contact                (latitude, longitude) float32 464MB dask.array<chunksize=(3704, 7840), meta=np.ndarray>
    Scale                  (latitude, longitude) float32 464MB dask.array<chunksize=(3704, 7840), meta=np.ndarray>
    ...                     