# Downscaled bioclimatic indicators

Available for Europe, Central Africa and Northern Brazil

Alternative name: sis-biodiversity-era5-regional

**Information on Dataset:**
* Source: [Downscaled Bioclimatic Indicators](https://cds.climate.copernicus.eu/datasets/sis-biodiversity-era5-regional?tab=overview)
* Author:
* Resolution: 1 km x 1 km
* Notebook Version: 1.1 (Updated: December 17. 2024)

## 1. Specifying the paths and working directories

In [1]:
import os

''' ---- Hier die Verzeichnisse angeben ---- '''
download_folder = r".\data\sis-biodiversity-era5-regional\download"
working_folder = r".\data\sis-biodiversity-era5-regional\working"
geotiff_folder = r".\data\sis-biodiversity-era5-regional\geotiff"
csv_folder = r".\data\sis-biodiversity-era5-regional\csv"
output_folder = r".\data\sis-biodiversity-era5-regional\output"
''' ----- Ende der Eingaben ---- '''

os.makedirs(download_folder, exist_ok=True)
os.makedirs(working_folder, exist_ok=True)
os.makedirs(geotiff_folder, exist_ok=True)
os.makedirs(csv_folder, exist_ok=True)
os.makedirs(output_folder, exist_ok=True)

## 2. Download and Extract Dataset

### 2.1 Authentication

In [2]:
import cdsapi

def main():
    api_key = "fdae60fd-35d4-436f-825c-c63fedab94a4"
    api_url = "https://cds.climate.copernicus.eu/api"
    client = cdsapi.Client(url=api_url, key=api_key)
    return client

### 2.2 Request Definition and Download

In [3]:
# Define additional request fields to ensure the request stays within the file size limit.
# These coordinates were obtained using the BBox Extractor tool:
# https://str-ucture.github.io/bbox-extractor/

bbox_wgs84_deutschland = [56.0, 5.8, 47.2, 15.0]
bbox_wgs84_konstanz = [47.9, 8.9, 47.6, 9.3]

# Alternatively, use a shapefile for precise geographic filtering
import geopandas as gpd
import math

# Example: Load shapefile of Konstanz (WGS84 projection)
de_shapefile = r"./shapefiles/de_boundary.shp"
de_gdf = gpd.read_file(de_shapefile)
de_bounds = de_gdf.total_bounds

# Adjust and buffer
de_bounds_adjusted = [(math.floor(de_bounds[0]* 10)/10)-0.1,
                      (math.floor(de_bounds[1]* 10)/10)-0.1,
                      (math.ceil(de_bounds[2]* 10)/10)+0.1,
                      (math.ceil(de_bounds[3]* 10)/10)+0.1]

bbox_de_bounds_adjusted = [de_bounds_adjusted[3], de_bounds_adjusted[0],
                           de_bounds_adjusted[1], de_bounds_adjusted[2]]

In [4]:
## Currently for Region == Europe, only ERA5 is available
# for Region == Central Africa, ERA5 is available
# for Region == Northern Brazil, ERA5-Land is available

## Variable group: Bioclimatic indicators as in WORLDCLIM
# cds.climate.copernicus.eu/datasets/sis-biodiversity-era5-regional?tab=download
variable_group = "bioclimatic_indicators_as_in_worldclim"

In [5]:
dataset = "sis-biodiversity-era5-regional"
request = {
    "region": ["europe"],
    "origin": "era5",
    "variable": [
        "annual_mean_temperature",
        "mean_diurnal_range",
        "isothermality",
        "temperature_seasonality",
        "maximum_temperature_of_warmest_month",
        "minimum_temperature_of_coldest_month",
        "temperature_annual_range",
        "mean_temperature_of_wettest_quarter",
        "mean_temperature_of_driest_quarter",
        "mean_temperature_of_warmest_quarter",
        "mean_temperature_of_coldest_quarter",
        "annual_precipitation",
        "precipitation_of_wettest_month",
        "precipitation_of_driest_month",
        "precipitation_seasonality",
        "precipitation_of_wettest_quarter",
        "precipitation_of_driest_quarter",
        "precipitation_of_warmest_quarter",
        "precipitation_of_coldest_quarter"
    ],
    "statistic": [
        "mean",
        "median",
        "25th_quartile",
        "75th_quartile"
    ],
    "version": ["1_0"]
}

In [6]:
# Run this cell to download the dataset:
def main_retrieve():
    dataset_filename = f"{dataset}_{variable_group}.zip"
    dataset_filepath = os.path.join(download_folder, dataset_filename)

    # Download the dataset only if the dataset has not been downloaded before
    if not os.path.isfile(dataset_filepath):
        # Download the dataset with the defined request parameters
        client.retrieve(dataset, request, dataset_filepath)
    else:
        print("Dataset already downloaded.")

if __name__ == "__main__":
    client = main()
    main_retrieve()

2024-12-20 09:21:34,791 INFO [2024-09-28T00:00:00] **Welcome to the New Climate Data Store (CDS)!** This new system is in its early days of full operations and still undergoing enhancements and fine tuning. Some disruptions are to be expected. Your 
[feedback](https://jira.ecmwf.int/plugins/servlet/desk/portal/1/create/202) is key to improve the user experience on the new CDS for the benefit of everyone. Thank you.
2024-12-20 09:21:34,793 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
2024-12-20 09:21:34,796 INFO [2024-09-16T00:00:00] Remember that you need to have an ECMWF account to use the new CDS. **Your old CDS credentials will not work in new CDS!**


Dataset already downloaded.


### 2.3 Extract the Zip folder

In [7]:
import zipfile

extract_folder = os.path.join(working_folder, variable_group)
os.makedirs(extract_folder, exist_ok=True)

# Extract the zip file
try:
    if not os.listdir(extract_folder):
        dataset_filename = f"{dataset}_{variable_group}.zip"
        dataset_filepath = os.path.join(download_folder, dataset_filename)
        
        with zipfile.ZipFile(dataset_filepath, 'r') as zip_ref:
            zip_ref.extractall(extract_folder)
            print(f"Successfully extracted files to: {extract_folder}")
    else:
        print("Folder is not empty. Skipping extraction.")
except FileNotFoundError:
    print(f"Error: The file {dataset_filepath} was not found.")
except zipfile.BadZipFile:
    print(f"Error: The file {dataset_filepath} is not a valid zip file.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Folder is not empty. Skipping extraction.


## 3. Read the netCDF file and print the metadata

In [30]:
import re
import pandas as pd
import netCDF4 as nc
import numpy as np

def meta(filename):
    match = re.search(r"(?P<ds_variable>BIO\d{2})_(?P<ds_origin>[\w-]+)-(?P<statistic>mean|median|q25|q75)_v(?P<version>\d+\.\d+)", filename)
    if not match:
        raise ValueError("the given filename does not fit the expected naming scheme")
    
    def get_nc_variable():
        with nc.Dataset(os.path.join(extract_folder, filename), 'r') as nc_dataset:
            nc_variable_name_list = nc_dataset.variables.keys()
            # Primary variable
            # In CDS dataset there is usually just 1 primary variable per dataset
            # Modify index based on the index of primary variable in nc_variable_name_list
            primary_variable_index = 3
            primary_variable = [*nc_variable_name_list][primary_variable_index]
            primary_variable_shape = np.shape(nc_dataset[primary_variable])
            
            return primary_variable, primary_variable_shape
    
    return dict(
        filename=filename,
        path=os.path.join(extract_folder, filename),
        ds_variable=match.group('ds_variable'),
        ds_origin=match.group('ds_origin'),
        variable_name=get_nc_variable()[0],
        variable_shape=get_nc_variable()[1],
        statistic=match.group('statistic')
    )

# Beispielverzeichnis (angepasst an deine Umgebung)
nc_files = [meta(f) for f in os.listdir(extract_folder) if f.endswith('.nc')]
df_nc_files = pd.DataFrame.from_dict(nc_files)

# Modify pandas display options
pd.options.display.max_colwidth = 30

# Display the DataFrame without displaying path
df_nc_files.loc[:, df_nc_files.columns != 'path']

Unnamed: 0,filename,ds_variable,ds_origin,variable_name,variable_shape,statistic
0,BIO01_era5-to-1km_1979-201...,BIO01,era5-to-1km_1979-2018,BIO01,"(1, 4800, 9600)",mean
1,BIO01_era5-to-1km_1979-201...,BIO01,era5-to-1km_1979-2018,BIO01,"(1, 4800, 9600)",median
2,BIO01_era5-to-1km_1979-201...,BIO01,era5-to-1km_1979-2018,BIO01,"(1, 4800, 9600)",q25
3,BIO01_era5-to-1km_1979-201...,BIO01,era5-to-1km_1979-2018,BIO01,"(1, 4800, 9600)",q75
4,BIO02_era5-to-1km_1979-201...,BIO02,era5-to-1km_1979-2018,BIO02,"(1, 4800, 9600)",mean
...,...,...,...,...,...,...
71,BIO18_era5-to-1km_1979-201...,BIO18,era5-to-1km_1979-2018,BIO18,"(1, 4800, 9600)",q75
72,BIO19_era5-to-1km_1979-201...,BIO19,era5-to-1km_1979-2018,BIO19,"(1, 4800, 9600)",mean
73,BIO19_era5-to-1km_1979-201...,BIO19,era5-to-1km_1979-2018,BIO19,"(1, 4800, 9600)",median
74,BIO19_era5-to-1km_1979-201...,BIO19,era5-to-1km_1979-2018,BIO19,"(1, 4800, 9600)",q25


### 3.1 Print the list and summary of uniqie variables

In [36]:
seen_variables = set()
for i, nc_file in enumerate(nc_files):
    variable_name = nc_file['variable_name']
    
    if variable_name in seen_variables:
        continue

    # Open the NetCDF file in read mode
    nc_dataset = nc.Dataset(nc_file['path'], mode='r')

    # List all variables in the dataset
    variables_list = nc_dataset.variables.keys()
    print(f"{i+1:<2} {variable_name:<8}: Available variables: {list(variables_list)}")
    
    # Add the variable name to the seen set
    seen_variables.add(variable_name)

1  BIO01   : Available variables: ['latitude', 'longitude', 'time', 'BIO01']
5  BIO02   : Available variables: ['latitude', 'longitude', 'time', 'BIO02']
9  BIO03   : Available variables: ['latitude', 'longitude', 'time', 'BIO03']
13 BIO04   : Available variables: ['latitude', 'longitude', 'time', 'BIO04']
17 BIO05   : Available variables: ['latitude', 'longitude', 'time', 'BIO05']
21 BIO06   : Available variables: ['latitude', 'longitude', 'time', 'BIO06']
25 BIO07   : Available variables: ['latitude', 'longitude', 'time', 'BIO07']
29 BIO08   : Available variables: ['latitude', 'longitude', 'time', 'BIO08']
33 BIO09   : Available variables: ['latitude', 'longitude', 'time', 'BIO09']
37 BIO10   : Available variables: ['latitude', 'longitude', 'time', 'BIO10']
41 BIO11   : Available variables: ['latitude', 'longitude', 'time', 'BIO11']
45 BIO12   : Available variables: ['latitude', 'longitude', 'time', 'BIO12']
49 BIO13   : Available variables: ['latitude', 'longitude', 'time', 'BIO13']

In [37]:
nc_file = nc_files[0]
nc_dataset = nc.Dataset(nc_file['path'], mode='r')
variables_list = list(nc_dataset.variables.keys())

rows = []
for test_var in variables_list:
    try:
        var_obj = nc_dataset.variables[test_var]
        unit = getattr(var_obj, 'units', 'N/A')
        shape = var_obj.shape
        rows.append({
            "nc_variables": test_var,
            "unit": unit,
            "shape": shape
        })
    except Exception as e:
        print(f"Error processing variable {test_var}: {e}")

# Create a DataFrame
df = pd.DataFrame(rows)
df

Unnamed: 0,nc_variables,unit,shape
0,latitude,degrees_north,"(4800,)"
1,longitude,degrees_east,"(9600,)"
2,time,days since 1999-01-01,"(1,)"
3,BIO01,K,"(1, 4800, 9600)"


### 3.2 Print summary of primary variable

In [35]:
seen_variables = set()
for i, nc_file in enumerate(nc_files):
    variable_name = nc_file['variable_name']
    
    if variable_name in seen_variables:
        continue
    
    nc_dataset = nc.Dataset(nc_file['path'], mode='r')
    variable_data = nc_dataset[variable_name]
    
    # Generate summary of the primary variable
    summary = {
        "Variable Name": variable_name,
        "Data Type": variable_data.dtype,
        "Shape": variable_data.shape,
        "Variable Info": f"{variable_data.dimensions}",
        "Units": getattr(variable_data, "units", "N/A"),
        "Long Name": getattr(variable_data, "long_name", "N/A"),
    }
    
    # Display dataset summary as a DataFrame for better visualization
    nc_summary = pd.DataFrame(list(summary.items()), columns=['Description', 'Remarks'])

    # Display the summary DataFrame
    print(f"{i+1}.")
    display(nc_summary)
    
    # Add the variable name to the seen set
    seen_variables.add(variable_name)
    if len(seen_variables)>=2:
        print("....")
        break

1.


Unnamed: 0,Description,Remarks
0,Variable Name,BIO01
1,Data Type,float32
2,Shape,"(1, 4800, 9600)"
3,Variable Info,"('time', 'latitude', 'long..."
4,Units,K
5,Long Name,Temperature annual mean


5.


Unnamed: 0,Description,Remarks
0,Variable Name,BIO02
1,Data Type,float32
2,Shape,"(1, 4800, 9600)"
3,Variable Info,"('time', 'latitude', 'long..."
4,Units,K
5,Long Name,Mean diurnal range (mean o...


....


## 4. Export Dataset to CSV

In [None]:
# import xarray as xr
# nc_filepath = os.path.join(extract_folder, 'BIO01_era5-to-1km_1979-2018-mean_v1.0.nc')
# variable_name = 'BIO01'

# # Open the NetCDF dataset using xarray
# with xr.open_dataset(nc_filepath) as nc_dataset:
#     variable_data = nc_dataset[variable_name]

#     # Apply geographic filtering using the defined bounding box
#     filtered_data = variable_data.where(
#         (nc_dataset['longitude'] >= bbox_de_bounds_standard[0]) & (nc_dataset['longitude'] <= bbox_de_bounds_standard[2]) &
#         (nc_dataset['latitude'] >= bbox_de_bounds_standard[1]) & (nc_dataset['latitude'] <= bbox_de_bounds_standard[3]),
#         drop=True
#     )
#     # Convert the filtered data into a DataFrame
#     filtered_df = filtered_data.to_dataframe().reset_index()

# # Modify display format for numbers in the DataFrames
# pd.options.display.float_format = '{:,.2f}'.format

# # Display the filtered DataFrame
# filtered_df

In [45]:
def netcdf_to_dataframe(nc_file, bounding_box=None):
    """
    Converts a netCDF file to a DataFrame, optionally filtering by a bounding box.
    
    Parameters:
        nc_file (dict): Dictionary with keys.
        bounding_box (list): Bounding box as [lon_min, lat_min, lon_max, lat_max] (optional).
        
    Returns:
        pd.DataFrame: DataFrame with time, latitude, longitude, and the variable's values.
    """
    # Open the netCDF file
    with nc.Dataset(nc_file['path'], 'r') as nc_dataset:
        lon = nc_dataset['longitude'][:]
        lat = nc_dataset['latitude'][:]
        
        # Extract time and convert it to readable dates
        time_var = nc_dataset.variables['time']
        time_units = time_var.units
        time_calendar = getattr(time_var, "calendar", "standard")
        cftime = nc.num2date(time_var[:], units=time_units, calendar=time_calendar)
        
        # Filter by bounding box if provided
        if bounding_box:
            lon_min, lat_min, lon_max, lat_max = bounding_box
            lat_mask = (lat >= lat_min) & (lat <= lat_max)
            lon_mask = (lon >= lon_min) & (lon <= lon_max)

            lat_indices = np.where(lat_mask)[0]
            lon_indices = np.where(lon_mask)[0]
        else:
            lat_indices = slice(None)
            lon_indices = slice(None)

        filtered_lat = lat[lat_indices]
        filtered_lon = lon[lon_indices]
        
        # Extract variable data, slicing as needed
        variable_data_subset = nc_dataset.variables[nc_file['variable_name']][..., lat_indices, lon_indices]
        
    # Flatten the data using NumPy
    """Modify the variable_column_name to reuse the function"""
    
    if 'variable_name' in nc_file and 'rcp' in nc_file and 'rcp_statistic' in nc_file:
        variable_column_name = f"{nc_file['variable_name']}_{nc_file['rcp']}_{nc_file['rcp_statistic']}"
    elif 'variable_name' in nc_file and 'statistic' in nc_file:
        variable_column_name = f"{nc_file['variable_name']}_{nc_file['statistic']}"
    elif 'variable_name' in nc_file:
        variable_column_name = f"{nc_file['variable_name']}"
    else:
        variable_column_name = None
        print("The required keys are missing in the 'nc_file' dictionary.")

    # Create rows for the DataFrame
    rows = []
    for t in range(variable_data_subset.shape[0]):
        for i in range(variable_data_subset.shape[1]):
            for j in range(variable_data_subset.shape[2]):
                if not np.ma.is_masked(variable_data_subset[t, i, j]):
                    rows.append({
                        'time': cftime[t],
                        'latitude': filtered_lat[i],
                        'longitude': filtered_lon[j],
                        variable_column_name: variable_data_subset[t, i, j]
                    })
                
    # Create a DataFrame from the rows
    df = pd.DataFrame(rows)
    df['time'] = pd.to_datetime(df['time'].map(str))
    df['latitude'] = pd.to_numeric(df['latitude'])
    df['longitude'] = pd.to_numeric(df['longitude'])
    df[variable_column_name] = pd.to_numeric(df[variable_column_name])
    
    # Set the index to time, latitude, and longitude
    return df.set_index(['time', 'latitude', 'longitude'])

In [46]:
# nc_file = nc_files[0]
# netcdf_to_dataframe(nc_file)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,BIO01_mean
time,latitude,longitude,Unnamed: 3_level_1
1999-01-01,32.004167,-29.995833,293.614014
1999-01-01,32.004167,-29.987500,293.613007
1999-01-01,32.004167,-29.979167,293.612030
1999-01-01,32.004167,-29.970833,293.611053
1999-01-01,32.004167,-29.962500,293.610077
1999-01-01,...,...,...
1999-01-01,71.995833,49.962500,271.485382
1999-01-01,71.995833,49.970833,271.481079
1999-01-01,71.995833,49.979167,271.476807
1999-01-01,71.995833,49.987500,271.472534


### 4.1 Create DataFrame and Export as merged CSV file