# SIS Health Vector

Eignung für das Vorkommen und die saisonale Aktivität der Tigermücke (Aedes albopictus) in Europa

Dieses Skript verarbeitet den Datensatz **SIS Health Vector** aus dem Copernics Climate Data Store. Der Datensatz enthält Informationen zu der Eignung der Umweltbedingungen sowie der saisonalen Aktivität der Tigermücke. Der Datensatz wurde im Rahmen des C3S European Health Service entwickelt. Die Informationen sind für unterschiedliche zukünftige Zeiträume und Klimawandelszenarien verfügbar.

**Informationen zum Datensatz**: 

* Source: [SIS Health Vector](https://cds.climate.copernicus.eu/datasets/sis-health-vector?tab=overview)
* Author: T. Tewes (Stadt Konstanz) 
* Resolution: 0.1° x 0.1°
* Notebook-Version: 1.1 (Updated: December 02, 2024)

## 1. Specifying the paths and working directories

In [1]:
import os

''' ---- Hier die Verzeichnisse angeben ---- '''
download_folder = r".\data\sis-health-vector\download"
working_folder = r".\data\sis-health-vector\working"
geotiff_folder = r".\data\sis-health-vector\geotiff"
csv_folder = r".\data\sis-health-vector\csv"
output_folder = r".\data\sis-health-vector\output"
''' ----- Ende der Eingaben ---- '''

os.makedirs(download_folder, exist_ok=True)
os.makedirs(working_folder, exist_ok=True)
os.makedirs(geotiff_folder, exist_ok=True)
os.makedirs(csv_folder, exist_ok=True)
os.makedirs(output_folder, exist_ok=True)

## 2. Download and Extract Dataset

### 2.1 API Authentication

In [2]:
import cdsapi

def main():
    api_key = "fdae60fd-35d4-436f-825c-c63fedab94a4"
    api_url = "https://cds.climate.copernicus.eu/api"
    client = cdsapi.Client(url=api_url, key=api_key)
    return client

### 2.2 Request Definition and Download

In [3]:

# Define additional request fields to ensure the request stays within the file size limit.
# These coordinates were obtained using the BBox Extractor tool:
# https://str-ucture.github.io/bbox-extractor/

bbox_wgs84_deutschland = [56.0, 5.8, 47.2, 15.0]
bbox_wgs84_konstanz = [47.9, 8.9, 47.6, 9.3]

# Alternatively, use a shapefile for precise geographic filtering
import geopandas as gpd
import math

# Example: Load shapefile of Konstanz (WGS84 projection)
de_shapefile = r"./shapefiles/de_boundary.shp"
de_gdf = gpd.read_file(de_shapefile)
de_bounds = de_gdf.total_bounds

# Adjust and buffer
de_bounds_adjusted = [(math.floor(de_bounds[0]* 10)/10)-0.1,
                      (math.floor(de_bounds[1]* 10)/10)-0.1,
                      (math.ceil(de_bounds[2]* 10)/10)+0.1,
                      (math.ceil(de_bounds[3]* 10)/10)+0.1]

bbox_de_bounds_adjusted = [de_bounds_adjusted[3], de_bounds_adjusted[0],
                           de_bounds_adjusted[1], de_bounds_adjusted[2]]

bbox_de_bounds_adjusted

[55.2, 5.7, 47.1, 15.2]

In [4]:
dataset = "sis-health-vector"
request = {
    "variable": [
        "suitability",
        "season_length"
    ],
    "experiment": [
        "rcp4_5",
        "rcp8_5"
    ],
    "ensemble_statistic": [
        "ensemble_members_average",
        "ensemble_members_standard_deviation"
    ],
    "area": bbox_de_bounds_adjusted
}

In [5]:
# Uncomment and run this cell to download the dataset:

def main_retrieve():
    dataset_filename = f"{dataset}.zip"
    dataset_filepath = os.path.join(download_folder, dataset_filename)

    # Download the dataset only if the dataset has not been downloaded before
    if not os.path.isfile(dataset_filepath):
        # Download the dataset with the defined request parameters
        client.retrieve(dataset, request, dataset_filepath)
    else:
        print("Dataset already downloaded.")

if __name__ == "__main__":
    client = main()
    main_retrieve()

2024-12-11 17:26:34,911 INFO [2024-09-28T00:00:00] **Welcome to the New Climate Data Store (CDS)!** This new system is in its early days of full operations and still undergoing enhancements and fine tuning. Some disruptions are to be expected. Your 
[feedback](https://jira.ecmwf.int/plugins/servlet/desk/portal/1/create/202) is key to improve the user experience on the new CDS for the benefit of everyone. Thank you.


2024-12-11 17:26:34,913 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.


2024-12-11 17:26:34,913 INFO [2024-09-16T00:00:00] Remember that you need to have an ECMWF account to use the new CDS. **Your old CDS credentials will not work in new CDS!**




Dataset already downloaded.


### 2.3 Extract the Zip folder

In [6]:
import zipfile

dataset_filename = f"{dataset}.zip"
dataset_filepath = os.path.join(download_folder, dataset_filename)

# Extract the zip file
try:
    os.makedirs(working_folder, exist_ok=True)
    
    if not os.listdir(working_folder):
        with zipfile.ZipFile(dataset_filepath, 'r') as zip_ref:
            zip_ref.extractall(working_folder)
            print(f"Successfully extracted files to: {working_folder}")
    else:
        print("Folder is not empty. Skipping extraction.")
except FileNotFoundError:
    print(f"Error: The file {dataset_filepath} was not found.")
except zipfile.BadZipFile:
    print(f"Error: The file {dataset_filepath} is not a valid zip file.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Folder is not empty. Skipping extraction.


## 3. Read the netCDF file and print the metadata

In [7]:
import re
import pandas as pd

def meta(filename):
    match = re.search(r'mosquito_(suit|seas)_(rcp\d{2})_(\w+)_v(\d+\.\d+)\.', filename)
    if not match:
        raise ValueError("the given filename does not fit the expected naming scheme")
    
    var = match.group(1)
    return dict(
        filename=filename,
        path=os.path.join(working_folder, filename),
        variable=var,
        varilable_name="season_length" if var == 'seas' else "suitability",
        rcp = match.group(2),
        statistic = match.group(3),
        version = match.group(4),
    )

# Create DataFrame from the list of files inside the extracted directory
nc_files = [meta(f) for f in os.listdir(working_folder) if f.endswith('.nc')]
df_nc_files = pd.DataFrame.from_dict(nc_files)

# Modify pandas display options
pd.options.display.max_colwidth = 30

# Display the DataFrame
# df_nc_files

### 3.1 For variable = 'seas'

In [8]:
import netCDF4 as nc

# Open the NetCDF file in read mode
nc_dataset = nc.Dataset(df_nc_files['path'][0], mode='r')

# List all variables in the dataset
variables_list = nc_dataset.variables.keys()
print(f"Available variables: {list(variables_list)}")

Available variables: ['season_length', 'height', 'lat', 'lon', 'time']


In [9]:
# Define variable name from available variables and read variable data
variable_name = 'season_length'
variable_data = nc_dataset[variable_name]

# Generate summary of the primary variable
summary = {
    "Variable Name": variable_name,
    "Data Type": variable_data.dtype,
    "Shape": variable_data.shape,
    "Variable Info": f"{variable_name}({', '.join(variable_data.dimensions)})",
    "Units": getattr(variable_data, "units", "N/A"),
    "Long Name": getattr(variable_data, "long_name", "N/A"),
}

# Display dataset summary as a DataFrame for better visualization
nc_summary = pd.DataFrame(list(summary.items()), columns=['Description', 'Remarks'])

# Display the summary DataFrame
nc_summary

Unnamed: 0,Description,Remarks
0,Variable Name,season_length
1,Data Type,float32
2,Shape,"(100, 82, 95)"
3,Variable Info,"season_length(time, lat, lon)"
4,Units,1
5,Long Name,Ensemble members average o...


### 3.2 For variable = 'suit'

In [10]:
# Open the NetCDF file in read mode
nc_dataset = nc.Dataset(df_nc_files['path'][4], mode='r')

# List all variables in the dataset
variables_list = nc_dataset.variables.keys()
print(f"Available variables: {list(variables_list)}")

Available variables: ['suitability', 'height', 'lat', 'lon', 'time']


In [11]:
# Define variable name from available variables and read variable data
variable_name = 'suitability'
variable_data = nc_dataset[variable_name]

# Generate summary of the primary variable
summary = {
    "Variable Name": variable_name,
    "Data Type": variable_data.dtype,
    "Shape": variable_data.shape,
    "Variable Info": f"{variable_name}({', '.join(variable_data.dimensions)})",
    "Units": getattr(variable_data, "units", "N/A"),
    "Long Name": getattr(variable_data, "long_name", "N/A"),
}

# Display dataset summary as a DataFrame for better visualization
nc_summary = pd.DataFrame(list(summary.items()), columns=['Description', 'Remarks'])

# Display the summary DataFrame
nc_summary

Unnamed: 0,Description,Remarks
0,Variable Name,suitability
1,Data Type,float32
2,Shape,"(100, 82, 95)"
3,Variable Info,"suitability(time, lat, lon)"
4,Units,1
5,Long Name,Ensemble members average o...


## 4. Export Dataset to CSV

In [12]:
import numpy as np
import netCDF4 as nc
from tqdm import tqdm

def netcdf_to_dataframe(
    nc_file,
    bounding_box=None):
    """
    Converts a netCDF file to a DataFrame, optionally filtering by a bounding box.

    Parameters:
        nc_file (dict): Dictionary with keys 'filename', 'path', 'variable', 'variable_name', 'rcp', 'statistic', 'version'.
        bbox (list): Bounding box as [lon_min, lat_min, lon_max, lat_max] (optional).

    Returns:
        pd.DataFrame: DataFrame with time, latitude, longitude, and the variable's values.
    """
    try:
        # Open the netCDF file
        nc_dataset = nc.Dataset(nc_file['path'], 'r')
        lon = nc_dataset['lon'][:]
        lat = nc_dataset['lat'][:]
        
        # Retrieve the variable name
        variable = nc_file['variable']
        if variable == 'suit':
            variable_name = 'suitability'
        elif variable == 'seas':
            variable_name = 'season_length'
        else:
            raise ValueError(f"Unexpected variable: {variable}")
        
        # Extract time variable and convert it to readable dates
        time_var = nc_dataset.variables['time']
        time_units = time_var.units
        time_calendar = getattr(time_var, "calendar", "standard")
        cftime = nc.num2date(time_var[:], units=time_units, calendar=time_calendar)

        # Extract temperature/variable data
        temperature_data = nc_dataset.variables[variable_name][:]
        
        # Filter by bounding box if provided
        if bounding_box:
            lon_min, lat_min, lon_max, lat_max = bounding_box
            
            indices_lat = np.where((lat >= lat_min) & (lat <= lat_max))[0]
            indices_lon = np.where((lon >= lon_min) & (lon <= lon_max))[0]
            
            start_lat, end_lat = indices_lat[0], indices_lat[-1] + 1
            start_lon, end_lon = indices_lon[0], indices_lon[-1] + 1
            
            filtered_lat = lat[start_lat:end_lat]
            filtered_lon = lon[start_lon:end_lon]
            temperature_data_subset = temperature_data[:, start_lat:end_lat, start_lon:end_lon]
            
            # # Filter the data (Alternative approach)
            # # Suitable for irregularly spaced lat/lon values (curvilinear grids, non-uniform sampling)
            # filtered_lat = lat[indices_lat]
            # filtered_lon = lon[indices_lon]
            # temperature_data_subset = temperature_data[:, indices_lat, :][:, :, indices_lon]
        else:
            filtered_lat = lat
            filtered_lon = lon
            temperature_data_subset = temperature_data
            
        # Create a column name for the variable
        variable_column_name = f"{variable}-{nc_file['rcp']}-{nc_file['statistic']}"
        
        
        # Create rows for the DataFrame
        rows = []
        for t in range(temperature_data_subset.shape[0]):
            for i in range(temperature_data_subset.shape[1]):
                for j in range(temperature_data_subset.shape[2]):
                    if not np.ma.is_masked(temperature_data_subset[t, i, j]):
                        rows.append({
                            'time': cftime[t],
                            'latitude': filtered_lat[i],
                            'longitude': filtered_lon[j],
                            variable_column_name: temperature_data_subset[t, i, j]
                        })
                    
        
        # Create a DataFrame from the rows
        df = pd.DataFrame(rows)
        df['time'] = pd.to_datetime(df['time'].map(str))
        df['latitude'] = pd.to_numeric(df['latitude'])
        df['longitude'] = pd.to_numeric(df['longitude'])
        df[variable_column_name] = pd.to_numeric(df[variable_column_name])
        
        # Set the index to time, latitude, and longitude
        return df.set_index(['time', 'latitude', 'longitude'])
    except KeyError as e:
        raise ValueError(f"KeyError: Missing required variable in the netCDF file: {e}")
    finally:
        # Ensure the dataset is closed
        nc_dataset.close()

In [13]:
# Create individual DataFrame and Export as individual CSV files
for nc_file in nc_files:
    csv_filename = f"mosquito_{nc_file['variable']}_{nc_file['rcp']}_{nc_file['statistic']}.csv"
    csv_path = os.path.join(csv_folder, csv_filename)
    
    if not os.path.isfile(csv_path):
        df = netcdf_to_dataframe(nc_file)
        df.to_csv(csv_path, index=False)
        print(f"Data exported successfully to {csv_path}")
    else:
        print(f"File already exists at {csv_path}. Skipping export.")

File already exists at .\data\sis-health-vector\csv\mosquito_seas_rcp45_mean.csv. Skipping export.
File already exists at .\data\sis-health-vector\csv\mosquito_seas_rcp45_stdev.csv. Skipping export.
File already exists at .\data\sis-health-vector\csv\mosquito_seas_rcp85_mean.csv. Skipping export.
File already exists at .\data\sis-health-vector\csv\mosquito_seas_rcp85_stdev.csv. Skipping export.
File already exists at .\data\sis-health-vector\csv\mosquito_suit_rcp45_mean.csv. Skipping export.
File already exists at .\data\sis-health-vector\csv\mosquito_suit_rcp45_stdev.csv. Skipping export.
File already exists at .\data\sis-health-vector\csv\mosquito_suit_rcp85_mean.csv. Skipping export.
File already exists at .\data\sis-health-vector\csv\mosquito_suit_rcp85_stdev.csv. Skipping export.


In [14]:
# Create combined DataFrame and Export as merged CSV file
csv_filename = 'sis-health-vector-merged.csv.zip'
csv_path = os.path.join(csv_folder, csv_filename)

if not os.path.isfile(csv_path):
    dataframes = [netcdf_to_dataframe(nc_file) for nc_file in nc_files]
    df = pd.concat(dataframes, axis=1)
    df.to_csv(csv_path, sep=',', encoding='utf8', compression='zip')
else:
    print(f"File already exists at {csv_path}. Skipping export.")

File already exists at .\data\sis-health-vector\csv\sis-health-vector-merged.csv.zip. Skipping export.
