# 2D Air Temperature Aggregation Tool

This notebook provides a robust tool for extracting 2-meter air temperature (`ta_2m`) data from PALM (Potsdam Atmospheric Large-Eddy Simulation Model) 2D NetCDF output files and performing temporal aggregation. The primary goal is to generate aggregated `ta_2m` NetCDF files (e.g., hourly averages) from potentially finer-resolution simulation outputs. These aggregated files can then be used as input for subsequent analysis notebooks, streamlining data workflows.

## 1. Import dependencies

This section imports all necessary Python libraries for numerical operations, NetCDF file handling, interactive widget creation for user input, and basic operating system interactions.

In [1]:
import os

import numpy as np
import netCDF4 as nc
from netCDF4 import Dataset

from IPython.display import display
import ipywidgets as widgets

import os
from utils import palm_variables

## 2. Load Simulation Data
This section defines the file paths for the 2D simulation output NetCDF files (for a baseline and a scenario run) and the static driver file. These files are then loaded into netCDF4 Dataset objects, making their contents accessible for processing. The static driver is included for potential future use (e.g., extracting building masks or grid information).

In [2]:
# Absolute URLs (paths) of 2D xy-averaged simulation output files.
file_xy_1 = r"./Data/_simulation_outputs_3/konstanz_4096x4096_v9_Baseline-48hr/OUTPUT/konstanz_4096x4096_v9_Baseline_av_xy_N03.000.nc"
file_xy_2 = r"./Data/_simulation_outputs_3/konstanz_4096x4096_v9_Scenario_1-48hr/OUTPUT/konstanz_4096x4096_v9_Scenario_1_av_xy_N03.000.nc"
file_static = r"./Data/_simulation_outputs_3/konstanz_4096x4096_v9_Scenario_1-48hr/INPUT/konstanz_4096x4096_v9_Scenario_1_static_N03"

# Read NetCDF files into Dataset objects in read mode ('r').
dataset_1 = nc.Dataset(file_xy_1, mode='r')
dataset_2 = nc.Dataset(file_xy_2, mode='r')
dataset_3 = nc.Dataset(file_static, mode='r') # Loaded for completeness, but not explicitly used later in *this* notebook.

# Store the Dataset objects and their corresponding file paths in lists for easy iteration.
file_xy_list = [file_xy_1, file_xy_2]
dataset_list = [dataset_1, dataset_2]

## 3. Variable Selection
This section allows the user to interactively select a 2D variable from the loaded NetCDF datasets. A dropdown widget is provided for selection, and the chosen variable's description and unit (retrieved from the palm_variables module) are displayed for clear identification.

In [3]:
# Extract variable names from the first dataset where the number of dimensions is greater than 2.
# In PALM xy-averaged output files (`_av_xy_N03.000.nc`), these typically represent 2D spatial data over time (time, z_fixed_level, y, x) or (time, y, x).
var_names_palm = [var for var in dataset_1.variables if dataset_1.variables[var].ndim > 2]

# Initialize `test_variable` with the first variable in the list (`var_names_palm[0]`),
# which is commonly 'ta_2m*_xy' for 2-m air temperature in these types of files.
test_variable = var_names_palm[0]

# Create a dropdown widget to allow the user to select the desired 2D variable.
drop_down = widgets.Dropdown(
    options=var_names_palm,         # Populate the dropdown with the extracted 2D variable names.
    value=var_names_palm[0],        # Set the initial selected value in the dropdown.
    description='Select test variable:' # Label displayed next to the dropdown.
)

# Define a handler function that will be called whenever the dropdown's value changes.
def dropdown_handler(change):
    global test_variable  # Declare `test_variable` as global to modify it.
    test_variable = change.new     # Update the global `test_variable` with the newly selected value.
    print(f"Selected variable: {test_variable}") # Print the newly selected variable to the console.

# Attach the `dropdown_handler` function to observe changes in the 'value' property of the dropdown.
drop_down.observe(dropdown_handler, names='value')

# Display the dropdown widget in the notebook output.
display(drop_down)

Dropdown(description='Select test variable:', options=('ta_2m*_xy', 'tsurf*_xy', 'wspeed_10m*_xy', 'bio_pet*_x…

In [4]:
# Check if the selected `test_variable` string contains a wildcard character '*'.
# This is common for PALM 2D xy-averaged variables (e.g., 'ta_2m*_xy').
if "*" in test_variable:
    # If a wildcard is present, extract the base part of the variable name (e.g., 'ta_2m' from 'ta_2m*_xy')
    # and re-append '*' to match the keys in `palm_variables.variables_dict`.
    var_initial = test_variable.split("*")[0] + "*"
    # Retrieve the dictionary of information for `var_initial` from the `palm_variables` module.
    variable_info = palm_variables.variables_dict.get(var_initial, {})
    # Extract the 'unit' from `variable_info`, defaulting to 'No unit available' if the key is missing.
    unit = variable_info.get('unit', 'No unit available')
    # Extract the 'description' from `variable_info`, defaulting to 'No description available' if the key is missing.
    description = variable_info.get('description', 'No description available')
    # Print the capitalized description and its unit.
    print(f"{description.capitalize()}, {unit}")

2-m air temperature, °C


## 4. Define Time Sequences and Aggregation Logic
This section dynamically extracts the total number of time steps from the loaded dataset. It then defines the get_aggregate_time_list function, which creates lists of time step indices for temporal aggregation. This function generates moving windows of time steps, enabling the calculation of aggregated values (e.g., averages) over specified durations.

In [5]:
def get_aggregate_time_list(total_time_steps, aggregate_time_steps):
    """
    Generates a list of time step ranges for temporal aggregation, creating a moving window.

    Args:
        total_time_steps (int): Total number of time steps in the simulation.
        aggregate_time_steps (int): The size of the aggregation window (number of time steps).

    Returns:
        list: A list where each element is a sub-list of time step indices
              representing an aggregation window.
    """
    time_lists = []
    
    for i in range(total_time_steps):
        if aggregate_time_steps <= 1:
            time_list = [i] # No aggregation: window is just the current time step.
        else:
            half_window = aggregate_time_steps // 2 # Calculate half-window size.
            
            # Determine the start and end indices of the window based on even/odd `aggregate_time_steps`.
            if aggregate_time_steps % 2 == 0:
                # For even window, it's centered such that `i` is towards the end of the first half.
                # Example: for aggregate_time_steps=6, half_window=3. For `i=10`, range is [7, 13) -> [7, 8, 9, 10, 11, 12].
                time_list = [j for j in range(i - half_window, i + half_window)]
            else:
                # For odd window, it's perfectly centered around `i`.
                # Example: for aggregate_time_steps=5, half_window=2. For `i=10`, range is [8, 13) -> [8, 9, 10, 11, 12].
                time_list = [j for j in range(i - half_window, i + half_window + 1)]
        
        # Filter out time indices that are outside the total simulation time steps.
        valid_time_list = [j for j in time_list if 0 <= j < total_time_steps]
        time_lists.append(valid_time_list)
    
    return time_lists

## 5. Prepare Data for Aggregation
This section prepares the necessary data for the aggregation process. It extracts the base filenames from the input simulation files and loads the selected 2D variable data from both dataset_1 and dataset_2. It also sets the aggregate_time_steps parameter (defaulting to 1, meaning no aggregation is performed by default) and generates the corresponding time_lists for aggregation.

In [6]:
# Extract the base filename (without path and extension) from the first xy output file.
# This will be used in naming the aggregated output files.
filename_xy_1 = os.path.basename(file_xy_1).split('.')[0]
# Extract the base filename for the second xy output file.
# Corrected from original: ensure `filename_xy_2` comes from `file_xy_2`.
filename_xy_2 = os.path.basename(file_xy_2).split('.')[0]

# Load the actual variable data for the selected `test_variable` from both `dataset_1` and `dataset_2`.
# `test_variable` is determined by the dropdown selection in a previous step.
variable_data_1 = dataset_1[test_variable]
variable_data_2 = dataset_2[test_variable]

# Get the full shape of the 2D variable data (time, y, x) from `dataset_1`.
# The first element (`[0]`) gives the total number of time steps.
variable_data_shape = np.shape(dataset_1[test_variable])
total_time_steps = variable_data_shape[0]

# Define the aggregation window size.
# A value of 1 means no aggregation (individual time steps are processed).
# Change this value (e.g., to 6 for hourly averages) to perform temporal aggregation.
aggregate_time_steps = 1 # Default to no aggregation.

# Generate the list of time step ranges for aggregation based on the defined window.
# This list (`time_lists`) will guide the averaging process.
time_lists = get_aggregate_time_list(total_time_steps, aggregate_time_steps)

## 6. Perform Aggregation and Save Data
This final section iterates through the loaded simulation datasets. For each dataset, it computes the temporal aggregate of the selected 2D variable using the previously defined aggregation windows (time_lists). The aggregated data is then saved into new NetCDF files, organized in a subdirectory named after the aggregation window size. A check is included to skip the export if the file already exists to prevent accidental overwrites.

In [7]:
# Iterate through each dataset in the `dataset_list` (e.g., Baseline and Scenario 1 simulations).
for i, current_dataset in enumerate(dataset_list):
    # Extract the variable data for the `test_variable` from the current dataset.
    variable_data = current_dataset[test_variable]
    
    # Get the total number of time steps for the current variable data.
    current_total_time_steps = np.shape(variable_data)[0]
    
    # Regenerate `time_lists` to ensure it's correct for the current dataset's total time steps,
    # in case datasets have different lengths or `aggregate_time_steps` was changed.
    time_lists = get_aggregate_time_list(current_total_time_steps, aggregate_time_steps)
    
    # Initialize a list to store the aggregated 2D arrays for all time steps.
    variable_data_agg = []
    
    # Loop through each generated time window (`time_list`) to compute the aggregate.
    for j, time_window_indices in enumerate(time_lists):
        values_in_window = []
        
        # For each time index within the current window, extract the 2D slice of the variable.
        # Assuming `variable_data` has dimensions (time, z_fixed_level, y, x) or (time, y, x).
        # If it's 4D (time, z, y, x), `variable_data[time_idx, 0, :, :]` extracts the 2D slice at z=0.
        # If it's 3D (time, y, x), `variable_data[time_idx, :, :]` extracts the 2D slice directly.
        # The following handles both cases assuming `z_fixed_level` is at index 1 if present.
        if variable_data.ndim == 4:
            for time_idx in time_window_indices:
                values_in_window.append(variable_data[time_idx, 0, :, :])
        elif variable_data.ndim == 3:
            for time_idx in time_window_indices:
                values_in_window.append(variable_data[time_idx, :, :])
        else:
            raise ValueError(f"Unexpected number of dimensions for variable {test_variable}: {variable_data.ndim}")
            
        # Compute the mean of all 2D slices collected in the current window along the time axis (axis=0).
        # This results in a single 2D array representing the aggregated value for that time window.
        variable_data_agg.append(np.mean(values_in_window, axis=0))
        
    # Determine the base filename for the output NetCDF file from the original file path.
    source_filename = os.path.basename(file_xy_list[i]).split('.')[0]
    
    # Construct the `output_filename` by appending the `test_variable` name (with '*' removed if present).
    output_filename = f"{source_filename}_{test_variable.replace('*','')}"
    
    # Define the output directory path.
    output_directory = f"./output/04_aggregated_2D_data/" # Example: ./output/aggregated_2D_data/
    # Create the output directory if it doesn't already exist. `exist_ok=True` prevents an error if it exists.
    os.makedirs(output_directory, exist_ok=True)
    
    # Construct the full output file path.
    output_filepath = os.path.join(output_directory, f"{output_filename}.nc")

    # --- Check if file already exists; if so, skip export ---
    if os.path.exists(output_filepath):
        print(f"Skipping export: File already exists at {output_filepath}")
        continue # Skip to the next iteration of the loop for the next dataset
    
    # Create a new NetCDF file in write mode ("w").
    # `nc.Dataset` is used for creating and writing to NetCDF files.
    with nc.Dataset(output_filepath, mode="w", format='NETCDF4_CLASSIC') as output_dataset:
        # --- Copy Global Attributes ---
        # Iterate through all global attributes of the input dataset.
        for attr_name in current_dataset.ncattrs():
            # Exclude "VAR_LIST" from direct copying as it will be reconstructed.
            if attr_name != "VAR_LIST":
                output_dataset.setncattr(attr_name, current_dataset.getncattr(attr_name))
        
        # Reconstruct and set the "VAR_LIST" attribute for the output file.
        # This attribute lists the variables contained within the new file, formatted as ';var1;var2;'.
        var_list_str = "".join([f";{var}" for var in [test_variable]]) + ";" # Only `test_variable` is exported.
        output_dataset.setncattr('VAR_LIST', var_list_str)
        
        # --- Create Dimensions ---
        # Get dimensions from the *first* variable (or general structure) of the input dataset.
        # Assuming `variable_data` is (time, (z_level), y, x)
        
        # `num_time`: number of aggregated time steps.
        # `num_z_level`: fixed to 1 as we're exporting a single 2D layer (e.g., z=0).
        # `num_y`: number of rows in the 2D array.
        # `num_x`: number of columns in the 2D array.
        num_time = len(variable_data_agg)
        num_z_level = 1 
        num_y = variable_data_agg[0].shape[0] # Number of rows (y-dimension)
        num_x = variable_data_agg[0].shape[1] # Number of columns (x-dimension)
        
        output_dataset.createDimension('time', num_time)
        output_dataset.createDimension('z', num_z_level) # Creating a z-dimension with size 1 for consistency.
        output_dataset.createDimension('y', num_y) 
        output_dataset.createDimension('x', num_x) 
        
        # Also copy coordinate variables if they exist and are useful
        if 'time' in current_dataset.variables:
            time_var = output_dataset.createVariable('time', current_dataset['time'].dtype, ('time',))
            time_var[:] = current_dataset['time'][time_lists[0][0]:time_lists[-1][-1]+1:aggregate_time_steps] # Simplified time assignment.
            # This assumes time steps are regular. A more precise way would be to average time, or pick central time.
            # For this context, picking the start of the first window to the end of the last window at aggregation interval.
        if 'z' in current_dataset.variables:
            z_var = output_dataset.createVariable('z', current_dataset['z'].dtype, ('z',))
            z_var[:] = current_dataset['z'][0] # Copy the z-coordinate of the extracted layer.
        if 'y' in current_dataset.variables:
            y_var = output_dataset.createVariable('y', current_dataset['y'].dtype, ('y',))
            y_var[:] = current_dataset['y'][:]
        if 'x' in current_dataset.variables:
            x_var = output_dataset.createVariable('x', current_dataset['x'].dtype, ('x',))
            x_var[:] = current_dataset['x'][:]


        # Create the variable in the new NetCDF file.
        # The dimensions are (time, z, y, x) to maintain a consistent structure with PALM 4D outputs.
        data_var = output_dataset.createVariable(f'{test_variable}', np.float32, ('time', 'z', 'y', 'x'), 
                                               fill_value=variable_data._FillValue if '_FillValue' in variable_data.ncattrs() else -9999.0) # Copy fill value

        # Fill the newly created variable with the aggregated 2D data.
        # Each aggregated 2D array is assigned to its corresponding time step and the first z-layer.
        for k, array in enumerate(variable_data_agg):
            data_var[k, 0, :, :] = array # Assign the 2D aggregated array to the NetCDF variable

    print(f"Successfully extracted and saved aggregated data for '{test_variable}' to: {output_filepath}")

Skipping export: File already exists at ./output/04_aggregated_2D_data/konstanz_4096x4096_v9_Baseline_av_xy_N03_ta_2m_xy.nc
Skipping export: File already exists at ./output/04_aggregated_2D_data/konstanz_4096x4096_v9_Scenario_1_av_xy_N03_ta_2m_xy.nc
