# Script to Preprocess CMIP6 data

This script demonstrates the preprocessing of CMIP6 climate model data necessary for this analysis.

**Following steps are included in this script:**

1. Load netCDF files
   - Define data
   - Load data
2. Create consistent time coordinates
   - Define reference time coordinate and load it to ref_ds
   - Define reference time coordinate
3. Regrid data
4. Landmask
   - Define landmask location
   - Apply landmask
5. Remove Antartica and Greenland/Iceland
6. Convert Units
7. Save the processed data to a specified path.

In [None]:
# ========== Import Required Libraries ==========
import sys
import dask
from dask.diagnostics import ProgressBar
import os
import xarray as xr

In [None]:
# ========== Configure Paths ==========
# Define the full path to the directories containing utility scripts and configurations
data_handling_dir = '../../src/data_handling'
config_file = '../../src'

# Add directories to sys.path for importing custom modules
sys.path.append(data_handling_dir)
sys.path.append(config_file)

# Import custom utility functions and configurations
import load_data as ld
import process_data as pro_dat
import save_data_as_nc as sd
from config import DATA_DIR, DEFAULT_EXPERIMENT, DEFAULT_TEMP_RES, DEFAULT_MODEL, DEFAULT_VARIABLE

### Step 1: Load netCDF files

In [None]:
# Step 1.1: Define the datasets
data_state = 'raw' # Preprocessing the raw data
experiments = [DEFAULT_EXPERIMENT] # You can load multiple experiments here with [experiment_id_1, experiment_id_2, ...] 
models = [DEFAULT_MODEL] # You can load multiple models here with [Model_name_1, Model_name_2 ...]
variables=[DEFAULT_VARIABLE # You can load multiple variables here with [var_1, var_2 ...]
]

In [None]:
# Step 1.2: Load the datasets
print("Loading datasets...")
with ProgressBar():
    ds_dict = dask.compute(
        ld.load_multiple_models_and_experiments(
            DATA_DIR, data_state, 'CMIP6', experiments, DEFAULT_TEMP_RES, models, variables
        )
    )[0]

### Step 2: Create consistent time coordinates

Aligning the time coordinates of different CMIP6 models is necessary for following ensemble analyses.
We apply a simple approach by selecting a reference time coordinate from a selected model (in our case cftime.DatetimeNoLeap from CMCC-CM2-SR5).
This dataset needs to be downloaded for the respective experiment before correcting time coordinates of other models.

In [None]:
# Step 2.1: Define reference time coordinate and load it to ref_ds
file = f'raw/{DEFAULT_EXPERIMENT}/{DEFAULT_TEMP_RES}/{DEFAULT_VARIABLE}/CMCC-CM2-SR5.nc'
file_path = os.path.join(DATA_DIR, file)
ref_ds = xr.open_dataset(file_path)

In [None]:
# Step 2.2: Define reference time coordinate
ds_dict[DEFAULT_EXPERIMENT] = pro_dat.consis_time(ds_dict[DEFAULT_EXPERIMENT], ref_ds)

### Step 3: Regrid data

We use daily and monthly scale outputs interpolated to a common 1◦ × 1◦ grid. For this purpose, we employ the conservative regridding method
provided by the xESMF package (Zhuang, J. et al. pangeo-data/xESMF: v0.8.2 (2023)), a first-order conservative interpolation technique designed to maintain the integral of the field values during the transition from source to destination grids.

In [None]:
ds_dict[DEFAULT_EXPERIMENT] = pro_dat.regrid(ds_dict[DEFAULT_EXPERIMENT], method='conservative')

### Step 4: Landmask

We select the IMERG Land-Sea Mask as Landmask to distinguish between land and water surfaces, a map widely used in climate and hydrological analyses. To use this mask it needs to be regridded from its original 0.1° × 0.1° resolution to match the CMIP6 data grid of 1° × 1°. The landmask can be downloaded from NASA's directory: https://gpm.nasa.gov/data/directory/imerg-land-sea-mask-netcdf.

In [None]:
# Step 4.1: Define landmask location
landmask_filename = 'land_sea_mask_1x1_grid.nc'
landmask_filepath = '/work/ch0636/g300115/phd_project/common/data/landmasks/imerg/'

# Step 4.2: Apply landmask
ds_dict[DEFAULT_EXPERIMENT] = pro_dat.apply_landmask(ds_dict[DEFAULT_EXPERIMENT], landmask_filename, landmask_filepath) 

### Step 5: Remove Antartica and Greenland/Iceland

Antarctica and Greenland/Iceland are excluded as their ice-dominated systems differ significantly from hydroecological systems, which are the focus of this analysis. Including these regions would introduce biases irrelevant to the study's objectives.

In [None]:
ds_dict[DEFAULT_EXPERIMENT] = pro_dat.remove_antarctica_greenland_iceland(ds_dict[DEFAULT_EXPERIMENT])

### Step 6: Convert Units

Convert Units of variables.

In [None]:
# Define variable and conversion unit
conv_units = {DEFAULT_VARIABLE: 'mm/day',
            }

In [None]:
ds_dict[DEFAULT_EXPERIMENT] = pro_dat.set_units(ds_dict[DEFAULT_EXPERIMENT], conv_units)

### Step 7: Define Output File Path

In [None]:
# Construct the output file path
data_path = f"processed/CMIP6/{DEFAULT_EXPERIMENT}/{DEFAULT_TEMP_RES}/"
file_path = os.path.join(DATA_DIR, data_path)
print(f"Saving files to: {file_path}")

### Step 8: Save Data

In [None]:
# Save the processed datasets and remove any existing files at the target path
sd.save_files(ds_dict[DEFAULT_EXPERIMENT], file_path)