# Pre-Processing Meteorological Predictor Fields 
Version 18 January 2024, Selina Kiefer

### Input: netcdf- or grib-file
continuous timeseries of meteorological predictors in netcdf- or grib-format (i.e. ERA-5, various predictor fields (u10, z100, z250, z500, z850, t850, H850, u300, msl), 4 times (00, 06, 12, 18 UTC), 1950 - 2020, Oct-Apr, 60°W-60°E and 20-80°N, e.g. from https://cds.climate.copernicus.eu/#!/search?text=ERA5&type=dataset)
### Output: csv-file
continuous timeseries of meteorological predictors in csv-format

## Used software: Climate Data Operators and Python

#### Climate Data Operators (CDO) 

Tailored open-source software to perform the most-common meteorological operations efficiently (and much faster than Python). 

Up to date information about CDO: https://code.mpimet.mpg.de/projects/cdo

Reference: Schulzweida, U. (2019): "CDO User Guide". Available at: https://doi.org/10.5281/ZENODO.3539275.

#### Short introduction to CDO

The overall structure for most operations is:

cdo -operator_last_executed,optional_specifications -operator_first_executed,optional_specifcations ifile ofile

e.g. cdo -daymean -selyear,1950,1951 input_file_name output_file_name

The input file (ifile) and the output file (ofile) of one operation have to have different names. So it is best to name all files, which are not intended for further use, similarly, e.g. temp_1, temp_2, etc. and to delete them afterwards directly.

CDO does not ask when overwriting an existing file. So make sure that everything is named uniquely and correctly.

### Start with CDO

Since it is much faster than Python.

In [None]:
# Short overview of the data file's content.
!cdo sinfov  ./era5_msl_180W_180E_0N_90N_1950_1978.nc

In [None]:
# Detailed depiction of the data file's content. Use grib_dump for files in grib-format, 
# nc_dump for files in netcdf-format. It might be wise to use a separate terminal for this
# command since it prints all available information about the data file.
#! grib_dump ./era5_msl_60W_60E_20N_80N_1950_2020.nc
#! nc_dump ./era5_msl_60W_60E_20N_80N_1950_2020.nc

#### Spatial Preprocessing 

In [None]:
# Selection of a gridbox (sellonlatbox,°W,°E,°S,°N). Western longitudes have to be given as 
# 360°-°W). In case there is only 1 latitude or longitude to average over, select the desired
# longitude/latitude and on the second position the desired longitude/latitude+1. Otherwise 
# CDO may perform not well. 
! cdo sellonlatbox,300,60,20,80 ./era5_msl_180W_180E_0N_90N_1950_1978.nc temp_11
! cdo sellonlatbox,300,60,20,80 ./era5_msl_180W_180E_0N_90N_1979_2020.nc temp_12

#### Temporal Preprocessing

In [None]:
# Calculation of the daily mean (daymean). Set the time to 00 UTC (settime,00:00:00) to avoid 
# any inconveniences when reading in the data later with python.
! cdo -settime,00:00:00 -daymean temp_11 temp_21
! cdo -settime,00:00:00 -daymean temp_12 temp_22

In [None]:
# Selection of certain times, e.g. only the wintermonths (selmon).
! cdo -selmon,1,2,3,4,10,11,12 temp_21 temp_31
! cdo selmon,1,2,3,4,10,11,12 temp_22 temp_32

In [None]:
# Selection of only the relevant data according to the lead time at the beginning of each
# winter. Number of days to delete = (Days_of_Month - lead_time).
! cdo delete,day=1,2,3,month=10 temp_31 temp_41
! cdo delete,day=1,2,3,month=10 temp_32 temp_42

In [None]:
# Selection of only the relevant data according to the lead time at the end of each winter.
# Number of days to delete = lead_time.
! cdo delete,day=3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,month=4 temp_41 temp_51
! cdo delete,day=3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,month=4 temp_42 temp_52

In [None]:
# Selection of only the relevant data according to the lead time at the end of the data in case
# it does end with 31 Dec instead of 30 Apr. Number of days to
# delete = (Days_of_Month - lead_time).
! cdo delete,day=4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,month=12,year=2020 temp_52 temp_62

In [None]:
# Temporal merging of two timeseries. The option "-b F64" makes sure that the two dataseries 
# can be combined without errors. 
! cdo -b F64 mergetime temp_51 temp_62 temp_7

In [None]:
# Make sure that the time is sorted correctaly (sorttimestamp) and the file is named correctly.
! cdo sorttimestamp temp_7 ./era5_msl_60W_60E_20N_80N_1950_2020_only_Oct_Apr_lead_time_28d.nc

#### Convert from grib-format to netcdf-format

In [None]:
# Convert the grib-file to a netcdf-file if necessary. The python-scripts are designed to use
# netcdf-files.
#! cdo -f nc copy ofile.grib ofile.nc

#### Remove unnecessary files

In [None]:
# Remove unnecessary files which have been created by CDO since the names of the input files 
# and output files have to be unique.
! rm temp*

## Continue with Python


For a nice overview of the data, pandas dataframes are used. These are then converted directly into csv-format for storage which ensures a safe and easy data transfer between various jupyter notebooks.

#### Define the paths' and files' names 

In [None]:
# Set the needed path and file names.
PATH_defined_functions = './Defined_Functions/'

PATH_data = './Data_in_Netcdf_Format/'
ifile_data = 'era5_msl_60W_60E_20N_80N_1950_2020_only_Oct_Apr_lead_time_28d.nc'

PATH_output_file = './Data_in_csv_Format/'
file_name_output_file = 'era5_msl_60W_60E_20N_80N_1950_2020_only_Oct_Apr_lead_time_28d.csv'

#### Import the necessary packages and functions
Nothing needs to be changed here.

In [None]:
# Import the necessary python packages.
import numpy as np
import pandas as pd

In [None]:
# Import the necessary functions.
import sys
sys.path.insert(1,PATH_defined_functions)
from read_in_netcdf_data import *

#### Read in the data and check the file's content

In [None]:
# Read in the data and show its header.
df_data = read_in_netcdf_data(PATH_data, ifile_data)
df_data.head()

In [None]:
# Show the end of the dataframe.
df_data.tail()

#### Apply a 7-day running mean for temporal aggregation of the data

In [None]:
# Use a 7-day rolling mean for temporal aggregation.
df_data_grouped = df_data
df_data_grouped = df_data.groupby(['latitude', 'longitude'], as_index=False)
df_data_grouped = pd.DataFrame(df_data_grouped)

In [None]:
# For every latitude-longitude pair, apply the 7-day running mean.
list_rolling_means = []
df_data_grouped = df_data_grouped[1]
for i in range(len(df_data_grouped)):
    list_rolling_means.append(df_data_grouped.iloc[i]['msl'].rolling(window=7, center=True).mean())

In [None]:
# Concat the list with the rolling means (does the same like "extend").
rolling_means = pd.concat(list_rolling_means)

In [None]:
# Convert the list to a pandas dataframe and sort its index.
df_rolling_means = pd.DataFrame(rolling_means)
df_rolling_means = df_rolling_means.sort_index()

In [None]:
# Remove the column with the original variable and replace it with the 7-day rolling mean of the variable.
df_data = df_data.drop(['msl'], axis=1)
df_data['msl'] = df_rolling_means['msl']

#### Remove any columns containing NaN-Values since the used ML-models cannot handle NaN values

In [None]:
# Remove any columns containing NaN-values.
df_data = df_data.dropna()

#### Create a minimal, useful representation of the data

In [None]:
# Extract the month of the winter to include seasonality. Needs to be done only for one predictor.
#df_data = df_data.set_index('time')
#df_data['month'] = df_data.index.month
#df_data = df_data.reset_index()

In [None]:
# Rename the variable's comlumn in case its naming is ambiguous.
df_data = df_data.rename(columns={'msl':'msl'})

#### Doublecheck the representation of the data

In [None]:
# Check if everything is sorted, renamed or removed correctly.
df_data.head()

In [None]:
# Also check if everything is sorted, renamed or removed correctly at the end of the
# dataframe.
df_data.tail()

#### Save the ground truth data

In [None]:
# Save the pandas dataframe in csv-format.
df_data.to_csv(PATH_output_file+file_name_output_file)

In [None]:
# End of Program