# Creating the Binary Ground Truth Cold Wave Days
Version 17 January 2024, Selina Kiefer

### Input: netcdf- or grib-file and csv-file
1. netcdf- or grib-file with ground truth (i.e. E-OBS V23.1e, tg, daily mean, 1950 - 2020, Nov-Apr, 3-20°E and 45-60°N, e.g. from https://www.ecad.eu/download/ensembles/download.php ), or, optionally, directly continuous timeseries of ground truth temperature in csv-format (then skip the part with CDO and start directly with Python)
2. threshold temperatures for cold waves in csv-format
### Output: csv-file
binary timeseries of cold wave days in csv-format (1 = cold wave day, 0 = non cold wave day)

## Used software: Climate Data Operators and Python

#### Climate Data Operators (CDO) 

Tailored open-source software to perform the most-common meteorological operations efficiently (and much faster than Python). 

Up to date information about CDO: https://code.mpimet.mpg.de/projects/cdo

Reference: Schulzweida, U. (2019): "CDO User Guide". Available at: https://doi.org/10.5281/ZENODO.3539275.

#### Short introduction to CDO

The overall structure for most operations is:

cdo -operator_last_executed,optional_specifications -operator_first_executed,optional_specifcations ifile ofile

e.g. cdo -daymean -selyear,1950,1951 input_file_name output_file_name

The input file (ifile) and the output file (ofile) of one operation have to have different names. So it is best to name all files, which are not intended for further use, similarly, e.g. temp_1, temp_2, etc. and to delete them afterwards directly.

CDO does not ask when overwriting an existing file. So make sure that everything is named uniquely and correctly.

### Start with CDO

Since it is much faster than Python.

#### At first, check the data file's content
This is optional.

In [None]:
# Short overview of the data file's content.
!cdo sinfov /home/my6406/Desktop/Wissenschaftliche_Mitarbeit/Data_Archive/E_OBS/eobs_v23e_daymean_sellonlatbox_3_20_45_60.nc

In [None]:
# Optional: use a more detailed description of the data file's content. It might be wise to use 
# a separate terminal for this command since it prints all available information about the data
# file. Use grib_dump for files in grib-format, 
# nc_dump for files in netcdf-format. 
#! grib_dump ./Data_Archive/E_OBS/eobs_v23e_daymean_sellonlatbox_3_20_45_60.grib
! nc_dump ./Data_Archive/E_OBS/eobs_v23e_daymean_sellonlatbox_3_20_45_60.nc

#### Spatial Preprocessing 

In [None]:
# Selection of a gridbox (sellonlatbox,°W,°E,°S,°N). Western longitudes have to be given as 
# 360°-°W). In case there is only 1 latitude or longitude to average over, select the desired
# longitude/latitude and on the second position the desired longitude/latitude+1. Otherwise 
# CDO may perform not well.    
! cdo sellonlatbox,3,20,45,60 ./E_OBS/eobs_v23e_daymean_sellonlatbox_3_20_45_60.nc temp_1

In [None]:
# Calculation of the areal mean (fldmean) over the desired area chosen above.
! cdo fldmean temp_1 temp_2

#### Temporal Preprocessing

In [None]:
# Selection of certain times, e.g. only the wintermonths (selmon).
! cdo selmon,1,2,3,4,11,12 temp_2 temp_3

In [None]:
# Remove the lead time from the beginning of the data. 
# Number of days to delete = lead_time.
! cdo delete,day=1,2,3,4,5,6,7,8,9,10,11,12,13,14,month=1,year=1950 temp_3 temp_4

In [None]:
# Make sure that the time is sorted correctly (sorttimestamp) and the file is named correctly.
! cdo sorttimestamp temp_4 ./Data_in_Netcdf_Format/eobsv23e_tg_3E_20E_45N_60N_1950_2020_only_Nov_Apr_lead_time_14d.nc

#### Convert from grib-format to netcdf-format

In [None]:
# Convert the grib-file to a netcdf-file if necessary. The Python-scripts are designed to use
# netcdf-files.
#! cdo -f nc copy ofile.grib ofile.nc

#### Remove unnecessary files

In [None]:
# Remove unnecessary files which have been created by CDO since the names of the input files 
# and output files have to be unique.
#! rm temp*

## Continue with Python


For a nice overview of the data, pandas dataframes are used. These are then converted directly into csv-format for storage which ensures a safe and easy data transfer between various jupyter notebooks.

#### Define the paths' and files' names 

In [None]:
# Set the needed path and file names.
PATH_defined_functions = './Defined_Functions/'

PATH_data = './Data_in_csv_Format/'
ifile_data = 'eobsv23e_tg_3E_20E_45N_60N_1950_2020_only_Nov_Apr_14d_lead.csv'

PATH_thresholds = './Threshold_Data/'
ifile_thresholds = 'cold_wave_thresholds_Smid_et_al_2019_for_1970_2000.csv'

PATH_output_file = './Data_in_csv_Format/'
file_name_output_file = 'eobsv23e_tg_3E_20E_45N_60N_1950_2020_binary_cold_waves_lead_time_14d.csv'

#### Import the necessary packages and functions

In [None]:
# Import the necessary python packages.
import numpy as np
import pandas as pd
import calendar
from datetime import datetime, timedelta

In [None]:
# Import the necessary defined functions.
import sys
sys.path.insert(1,PATH_defined_functions)
from read_in_csv_data import *
from create_auxiliary_date import *
from truncate_data_by_date import *
from apply_cold_wave_definition_smid_et_al_2019 import *

#### Read in the ground truth temperature data and check the file's content

In [None]:
# Read in the ground truth and remove any unnamed columns as well as the index column.
df_data = read_in_csv_data(PATH_data,ifile_data)
df_data = df_data.loc[:, ~df_data.columns.str.contains('^Unnamed')]
df_data = df_data.drop( ['level_0', 'index'], axis=1)

In [None]:
# Show the head of the dataframe.
df_data.head()

In [None]:
# Show the end of the dataframe.
df_data.tail()

In [None]:
# Set the name of the columns containing the time and the variables of the ground truth.
time_column_name = df_data.columns[0]
var_column_name = df_data.columns[1]

#### Read in the temperature thresholds of the cold wave criterion and check the file's content


In [None]:
# Read in the thresholds for the cold wave index and remove any unnamed columns as well as the
# index column.
df_thresholds = read_in_csv_data(PATH_thresholds,ifile_thresholds)
df_thresholds = df_thresholds.loc[:, ~df_thresholds.columns.str.contains('^Unnamed')]
df_thresholds = df_thresholds.drop(['index'], axis =1 )

In [None]:
# Show the head of the dataframe containing the cold wave thresholds.
df_thresholds.head()

In [None]:
# Show the tail of the dataframe containing the thresholds.
df_thresholds.tail()

In [None]:
# Set the name of the columns containing the time and the threshold.
time_column_name_thresholds = df_thresholds.columns[0]
var_column_name_thresholds = df_thresholds.columns[1]

#### Apply the cold wave criterion to the data

In [None]:
# At first, two different dataframes are created with the threshold for the cold wave 
# definition. One for regular years and one for leap years. Therefore, the index of the original
# dataframe is set to the time and the index of the 29 February is determined. Then, a new 
# dataframe without the 29 February is created for regular years. The original dataframe is used
# for leap years.
df_thresholds[time_column_name_thresholds]=pd.to_datetime(df_thresholds[time_column_name_thresholds])
df_thresholds = df_thresholds.set_index(time_column_name_thresholds)
index_of_february_29 = df_thresholds[((df_thresholds.index.month == 2) & (df_thresholds.index.day == 29))].index
df_thresholds_without_29_feb = df_thresholds.drop(index_of_february_29)
df_thresholds = df_thresholds.reset_index()
df_thresholds_without_29_feb = df_thresholds_without_29_feb.reset_index()

In [None]:
# A list with all the start years of the winters in the training period is created.
start_years_of_winter = np.arange(1950, 2020)

In [None]:
# In case the ground truth timeseries does not start with the beginning of a winter, the beginning
#  of the timeseries until the start of the first winter is checked for cold waves. Therefore, the 
# respective part of the ground truth and the fitting part of the cold wave thresholds are extracted.
# Then, the cold wave definition by Smid et al. (2019) is applied and the binary classification
# of whether at a specific date a cold wave occurred ('1') or not ('0') is saved to a list. 
# Also, the dates are saved to a list.
dates = []

all_winters_list_cold_waves_data = []

start_winter = datetime(1950, 2, 1)
end_winter = datetime(1950, 4, 30)

df_thresholds_start_training_data = truncate_data_by_date(df_thresholds, time_column_name_thresholds, start_winter.strftime('%Y_%m_%d'), end_winter.strftime('%Y_%m_%d')) 
threshold_cold_waves = df_thresholds_start_training_data[var_column_name_thresholds]


df_data_respective_winter = truncate_data_by_date(df_data, time_column_name, start_winter.strftime('%Y_%m_%d'), end_winter.strftime('%Y_%m_%d')) 

df_data_binned = pd.DataFrame()
df_data_binned['time'] = df_data_respective_winter[time_column_name]
list_cold_waves_data = apply_cold_wave_definition_smid_et_al_2019(df_data_binned, df_data_respective_winter, var_column_name, threshold_cold_waves)
          
all_winters_list_cold_waves_data.extend(list_cold_waves_data)
dates.extend(pd.to_datetime(df_data_respective_winter[time_column_name]))

In [None]:
# Now, the same is done for every complete winter in the ground truth timeseries. The cold wave
# thresholds are taken depending on whether it is a leap year or not, meaning that the 
# 29 February is included in the threshold data or not.
for start_year_of_winter in start_years_of_winter:
    
    if calendar.isleap(start_year_of_winter+1):
        threshold_cold_waves = df_thresholds[var_column_name_thresholds]
    else:
        threshold_cold_waves = df_thresholds_without_29_feb[var_column_name_thresholds]
    
    start_winter = datetime(start_year_of_winter, 11, 1)
    end_winter = datetime(start_year_of_winter+1, 4, 30)

    df_data_respective_winter = truncate_data_by_date(df_data, time_column_name, start_winter.strftime('%Y_%m_%d'), end_winter.strftime('%Y_%m_%d')) 

    df_data_binned = pd.DataFrame()
    df_data_binned['time'] = df_data_respective_winter[time_column_name]
    list_cold_waves_data = apply_cold_wave_definition_smid_et_al_2019(df_data_binned, df_data_respective_winter, var_column_name, threshold_cold_waves)
          
    all_winters_list_cold_waves_data.extend(list_cold_waves_data)

    dates.extend(pd.to_datetime(df_data_respective_winter[time_column_name]))

In [None]:
# In a last step, the procedure is done for the rest of the ground truth timeseries which does
# not cover a whole winter anymore.
start_winter = datetime(2020, 11, 1)
end_winter = datetime(2020, 12, 28)

df_thresholds_end_training_data = truncate_data_by_date(df_thresholds, time_column_name_thresholds, start_winter.strftime('%Y_%m_%d'), end_winter.strftime('%Y_%m_%d')) 
threshold_cold_waves = df_thresholds_end_training_data[var_column_name_thresholds]


df_data_respective_winter = truncate_data_by_date(df_data, time_column_name, start_winter.strftime('%Y_%m_%d'), end_winter.strftime('%Y_%m_%d')) 

df_data_binned = pd.DataFrame()
df_data_binned['time'] = df_data_respective_winter[time_column_name]
list_cold_waves_data = apply_cold_wave_definition_smid_et_al_2019(df_data_binned, df_data_respective_winter, var_column_name, threshold_cold_waves)
          
all_winters_list_cold_waves_data.extend(list_cold_waves_data)

dates.extend(pd.to_datetime(df_data_respective_winter[time_column_name]))

In [None]:
# Now, a new dataframe is created containing the dates of the ground truth and the binary cold
# wave index.
df_data_binary_cold_waves = pd.DataFrame()
df_data_binary_cold_waves['time'] = dates
df_data_binary_cold_waves['Cold_Wave'] = all_winters_list_cold_waves_data

#### Plausibility check if the application of the cold wave criterion worked

In [None]:
# Here it is checked whether all dates in the ground truth timeseries have been checked for a
# cold wave occurrence.
if len(df_data_binary_cold_waves['Cold_Wave']) == len(df_data[var_column_name]):
    print('All data has been checked for cold waves.')
else:
    print('Not all data has been checked for cold waves!')
    print('Days to be checked: '+str(len(df_data[var_column_name])))
    print('Days actually checked: '+str(len(all_winters_list_cold_waves_data)))

In [None]:
# To check if the cold wave definition has been correctly, it is checked whether a reasonable
# number of days with cold waves have been detected in the ground truth timeseries.
days_with_cold_waves = 0
binary_list_of_cold_waves = df_data_binary_cold_waves['Cold_Wave']
for i in range(len(binary_list_of_cold_waves)):
    if binary_list_of_cold_waves[i] == 1:
        days_with_cold_waves +=1

print('There are '+str(days_with_cold_waves)+' days with cold waves from a total of '+str(len(binary_list_of_cold_waves))+' days.')
print('This means that '+str(days_with_cold_waves/len(binary_list_of_cold_waves)*100)+'% of the winterdays are cold wave days.')

#### Doublecheck the representation of the data

In [None]:
# Check if everything is sorted, renamed or removed correctly.
df_data_binary_cold_waves.head()

In [None]:
# Also check if everything is sorted, renamed or removed correctly at the end of the
# dataframe.
df_data_binary_cold_waves.tail()

#### Save the binary ground truth cold wave data

In [None]:
# Save the pandas dataframe in csv-format.
df_data_binary_cold_waves.to_csv(PATH_output_file+file_name_output_file)

In [None]:
# End of Program