# Creating the Continuous Ground Truth Temperature
Version 17 January 2024, Selina Kiefer

### Input: netcdf- or grib-files
1. netcdf- or grib-file with ground truth temperature (i.e. E-OBS V23.1e, tg, daily mean, 1950 - 2020, Nov-Apr, 3-20°E and 45-60°N, e.g. from https://www.ecad.eu/download/ensembles/download.php )
2. netcdf- or grib-file with elevation data (i.e. E-OBS V23.1e, elevation, 3-20°E and 45-60°N, e.g. from https://www.ecad.eu/download/ensembles/download.php )

### Output: csv-file
continuous timeseries of ground truth temperature in csv-format (7-day running mean, averaged over 3-20°E and 45-60°N (only grid points on land, one grid point away from coasts and elevation<800m) and adapted to the desired lead time of model intended for forecasting)

## Used software: Climate Data Operators and Python

#### Climate Data Operators (CDO) 

Tailored open-source software to perform the most-common meteorological operations efficiently (and much faster than Python). 

Up to date information about CDO: https://code.mpimet.mpg.de/projects/cdo

Reference: Schulzweida, U. (2019): "CDO User Guide". Available at: https://doi.org/10.5281/ZENODO.3539275.

#### Short introduction to CDO

The overall structure for most operations is:

cdo -operator_last_executed,optional_specifications -operator_first_executed,optional_specifcations ifile ofile

e.g. cdo -daymean -selyear,1950,1951 input_file_name output_file_name

The input file (ifile) and the output file (ofile) of one operation have to have different names. So it is best to name all files, which are not intended for further use, similarly, e.g. temp_1, temp_2, etc. and to delete them afterwards directly.

CDO does not ask when overwriting an existing file. So make sure that everything is named uniquely and correctly.

### Start with CDO

Since it is much faster than Python.

#### At first, check the data files' content 
This is optional.

In [None]:
# Short overview of the temperature data file's content.
!cdo sinfov ./E_OBS/eobs_tg_mean_v23.1e.nc

In [None]:
# Short overview of the elevation data file's content.
!cdo sinfov ./E_OBS/eobs_v23e_surface_elevation.nc

### Elevation Data

In [None]:
# Select the correct longitude-latitude box (sellonlatbox,°W,°E,°S,°N) for the elevation data. Western longitudes 
# have to be given as 360°-°W). In case there is only 1 latitude or longitude to average over, select the desired
# longitude/latitude and on the second position the desired longitude/latitude+1. Otherwise 
# CDO may perform not well. 
! cdo sellonlatbox,3,20,45,60 ./E_OBS/eobs_v23e_surface_elevation.nc ./Data_in_Netcdf_Format/eobsv23e_elevation_sellonlatbox.nc

### Temperature Data

#### Spatial Preprocessing 

In [None]:
# Selection of a gridbox (sellonlatbox,°W,°E,°S,°N). Western longitudes have to be given as 
# 360°-°W). In case there is only 1 latitude or longitude to average over, select the desired
# longitude/latitude and on the second position the desired longitude/latitude+1. Otherwise 
# CDO may perform not well.    
! cdo sellonlatbox,3,20,45,60 ./E_OBS/eobs_tg_mean_v23.1e.nc temp_1

#### Temporal Preprocessing

In [None]:
# Selection of certain times, e.g. only the winter months (selmon).
! cdo selmon,1,2,3,4,11,12 temp_1 temp_3 

In [None]:
# Remove the lead time from the beginning of the data.
# Number of days to delete = lead_time.
! cdo delete,day=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,month=1,year=1950 temp_3 temp_4

In [None]:
# Make sure that the time is sorted correctly (sorttimestamp) and the file is named correctly.
! cdo sorttimestamp temp_4 ./Data_in_Netcdf_Format/eobsv23e_tg_3E_20E_45N_60N_1950_2020_only_Nov_Apr_28d_lead.nc

#### Convert from grib-format to netcdf-format

In [None]:
# Convert the grib-file to a netcdf-file if necessary. The Python-scripts are designed to use
# netcdf-files.
#! cdo -f nc copy ofile.grib ofile.nc

#### Remove unnecessary files

In [None]:
# Remove unnecessary files which have been created by CDO.
! rm temp*

## Continue with Python


For a nice overview of the data, pandas dataframes are used. These are then converted directly into csv-format for storage which ensures a safe and easy data transfer between various jupyter notebooks.

#### Define the paths' and files' names 

In [None]:
# Set the needed path and file names.
PATH_defined_functions = './Defined_Functions/'

PATH_data = './Data_in_Netcdf_Format/'
ifile_data = 'eobsv23e_tg_3E_20E_45N_60N_1950_2020_only_Nov_Apr_28d_lead.nc'

PATH_mask = './Data_in_Netcdf_Format/'
ifile_mask = 'eobsv23e_elevation_sellonlatbox.nc'

PATH_output_file = './Data_in_csv_Format/'
file_name_output_file = 'eobsv23e_tg_3E_20E_45N_60N_1950_2020_only_Nov_Apr_28d_lead.csv'

PATH_plots = './Plots/' 

#### Import the necessary packages and functions

In [None]:
# Import the necessary python packages.
import numpy as np
import pandas as pd
from skimage.morphology import erosion
import cartopy.crs as ccrs
import cartopy
import cartopy.feature as cfeature
import matplotlib.pyplot as plt

In [None]:
# Import the necessary functions.
import sys
sys.path.insert(1,PATH_defined_functions)
from read_in_netcdf_data import *
from read_in_csv_data import *

#### Read in the data and check the file's content

In [None]:
# Read in the data and show its header.
df_data = read_in_netcdf_data(PATH_data, ifile_data)
df_data.head()

In [None]:
# Show the end of the dataframe.
df_data.tail()

In [None]:
# Read in the mask (aka the elevation data) and show its header.
df_mask = read_in_netcdf_data(PATH_mask, ifile_mask)
df_mask.head()

In [None]:
# Also show the end of the dataframe.
df_mask.tail()

#### Create a mask to exclude high mountains and coastal areas from the ground truth data

In [None]:
# For convenience, convert the relevant data (longitude, latitude and elevation) in separate numpy arrays.
df_mask_lon = np.array(df_mask['longitude'])
df_mask_lat = np.array(df_mask['latitude'])
df_mask_elevation = np.array(df_mask['elevation'])

In [None]:
# In order to reshape the elevation data into a 2d representation, find out the number of longitudes and latitudes
# in the order. To do so, use numpy's unique() which counts all the unique values in an array.
number_longitudes = len(np.unique(df_mask_lon))
number_latitudes = len(np.unique(df_mask_lat))

In [None]:
# Reshape the data into 2d fields.
df_mask_elevation = np.reshape(df_mask_elevation, (number_longitudes, number_latitudes))
df_mask_lon = np.reshape(df_mask_lon, (number_longitudes, number_latitudes))
df_mask_lat = np.reshape(df_mask_lat, (number_longitudes, number_latitudes))

In [None]:
# Plot the data to get a first overview over the different elevations.
plt.figure()
ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_extent ((2, 21, 44, 61))
ax.coastlines(resolution='50m')
bodr = cartopy.feature.NaturalEarthFeature(category='cultural', 
    name='admin_0_boundary_lines_land', scale='50m', facecolor='none', alpha=0.7)
ax.add_feature(bodr, linestyle='--', edgecolor='r', alpha=1)
plt.scatter(df_mask_lon, df_mask_lat, c=df_mask_elevation, cmap='copper_r')
plt.colorbar(label='Elevation in m')
plt.title('E-OBS Elevation, Resolution 0.25°')
plt.savefig(PATH_plots+'E_OBS_Elevation_Data.png', bbox_inches='tight')
plt.show()

In [None]:
# In a next step, create a new binary elevation array in which all non-NaN values are replaced by 1 and all NaN
# values (aka the sea grid points) are replaced by 0.
df_mask_elevation_binary = df_mask_elevation

here_are_non_NaNs = ~np.isnan(df_mask_elevation)
df_mask_elevation_binary[here_are_non_NaNs] = 1

here_are_NaNs = np.isnan(df_mask_elevation)
df_mask_elevation_binary[here_are_NaNs] = 0

In [None]:
# In order to exclude the coastal areas from the land, the next grid-point to every sea grid-point should be replaced
# by 0 (instead of 1). Therefore, a filter in shape of a cross made out of ones is created.
cross=np.zeros((3,3))
cross[1,:]=1
cross[:,1]=1
cross

In [None]:
# Then, this filter is applied to the binary land-sea mask using scipy's erosion function.
eroded_mask = erosion(df_mask_elevation_binary, cross)

In [None]:
# Now, the eroded data is reshaped to the original shape of the dataframe again. As a sanity check, also the
# longitudes and latitudes are reshaped into their original shape again.
eroded_mask_reshaped = np.reshape(eroded_mask, (number_longitudes*number_latitudes, 1))
lon_mask_reshaped = np.reshape(df_mask_lon, (number_longitudes*number_latitudes, 1))
lat_mask_reshaped = np.reshape(df_mask_lat, (number_longitudes*number_latitudes,1))

In [None]:
# The reshaped arrays are added to the original dataframe containing the elevation data.
df_mask['lon_reshaped'] = lon_mask_reshaped
df_mask['lat_reshaped'] = lat_mask_reshaped
df_mask['eroded_mask'] = eroded_mask_reshaped

In [None]:
# To exclude not only the coastal areas from the mask but also the high mountains, only grid-points with an elevation
# below 800m are kept (.where()). Then, the NaNs are replaced by 0 to be consistent with the binary elevation mask.
df_mask['eroded_mask'] = df_mask['eroded_mask'].where(df_mask['elevation']<800)
df_mask['eroded_mask'] = df_mask['eroded_mask'].fillna(0)
df_mask = df_mask.drop(['lon_reshaped', 'lat_reshaped'], axis=1)

In [None]:
# Save the mask as csv-data.
df_mask.to_csv(PATH_output_file+'Mask_for_Defining_Central_Europe.csv')

In [None]:
# For a nice plot showing all remaining elevations, the column containing the elevation data is multiplied with
# the column containing the mask. All grid-points which are kept, keep the same value as before, the others are set
# to 0 by the multiplication. For a nicer plot, the zeros are replaced by NaNs.
df_mask['elevation_mask'] = df_mask['elevation']*df_mask['eroded_mask']
df_mask['elevation_mask'] = df_mask['elevation_mask'].replace(0, np.nan)

In [None]:
# The remaining grid-points with their elevation shown is plotted.
plt.figure()
ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_extent ((2, 21, 44, 61))
ax.coastlines(resolution='50m')
bodr = cartopy.feature.NaturalEarthFeature(category='cultural', 
    name='admin_0_boundary_lines_land', scale='50m', facecolor='none', alpha=0.7)
ax.add_feature(bodr, linestyle='--', edgecolor='r', alpha=1)
plt.scatter(df_mask['longitude'], df_mask['latitude'], c=df_mask['elevation_mask'], alpha=1, cmap='copper_r')
plt.colorbar(label='Elevation in m')
plt.title('Elevation of Grid Points in E-OBS Dataset')
plt.savefig(PATH_plots+'Masked_E_OBS_Elevation_Data.png', bbox_inches='tight')
plt.show()

#### Apply the mask to the data

In [None]:
# Since the mask is binary and the temperature given in °C, an issue with the zeros can occur, since they could be
# present in both data but with different meanings. Therefore, the temperature data is converted to Kelvin. Since we
# look at surface air temperatures, 0K is not plausible and the problem with the binary mask solved.
data = np.array(df_data['tg']) 
data = data + 273.0
df_data['tg'] = data

In [None]:
# Because the mask is static in time, it needs to be applied to every day separately. Therefore, the number of days
# has to be determined first.
number_of_days = int(len(df_data['tg'])/len(df_mask['eroded_mask']))

In [None]:
# Then, the mask is repeated accordingly.
mask_repeated = []
for i in range(number_of_days):
    mask_repeated.extend(np.array(df_mask['eroded_mask']))

In [None]:
# The repeated mask is added to the dataframe containing the temperatures.
df_data['eroded_mask'] = mask_repeated

In [None]:
# In a next step, the mask is applied to the data by multiplication. Again, the valid values keep their value and the
# values which are masked are set to 0 by the multiplication with the binary mask. Since in the next step an areal 
# mean will be calculated, the zeros are set to NaNs.
df_data['mask_applied_to_tg'] = df_data['tg']*df_data['eroded_mask']
df_data['mask_applied_to_tg'] = df_data['mask_applied_to_tg'].replace(0, np.nan)

#### Calculate the areal mean of the ground truth data

In [None]:
# In a first step, all rows containing NaNs are dropped. Then, the aerial mean is calculated for every day.
df_data = df_data.dropna()
df_data_mean = df_data.groupby(df_data['time']).mean()
df_data_mean = df_data_mean.reset_index()

#### Apply a 7-day running mean for temporal aggregation of the data

In [None]:
# Use a 7-day rolling mean for temporal aggregation.
df_data_mean['tg_rolling_mean'] = df_data_mean['tg'].rolling(window=7, center=True).mean()
df_data_mean = df_data_mean.reset_index()

In [None]:
# Rename the column with the rolling mean to avoid confusion.
df_data_mean = df_data_mean.drop(['tg'], axis=1)
df_data_mean = df_data_mean.rename(columns={'tg_rolling_mean':'tg'})

#### Remove any columns containing NaN-Values since the used ML-models cannot handle NaN values

In [None]:
# Remove any columns containing NaN-values.
df_data_mean = df_data_mean.dropna()

#### Create a minimal, useful representation of the data

In [None]:
# Remove any unnecessary columns here, e.g. the latitude and longitude for aerial means.
df_data_mean = df_data_mean.drop(['longitude', 'latitude', 'eroded_mask', 'mask_applied_to_tg'], axis =1 )

#### Doublecheck the representation of the data

In [None]:
# Check if everything is sorted, renamed or removed correctly. Note:
# Although the data is displayed with wrong extra precision, it is saved correctly in
# csv-format later. 
df_data_mean.head()

In [None]:
# Also check if everything is sorted, renamed or removed correctly at the end of the
# dataframe.
df_data_mean.tail()

#### Save the ground truth data

In [None]:
# Save the pandas dataframe in csv-format.
df_data_mean.to_csv(PATH_output_file+file_name_output_file)

In [None]:
# End of Program