# Skill assessment (CRPS calculations)

Since the seasonal forecasts are comprised of ensemble members, generally applied simple evaluation methods such as RMSE (Root Mean Squared Error) and R2 have limitations. 
Thus, in this code, CRPS or CRPSS are adopted to measure the skill of ensembled datasets.
This code enables you to calculate those indices automatically.

### 1. Import libraries
Now, we need to import the necessary libraries and tools (🚨 in order to run the code like in the box below, place the mouse pointer in the cell, then click on “run cell” button above or press shift + enter).

In [1]:
import os, re
import pandas as pd
from pandas import Series, DataFrame
import numpy as np

### 2. Simulation settings

In [9]:
forecast_center = 'ECMWF'

# Assign working directory and time series data
path = os.getcwd()

# Input simulation information
catchment_name = 'A'
start_year = 2011
start_month = 1
start_day = 1
start_date = str(start_month).zfill(2) + '/' + str(start_day).zfill(2) + '/' + str(start_year)
end_year = 2020
end_month = 12
end_day = 31
end_date = str(end_month).zfill(2) + '/' + str(end_day).zfill(2) + '/' + str(end_year)

###  3. Rearrange Datasets

Basically, CRPS/CRPSS are calculated according to lead time. Therefore, we need to collect them into a single file for each lead time. This code enable to collect every data having same lead time. 

#### 3.1 ESP ensemble

In [10]:
for leadtime in range(1,8):                             # from lead time 1 to 7 months 
    for year in range(start_year,end_year+1):           
        for month in range(start_month, end_month+1):
            # read esp simulation results
            df = pd.read_csv(path + '/analysis/3.ESP/3_run/[out]' + catchment_name + '_' + str(year) + '_' 
                             + str(month).zfill(2) + '.csv')
            df2=df.groupby(by=['leadtime']).mean().cumsum()  # calculate accumulated value at each lead time
            df2['date'] = str(year) + '_' + str(month)       # insert 'date' column
            # rearrange column order 
            col1=df2.columns[-1:].to_list()      
            col2=df2.columns[:-1].to_list()
            new_col=col1+col2
            df3=df2[new_col]
            temp = pd.DataFrame(df3.loc[leadtime]).T
            # stack data
            if year == int(start_year) and month == int(start_month):
                temp1 = temp
            else :
                pass
            temp1 = pd.concat([temp1, temp], axis=0, ignore_index = True)
            
    temp1 = temp1.iloc[1:]
    temp1.set_index('date', inplace=True)
    temp1.to_csv(path + '/analysis/3.ESP/3_run/skill/'  + catchment_name + '_' + str(leadtime) + '_esp.csv') # save csv

#### 3.2 SFFs ensemble

In [11]:
folder = {1:'original',2:'biascorrected'}

for bc_type in range(1,3):                        # bias correction type
    for leadtime in range(1,8):                   # lead time from 1 to 7 months
        for year in range(start_year,end_year+1):
            for month in range(start_month, end_month+1):
                # read seasonal hydrological forecasts results
                df = pd.read_csv(path + '/analysis/4.SFFs/3_run/' + folder[bc_type] + '/[out]' +  catchment_name + '_' 
                                 + str(year) + '_' + str(month).zfill(2) + '.csv')
                df2=df.groupby(by=['leadtime']).mean().cumsum()  # calculate accumulated value at each lead time
                df2['date'] = str(year) + '_' + str(month)       # insert 'date' column
                # rearrange column order                 
                col1=df2.columns[-1:].to_list()
                col2=df2.columns[:-1].to_list()
                new_col=col1+col2
                df3=df2[new_col]
                temp = pd.DataFrame(df3.loc[leadtime]).T
                # stack data
                if year == int(start_year) and month == int(start_month):
                    temp1 = temp
                else :
                    pass
                temp1 = pd.concat([temp1, temp], axis=0, ignore_index = True)
            
        temp1 = temp1.iloc[1:]
        temp1['mean2'] = temp1['mean']
        temp1['obs2'] = temp1['obs']
        temp1=temp1.drop(['mean','obs'], axis=1)
        temp1.rename(columns={'mean2':'mean', 'obs2':'obs'}, inplace=True)
        temp1.set_index('date', inplace=True)
        temp1.to_csv(path + '/analysis/4.SFFs/3_run/' + folder[bc_type] + '/skill/' + catchment_name + '_' 
                     + str(leadtime) + '_' + folder[bc_type] + '_sffs.csv') # save data

### 4. Calculate CRPS at each lead time

CRPS is a measure of how good forecasts are in matching observed outcomes considering each ensemble. It is a quadratic measure of the difference between the forecast cumulative distribution function (CDF) and the reference dataset of the observation (Zamo and Naveau, 2017). The CRPS is thus calculated as

$$ CRPS= \int [F(x) - H(x > y)]^2 dx $$

where F(x) represents the cumulative distribution of seasonal forecasts, y is observed precipitation, H is called the indicator function which is equals to 1 when x > y and 0 when x < y. Once the CRPS is equals to 0, the forecast is wholly accurate, conversely, the higher the CRPS, the worse the performance of the forecast. 

Also, sometimes we can face the issue from the number of ensemble members. Most of originating centres have changed number of ensemble members once. (Please see A.Download seasonal forecasts datasets / 3. Seasonal forecasts systems and datasets for 8 originating centres / Total precipitation table). In this case, we need to designate exact location and number of ensemble manually. 

This example show you the case when we apply ECMWF datasets to calculate CRPS from 2011 to 2020. In this case, there are 25 ensemble members and 72 rows from Jan.2011 to Dec.2016, also rest of the data have 51 ensemble members. If the number of ensemble members is same, you can put the same number on it.  <font color='red'> Please note that, you need to manually revise some of the code below;</font>

#### 4.1 CRPS of ESP

In [12]:
import hydrostats.ens_metrics as em   # read library for calculating crps

for leadtime in range(1,8):
    # read ESP data rearanged by lead times
    df = pd.read_csv(path + '/analysis/3.ESP/3_run/skill/'  + catchment_name + '_' + str(leadtime) + '_esp.csv')
    df_a = df.to_numpy().astype(float)
    df_a2 = df_a[:,1:df_a.shape[1]-2]    # select ensemble data only
    df_obs = df_a[:, len(df.columns)-1]  # select the column for observed data
    # calculate crps using observed and ensemble data    
    crps_dictionary_rand1 = em.ens_crps(df_obs, df_a2)   
    crps = crps_dictionary_rand1['crps']
    csv = pd.DataFrame(crps)
    csv['month'] = df['date'].str.slice(start=5,stop=7)
    csv.set_index(df['date'], inplace=True)
    csv=csv[['month',0]]
    csv=csv.rename(columns={0:'CRPS'})
    csv['leadtime'] = leadtime
    # stack data
    if leadtime == 1:
        temp = csv
    temp = pd.concat([temp, csv])
temp = temp.iloc[12*10:]
temp = temp[['month', 'leadtime', 'CRPS']]
temp.to_csv(path + '/analysis/3.ESP/3_run/skill/[skill]'  + catchment_name + '_esp.csv')  # save the results

print('The skill (CRPS) of ESP is computed')

The skill (CRPS) of ESP is computed


#### 4.2 CRPS of SFFs

In [13]:
import hydrostats.ens_metrics as em   # read library for calculating crps
folder = {1:'original',2:'biascorrected'}

# (Should be manually revised) 
num_row = 72    # The number of rows for the first datasets having 'num_col1' of ensemble members
num_col1 = 25    # The number of ensemble members for the first datasets
num_col2 = 51    # The number of ensemble members for the second datasets

for bc_type in range(1,3):
    for leadtime in range(1,8):
        # read SFFs data rearanged by lead times
        df = pd.read_csv(path + '/analysis/4.SFFs/3_run/' + folder[bc_type] + '/skill/' + catchment_name + '_' 
                    + str(leadtime) + '_' + folder[bc_type] + '_sffs.csv')
        df_a = df.to_numpy().astype(float)
        df_a2 = df_a[:,1:df_a.shape[1]-2]            # select ensemble data only
        df_a3 = df_a2[:num_row, :num_col1]           # select ensemble data having 25 ensembles (~ Dec. 2016)
        df_a4 = df_a2[num_row:, :num_col2]           # select ensemble data having 51 ensembles (Jan. 2017 ~)
        df_obs1 = df_a[:num_row, len(df.columns)-1]  # select the column for observed data having 25 ensembles (~ Dec. 2016)
        df_obs2 = df_a[num_row:, len(df.columns)-1]  # select the column for observed data having 51 ensembles (Jan. 2017 ~)
        # calculate crps using observed and ensemble data
        crps_dictionary_rand1 = em.ens_crps(df_obs1, df_a3)
        crps_dictionary_rand2 = em.ens_crps(df_obs2, df_a4)
        temp1 = crps_dictionary_rand1['crps']
        temp2 = crps_dictionary_rand2['crps']
        crps = np.concatenate([temp1, temp2], axis=0)  # 두개의 array를 axis=0 즉 행 방향으로(아래쪽으로) 합치기
        csv = pd.DataFrame(crps)
        csv['month'] = df['date'].str.slice(start=5,stop=7)
        csv['date'] = df['date']
        csv.set_index(csv['date'], inplace=True)
        csv=csv[['month',0]]
        csv=csv.rename(columns={0:'CRPS'})
        csv['leadtime'] = leadtime
        # stack data
        if leadtime == 1:
            temp = csv
        temp = pd.concat([temp, csv])
    temp = temp.iloc[12*10:]
    temp = temp[['month', 'leadtime', 'CRPS']]
    temp.to_csv(path + '/analysis/4.SFFs/3_run/' + folder[bc_type] + '/skill/[skill]' + catchment_name 
                + '_' + folder[bc_type] + '_sffs.csv')  # save the results
print('The skill (CRPS) of SFFs has computed.')

The skill (CRPS) of SFFs has computed.


### 5. CRPSS calculation

CRPSS compares the skill of seasonal forecasts with climatology, thus finally it can be simply calculated as 

$$ CRPSS=\ 1\ -\ \frac{{\rm CRPS}^{Sys}}{{\rm CRPS}^{Ref}}$$

where $CRPS^{Sys}$ is previously calculated $CRPS$ (seasonal forecasts), $CRPS^{Ref}$ represents the reference $CRPS$ obtained from climatology. When the skill score is higher (lower) than zero, the forecasting system is more (less) skilful than reference. When it is equal to zero, the system (seasonal forecasts) and the reference (Climatology) have equivalent skill. 

CRPSS can be calculated by runing the code below;

In [14]:
folder = {1:'original',2:'biascorrected'}

for bc_type in range(1,3):
    # read calculated CRPS data (SFFs)
    df = pd.read_csv(path + '/analysis/4.SFFs/3_run/' + folder[bc_type] + '/skill/[skill]' + catchment_name 
                     + '_' + folder[bc_type] + '_sffs.csv')
    # read reference CRPS data (ESP)
    df_ref = pd.read_csv(path + '/analysis/3.ESP/3_run/skill/[skill]'  + catchment_name + '_esp.csv')    
    df['CRPS_ref'] = df_ref['CRPS']                 # add reference CRPS column
    df['CRPSS'] = 1 - df['CRPS'] / df['CRPS_ref']   # add CRPSS column
    df['count'] = np.nan
    for i in range(0,len(df)):
        if df['CRPSS'][i] > 0:
            df['count'][i] = 1
        else:
            df['count'][i] = 0
    df.to_csv(path + '/analysis/4.SFFs/3_run/' + folder[bc_type] + '/skill/[skill]' + catchment_name 
                    + '_' + folder[bc_type] + '_sffs.csv')       # save the result on the same file
print('CRPSS calculation has completed.')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['count'][i] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['count'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['count'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['count'][i] = 0


CRPSS calculation has completed.
