overview of this notebook...

A bit about these packages and any extenstions...

In [1]:
import xarray as xr
import pandas as pd
import numpy as np
import xskillscore as xs
%load_ext blackcellmagic

see the metrics up front...

In [2]:
dir(xs)

['XSkillScoreAccessor',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'brier_score',
 'core',
 'crps_ensemble',
 'crps_gaussian',
 'crps_quadrature',
 'effective_sample_size',
 'mae',
 'mape',
 'median_absolute_error',
 'mse',
 'pearson_r',
 'pearson_r_eff_p_value',
 'pearson_r_p_value',
 'r2',
 'rmse',
 'smape',
 'spearman_r',
 'spearman_r_eff_p_value',
 'spearman_r_p_value',
 'threshold_brier_score']

Let's say you are a data scientist who works for a company which owns four stores which each sell three items (Store Keeping Units)...

In [3]:
stores = np.arange(4)
skus = np.arange(3)

and you are tracking daily perfomane of items sold between Jan 1st and Jan 5th 2020...

In [4]:
dates = pd.date_range("1/1/2020", "1/5/2020", freq="D")

you can query you database to obtain this data. In our case we will generate data to mimic this...

In [5]:
rows = []
for _, date in enumerate(dates):
    for _, store in enumerate(stores):
        for _, sku in enumerate(skus):
            rows.append(
                dict(
                    {
                        "DATE": date,
                        "STORE": store,
                        "SKU": sku,
                        "QUANTITY_SOLD": np.random.randint(10),
                    }
                )
            )
df = pd.DataFrame(rows)
df.head()

Unnamed: 0,DATE,STORE,SKU,QUANTITY_SOLD
0,2020-01-01,0,0,9
1,2020-01-01,0,1,1
2,2020-01-01,0,2,7
3,2020-01-01,1,0,2
4,2020-01-01,1,1,5


you boss has asked you to predict how many items were sold during this period...

The prediction is outside of the scope of the tutorial but we will use `xskillscore` to tell us how good the prediction was...

First, rename the target variable to ``y``...

In [6]:
df.rename(columns={"QUANTITY_SOLD": "y"}, inplace=True)
df.head()

Unnamed: 0,DATE,STORE,SKU,y
0,2020-01-01,0,0,9
1,2020-01-01,0,1,1
2,2020-01-01,0,2,7
3,2020-01-01,1,0,2
4,2020-01-01,1,1,5


Use pandas multiIndex so we can handle the dimensions better...

In [7]:
df.set_index(['DATE', 'STORE', 'SKU'], inplace=True)

This also displays the data better in our notebook

In [8]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y
DATE,STORE,SKU,Unnamed: 3_level_1
2020-01-01,0,0,9
2020-01-01,0,1,1
2020-01-01,0,2,7
2020-01-01,1,0,2
2020-01-01,1,1,5


Make a prediction...

As mentioned this is outside of the scope of the tutorial. In our case we are going to take `y` and perturb it slightly. This will provide a middle ground of creating a prediction which is overfitting the data (very similar to y) and random numbers for which the skill will be zero.

The pertubations will scale each number anywhere between -100% and 100% using a uniform distribution e.g. a value of 5 in `y` will be between 0 and 10 in the prediction

In [9]:
noise = np.random.uniform(-1, 1, size=len(df['y']))

Name the prediction `yhat` and add as a field to `df`. Lastly convert it is an `int` to match `y`

In [10]:
df['yhat'] = (df['y'] + (df['y'] * noise)).astype(int)
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y,yhat
DATE,STORE,SKU,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-01,0,0,9,10
2020-01-01,0,1,1,1
2020-01-01,0,2,7,5
2020-01-01,1,0,2,1
2020-01-01,1,1,5,7


# Using xskillscore - RMSE

What is RMSE and why use it...

You can obtain the skill of the prediction (rmse) using `sklearn` as below...

In [11]:
from sklearn.metrics import mean_squared_error

mean_squared_error(df['y'], df['yhat'], squared=False)

3.278719262151

While simple it doesn't give the flexibility of that given in xskillscore. Note: xskillscore does use the same metrics as in sklearn and in some cases it uses the numpy implementation of computing those metrics...

You can convert the pandas.DataFrame to an Xarray.Dataset. This allows you to access all of the functionality of xarray and most importly use xskillscore...

In [12]:
ds = df.to_xarray()
ds

We can call xskillscore as an accessor on this xarray.Dataset...

`xskillscore` works by speicifying `(y_true, y_pred, dim(s))` here you pass the target vairable: `y` in our case, the predicted variable `yhat` in our case and the dimensions for which to reduce...

To replicate the sklearn metric above we want to reduce over all metrics `[DATE, STORE, SKU]`. root mean squared error is called `rmse` in xskillscore... Lastly call `.values` on the object to obtain the data as a `np.array`...

In [13]:
ds.xs.rmse('y', 'yhat', ['DATE', 'STORE', 'SKU']).values

array(3.27871926)

You boss is interested in how good you model is at a store level...

In this case reduce over the `DATE` and `SKU` dimensions...

In [14]:
ds.xs.rmse('y', 'yhat', ['DATE', 'SKU'])

# Providing weights to the verification metrics

You can specify weights when calculating skill metrics...

You boss has asked for you to create a prediction for the next five days. You will update this prediction everyday and there is a larger focus on the performance of the next days sales compared to the fifth day...

In this case you can weight your metric so the performance of day one has a larger influence that day five. You can apply a linear scaling from 1 to 0 with day 1 have a weight of 1 and day 5 having a weight of 0...

We will reduce over `DATE` and therfore obtain the metric for the forecasts as a `STORE` and `SKU` level...

Create the weights as an `xarray.DataArray` and name it to match the dimension it applies to...

In [15]:
dim = 'DATE'
np_weights = np.linspace(1, 0, num=len(ds[dim]))
weights = xr.DataArray(np_weights, dims=dim)
weights

add this to the `weight` parameter of the skill metric... 

In [16]:
ds.xs.rmse('y', 'yhat', 'DATE', weights=weights)

and you can compare without weights...

In [17]:
ds.xs.rmse('y', 'yhat', 'DATE')

# Handle missing values

It is often the case that on some days in some store for some products there are no purchases. These entries will be blank in the relational database...

To mimic this lets create the same type of data structure as before but randomally suppress rows...

In [18]:
random_number_threshold = 0.8

rows = []
for _, date in enumerate(dates):
    for _, store in enumerate(stores):
        for _, sku in enumerate(skus):
            if np.random.rand(1) < random_number_threshold:
                rows.append(
                    dict(
                        {
                            "DATE": date,
                            "STORE": store,
                            "SKU": sku,
                            "QUANTITY_SOLD": np.random.randint(10),
                        }
                    )
                )
df = pd.DataFrame(rows)
df.rename(columns={"QUANTITY_SOLD": "y"}, inplace=True)
df.set_index(['DATE', 'STORE', 'SKU'], inplace=True) # order alphabertically
df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y
DATE,STORE,SKU,Unnamed: 3_level_1
2020-01-01,0,0,8
2020-01-01,0,1,8
2020-01-01,0,2,5
2020-01-01,1,1,6
2020-01-01,1,2,2
2020-01-01,2,1,4
2020-01-01,2,2,2
2020-01-01,3,0,1
2020-01-01,3,1,3
2020-01-01,3,2,5


Xarray will infer missing values as `nans` given that are all indexes are present at some point in the data...

In [19]:
ds = df.to_xarray()
ds

you can check this by converting the xarray object back to a a pandas dataframe...

Note: xarray returns the fields alphabetically but it still shows the nans...

In [20]:
df_with_nans = ds.to_dataframe()
df_with_nans.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y
DATE,SKU,STORE,Unnamed: 3_level_1
2020-01-01,0,0,8.0
2020-01-01,0,1,
2020-01-01,0,2,
2020-01-01,0,3,1.0
2020-01-01,1,0,8.0
2020-01-01,1,1,6.0
2020-01-01,1,2,4.0
2020-01-01,1,3,3.0
2020-01-01,2,0,5.0
2020-01-01,2,1,2.0


Append a prediction column. In most cases you still want to make predictions where the missing values. We would hope this number is low...

In [21]:
df_with_nans['yhat'] = df_with_nans['y'] + (df_with_nans['y'] * noise)
df_with_nans.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y,yhat
DATE,SKU,STORE,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-01,0,0,8.0,9.044095
2020-01-01,0,1,,
2020-01-01,0,2,,
2020-01-01,0,3,1.0,0.95589
2020-01-01,1,0,8.0,12.489592


Our prediction still contains NaNs so replace with random small numbers (hoping thre prediction would predict a low number)

In [22]:
yhat = df_with_nans['yhat']

yhat.loc[pd.isna(yhat)] = yhat[pd.isna(yhat)].apply(lambda x: np.random.randint(5))

df_with_nans['yhat'] = yhat
df_with_nans.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y,yhat
DATE,SKU,STORE,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-01,0,0,8.0,9.044095
2020-01-01,0,1,,3.0
2020-01-01,0,2,,4.0
2020-01-01,0,3,1.0,0.95589
2020-01-01,1,0,8.0,12.489592


Now if we try using `sklearn`

In [23]:
mean_squared_error(df_with_nans['y'], df_with_nans['yhat'], squared=False)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

you get a `ValueError` as the data contains `NaN's`...

In xskillscore you don't need to worry about this and simply specifiy `skipna=True`...

In [24]:
ds = df_with_nans.to_xarray()
ds.xs.rmse('y', 'yhat', ['DATE', 'STORE', 'SKU'], skipna=True).values

array(2.89445127)

# Handle weights and missing values

You can specifcy weights and skipna together for powerful analysis..

In [25]:
ds.xs.rmse('y', 'yhat', 'DATE', weights=weights, skipna=True)