## [01_Deterministic.ipynb](https://github.com/raybellwaves/xskillscore-tutorial/blob/master/01_Determinisitic.ipynb)

In this notebook I show how `xskillscore` can be dropped in a typical data science task where the data is a [`pandas.DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

I use the metric RMSE to verifity forecasts of items sold.

I also show how you can applys weights to the verification and handle missing values.

Import the neccessary packages

In [1]:
import xarray as xr
import pandas as pd
import numpy as np
import xskillscore as xs

Let's say you are a data scientist who works for a company which owns four stores which each sell three items (Store Keeping Units).

Set up `stores` and `skus` arrays:

In [2]:
stores = np.arange(4)
skus = np.arange(3)

and you are tracking daily perfomane of items sold between Jan 1st and Jan 5th 2020.

Setup up `dates` array:

In [3]:
dates = pd.date_range("1/1/2020", "1/5/2020", freq="D")

Generate a `pandas.DataFrame` to show the number of items that were sold during this period. The number of items sold will be a random number between 1 and 10.

This may be something you would obtain from querying a database:

In [10]:
rows = []
for _, date in enumerate(dates):
    for _, store in enumerate(stores):
        for _, sku in enumerate(skus):
            rows.append(
                dict(
                    {
                        "DATE": date,
                        "STORE": store,
                        "SKU": sku,
                        "QUANTITY_SOLD": np.random.randint(9) + 1,
                    }
                )
            )
df = pd.DataFrame(rows)

Pring the first 5 rows of the `pandas.DataFrame`:

In [11]:
df.head()

Unnamed: 0,DATE,STORE,SKU,QUANTITY_SOLD
0,2020-01-01,0,0,5
1,2020-01-01,0,1,6
2,2020-01-01,0,2,1
3,2020-01-01,1,0,9
4,2020-01-01,1,1,4


Your boss has asked you to use this data to predict the number of items sold for each store and sku level for the next 5 days.

The prediction is outside of the scope of the tutorial but we will use `xskillscore` to tell us how good our prediction may be .

First, rename the target variable to ``y``:

In [13]:
df.rename(columns={"QUANTITY_SOLD": "y"}, inplace=True)
df.head()

Unnamed: 0,DATE,STORE,SKU,y
0,2020-01-01,0,0,5
1,2020-01-01,0,1,6
2,2020-01-01,0,2,1
3,2020-01-01,1,0,9
4,2020-01-01,1,1,4


Use [pandas MultiIndex](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html) to help handle the granularity of the forecast:

In [14]:
df.set_index(['DATE', 'STORE', 'SKU'], inplace=True)

This also displays the data in a cleaner foremat in the notebook:

In [15]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y
DATE,STORE,SKU,Unnamed: 3_level_1
2020-01-01,0,0,5
2020-01-01,0,1,6
2020-01-01,0,2,1
2020-01-01,1,0,9
2020-01-01,1,1,4


Time for your prediction! As mentioned, this is outside of the scope of this tutorial.

In our case we are going to generate data to mimic a prediction by taking `y` and perturbing randomly. This will provide a middle ground of creating a prediction which is not overfitting the data (being very similar to `y`) and the other extereme of random numbers for which the skill will be 0.

The pertubations will scale `y` between -100% and 100% using a uniform distribution. For example, a value of 5 in `y` will be between 0 and 10 in the prediction (`yhat`).

Setup the perturbation array:

In [17]:
noise = np.random.uniform(-1, 1, size=len(df['y']))

Name the prediction `yhat` and append it to the `pandas.DataFrame`.

Lastly, convert it is an `int` to match the same format as the target (`y`):

In [18]:
df['yhat'] = (df['y'] + (df['y'] * noise)).astype(int)
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y,yhat
DATE,STORE,SKU,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-01,0,0,5,7
2020-01-01,0,1,6,7
2020-01-01,0,2,1,1
2020-01-01,1,0,9,8
2020-01-01,1,1,4,3


## Using xskillscore - RMSE

RMSE (root-mean-squre error) is the square root of the average of the squared differences between forecasts and verification data:

\begin{align}
RMSE = \sqrt{\overline{(f - o)^{2}}}
\end{align}

Because the error is squared is it sensitive to outliers and is a more conversative metric than mean-absolute error.

See https://climpred.readthedocs.io/en/stable/metrics.html#root-mean-square-error-rmse for further documentation

### sklearn

Most data scientists are familar with using `scikit-learn` for verifying forecasts, especially if you used `scikit-learn` for the prediction.

To obtain RMSE from `scikit-learn` import `mean_squared_error` and specify `squared=False`:

In [20]:
from sklearn.metrics import mean_squared_error
mean_squared_error(df['y'], df['yhat'], squared=False)

2.819574435974337

While `skikit-learn` is simple it doesn't give the flexibility of that given in xskillscore.

Note: `xskillscore` does use the same metrics as in `scikit-learn` such as [`r2`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html).

### xskillscore

To use `xskillscore` you first have to put your data into an `xarray` object.

Because `xarray` is part of the PyData stack it integrates will other Python data science packages.

`pandas` has a convient [`to_xarray`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_xarray.html) which makes going from `pandas` to `xarray` seemless.

Use `to_xarray` to convert the `pandas.Dataframe` to an `xarray.Dataset`: 

In [12]:
ds = df.to_xarray()
ds

As seem above `xarray` has a very nice html representation.

Click on the data symbol to the see the data associated with the Coordinates and Data.

You now have one variable which houses the data and the associted meta data (this is why `xarray` was developed).

In [None]:
...

We can call xskillscore as an accessor on this xarray.Dataset...

`xskillscore` works by speicifying `(y_true, y_pred, dim(s))` here you pass the target vairable: `y` in our case, the predicted variable `yhat` in our case and the dimensions for which to reduce...

To replicate the sklearn metric above we want to reduce over all metrics `[DATE, STORE, SKU]`. root mean squared error is called `rmse` in xskillscore... Lastly call `.values` on the object to obtain the data as a `np.array`...

In [13]:
ds.xs.rmse('y', 'yhat', ['DATE', 'STORE', 'SKU']).values

array(3.39116499)

You boss is interested in how good you model is at a store level...

In this case reduce over the `DATE` and `SKU` dimensions...

In [16]:
ds.xs.rmse('y', 'yhat', ['DATE', 'SKU'])

# Providing weights to the verification metrics

You can specify weights when calculating skill metrics...

You boss has asked for you to create a prediction for the next five days. You will update this prediction everyday and there is a larger focus on the performance of the next days sales compared to the fifth day...

In this case you can weight your metric so the performance of day one has a larger influence that day five. You can apply a linear scaling from 1 to 0 with day 1 have a weight of 1 and day 5 having a weight of 0...

We will reduce over `DATE` and therfore obtain the metric for the forecasts as a `STORE` and `SKU` level...

Create the weights as an `xarray.DataArray` and name it to match the dimension it applies to...

In [20]:
dim = 'DATE'
np_weights = np.linspace(1, 0, num=len(ds[dim]))
weights = xr.DataArray(np_weights, dims=dim)
print(weights)

<xarray.DataArray (DATE: 5)>
array([1.  , 0.75, 0.5 , 0.25, 0.  ])
Dimensions without coordinates: DATE


add this to the `weight` parameter of the skill metric... 

In [19]:
print(ds.xs.rmse('y', 'yhat', 'DATE', weights=weights))

<xarray.DataArray (STORE: 4, SKU: 3)>
array([[1.70293864, 1.76068169, 1.67332005],
       [4.74341649, 3.67423461, 5.30094331],
       [2.21359436, 4.35889894, 4.        ],
       [1.58113883, 1.8973666 , 3.03315018]])
Coordinates:
  * STORE    (STORE) int64 0 1 2 3
  * SKU      (SKU) int64 0 1 2


and you can compare without weights...

In [22]:
print(ds.xs.rmse('y', 'yhat', 'DATE'))

<xarray.DataArray (STORE: 4, SKU: 3)>
array([[2.19089023, 1.67332005, 2.89827535],
       [5.07937004, 3.71483512, 4.42718872],
       [2.28035085, 4.87852437, 3.71483512],
       [1.54919334, 2.23606798, 3.54964787]])
Coordinates:
  * STORE    (STORE) int64 0 1 2 3
  * SKU      (SKU) int64 0 1 2


# Handle missing values

It is often the case that on some days in some store for some products there are no purchases. These entries will be blank in the relational database...

To mimic this lets create the same type of data structure as before but randomally suppress rows...

In [23]:
random_number_threshold = 0.8

rows = []
for _, date in enumerate(dates):
    for _, store in enumerate(stores):
        for _, sku in enumerate(skus):
            if np.random.rand(1) < random_number_threshold:
                rows.append(
                    dict(
                        {
                            "DATE": date,
                            "STORE": store,
                            "SKU": sku,
                            "QUANTITY_SOLD": np.random.randint(10),
                        }
                    )
                )
df = pd.DataFrame(rows)
df.rename(columns={"QUANTITY_SOLD": "y"}, inplace=True)
df.set_index(['DATE', 'STORE', 'SKU'], inplace=True) # order alphabertically
df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y
DATE,STORE,SKU,Unnamed: 3_level_1
2020-01-01,0,1,9
2020-01-01,0,2,3
2020-01-01,1,0,5
2020-01-01,1,2,1
2020-01-01,2,1,5
2020-01-01,2,2,0
2020-01-01,3,0,4
2020-01-01,3,2,3
2020-01-02,0,0,1
2020-01-02,0,1,6


Xarray will infer missing values as `nans` given that are all indexes are present at some point in the data...

In [24]:
ds = df.to_xarray()
ds

you can check this by converting the xarray object back to a a pandas dataframe...

Note: xarray returns the fields alphabetically but it still shows the nans...

In [25]:
df_with_nans = ds.to_dataframe()
df_with_nans.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y
DATE,SKU,STORE,Unnamed: 3_level_1
2020-01-01,0,0,
2020-01-01,0,1,5.0
2020-01-01,0,2,
2020-01-01,0,3,4.0
2020-01-01,1,0,9.0
2020-01-01,1,1,
2020-01-01,1,2,5.0
2020-01-01,1,3,
2020-01-01,2,0,3.0
2020-01-01,2,1,1.0


Append a prediction column. In most cases you still want to make predictions where the missing values. We would hope this number is low...

In [26]:
df_with_nans['yhat'] = df_with_nans['y'] + (df_with_nans['y'] * noise)
df_with_nans.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y,yhat
DATE,SKU,STORE,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-01,0,0,,
2020-01-01,0,1,5.0,2.448041
2020-01-01,0,2,,
2020-01-01,0,3,4.0,5.119504
2020-01-01,1,0,9.0,17.061177


Our prediction still contains NaNs so replace with random small numbers (hoping thre prediction would predict a low number)

In [27]:
yhat = df_with_nans['yhat']

yhat.loc[pd.isna(yhat)] = yhat[pd.isna(yhat)].apply(lambda x: np.random.randint(5))

df_with_nans['yhat'] = yhat
df_with_nans.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y,yhat
DATE,SKU,STORE,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-01,0,0,,2.0
2020-01-01,0,1,5.0,2.448041
2020-01-01,0,2,,3.0
2020-01-01,0,3,4.0,5.119504
2020-01-01,1,0,9.0,17.061177


Now if we try using `sklearn`

In [28]:
mean_squared_error(df_with_nans['y'], df_with_nans['yhat'], squared=False)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

you get a `ValueError` as the data contains `NaN's`...

In xskillscore you don't need to worry about this and simply specifiy `skipna=True`...

In [29]:
ds = df_with_nans.to_xarray()
ds.xs.rmse('y', 'yhat', ['DATE', 'STORE', 'SKU'], skipna=True).values

array(3.64171459)

# Handle weights and missing values

You can specifcy weights and skipna together for powerful analysis..

In [31]:
print(ds.xs.rmse('y', 'yhat', 'DATE', weights=weights, skipna=True))

<xarray.DataArray (SKU: 3, STORE: 4)>
array([[1.50745603, 2.61873643, 1.93507265, 2.82230063],
       [6.31019042, 7.92586319, 2.55215275, 3.03957086],
       [2.40743101, 1.4316764 , 2.0019758 , 0.20805517]])
Coordinates:
  * SKU      (SKU) int64 0 1 2
  * STORE    (STORE) int64 0 1 2 3
