# Tabular Data

`xskillscore` can be used on tabular data such as that stored in a `pandas.DataFrame`.

It can be used most effectively when evaluating predictions over different fields.

In [None]:
import numpy as np
import pandas as pd
import xskillscore as xs
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error

np.random.seed(seed=42)

## California house prices dataset

A small example is to take a dataset and evaluate the model according to a field (column).

Load the California house prices dataset:

In [None]:
housing = fetch_california_housing(as_frame=True)
df = housing.frame
df["AveRooms"] = df["AveRooms"].round()
df = df.rename(columns={"MedHouseVal": "y"})
df

Create a dummy prediction column by adding noise to `y`:

In [None]:
noise = np.random.uniform(-1, 1, size=len(df["y"]))
df["yhat"] = (df["y"] + (df["y"] * noise)).clip(lower=df["y"].min())

Evaluate the model over the field `AveRooms` using `pandas.groupby.apply` with `mean_squared_error` from `scikit-learn`:

In [None]:
df.groupby("AveRooms").apply(lambda x: mean_squared_error(x["y"], x["yhat"])).head()

You could also do the following using `xskillscore`.

First, structure the `pandas.DataFrame` to keep the core fields when converting to an `xarray` object:

In [None]:
min_df = df.reset_index().set_index(["index", "AveRooms"])[["y", "yhat"]]
min_df

Convert it to an `xarray.Dataset` using `pandas.DataFrame.to_xarray`. Note: This will create an array of `index` by `AveRooms` and pad the values that do not exist with `nan`.

In [None]:
ds = min_df.to_xarray()
ds

You call now apply any metric from `xskillscore` using the accessor method. The input for the `dim` argument is `index` as we want to reduce this dimension and apply the metric over `AveRooms`. In addition, there are `nan`'s in the `xarray.Dataset` so you should use `skipna=True`:

In [None]:
out = ds.xs.mse("y", "yhat", dim="index", skipna=True)
out

It makes sense to return the data in tabular form hence you can call `xarray.DataArray.to_series` to convert it to a `pandas.Series`:

In [None]:
out.to_series().head()

## Evaluating predictions over many columns

`xskillscore` is built upon `xarray.apply_ufunc` which offers speed-up by vectorizing operations. As a result `xskillscore` can be faster than `pandas.groupby.apply`. This is espicially true if there are many samples in the dataset and if the predictions have to be evaluated over many fields.

For this exercise we will create fake data for which the predictions have to be evaluated over three fields:

In [None]:
stores = np.arange(100)
skus = np.arange(100)
dates = pd.date_range("1/1/2020", "1/10/2020", freq="D")

rows = []
for _, date in enumerate(dates):
    for _, store in enumerate(stores):
        for _, sku in enumerate(skus):
            rows.append(
                dict(
                    {
                        "DATE": date,
                        "STORE": store,
                        "SKU": sku,
                        "y": np.random.randint(9) + 1,
                    }
                )
            )
df = pd.DataFrame(rows)

noise = np.random.uniform(-1, 1, size=len(df["y"]))
df["yhat"] = (df["y"] + (df["y"] * noise)).clip(lower=df["y"].min())
df

Time the `pandas.groupby.apply` method:

In [None]:
%%time
df.groupby(["STORE", "SKU"]).apply(lambda x: mean_squared_error(x["y"], x["yhat"]))

Time it using `xskillscore`:

In [None]:
%%time
df.set_index(["DATE", "STORE", "SKU"]).to_xarray().xs.mse(
    "y", "yhat", dim="DATE"
).to_series()

See [xskillscore-tutorial](https://github.com/raybellwaves/xskillscore-tutorial) for further reading.