### Overview of this notebook

* what are time series? how are they represented in `sktime`?
* terminology: univariate, multivariate, panel data, hierarchical data
* loading data and validity checking

first some basic terminology on time series (required for the above)

## 2.1 time series - terminology

* time series
* variables, univariate, multivariate
* time index
* panel of time series, instances
* hierarchical time series

#### **time series**, **time index**, **variables**

time series = recorded observations of one object or process at different time points.

observations at different time points are of same kind/type.

observations recorded with **time index** (= recorded time stamp)

(index could be not time but ordered - for simplicity, still call this time series)

observations are of **variables** (= recording of an observable)

time series with 2 or more variables is called **multivariate**

with 1 variable is called **univariate**

**Example: airline data**

one time series recording number of airline passengers

one observable = number of passengers in given calendar month

index = which calendar month (period = span of calendar month)

In [None]:
from sktime.datasets import load_airline

y = load_airline()
y

In [None]:
# pandas models the time index as a separate object:
y.index

In [None]:
from sktime.utils.plotting import plot_series

fig, ax = plot_series(y)

**Example: macroeconomic data**

one time series recording various macroeconomic variables over time

multiple observables = GDP, unemployment, etc

index = which calendar quarter (period = span of three calendar months)

In [None]:
from sktime.datasets import load_macroeconomic

y = load_macroeconomic()

In [None]:
from sktime.utils.plotting import plot_series

for i in y.columns[0:3]:
    fig, ax = plot_series(y.loc[:, i])

common abstract data model:

data frame, with row index = time index; column index = variable index

In [None]:
y

#### **panel of time series**

panel of time series is a collection of multiple time series instances

different time series in the collection = **instances**

instances usually assumed independent, or conditionally independent

**instance index** = names/tags of the different instances

**Example: basic motions data**

multiple time series, each time series (or sequence) comes from one trial

each trial involves smartwatch recording of a person while running etc

six observables = 3 accelerometer, 3 gyroscope

index = time stamp of the observable recording

instance = which trial

In [None]:
from sktime.datasets import load_basic_motions

X, _ = load_basic_motions(return_type="pd-multiindex")
X

abstract data model: one value per instance number, time stamp, variable

no common data model, so `sktime` support multiple (will revisit later)

#### **panel vs multivariate?**

important to distinguish independent instance from variable!

instances indicate different observations; variables indicate different observables!

Example - macroeconomic data. Different *variables* because observe the same thing - the economy.

Variables in the same economy are highly interdependent.

Example - basic motions data. Different *instances* because observe different things - different humans.

Motion/gait data of different humans is independent, they do not influence each other (causally or by confounder).

#### **hierarchical time series**

are collections of time series with nested/hierarchical instance index

example: runner & trial repetition = index; observables = motion data of that runner in repetition

example: hospital & patient = index; observables = clinical variables of that patient

example: store & product = index; observable = sales over time period in store of product

Hierarchies may or may not be independent on levels (important assumption to track)

**Example: dummy sales data**

![](../images/hierarchy.png)

concurrent time series

sales of productgroup, in product line, at date

one observable = sales during period, of product group

index = day on which sales are recorded

Note: hierarchies are not independent

In [None]:
from odsc_utils import load_product_hierarchy

y = load_product_hierarchy()

In [None]:
y

## 2.2 time series - `sktime` in-memory data formats

* `sktime` supports and recognizes multiple data formats for convenience and internal use, e.g., `dask`, `xarray`
* abstract data type = "scitype"; in-memory specification = "mtype"
* More information in tutorial on [in-memory data representations and data loading](https://www.sktime.net/en/latest/examples/AA_datatypes_and_datasets.html#In-memory-data-representations-and-data-loading)

#### single time series (univariate or multivariate) = `Series` data scitype

preferred format: `pd.DataFrame` with index=time, cols=variables

#### panel data (univariate or multivariate) = `Panel` data scitype

Preferred format 1: `pd.DataFrame` with 2-level `MultiIndex`, (instance, time), cols=variables

Preferred format 2: 3D `np.ndarray` with index (instance, variable, time)

#### hierarchical data = `Hierarchical` data scitype

preferred format: `pd.DataFrame` with multi-level `MultiIndex`, cols=Variables

Overall, using `pd.DataFrame` is most consistent between different scenarios

### 2.2.1 `Series` preferred format 1 - `pd.DataFrame` specification

`pd.Dataframe` mtype = `pd.DataFrame` with index=time, cols=variables

In [None]:
from sktime.datasets import load_macroeconomic

# load an example time series in pd.DataFrame mtype
y = load_macroeconomic()

The macroeconomic dataset has:

* 12 variables
* observations at 203 time points (quarterly periods)

It is a single time series (not 12 separate time series instances).

In [None]:
y

### 2.2.2 `Panel` preferred format 1 - `pd-multiindex` specification

`pd-multiindex` = `pd.DataFrame` with 2-level `MultiIndex`, (instance, time), cols=variables

In [None]:
from sktime.datasets import load_basic_motions

# load an example time series panel in pd-multiindex mtype
X, _ = load_basic_motions(return_type="pd-multiindex")

The basic motions dataset has:

* 6 individual time series instances
* six variables per time series instance, `dim_0` to `dim_5`
* individual time series are observed at 100 time points (the same number for all instances)

In [None]:
X

### 2.2.3 `Panel` preferred format 2 - `numpy3D` specification

`numpy3D` = 3D `np.ndarray` with index (instance, variable, time)

instance/time index is interpreted as integer

IMPORTANT: unlike `pd-multiindex`, this assumes:

* all individual series have the same length
* all individual series have the same index

In [None]:
from sktime.datasets import load_basic_motions

# load an example time series panel in numpy mtype
X, _ = load_basic_motions(return_type="numpy3D")

The basic motions dataset has:

* 80 individual time series instances
* six variables per time series instance
* individual time series are observed at 100 time points (the same number for all instances)

In [None]:
X.shape

### 2.2.3 `Hierarchical` preferred format - `pd_multiindex_hier` specification

`pd_multiindex_hier` = `pd.DataFrame` with multi-level `MultiIndex`,  last level is time, cols=variables

In [None]:
from odsc_utils import load_product_hierarchy

y = load_product_hierarchy()

the dummy sales data has:

* two hierarchy levels, `Product line` and `Product group`
* two hierarchy nodes for `Product line`, four hierarchy nodes for `Product group`
* a `MultiIndex` with three levels, two of them hierarchy, one time at monthly periods
* a single variable, `Sales`, each observed for the same 48 months

In [None]:
y

## 2.3 loading and validity checking

for custom data sets:

1. use `pandas` `read_csv` or similar utilities to obtain a `pd.DataFrame` or `np.ndarray`
2. try to bring the result in one of the preferred specifications
3. use the `check_is_mtype` utility to check compliance - inspect informative error messages
4. repeat 2-3 until the data format check passes

In [None]:
# let's pretend we just loaded this from csv
from sktime.datasets import load_osuleaf

X_pd, _ = load_osuleaf(return_type="pd-multiindex")

let's now check whether it complies with the `pd-multiindex` specification

In [None]:
from sktime.datatypes import check_is_mtype

valid, error_msg, metadata = check_is_mtype(X_pd, "pd-multiindex", return_metadata=True)

In [None]:
# is it valid?
valid

In [None]:
# helpful metadata, check if this is as per expectations
metadata

let's see what happens if it is not in the expected format.

We have a `pd.DataFrame`, so if we check against `numpy3D`, it should complain:

In [None]:
valid, error_msg, metadata = check_is_mtype(X_pd, "numpy3D", return_metadata=True)

In [None]:
valid

In [None]:
error_msg

This tells us that we should first convert into `np.ndarray` as expected.

For further details on data formats, see the tutorial on [in-memory data representations and data loading](https://www.sktime.net/en/latest/examples/AA_datatypes_and_datasets.html#In-memory-data-representations-and-data-loading).

The "datatypes" tutorial also contains:

* full formal specifications of the mtypes (= machine representations)
* common examples for loading from csv and formatting
* utilities for loading data for commonly used benchmark problems

All supported in-memory representations are python inspectable in `sktime.datatypes.MTYPE_REGISTER`

Note that this includes "exotic", rarely used ones and representations of objects that aren't time series.
Formats for time series panels are indicated by the `Panel` mtype.


---

### Credits: notebook 2 - time series

notebook creation: fkiraly