**Set-up instructions:** this notebook give a tutorial on the forecasting learning task supported by `sktime`.
On binder, this should run out-of-the-box.

To run this notebook as intended, ensure that `sktime` with basic dependency requirements is installed in your python environment.

To run this notebook with a local development version of sktime, either uncomment and run the below, or `pip install -e` a local clone of the `sktime` `main` branch.

In [None]:
# from os import sys
# sys.path.append("..")

# In-memory data representations and data loading

`sktime` provides modules for a number of time series related learning tasks.

These modules use `sktime` specific in-memory (i.e., python workspace) representations for time series and related objects, most importantly individual time series and time series panels. `sktime`'s in-memory representations rely on `pandas` and `numpy`, with additional conventions on the `pandas` and `numpy` object.

Users of `sktime` should be aware of these representations, since presenting the data in an `sktime` compatible representation is usually the first step in using any of the `sktime` modules.

This notebook introduces the data types used in `sktime`, related functionality such as converters and validity checkers, and common workflows for loading and conversion:

**Section 1** introduces in-memory data containers used in `sktime`, with examples.

**Section 2** introduces validity checkers and conversion functionality for in-memory data containers.

**Section 3** introduces common workflows to load data from file formats

## Section 1: in-memory data containers

This section provides a reference to data containers used for time series and related objets in `sktime`.

Conceptually, `sktime` distinguishes:

* the *scientific type* (or short: scitype) of a data container, defined by relational and statistical properties of the data being represented - for instance, a (scientific) "time series" or a (scientific) "time series panel", in the mathematical-abstract sense, without specifying a machine representation
* the *machine type* (or short: mtype) of a data container, which, for a defined *scientific type*, specifies the python type and conventions on structure and value of the python in-memory object. For instance, a specific (scientific) time series is represented by a concrete `pandas.DataFrame` in `sktime`, subject to certain conventions on the `pandas.DataFrame`. Formally, these conventions form a specific mtype, i.e., a way to represent the (abstract) "time series" scitype.

In `sktime`, the same scitype can be represented by multiple mtypes. For instance, `sktime` allows the user to specify time series as `pandas.DataFrame`, as `pandas.Series`, or as a `numpy.ndarray`. These are different mtypes which are admissible representations of the same scitype, "time series". Also, not all mtypes are equally rich in metadata - for instance, `pandas.DataFrame` can store column names, while this is not possible in `numpy.ndarray`.

Both scitypes and mtypes are encoded by strings in `sktime`, for easy reference.

This section introduces the mtypes for the following scitypes:
* `"Series"`, the `sktime` scitype for time series of any kind
* `"Panel"`, the `sktime` scitype for time series panels of any kind

### Section 1.1: Time series - the `"Series"` scitype

The major representations of time series in `sktime` are:

* `"pd.DataFrame"` - a uni- or multivariate `pandas.DataFrame`, with rows = time points, cols = variables
* `"pd.Series"` - a (univariate) `pandas.Series`, with entries corresponding to different time points
* `"np.ndarray"` - a 2D `numpy.ndarray`, with rows = time points, cols = variables

`pandas` objects must have one of the following `pandas` index types:
`Int64Index`, `RangeIndex`, `DatetimeIndex`, `PeriodIndex`; if `DatetimeIndex`, the `freq` attribute must be set.

`numpy.ndarray` 2D arrays are interpreted as having an `RangeIndex` on the rows, and generally equivalent to the `pandas.DataFrame` obtained after default coercion using the `pandas.DataFrame` constructor.


In [None]:
# import to retrieve examples
from sktime.datatypes import get_examples

### Section 1.1.1: Time series - the `"pd.DataFrame"` mtype

In the `"pd.DataFrame"` mtype, time series are represented by an in-memory container `obj: pandas.DataFrame` as follows.

* structure convention: `obj.index` must be monotonous, and one of `Int64Index`, `RangeIndex`, `DatetimeIndex`, `PeriodIndex`.
* variables: columns of `obj` correspond to different variables
* variable names: column names `obj.columns`
* time points: rows of `obj` correspond to different, distinct time points
* time index: `obj.index` is interpreted as a time index.
* capabilities: can represent multivariate series; can represent unequally spaced series

Example of a univariate series in `"pd.DataFrame"` representation.
The single variable has name `"a"`, and is observed at four time points 0, 1, 2, 3.

In [None]:
get_examples(mtype="pd.DataFrame", as_scitype="Series")[0]

Example of a bivariate series in `"pd.DataFrame"` representation.
This series has two variables, named `"a"` and `"b"`. Both are observed at the same four time points 0, 1, 2, 3.

In [None]:
get_examples(mtype="pd.DataFrame", as_scitype="Series")[1]

### Section 1.1.2: Time series - the `"pd.Series"` mtype

In the `"pd.Series"` mtype, time series are represented by an in-memory container `obj: pandas.Series` as follows.

* structure convention: `obj.index` must be monotonous, and one of `Int64Index`, `RangeIndex`, `DatetimeIndex`, `PeriodIndex`.
* variables: there is a single variable, corresponding to the values of `obj`. Only univariate series can be represented.
* variable names: by default, there is no column name. If needed, a variable name can be provided as `obj.name`.
* time points: entries of `obj` correspond to different, distinct time points
* time index: `obj.index` is interpreted as a time index.
* capabilities: cannot represent multivariate series; can represent unequally spaced series

Example of a univariate series in `"pd.Series"` mtype representation.
The single variable has name `"a"`, and is observed at four time points 0, 1, 2, 3.

In [None]:
get_examples(mtype="pd.Series", as_scitype="Series")[0]

### Section 1.1.3: Time series - the `"np.ndarray"` mtype

In the `"np.ndarray"` mtype, time series are represented by an in-memory container `obj: np.ndarray` as follows.

* structure convention: `obj` must be 2D, i.e., `obj.shape` must have length 2. This is also true for univariate time series.
* variables: variables correspond to columns of `obj`.
* variable names: the `"np.ndarray"` mtype cannot represent variable names.
* time points: the rows of `obj` correspond to different, distinct time points. 
* time index: The time index is implicit and by-convention. The `i`-th row (for an integer `i`) is interpreted as an observation at the time point `i`.
* capabilities: cannot represent multivariate series; cannot represent unequally spaced series

Example of a univariate series in `"np.ndarray"` mtype representation.
There is a single (unnamed) variable, it is observed at four time points 0, 1, 2, 3.

In [None]:
get_examples(mtype="np.ndarray", as_scitype="Series")[0]

Example of a bivariate series in `"np.ndarray"` mtype representation.
There are two (unnamed) variables, they are both observed at four time points 0, 1, 2, 3.

In [None]:
get_examples(mtype="np.ndarray", as_scitype="Series")[1]

### Section 1.2: Time series panels - the `"Panel"` scitype

The major representations of time series panels in `sktime` are:

* `"pd-multiindex"` - a `pandas.DataFrame`, with row multi-index (`instances`, `timepoints`), cols = variables
* `"numpy3D"` - a 3D `np.ndarray`, with axis 0 = instances, axis 1 = variables, axis 2 = time points
* `"df-list"` - a `list` of `pandas.DataFrame`, with list index = instances, data frame rows = time points, data frame cols = variables

These representations are considered primary representations in `sktime` and are core to internal computations.

There are further, minor representations of time series panels in `sktime`:

* `"nested_univ"` - a `pandas.DataFrame`, with `pandas.Series` in cells. data frame rows = instances, data frame cols = variables, and series axis = time points
* `"numpyflat"` - a 2D `np.ndarray` with rows = instances, and columns indexed by a pair index of (variables, time points). This format is only being converted to and cannot be converted from (since number of variables and time points may be ambiguous).
* `"pd-wide"` - a `pandas.DataFrame` in wide format: has column multi-index (variables, time points), rows = instances
* `"pd-long"` - a `pandas.DataFrame` in long format: has cols `instances`, `timepoints`, `variable`, `value`; entries in `value` are indexed by tuples of values in (`instances`, `timepoints`, `variable`).

The minor representations are currently not fully consolidated in-code and are not discussed further below. Contributions are appreciated.

### Section 1.2.1: Time series panels - the `"pd-multiindex"` mtype

In the `"pd-multiindex"` mtype, time series panels are represented by an in-memory container `obj: pandas.DataFrame` as follows.

* structure convention: `obj.index` must be a pair multi-index of type `(RangeIndex, t)`, where `t` is one of `Int64Index`, `RangeIndex`, `DatetimeIndex`, `PeriodIndex` and monotonous. `obj.index` must have name `("instances", "timepoints")`.
* instances: rows with the same `"instances"` index correspond to the same instance; rows with different `"instances"` index correspond to different instances.
* instance index: the first element of pairs in `obj.index` is interpreted as an instance index. 
* variables: columns of `obj` correspond to different variables
* variable names: column names `obj.columns`
* time points: rows of `obj` with the same `"timepoints"` index correspond correspond to the same time point; rows of `obj` with different `"timepoints"` index correspond correspond to the different time points.
* time index: the second element of pairs in `obj.index` is interpreted as a time index. 
* capabilities: can represent panels of multivariate series; can represent unequally spaced series; can represent panels of unequally supported series; cannot represent panels of series with different sets of variables.

Example of a panel of multivariate series in `"pd-multiindex"` mtype representation.
The panel contains three multivariate series, with instance indices 0, 1, 2. All series have two variables with names `"var_0"`, `"var_1"`. All series are observed at three time points 0, 1, 2.

In [None]:
get_examples(mtype="pd-multiindex", as_scitype="Panel")[0]

### Section 1.2.2: Time series panels - the `"numpy3D"` mtype

In the `"numpy3D"` mtype, time series panels are represented by an in-memory container `obj: np.ndarray` as follows.

* structure convention: `obj` must be 3D, i.e., `obj.shape` must have length 2.
* instances: instances correspond to axis 0 elements of `obj`.
* instance index: the instance index is implicit and by-convention. The `i`-th element of axis 0 (for an integer `i`) is interpreted as indicative of observing instance `i`. 
* variables: variables correspond to axis 1 elements of `obj`.
* variable names: the `"numpy3D"` mtype cannot represent variable names.
* time points: time points correspond to axis 2 elements of `obj`.
* time index: the time index is implicit and by-convention. The `i`-th elemtn of axis 2 (for an integer `i`) is interpreted as an observation at the time point `i`.
* capabilities: can represent panels of multivariate series; cannot represent unequally spaced series; cannot represent panels of unequally supported series; cannot represent panels of series with different sets of variables.

Example of a panel of multivariate series in `"numpy3D"` mtype representation.
The panel contains three multivariate series, with instance indices 0, 1, 2. All series have two variables (unnamed). All series are observed at three time points 0, 1, 2.

In [None]:
get_examples(mtype="numpy3D", as_scitype="Panel")[0]

### Section 1.2.3: Time series panels - the `"df-list"` mtype

In the `"df-list"` mtype, time series panels are represented by an in-memory container `obj: List[pandas.DataFrame]` as follows.

* structure convention: `obj` must be a list of `pandas.DataFrames`. Individual list elements of `obj` must follow the `"pd.DataFrame"` mtype convention for the `"Series"` scitype.
* instances: instances correspond to different list elements of `obj`.
* instance index: the instance index of an instance is the list index at which it is located in `obj`. That is, the data at `obj[i]` correspond to observations of the instance with index `i`.
* variables: columns of `obj[i]` correspond to different variables available for instance `i`.
* variable names: column names `obj[i].columns` are the names of variables available for instance `i`.
* time points: rows of `obj[i]` correspond to different, distinct time points, at which instance `i` is observed.
* time index: `obj[i].index` is interpreted as the time index for instance `i`.
* capabilities: can represent panels of multivariate series; can represent unequally spaced series; can represent panels of unequally supported series; can represent panels of series with different sets of variables.

Example of a panel of multivariate series in `"df-list"` mtype representation.
The panel contains three multivariate series, with instance indices 0, 1, 2. All series have two variables with names `"var_0"`, `"var_1"`. All series are observed at three time points 0, 1, 2.

In [None]:
get_examples(mtype="df-list", as_scitype="Panel")[0]

## Section 2: validity checking and mtype conversion

`sktime`'s `datatypes` module provides users with generic functionality for:

* checking in-memory containers against mtype conventions, with informative error messages that help moving data to the right format
* converting different mtypes to each other, for a given scitype

In this section, this functionality and intended usage worfklows are presented.
There is further validation functionality in the `utils` module, meant for internal use, as under-the-hood functionality or by developers; this is also quickly summarized at the end.

### Section 2.1: Preparing data, checking in-memory containers for validity
