### Overview of this notebook

This notebook:

* introduction to time series classification, regression, clustering
* `sktime` data format fo "time series panels" = collections of time series
* basic vignettes for TSC, TSR, TScl
* advanced vignettes - pipelines, ensembles, tuning
* appendix: data loading
* appendix: implementing third party estimators for TSC, TSR, TScl

# 2. Learning tasks - Classification, Regression, Clustering & more

#### What are Panel tasks?

Panel tasks refers to a type of learning problem where a Panel of data is employed, simply refered to as Panel Data.

Panel Data comprises of multiple time series entities/instances, where a single time series component looks like:

<INSERT TIME-SERIES, Image-1>

Hence, Panel Data can be visualized as follows:

<INSERT PANEL-DATA, Image-2>

As per the kind of response variable and goal of the task, we can define different tasks - all of them are synonymous to time-independent (often called as Cross-sectional) data:
1. _Classification_: The response variable is a label (Good / bad, ratings between 0 and 5 - 0, 1, 2, 3, 4, 5)
    <INSERT CLASSIFICATION-BOXES, Image-3>
2. _Regression_: The response variable is continuous (floating point, integers)
    <INSERT REGRESSION-PLOT FROM `utils.load_experiments(variables="pressure"), Image-4`>
3. _Clustering_: There are no response variables here, the goal of this task is to group entities that are "similar" to each other.
    <INSERT CLUSTERING-PLOT, Image-5>
4. _Forecasting_: Given historical data, predict the (near future) values by capturing temporal dependencies and patterns within each panel.
    <TODO - Think of what image to add here, Image-6>
5. _TODO: Add more obscure tasks (causal inference, survival analysis) or something more relevant to the talk - distances and kernel based_
    <TODO - Think of what image to add here, Image-7>

## 2.1 Panel data - `sktime` data formats

Preferred format 1: `pd.DataFrame` with 2-level `MultiIndex`, (instance, time), cols= variables

Preferred format 2: 3D `np.ndarray` with index (instance, variable, time)

* `sktime` supports and recognizes multiple data formats for convenience and internal use, e.g., `dask`, `xarray`
* abstract data type = "scitype"; in-memory specification = "mtype"
* More information in tutorial on [in-memory data representations and data loading](https://www.sktime.net/en/latest/examples/AA_datatypes_and_datasets.html#In-memory-data-representations-and-data-loading)

### 2.1.1 preferred format 1 - `pd-multiindex` specification

`pd-multiindex` = `pd.DataFrame` with 2-level `MultiIndex`, (instance, time), cols= variables

In [1]:
from sktime.datasets import load_osuleaf

# load an example time series panel in pd-multiindex mtype
X, _ = load_osuleaf(return_type="pd-multiindex")

The osuleaf dataset has:

* 412 individual time series instances
* one single variable, `dim_0`
* individual time series are observed at around 417 time points

In [2]:
X

Unnamed: 0_level_0,Unnamed: 1_level_0,dim_0
Unnamed: 0_level_1,timepoints,Unnamed: 2_level_1
0,0,0.550671
0,1,0.464716
0,2,0.375261
0,3,0.293060
0,4,0.206427
...,...,...
441,422,1.761683
441,423,1.838726
441,424,1.927078
441,425,2.004444


### 2.1.2 preferred format 2 - `numpy3D` specification

`numpy3D` = 3D `np.ndarray` with index (instance, variable, time)

instance/time index is interpreted as integer

IMPORTANT: unlike `pd-multiindex`, this assumes:

* all individual series have the same length
* all individual series have the same index

In [11]:
from sktime.datasets import load_osuleaf

# load an example time series panel in pd-multiindex mtype
X, _ = load_osuleaf(return_type="numpy3D")

In [12]:
X.shape

(442, 1, 427)

### 2.1.3 loading and validity checking

for custom data sets:

1. use `pandas` `read_csv` or similar utilities to obtain a `pd.DataFrame` or `np.ndarray`
2. try to bring the result in one of the preferred specifications
3. use the `check_is_mtype` utility to check compliance - inspect informative error messages
4. repeate 2-3 until the data format check passes

In [14]:
# let's pretend we just loaded this from csv
from sktime.datasets import load_osuleaf

X_pd, _ = load_osuleaf(return_type="pd-multiindex")

let's now check whether it complies with the `pd-multiindex` specification

In [17]:
from sktime.datatypes import check_is_mtype

valid, error_msg, metadata = check_is_mtype(X_pd, "pd-multiindex", return_metadata=True)

In [19]:
# is it valid?
valid

True

In [20]:
# helpful metadata, check if this is as per expectations
metadata

{'is_univariate': True,
 'is_empty': False,
 'has_nans': False,
 'n_instances': 442,
 'is_one_series': False,
 'is_equal_length': True,
 'is_equally_spaced': True,
 'n_panels': 1,
 'is_one_panel': True,
 'mtype': 'pd-multiindex',
 'scitype': 'Panel'}

let's see what happens if it is not in the expected format.

We have a `pd.DataFrame`, so if we check against `numpy3D`, it should complain:

In [21]:
valid, error_msg, metadata = check_is_mtype(X_pd, "numpy3D", return_metadata=True)

In [22]:
valid

False

In [23]:
error_msg

"obj must be a numpy.ndarray, found <class 'pandas.core.frame.DataFrame'>"

This tells us that we should first convert into `np.ndarray` as expected.

For further details on data formats, see the tutorial on [in-memory data representations and data loading](https://www.sktime.net/en/latest/examples/AA_datatypes_and_datasets.html#In-memory-data-representations-and-data-loading).

This tutorial also contains full formal specifications of the mtypes (= machine representations).

All supported in-memory representations are python inspectable in `sktime.datatypes.MTYPE_REGISTER`

Note that this including "exotic", rarely used ones and representations of objects that aren't time series.


## 2.2 Classification Tasks

    -- TODO: A simple, motivating problem --

In [None]:
# TODO: Load the dataset and split

### 2.2.1 --TODO: Decide Topic Name--: A small list of famous & simple estimators to 'solve' this problem

In [None]:
# TODO-1: Estimator-1
# TODO-2: Estimator-2
# TODO-3: Estimator-3

### 2.2.2 Evaluation Metrics for Time Series Classification

In [None]:
# TODO: `scikit-learn` compatibility showcase for evaluation, provide in-house evaluation functions too

## 2.3 Regression Tasks
    -- TODO: A simple, motivating problem, relevant to the first one --

In [None]:
# TODO: Load the dataset and split

### 2.3.1 -- TODO: Decide Topic Name --: A small list of famous & simple estimators to 'solve' this problem

In [None]:
# TODO-1: Estimator-1
# TODO-2: Estimator-2
# TODO-3: Estimator-3

### 2.3.2 Evaluation Metrics for Time Series Regression

In [None]:
# TODO: `scikit-learn` compatibility showcase for evaluation, provide in-house evaluation functions too

## 2.4 Clustering Tasks
    -- TODO: A simple, motivating problem, relevant to both problems --

In [None]:
# TODO: Load the dataset and split

### 2.4.1 -- TODO: Decide Topic Name --: A small list of famous & simple estimators to 'solve' this problem

In [None]:
# TODO-1: Estimator-1
# TODO-2: Estimator-2
# TODO-3: Estimator-3

### 2.4.2 Evaluation Metrics for Time Series Clustering

In [None]:
# TODO: `scikit-learn` compatibility showcase for evaluation, provide in-house evaluation functions too

## 2.5 Introduction to Pipelines in `sktime`
    -- TODO: Decide subtopic distribution --

## 2.6 Advanced Topics
    -- TODO : Choose 2 or 3 of the following
        - DL estimators
        - Extending both classical and DL estimators for different tasks
        - Combining pipelines with DL estimators
        - GridSearch Pipelines
        - Reduction Pipelines-