# Benchmarking with sktime

The benchmarking modules allows you to easily orchestrate benchmarking experiments in which you want to compare the performance of one or more algorithms over one or more data sets. It also provides a number of statistical tests to check if observed performance differences are statistically significant. 

The benchmarking modules is based on [mlaut](https://github.com/alan-turing-institute/mlaut).

## Preliminaries

In [1]:
import os
from sklearn.metrics import accuracy_score
from sktime.benchmarking.data import UEADataset, make_datasets
from sktime.benchmarking.evaluation import Evaluator
from sktime.benchmarking.metrics import PairwiseMetric
from sktime.benchmarking.orchestration import Orchestrator
from sktime.benchmarking.results import HDDResults
from sktime.benchmarking.strategies import TSCStrategy
from sktime.benchmarking.tasks import TSCTask

from sktime.classification.interval_based import TimeSeriesForest

from sktime.series_as_features.model_selection import PresplitFilesCV

ModuleNotFoundError: No module named 'sktime.benchmarking.strategies'

## Setup

In [10]:
# set paths to data and results folder
import sktime
DATAPATH = os.path.join(os.path.dirname(sktime.__file__), "datasets/data")
RESULTSPATH = "results"

### Create pointers to datasets on hard drive
Here we use the `UEADataset` which follows the [UEA/UCR format](http://www.timeseriesclassification.com) and some of the time series classification datasets included in sktime.

In [11]:
# Create individual pointers to dataset on the disk
datasets = [
    UEADataset(path=DATAPATH, name="GunPoint"),
    UEADataset(path=DATAPATH, name="ItalyPowerDemand"),
    UEADataset(path=DATAPATH, name="ArrowHead")
]

# Alternatively, we can use a helper function to create them automatically
datasets = make_datasets(path=DATAPATH, dataset_cls=UEADataset, 
                         names=["GunPoint", "ItalyPowerDemand", "ArrowHead"])

### For each dataset, we also need to specify a learning task
In this case, all tasks are the same, because the target variable has the same name for all datasets. 

In [12]:
tasks = [TSCTask(target="target") for _ in range(len(datasets))]

### Specify learning strategies

In [13]:
# Specify learning strategies
strategies = [
    TSCStrategy(TimeSeriesForestClassifier(n_estimators=10), name="tsf10"),
    TSCStrategy(TimeSeriesForestClassifier(n_estimators=20), name="tsf20")
]



### Set up a results object
The results object encapsulates where and how benchmarking results are stored, here we choose to output them to the hard drive.

In [14]:
# Specify results object which manages the output of the benchmarking
results = HDDResults(path=RESULTSPATH)

## Run benchmarking

In [15]:
# run orchestrator
orchestrator = Orchestrator(datasets=datasets,
                            tasks=tasks,  
                            strategies=strategies, 
                            cv=PresplitFilesCV(), 
                            results=results)
 
orchestrator.fit_predict(save_fitted_strategies=False, overwrite_predictions=True)



## Evaluate and compare results
Having run the orchestrator, we can evaluate and compare the prediction strategies.

In [8]:
evaluator = Evaluator(results)
metric = PairwiseMetric(func=accuracy_score, name="accuracy")
metrics_by_strategy = evaluator.evaluate(metric=metric)
metrics_by_strategy.head()

Unnamed: 0,strategy,accuracy_mean,accuracy_stderr
0,tsf10,0.870884,0.021333
1,tsf20,0.879359,0.018483


The evaluator offers a number of additional methods for evaluating and comparing strategies, including statistical hypothesis tests and visualisation tools, for example:

In [9]:
evaluator.rank()

Unnamed: 0,strategy,accuracy_mean_rank
0,tsf10,1.333333
1,tsf20,1.666667


Currently, the following functions are implemented:

* `analyse.plot_boxplots()`
* `analyse.ranks()`
* `analyse.t_test()`
* `analyse.sign_test()`
* `analyse.ranksum_test()`
* `analyse.t_test_with_bonferroni_correction()`
* `analyse.wilcoxon_test()`
* `analyse.friedman_test()`
* `analyse.nemenyi()`