[ENH] New transformation based pipeline classifiers (#1721)

* typo * fitted logic in panel case * boxcox refactored as example * linting * linting * added new tags * added checks for object dtype in Series types * fixing if/else logic path 2 and 3 * added nested_univ to df-list conversion * and back * changed fit-in-transform default to True * linting * linting * improved informativity of estimator tags error message * added more tags * fixed missing tag in boxcox refactor * random interval transformer * updated fit check with check_is_fitted * removed superfluous comments * remove return_pred_int errors * corrected data frame check * added inner types for panel trafos * output conversion for Series/Primitives * signature fixes * linting signatures * linting * linting signatures * stc transform saving * signatures update * typo * coerce to pandas * japanese vowels test * Fixing error on parameters for experiment * relaxing numpy series mtype * new rotation forest * remove useless atts * rotf contracting, train estimate and comments * doc fix * rotf example fix * stc comments p1 * include rotation forest * hc refactor p1 * hc1 test fix * stc rotf fix * cboss train probs * use transformed data in stc * hc1 test outline * config change * hc2.0 * rotf seed fix * st * st fix * stc and st tests and finishing touches * remove mstc dev * more st fixing * import code quality * rotf code quality * hc1 test code quality * hc2 tests and contracting * hc2 stc default * hc2 stc default (fix) * st rename and test fixes * hc2 train estimate fix * stc test fixes and codeowners * rotf fix * remove identical shapelets at end and transform n_jobs * rounding ST ig to prevent differences between OS * rotf test fix * st and stc docs * hc1 and hc2 test fixes * hc1 and hc2 examples fix * higher max shapelets for tests * more mv cases in tests * hc test parameters * hc test parameters 2 * hc test parameters 3 * hc test parameters 4 * wrong probas * hopefully done * dumb linalg packages ruin my week by causing differences between os * experiments update and a couple of getattr bugfixes * contrib experiments update * set classifier hc2 * base * enforce univariate in base class * remove unnecessary classifier tags * predict and predict_proba * tweaks to classifier base class * formatting 1 * formatting 3 * formatting 4 * formatting 6? * blank lines or no blank lines? * remove unnecessary argument to get_tag * negate tag correctly, remove unnecessary get_tag argument * correct tag negation * _predict _predict_proba _predict made abstract, _ predict_proba given a default implmentation n_classes_ added to base class * formatting 1 * formatting 2 * renamed fitted_trafos to transformers * deprecation warning category, bump version number * linting * linting * linting * linting * added defensive assert * added error message to defensive assertion * added transform-input tag in docstring * wrong tag used, now "univariate-only" * corrected reference in ForecastingPipeline * clarified mtype in docstring * linting * fixing transformer tests to correctly refer to create_test_instance * Revert "fixing transformer tests to correctly refer to create_test_instance" This reverts commit ad6877f. * transformer extension template * extension template docstrings * HC comments an experiments fixes * added link to transformer extension template in README * remove deprecated versions and more feature based updates * test config * feature based test tolerance * clearer mtype/scitype tracking logic * minor update to extension template * refactored input checks * bug in check_is function fixed * linting * linting * removing mtypes with unsupported checks * fixed typo in signature method * signature _fit should ignore y * added missing docstring in _series_as_panel/_convert * fixed docstring * corrected docstring * added valueerror message and capture for dim 1 np.ndarray in series/panel conversions * scitypes condition * linting * changed variable names in fit * bugfix in output conversion in transform * added comments suggested by Lovkush * remove new transform classifiers * signature and mp tests * signature example * reintroduce new transformers/transform classifiers * only mv summarycls is broken now * replaced fit-copy with fit-reference * Update extension_templates/transformer.py Co-authored-by: Lovkush <lovkush@gmail.com> * clarify special case * clarified * clarification on output type * fixed test reference * replaced transform(Z) reference by X reference in test_date * moved comments before lines * numpydoc compliance in extension template * clarified imports * added clarification on init * est2 fixed * removed fixed arg for transform testing * refactor summarizer and base class * linting * wrong place * output for transformer fix * had wrong X_inner_mtype * fixed test_raises_not_fitted_error test * changed fit-in-transform behaviour to "fit must always be called even if empty" * fresh prince * fresh prince 2 * i have to make this commit to swap branches and fix a bug * fix secondary error caused by changing fit-in-transform behaviour * testing changes, still some train estimate stuff to do * fresh prince test * summary classifier fix and docs * pre-pr bug fix * catch22 replace nans * catch22 single feature public * catch22 nan replacement change * tsfresh link Co-authored-by: Franz Király <f.kiraly@ucl.ac.uk> Co-authored-by: a-pasos-ruiz <56823538+a-pasos-ruiz@users.noreply.github.com> Co-authored-by: Alejandro Pasos Ruiz (CMP - Postgraduate Researcher) <fbu19zru@UEA.AC.UK> Co-authored-by: Tony Bagnall <ajb@uea.ac.uk> Co-authored-by: Lovkush <lovkush@gmail.com>
sktime · Dec 19, 2021 · 8fee732 · 8fee732
1 parent 5c608f0
commit 8fee732
Show file tree

Hide file tree

Showing 18 changed files with 2,336 additions and 956 deletions.
diff --git a/sktime/classification/feature_based/__init__.py b/sktime/classification/feature_based/__init__.py
@@ -3,15 +3,23 @@
 __all__ = [
     "Catch22Classifier",
     "MatrixProfileClassifier",
+    "RandomIntervalClassifier",
     "SignatureClassifier",
+    "SummaryClassifier",
     "TSFreshClassifier",
+    "FreshPRINCE",
 ]
 
 from sktime.classification.feature_based._catch22_classifier import Catch22Classifier
+from sktime.classification.feature_based._fresh_prince import FreshPRINCE
 from sktime.classification.feature_based._matrix_profile_classifier import (
     MatrixProfileClassifier,
 )
+from sktime.classification.feature_based._random_interval_classifier import (
+    RandomIntervalClassifier,
+)
 from sktime.classification.feature_based._signature_classifier import (
     SignatureClassifier,
 )
+from sktime.classification.feature_based._summary_classifier import SummaryClassifier
 from sktime.classification.feature_based._tsfresh_classifier import TSFreshClassifier
diff --git a/sktime/classification/feature_based/_catch22_classifier.py b/sktime/classification/feature_based/_catch22_classifier.py
@@ -26,6 +26,8 @@ class Catch22Classifier(BaseClassifier):
     outlier_norm : bool, default=False
         Normalise each series during the two outlier catch22 features, which can take a
         while to process for large values
+    replace_nans : bool, default=True
+        Replace NaN or inf values from the catch22 transform with 0.
     estimator : sklearn classifier, default=None
         An sklearn estimator to be built using the transformed data. Defaults to a
         Random Forest with 200 trees.
@@ -83,11 +85,13 @@ class Catch22Classifier(BaseClassifier):
     def __init__(
         self,
         outlier_norm=False,
+        replace_nans=True,
         estimator=None,
         n_jobs=1,
         random_state=None,
     ):
         self.outlier_norm = outlier_norm
+        self.replace_nans = replace_nans
         self.estimator = estimator
 
         self.n_jobs = n_jobs
@@ -118,7 +122,9 @@ def _fit(self, X, y):
         Changes state by creating a fitted model that updates attributes
         ending in "_" and sets is_fitted flag to True.
         """
-        self._transformer = Catch22(outlier_norm=self.outlier_norm)
+        self._transformer = Catch22(
+            outlier_norm=self.outlier_norm, replace_nans=self.replace_nans
+        )
 
         self._estimator = _clone_estimator(
             RandomForestClassifier(n_estimators=200)
@@ -132,7 +138,7 @@ def _fit(self, X, y):
             self._estimator.n_jobs = self._threads_to_use
 
         X_t = self._transformer.fit_transform(X, y)
-        X_t = np.nan_to_num(X_t, False, 0, 0, 0)
+
         self._estimator.fit(X_t, y)
 
         return self
@@ -150,9 +156,7 @@ def _predict(self, X):
         y : array-like, shape = [n_instances]
             Predicted class labels.
         """
-        X_t = self._transformer.transform(X)
-        X_t = np.nan_to_num(X_t, False, 0, 0, 0)
-        return self._estimator.predict(X_t)
+        return self._estimator.predict(self._transformer.transform(X))
 
     def _predict_proba(self, X):
         """Predict class probabilities for n instances in X.
@@ -167,15 +171,12 @@ def _predict_proba(self, X):
         y : array-like, shape = [n_instances, n_classes_]
             Predicted probabilities using the ordering in classes_.
         """
-        X_t = self._transformer.transform(X)
-        X_t = np.nan_to_num(X_t, False, 0, 0, 0)
-
         m = getattr(self._estimator, "predict_proba", None)
         if callable(m):
-            return self._estimator.predict_proba(X_t)
+            return self._estimator.predict_proba(self._transformer.transform(X))
         else:
             dists = np.zeros((X.shape[0], self.n_classes_))
-            preds = self._estimator.predict(X_t)
+            preds = self._estimator.predict(self._transformer.transform(X))
             for i in range(0, X.shape[0]):
                 dists[i, self._class_dictionary[preds[i]]] = 1
             return dists
diff --git a/sktime/classification/feature_based/_fresh_prince.py b/sktime/classification/feature_based/_fresh_prince.py
@@ -0,0 +1,206 @@
+# -*- coding: utf-8 -*-
+"""FreshPRINCE Classifier.
+
+Pipeline classifier using the full set of TSFresh features and a RotationForest
+classifier.
+"""
+
+__author__ = ["MatthewMiddlehurst"]
+__all__ = ["FreshPRINCE"]
+
+
+from sktime.classification.base import BaseClassifier
+from sktime.contrib.vector_classifiers._rotation_forest import RotationForest
+from sktime.transformations.panel.tsfresh import TSFreshFeatureExtractor
+from sktime.utils.validation.panel import check_X_y
+
+
+class FreshPRINCE(BaseClassifier):
+    """Fresh Pipeline with RotatIoN forest Classifier.
+
+    This classifier simply transforms the input data using the TSFresh [1]_
+    transformer with comprehensive features and builds a RotationForest estimator using
+    the transformed data.
+
+    Parameters
+    ----------
+    default_fc_parameters : str, default="comprehensive"
+        Set of TSFresh features to be extracted, options are "minimal", "efficient" or
+        "comprehensive".
+    n_estimators : int, default=200
+        Number of estimators for the RotationForest ensemble.
+    verbose : int, default=0
+        Level of output printed to the console (for information only)
+    n_jobs : int, default=1
+        The number of jobs to run in parallel for both `fit` and `predict`.
+        ``-1`` means using all processors.
+    chunksize : int or None, default=None
+        Number of series processed in each parallel TSFresh job, should be optimised
+        for efficient parallelisation.
+    random_state : int or None, default=None
+        Seed for random, integer.
+
+    Attributes
+    ----------
+    n_classes_ : int
+        Number of classes. Extracted from the data.
+    classes_ : ndarray of shape (n_classes_)
+        Holds the label for each class.
+
+    See Also
+    --------
+    TSFreshFeatureExtractor, TSFreshClassifier, RotationForest
+
+    References
+    ----------
+    .. [1] Christ, Maximilian, et al. "Time series feature extraction on basis of
+        scalable hypothesis tests (tsfresh–a python package)." Neurocomputing 307
+        (2018): 72-77.
+        https://www.sciencedirect.com/science/article/pii/S0925231218304843
+
+    Examples
+    --------
+    >>> from sktime.classification.feature_based import FreshPRINCE
+    >>> from sktime.contrib.vector_classifiers._rotation_forest import RotationForest
+    >>> from sktime.datasets import load_unit_test
+    >>> X_train, y_train = load_unit_test(split="train", return_X_y=True)
+    >>> X_test, y_test = load_unit_test(split="test", return_X_y=True)
+    >>> clf = FreshPRINCE(
+    ...     default_fc_parameters="minimal",
+    ...     n_estimators=10,
+    ... )
+    >>> clf.fit(X_train, y_train)
+    FreshPRINCE(...)
+    >>> y_pred = clf.predict(X_test)
+    """
+
+    _tags = {
+        "capability:multivariate": True,
+        "capability:multithreading": True,
+        "capability:train_estimate": True,
+    }
+
+    def __init__(
+        self,
+        default_fc_parameters="comprehensive",
+        n_estimators=200,
+        save_transformed_data=False,
+        verbose=0,
+        n_jobs=1,
+        chunksize=None,
+        random_state=None,
+    ):
+        self.default_fc_parameters = default_fc_parameters
+        self.n_estimators = n_estimators
+
+        self.save_transformed_data = save_transformed_data
+        self.verbose = verbose
+        self.n_jobs = n_jobs
+        self.chunksize = chunksize
+        self.random_state = random_state
+
+        self.n_instances_ = 0
+        self.n_dims_ = 0
+        self.series_length_ = 0
+        self.transformed_data_ = []
+
+        self._rotf = None
+        self._tsfresh = None
+
+        super(FreshPRINCE, self).__init__()
+
+    def _fit(self, X, y):
+        """Fit a pipeline on cases (X,y), where y is the target variable.
+
+        Parameters
+        ----------
+        X : 3D np.array of shape = [n_instances, n_dimensions, series_length]
+            The training data.
+        y : array-like, shape = [n_instances]
+            The class labels.
+
+        Returns
+        -------
+        self :
+            Reference to self.
+
+        Notes
+        -----
+        Changes state by creating a fitted model that updates attributes
+        ending in "_" and sets is_fitted flag to True.
+        """
+        self.n_instances_, self.n_dims_, self.series_length_ = X.shape
+
+        self._rotf = RotationForest(
+            n_estimators=self.n_estimators,
+            save_transformed_data=self.save_transformed_data,
+            n_jobs=self._threads_to_use,
+            random_state=self.random_state,
+        )
+        self._tsfresh = TSFreshFeatureExtractor(
+            default_fc_parameters=self.default_fc_parameters,
+            n_jobs=self._threads_to_use,
+            chunksize=self.chunksize,
+            show_warnings=self.verbose > 1,
+            disable_progressbar=self.verbose < 1,
+        )
+
+        X_t = self._tsfresh.fit_transform(X, y)
+        self._rotf.fit(X_t, y)
+
+        if self.save_transformed_data:
+            self.transformed_data_ = X_t
+
+        return self
+
+    def _predict(self, X):
+        """Predict class values of n instances in X.
+
+        Parameters
+        ----------
+        X : 3D np.array of shape = [n_instances, n_dimensions, series_length]
+            The data to make predictions for.
+
+        Returns
+        -------
+        y : array-like, shape = [n_instances]
+            Predicted class labels.
+        """
+        return self._rotf.predict(self._tsfresh.transform(X))
+
+    def _predict_proba(self, X):
+        """Predict class probabilities for n instances in X.
+
+        Parameters
+        ----------
+        X : 3D np.array of shape = [n_instances, n_dimensions, series_length]
+            The data to make predict probabilities for.
+
+        Returns
+        -------
+        y : array-like, shape = [n_instances, n_classes_]
+            Predicted probabilities using the ordering in classes_.
+        """
+        return self._rotf.predict_proba(self._tsfresh.transform(X))
+
+    def _get_train_probs(self, X, y):
+        self.check_is_fitted()
+        X, y = check_X_y(X, y, coerce_to_numpy=True)
+
+        n_instances, n_dims, series_length = X.shape
+
+        if (
+            n_instances != self.n_instances_
+            or n_dims != self.n_dims_
+            or series_length != self.series_length_
+        ):
+            raise ValueError(
+                "n_instances, n_dims, series_length mismatch. X should be "
+                "the same as the training data used in fit for generating train "
+                "probabilities."
+            )
+
+        if not self.save_transformed_data:
+            raise ValueError("Currently only works with saved transform data from fit.")
+
+        return self._rotf._get_train_probs(self.transformed_data_, y)