# Table of Contents

* [Introduction](#Introduction)
* [1. Multivariate Time Series](#1.-Multivariate-Time-Series)
  * [1.1. Derivatives](#1.1.-Derivatives)
* [2. Fourier Transformation](#2.-Fourier-Transformations)
* [3. ANOVA-F test](#3.-ANOVA-F-test)
* [4. Binning](#4.-Binning)
* [5. Words](#5.-Words)
  * [5.1. Bigrams](#5.1.-Bigrams)
* [6. Bag of Patterns](#6.-Bag-of-Patterns)
* [7. Summary](#7.-Summary)
* [Bibliography](#Bibliography)

# Introduction

In this kernel, I'm trying to present Patrick Schäfer's (Humbold University) approach to multivariate time series analysis. The original author proposes to decompose each time series into Fourier's coefficients, choose those that separate classes best and create special words upon those coefficients. The pipeline looks like the one below [1].

![WEASEL MUSE](https://i.imgur.com/47GaxQF.png "WEASEL MUSE")

Let's follow each step to learn more details.

In [None]:
import numpy as np
import pandas as pd

import os
print(os.listdir("../input"))

# 1. Multivariate Time Series

In this competition, we're given data where each column is a time series representing readings from a robot's sensor. We have 4 orientation readings, 3 angular velocity readings, and 3 acceleration sensors. Using this data we're tasked to guess on which surface the robot has been moving. 

In practice, each unique series of readings is represented as a 2d array, where columns represent sensors and rows represents a sample of fa signal from a given sensor.

First, we will do a bit of preprocessing, namely, we'll smooth out signals using rolling average and then obtain derivatives of signals.

In [None]:
X_train = pd.read_csv('../input/career-con-2019/X_train.csv')
y_train = pd.read_csv('../input/career-con-2019/y_train.csv')

X_test = pd.read_csv('../input/career-con-2019/X_test.csv')
y_sample = pd.read_csv('../input/career-con-2019/sample_submission.csv')

X_train.set_index('series_id', inplace=True)
y_train.set_index('series_id', inplace=True)

X_test.set_index('series_id', inplace=True)

new_col_names = ['oX', 'oY', 'oZ', 'oW', 'avX', 'avY', 'avZ', 'laX', 'laY', 'laZ']
columns_to_drop = ['row_id', 'measurement_number']

X_train = X_train.drop(columns_to_drop, axis=1)
y_train = y_train.drop(['group_id'], axis=1)

X_test = X_test.drop(columns_to_drop, axis=1)
#y_sample = y_sample.drop(['group_id'], axis=1)

X_train.columns = new_col_names
X_test.columns = new_col_names

X_train = X_train.groupby('series_id').rolling(10).mean()
X_test = X_test.groupby('series_id').rolling(10).mean()

def add_derivatives(df):
    for col in df.columns:
        series = df[col].values
        diff = np.diff(np.array(series))
        df[col+'_der'] = np.append(0, diff)
    return df

X_train = add_derivatives(X_train).dropna()
X_train.index = X_train.index.droplevel()

X_test = add_derivatives(X_test).dropna()
X_test.index = X_test.index.droplevel()

The most common class here is concrete, which constitutes about 20% of all measurements. Therefore our baseline for prediction is 20% when the most common class is chosen for all predictions.

In [None]:
pd.DataFrame(y_train['surface'].value_counts()/y_train.shape[0]*100)

## 1.1. Derivatives

The author advises adding derivatives to our dataset, as they might improve our model.

The column codes are:
* X, Y, Z, W - stand for direction
* 'o' - stands for orientation
* 'av' - stands for angular velocity
* 'la' - stands for linear acceleration
* '_der' - stands for derivative

This is how the pre-processed data looks like. So far we've used rolling mean to smooth out the measurements and calculated derivatives by simply substracting concurrent data points.

In [None]:
X_train.head()

# 2. Fourier Transformation

In this kernel, we'll be using sklearn.pipelines for each step in data processing, as it seems to be a consistent way of transforming data.

Before we'ee go to transforming the data using FFT, we'll z-score standardize our time series first. TF of z-score standardized series follows a Gaussian distribution, which will help us in the next step.

Note we're only using first 100 samples in this example for simplicity.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

In [None]:
from scipy import stats

class ZScoreScaler(BaseException, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        arr = X.groupby('series_id').apply(stats.zscore).values
        arr = np.concatenate(arr, axis=0)
        return pd.DataFrame(arr, columns=X.columns, index=X.index)

In [None]:
zscore = ZScoreScaler().fit_transform(X_train.loc[:100])

And that's how our z-score transformed series look like.

In [None]:
zscore.loc[20, 'oX':'avZ'].reset_index().drop(columns=['series_id']).plot()

In [None]:
zscore.loc[20, 'oX_der':'avZ_der'].reset_index().drop(columns=['series_id']).plot()

The goal of using FT and choosing the best coefficients is to separate signal from the noise. In the above example, we can see a random sample of readings from sensors (top chart), and their derivatives (bottom chart).

Now it's time to take Fourier transforms of our series **using rolling window**. We transform each series (including derivatives) by applying rolling window of a given length (in our example it's 10 samples) and applying FFT on each window.

In [None]:
def rolling_window_2d(arr, size=2):
    shape = arr.shape
    strides = arr.strides
    arr = np.lib.stride_tricks.as_strided(arr,
                                         shape=(shape[1], shape[0]+1-size, size),
                                         strides=(strides[1],strides[0],strides[0]))
    return arr

class RollingWindowFourier(BaseEstimator, TransformerMixin):

    def __init__(self, window_length):
        self.window_length = window_length
        
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
        index = []
        arr_list = []
        for i in X.index.unique():
            new_one = rolling_window_2d(X.loc[i].values, self.window_length)
            f_transformed = np.fft.fft(new_one, 12)
            concated = np.concatenate((f_transformed.real, f_transformed.imag), axis=2)
            two_dims = concated.transpose(1,0,2).reshape(concated.shape[1],-1)
            arr_list.append(two_dims)
            index += [i for ind in range(new_one.shape[1])]
        df = pd.DataFrame(np.concatenate(arr_list, axis=0), index=index)
        del arr_list
        return df

In [None]:
fourier = RollingWindowFourier(10).fit_transform(zscore)

In [None]:
fourier.head()

In [None]:
fourier.shape

Note we got rid of column names. Now we're left with numbers only.

Each column represents now i-th real or imaginary coefficient of j-th sensor reading, i.e.:

>First 24 columns represent Fourier coefficients of 'oX' sensor readings. Now first 12 columns of those 24 for columns represent real, and the rest imaginary coefficients. So for our second reading, 'oY' we'll be looking at columns 24-47, and so on, till column no. 479, because we had 20 readings times 24 coefficients per reading.

# 3. ANOVA-F test

Now it is time to choose the best Fourier coefficients for each sensor readings. We're using ANOVA-F test to check which coefficients best separate classes.

Other, much simpler (and less reliable) solution, is to choose the first few coefficients as they refer to low frequency, and therefore should separate signal from the noise.

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

class ANOVA_ColumnSelector(BaseEstimator, TransformerMixin):

    def __init__(self, k_best):
        self.k = k_best
        
    def fit(self, X, y):
        y = X.merge(y, left_on=X.index, right_on=y.index)['surface']
        col_list = []
        for sensor in range(20):
            col_indexes = np.linspace(sensor*24, sensor*24+23, 24).astype('int16')
            skb = SelectKBest(f_classif, k=self.k).fit(X[col_indexes], y)
            col_list += list(np.argsort(skb.pvalues_)[:self.k] + sensor*24)
        self.columns = col_list
        return self
    
    def transform(self, X):
        df = X[self.columns]
        del X
        return df

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
anova = ANOVA_ColumnSelector(4).fit_transform(fourier, y_train)

The author advises choosing 3-5 best coefficients. We'll go with 4, which is a conservative number - we have neither complex nor oversimplified model.

In [None]:
anova.merge(y_train, left_on=anova.index, right_on=y_train.index).boxplot(2, by='surface')

In [None]:
anova.merge(y_train, left_on=anova.index, right_on=y_train.index).boxplot(178, by='surface')

In [None]:
anova.merge(y_train, left_on=anova.index, right_on=y_train.index).boxplot(347, by='surface')

Although we see some difference in means between different classes, their variance is quite high. This may cause overlapping of classes, which might result in weak separation.****

# 4. Binning

In order to create words, we need to create bins for each Fourier coefficient [2]. We'll do this by using Entropy Information Gain. Luckily this algorithm is implemented in Decision Trees. The Gini impurity is an entropy-based method, that makes a decision on how to make a split in the decision tree. We'll use those split values to create bins.

In [None]:
from sklearn import tree

class Binner(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, X, y):
        y = X.merge(y, left_on=X.index, right_on=y.index)['surface']
        column_limits = []
        for col in X.columns:
            clf = tree.DecisionTreeClassifier(max_depth=2, max_leaf_nodes=4).fit(X[col].values.reshape(-1,1), y)
            threshold = clf.tree_.threshold[:3]
            limits = np.sort(np.insert(threshold,0,[np.NINF, np.inf]))
            column_limits.append(limits)
        self.column_limits = column_limits
        return self
    
    def transform(self, X):
        for idx, col in enumerate(X.columns):
            X.loc[:,col] = pd.cut(X[col], bins=self.column_limits[idx], labels=[1,2,3,4])
        return X.astype('int16')

In [None]:
binned = Binner().fit_transform(anova, y_train)

In [None]:
binned.head()

Now each Fourier coefficient has is own label (bin number) based on Decision Tree splits.

# 5. Words

We'll combine those bin numbers to 'words'. Those words basically consist of a bin number of certain Fourier coefficient for each sensor reading. We should be left with 20 columns because each word will symbolize Fourier coefficients for certain window in a reading.

In [None]:
class CreateWords(BaseEstimator, TransformerMixin):
    
    def __init__(self, window_length):
        self.window_length = window_length
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        cols = X.columns
        ind = X.index
        col_names = ['oX', 'oY', 'oZ', 'oW', 'avX', 'avY', 'avZ', 'laX', 'laY', 'laZ',
                     'oX_der', 'oY_der', 'oZ_der', 'oW_der', 'avX_der', 'avY_der', 'avZ_der',
                     'laX_der', 'laY_der', 'laZ_der']
        df_dict = {}
        for i in range(20):
            x = X[cols[i*4:i*4+4]].apply(lambda x: int(''.join(map(str,x))), axis=1)
            df_dict[str(self.window_length)+col_names[i]] = x
        df = pd.DataFrame(df_dict, index=ind)
        del x, df_dict, X
        return df

In [None]:
words = CreateWords(10).fit_transform(binned)

In [None]:
words.head()

We renamed columns by appending used window length to the original sensor name. In this manner, we will be able to differentiate between 'words' of different window lengths and of different sensors later on.

# 5.1. Bigrams

Another step proposed by the author is to get bigrams from the above words. This way we'll be able to find which signals tend to occur in pairs.

In [None]:
class GetBigrams(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        arr_list = []
        index = []
        for i in X.index.unique():
            unigrams2d = rolling_window_2d(X.loc[i].values, 2)
            bigrams = np.apply_along_axis(lambda x: int(''.join(map(str,x))), 2, unigrams2d).T
            stacked = np.vstack((X.loc[i], bigrams))
            arr_list.append(stacked)
            index += [i for ind in range(stacked.shape[0])]
        df = pd.DataFrame(np.concatenate(arr_list, axis=0), columns=X.columns, index=index)
        del X, arr_list
        return df

In [None]:
bigrams = GetBigrams().fit_transform(words)

In [None]:
bigrams.tail(5)

# 6. Bag of Patterns

Now that we have both words and bigrams, we'll count them for each unique series ID. We will hope to see certain words occuring more frequently for certain classes. I could speculate to see, for example, high frequency terms in medium window and long window lenghts for orientation sensors on concrete, as the concrete tends to be bumpy, which may cause vibration of the robot.

In [None]:
class TextTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        for col in X.columns:
            X[col] = col+X[col].astype(str)
        df = X.groupby(X.index).apply(lambda x: ' '.join(x.values.flatten()))
        del X
        return df

In [None]:
text = TextTransformer().fit_transform(bigrams)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

class Dummify(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        self.CV = CountVectorizer().fit(X)
        self.columns = self.CV.get_feature_names()
        return self
    
    def transform(self, X):
        counts = self.CV.transform(X)
        index = X.index
        del X
        return pd.DataFrame(counts.toarray(), index=index, columns=self.columns)

In [None]:
dummies = Dummify().fit_transform(text)

In [None]:
dummies.head()

# 7. Summary

I used this method for prediction using varying window lengths - from 5 to 68 samples. For each window length, I've chosen 350 best features and then combined them together. After choosing multiple models (logreg, SVM, GBT, RFC) and using Grid Search I haven't observed very high accuracy. Most of them score at around 40-45%, which is still better than our baseline 20%.

I suspect that this data is simply not suitable for this model. We can read in the paper [1], that some data sets containing sensor readings did not perform well. It's probably a bad separation between classes of Fourier coefficients (see ANOVA-F test section) that causes such low performance.

Nevertheless, it's an interesting approach to signal analysis and certainly an alternative for ANN.

# Bibliography
* [1] Multivariate Time Series Classification with WEASEL+MUSE - Patrick Schäfer, Ulf Leser
* [2] SFA: A Symbolic Fourier Approximation and Index for Similarity Search in High Dimensional Datasets - Patrick Schäfer
* [3] Fast and Accurate Time Series Classification with WEASEL - Patrick Schäfer, Ulf LeserPatrick Schäfer, Ulf Leser

In [None]:
from sklearn.pipeline import make_pipeline, make_union
from sklearn.feature_selection import SelectKBest, chi2

union = make_union(
    make_pipeline(
        ZScoreScaler(),
        RollingWindowFourier(5),
        ANOVA_ColumnSelector(4),
        Binner(),
        CreateWords(5),
        GetBigrams(),
        TextTransformer(),
        Dummify(),
        SelectKBest(chi2, k=350)
    ),
    make_pipeline(
        ZScoreScaler(),
        RollingWindowFourier(10),
        ANOVA_ColumnSelector(4),
        Binner(),
        CreateWords(10),
        GetBigrams(),
        TextTransformer(),
        Dummify(),
        SelectKBest(chi2, k=350)
    ),
    make_pipeline(
        ZScoreScaler(),
        RollingWindowFourier(14),
        ANOVA_ColumnSelector(4),
        Binner(),
        CreateWords(14),
        GetBigrams(),
        TextTransformer(),
        Dummify(),
        SelectKBest(chi2, k=350)
    ),
    make_pipeline(
        ZScoreScaler(),
        RollingWindowFourier(23),
        ANOVA_ColumnSelector(4),
        Binner(),
        CreateWords(23),
        GetBigrams(),
        TextTransformer(),
        Dummify(),
        SelectKBest(chi2, k=350)
    ),
    make_pipeline(
        ZScoreScaler(),
        RollingWindowFourier(32),
        ANOVA_ColumnSelector(4),
        Binner(),
        CreateWords(32),
        GetBigrams(),
        TextTransformer(),
        Dummify(),
        SelectKBest(chi2, k=350)
    ),
    make_pipeline(
        ZScoreScaler(),
        RollingWindowFourier(41),
        ANOVA_ColumnSelector(4),
        Binner(),
        CreateWords(41),
        GetBigrams(),
        TextTransformer(),
        Dummify(),
        SelectKBest(chi2, k=350)
    ),
    make_pipeline(
        ZScoreScaler(),
        RollingWindowFourier(50),
        ANOVA_ColumnSelector(4),
        Binner(),
        CreateWords(50),
        GetBigrams(),
        TextTransformer(),
        Dummify(),
        SelectKBest(chi2, k=350)
    ),
    make_pipeline(
        ZScoreScaler(),
        RollingWindowFourier(59),
        ANOVA_ColumnSelector(4),
        Binner(),
        CreateWords(59),
        GetBigrams(),
        TextTransformer(),
        Dummify(),
        SelectKBest(chi2, k=350)
    ),
    make_pipeline(
        ZScoreScaler(),
        RollingWindowFourier(68),
        ANOVA_ColumnSelector(4),
        Binner(),
        CreateWords(68),
        GetBigrams(),
        TextTransformer(),
        Dummify(),
        SelectKBest(chi2, k=350)
    )    
)

In [None]:
"""union.fit(X_train, y_train)
X_train_transformed = union.transform(X_train)
X_test_transformed = union.transform(X_test)"""

In [None]:
"""X_train_transformed = pd.DataFrame(X_train_transformed)
X_test_transformed = pd.DataFrame(X_test_transformed)"""

In [None]:
"""X_train_transformed.to_csv('train_transformed.csv',index=False)
X_test_transformed.to_csv('test_transformed.csv',index=False)"""

In [None]:
"""train = pd.read_csv('../input/weaselmuse-robots/train_transformed.csv')
test = pd.read_csv('../input/weaselmuse-robots/test_transformed.csv')"""

In [None]:
"""from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, make_scorer


scorer = make_scorer(accuracy_score)

GBC_params = dict(n_estimators=(100,250,500),
                  learning_rate=(0.1,1,10),
                  min_samples_split=(2,3,4))

RFC_params = dict(n_estimators=(10,50,100,200),
                 oob_score=(False, True),
                 class_weight=['balanced'])

GBC_grid = GridSearchCV(GradientBoostingClassifier(random_state=42), GBC_params, cv=5, scoring=scorer)
RFC_grid = GridSearchCV(RandomForestClassifier(random_state=42), RFC_params, cv=5, scoring=scorer)"""

In [None]:
#GBC = GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, min_samples_split=3).fit(train, y_train.values.ravel())