# PHM North America challenge '23

## Problem description: Gear pitting

Gear pitting is a surface fatigue failure of the gear tooth. It occurs due to repeated loading of tooth surface and the contact stress exceeding the surface fatigue strength of the material. Material in the fatigue region gets removed and a pit is formed. The pit itself will cause stress concentration and soon the pitting spreads to adjacent region till the whole surface is covered [[source](https://gearsmechon.wordpress.com/pitting-of-gears/)].

## Dataset description

The **training** dataset includes measurements under varied operating conditions from a healthy state as well as six known fault levels. The **testing and validation** datasets contain data from eleven health levels. Data from some fault levels and operating conditions are excluded from the training datasets to mirror real-world conditions where data collection may only be available from a subset of full range of operation. The training data are collected from a range of different operating conditions under 15 different rotational speeds and 6 different torque levels. Test and validation data operating conditions span 18 different rotational speeds and 6 different torque levels.

[[source](https://data.phmsociety.org/phm2023-conference-data-challenge/)]

<img src="https://data.phmsociety.org/wp-content/uploads/sites/9/2023/06/PHM2023dc_fig1.png" alt="MarineGEO circle logo" style="height: 375px; width:800px;"/>

<img src="https://data.phmsociety.org/wp-content/uploads/sites/9/2023/06/PHM2023dc_fig2.png" alt="MarineGEO circle logo" style="height: 300px; width:800px;"/>



In [None]:
%load_ext autoreload
%autoreload 2

from conscious_engie_icare import distance_metrics
from conscious_engie_icare.normalization import normalize_1
from conscious_engie_icare.nmf_profiling import derive_df_orders, derive_df_vib, extract_nmf_per_number_of_component
from conscious_engie_icare.util import calc_tpr_at_fpr_threshold, calc_fpr_at_tpr_threshold, calculate_roc_characteristics
from conscious_engie_icare.viz.viz import illustrate_nmf_components_for_paper
from conscious_engie_icare.data import phm_data_handler

import os
import pandas as pd
from tqdm import tqdm
from sklearn.decomposition import PCA
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import math
import random
from scipy.signal import stft
import numpy as np
from scipy.signal import welch, periodogram
import glob
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
import string
import pickle
from scipy.cluster.hierarchy import linkage, fcluster
from matplotlib.colors import LogNorm
from umap import UMAP
import plotly.express as px
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Data exploration

We first load the data and examine the structure of the dataset.

In [None]:
phm_data_handler.fetch_and_unzip_data()

In [None]:
!ls ../data

## Vibration data

First we load the vibration dataset and examine a single vibration entry. 
For each vibration measurement there are triaxial time-domain vibration measurements available (`x`, `y` and `z`) in addition to the actual rpm (`tachometer`).

In [None]:
rpm = 100
torque = 500
run = 1
df_example = load_train_data(rpm, torque, run)
print(f"A single sample (rpm={rpm}, torque={torque}, run={run}) has the following shape:")
print(df_example.shape)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for var, ax in zip(['x', 'y', 'z'], axes):
    ax.plot(df_example[var], label=var)
    ax.set_title(var)
    ax.legend()

### STFT

Vibration Sampling Frequency = 20480 Hz [[source](https://data.phmsociety.org/phm2023-conference-data-challenge/)].

> **Difference to industrial use case**: we determine FFT parameters on ourselfes.

The STFT divides the signal into overlapping segments and calculates a Fourier Transform for each segment.
It provides a localized view on the signal which is particularly useful for signals where the frequency components change over the measurement period.

In [None]:
def plot_stft(df, var, nperseg=256, noverlap=None, nfft=None, fs=1):
    f, t, Zxx = stft(df[var], nperseg=nperseg, noverlap=noverlap, nfft=nfft, fs=fs)
    plt.pcolormesh(t, f, np.abs(Zxx))
    plt.title(f'STFT Magnitude {var}')
    plt.ylabel('Frequency [Hz]')
    plt.xlabel('Time [sec]')
    plt.show()

plot_stft(df_example, 'z', nperseg=None, fs=20480)

As we expect that within each measurement there are no changes in the frequency components, we also check the periodogram below.

### Power density estimation

A periodogram is an estimate of the spectral density of a signal [[source](https://en.wikipedia.org/wiki/Periodogram)].

> periodogram returns array with NaN values.

Besides the periodogram, we also use Welch's method.
**The primary idea behind Welch's method is to divide the original signal into overlapping segments, calculate the periodogram for each segment, and then average these periodograms to obtain a more stable estimate of the PSD.** This approach helps reduce the variance and noise inherent in the standard periodogram.

In [None]:
def plot_periodogram(df, var, fs=1):
    f, Pxx = periodogram(df[var], fs=fs)
    plt.semilogy(f, Pxx)
    plt.title(f'Periodogram {var}')
    plt.xlabel('frequency [Hz]')
    plt.ylabel('PSD amplitude')
    plt.show()

def plot_welch(df, var, nperseg=256, noverlap=None, nfft=None, fs=1, ax=None, legend=None):
    f, Pxx = welch(df[var], nperseg=nperseg, noverlap=noverlap, nfft=nfft, fs=fs)
    plt.semilogy(f, Pxx)
    plt.title(f'Welch Power Spectral Density {var}')
    plt.xlabel('frequency [Hz]')
    plt.ylabel('PSD amplitude')
    # plt.show()

# dimension = 'x'
# plot_periodogram(df_example, dimension, fs=20480)

rpm=200
torque=300
run=3
df_example = load_train_data(rpm=rpm, torque=torque, run=run)
plot_welch(df_example, 'x', nperseg=128, fs=20480)
plot_welch(df_example, 'y', nperseg=128, fs=20480)
plot_welch(df_example, 'z', nperseg=128, fs=20480)
plt.title(f'Measurement {run} @ {rpm} rpm, {torque} Nm');
plt.legend(['x', 'y', 'z'], title='Direction')

## Process parameters

In contrast to the industrial feedwater pump use case, **operating conditions are very stable in the given dataset**.
Therefore, clustering of operating modes based on a separate set of process paramters is not necessary.

In [None]:
fnames = glob.glob(os.path.join(BASE_PATH_HEALTHY, '*.txt'))
# extract rpm, torque, run from filename

def extract_process_parameters(file_path, use_train_data_for_validation=True):
    parts = file_path.split('/')
    filename = parts[-1]  # Extract the filename from the path
    if use_train_data_for_validation:
        v_value, n_value, sample_number = filename.split('_')  # Extract V, N, and sample number
        return int(v_value[1:]), int(n_value[:-1]), int(sample_number.split('.')[0])
    else:
        sample_number, v_value, n_value = filename.split('_')
        return int(v_value[1:]), int(n_value.split('.')[0][:-1]), int(sample_number)

data = []
for file_path in fnames:
    v_value, n_value, sample_number = extract_process_parameters(file_path)
    data.append({
        'V': v_value,  # Remove the 'V' prefix and convert to integer
        'N': n_value,  # Remove the 'N' suffix and convert to integer
        'SampleNumber': sample_number  # Remove the '.txt' extension and convert to integer
    })

df_process = pd.DataFrame(data)

print("--- Healthy data (pitting level 0) ---")
print(f"Number of samples: {len(df_process)}")
print(f"Number of unique RPM values: {len(df_process['V'].unique())}")
print(f"Number of unique torque values: {len(df_process['N'].unique())}")
print(f"Number of unique sample numbers: {len(df_process['SampleNumber'].unique())}")
df_process.head()

There are 77 unique combinations of rotational speed and torque in the training dataset.
Each combination has 1-5 samples.

In [None]:
# get the unique number of combinations of RPM and torque
df_runs = df_process.groupby(['V', 'N']).size().reset_index(name='counts')
df_runs.head()

# Building a decomposition matrix 

1. [Load healthy data & extract FFT](#Load-all-healthy-data-and-extract-STFT-and-PSD)
2. [Convert to orders](#Order-transformation)

> **Difference to industrial use case**: There is only one location --> matrix is 3 dimensional instead of 6 dimensional

> - [ ] TODO: Build decomposition matrix for all fault levels
> - [x] TODO: Build decomposition matrix for all directions (x, y and z) 

## Load all healthy data and convert to frequency domain.

As the data is given in the time domain, we transform it to the frequency domain.
Each individual measurement is transformed to a frequency spectrum with a short-term Fourier transform (STFT).

**Short-term Fourier transform (STFT)**: Let $x(t)$ represent the original vibration signal in the time domain with the time index $t$.
The **STFT** is applied to $x(t)$ to obtain a representation $X(f,\tau)$ in the frequency domain, where $f$ is the frequency index and $\tau$ is the time window index.


**Welch's method**: Welch's method (also called the periodogram method) for estimating power spectra is carried out by dividing the time signal into successive blocks, forming the periodogram for each block, and averaging [source](https://ccrma.stanford.edu/~jos/sasp/Welch_s_Method.html).

In [None]:
fnames = glob.glob(os.path.join(BASE_PATH_HEALTHY, '*.txt'))

import signal

class timeout:
    def __init__(self, seconds=1, error_message='Timeout'):
        self.seconds = seconds
        self.error_message = error_message
    def handle_timeout(self, signum, frame):
        raise TimeoutError(self.error_message)
    def __enter__(self):
        signal.signal(signal.SIGALRM, self.handle_timeout)
        signal.alarm(self.seconds)
    def __exit__(self, type, value, traceback):
        signal.alarm(0)

def load_data(fnames, use_train_data_for_validation=True, base_path=BASE_PATH_HEALTHY, **kwargs):
    """train_data --> process parameters are known. (TODO: change later)"""
    data = []
    for fn in tqdm(fnames):
        rpm, torque, run = extract_process_parameters(fn, use_train_data_for_validation=use_train_data_for_validation)
        try:
            with timeout(seconds=4):
                df = load_train_data(rpm, torque, run, base_path=base_path) if use_train_data_for_validation else load_test_data(rpm, torque, run, base_path=base_path)
        except TimeoutError:
            print(f'timed out loading {fn}')
        f, t, stft_x = stft(df['x'], **kwargs)
        f, t, stft_y = stft(df['y'], **kwargs)
        f, t, stft_z = stft(df['z'], **kwargs)
        f, psd_x = welch(df['x'], **kwargs)
        f, psd_y = welch(df['y'], **kwargs)
        f, psd_z = welch(df['z'], **kwargs)
        data.append({
            'rpm': rpm,
            'torque': torque, 
            'sample_id': run,
            'unique_sample_id': f'{rpm}_{torque}_{run}',  # Remove the '.txt' extension and convert to integer
            'vibration_time_domain': df, 
            'stft_x': stft_x,
            'stft_y': stft_y,
            'stft_z': stft_z,  # Remove the '.txt' extension and convert to integer
            'psd_x': psd_x,
            'psd_y': psd_y,
            'psd_z': psd_z
        })
    return data, f

nperseg = 10240
noverlap = nperseg // 2
nfft = None
fs = 20480
data_healthy, f = load_data(fnames, nperseg=nperseg, noverlap=noverlap, nfft=nfft, fs=fs)

We use most of the healthy data for training. 25% are held out for validation.

> TODO: have all operating modes in the training set.

In [None]:
# V1: randomly shuffle the data and split into train and test set once
# V2: 4 different independent splits
# V3: repeat N times: Sample equal amount of samples from healthy and faulty data
# ADD_VALIDATION_AND_TEST: determines whether to add validation and test data from the challenge (used with 'V1')
RANDOM_SPLIT = 'V3'
ADD_VALIDATION_AND_TEST = True
SPLIT = 0.75
N = 100
CACHE_RESULTS = False
LOAD_CACHED_RESULTS = True
CACHING_FOLDER_NAME = 'CACHED_RESULTS_300124'
assert RANDOM_SPLIT in ['V1', 'V2', 'V3']
assert CACHE_RESULTS != LOAD_CACHED_RESULTS

data_healthy_train_folds = []
data_healthy_test_folds = []
if RANDOM_SPLIT == 'V1':
    N=1
    # randomly shuffle the data and split into train and test set once
    split_id = int(len(data_healthy) * SPLIT)
    random.Random(42).shuffle(data_healthy)   # !!!
    data_healthy_train = data_healthy[:split_id]
    data_healthy_test = data_healthy[split_id:]
    data_healthy_train_folds = [data_healthy_train]
    data_healthy_test_folds = [data_healthy_test]
elif RANDOM_SPLIT == 'V2':
    N=4
    n_total = len(data_healthy)
    for i in range(4):
        split_id_start = (n_total * i) // 4
        split_id_stop = (n_total * (i+1)) // 4
        data_healthy_test_ = data_healthy[split_id_start:split_id_stop]
        data_healthy_train_ = data_healthy[:split_id_start] + data_healthy[split_id_stop:]
        data_healthy_test_folds.append(data_healthy_test_)
        data_healthy_train_folds.append(data_healthy_train_)
elif RANDOM_SPLIT == 'V3':
    for i in range(N):
        # randomly sample equal amount of samples from healthy and faulty data
        split_id = int(len(data_healthy) * SPLIT)
        random.Random(i).shuffle(data_healthy)
        data_healthy_train_ = data_healthy[:split_id]
        data_healthy_test_ = data_healthy[split_id:]
        data_healthy_test_folds.append(data_healthy_test_)
        data_healthy_train_folds.append(data_healthy_train_)

len(data_healthy_train_folds[0])

## Order transformation and binning

In the order-tarnsformed domain, the frequency components are transformed to the number of rotations per minute (RPM) of the gears.

> Contrast to industrial use case: trivial due to static RPM

In [None]:
fpath_df_orders_train_folds = os.path.join(CACHING_FOLDER_NAME, f'df_orders_train_folds.pkl')
fpath_meta_data_train_folds = os.path.join(CACHING_FOLDER_NAME, f'meta_data_train_folds.pkl')
setup = {'start': 0.5, 'stop': 100.5, 'n_windows': 50, 'window_steps': 2, 'window_size': 2}

# load transformed data (if specified)
if LOAD_CACHED_RESULTS:
    with open(fpath_df_orders_train_folds, 'rb') as file:
        df_orders_train_folds = pickle.load(file)
    with open(fpath_meta_data_train_folds, 'rb') as file:
        meta_data_train_folds = pickle.load(file)

# load train data and transform to orders
else:
    df_vib_train_folds = []
    df_orders_train_folds = []
    meta_data_train_folds = []
    for fold, data_healthy_train_ in enumerate(tqdm(data_healthy_train_folds, desc='Deriving orders on training set per fold')):
        df_vib_train_folds.append(derive_df_vib(data_healthy_train_, f)) # f!!!
        df_orders_train_, meta_data_train_ = derive_df_orders(df_vib_train_folds[-1], setup, f, verbose=False)
        df_orders_train_[meta_data_train_.columns] = meta_data_train_
        df_orders_train_folds.append(df_orders_train_)
        meta_data_train_folds.append(meta_data_train_)
        """
        fpath = os.path.join('df_nmf_models_folds_241023', f'df_orders_train_folds_{fold}.pkl')
        with open(fpath, 'wb') as file:
            pickle.dump(df_orders_train_, file)
        fpath = os.path.join('df_nmf_models_folds_241023', f'meta_data_train_folds_{fold}.pkl')
        with open(fpath, 'wb') as file:
            pickle.dump(meta_data_train_, file)
        """
    if CACHE_RESULTS:
        # cache train data
        with open(fpath_df_orders_train_folds, 'wb') as file:
            pickle.dump(df_orders_train_folds, file)
        # cache test data
        with open(fpath_meta_data_train_folds, 'wb') as file:
            pickle.dump(meta_data_train_folds, file)

# plot effect of orders
cols = df_orders_train_folds[-1].columns
BAND_COLS = cols[cols.str.contains('band')].tolist()
idx_cols = ['index', 'rotational speed [RPM]', 'torque [Nm]', 'direction',
            'unique_sample_id', 'sample_id']
cols = BAND_COLS + idx_cols
df_ = df_orders_train_folds[-1].reset_index()[cols]
df_ = pd.melt(df_, id_vars=idx_cols, var_name='frequency band', value_name='frequency band value')
fig = px.line(df_, x='frequency band', y='frequency band value',
              facet_row='direction', color='unique_sample_id',
              hover_data=['rotational speed [RPM]', 'torque [Nm]'],
              title='Frequency bands for healthy samples, before normalisation',
              markers=True, width=1200, height=600)
# draw verical line at band_39.5-40.5 in plotly express figure
# for x in [39, 79]:
#    fig.add_shape(type='line', x0=x, y0=0, x1=x, y1=2, line=dict(color='black', width=1, dash='dash'))
fig

We observe **major peaks at 40 and 80 orders**. 
40 orders corresponds to the number of teeth of the driving gear (= **gear mesh frequency**), 80 orders corresponds to a **harmonic frequency**.
The driven gear has 72 teeth which are not visible in the order spectrum. 

## Frequency-band normalization

> **Observation**, ***if there are no other sensors (y, z) present***: Without normalisation much higher explained variance!

In [None]:
df_V_train_normalized_folds = [normalize_1(df_orders_train_, BAND_COLS) for df_orders_train_ in df_orders_train_folds]
idx_vars = ['rotational speed [RPM]', 'torque [Nm]', 'direction', 'unique_sample_id', 'sample_id']
df_ = df_V_train_normalized_folds[-1].reset_index()
df_[idx_vars] = df_orders_train_folds[-1][idx_vars]
df_ = pd.melt(df_, id_vars=['index'] + idx_vars, 
    var_name='frequency band', value_name='frequency band value'
    )
fig = px.line(df_, x='frequency band', y='frequency band value',
              facet_row='direction', color='unique_sample_id',
            # hover_data=['rotational speed [RPM]', 'torque [Nm]'], 
              title='Frequency bands for healthy samples, after normalisation',
              markers=True, width=1200, height=600)
fig.show()

In [None]:
df_V_train_folds = df_V_train_normalized_folds # df_V_train_not_normalized 

How large are the folds?

In [None]:
len_V = [len(V_) for V_ in df_V_train_folds]
pd.Series(len_V).plot.hist()

## Non-negative matrix factorization (NMF)

In [None]:
# ignore convergence warnings (1000 iterations reached by NMF)
import warnings;
warnings.filterwarnings('ignore');

# if LOAD_CACHED_RESULTS:
if False:
    with open(os.path.join(CACHING_FOLDER_NAME, f'df_nmf_models_folds.pkl'), 'rb') as file:
        df_nmf_models_folds = pickle.load(file)
if not LOAD_CACHED_RESULTS:
    MAX_N_COMPONENTS = 40
    # cache df_nmf_folds on the disk
    # df_nmf_models_folds = []
    for fold, df_V_train_ in enumerate(tqdm(df_V_train_folds, desc='Extracting NMF models per fold')):
        fpath = os.path.join(CACHING_FOLDER_NAME, 'df_nmf_models_folds', f'df_nmf_models_folds_{fold}.pkl')
        df_nmf_models_ = extract_nmf_per_number_of_component(
            df_V_train_, n_components=MAX_N_COMPONENTS, timestamps=df_V_train_.index, verbose=False
        )   #changed timestamps from df_V_train_normalized.index to df_V_train.index
        # df_nmf_models_folds.append(df_nmf_models_)
        fpath = os.path.join(CACHING_FOLDER_NAME, 'df_nmf_models_folds', f'df_nmf_models_folds_{fold}.pkl')
        if CACHE_RESULTS:
            pickle.dump(df_nmf_models_, open(fpath, 'wb'))

> - If we use 10-fold cross-validation, we have to extract the NMF components for each fold individually. In some cases it is better to extract 4 in others 5
> - There seems to be a bug in the decomposition matrix

In [None]:
# plot hyperparameters for NMF
FOLD = 2
QMIN = 0.001
QMAX = 0.999
MIN_EXPLAINED_VARIANCE = 95 # 99.9 # 95
ORDER_COMPONENTS = None # list(range(15))

df_nmf_models_folds = []
for fold in tqdm(list(range(N)), desc='Loading NMF models per fold'):
    # fpath = os.path.join('df_nmf_models_folds_301023', f'df_nmf_models_folds_{i}.pkl')
    fpath = os.path.join(CACHING_FOLDER_NAME, 'df_nmf_models_folds', f'df_nmf_models_folds_{fold}.pkl')
    models_ = pickle.load(open(fpath, 'rb'))
    df_nmf_models_folds.append(models_)

df_V_train_ = df_V_train_folds[FOLD]
df_nmf_models_ = df_nmf_models_folds[FOLD]

# get minimum and maximum for feature space
vmin = df_V_train_.stack().quantile(q=QMIN)
print(f'    - took {vmin} (=0.01 quantile) as vmin for plotting feature space')
vmax = df_V_train_.stack().quantile(q=QMAX)
print(f'    - took {vmax} (=0.99 quantile) as vmax for plotting feature space')

# calculate explained variance
print(f'- calculating explained variance...')
pca = PCA(n_components=len(BAND_COLS), random_state=42)
pca.fit(df_V_train_)
explained_variance_ratio = pca.explained_variance_ratio_.cumsum()
explained_variance_ratio = explained_variance_ratio[:39]

fig, _ = illustrate_nmf_components_for_paper(
    df_V_train_, explained_variance_ratio, df_nmf_models_, pd.Series(BAND_COLS),
    min_explained_variance=MIN_EXPLAINED_VARIANCE, order_components=ORDER_COMPONENTS,
    vmin=vmin, vmax=vmax, xlims=(-1,101), plot_x_ticks=[0, 10, 20, 30, 40, 49]
)

fig.savefig(os.path.join('figs', 'nmf_exemplary_fold.pdf'), dpi=300, bbox_inches='tight')

> - TODO: SOME EMPTY FREQUENCY BANDS or just extreme outliers (set `QMIN=0.01` and `QMAX=0.99`)?
> - after adding all directions (x, y, z): slightly less explained variance, still very high though

In [None]:
N_COMPONENTS = 5
COMPONENT_COLUMNS = list(range(N_COMPONENTS))  # used later

model_folds = []
for df_nmf_models_ in df_nmf_models_folds:
    model_ = df_nmf_models_[(df_nmf_models_.n_components == N_COMPONENTS)].iloc[0]
    model_folds.append(model_)

# Offline vibration fingerprint extraction

In contrast to the industrial dataset, in this dataset there are no timestamps. However, we know speed and torque for each measurement. Therefore, we merge the fingerprint with the operating mode.

In [None]:
df_W_train_with_OM_folds = []

for i, (model_, df_V_train_normalized_, meta_data_train_) in enumerate(zip(model_folds, df_V_train_normalized_folds, meta_data_train_folds)):
    W_train_ = model_.W.reshape(-1, N_COMPONENTS)
    df_W_train_ = pd.DataFrame(W_train_)
    display(f'Fold {i}. Shape: {W_train_.shape}')
    df_W_train_.index = df_V_train_normalized_.index
    df_W_train_['direction'] = meta_data_train_['direction']

    # add operating mode (OM)
    df_W_train_with_OM_ = pd.merge(df_W_train_, meta_data_train_.drop(columns=['direction']), left_index=True, right_index=True)
    df_W_train_with_OM_folds.append(df_W_train_with_OM_)

df_W_train_with_OM_folds[0].head()

Below we plot the weights of an individual measurement from the training set.

In [None]:
# The same measurement is in three of four training folds
rpm=100
torque=500
run=1
fold=0

df_W_train_with_OM_ = df_W_train_with_OM_folds[fold]
df_ = df_W_train_with_OM_[(df_W_train_with_OM_['rotational speed [RPM]']==rpm) &
                          (df_W_train_with_OM_['torque [Nm]']==torque) &
                          (df_W_train_with_OM_['sample_id']==run)]
df_ = df_.set_index('direction')
fig, ax = plt.subplots()
sns.heatmap(df_[list(range(N_COMPONENTS))], annot=True, fmt=".3f", ax=ax, cmap='Blues', vmin=0, vmax=0.1, cbar=False)
ax.set_title(f'Measurement {run} @ {rpm} rpm, {torque} Nm');
ax.set_xlabel('component');

A vibration fingerprint is the aggregation over all measurements of a given operating mode.

In [None]:
df_W_train_with_OM_ = df_W_train_with_OM_folds[fold]
df_ = df_W_train_with_OM_[(df_W_train_with_OM_['rotational speed [RPM]']==rpm) & (df_W_train_with_OM_['torque [Nm]']==torque)]
df_ = df_[list(range(N_COMPONENTS)) + ['direction']].groupby('direction').mean()
fig, ax = plt.subplots()
sns.heatmap(df_, annot=True, fmt=".3f", ax=ax, cmap='Blues', vmin=0, vmax=0.1, cbar=False)
ax.set_title(f'Vibration fingerprint @ {rpm} rpm, {torque} Nm');
ax.set_xlabel('component');

## Operating mode detection

In this dataset, a large part of our previously constructed process pipeline it is not necessary for cluster operating modes.
1. First, there are only two process parameters, removing the need to reduce the dimensionality of the data.
2. Second, process parameters and vibration measurements are already associated, making it unnecessary to manually merge them based on timestamps.

In this section we propose two methods to cluster the operating modes: 

1. Treating each unique combination of speed and torque as a separate operating mode.
2. (Setting clusters based on differences in vibration profiles.) 

> The second one is an optional TODO atm.

### Treating each unique combination of speed and torque as a separate operating mode

In [None]:
# for each unique combination of RPM and torque, assign a unique cluster label
cluster_label_unique_name_mapping_folds = []
for i, df_W_train_with_OM_ in enumerate(df_W_train_with_OM_folds):
    df_W_train_with_OM_['cluster_label_unique'] = df_W_train_with_OM_.groupby(['rotational speed [RPM]', 'torque [Nm]']).ngroup()
    df_W_train_with_OM_folds[i] = df_W_train_with_OM_
    cluster_label_unique_name_mapping_folds.append(df_W_train_with_OM_.groupby('cluster_label_unique').first()[['rotational speed [RPM]', 'torque [Nm]']].reset_index())

cluster_label_unique_name_mapping_folds[0].head()

Below we extract and illustrate the fingerprints.

In [None]:
SHOW_FINGERPRINTS = True

# extract operating mode wise fingerprints
grouping_vars = ['direction', 'cluster_label_unique']
fingerprints_folds = []
for i, df_W_train_with_OM_ in enumerate(df_W_train_with_OM_folds):
    df_ = df_W_train_with_OM_[COMPONENT_COLUMNS + grouping_vars].copy()
    fingerprints_ = {
        om: om_group.groupby(['direction']).mean().drop(columns=['cluster_label_unique']) for om, om_group in df_.groupby('cluster_label_unique')
    }
    fingerprints_folds.append(fingerprints_)

# illustrate fingerprints
fingerprints_ = fingerprints_folds[fold]
cluster_label_unique_name_mapping_ = cluster_label_unique_name_mapping_folds[fold]
if SHOW_FINGERPRINTS:
    nrows = math.ceil(len(fingerprints_) / 3)
    fig, axes = plt.subplots(figsize=(18, 3*nrows), nrows=nrows, ncols=3, sharex=True, sharey=True)
    for om, ax in tqdm(zip(fingerprints_, axes.flat), total=len(fingerprints_), desc='Plotting fingerprints'):
        om_group = fingerprints_[om]
        om_group.columns = om_group.columns.astype(str)
        sns.heatmap(om_group, annot=True, fmt=".3f", ax=ax, cmap='Blues', vmin=0, vmax=0.1, cbar=False)
        rpm = cluster_label_unique_name_mapping_[cluster_label_unique_name_mapping_.cluster_label_unique == om]['rotational speed [RPM]'].values[0]
        Nm = cluster_label_unique_name_mapping_[cluster_label_unique_name_mapping_.cluster_label_unique == om]['torque [Nm]'].values[0]
        ax.set_title(f'OM {om}, ({rpm} rpm, {Nm} Nm))')
        ax.set_xlabel('component')
    fig.tight_layout()

> OBSERVATION: OFTEN EITHER ONE OR THE OTHER COMPONENT ACTIVATET, IF PEAKS OVERLAPPING (e.g. component 1 and 2)
> - --> local method? 
> - --> other decomposition method?

Plot pairwise distance between operating modes:

In [None]:
pairwise_distances = []
fingerprints_ = fingerprints_folds[fold]
for om1 in fingerprints_:
    fp1 = fingerprints_[om1]
    for om2 in fingerprints_:
        fp2 = fingerprints_[om2]
        dist_ = distance_metrics.cosine_distance(fp1, fp2)
        pairwise_distances.append({'om1': om1, 'om2': om2, 'dist': dist_})
df_pairwise_dist = pd.DataFrame(pairwise_distances)

df_plot = df_pairwise_dist.pivot("om1", "om2", "dist")
fig, ax = plt.subplots(figsize=(20, 16))
sns.heatmap(df_plot, ax=ax, cmap='Blues', annot=False, fmt=".2f")
ax.set_title(f"Cosine distance between fingerpints")

# Regrouping vibration fingerprints (consensus operating modes)

In [None]:
fingerprints_ = fingerprints_folds[fold]

def plot_dendrogram(linkage_matrix, ax=None, **kwargs):
    dendrogram(linkage_matrix, ax=ax, **kwargs)
    ax.set_title('Hierarchical Clustering Dendrogram')
    ax.set_xlabel('Samples')
    ax.set_ylabel('Distance')
    #xlbls = ax.get_xmajorticklabels()
    #lbls = [replace_number_with_letter_(l.get_text()) for l in xlbls]
    #ax.set_xticklabels(lbls)
    return ax

fingerprints_feature_space = np.vstack(pd.Series(fingerprints_).apply(lambda df: df.stack().values).to_numpy())

fig, axes = plt.subplots(figsize=(15, 10), ncols=2, nrows=2, sharey=True)
axes = axes.flatten()
fig.suptitle('Hierarchical Clustering Dendrogram')

# Single linkage clustering
linkage_matrix_avg = linkage(pdist(fingerprints_feature_space, metric='cosine'), optimal_ordering=True, method='single')
ax = plot_dendrogram(linkage_matrix_avg, ax=axes[0])
ax.set_title('Single-link (nearest point)')

# Complete linkage clustering
linkage_matrix_ward = linkage(pdist(fingerprints_feature_space, metric='cosine'), optimal_ordering=True, method='complete')
ax = plot_dendrogram(linkage_matrix_ward, ax=axes[1])
ax.set_title('Complete-link (farthest point)')

# Average linkage clustering: WPGMA
linkage_matrix_avg = linkage(pdist(fingerprints_feature_space, metric='cosine'), optimal_ordering=True, method='weighted')
ax = plot_dendrogram(linkage_matrix_avg, ax=axes[2])
ax.set_title('Average-link (WPGMA)')

# Average linkage clustering: UPGMA
linkage_matrix_avg = linkage(pdist(fingerprints_feature_space, metric='cosine'), optimal_ordering=True, method='average')
ax = plot_dendrogram(linkage_matrix_avg, ax=axes[3])
ax.set_ylim(0, 1)
ax.set_title('Average-link (UPGMA): preferred method')

fig.tight_layout()

In [None]:
# Set a threshold to determine the number of clusters (you can adjust this threshold as needed)
threshold = 0.5

linkage_matrix_avg = linkage(pdist(fingerprints_feature_space, metric='cosine'), optimal_ordering=True, method='average')
fig, ax = plt.subplots(figsize=(20, 7))
ax = plot_dendrogram(linkage_matrix_avg, ax=ax, color_threshold=threshold, above_threshold_color='k')
ax.set_title('Average-link (UPGMA)')
ax.axhline(y=threshold, c='k', ls='--', lw=1)

# Get the cluster labels based on the threshold
cluster_labels = fcluster(linkage_matrix_avg, threshold, criterion='distance')
# generate dictionary where each operating mode is mapped to the respective group and save locally
cluster_labels_dict = dict(zip(list(string.ascii_uppercase[0:len(cluster_labels)]), cluster_labels))
#if CACHE_RESULTS:
if False:
    with open('cluster_labels_dict.pickle', 'wb') as fp:
        pickle.dump(cluster_labels_dict, fp)

In [None]:
tick_color = {1: 'blue', 2: 'red', 3: 'green', 4: 'orange', 5: 'purple', 6: 'brown', 7: 'pink', 8: 'gray', 9: 'olive', 10: 'cyan'}

dfs_ = {om: fingerprints_[om].values.flatten() for om in fingerprints_}
fig, ax = plt.subplots(figsize=(4, 18))
ax.set_title('Fingerprinting featurespace with corresponding cluster', fontsize=32)

df = pd.DataFrame(dfs_).T
sns.heatmap(df, ax=ax, cmap='Blues', annot=False, fmt=".2f", norm=LogNorm())
texts = []
for tick, cluster_label_ in zip(ax.get_yticklabels(), cluster_labels):
    tick.set_color(tick_color[cluster_label_])
    texts.append(f'{tick.get_text()} ({cluster_label_})')
ax.set_xlabel('Component', fontsize=24)
ax.set_ylabel('Fingerprint (cluster-id)', fontsize=24)
ax.set_yticklabels(texts, rotation=0, fontsize=16);
fig.tight_layout()

# Online fingerprint extraction (C02)

For the test set, we first load the process and vibration data that exhibit a high level of pitting and merge it with the healthy data that we previously held back.

> Formerly we wrote: First we load the process data for the test set (which is formatted slightly different than the train data).

## Loading test data

At the moment, the test data consists of two different conditions:
1. **Anomaly condition**: Pitting level 8
2. **Anomaly condition**: Healthy data

### Loading anomalous test data

There are 296 samples in the test set that exhibit a high level of pitting (pitting level 8) that were recorded at different speeds and torques.

In [None]:
# this code cell is not essential for the notebook

USE_TRAINING_SET_FOR_VALIDATION = True
# PITTING_LEVEL = 1
pitting_levels = [1, 2, 3, 4, 6, 8]

if USE_TRAINING_SET_FOR_VALIDATION:
    # BASE_PATH_TEST = os.path.join('Data_Challenge_PHM2023_training_data', f'Pitting_degradation_level_{PITTING_LEVEL}')
    base_paths_test = [os.path.join('Data_Challenge_PHM2023_training_data', f'Pitting_degradation_level_{pitting_level}') for pitting_level in pitting_levels]
else:
    # BASE_PATH_TEST = os.path.join('Data_Challenge_PHM2023_test_data')
    base_paths_test = [os.path.join('Data_Challenge_PHM2023_test_data')]

df_process_test_dict_ = {}
for lvl, path_ in zip(pitting_levels, base_paths_test):
    fnames = glob.glob(os.path.join(path_, '*.txt'))
    data_test_ = []
    for file_path in fnames:
        v_value, n_value, sample_number = extract_process_parameters(file_path, use_train_data_for_validation=USE_TRAINING_SET_FOR_VALIDATION)  # change train_data after restarting notebook
        data_test_.append({
            'V': v_value,  # Remove the 'V' prefix and convert to integer
            'N': n_value,  # Remove the 'N' suffix and convert to integer
            'SampleNumber': sample_number  # Remove the '.txt' extension and convert to integer
        })

    df_process_test_dict_[lvl] = pd.DataFrame(data_test_)

    print(f"--- Unhealthy data (pitting level {lvl}) ---")
    print(f"Number of samples: {len(df_process_test_dict_[lvl])}")
    print(f"Number of unique RPM values: {len(df_process_test_dict_[lvl]['V'].unique())},",
          f"torque values: {len(df_process_test_dict_[lvl]['N'].unique())},",
          f"sample numbers: {len(df_process_test_dict_[lvl]['SampleNumber'].unique())}")

df_process_test_dict_[1].head()

Load test vibration data and transform to orders.
We normalize test vibration data with the same normalization parameters as the train data.
In contrast to the original dataset, it is not necessary to select valid vibration measurement periods, as all measurements were taken at the same time.

> TODO: operating mode detection based on vibration fingerprints
> 
> For the moment we assign immediately the ground truth. Instead, we wish to train a classifier on the operating mode groups to predict the operating mode based on the vibration fingerprint.

In [None]:
# extract data in original format
df_orders_test_pitting_dict = {}
meta_data_test_pitting_dict = {}
for lvl, path in tqdm(zip(pitting_levels, base_paths_test), desc='Extracting test data', total=len(base_paths_test)):
    fnames = glob.glob(os.path.join(path, '*.txt'))
    data_test, f = load_data(fnames, nperseg=nperseg, noverlap=noverlap, nfft=nfft, fs=fs, base_path=path, 
                            use_train_data_for_validation=USE_TRAINING_SET_FOR_VALIDATION)  # !!! change train_data to use_train_data_for_validation after restarting notebook

    # extract vibration data
    df_vib_test_unhealthy = derive_df_vib(data_test, f)

    # convert to orders and derive meta data
    df_orders_test_pitting_, meta_data_test_pitting_ = derive_df_orders(df_vib_test_unhealthy, setup, f, verbose=False)
    if USE_TRAINING_SET_FOR_VALIDATION:
        print('transforming sample-id in test set')
        # meta_data_test_pitting_8['test_sample_id'] = meta_data_test_pitting_8.groupby(['rotational speed [RPM]', 'torque [Nm]', 'sample_id']).ngroup() + 1   # !!! might not be necessary
        rpm = meta_data_test_pitting_['rotational speed [RPM]']
        torque = meta_data_test_pitting_['torque [Nm]']
        run = meta_data_test_pitting_['sample_id']
        meta_data_test_pitting_['unique_sample_id'] = rpm.astype(str) + '_' + torque.astype(str) + '_' + run.astype(str) + f'_pitting_level_{lvl}'

    df_orders_test_pitting_['unique_sample_id'] = meta_data_test_pitting_['unique_sample_id'] # + f'_pitting_level_{lvl}'

    df_orders_test_pitting_dict[lvl] = df_orders_test_pitting_
    meta_data_test_pitting_dict[lvl] = meta_data_test_pitting_

In [None]:
# For the balanced train-test split we exclude all operating modes that are not present in the training set.
# check if we can filter unrelated operating modes from df_orders_test_pitting_dict and meta_data_test_pitting_dict here
# --> need to do this per healthy fold 
"""
if SPLIT = 'V3':
    print('Removing samples from test set that expose operating conditions not found in the training set')
    for pitting_level in meta_data_test_pitting_dict:
        # exclude entries that show rotational speed [RPM] and torque [Nm] that are not in the training set
        old_meta_data_ = meta_data_test_pitting_dict[pitting_level]
        new_meta_data_ = old_meta_data_[old_meta_data_.unique_sample_id.isin(df_process.unique_sample_id)]
        meta_data_test_pitting_dict[pitting_level] = new_meta_data_
"""
pass

--- 

OOOOOO

In [None]:
fpath_df_orders_test_folds = os.path.join(CACHING_FOLDER_NAME, f'df_orders_test_folds.pkl')
fpath_meta_data_test_folds = os.path.join(CACHING_FOLDER_NAME, f'meta_data_test_folds.pkl')

# load transformed data (if specified)
if LOAD_CACHED_RESULTS:
    with open(fpath_df_orders_test_folds, 'rb') as file:
        df_orders_test_folds = pickle.load(file)
    with open(fpath_meta_data_test_folds, 'rb') as file:
        meta_data_test_folds = pickle.load(file)
# transform test data to orders
else:
    # convert healthy test samples to orders
    meta_data_test_healthy_folds = []
    df_orders_test_healthy_folds = []
    for data_healthy_test_ in tqdm(data_healthy_test_folds, desc='convert healthy test samples to orders per fold'):
        df_vib_test_healthy_ = derive_df_vib(data_healthy_test_, f)
        df_orders_test_healthy_, meta_data_test_healthy_ = derive_df_orders(df_vib_test_healthy_, setup, f, verbose=False)
        meta_data_test_healthy_['unique_sample_id'] = meta_data_test_healthy_['unique_sample_id'] + '_healthy'
        df_orders_test_healthy_['unique_sample_id'] = meta_data_test_healthy_['unique_sample_id']
        meta_data_test_healthy_folds.append(meta_data_test_healthy_)
        df_orders_test_healthy_folds.append(df_orders_test_healthy_)

    # concat all pitting levels samples
    df_orders_test_pitting = pd.concat(list(df_orders_test_pitting_dict.values()))
    meta_data_test_pitting = pd.concat(list(meta_data_test_pitting_dict.values()))

    # merge healthy and unheathy samples for each fold
    df_orders_test_folds = []
    meta_data_test_folds = []
    for i, (df_orders_test_healthy_, meta_data_test_healthy_) in enumerate(zip(df_orders_test_healthy_folds, meta_data_test_healthy_folds)):
        if RANDOM_SPLIT=='V3':
            # only use operating modes in the test set that are also in the training set
            om_test_healthy = meta_data_test_healthy_['rotational speed [RPM]'].astype(str) + '_' + meta_data_test_healthy_['torque [Nm]'].astype(str)
            om_test_pitting = meta_data_test_pitting['rotational speed [RPM]'].astype(str) + '_' + meta_data_test_pitting['torque [Nm]'].astype(str)
            new_meta_data_test_pitting_without_missing_oms = meta_data_test_pitting[om_test_pitting.isin(om_test_healthy)]
            new_df_orders_test_pitting_without_missing_oms = df_orders_test_pitting[om_test_pitting.isin(om_test_healthy)]

            # sample equal amount of samples from healthy and faulty data
            om_test_pitting_with_run = new_meta_data_test_pitting_without_missing_oms['rotational speed [RPM]'].astype(str) + '_' + new_meta_data_test_pitting_without_missing_oms['torque [Nm]'].astype(str) + '_' + new_meta_data_test_pitting_without_missing_oms['sample_id'].astype(str)
            om_test_healthy_with_run = meta_data_test_healthy_['rotational speed [RPM]'].astype(str) + '_' + meta_data_test_healthy_['torque [Nm]'].astype(str) + '_' + meta_data_test_healthy_['sample_id'].astype(str)
            n_samples = len(om_test_healthy_with_run.unique())
            samples = new_df_orders_test_pitting_without_missing_oms['unique_sample_id'].sample(n_samples, random_state=i, replace=False)
            new_meta_data_test_pitting = new_meta_data_test_pitting_without_missing_oms[new_meta_data_test_pitting_without_missing_oms['unique_sample_id'].isin(samples)]
            new_df_orders_test_pitting = new_df_orders_test_pitting_without_missing_oms[new_df_orders_test_pitting_without_missing_oms['unique_sample_id'].isin(samples)]
            df_orders_test_folds.append(pd.concat([df_orders_test_healthy_, new_df_orders_test_pitting]).reset_index(drop=True))
            meta_data_test_folds.append(pd.concat([meta_data_test_healthy_, new_meta_data_test_pitting]).reset_index(drop=True))
        else:
            df_orders_test_folds.append(pd.concat([df_orders_test_healthy_, df_orders_test_pitting]).reset_index(drop=True))
            meta_data_test_folds.append(pd.concat([meta_data_test_healthy_, meta_data_test_pitting]).reset_index(drop=True))

    if CACHE_RESULTS:
        # cache train data
        with open(fpath_df_orders_test_folds, 'wb') as file:
            pickle.dump(df_orders_test_folds, file)
        # cache test data
        with open(fpath_meta_data_test_folds, 'wb') as file:
            pickle.dump(meta_data_test_folds, file)

In [None]:
# !!! healthy samples != faulty samples
display(meta_data_test_folds[-1].unique_sample_id.str.contains('healthy').sum())
display(meta_data_test_folds[-1].unique_sample_id.str.contains('pitting').sum())

In [None]:
# stop # 216 unique ids are missing, why?  --> because of the healthy samples --> issue fixed

In [None]:
# stop # meta_data_test has different types of unique_sample_id: <1200_50_3> for pitting and <1200_100_3_healthy> for no pitting 
# --> previously no issue as there was only one level of pitting
# fixed this 

xxxxxx

---

Below we extract the vibration fingerprint for each measurement (***later, maybe: and assign the operating mode based on the previously trained classifier.***).

In [None]:
# extract train vibration measurement periods
train_vibration_measurement_periods_folds = []
for df_V_train_normalized_, meta_data_train_ in zip(df_V_train_normalized_folds, meta_data_train_folds):
    df_ = df_V_train_normalized_
    #meta_data_train['sample_id_unique'] = meta_data_train.groupby(['sample_id', 'rotational speed [RPM]', 'torque [Nm]']).ngroup() + 1
    df_[['unique_sample_id', 'direction']] = meta_data_train_[['unique_sample_id', 'direction']]   # !!! wrong? 
    train_vibration_measurement_periods = []
    for sample_id, group in df_.groupby('unique_sample_id'):
        measurement_period = {
            'start': 'unknown', 
            'stop': 'unknown',
            'group': group,
            'sample_id': sample_id,
            #'rpm': group['rotational speed [RPM]'].unique()[0],
            #'torque': group['torque [Nm]'].unique()[0],
        }
        train_vibration_measurement_periods.append(group)
    train_vibration_measurement_periods_folds.append(train_vibration_measurement_periods)

In [None]:
# DOES IT REALLY WORK?

# extract test vibration measurement periods
test_vibration_measurement_periods_folds = []
test_vibration_measurement_periods_meta_data_folds = []
for df_orders_test_, meta_data_test_, cluster_label_unique_name_mapping_ in tqdm(zip(df_orders_test_folds, meta_data_test_folds, cluster_label_unique_name_mapping_folds), 
                                                                                 total=len(df_orders_test_folds)):
    df_V_test_normalized = normalize_1(df_orders_test_, BAND_COLS)
    df_ = df_V_test_normalized
    df_[['sample_id', 'unique_sample_id', 'direction']] = meta_data_test_[['sample_id', 'unique_sample_id', 'direction']]
    test_vibration_measurement_periods_ = []
    test_vibration_measurement_periods_meta_data_ = []
    n_index_errors = 0
    for unique_sample_id, group in df_.groupby('unique_sample_id'):
        rpm = meta_data_test_[meta_data_test_['unique_sample_id'] == unique_sample_id]['rotational speed [RPM]'].unique()[0]
        torque = meta_data_test_[meta_data_test_['unique_sample_id'] == unique_sample_id]['torque [Nm]'].unique()[0]
        try:
            om = cluster_label_unique_name_mapping_[
                (cluster_label_unique_name_mapping_['rotational speed [RPM]'] == rpm) & 
                (cluster_label_unique_name_mapping_['torque [Nm]'] == torque)
            ]['cluster_label_unique'].iloc[0]
        except IndexError:
            n_index_errors += 1
            om = -1
        measurement_period = {
            'start': 'unknown', 
            'stop': 'unknown',
            'group': group,
            'unique_sample_id': unique_sample_id,
            'rpm': rpm,
            'torque': torque,
            'unique_cluster_label': om
        }
        test_vibration_measurement_periods_.append(group)
        test_vibration_measurement_periods_meta_data_.append(measurement_period)
    test_vibration_measurement_periods_folds.append(test_vibration_measurement_periods_)
    test_vibration_measurement_periods_meta_data_folds.append(test_vibration_measurement_periods_meta_data_)

    n_total = len(test_vibration_measurement_periods_)
    #print(f'Total number of measurement periods: {n_total}')
    #print(f'Number of measurement periods with unknown RPM and/or torque: {n_index_errors}')

In [None]:
# extract df_W_offline and df_W_online
def extract_vibration_weights_per_measurement_period(measurement_periods, col_names, band_cols, normalization, model, verbose=False):
    Ws = []
    for period in tqdm(measurement_periods, disable=not verbose, desc='Extracting vibration weights per measurement period'):
        assert len(period) == 3, 'should have exactly 3 directions per measurement period'
        band_column_names = period.columns[period.columns.str.contains('band_')]
        V = period.set_index(['direction'])[band_column_names]  # already normalized
        # dim(W) = 6 x 16
        W = model.nmf.transform(V)
        W = pd.DataFrame(W, columns=col_names)  # !!!
        Ws.append({
            # 'Sample_id': period.sample_id.unique()[0],
            'unique_sample_id': period.unique_sample_id.unique()[0],  # !!!
            'V_normalized': V,
            'W': W
        })
    return pd.DataFrame(Ws)

df_W_offline_folds = []
df_W_online_folds = []
for train_vibration_measurement_periods_, test_vibration_measurement_periods_, fingerprints_, model_ in tqdm(zip(train_vibration_measurement_periods_folds, 
                         test_vibration_measurement_periods_folds,
                         fingerprints_folds,
                         model_folds),
                         total=len(fingerprints_folds)):
    df_W_offline_ = extract_vibration_weights_per_measurement_period(train_vibration_measurement_periods_, fingerprints_[0].columns, BAND_COLS, normalize_1, model_)
    df_W_online_ = extract_vibration_weights_per_measurement_period(test_vibration_measurement_periods_, fingerprints_[0].columns, BAND_COLS, normalize_1, model_)
    df_W_offline_folds.append(df_W_offline_)
    df_W_online_folds.append(df_W_online_)

df_W_online_folds[fold].head()

In [None]:
# old code before cross-validation
# extract train vibration measurement periods
"""
df_ = df_V_train_normalized
#meta_data_train['sample_id_unique'] = meta_data_train.groupby(['sample_id', 'rotational speed [RPM]', 'torque [Nm]']).ngroup() + 1
df_[['unique_sample_id', 'direction']] = meta_data_train[['unique_sample_id', 'direction']]   # !!! wrong? 
train_vibration_measurement_periods = []
for sample_id, group in df_.groupby('unique_sample_id'):
    measurement_period = {
        'start': 'unknown', 
        'stop': 'unknown',
        'group': group,
        'sample_id': sample_id,
        #'rpm': group['rotational speed [RPM]'].unique()[0],
        #'torque': group['torque [Nm]'].unique()[0],
    }
    train_vibration_measurement_periods.append(group)

# extract test vibration measurement periods
df_V_test_normalized = normalize_1(df_orders_test, BAND_COLS)
df_ = df_V_test_normalized
df_[['sample_id', 'unique_sample_id', 'direction']] = meta_data_test[['sample_id', 'unique_sample_id', 'direction']]
test_vibration_measurement_periods = []
test_vibration_measurement_periods_meta_data = []
n_index_errors = 0
for unique_sample_id, group in df_.groupby('unique_sample_id'):
    rpm = meta_data_test[meta_data_test['unique_sample_id'] == unique_sample_id]['rotational speed [RPM]'].unique()[0]
    torque = meta_data_test[meta_data_test['unique_sample_id'] == unique_sample_id]['torque [Nm]'].unique()[0]
    try:
        om = cluster_label_unique_name_mapping[
            (cluster_label_unique_name_mapping['rotational speed [RPM]'] == rpm) & 
            (cluster_label_unique_name_mapping['torque [Nm]'] == torque)
        ]['cluster_label_unique'].iloc[0]
    except IndexError:
        n_index_errors += 1
        om = -1
    measurement_period = {
        'start': 'unknown', 
        'stop': 'unknown',
        'group': group,
        'unique_sample_id': unique_sample_id,
        'rpm': rpm,
        'torque': torque,
        'unique_cluster_label': om
    }
    test_vibration_measurement_periods.append(group)
    test_vibration_measurement_periods_meta_data.append(measurement_period)

n_total = len(test_vibration_measurement_periods)
print(f'Total number of measurement periods: {n_total}')
print(f'Number of measurement periods with unknown RPM and/or torque: {n_index_errors}')

# extract df_W_offline and df_W_online
def extract_vibration_weights_per_measurement_period(measurement_periods, col_names, band_cols, normalization, model):
    Ws = []
    for period in tqdm(measurement_periods):
        assert len(period) == 3, 'should have exactly 3 directions per measurement period'
        band_column_names = period.columns[period.columns.str.contains('band_')]
        V = period.set_index(['direction'])[band_column_names]  # already normalized
        # dim(W) = 6 x 16
        W = model.nmf.transform(V)
        W = pd.DataFrame(W, columns=col_names)  # !!!
        Ws.append({
            # 'Sample_id': period.sample_id.unique()[0],
            'unique_sample_id': period.unique_sample_id.unique()[0],  # !!!
            'V_normalized': V,
            'W': W
        })
    return pd.DataFrame(Ws)

df_W_offline = extract_vibration_weights_per_measurement_period(train_vibration_measurement_periods, fingerprints[0].columns, BAND_COLS, normalize_1, model)
df_W_online = extract_vibration_weights_per_measurement_period(test_vibration_measurement_periods, fingerprints[0].columns, BAND_COLS, normalize_1, model)
df_W_online
"""
pass

Illustrate a derived weight matrix:

In [None]:
period = 10
df_W_online_ = df_W_online_folds[fold]

fig, ax = plt.subplots(figsize=(8, 4))
sns.heatmap(df_W_online_['W'][period], annot=True, fmt=".6f", ax=ax, cmap='Blues', vmin=0, vmax=0.05, cbar=False)
ax.set_title(f'Derived weights for measurements period {period}');
# set y tick labels to x, y and z
ax.set_yticklabels(['x', 'y', 'z'], rotation=0);
# set x  tick labels to component_0, component_1, etc.
ax.set_xticklabels([f'frequency_component_{x}' for x in range(N_COMPONENTS)], rotation=45);

Below we illustrate all measurement periods.
We start with a U-MAP embedding of the vectorized measurement matrices.

In [None]:
# old code before cross-validation
"""
feature_space = pd.DataFrame(df_W_online['W'].apply(lambda df: df.stack().values).to_list())
X_umap = UMAP(random_state=42).fit_transform(X=feature_space.to_numpy())
df_umap = pd.DataFrame(data=X_umap, index=feature_space.index, columns=['umap_1', 'umap_2'])
df_umap['unique_sample_id'] = df_W_online['unique_sample_id']
df_info_ = pd.DataFrame(test_vibration_measurement_periods_meta_data)
# plot unique cluster name
# px.scatter(df_umap.reset_index(), x='umap_1', y='umap_2', color=df_info_['unique_cluster_label'].astype(str), hover_data=['index'], width=800, height=600)
# plot rpm
px.scatter(df_umap.reset_index(), x='umap_1', y='umap_2', color=df_info_['rpm'], hover_data=['index', 'unique_sample_id'], width=800, height=600)
"""
pass

## Distance to fingerprints

Calculate distances on test set:

> `df_W_train` != `df_W_offline` --> make sure that this is not a bug in the industrial dataset!
> `df_W_train` has only components per direction, should have components per measurement period

In [None]:
# cell takes around 60 minutes to run (!) --> going to cache the results
SHOW_DISTANCES = False

def calculate_distances_per_measurement_period(measurement_period, fingerprints, verbose=False):
    # pointwise Mahalanobis distance
    fingerprint_matrix = np.array([fingerprints[om].to_numpy().flatten() for om in fingerprints])
    # calculate covariance matrix
    fingerprint_S = np.cov(fingerprint_matrix.T)
    # calculate inverse
    fingerprint_SI = np.linalg.inv(fingerprint_S)
    # calculate mu
    fingerprint_mu = fingerprint_matrix.mean(axis=0)
    df_dist_ = []
    for idx, row in tqdm(measurement_period.iterrows(), total=len(measurement_period), disable=not verbose):
        for om in fingerprints:
            weights = row['W']
            fingerprint = fingerprints[om]
            tmp = {
                'idx': idx,
                'data': row, 
                'om': om, 
                'frobenius_norm': distance_metrics.frobenius_norm(weights, fingerprint),
                'frobenius_norm_pow2': distance_metrics.frobenius_norm_v2(weights, fingerprint),
                'frobenius_norm_sqrt': distance_metrics.frobenius_norm_v3(weights, fingerprint),
                'cosine_distance': distance_metrics.cosine_distance(weights, fingerprint),
                'manhattan_distance': distance_metrics.manhattan_distance(weights, fingerprint),
            }
            df_dist_.append(tmp)
    df_dist_ = pd.DataFrame(df_dist_)
    df_dist_['frobenius_norm_minmax'] = MinMaxScaler().fit_transform(df_dist_['frobenius_norm'].to_numpy().reshape(-1, 1))
    return df_dist_


df_dist_offline_folds = []
df_dist_online_folds = []
for i, (df_W_offline_, df_W_online_, fingerprints_) in tqdm(enumerate(zip(df_W_offline_folds,
                                                                     df_W_online_folds,
                                                                     fingerprints_folds)),
                                                                     desc='Calculating distances per fold'):
    fpath_offline = os.path.join(CACHING_FOLDER_NAME, 'distance_folds', f'df_dist_offline_fold_{i}.pkl')
    fpath_online = os.path.join(CACHING_FOLDER_NAME, 'distance_folds', f'df_dist_online_fold_{i}.pkl')
    if LOAD_CACHED_RESULTS:
        df_dist_offline_folds.append(pickle.load(open(fpath_offline, 'rb')))
        df_dist_online_folds.append(pickle.load(open(fpath_online, 'rb')))
    else:
        df_dist_offline_ = calculate_distances_per_measurement_period(df_W_offline_, fingerprints=fingerprints_)
        if CACHE_RESULTS:
            pickle.dump(df_dist_offline_, open(fpath_offline, 'wb'))
        df_dist_online_ = calculate_distances_per_measurement_period(df_W_online_, fingerprints=fingerprints_)
        if CACHE_RESULTS:
            pickle.dump(df_dist_online_, open(fpath_online, 'wb'))
        df_dist_offline_folds.append(df_dist_offline_)
        df_dist_online_folds.append(df_dist_online_)

if SHOW_DISTANCES:
    g = sns.displot(data=df_dist_offline_folds[fold], 
                    x="cosine_distance", col="om", col_wrap=3, height=2, aspect=4, bins=20, kind="hist")

> Observation: In contrast to industrial use case, the cosine distance is spread out for almost any operating mode.

# Anomaly detection 

Because the measurements are not timestamped, it is not possible to order the measurements in time.
Instead, we perform an anomaly detection. 
Per given operating mode, we check the distances of the derived weights to the fingerprint.

In [None]:
df_dist_online_folds[fold].head()

In [None]:
# for each measurement period (row), get the distance to each operating mode (column)
df_cosine_folds = []
for df_dist_online_, test_vibration_measurement_periods_meta_data_ in zip(df_dist_online_folds, test_vibration_measurement_periods_meta_data_folds):
    df_cosine_ = df_dist_online_[['idx', 'om', 'cosine_distance']].pivot(index='idx', columns='om', values='cosine_distance')
    # assign the corresponding operating mode to the given row (if known), else, assign -1
    # unique cluster label is wrong!!! (might be correct)
    df_cosine_[['rpm', 'torque', 'unique_cluster_label']] = pd.DataFrame(test_vibration_measurement_periods_meta_data_)[['rpm', 'torque', 'unique_cluster_label']]
    df_cosine_folds.append(df_cosine_)

df_cosine_folds[fold].head()

As we already aw before, there are 658 measurements in the test set where the cluster label is known and 142 measurements where the cluster label is unknown. (later we will implent a classifier to predict the cluster label for all measurements. We will add it as additional column and check whether the classification is correct).

In [None]:
for i, (df_cosine_, df_W_online_) in enumerate(zip(df_cosine_folds, df_W_online_folds)):
    distance_to_own_cluster_center_ = []
    for idx, row in df_cosine_.iterrows():
        om = row['unique_cluster_label']
        if om != -1:
            distance_to_own_cluster_center_.append(row[om])
        else:
            distance_to_own_cluster_center_.append(np.nan)
    df_cosine_['distance_to_own_cluster_center'] = distance_to_own_cluster_center_
    df_cosine_['pitting'] = df_W_online_['unique_sample_id'].str.contains(f'pitting_level_')
    df_cosine_['pitting_level'] = df_W_online_['unique_sample_id'].str.extract(r'pitting_level_(\d)')
    df_cosine_['pitting_level'] = df_cosine_['pitting_level'].fillna(0).astype(int)
    df_cosine_folds[i] = df_cosine_

df_cosine_folds[fold].head()

plot distance to own cluster center vs. distance to other cluster centers

In [None]:
df_cosine_ = df_cosine_folds[fold]

fig, axes = plt.subplots(figsize=(16, 8), ncols=2)

ax = df_cosine_['distance_to_own_cluster_center'].plot(kind='hist', bins=20, ax=axes[0], alpha=0.5, legend=False)
ax.set_title('Distance to own cluster centers')
ax.set_xlabel('Cosine distance')

# plot distance to other cluster centers
ax = df_cosine_.drop(columns=['rpm', 'torque', 'unique_cluster_label', 'distance_to_own_cluster_center', 'pitting', 'pitting_level']).melt()['value'].plot(kind='hist', bins=20, ax=axes[1], alpha=0.5, legend=False)
ax.set_title('Distance to other cluster centers')
ax.set_xlabel('Cosine distance')
ax.legend(loc='upper right', bbox_to_anchor=(1.2, 1))

Idea: Set threshold based on distance in training set. If distance to own cluster center exceeds threshold, then anomaly.
> - TODO: if OM unknonw, then anomaly --> drop unknown cluster labels
> - TODO: anomaly detection per operating mode group, not per operating mode

In [None]:
# There are many anomalies when there is pitting
# However, the cosine distance threshold is set manually to 0.01 (in a next step we need to set it based on the observed cosine distances in the training set)
anomaly_ = df_cosine_['distance_to_own_cluster_center'] > 0.01   # TODO: setting threshold to 0.01 as first test, later set threshold based on distance in training set
display(anomaly_.value_counts())
ax = sns.boxplot(data=df_cosine_, y='distance_to_own_cluster_center', x='pitting_level')
ax.set_title(f'Distance to own cluster center per pitting level (Fold {fold})')

## Creating ROC-curve

In [None]:
# creating single ROC-curve
# --> now in util module
"""
def calculate_roc_characteristics(df_):
    df_ = df_.sort_values(by='distance_to_own_cluster_center', ascending=True)

    # Initialize variables to store ROC curve values
    fpr = []
    tpr = []

    for threshold in df_['distance_to_own_cluster_center']:
        df_['predicted_anomaly'] = df_['distance_to_own_cluster_center'] >= threshold

        # Calculate True Positive Rate (TPR) and False Positive Rate (FPR)
        true_positives = df_[(df_['pitting'] == 1) & (df_['predicted_anomaly'] == 1)].shape[0]
        false_positives = df_[(df_['pitting'] == 0) & (df_['predicted_anomaly'] == 1)].shape[0]
        true_negatives = df_[(df_['pitting'] == 0) & (df_['predicted_anomaly'] == 0)].shape[0]
        false_negatives = df_[(df_['pitting'] == 1) & (df_['predicted_anomaly'] == 0)].shape[0]

        tpr.append(true_positives / (true_positives + false_negatives))
        fpr.append(false_positives / (false_positives + true_negatives))

    # Calculate the area under the ROC curve (AUC)
    roc_auc = auc(fpr, tpr)

    return fpr, tpr, roc_auc
"""


"""
fig, ax = plt.subplots(figsize=(8, 6))

# plot individual ROC curves
linestyles = ['-', '--', ':', '-', '--', ':']
for lvl, style in zip(pitting_levels, linestyles):
    df_ = df_cosine[(~df_cosine['pitting']) | (df_cosine['pitting_level'] == lvl)]
    fpr, tpr, roc_auc = calculate_roc_characteristics(df_)
    ax.plot(fpr, tpr, lw=1, linestyle=style, alpha=0.66, label=f'level {lvl} (area = {roc_auc:.3f})')
    ax.set_xlim(0.0, 1.0)
    ax.set_ylim(0.0, 1.05)

# Plot the general ROC curve
fpr, tpr, roc_auc = calculate_roc_characteristics(df_cosine)
ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.66)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.05)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'Receiver Operating Characteristic (ROC) Curve')

ax.legend(loc='lower right', title='Pitting severity level');
"""
pass

In [None]:
PLOT = False 

if PLOT:
    for fold in range(N):
        # why does the individual ROC curve not go until FP = 1 (?)
        df_cosine_ = df_cosine_folds[fold]
        df_cosine_ = df_cosine_[df_cosine_.unique_cluster_label != -1]  # QUICK FIX !!! : removed unknown cluster labels

        fig, ax = plt.subplots(figsize=(8, 6))

        # plot individual ROC curves
        linestyles = ['-', '--', ':', '-', '--', ':']
        for lvl, style in zip(pitting_levels, linestyles):
            df_ = df_cosine_[(~df_cosine_['pitting']) | (df_cosine_['pitting_level'] == lvl)]
            fpr, tpr, roc_auc = calculate_roc_characteristics(df_)
            ax.plot(fpr, tpr, lw=1, linestyle=style, alpha=0.66, label=f'level {lvl} (area = {roc_auc:.3f})')
            ax.set_xlim(0.0, 1.0)
            ax.set_ylim(0.0, 1.05)

        # Plot the general ROC curve
        fpr, tpr, roc_auc = calculate_roc_characteristics(df_cosine_)
        ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.66)
        ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
        ax.set_xlim(0.0, 1.0)
        ax.set_ylim(0.0, 1.05)
        ax.set_xlabel('False Positive Rate')
        ax.set_ylabel('True Positive Rate')
        ax.set_title(f'ROC Curve (trial {fold})')
        n_total = len(df_cosine_)
        n_healthy = len(df_cosine_[df_cosine_['pitting'] == False])
        n_unhealthy = len(df_cosine_[df_cosine_['pitting'] == True])
        text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy)"
        ax.annotate(xy=(0.1, 0.025), text=text)
        ax.legend(loc='lower right', title='Pitting severity level');

In [None]:
fold = 0
threshold = 0.1

# --> now in module
"""
def calc_tpr_at_fpr_threshold(tpr, fpr, threshold=0.1):
    # sort tpr and fpr such that they are in ascending order
    if (fpr[0] > fpr[-1]) or (tpr[0] > tpr[-1]):
        assert(tpr[0] > tpr[-1] and fpr[0] > fpr[-1])
        tpr = list(reversed(tpr))
        fpr = list(reversed(fpr))
    try:
        idx = next(i for i, value in enumerate(fpr) if value > threshold)
    except StopIteration:
        idx = 0
    tpr_at_fpr = tpr[idx]
    return tpr_at_fpr
"""

# why does the individual ROC curve not go until FP = 1 (?)
df_cosine_ = df_cosine_folds[fold]
df_cosine_ = df_cosine_[df_cosine_.unique_cluster_label != -1]  # QUICK FIX !!! : removed unknown cluster labels

fig, ax = plt.subplots(figsize=(8, 6))

# Plot the general ROC curve
fpr, tpr, roc_auc = calculate_roc_characteristics(df_cosine_)
tpr_at_fpr = calc_tpr_at_fpr_threshold(tpr, fpr, threshold=threshold)
ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.66)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
ax.plot([0, threshold], [tpr_at_fpr, tpr_at_fpr], color='red', lw=2, linestyle='--', label=f'TPR@FPR={threshold:.2f} = {tpr_at_fpr:.2f}')
ax.plot([threshold, threshold], [0, tpr_at_fpr], color='red', lw=2, linestyle='--')
ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.05)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'ROC Curve (trial {fold})')
n_total = len(df_cosine_)
n_healthy = len(df_cosine_[df_cosine_['pitting'] == False])
n_unhealthy = len(df_cosine_[df_cosine_['pitting'] == True])
text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy)"
ax.annotate(xy=(0.1, 0.025), text=text)

ax.legend(loc='lower right', title='Pitting severity level');

In [None]:
# --> now in module
"""
def calc_fpr_at_tpr_threshold(tpr, fpr, threshold=0.1):
    return calc_tpr_at_fpr_threshold(tpr=fpr, fpr=tpr, threshold=threshold)
"""

fold = 0
df_cosine_ = df_cosine_folds[fold]
df_cosine_ = df_cosine_[df_cosine_.unique_cluster_label != -1]  # QUICK FIX !!! : removed unknown cluster labels

fig, ax = plt.subplots(figsize=(8, 6))

threshold = 0.90

# Plot the general ROC curve
fpr, tpr, roc_auc = calculate_roc_characteristics(df_cosine_)
fpr_at_tpr = calc_fpr_at_tpr_threshold(tpr, fpr, threshold=threshold)
ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.66)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
ax.plot([fpr_at_tpr, fpr_at_tpr], [0, threshold], color='green', lw=2, linestyle='--', label=f'FPR@TPR={threshold:.2f} = {fpr_at_tpr:.2f}')
ax.plot([0, fpr_at_tpr], [threshold, threshold], color='green', lw=2, linestyle='--')
ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.05)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'ROC Curve (trial {fold})')
n_total = len(df_cosine_)
n_healthy = len(df_cosine_[df_cosine_['pitting'] == False])
n_unhealthy = len(df_cosine_[df_cosine_['pitting'] == True])
text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy)"
ax.annotate(xy=(0.1, 0.025), text=text)

ax.legend(loc='lower right', title='Pitting severity level');

In [None]:
trials = list(range(N))

fig, ax = plt.subplots(figsize=(8, 6))
threshold = 0.90

ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
#text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy)"
#ax.annotate(xy=(0.1, 0.025), text=text)
#ax.legend(loc='lower right', title='Pitting severity level');

for fold in tqdm(trials):
    df_cosine_ = df_cosine_folds[fold]
    df_cosine_ = df_cosine_[df_cosine_.unique_cluster_label != -1]  # QUICK FIX !!! : removed unknown cluster labels

    # Plot the general ROC curve
    fpr, tpr, roc_auc = calculate_roc_characteristics(df_cosine_)
    fpr_at_tpr = calc_fpr_at_tpr_threshold(tpr, fpr, threshold=threshold)
    ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.05)
    ax.plot([fpr_at_tpr, fpr_at_tpr], [0, threshold], color='green', lw=2, linestyle='--', label=f'FPR@TPR={threshold:.2f} = {fpr_at_tpr:.2f}', alpha=0.05)
    ax.plot([0, fpr_at_tpr], [threshold, threshold], color='green', lw=2, linestyle='--', alpha=0.05)
    #ax.set_xlim(0.0, 1.0)
    ax.set_xlim(-0.01, 0.5)
    #ax.set_ylim(0.0, 1.05)
    ax.set_ylim(0.5, 1.01)
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title(f'ROC Curve (all trials)')
    n_total = len(df_cosine_)
    n_healthy = len(df_cosine_[df_cosine_['pitting'] == False])
    n_unhealthy = len(df_cosine_[df_cosine_['pitting'] == True])

In [None]:
trials = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 20]

fig, ax = plt.subplots(figsize=(4, 3))
threshold = 0.90

ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
#text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy)"
#ax.annotate(xy=(0.1, 0.025), text=text)
#ax.legend(loc='lower right', title='Pitting severity level');

for fold in tqdm(trials):
    df_cosine_ = df_cosine_folds[fold]
    df_cosine_ = df_cosine_[df_cosine_.unique_cluster_label != -1]  # QUICK FIX !!! : removed unknown cluster labels

    # Plot the general ROC curve
    fpr, tpr, roc_auc = calculate_roc_characteristics(df_cosine_)
    fpr_at_tpr = calc_fpr_at_tpr_threshold(tpr, fpr, threshold=threshold)
    ax.plot(fpr, tpr, color=None, lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.2)
    ax.plot([fpr_at_tpr, fpr_at_tpr], [0, threshold], color='green', lw=2, linestyle='--', label=f'FPR@TPR={threshold:.2f} = {fpr_at_tpr:.2f}', alpha=0.2)
    ax.plot([0, fpr_at_tpr], [threshold, threshold], color='green', lw=2, linestyle='--', alpha=0.2)
    ax.set_xlim(-0.025, 1.0)
    #ax.set_xlim(-0.01, 0.5)
    ax.set_ylim(0.0, 1.025)
    #ax.set_ylim(0.5, 1.01)
    ax.set_xlabel('False Positive Rate [FPR]')
    ax.set_ylabel('True Positive Rate [TPR]')
    ax.set_title(f'ROC Curves')
    n_total = len(df_cosine_)
    n_healthy = len(df_cosine_[df_cosine_['pitting'] == False])
    n_unhealthy = len(df_cosine_[df_cosine_['pitting'] == True])

In [None]:
import matplotlib.patches as patches

trials = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 20]
zoom = {'x': (-0.01, 0.2), 'y': (0.8, 1.01)}
cmap = plt.get_cmap('tab20').colors

fig, axes = plt.subplots(figsize=(7, 3), ncols=2)
threshold = 0.90

#text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy)"
#ax.annotate(xy=(0.1, 0.025), text=text)
#ax.legend(loc='lower right', title='Pitting severity level');

ax = axes[0]
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
rect = patches.Rectangle((zoom['x'][0], zoom['y'][0]), zoom['x'][1], zoom['y'][1], linewidth=5, edgecolor='r', facecolor='none', label='zoomed area')
ax.add_patch(rect)
for i, fold in enumerate(tqdm(trials)):
    df_cosine_ = df_cosine_folds[fold]
    df_cosine_ = df_cosine_[df_cosine_.unique_cluster_label != -1]  # QUICK FIX !!! : removed unknown cluster labels

    # Plot the general ROC curve
    fpr, tpr, roc_auc = calculate_roc_characteristics(df_cosine_)
    fpr_at_tpr = calc_fpr_at_tpr_threshold(tpr, fpr, threshold=threshold)
    label=f'trial {fold}'
    ax.plot(fpr, tpr, color=cmap[i], lw=4, label=label, alpha=0.2)
    #label=f'FPR@TPR={threshold:.2f} = {fpr_at_tpr:.2f}'
    ax.plot([fpr_at_tpr, fpr_at_tpr], [0, threshold], color=cmap[i], lw=2, linestyle='dotted', alpha=0.2)
    ax.plot([0, fpr_at_tpr], [threshold, threshold], color=cmap[i], lw=2, linestyle='dotted', alpha=0.2)
    ax.set_xlim(-0.05, 1.0)
    #ax.set_xlim(-0.01, 0.5)
    ax.set_ylim(0.0, 1.05)
    #ax.set_ylim(0.5, 1.01)
    ax.set_xlabel('False Positive Rate [FPR]')
    ax.set_ylabel('True Positive Rate [TPR]')
    ax.set_title(f'ROC Curves')
    n_total = len(df_cosine_)
    n_healthy = len(df_cosine_[df_cosine_['pitting'] == False])
    n_unhealthy = len(df_cosine_[df_cosine_['pitting'] == True])
    # create rectangle for zoomed in plot

ax = axes[1]
# ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
for i, fold in enumerate(tqdm(trials)):
    df_cosine_ = df_cosine_folds[fold]
    df_cosine_ = df_cosine_[df_cosine_.unique_cluster_label != -1]  # QUICK FIX !!! : removed unknown cluster labels

    # Plot the general ROC curve
    fpr, tpr, roc_auc = calculate_roc_characteristics(df_cosine_)
    fpr_at_tpr = calc_fpr_at_tpr_threshold(tpr, fpr, threshold=threshold)
    ax.plot(fpr, tpr, color=cmap[i], lw=4, label=None, alpha=0.25)
    ax.plot([fpr_at_tpr, fpr_at_tpr], [0, threshold], color=cmap[i], lw=2, linestyle='dotted', label=None, alpha=0.25)
    ax.plot([0, fpr_at_tpr], [threshold, threshold], color=cmap[i], lw=2, linestyle='dotted', alpha=0.25)
    ax.set_xlim(zoom['x'])
    ax.set_ylim(zoom['y'])
    ax.set_xlabel('False Positive Rate [FPR]')
    ax.set_ylabel('True Positive Rate [TPR]')
    ax.set_title(f'ROC Curves (zoomed in)')
    n_total = len(df_cosine_)
    n_healthy = len(df_cosine_[df_cosine_['pitting'] == False])
    n_unhealthy = len(df_cosine_[df_cosine_['pitting'] == True])

fig.legend(ncol=6, fontsize=9.5, loc='lower center', bbox_to_anchor=(0.5, -0.15))
fig.tight_layout()
#fig.savefig(os.path.join('figs', 'roc_curves.pdf'), bbox_inches='tight')

# Baseline: Isolation Forest

Isolation Forests are a machine learning anomaly detection algorithm that works by isolating instances in a dataset. Developed based on the concept of randomly partitioning data points, the algorithm identifies anomalies by measuring the ease with which a data point can be separated from the rest of the dataset, making it particularly effective for detecting outliers in large and complex datasets.

- Need to set **contamination** (expected amount of outliers in the dataset) to 0, as there are no anomalies in the training set.
- Has scoring function

Variables:
- $X = V$
- train_vibration_measurement_periods_folds = $X_{train}$

Preliminary results: 
- Low performance (AUC $\approx$ 0.6)
- Reducing feature space (with PCA) barerly improces performance

Notes: 
- More extensive hyperparameter tuning?
- Unfair comparison, as process parameters are not taken into account. Train model per operating mode? --> issue with feature space size
- Train per sensor instead?

### Single trial

In [None]:
from sklearn.ensemble import IsolationForest

exemplary_fold = 0
binned_vibrations_train = train_vibration_measurement_periods_folds[exemplary_fold]
binned_vibrations_test = test_vibration_measurement_periods_folds[exemplary_fold]
df_cosine_test = df_cosine_folds[exemplary_fold]

#clf = IsolationForest(contamination=0)
#clf.fit(X)
#y_pred_train = clf.predict(X) 

# frequency band columns are all columns that contain the string 'band_'
# --> there are 50 frequency bands per sensor
exemplary_vibration_column_names = binned_vibrations_train[0].columns
frequency_band_column_names = exemplary_vibration_column_names[exemplary_vibration_column_names.str.contains('band_')]

# create a matrix representation of the feature space for the train set,
# where the three different sensors are stacked to a single vector with 150 features
# (50 frequency bands per sensor)
# --> this is the feature space for the clustering algorithm
flatten_df = lambda df_: df_.to_numpy().flatten()
X_train = np.array([flatten_df(individual_measurements[frequency_band_column_names]) for individual_measurements in binned_vibrations_train])
print(f'Shape of X_train: {X_train.shape}')

# Fit Isolation Forest (without preprocessing the data)
clf = IsolationForest(contamination=0)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
print(f'Number of outliers detected in training set: {np.sum(y_pred_train == -1)} (should be 0)')

# create a matrix representation for the test set
X_test = np.array([flatten_df(individual_measurements[frequency_band_column_names]) for individual_measurements in binned_vibrations_test])
print(f'Shape of X_test: {X_test.shape}')

# predict outliers in test set
y_test = df_cosine_test['pitting'].replace({True: -1, False: 1}).to_numpy()
y_pred_test = clf.predict(X_test)
y_score_test = clf.score_samples(X_test)  # the lower, the more abnormal
print(f'Number of outliers detected in test set with default parameters: {np.sum(y_pred_test == -1)} (should be {np.sum(y_test == -1)})')

# create ROC curve for test set
fpr, tpr, thresholds = roc_curve(y_true=y_test, y_score=y_score_test)
roc_auc = auc(fpr, tpr)
print(f'AUC: {roc_auc:.3f}')

# plot ROC curve
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.66)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.05)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'ROC Curve (trial {exemplary_fold})')
n_total = len(df_cosine_test)
n_healthy = len(df_cosine_test[df_cosine_test['pitting'] == False])
n_unhealthy = len(df_cosine_test[df_cosine_test['pitting'] == True])
text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy), AUC={round(roc_auc, 3)}"
ax.annotate(xy=(0.1, 0.025), text=text);
fig.suptitle(f'Isolation Forest', color='blue', size=16);

## Single trial (+ PCA)

--> Feature size not an issue

In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.pipeline import Pipeline

exemplary_fold = 4
binned_vibrations_train = train_vibration_measurement_periods_folds[exemplary_fold]
binned_vibrations_test = test_vibration_measurement_periods_folds[exemplary_fold]
df_cosine_test = df_cosine_folds[exemplary_fold]

#clf = IsolationForest(contamination=0)
#clf.fit(X)
#y_pred_train = clf.predict(X) 

# frequency band columns are all columns that contain the string 'band_'
# --> there are 50 frequency bands per sensor
exemplary_vibration_column_names = binned_vibrations_train[0].columns
frequency_band_column_names = exemplary_vibration_column_names[exemplary_vibration_column_names.str.contains('band_')]

# create a matrix representation of the feature space for the train set,
# where the three different sensors are stacked to a single vector with 150 features
# (50 frequency bands per sensor)
# --> this is the feature space for the clustering algorithm
flatten_df = lambda df_: df_.to_numpy().flatten()
X_train = np.array([flatten_df(individual_measurements[frequency_band_column_names]) for individual_measurements in binned_vibrations_train])
print(f'Shape of X_train: {X_train.shape}')

# Fit Isolation Forest (without preprocessing the data)
clf = Pipeline([('pca', PCA(n_components=10)), ('clf', IsolationForest(contamination=0))])
clf = clf.fit(X_train)
y_pred_train = clf.predict(X_train)
print(f'Number of outliers detected in training set: {np.sum(y_pred_train == -1)} (should be 0)')

# create a matrix representation for the test set
X_test = np.array([flatten_df(individual_measurements[frequency_band_column_names]) for individual_measurements in binned_vibrations_test])
print(f'Shape of X_test: {X_test.shape}')

# predict outliers in test set
y_test = df_cosine_test['pitting'].replace({True: -1, False: 1}).to_numpy()
y_pred_test = clf.predict(X_test)
y_score_test = clf.score_samples(X_test)  # the lower, the more abnormal
print(f'Number of outliers detected in test set with default parameters: {np.sum(y_pred_test == -1)} (should be {np.sum(y_test == -1)})')

# create ROC curve for test set
fpr, tpr, thresholds = roc_curve(y_true=y_test, y_score=y_score_test)
roc_auc = auc(fpr, tpr)
print(f'AUC: {roc_auc:.3f}')

# plot ROC curve
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.66)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.05)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'ROC Curve (trial {exemplary_fold})')
n_total = len(df_cosine_test)
n_healthy = len(df_cosine_test[df_cosine_test['pitting'] == False])
n_unhealthy = len(df_cosine_test[df_cosine_test['pitting'] == True])
text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy), AUC={round(roc_auc, 3)}"
ax.annotate(xy=(0.1, 0.025), text=text);
fig.suptitle(f'Isolation Forest + PCA (n_components=10)', color='blue', size=16);

## Single trial on NMF weights

In [None]:
from sklearn.ensemble import IsolationForest

exemplary_fold = 4
W_train = df_W_offline_folds[exemplary_fold]['W']
W_test = df_W_online_folds[exemplary_fold]['W']
df_cosine_test = df_cosine_folds[exemplary_fold]

#clf = IsolationForest(contamination=0)
#clf.fit(X)
#y_pred_train = clf.predict(X) 

# create a matrix representation of the feature space for the train set,
# where the three different sensors are stacked to a single vector with 150 features
# (50 frequency bands per sensor)
# --> this is the feature space for the clustering algorithm
flatten_df = lambda df_: df_.to_numpy().flatten()
X_train = np.array([flatten_df(individual_measurements) for individual_measurements in W_train])
print(f'Shape of X_train: {X_train.shape}')

# Fit Isolation Forest (without preprocessing the data)
clf = IsolationForest(contamination=0)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
print(f'Number of outliers detected in training set: {np.sum(y_pred_train == -1)} (should be 0)')

# create a matrix representation for the test set
X_test = np.array([flatten_df(individual_measurements) for individual_measurements in W_test])
print(f'Shape of X_test: {X_test.shape}')

# predict outliers in test set
y_test = df_cosine_test['pitting'].replace({True: -1, False: 1}).to_numpy()
y_pred_test = clf.predict(X_test)
y_score_test = clf.score_samples(X_test)  # the lower, the more abnormal
print(f'Number of outliers detected in test set with default parameters: {np.sum(y_pred_test == -1)} (should be {np.sum(y_test == -1)})')

# create ROC curve for test set
fpr, tpr, thresholds = roc_curve(y_true=y_test, y_score=y_score_test)
roc_auc = auc(fpr, tpr)
print(f'AUC: {roc_auc:.3f}')

# plot ROC curve
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.66)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.05)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'ROC Curve (trial {exemplary_fold})')
n_total = len(df_cosine_test)
n_healthy = len(df_cosine_test[df_cosine_test['pitting'] == False])
n_unhealthy = len(df_cosine_test[df_cosine_test['pitting'] == True])
text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy), AUC={round(roc_auc, 3)}"
ax.annotate(xy=(0.1, 0.025), text=text);
fig.suptitle(f'Isolation Forest + NMF', color='blue', size=16);

## Isolation Forest on Vibration + Process Parameters (speed, torque)

> TODO: WHY DF_COSINE_FOLDS HAS NOT THE SAME DIMENSIONALITY AS DF_W_FOLDS?

In [None]:
from sklearn.ensemble import IsolationForest

exemplary_fold = 4
W_train = df_W_offline_folds[exemplary_fold]['W']
meta_data_train = pd.DataFrame({
    'rpm': df_W_offline_folds[exemplary_fold]['unique_sample_id'].str.extract(r'^(\d+)_')[0],
    'torque': df_W_offline_folds[exemplary_fold]['unique_sample_id'].str.extract(r'_(\d+)_')[0],
})
W_test = df_W_online_folds[exemplary_fold]['W']
df_cosine_test = df_cosine_folds[exemplary_fold]
meta_data_test = pd.DataFrame({
    'rpm': df_W_online_folds[exemplary_fold]['unique_sample_id'].str.extract(r'^(\d+)_')[0],
    'torque': df_W_online_folds[exemplary_fold]['unique_sample_id'].str.extract(r'_(\d+)_')[0],
})

#clf = IsolationForest(contamination=0)
#clf.fit(X)
#y_pred_train = clf.predict(X) 

# create a matrix representation of the feature space for the train set,
# where the three different sensors are stacked to a single vector with 150 features
# (50 frequency bands per sensor)
# --> this is the feature space for the clustering algorithm
flatten_df = lambda df_: df_.to_numpy().flatten()
X_train = np.array([flatten_df(individual_measurements) for individual_measurements in W_train])
X_train = np.hstack((X_train, meta_data_train.to_numpy()))
print(f'Shape of X_train: {X_train.shape}')

# Fit Isolation Forest (without preprocessing the data)
clf = IsolationForest(contamination=0)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
print(f'Number of outliers detected in training set: {np.sum(y_pred_train == -1)} (should be 0)')

# create a matrix representation for the test set
X_test = np.array([flatten_df(individual_measurements) for individual_measurements in W_test])
X_test = np.hstack((X_test, meta_data_test.to_numpy()))
print(f'Shape of X_test: {X_test.shape}')

# predict outliers in test set
y_test = df_cosine_test['pitting'].replace({True: -1, False: 1}).to_numpy()
y_pred_test = clf.predict(X_test)
y_score_test = clf.score_samples(X_test)  # the lower, the more abnormal
print(f'Number of outliers detected in test set with default parameters: {np.sum(y_pred_test == -1)} (should be {np.sum(y_test == -1)})')

# create ROC curve for test set
fpr, tpr, thresholds = roc_curve(y_true=y_test, y_score=y_score_test)
roc_auc = auc(fpr, tpr)
print(f'AUC: {roc_auc:.3f}')

# plot ROC curve
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.66)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.05)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'ROC Curve (trial {exemplary_fold})')
n_total = len(df_cosine_test)
n_healthy = len(df_cosine_test[df_cosine_test['pitting'] == False])
n_unhealthy = len(df_cosine_test[df_cosine_test['pitting'] == True])
text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy), AUC={round(roc_auc, 3)}"
ax.annotate(xy=(0.1, 0.025), text=text);
fig.suptitle(f'Isolation Forest + NMF + MetaData', color='blue', size=16);

## Baseline: 1-class SVM

In [None]:
from sklearn.svm import OneClassSVM

exemplary_fold = 1
binned_vibrations_train = train_vibration_measurement_periods_folds[exemplary_fold]
binned_vibrations_test = test_vibration_measurement_periods_folds[exemplary_fold]
df_cosine_test = df_cosine_folds[exemplary_fold]

#clf = IsolationForest(contamination=0)
#clf.fit(X)
#y_pred_train = clf.predict(X) 

# frequency band columns are all columns that contain the string 'band_'
# --> there are 50 frequency bands per sensor
exemplary_vibration_column_names = binned_vibrations_train[0].columns
frequency_band_column_names = exemplary_vibration_column_names[exemplary_vibration_column_names.str.contains('band_')]

# create a matrix representation of the feature space for the train set,
# where the three different sensors are stacked to a single vector with 150 features
# (50 frequency bands per sensor)
# --> this is the feature space for the clustering algorithm
flatten_df = lambda df_: df_.to_numpy().flatten()
X_train = np.array([flatten_df(individual_measurements[frequency_band_column_names]) for individual_measurements in binned_vibrations_train])
print(f'Shape of X_train: {X_train.shape}')

# Fit Isolation Forest (without preprocessing the data)
clf = OneClassSVM(kernel='linear', gamma='auto')
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
print(f'Number of outliers detected in training set: {np.sum(y_pred_train == -1)} (should be 0)')

# create a matrix representation for the test set
X_test = np.array([flatten_df(individual_measurements[frequency_band_column_names]) for individual_measurements in binned_vibrations_test])
print(f'Shape of X_test: {X_test.shape}')

# predict outliers in test set
y_test = df_cosine_test['pitting'].replace({True: -1, False: 1}).to_numpy()
y_pred_test = clf.predict(X_test)
y_score_test = clf.score_samples(X_test)  # the lower, the more abnormal
print(f'Number of outliers detected in test set with default parameters: {np.sum(y_pred_test == -1)} (should be {np.sum(y_test == -1)})')

# create ROC curve for test set
fpr, tpr, thresholds = roc_curve(y_true=y_test, y_score=y_score_test)
roc_auc = auc(fpr, tpr)
print(f'AUC: {roc_auc:.3f}')

# plot ROC curve
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.66)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.05)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'ROC Curve (trial {exemplary_fold})')
n_total = len(df_cosine_test)
n_healthy = len(df_cosine_test[df_cosine_test['pitting'] == False])
n_unhealthy = len(df_cosine_test[df_cosine_test['pitting'] == True])
text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy), AUC={round(roc_auc, 3)}"
ax.annotate(xy=(0.1, 0.025), text=text);
fig.suptitle(f'1-class SVM', color='blue', size=16);

In [None]:
exemplary_fold = 1
binned_vibrations_train = train_vibration_measurement_periods_folds[exemplary_fold]
binned_vibrations_test = test_vibration_measurement_periods_folds[exemplary_fold]
df_cosine_test = df_cosine_folds[exemplary_fold]

#clf = IsolationForest(contamination=0)
#clf.fit(X)
#y_pred_train = clf.predict(X) 

# frequency band columns are all columns that contain the string 'band_'
# --> there are 50 frequency bands per sensor
exemplary_vibration_column_names = binned_vibrations_train[0].columns
frequency_band_column_names = exemplary_vibration_column_names[exemplary_vibration_column_names.str.contains('band_')]

# create a matrix representation of the feature space for the train set,
# where the three different sensors are stacked to a single vector with 150 features
# (50 frequency bands per sensor)
# --> this is the feature space for the clustering algorithm
flatten_df = lambda df_: df_.to_numpy().flatten()
X_train = np.array([flatten_df(individual_measurements[frequency_band_column_names]) for individual_measurements in binned_vibrations_train])
print(f'Shape of X_train: {X_train.shape}')

# Fit Isolation Forest (without preprocessing the data)
clf = Pipeline([('pca', PCA(n_components=0.9)), ('clf', OneClassSVM(kernel='rbf', gamma='auto'))])
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
print(f'Number of outliers detected in training set: {np.sum(y_pred_train == -1)} (should be 0)')

# create a matrix representation for the test set
X_test = np.array([flatten_df(individual_measurements[frequency_band_column_names]) for individual_measurements in binned_vibrations_test])
print(f'Shape of X_test: {X_test.shape}')

# predict outliers in test set
y_test = df_cosine_test['pitting'].replace({True: -1, False: 1}).to_numpy()
y_pred_test = clf.predict(X_test)
y_score_test = clf.score_samples(X_test)  # the lower, the more abnormal
print(f'Number of outliers detected in test set with default parameters: {np.sum(y_pred_test == -1)} (should be {np.sum(y_test == -1)})')

# create ROC curve for test set
fpr, tpr, thresholds = roc_curve(y_true=y_test, y_score=y_score_test)
roc_auc = auc(fpr, tpr)
print(f'AUC: {roc_auc:.3f}')

# plot ROC curve
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.66)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.05)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'ROC Curve (trial {exemplary_fold})')
n_total = len(df_cosine_test)
n_healthy = len(df_cosine_test[df_cosine_test['pitting'] == False])
n_unhealthy = len(df_cosine_test[df_cosine_test['pitting'] == True])
text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy), AUC={round(roc_auc, 3)}"
ax.annotate(xy=(0.1, 0.025), text=text);
fig.suptitle(f'1-class SVM + PCA', color='blue', size=16);

In [None]:
# plot U-Map embedding of X_train
fig, ax = plt.subplots(figsize=(8, 6))
umap = UMAP(n_neighbors=5, min_dist=0.3, metric='cosine', random_state=42)
X_train_umap = umap.fit_transform(X_train)
ax.scatter(X_train_umap[:, 0], X_train_umap[:, 1], c=y_pred_train, cmap='coolwarm', s=1)
ax.set_title('U-Map embedding of X_train');

In [None]:
import plotly.express as px

# plot U-Map embedding of X_train
umap = UMAP(n_neighbors=5, min_dist=0.3, metric='cosine', random_state=42)
X_train_umap = umap.fit_transform(X_train)

# plot U-Map embedding of X_train
fig = px.scatter(x=X_train_umap[:, 0], y=X_train_umap[:, 1], color=None, width=800, height=600)
fig.update_layout(title='U-Map embedding of X_train')
fig.show()

In [None]:
import plotly.express as px

# plot U-Map embedding of X_train
umap = UMAP(n_neighbors=5, min_dist=0.3, metric='cosine', random_state=42)
X_test_umap = umap.fit_transform(X_test)

# plot U-Map embedding of X_train
fig = px.scatter(x=X_test_umap[:, 0], y=X_test_umap[:, 1], color=df_cosine_test['unique_cluster_label'].astype(str), width=800, height=600)
fig.update_layout(title='U-Map embedding of X_test')
fig.show()

In [None]:
# calculate accuracy
accuracy = np.sum(y_pred_test == y_test) / len(y_test)
print(f'Accuracy: {accuracy:.3f}')

In [None]:
from sklearn.ensemble import IsolationForest

exemplary_fold = 0
W_train = df_W_offline_folds[exemplary_fold]['W']
W_test = df_W_online_folds[exemplary_fold]['W']
df_cosine_test = df_cosine_folds[exemplary_fold]

#clf = IsolationForest(contamination=0)
#clf.fit(X)
#y_pred_train = clf.predict(X) 

# create a matrix representation of the feature space for the train set,
# where the three different sensors are stacked to a single vector with 150 features
# (50 frequency bands per sensor)
# --> this is the feature space for the clustering algorithm
flatten_df = lambda df_: df_.to_numpy().flatten()
X_train = np.array([flatten_df(individual_measurements) for individual_measurements in W_train])
print(f'Shape of X_train: {X_train.shape}')

# Fit Isolation Forest (without preprocessing the data)
clf = OneClassSVM(kernel='rbf', gamma='auto')
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
print(f'Number of outliers detected in training set: {np.sum(y_pred_train == -1)} (should be 0)')

# create a matrix representation for the test set
X_test = np.array([flatten_df(individual_measurements) for individual_measurements in W_test])
print(f'Shape of X_test: {X_test.shape}')

# predict outliers in test set
y_test = df_cosine_test['pitting'].replace({True: -1, False: 1}).to_numpy()
y_pred_test = clf.predict(X_test)
y_score_test = clf.score_samples(X_test)  # the lower, the more abnormal
print(f'Number of outliers detected in test set with default parameters: {np.sum(y_pred_test == -1)} (should be {np.sum(y_test == -1)})')

# create ROC curve for test set
fpr, tpr, thresholds = roc_curve(y_true=y_test, y_score=y_score_test)
roc_auc = auc(fpr, tpr)
print(f'AUC: {roc_auc:.3f}')

# plot ROC curve
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.66)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.05)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'ROC Curve (trial {exemplary_fold})')
n_total = len(df_cosine_test)
n_healthy = len(df_cosine_test[df_cosine_test['pitting'] == False])
n_unhealthy = len(df_cosine_test[df_cosine_test['pitting'] == True])
text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy), AUC={round(roc_auc, 3)}"
ax.annotate(xy=(0.1, 0.025), text=text);
fig.suptitle(f'1-class SVM + NMF', color='blue', size=16);

In [None]:
from sklearn.ensemble import IsolationForest

exemplary_fold = 4
W_train = df_W_offline_folds[exemplary_fold]['W']
meta_data_train = pd.DataFrame({
    'rpm': df_W_offline_folds[exemplary_fold]['unique_sample_id'].str.extract(r'^(\d+)_')[0],
    'torque': df_W_offline_folds[exemplary_fold]['unique_sample_id'].str.extract(r'_(\d+)_')[0],
})
W_test = df_W_online_folds[exemplary_fold]['W']
df_cosine_test = df_cosine_folds[exemplary_fold]
meta_data_test = pd.DataFrame({
    'rpm': df_W_online_folds[exemplary_fold]['unique_sample_id'].str.extract(r'^(\d+)_')[0],
    'torque': df_W_online_folds[exemplary_fold]['unique_sample_id'].str.extract(r'_(\d+)_')[0],
})

#clf = IsolationForest(contamination=0)
#clf.fit(X)
#y_pred_train = clf.predict(X) 

# create a matrix representation of the feature space for the train set,
# where the three different sensors are stacked to a single vector with 150 features
# (50 frequency bands per sensor)
# --> this is the feature space for the clustering algorithm
flatten_df = lambda df_: df_.to_numpy().flatten()
X_train = np.array([flatten_df(individual_measurements) for individual_measurements in W_train])
X_train = np.hstack((X_train, meta_data_train.to_numpy()))
print(f'Shape of X_train: {X_train.shape}')

# Fit Isolation Forest (without preprocessing the data)
clf = OneClassSVM(kernel='linear', gamma='auto')
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
print(f'Number of outliers detected in training set: {np.sum(y_pred_train == -1)} (should be 0)')

# create a matrix representation for the test set
X_test = np.array([flatten_df(individual_measurements) for individual_measurements in W_test])
X_test = np.hstack((X_test, meta_data_test.to_numpy()))
print(f'Shape of X_test: {X_test.shape}')

# predict outliers in test set
y_test = df_cosine_test['pitting'].replace({True: -1, False: 1}).to_numpy()
y_pred_test = clf.predict(X_test)
y_score_test = clf.score_samples(X_test)  # the lower, the more abnormal
print(f'Number of outliers detected in test set with default parameters: {np.sum(y_pred_test == -1)} (should be {np.sum(y_test == -1)})')

# create ROC curve for test set
fpr, tpr, thresholds = roc_curve(y_true=y_test, y_score=y_score_test)
roc_auc = auc(fpr, tpr)
print(f'AUC: {roc_auc:.3f}')

# plot ROC curve
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.66)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.05)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'ROC Curve (trial {exemplary_fold})')
n_total = len(df_cosine_test)
n_healthy = len(df_cosine_test[df_cosine_test['pitting'] == False])
n_unhealthy = len(df_cosine_test[df_cosine_test['pitting'] == True])
text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy), AUC={round(roc_auc, 3)}"
ax.annotate(xy=(0.1, 0.025), text=text);
fig.suptitle(f'Isolation Forest + NMF + MetaData', color='blue', size=16);

# Baseline: Reconstruction error for NMF

In [None]:
pass

# Baseline: Reconstruction error based on PCA

In [None]:
exemplary_fold = 0
explained_variance_ratio = 0.999
binned_vibrations_train = train_vibration_measurement_periods_folds[exemplary_fold]
binned_vibrations_test = test_vibration_measurement_periods_folds[exemplary_fold]
df_cosine_test = df_cosine_folds[exemplary_fold]

# frequency band columns are all columns that contain the string 'band_'
# --> there are 50 frequency bands per sensor
exemplary_vibration_column_names = binned_vibrations_train[0].columns
frequency_band_column_names = exemplary_vibration_column_names[exemplary_vibration_column_names.str.contains('band_')]

# create a matrix representation of the feature space for the train set,
# where the three different sensors are stacked to a single vector with 150 features
# (50 frequency bands per sensor)
# --> this is the feature space for the clustering algorithm
flatten_df = lambda df_: df_.to_numpy().flatten()
X_train = np.array([flatten_df(individual_measurements[frequency_band_column_names]) for individual_measurements in binned_vibrations_train])
print(f'Shape of X_train: {X_train.shape}')

# Fit Isolation Forest (without preprocessing the data)
clf = PCA(n_components=explained_variance_ratio)
X_train_reconstructed = clf.fit_transform(X_train)
reconstrucion_error_train = np.sum((X_train - clf.inverse_transform(X_train_reconstructed))**2, axis=1)
print(f'Reconstruction error train')
print(pd.Series(reconstrucion_error_train).describe())

# create a matrix representation for the test set
X_test = np.array([flatten_df(individual_measurements[frequency_band_column_names]) for individual_measurements in binned_vibrations_test])
print(f'Shape of X_test: {X_test.shape}')

# predict outliers in test set
y_test = df_cosine_test['pitting'].replace({True: -1, False: 1}).to_numpy()
X_test_reconstructed = clf.transform(X_test)
reconstrucion_error_test = np.sum((X_test - clf.inverse_transform(X_test_reconstructed))**2, axis=1)
y_score_test = max(reconstrucion_error_test) - reconstrucion_error_test  # the higher, the more abnormal
print(f'Reconstruction error test')
print(pd.Series(reconstrucion_error_test).describe())

# create ROC curve for test set
fpr, tpr, thresholds = roc_curve(y_true=y_test, y_score=y_score_test)
roc_auc = auc(fpr, tpr)
print(f'AUC: {roc_auc:.3f}')

# plot ROC curve
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr, tpr, color='blue', lw=4, label=f'overall (area = {roc_auc:.3f})', alpha=0.66)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='baseline')
ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.05)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'ROC Curve (trial {exemplary_fold})')
n_total = len(df_cosine_test)
n_healthy = len(df_cosine_test[df_cosine_test['pitting'] == False])
n_unhealthy = len(df_cosine_test[df_cosine_test['pitting'] == True])
text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy), AUC={round(roc_auc, 3)}"
ax.annotate(xy=(0.1, 0.025), text=text);
fig.suptitle(f'PCA ({100*explained_variance_ratio}% explained variance)', color='blue', size=16);

# Baseline: KNN anomaly detection

In [None]:
pass

# Baseline: Autoencoder

In [None]:
pass

# Compare multiple trials

In [None]:
from conscious_engie_icare.supervised_benchmarking import Benchmarking

CALCULATE = False

if CALCULATE:
    benchmarking = Benchmarking(
        train_vibration_measurement_periods_folds=train_vibration_measurement_periods_folds,
        test_vibration_measurement_periods_folds=test_vibration_measurement_periods_folds,
        df_cosine_folds=df_cosine_folds,
        df_W_offline_folds=df_W_offline_folds,
        df_W_online_folds=df_W_online_folds,
    )
    df_roc_curves = benchmarking.run_all_approaches()
    df_roc_curves.to_pickle('df_roc_curves.pkl')
else:
    df_roc_curves = pd.read_pickle('df_roc_curves.pkl')
df_roc_curves.head()

Rename approaches:

In [None]:
df_roc_curves['approach'].unique()

In [None]:
df_roc_curves['approach (renamed)'] = df_roc_curves['approach'].replace({
    'Isolation Forest+ PCA+ Meta Data': 'IForest+ PCA+ Meta Data',
    'IForest+ Meta Data': 'IForest+ Meta',
    'Isolation Forest+ PCA+ Meta Data': 'IForest+ PCA+ Meta',
    'IForest+ Hyperparameter tuning': 'IForest+ Hyperparam',
    '1cSVM+ Hyperparameter tuning': '1cSVM+ Hyperparam',
    '1cSVM+ Meta Data': '1cSVM+ Meta',
})

Plot results:

In [None]:
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(8, 4))
sns.boxplot(x='approach (renamed)', y='roc_auc', data=df_roc_curves, ax=ax);
# ax.set_title('ROC AUC');
# before each '+' in each tick label, create a new line for the label
ax.set_xticklabels([label.get_text().replace('+', '\n+') for label in ax.get_xticklabels()])
# rotate x-axis labels
#for tick in ax.get_xticklabels():
#    tick.set_rotation(10)
# ax.set_title('Performance comparison (100 trials)', size=18);
# make the last tick red
ax.get_xticklabels()[-1].set_color('red');
ax.set_ylabel('ROC AUC', size=14);
ax.set_xlabel(None);
# 90 degree rotation of x-axis labels
plt.xticks(rotation=45);
fig.tight_layout()
fig.savefig(os.path.join('figs', 'performance_comparison.pdf'))

In [None]:
stop

In [None]:
df_roc_curves_test_pca = benchmarking.run_test()
display(df_roc_curves_test_pca.head())

sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(12, 4))
sns.boxplot(x='approach', y='roc_auc', data=df_roc_curves_test_pca, ax=ax);
# ax.set_title('ROC AUC');
# before each '+' in each tick label, create a new line for the label
ax.set_xticklabels([label.get_text().replace('+', '\n+') for label in ax.get_xticklabels()])
# rotate x-axis labels
#for tick in ax.get_xticklabels():
#    tick.set_rotation(10)
ax.set_title('TEST RUN NEW PCA: Performance comparison (100 trials)', size=18);
# make the last tick red
ax.get_xticklabels()[-1].set_color('red');
ax.set_ylabel('ROC AUC', size=14);
ax.set_xlabel('Approach', size=14);
fig.tight_layout()

In [None]:
df_ = df_roc_curves_test_pca[df_roc_curves_test_pca['approach'] == 'PCA + Crossvalidation']
df_ = pd.DataFrame(df_.cv_results.tolist())
# all split0_test_scores the same?
df_.head()

In [None]:
df_['split0_test_score']

In [None]:
df_roc_curves_test_pca['approach'].unique()

In [None]:
df_ = df_roc_curves_test_pca[df_roc_curves_test_pca['approach'] == '1cSVM+ Crossvalidation']
df_ = pd.DataFrame(df_.cv_results.tolist())
# all split0_test_scores the same?
df_.head()

In [None]:
df_.iloc[0]

In [None]:
stop

In [None]:
from sklearn.model_selection import PredefinedSplit, GridSearchCV
from sklearn.metrics import roc_auc_score, make_scorer


def train_isolation_forest(fold_nr, **kwargs):
    pipeline = IsolationForest(contamination=0)
    return train_standard_anomaly_detection(fold_nr, pipeline=pipeline, **kwargs)

def train_isolation_forest_with_crossvalidation(fold_nr, **kwargs):
    pipeline = IsolationForest()
    param_grid = {
        'n_estimators' : [10, 100, 200, 500], 
        'max_samples' : [0.1, 0.5, 1.0, 'auto'], 
        'max_features': [0.1, 0.5, 1.0, 10, 100]
    }
    return train_standard_anomaly_detection_with_crossvalidation(fold_nr, pipeline=pipeline, param_grid=param_grid, **kwargs)

def train_isolation_forest_with_metadata(fold_nr, verbose=False):
    return train_isolation_forest(fold_nr, verbose=verbose, use_meta_data=True)

def train_isolation_forest_with_pca(fold_nr, **kwargs):
    pipeline = Pipeline([('pca', PCA(n_components=0.999)), ('clf', IsolationForest(contamination=0))])
    return train_standard_anomaly_detection(fold_nr, pipeline=pipeline, **kwargs)

def train_isolation_forest_with_pca_and_metadata(fold_nr, verbose=False):
    return train_isolation_forest_with_pca(fold_nr, verbose=verbose, use_meta_data=True)

def train_one_class_svm(fold_nr, **kwargs):
    pipeline = OneClassSVM(kernel='rbf', gamma='auto')
    return train_standard_anomaly_detection(fold_nr, pipeline=pipeline, **kwargs)

def train_one_class_svm_with_crossvalidation(fold_nr, **kwargs):
    pipeline = OneClassSVM()
    param_grid = {
        'kernel' : ['linear', 'poly', 'rbf'], 
        'gamma' : [0.0001, 0.001, 0.01, 0.1, 1, 'scale', 'auto'], 
        'nu': [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
    }
    return train_standard_anomaly_detection_with_crossvalidation(fold_nr, pipeline=pipeline, param_grid=param_grid, **kwargs)

def train_one_class_svm_with_metadata(fold_nr, verbose=False):
    return train_one_class_svm(fold_nr, verbose=verbose, use_meta_data=True)

def train_one_class_svm_with_pca(fold_nr, **kwargs):
    pipeline = Pipeline([('pca', PCA(n_components=0.999)), ('clf', OneClassSVM(kernel='linear', gamma='auto'))])
    return train_standard_anomaly_detection(fold_nr, pipeline=pipeline, **kwargs)

def train_standard_anomaly_detection(fold_nr, pipeline, use_meta_data=False, verbose=False):
    binned_vibrations_train = train_vibration_measurement_periods_folds[fold_nr]
    binned_vibrations_test = test_vibration_measurement_periods_folds[fold_nr]
    df_cosine_test = df_cosine_folds[fold_nr]

    # frequency band columns are all columns that contain the string 'band_'
    # --> there are 50 frequency bands per sensor
    exemplary_vibration_column_names = binned_vibrations_train[0].columns
    frequency_band_column_names = exemplary_vibration_column_names[exemplary_vibration_column_names.str.contains('band_')]

    # create a matrix representation of the feature space for the train set,
    # where the three different sensors are stacked to a single vector with 150 features
    # (50 frequency bands per sensor)
    # --> this is the feature space for the clustering algorithm
    X_train = create_matrix_representation_train_set(binned_vibrations_train, fold_nr, frequency_band_column_names,
                                                     use_meta_data=use_meta_data, verbose=verbose)

    # Fit Pipeline (without preprocessing the data)
    pipeline.fit(X_train)
    y_pred_train = pipeline.predict(X_train)
    if verbose:
        print(f'Number of outliers detected in training set: {np.sum(y_pred_train == -1)} (should be 0)')

    # create a matrix representation for the test set
    X_test = create_matrix_representation_test_set(binned_vibrations_test, fold_nr, frequency_band_column_names, 
                                                   use_meta_data=use_meta_data, verbose=verbose)

    # predict outliers in test set
    y_test = df_cosine_test['pitting'].replace({True: -1, False: 1}).to_numpy()
    y_pred_test = pipeline.predict(X_test)
    y_score_test = pipeline.score_samples(X_test)  # the lower, the more abnormal
    if verbose:
        print(f'Number of outliers detected in test set with default parameters: {np.sum(y_pred_test == -1)} (should be {np.sum(y_test == -1)})')

    # create ROC curve for test set
    results = calc_metrics_(y_test, y_score_test, verbose=verbose)
    return results

def train_standard_anomaly_detection_with_crossvalidation(fold_nr, pipeline, param_grid, validation_ratio=0.2, use_meta_data=False, verbose=False):
    # TODO: crossvalidation does not work with roc_auc
    binned_vibrations_train = train_vibration_measurement_periods_folds[fold_nr]
    binned_vibrations_test = test_vibration_measurement_periods_folds[fold_nr]
    df_cosine_test = df_cosine_folds[fold_nr]

    # frequency band columns are all columns that contain the string 'band_'
    # --> there are 50 frequency bands per sensor
    exemplary_vibration_column_names = binned_vibrations_train[0].columns
    frequency_band_column_names = exemplary_vibration_column_names[exemplary_vibration_column_names.str.contains('band_')]

    # create a matrix representation of the feature space for the train set,
    # where the three different sensors are stacked to a single vector with 150 features
    # (50 frequency bands per sensor)
    # --> this is the feature space for the clustering algorithm
    X_train = create_matrix_representation_train_set(binned_vibrations_train, fold_nr, frequency_band_column_names,
                                                     use_meta_data=use_meta_data, verbose=verbose)

    # create a matrix representation for the validation/test set
    X_val_test = create_matrix_representation_test_set(binned_vibrations_test, fold_nr, frequency_band_column_names, 
                                                       use_meta_data=use_meta_data, verbose=verbose)
    X_val = X_val_test[:int(validation_ratio * len(X_val_test))]
    X_test = X_val_test[int(validation_ratio * len(X_val_test)):]
    y_val_test = df_cosine_test['pitting'].replace({True: -1, False: 1}).to_numpy()
    y_val = y_val_test[:int(validation_ratio * len(y_val_test))]
    y_test = y_val_test[int(validation_ratio * len(y_val_test)):]

    # Gridsearch
    X_train_val = np.vstack((X_train, X_val))
    y_train_val = np.concatenate((np.ones(len(X_train)), y_val)).astype(int)
    assert len(X_train_val) == len(y_train_val)
    #train_ind = list(range(len(X_train)))
    #val_ind = list(range(len(X_train), len(X_train_val)))
    train_ind = np.ones(len(X_train), dtype=int) * -1
    val_ind = np.zeros(len(X_val), dtype=int)
    assert len(train_ind) + len(val_ind) == len(X_train_val)
    ind = np.concatenate((train_ind, val_ind))
    assert len(ind) == len(X_train_val)
    predefined_split = PredefinedSplit(test_fold=ind)
    assert predefined_split.get_n_splits() == 1
    # debugging code for scoring = 'roc_auc' (which did not work, but making a custom scorer did work)
    #split_ = next(predefined_split.split())
    #print(f'train set in predefined_split should only contain non-anomalous data: {split_}')
    #X_train_ = X_train_val[split_[0]]
    #X_val_ = X_train_val[split_[1]]
    #y_train_ = y_train_val[split_[0]]
    #y_val_ = y_train_val[split_[1]]
    #print("y_train_:", y_train_)
    #print("y_val_:", y_val_)
    #pipeline_ = pipeline
    #pipeline_.fit(X_train_val, y=y_train_val)
    #y_pred_train_ = pipeline_.predict(X_train_)
    #y_pred_val_ = pipeline_.predict(X_val_)
    #roc_auc_train_ = roc_auc_score(y_train_, y_pred_train_)
    #print("roc_auc_train_:", roc_auc_train_)
    #roc_auc_val_ = roc_auc_score(y_val_, y_pred_val_)
    #print("roc_auc_val_:", roc_auc_val_)
    # assert True, 'validation set in predefined_split should contain anomalous and non-anomalous data'
    roc_auc_scorer = make_scorer(roc_auc_score)
    grid_search_clf = GridSearchCV(pipeline, param_grid=param_grid, cv=predefined_split, scoring=roc_auc_scorer, error_score="raise")
    grid_search_clf.fit(X_train_val, y=y_train_val)
    cv_results = {'cv_results': grid_search_clf.cv_results_}

    # Fit pipeline (without preprocessing the data)
    y_pred_train = grid_search_clf.predict(X_train)
    if verbose:
        print(f'Number of outliers detected in training set: {np.sum(y_pred_train == -1)} (should be 0)')

    # predict outliers in test set
    y_pred_test = grid_search_clf.predict(X_test)
    y_score_test = grid_search_clf.score_samples(X_test)  # the lower, the more abnormal
    if verbose:
        print(f'Number of outliers detected in test set with default parameters: {np.sum(y_pred_test == -1)} (should be {np.sum(y_test == -1)})')

    # create ROC curve for test set
    results = calc_metrics_(y_test, y_score_test, verbose=verbose)
    results.update(cv_results)
    return results

def train_reconstruction_error_based_approach(fold_nr, pipeline, use_meta_data=False, verbose=False):
    binned_vibrations_train = train_vibration_measurement_periods_folds[fold_nr]
    binned_vibrations_test = test_vibration_measurement_periods_folds[fold_nr]
    df_cosine_test = df_cosine_folds[fold_nr]

    # frequency band columns are all columns that contain the string 'band_'
    # --> there are 50 frequency bands per sensor
    exemplary_vibration_column_names = binned_vibrations_train[0].columns
    frequency_band_column_names = exemplary_vibration_column_names[exemplary_vibration_column_names.str.contains('band_')]

    # create a matrix representation of the feature space for the train set,
    # where the three different sensors are stacked to a single vector with 150 features
    # (50 frequency bands per sensor)
    # --> this is the feature space for the clustering algorithm
    X_train = create_matrix_representation_train_set(binned_vibrations_train, fold_nr, frequency_band_column_names,
                                                     use_meta_data=use_meta_data, verbose=verbose)

    # Fit Isolation Forest (without preprocessing the data)
    X_train_reconstructed = pipeline.fit_transform(X_train)
    reconstrucion_error_train = np.sum((X_train - pipeline.inverse_transform(X_train_reconstructed))**2, axis=1)
    if verbose:
        print(f'Reconstruction error train')
        print(pd.Series(reconstrucion_error_train).describe())

    # create a matrix representation for the test set
    X_test = create_matrix_representation_test_set(binned_vibrations_test, fold_nr, frequency_band_column_names, 
                                                   use_meta_data=use_meta_data, verbose=verbose)

    # predict outliers in test set
    y_test = df_cosine_test['pitting'].replace({True: -1, False: 1}).to_numpy()
    X_test_reconstructed = pipeline.transform(X_test)
    reconstrucion_error_test = np.sum((X_test - pipeline.inverse_transform(X_test_reconstructed))**2, axis=1)
    y_score_test = max(reconstrucion_error_test) - reconstrucion_error_test  # the higher, the more abnormal
    if verbose:
        print(f'Reconstruction error test')
        print(pd.Series(reconstrucion_error_test).describe())

    # create ROC curve for test set
    results = calc_metrics_(y_test, y_score_test, verbose=verbose)
    return results

def train_reconstruction_error_based_approach_with_crossvalidation(fold_nr, pipeline, param_grid, validation_ratio=0.2, use_meta_data=False, verbose=False):
    pass

def train_pca_old(fold_nr, use_meta_data=False, verbose=False):
    pipeline = PCA(n_components=0.999)
    return train_reconstruction_error_based_approach(fold_nr, pipeline, use_meta_data=use_meta_data, verbose=verbose)

def train_pca_new(fold_nr, use_meta_data=False, verbose=False):
    pipeline = PCA_ANOMALY_DETECTOR(n_components=0.999)
    return train_standard_anomaly_detection(fold_nr, pipeline=pipeline, **kwargs)

def flatten_df(df_):
    return df_.to_numpy().flatten()

def create_matrix_representation_train_set(binned_vibrations_train, fold_nr, frequency_band_column_names, use_meta_data=False, verbose=False):
    X_train = np.array([flatten_df(individual_measurements[frequency_band_column_names]) for individual_measurements in binned_vibrations_train])
    if use_meta_data:
        meta_data_train = pd.DataFrame({
            'rpm': df_W_offline_folds[fold_nr]['unique_sample_id'].str.extract(r'^(\d+)_')[0],
            'torque': df_W_offline_folds[fold_nr]['unique_sample_id'].str.extract(r'_(\d+)_')[0],
        })
        X_train = np.hstack((X_train, meta_data_train.to_numpy()))
    if verbose:
        print(f'Shape of X_train: {X_train.shape}')
    return X_train

def create_matrix_representation_test_set(binned_vibrations_test, fold_nr, frequency_band_column_names, use_meta_data=False, verbose=False):
    X_test = np.array([flatten_df(individual_measurements[frequency_band_column_names]) for individual_measurements in binned_vibrations_test])
    if use_meta_data:
        meta_data_test = pd.DataFrame({
            'rpm': df_W_online_folds[fold_nr]['unique_sample_id'].str.extract(r'^(\d+)_')[0],
            'torque': df_W_online_folds[fold_nr]['unique_sample_id'].str.extract(r'_(\d+)_')[0],
        })
        X_test = np.hstack((X_test, meta_data_test.to_numpy()))
    if verbose:
        print(f'Shape of X_test: {X_test.shape}')
    return X_test

def calc_metrics_(y_test, y_score_test, verbose=False):
    # create ROC curve for test set
    fpr, tpr, thresholds = roc_curve(y_true=y_test, y_score=y_score_test)
    roc_auc = auc(fpr, tpr)
    fpr_at_tpr = calc_fpr_at_tpr_threshold(tpr, fpr, threshold=0.9)
    if verbose:
        print(f'AUC: {roc_auc:.3f}')

    results = {
        'fpr': fpr,
        'tpr': tpr,
        'fpr_at_tpr': fpr_at_tpr,
        'thresholds': thresholds,
        'roc_auc': roc_auc,
    }
    return results

def get_results_of_our_method(fold_nr):
    df_cosine_ = df_cosine_folds[fold_nr]
    df_cosine_ = df_cosine_[df_cosine_.unique_cluster_label != -1]  # QUICK FIX !!! : removed unknown cluster labels

    # Plot the general ROC curve
    fpr, tpr, roc_auc = calculate_roc_characteristics(df_cosine_)
    fpr_at_tpr = calc_fpr_at_tpr_threshold(tpr, fpr, threshold=0.9)
    results = {
        'fpr': fpr,
        'tpr': tpr,
        'fpr_at_tpr': fpr_at_tpr,
        'thresholds': thresholds,
        'roc_auc': roc_auc,
    }
    return results
    

approaches = [
    {'name': 'IForest', 'function': train_isolation_forest},
    {'name': 'IForest+ Meta Data', 'function': train_isolation_forest_with_metadata},
    {'name': 'IForest+ PCA', 'function': train_isolation_forest_with_pca},
    {'name': 'Isolation Forest+ PCA+ Meta Data', 'function': train_isolation_forest_with_pca_and_metadata},
    {'name': 'IForest+ Crossvalidation', 'function': train_isolation_forest_with_crossvalidation},
    {'name': '1cSVM', 'function': train_one_class_svm},
    {'name': '1cSVM+ Crossvalidation', 'function': train_one_class_svm_with_crossvalidation},
    {'name': '1cSVM+ Meta Data', 'function': train_one_class_svm_with_metadata},
    {'name': '1cSVM+ PCA', 'function': train_one_class_svm_with_pca},
    # {'name': '1-class SVM+ PCA+ Meta Data', 'function': train_one_class_svm_with_pca_and_metadata},   # interrupts the kernel
    {'name': 'PCA', 'function': train_pca},
    {'name': 'Our method', 'function': get_results_of_our_method}
]

trials = []
for approach in approaches:
    for trial in tqdm(list(range(N)), desc=f'Approach: {approach["name"]}'):
        # calculate results
        results = approach['function'](trial)
        # add meta info
        results = dict({'trial': trial, 'approach': approach['name']}, **results)
        trials.append(results)
df_roc_curves = pd.DataFrame(trials)
df_roc_curves.head()

In [None]:
# dump as pickle
#with open('df_roc_curves.pkl', 'wb') as f:
#    pickle.dump(df_roc_curves, f)

In [None]:
df_ = df_roc_curves[df_roc_curves.approach == '1cSVM+ Crossvalidation']
df_.cv_results.iloc[10].keys()
df_.cv_results.iloc[10]['split0_test_score']

In [None]:
df_ = df_roc_curves[df_roc_curves.approach == '1cSVM+ Crossvalidation']
df_.cv_results.iloc[0].keys()

In [None]:
df_.cv_results.iloc[0]['split0_test_score'] 

In [None]:
df_.cv_results.iloc[0]['params']

In [None]:
df_.cv_results.iloc[0]['rank_test_score']

Illustrate results:

In [None]:
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(12, 4))
sns.boxplot(x='approach', y='roc_auc', data=df_roc_curves, ax=ax);
# ax.set_title('ROC AUC');
# before each '+' in each tick label, create a new line for the label
ax.set_xticklabels([label.get_text().replace('+', '\n+') for label in ax.get_xticklabels()])
# rotate x-axis labels
#for tick in ax.get_xticklabels():
#    tick.set_rotation(10)
ax.set_title('Performance comparison (100 trials)', size=18);
# make the last tick red
ax.get_xticklabels()[-1].set_color('red');
ax.set_ylabel('ROC AUC', size=14);
ax.set_xlabel('Approach', size=14);
fig.tight_layout()
fig.savefig('performance_comparison_on_supervised_test_setup.pdf')

In [None]:
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(6, 4))
sns.boxplot(x='approach', y='roc_auc', data=df_roc_curves, ax=ax);
# ax.set_title('ROC AUC');
# before each '+' in each tick label, create a new line for the label
ax.set_xticklabels([label.get_text().replace('+', '\n+') for label in ax.get_xticklabels()])
# rotate x-axis labels
#for tick in ax.get_xticklabels():
#    tick.set_rotation(10)
ax.set_title('Performance comparison (100 trials)', size=18);
# make the last tick red
ax.get_xticklabels()[-1].set_color('red');
ax.set_ylabel('ROC AUC', size=14);
ax.set_xlabel('Approach', size=14);
fig.tight_layout()
fig.savefig(os.path.join('figs', 'performance_comparison_on_supervised_test_setup.pdf'))

In [None]:
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(6, 4))
sns.boxplot(x='approach', y='fpr_at_tpr', data=df_roc_curves, ax=ax);
# ax.set_title('ROC AUC');
# before each '+' in each tick label, create a new line for the label
ax.set_xticklabels([label.get_text().replace('+', '\n+') for label in ax.get_xticklabels()])
# rotate x-axis labels
#for tick in ax.get_xticklabels():
#    tick.set_rotation(10)
ax.set_title('Performance comparison (100 trials)', size=18);
# make the last tick red
ax.get_xticklabels()[-1].set_color('red');
ax.set_ylabel('FPR@TPR=0.9', size=14);
ax.set_xlabel('Approach', size=14);
fig.tight_layout()
# fig.savefig('performance_comparison_on_supervised_test_setup.pdf')

In [None]:
sns.set_style('whitegrid')
fig, axes = plt.subplots(figsize=(6, 4), ncols=2)

# left subfigure (ROC AUC)
sns.boxplot(x='approach', y='fpr_at_tpr', data=df_roc_curves, ax=ax);
# ax.set_title('ROC AUC');
# before each '+' in each tick label, create a new line for the label
ax.set_xticklabels([label.get_text().replace('+', '\n+') for label in ax.get_xticklabels()])
# rotate x-axis labels
#for tick in ax.get_xticklabels():
#    tick.set_rotation(10)
ax.set_title('Performance comparison (100 trials)', size=18);
# make the last tick red
ax.get_xticklabels()[-1].set_color('red');
ax.set_ylabel('FPR@TPR=0.9', size=14);
ax.set_xlabel('Approach', size=14);

# right subfigure ()

fig.tight_layout()
# fig.savefig('performance_comparison_on_supervised_test_setup.pdf')

In [None]:
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x='approach', y='roc_auc', data=df_roc_curves, ax=ax);
ax.set_title('ROC AUC');
# before each '+' in each tick label, create a new line for the label
ax.set_xticklabels([label.get_text().replace('+', '\n+') for label in ax.get_xticklabels()])
# rotate x-axis labels
#for tick in ax.get_xticklabels():
#    tick.set_rotation(10)
fig.suptitle('Performance comparison on supervised test setup', color='blue', size=16);
fig.tight_layout()

In [None]:
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x='approach', y='roc_auc', data=df_roc_curves, ax=ax);
ax.set_title('ROC AUC');
# before each '+' in each tick label, create a new line for the label
ax.set_xticklabels([label.get_text().replace('+', '\n+') for label in ax.get_xticklabels()])
# rotate x-axis labels
#for tick in ax.get_xticklabels():
#    tick.set_rotation(10)
fig.suptitle('Performance comparison on supervised test setup', color='blue', size=16);
fig.tight_layout()

In [None]:
stop

# Analyze samples without labels

In [None]:
_test_set, f = load_data(fnames, nperseg=nperseg,

_no_labels = data_validation_set + data_test_set
del data_validation_set
data_test_set

In [None]:
len(data_no_labels)

In [None]:
'''
import sys

# Create a function to get the size of all variables
def get_size_of_all_variables():
    variable_sizes = [(var, sys.getsizeof(globals()[var]) / (1024 * 1024)) for var in tqdm(globals())]
    total_size_mb = sum(size for _, size in variable_sizes)
    return total_size_mb, variable_sizes

# Call the function and print the results
total_size, variable_sizes = get_size_of_all_variables()
print(f"Total size of all variables: {total_size:.2f} MB")

# Print the sizes of individual variables
for var, size in variable_sizes:
    print(f"{var}: {size:.2f} MB")
'''
pass

In [None]:
# extract vibration data
df_vib_test_no_labels = derive_df_vib(data_no_labels, f)

# convert to orders and derive meta data
df_orders_no_labels, meta_data_no_labels = derive_df_orders(df_vib_test_no_labels, setup, f, verbose=False)
"""
if USE_TRAINING_SET_FOR_VALIDATION:
    print('transforming sample-id in test set')
    # meta_data_test_pitting_8['test_sample_id'] = meta_data_test_pitting_8.groupby(['rotational speed [RPM]', 'torque [Nm]', 'sample_id']).ngroup() + 1   # !!! might not be necessary
    rpm = meta_data_test_pitting_['rotational speed [RPM]']
    torque = meta_data_test_pitting_['torque [Nm]']
    run = meta_data_test_pitting_['sample_id']
    meta_data_test_pitting_['unique_sample_id'] = rpm.astype(str) + '_' + torque.astype(str) + '_' + run.astype(str) + f'_pitting_level_{lvl}'
"""

df_orders_no_labels['unique_sample_id'] = meta_data_no_labels['unique_sample_id'] # + f'_pitting_level_{lvl}'

#df_orders_test_pitting_dict[lvl] = df_orders_test_pitting_
#meta_data_test_pitting_dict[lvl] = meta_data_test_pitting_

In [None]:
fingerprints_ = fingerprints_folds[0]  # there is only one fold
cluster_label_unique_name_mapping = cluster_label_unique_name_mapping_folds[0]

# normalize data
print('normalizing data')
df_V_no_labels_normalized = normalize_1(df_orders_no_labels, BAND_COLS)

# extract vibration measurement periods
print('extracting vibration measurement periods')
df_ = df_V_no_labels_normalized
#meta_data_train['sample_id_unique'] = meta_data_train.groupby(['sample_id', 'rotational speed [RPM]', 'torque [Nm]']).ngroup() + 1
df_[['unique_sample_id', 'direction']] = meta_data_no_labels[['unique_sample_id', 'direction']]   # !!! wrong? 
no_labels_vibration_measurement_periods_ = []
no_labels_vibration_measurement_periods_meta_data_ = []
for sample_id, group in df_.groupby('unique_sample_id'):
    # TODO: there are duplicate names between validation and test set!
    # for the moment we exclude those samples (since it only concerns 5 measurement periods)
    #  --> fix this later if time left
    # assert len(group) == 3, f'should have exactly 3 directions per measurement period, had {len(group)} instead for sample_id {sample_id}'
    rpm = meta_data_no_labels[meta_data_no_labels['unique_sample_id'] == sample_id]['rotational speed [RPM]'].unique()[0]
    torque = meta_data_no_labels[meta_data_no_labels['unique_sample_id'] == sample_id]['torque [Nm]'].unique()[0]
    try:
        om = cluster_label_unique_name_mapping_[
            (cluster_label_unique_name_mapping_['rotational speed [RPM]'] == rpm) & 
            (cluster_label_unique_name_mapping_['torque [Nm]'] == torque)
        ]['cluster_label_unique'].iloc[0]
    except IndexError:
        n_index_errors += 1
        om = -1
    if len(group) == 3:
        # append measurement period
        measurement_period = {
            'start': 'unknown',
            'stop': 'unknown',
            'group': group,
            'sample_id': sample_id,
            'rpm': rpm,
            'torque': torque,
            'unique_cluster_label': om
        }
        no_labels_vibration_measurement_periods_meta_data_.append(measurement_period)
        no_labels_vibration_measurement_periods_.append(group)

In [None]:
#  derive weights for measurement periods
print('deriving weights for measurement periods')
df_W_no_labels_ = extract_vibration_weights_per_measurement_period(no_labels_vibration_measurement_periods_, fingerprints_[0].columns, BAND_COLS, normalize_1, model_)

# calculate distances
print('calculating distances')
df_dist_no_labels_ = calculate_distances_per_measurement_period(df_W_no_labels_, fingerprints=fingerprints_)
#if CACHE_RESULTS:
if False:
    pickle.dump(df_dist_no_labels_, open(os.path.join('distances_no_labels', f'df_dist_no_labels.pkl'), 'wb'))
df_dist_no_labels_.head()

In [None]:
df_cosine_with_labels = df_cosine_folds[fold]

df_cosine_no_labels = df_dist_no_labels_[['idx', 'om', 'cosine_distance']].pivot(index='idx', columns='om', values='cosine_distance')
# assign the corresponding operating mode to the given row (if known), else, assign -1
# unique cluster label is wrong!!! (might be correct)
df_cosine_no_labels[['rpm', 'torque', 'unique_cluster_label']] = pd.DataFrame(no_labels_vibration_measurement_periods_meta_data_)[['rpm', 'torque', 'unique_cluster_label']]

distance_to_own_cluster_center_ = []
for idx, row in df_cosine_.iterrows():
    om = row['unique_cluster_label']
    if om != -1:
        distance_to_own_cluster_center_.append(row[om])
    else:
        distance_to_own_cluster_center_.append(np.nan)
df_cosine_no_labels['distance_to_own_cluster_center'] = distance_to_own_cluster_center_
df_cosine_no_labels.head()

In [None]:
plot_density = False

# plot distribution of cosine distances to own cluster center
fig, axes = plt.subplots(figsize=(15, 10), nrows=3)

min_ = -0.001
max_ = 0.8
bins = np.arange(min_, max_, 0.0025)

df_cosine_healthy = df_cosine_with_labels[df_cosine_with_labels.pitting == False]
ax = df_cosine_healthy['distance_to_own_cluster_center'].plot(kind='hist', density=plot_density, bins=bins, ax=axes[0], alpha=0.5, legend=False)
ax.set_title('healthy samples')

df_cosine_anomalous = df_cosine_with_labels[df_cosine_with_labels.pitting == True]
ax = df_cosine_anomalous['distance_to_own_cluster_center'].plot(kind='hist', density=plot_density, bins=bins, ax=axes[1], alpha=0.5, legend=False)
ax.set_title('anomalous samples')

ax = df_cosine_no_labels['distance_to_own_cluster_center'].plot(kind='hist', density=plot_density, bins=bins, ax=axes[2], alpha=0.5, legend=False)
ax.set_title('unknown labels')
ax.set_xlabel('Cosine distance');

fig.suptitle('Cosine distance to vibration fingerprint');
fig.tight_layout()

In [None]:
df_cosine_healthy['distance_to_own_cluster_center'].describe()

In [None]:
df_ = df_cosine_no_labels[df_cosine_no_labels.distance_to_own_cluster_center < 0.3]
fig, ax = plt.subplots(figsize=(15, 4))
ax = df_['distance_to_own_cluster_center'].plot(kind='hist', density=plot_density, bins=100, ax=ax, alpha=0.5, legend=False)
ax.set_title('unknown labels (with distance < 0.3)')
ax.axvline(x=0.005, color='red', linestyle='--', label='distance threshold')
ax.set_xlabel('Cosine distance');
ax.legend()
fig.tight_layout()

In [None]:
plot_density = False

# plot distribution of cosine distances to own cluster center
fig, axes = plt.subplots(figsize=(15, 10), nrows=3)

min_ = -0.00001
max_ = 0.30001
bins = np.arange(min_, max_, 0.002)

df_cosine_healthy = df_cosine_with_labels[df_cosine_with_labels.pitting == False][df_cosine_with_labels.distance_to_own_cluster_center < 0.3]
ax = df_cosine_healthy['distance_to_own_cluster_center'].plot(kind='hist', density=plot_density, bins=bins, ax=axes[0], alpha=0.5, legend=False)
ax.axvline(x=0.004, color='red', linestyle='--', label='distance threshold')
tp = len(df_cosine_healthy[df_cosine_with_labels[df_cosine_with_labels.pitting == False].distance_to_own_cluster_center < 0.004])
fn = len(df_cosine_healthy[df_cosine_with_labels[df_cosine_with_labels.pitting == False].distance_to_own_cluster_center >= 0.004])
ax.text(x=0.1, y=0.1, s=f'TP={tp}, FN={fn}', transform=ax.transAxes)
ax.set_title('healthy samples')

df_cosine_anomalous = df_cosine_with_labels[df_cosine_with_labels.pitting == True][df_cosine_with_labels.distance_to_own_cluster_center < 0.3]
ax = df_cosine_anomalous['distance_to_own_cluster_center'].plot(kind='hist', density=plot_density, bins=bins, ax=axes[1], alpha=0.5, legend=False)
ax.axvline(x=0.004, color='red', linestyle='--', label='distance threshold')
fp = len(df_cosine_anomalous[df_cosine_with_labels[df_cosine_with_labels.pitting == True].distance_to_own_cluster_center < 0.004])
tn = len(df_cosine_anomalous[df_cosine_with_labels[df_cosine_with_labels.pitting == True].distance_to_own_cluster_center >= 0.004])
ax.text(x=0.1, y=0.1, s=f'FP={fp}, TN={tn}', transform=ax.transAxes)
ax.set_title('anomalous samples')

df_ = df_cosine_no_labels[df_cosine_no_labels.distance_to_own_cluster_center < 0.3]
ax = df_['distance_to_own_cluster_center'].plot(kind='hist', density=plot_density, bins=bins, ax=axes[2], alpha=0.5, legend=False)
ax.axvline(x=0.004, color='red', linestyle='--', label='distance threshold')
ax.set_title('unknown labels')
healthy = len(df_cosine_no_labels[df_cosine_no_labels.distance_to_own_cluster_center < 0.004])
anomalies = len(df_cosine_no_labels[df_cosine_no_labels.distance_to_own_cluster_center >= 0.004])
perc_a = round(anomalies / (healthy + anomalies) * 100, 2)
perc_h = round(healthy / (healthy + anomalies) * 100, 2)
ax.text(x=0.1, y=0.1, s=f'no anomalies={healthy} ({perc_h}%), anomalies={anomalies} ({perc_a}%)', transform=ax.transAxes)
ax.legend()
ax.set_xlabel('Cosine distance');

fig.suptitle('Cosine distance to vibration fingerprint');
fig.tight_layout()

In [None]:
df_cosine_combined = pd.concat([df_cosine_healthy, df_cosine_anomalous, df_cosine_no_labels])
df_cosine_combined['pitting'] = df_cosine_combined['pitting'].fillna('unknown')
fig, ax = plt.subplots(figsize=(18, 4))
ax = sns.boxplot(data=df_cosine_combined, x='distance_to_own_cluster_center', y='pitting', ax=ax)
ax.set_title(f'Distance to own cluster center per pitting level');

In [None]:
df_cosine_combined = pd.concat([df_cosine_healthy, df_cosine_anomalous, df_cosine_no_labels])
df_cosine_combined['pitting'] = df_cosine_combined['pitting'].fillna('unknown')
fig, ax = plt.subplots(figsize=(18, 18))
ax = sns.swarmplot(data=df_cosine_combined, x='distance_to_own_cluster_center', y='pitting', ax=ax)
ax.set_title(f'Distance to own cluster center per pitting level');

In [None]:
df_cosine_combined.head()

In [None]:
df_cosine_combined = pd.concat([df_cosine_healthy, df_cosine_anomalous, df_cosine_no_labels])
df_cosine_combined['pitting_level'] = df_cosine_combined['pitting_level'].fillna(-1.0).astype(int).astype(str)
# df_cosine_combined['pitting_level'] = df_cosine_combined['pitting_level'].replace({'unknown': -1})
fig, ax = plt.subplots(figsize=(18, 4))
ax = sns.boxplot(data=df_cosine_combined, x='distance_to_own_cluster_center', y='pitting_level', ax=ax)
ax.set_title(f'Distance to own cluster center per pitting level');

Calculating anomaly score:
- lower distance --> lower score
- higher distance --> higher score

In [None]:
# normalize distance
min_ = df_cosine_combined.distance_to_own_cluster_center.min()
max_ = df_cosine_combined.distance_to_own_cluster_center.max()

df_cosine_combined['distance_to_own_cluster_center_normalized'] = (df_cosine_combined.distance_to_own_cluster_center - min_) / (max_ - min_)
fig, ax = plt.subplots(figsize=(18, 4))
df_cosine_combined['distance_to_own_cluster_center_normalized'].plot.hist(bins=100, ax=ax)
ax.set_title(f'anomaly score distribution (normalized distance to own cluster center $d_n$)');

In [None]:
score_ = np.sqrt(np.sqrt((df_cosine_combined.distance_to_own_cluster_center - min_) / (max_ - min_)))
fig, ax = plt.subplots(figsize=(18, 4))
score_.plot.hist(bins=100, ax=ax)
ax.set_title(f'anomaly score distribution (4th sqrt of $d_n$)');

---

In [None]:
stop

## Hypothesis testing (only important for V3)

source: https://statistics.laerd.com/spss-tutorials/binomial-test-using-spss-statistics.php 

## Hypothesis 1
Idea: Formulate Bernoulli experiment (exactly two prossible outcomes per trial) for anomaly detection in order to test whether a statistical significance can be observed with respect to a predetermined successful experiment.
In this case statistical significance can be tested with a binomial test.

- ***$H_0$="The FPR@TPR=90% is larger than 10%"***
- $H_A$: In order to achieve a TPR of at least 90%, the FPR is no more than 10%.

In [None]:
import scipy.stats as stats

ALPHA = 0.05  # 5% significance level
THRESHOLD = 0.90

# calculate TPR@FPR=X% for each fold
fpr_at_tpr_folds = []
for fold in tqdm(range(N), total=N):
    df_cosine_ = df_cosine_folds[fold]
    df_cosine_ = df_cosine_[df_cosine_.unique_cluster_label != -1]  # QUICK FIX !!! : removed unknown cluster labels
    fpr, tpr, roc_auc = calculate_roc_characteristics(df_cosine_)
    fpr_at_tpr = calc_fpr_at_tpr_threshold(tpr, fpr, threshold=THRESHOLD)
    fpr_at_tpr_folds.append(fpr_at_tpr)
fpr_at_tpr_folds = pd.Series(fpr_at_tpr_folds)

# calclate how many of the samples fall below the expected proportion
EXPECTED_PROPORTION = 0.1  # null hypothesis value as a decimal
observed_successes = (fpr_at_tpr_folds > EXPECTED_PROPORTION).sum()  # actual TPR@FPR=X% value as a decimal

# Perform the one-sample binomial test
test_result = stats.binomtest(k=observed_successes, n=N, p=EXPECTED_PROPORTION, alternative='less')   # need to increase the number of samples to get a significant result
print(f"Observed samples where FPR@TPR={THRESHOLD*100}% > {EXPECTED_PROPORTION*100}%: {observed_successes} out of {N}")
if test_result.pvalue < ALPHA:
    print(f"Reject the null hypothesis. The FPR@TPR={THRESHOLD*100}% is statistically significantly smaller than {EXPECTED_PROPORTION*100}%.")
else:
    print(f"Fail to reject the null hypothesis. There isn't enough evidence to conclude that FPR@TPR={THRESHOLD*100}% is statistically significaficantly smaller than {EXPECTED_PROPORTION*100}%.")
print("p-value:", test_result.pvalue)

- ***$H_0$="The FPR@TPR=95% is larger than 10%"***
- $H_A$: In order to achieve a TPR of at least 95%, the FPR is no more than 10%.

In [None]:
ALPHA = 0.05  # 5% significance level
THRESHOLD = 0.95

# calculate TPR@FPR=X% for each fold
fpr_at_tpr_folds = []
for fold in tqdm(range(N), total=N):
    df_cosine_ = df_cosine_folds[fold]
    df_cosine_ = df_cosine_[df_cosine_.unique_cluster_label != -1]  # QUICK FIX !!! : removed unknown cluster labels
    fpr, tpr, roc_auc = calculate_roc_characteristics(df_cosine_)
    fpr_at_tpr = calc_fpr_at_tpr_threshold(tpr, fpr, threshold=THRESHOLD)
    fpr_at_tpr_folds.append(fpr_at_tpr)
fpr_at_tpr_folds = pd.Series(fpr_at_tpr_folds)

# calclate how many of the samples fall below the expected proportion
EXPECTED_PROPORTION = 0.1  # null hypothesis value as a decimal
observed_successes = (fpr_at_tpr_folds > EXPECTED_PROPORTION).sum()  # actual TPR@FPR=X% value as a decimal

# Perform the one-sample binomial test
test_result = stats.binomtest(k=observed_successes, n=N, p=EXPECTED_PROPORTION, alternative='less')   # need to increase the number of samples to get a significant result
print(f"Observed samples where FPR@TPR={THRESHOLD*100}% > {EXPECTED_PROPORTION*100}%: {observed_successes} out of {N}")
if test_result.pvalue < ALPHA:
    print(f"Reject the null hypothesis. The FPR@TPR={THRESHOLD*100}% is statistically significantly smaller than {EXPECTED_PROPORTION*100}%.")
else:
    print(f"Fail to reject the null hypothesis. There isn't enough evidence to conclude that FPR@TPR={THRESHOLD*100}% is statistically significaficantly smaller than {EXPECTED_PROPORTION*100}%.")
print("p-value:", test_result.pvalue)

In [None]:
stop

---

## Hypothesis 2 (draft)
- We claim that in more than 95% of the cases, the TPR@FPR=10% is higher than 85%.
- ***$H_0$="The TPR@FPR=10% is less or equal than 80%"***
- ***$H_A$="The TPR@FPR=10% is more than 80%"***
- (*) Our null hypothesis is ***$H_0$="The TPR@FPR=10% is less than 85%"***, hence my alternative thesis would be that ***$H_A$="The TPR@FPR=10% is equal or more than 85%"***.

In [None]:
import scipy.stats as stats

ALPHA = 0.05  # 5% significance level
FPR_THRESHOLD = 0.1  # FPR threshold to calculate TPR@FPR=<FPR_THRESHOLD>%

# calculate TPR@FPR=X% for each fold
tpr_at_fpr_folds = []
for fold in tqdm(range(N), total=N):
    df_cosine_ = df_cosine_folds[fold]
    fpr, tpr, roc_auc = calculate_roc_characteristics(df_cosine_)
    tpr_at_fpr = calc_tpr_at_fpr_threshold(tpr, fpr, threshold=FPR_THRESHOLD)
    tpr_at_fpr_folds.append(tpr_at_fpr)
tpr_at_fpr_folds = pd.Series(tpr_at_fpr_folds)

# calclate how many of the samples fall below the expected proportion
EXPECTED_PROPORTION = 0.8  # null hypothesis value as a decimal
observed_successes = (tpr_at_fpr_folds <= EXPECTED_PROPORTION).sum()  # actual TPR@FPR=X% value as a decimal

# Perform the one-sample binomial test
test_result = stats.binomtest(k=observed_successes, n=N, p=EXPECTED_PROPORTION, alternative='less')
print(f"Observed samples where TPR@FPR={FPR_THRESHOLD*100}% < {EXPECTED_PROPORTION}: {observed_successes}")
if test_result.pvalue < ALPHA:
    print(f"Reject the null hypothesis. The TPR@FPR={FPR_THRESHOLD*100}% is statistically significantly greater than or equal to {EXPECTED_PROPORTION*100}%.")
else:
    print(f"Fail to reject the null hypothesis. There isn't enough evidence to conclude that TPR@FPR={FPR_THRESHOLD*100}% is greater than or equal to {EXPECTED_PROPORTION*100}%.")
print("p-value:", test_result.pvalue)

In [None]:
stats.binomtest(k=35, n=100, p=0.25, alternative='greater')

In [None]:
# plot average ROC curve
pass

Reminder: Precision & Recall:
- **Precision** is the fraction of relevant instances among the retrieved instances: `true_positives / (true_positives + false_positives)`
- **Recall** is the fraction of relevant instances that have been retrieved over the total amount of relevant instances: `true_positives / (true_positives + false_negatives)`

Why is a precision-recall curve better for imbalanced problems?
- Precision-recall curves are more sensitive to the performance of the model on the minority class.
- In imbalanced problems, it's often more critical to correctly identify the positive class instances (high recall) and minimize false positives (high precision) rather than worrying about true negatives.

### PR-curve with anomaly as minority class

In [None]:
# plot precision-recall curve
def calculate_pr_characteristics(df_):
    df_ = df_.sort_values(by='distance_to_own_cluster_center', ascending=True)

    # Initialize variables to store ROC curve values
    precision = []
    recall = []

    for threshold in df_['distance_to_own_cluster_center']:
        df_['predicted_anomaly'] = df_['distance_to_own_cluster_center'] >= threshold
        positive = 1
        negative = 1 - positive

        # Calculate True Positive Rate (TPR) and False Positive Rate (FPR)
        true_positives = df_[(df_['pitting'] == positive) & (df_['predicted_anomaly'] == positive)].shape[0]
        false_positives = df_[(df_['pitting'] == negative) & (df_['predicted_anomaly'] == positive)].shape[0]
        true_negatives = df_[(df_['pitting'] == negative) & (df_['predicted_anomaly'] == negative)].shape[0]
        false_negatives = df_[(df_['pitting'] == positive) & (df_['predicted_anomaly'] == negative)].shape[0]

        # Precision = fraction of positive predictions that actually belong to the positive class.
        precision.append(true_positives / (true_positives + false_positives))
        # Recall = fraction of positive predictions out of all positive instances in the data set.
        recall.append(true_positives / (true_positives + false_negatives))

    # Calculate the area under the ROC curve (AUC)
    pr_auc = auc(recall, precision)
    #pr_auc = 0

    return precision, recall, pr_auc

for fold in range(min([4, len(df_cosine_folds)])):
    # why does the individual ROC curve not go until FP = 1 (?)
    df_cosine_ = df_cosine_folds[fold]
    df_cosine_ = df_cosine_[df_cosine_.unique_cluster_label != -1]  # QUICK FIX !!! : removed unknown cluster labels

    fig, ax = plt.subplots(figsize=(8, 6))

    # plot individual ROC curves
    linestyles = ['-', '--', ':', '-', '--', ':']
    for lvl, style in zip(pitting_levels, linestyles):
        df_ = df_cosine_[(~df_cosine_['pitting']) | (df_cosine_['pitting_level'] == lvl)]
        precision, recall, pr_auc = calculate_pr_characteristics(df_)
        ax.plot(precision, recall, lw=1, linestyle=style, alpha=0.66, label=f'level {lvl} (area = {pr_auc:.3f})')
        ax.set_xlim(0.0, 1.0)
        ax.set_ylim(0.0, 1.05)

    # Plot the general ROC curve
    precision, recall, pr_auc = calculate_pr_characteristics(df_cosine_)
    ax.plot(recall, precision, color='blue', lw=4, label=f'overall (area = {pr_auc:.3f})', alpha=0.66)
    # plot baseline that goes from x=0, y=0.5 to x=1, y=0.5
    ax.plot([0, 1], [0.5, 0.5], color='navy', lw=2, linestyle='--', label='baseline')
    # ax.plot([0, 0.5], [1, 0.5], color='navy', lw=2, linestyle='--')
    ax.set_xlim(0.0, 1.0)
    ax.set_ylim(0.0, 1.05)
    ax.set_xlabel('Recall')
    ax.set_ylabel('Precision')
    ax.set_title(f'Precision-recall Curve (fold {fold})')
    n_total = len(df_cosine_)
    n_healthy = len(df_cosine_[df_cosine_['pitting'] == False])
    n_unhealthy = len(df_cosine_[df_cosine_['pitting'] == True])
    text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy)"
    ax.annotate(xy=(0.1, 0.025), text=text)
    ax.legend(loc='lower right', title='Pitting severity level');

> WHAT IS THE MAJORITY CLASS?

### PR-curve with healthy as minority class

In [None]:
# plot precision-recall curve
def calculate_pr_characteristics(df_):
    df_ = df_.sort_values(by='distance_to_own_cluster_center', ascending=True)

    # Initialize variables to store ROC curve values
    precision = []
    recall = []

    for threshold in df_['distance_to_own_cluster_center']:
        df_['predicted_anomaly'] = df_['distance_to_own_cluster_center'] >= threshold
        positive = 1
        negative = 1 - positive

        # Calculate True Positive Rate (TPR) and False Positive Rate (FPR)
        # correctly identified healthy samples
        true_positives = df_[(df_['pitting'] == False) & (df_['predicted_anomaly'] == False)].shape[0]
        # incorrectly identified samples as healthy
        false_positives = df_[(df_['pitting'] == True) & (df_['predicted_anomaly'] == False)].shape[0]
        # correctly identified unhealthy samples
        true_negatives = df_[(df_['pitting'] == True) & (df_['predicted_anomaly'] == True)].shape[0]
        # incorrectly identified samples as unhealthy
        false_negatives = df_[(df_['pitting'] == False) & (df_['predicted_anomaly'] == True)].shape[0]

        # Precision = fraction of positive predictions that actually belong to the positive class.
        try:
            precision.append(true_positives / (true_positives + false_positives))
        except ZeroDivisionError:
            precision.append(1)
        # Recall = fraction of positive predictions out of all positive instances in the data set.
        try:
            recall.append(true_positives / (true_positives + false_negatives))
        except ZeroDivisionError:
            recall.append(1)

    # Calculate the area under the ROC curve (AUC)
    pr_auc = auc(recall, precision)
    #pr_auc = 0

    return precision, recall, pr_auc

for fold in range(min([4, len(df_cosine_folds)])):
    # why does the individual ROC curve not go until FP = 1 (?)
    df_cosine_ = df_cosine_folds[fold]
    df_cosine_ = df_cosine_[df_cosine_.unique_cluster_label != -1]  # QUICK FIX !!! : removed unknown cluster labels

    fig, ax = plt.subplots(figsize=(8, 6))

    # plot individual ROC curves
    linestyles = ['-', '--', ':', '-', '--', ':']
    for lvl, style in zip(pitting_levels, linestyles):
        df_ = df_cosine_[(~df_cosine_['pitting']) | (df_cosine_['pitting_level'] == lvl)]
        precision, recall, pr_auc = calculate_pr_characteristics(df_)
        ax.plot(precision, recall, lw=1, linestyle=style, alpha=0.66, label=f'level {lvl} (area = {pr_auc:.3f})')
        ax.set_xlim(0.0, 1.0)
        ax.set_ylim(0.0, 1.05)

    # Plot the general ROC curve
    precision, recall, pr_auc = calculate_pr_characteristics(df_cosine_)
    ax.plot(recall, precision, color='blue', lw=4, label=f'overall (area = {pr_auc:.3f})', alpha=0.66)
    # plot baseline that goes from x=0, y=0.5 to x=1, y=0.5
    ax.plot([0, 1], [0.5, 0.5], color='navy', lw=2, linestyle='--', label='baseline')
    # ax.plot([0, 0.5], [1, 0.5], color='navy', lw=2, linestyle='--')
    ax.set_xlim(0.0, 1.0)
    ax.set_ylim(0.0, 1.05)
    ax.set_xlabel('Recall')
    ax.set_ylabel('Precision')
    ax.set_title(f'Precision-recall Curve (fold {fold})')
    n_total = len(df_cosine_)
    n_healthy = len(df_cosine_[df_cosine_['pitting'] == False])
    n_unhealthy = len(df_cosine_[df_cosine_['pitting'] == True])
    text = f"n={n_total} ({n_healthy} healthy, {n_unhealthy} unhealthy)"
    ax.annotate(xy=(0.1, 0.025), text=text)
    ax.legend(loc='lower right', title='Pitting severity level');

# Analyse false positives

In [None]:
df_ = df_cosine_folds[0]
df_sorted_ = df_.sort_values(by='distance_to_own_cluster_center', ascending=True).dropna(subset=['distance_to_own_cluster_center'])
df_sorted_.tail()

In [None]:
df_sorted_no_pitting_ = df_sorted_[df_sorted_['pitting'] == False]
df_sorted_pitting_ = df_sorted_[df_sorted_['pitting'] == True]
df_sorted_no_pitting_.tail()

In [None]:
fig, axes = plt.subplots(figsize=(16, 4), ncols=2)
df_sorted_no_pitting_['distance_to_own_cluster_center'].plot(kind='hist', bins=100, ax=axes[0], title='No pitting')
df_sorted_pitting_['distance_to_own_cluster_center'].plot(kind='hist', bins=100, ax=axes[1], title='Pitting')
dist_ = 0.0075
axes[0].axvline(x=dist_, color='red', linestyle='--');
axes[0].text(dist_+0.25*dist_, 10, 'FP', color='red', verticalalignment='top')
axes[0].text(dist_-0.35*dist_, 10, 'TN', color='red', verticalalignment='top')
axes[1].axvline(x=dist_, color='red', linestyle='--');
axes[1].text(0.05, 5, 'TP', color='red', verticalalignment='top')
axes[1].text(-0.02, 5, 'FN', color='red', verticalalignment='top')
TP = len(df_sorted_pitting_[df_sorted_pitting_['distance_to_own_cluster_center'] >= dist_])
FP = len(df_sorted_no_pitting_[df_sorted_no_pitting_['distance_to_own_cluster_center'] >= dist_])
TN = len(df_sorted_no_pitting_[df_sorted_no_pitting_['distance_to_own_cluster_center'] < dist_])
FN = len(df_sorted_pitting_[df_sorted_pitting_['distance_to_own_cluster_center'] < dist_])
TPR = TP / (TP + FN)
FPR = FP / (FP + TN)
TNR = TN / (TN + FP)
FNR = FN / (FN + TP)
print(f'TP: {TP} (TPR = {round(100 * TPR, 1)}%)')
print(f'FP: {FP} (FPR = {round(100 * FPR, 1)}%)')
print(f'TN: {TN} (TNR = {round(100 * TNR, 1)}%)')
print(f'FN: {FN} (FNR = {round(100 * FNR, 1)}%)')

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(x=df_sorted_no_pitting_['rpm'], y=df_sorted_no_pitting_['distance_to_own_cluster_center'], label='No pitting')
ax.set_xlabel('RPM')
ax.set_ylabel('Cosine distance')
ax.set_title('Cosine distance to own cluster center per RPM for healthy samples')
ax.axhline(y=dist_, color='red', linestyle='--');

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(x=df_sorted_no_pitting_['torque'], y=df_sorted_no_pitting_['distance_to_own_cluster_center'], label='No pitting')
ax.set_xlabel('Torque')
ax.set_ylabel('Cosine distance')
ax.set_title('Cosine distance to own cluster center per RPM for healthy samples')
ax.axhline(y=dist_, color='red', linestyle='--');

# Test on merged validation and test set

In [None]:
BASE_PATH_VALIDATION = os.path.join('Data_Challenge_PHM2023_validation_data')
BASE_PATH_TEST = os.path.join('Data_Challenge_PHM2023_test_data')

nperseg = 10240
noverlap = nperseg // 2
nfft = None
fs = 20480

def load_test_data(rpm, torque, run, path=BASE_PATH_HEALTHY):
    df = pd.read_csv(path, names=['x', 'y', 'z', 'tachometer'], delimiter=' ')
    return df

def load_data(fnames, use_train_data_for_validation=True, base_path=BASE_PATH_HEALTHY, **kwargs):
    """train_data --> process parameters are known. (TODO: change later)"""
    data = []
    for fn in tqdm(fnames):
        rpm, torque, run = extract_process_parameters(fn, use_train_data_for_validation=use_train_data_for_validation)
        try:
            with timeout(seconds=2):
                df = load_train_data(rpm, torque, run, base_path=base_path) if use_train_data_for_validation else load_test_data(rpm, torque, run, path=fn)
        except TimeoutError:
            print(f"timed out: {fn}")
        f, t, stft_x = stft(df['x'], **kwargs)
        f, t, stft_y = stft(df['y'], **kwargs)
        f, t, stft_z = stft(df['z'], **kwargs)
        f, psd_x = welch(df['x'], **kwargs)
        f, psd_y = welch(df['y'], **kwargs)
        f, psd_z = welch(df['z'], **kwargs)
        data.append({
            'rpm': rpm,
            'torque': torque, 
            'sample_id': run,
            'unique_sample_id': f'{rpm}_{torque}_{run}',  # Remove the '.txt' extension and convert to integer
            'vibration_time_domain': df, 
            'stft_x': stft_x,
            'stft_y': stft_y,
            'stft_z': stft_z,  # Remove the '.txt' extension and convert to integer
            'psd_x': psd_x,
            'psd_y': psd_y,
            'psd_z': psd_z
        })
    return data, f

fnames = glob.glob(os.path.join(BASE_PATH_VALIDATION, '*.txt'))
data_validation_set, f = load_data(fnames, nperseg=nperseg,
                                   noverlap=noverlap, nfft=nfft, fs=fs,
                                   use_train_data_for_validation=False)

"""
fnames = glob.glob(os.path.join(BASE_PATH_TEST, '*.txt'))
data_test_set, f = load_data(fnames, nperseg=nperseg,
                                   noverlap=noverlap, nfft=nfft, fs=fs,
                                   use_train_data_for_validation=False)
"""
pass

In [None]:
# analyse size of variables (with possibility to filter out variables with high memory consumption)

import sys

# Create a function to get the size of all variables
def get_size_of_all_variables():
    variable_sizes = [(var, sys.getsizeof(globals()[var]) / (1024 * 1024)) for var in globals()]
    total_size_mb = sum(size for _, size in variable_sizes)
    return total_size_mb, variable_sizes

# Call the function and print the results
total_size, variable_sizes = get_size_of_all_variables()
print(f"Total size of all variables: {total_size:.2f} MB")

# Print the sizes of individual variables
for var, size in variable_sizes:
    print(f"{var}: {size:.2f} MB")

In [None]:
# extract vibration data
df_vib_test_unhealthy = derive_df_vib(data_test, f)

# convert to orders and derive meta data
df_orders_test_pitting_, meta_data_test_pitting_ = derive_df_orders(df_vib_test_unhealthy, setup, f, verbose=False)

---

In [None]:
df_cosine_.distance_to_own_cluster_center.isna().sum()

In [None]:
pass

In [None]:
stop

---

Below we iterate over all operating modes and check the distances to the fingerprint.

In [None]:
df_vib_test_healthy = derive_df_vib(data_healthy_test, f)
df_orders_test_healthy, meta_data_test_healthy = derive_df_orders(df_vib_test_healthy, setup, f)
df_V_test_normalized_healthy = normalize_1(df_orders_test_healthy, BAND_COLS)
df_ = df_V_test_normalized_healthy
# meta_data_test_healthy['sample_id'] = meta_data_test_healthy.groupby(['rotational speed [RPM]', 'torque [Nm]', 'sample_id']).ngroup() + 1
df_[['unique_sample_id', 'direction']] = meta_data_test_healthy[['unique_sample_id', 'direction']]
test_vibration_measurement_periods = []
test_vibration_measurement_periods_meta_data = []
n_index_errors = 0
for unique_sample_id, group in df_.groupby('unique_sample_id'):
    rpm = meta_data_test_healthy[meta_data_test_healthy['unique_sample_id'] == unique_sample_id]['rotational speed [RPM]'].unique()[0]
    torque = meta_data_test_healthy[meta_data_test_healthy['unique_sample_id'] == unique_sample_id]['torque [Nm]'].unique()[0]
    try:
        om = cluster_label_unique_name_mapping[
            (cluster_label_unique_name_mapping['rotational speed [RPM]'] == rpm) & 
            (cluster_label_unique_name_mapping['torque [Nm]'] == torque)
        ]['cluster_label_unique']
        assert len(om.unique()) <= 1, f'should have maximum one unique cluster label, got instead: {om}'
        om = om.iloc[0]
        """
        if len(oms.unique()) == 1:
            om = oms.iloc[0]
        elif len(oms.unique()) == 0:
            pass
        else:
            print(oms)
            raise ValueError(f'Found more than one unique cluster label for RPM={rpm} and torque={torque}')
        """
    except IndexError:
        n_index_errors += 1
        om = -1
    measurement_period = {
        'start': 'unknown', 
        'stop': 'unknown',
        'group': group,
        'sample_id': sample_id,
        'rpm': rpm,
        'torque': torque,
        'unique_cluster_label': om
    }
    test_vibration_measurement_periods.append(group)
    test_vibration_measurement_periods_meta_data.append(measurement_period)

n_total = len(test_vibration_measurement_periods)
print(f'Total number of measurement periods: {n_total}')
print(f'Number of measurement periods with unknown RPM and/or torque: {n_index_errors}')

df_W_online = extract_vibration_weights_per_measurement_period(test_vibration_measurement_periods, fingerprints[0].columns, BAND_COLS, normalize_1, model)
df_dist_online = calculate_distances_per_measurement_period(df_W_online)

# for each measurement period (row), get the distance to each operating mode (column)
df_cosine = df_dist_online[['idx', 'om', 'cosine_distance']].pivot(index='idx', columns='om', values='cosine_distance')
# assign the corresponding operating mode to the given row (if known), else, assign -1
# unique cluster label is wrong!!! (might be correct)
df_cosine[['rpm', 'torque', 'unique_cluster_label']] = pd.DataFrame(test_vibration_measurement_periods_meta_data)[['rpm', 'torque', 'unique_cluster_label']]

distance_to_own_cluster_center = []
for idx, row in df_cosine.iterrows():
    om = row['unique_cluster_label']
    if om != -1:
        distance_to_own_cluster_center.append(row[om])
    else:
        distance_to_own_cluster_center.append(np.nan)
df_cosine['distance_to_own_cluster_center'] = distance_to_own_cluster_center
df_cosine.head()

In [None]:
# There are almost no anomalies when there is no pitting
anomaly = df_cosine['distance_to_own_cluster_center'] > 0.01   # TODO: setting threshold to 0.01 as first test, later set threshold based on distance in training set
anomaly.value_counts()

In [None]:
fig, axes = plt.subplots(figsize=(16, 8), ncols=2)

ax = df_cosine['distance_to_own_cluster_center'].plot(kind='hist', bins=20, ax=axes[0], alpha=0.5, legend=False)
ax.set_title('Distance to own cluster centers')
ax.set_xlabel('Cosine distance')

# plot distance to other cluster centers
ax = df_cosine.drop(columns=['rpm', 'torque', 'unique_cluster_label', 'distance_to_own_cluster_center']).melt()['value'].plot(kind='hist', bins=20, ax=axes[1], alpha=0.5, legend=False)
ax.set_title('Distance to other cluster centers')
ax.set_xlabel('Cosine distance')
ax.legend(loc='upper right', bbox_to_anchor=(1.2, 1))

# Setting an anomaly threshold with an ROC-curve

In [None]:
pass

# NOTES & TODOS

- [x] change test set --> use train set with and without pitting
- [ ] experiment with normalisation
- [ ] ROC curve

©, 2023, Sirris