# PHM North America challenge '23

# 02 - Data processing

This notebook preprocesses the vibration data from the PHM North America challenge '23, with the goal to construct a decomposition matrix of the vibration data.

The preprocessing consists of the following steps:
1. [domain-conversion](#(1)-Conversion-from-time-to-frequency-domain): Convert the vibration data into the frequency domain using the Fast Fourier Transform (FFT).
2. [order-conversion](#2-order-transformation-and-binning): Convert the frequency domain data into constant orders and bin the resulting data.
3. [normalization](#3-frequency-band-normalization): Normalize the order converted data.

In [None]:
%load_ext autoreload
%autoreload 2

from conscious_engie_icare.normalization import normalize_1
from conscious_engie_icare.nmf_profiling import derive_df_orders, derive_df_vib
from conscious_engie_icare.data.phm_data_handler import BASE_PATH_HEALTHY, FILE_NAMES_HEALTHY, CACHING_FOLDER_NAME, fetch_and_unzip_data, \
                                                        load_data, load_cached_data, FPATH_DF_ORDERS_TRAIN_FOLDS, FPATH_META_DATA_TRAIN_FOLDS, FPATH_DF_V_TRAIN_FOLDS, \
                                                        FPATH_DATA_HEALTHY_TEST_FOLDS
import os
import pandas as pd
from tqdm import tqdm
import plotly.express as px
import matplotlib.pyplot as plt
import random
import numpy as np
import pickle

In [None]:
fetch_and_unzip_data()

CACHE_RESULTS = True

# (1) Conversion from time to frequency domain

As described in the [data exploration notebook](01_data_exploration.ipymb), the data is given in the time domain.
We transform it to the frequency domain.

Each individual measurement is transformed to a frequency spectrum with a short-term Fourier transform (STFT) and Welch's method.
- **Short-term Fourier transform (STFT)**: Let $x(t)$ represent the original vibration signal in the time domain with the time index $t$.
The **STFT** is applied to $x(t)$ to obtain a representation $X(f,\tau)$ in the frequency domain, where $f$ is the frequency index and $\tau$ is the time window index.
- **Welch's method**: Welch's method (also called the periodogram method) for estimating power spectra is carried out by dividing the time signal into successive blocks, forming the periodogram for each block, and averaging [source](https://ccrma.stanford.edu/~jos/sasp/Welch_s_Method.html).

In [None]:
nperseg = 10240
noverlap = nperseg // 2
nfft = None
fs = 20480
data_healthy, f = load_data(FILE_NAMES_HEALTHY, nperseg=nperseg, noverlap=noverlap, nfft=nfft, fs=fs)
len(data_healthy)

We use most of the healthy data for training. Different train-test splits have been used in the past. In this latest version, we sample an equal amount of samples from the heathy and faulty data.

Methods tried in the past:
1. randomly shuffle the data and split into train and test set once
2. 4 different independent splits
3. repeat N times: Sample equal amount of samples from healthy and faulty data

We sample the data according to (3).

In [None]:
SPLIT = 0.75
N = 100

data_healthy_train_folds = []
data_healthy_test_folds = []
for i in range(N):
    # randomly sample equal amount of samples from healthy and faulty data
    split_id = int(len(data_healthy) * SPLIT)
    random.Random(i).shuffle(data_healthy)
    data_healthy_train_ = data_healthy[:split_id]
    data_healthy_test_ = data_healthy[split_id:]
    data_healthy_train_folds.append(data_healthy_train_)
    data_healthy_test_folds.append(data_healthy_test_)

# save data_healthy_test_folds for usage in 04_online_anomaly_detection.ipynb
if CACHE_RESULTS:
    with open(FPATH_DATA_HEALTHY_TEST_FOLDS, 'wb') as file:  # TODO: store indices instead of complete file
        pickle.dump(data_healthy_test_folds, file)

print(f'There are {len(data_healthy_train_folds[0])} healthy samples in the first training fold.')

# (2) Order transformation and binning

In the order-transformed domain, the frequency components are transformed according to the number of rotations per minute (RPM) of the gears.
The measurements are transformed to orders by dividing the frequency by the gear's rotational speed, which is given in the process conditions.
We then bin the data into 50 bins, starting from 0.5 orders up to 100.5 orders. The bins are equally spaced and non-overlapping with a window size of 2.

Below we load/calculate said order transformation.
We cache this process, because it takes around 15 minutes to calculate.
The plot illustrates the frequency band values for all measurements for the x-, y- and z-axis from top to bottom. 
The unique_sample_id is described by `<rpm>_<torque>_<run>`.

In [None]:
CACHE_RESULTS = True   # If True, cache the results locally
setup = {'start': 0.5, 'stop': 100.5, 'n_windows': 50, 'window_steps': 2, 'window_size': 2}

# load transformed data (if specified)
try:
    df_orders_train_folds, meta_data_train_folds = load_cached_data(
            fpath_df_orders_train_folds=FPATH_DF_ORDERS_TRAIN_FOLDS,
            fpath_meta_data_train_folds=FPATH_META_DATA_TRAIN_FOLDS
        )

# load train data and transform to orders
except FileNotFoundError:
    print('Constructing the train folds...')
    df_vib_train_folds = []
    df_orders_train_folds = []
    meta_data_train_folds = []
    for fold, data_healthy_train_ in enumerate(tqdm(data_healthy_train_folds, desc='Deriving orders on training set per fold')):
        df_vib_train_folds.append(derive_df_vib(data_healthy_train_, f)) # f!!!
        df_orders_train_, meta_data_train_ = derive_df_orders(df_vib_train_folds[-1], setup, f, verbose=False)
        df_orders_train_[meta_data_train_.columns] = meta_data_train_
        df_orders_train_folds.append(df_orders_train_)
        meta_data_train_folds.append(meta_data_train_)
    if CACHE_RESULTS:
        # cache train data
        os.makedirs(os.path.dirname(FPATH_DF_ORDERS_TRAIN_FOLDS), exist_ok=True)
        with open(FPATH_DF_ORDERS_TRAIN_FOLDS, 'wb') as file:
            pickle.dump(df_orders_train_folds, file)
        # cache test data
        os.makedirs(os.path.dirname(FPATH_META_DATA_TRAIN_FOLDS), exist_ok=True)
        with open(FPATH_META_DATA_TRAIN_FOLDS, 'wb') as file:
            pickle.dump(meta_data_train_folds, file)

# plot effect of orders
cols = df_orders_train_folds[-1].columns
BAND_COLS = cols[cols.str.contains('band')].tolist()
idx_cols = ['index', 'rotational speed [RPM]', 'torque [Nm]', 'direction',
            'unique_sample_id', 'sample_id']
cols = BAND_COLS + idx_cols
df_ = df_orders_train_folds[-1].reset_index()[cols]
df_ = pd.melt(df_, id_vars=idx_cols, var_name='frequency band', value_name='frequency band value')
fig = px.line(df_, x='frequency band', y='frequency band value',
              facet_row='direction', color='unique_sample_id',
              hover_data=['rotational speed [RPM]', 'torque [Nm]'],
              title='Frequency bands for healthy samples, after order-converion, before normalisation',
              markers=True, width=1200, height=600)
fig

We observe **major peaks at 40 and 80 orders**, highlighting the necessity of order transformation, to make different running conditions comparable.
40 orders corresponds to the number of teeth of the driving gear (= **gear mesh frequency**), 80 orders corresponds to a **harmonic frequency**.
The driven gear has 72 teeth which are not visible in the order spectrum.

# (3) Frequency-band normalization

We further normalize the frequency bands to the same range. Given the $i\text{th}$ measurement vector $\mathbf{v}_i$, each order transformed bin $\mathbf{v}_{ij}$ is normalized by the sum of all bins, specifically:

$$\mathbf{v}_{ij}' = \mathbf{v}_{ij} / \sum_j\mathbf{v}_{ij}$$

, where 

- $\mathbf{v}_{ij}$ is the $j\text{th}$ bin of the $i\text{th}$ measurement vector $\mathbf{v}_i$.
- $\mathbf{v}_{ij}'$ is the normalized $j\text{th}$ bin of the $i\text{th}$ measurement vector.
- $\sum_j\mathbf{v}_{ij}$ is the sum of all bins of the $i\text{th}$ measurement vector.

In [None]:
df_V_train_normalized_folds = [normalize_1(df_orders_train_, BAND_COLS) for df_orders_train_ in df_orders_train_folds]
idx_vars = ['rotational speed [RPM]', 'torque [Nm]', 'direction', 'unique_sample_id', 'sample_id']
df_ = df_V_train_normalized_folds[-1].reset_index()
df_[idx_vars] = df_orders_train_folds[-1][idx_vars]
df_ = pd.melt(df_, id_vars=['index'] + idx_vars, 
    var_name='frequency band', value_name='frequency band value'
    )

# plot effect of normalization
fig = px.line(df_, x='frequency band', y='frequency band value',
              facet_row='direction', color='unique_sample_id',
            # hover_data=['rotational speed [RPM]', 'torque [Nm]'], 
              title='Frequency bands for healthy samples, after normalisation',
              markers=True, width=1200, height=600)
fig.show()

The order-transformed and normalized vibration measurements will be further used in the following notebook for the construction of the context-sensitive vibration fingerprints.
They are cached below.

In [None]:
if CACHE_RESULTS:
    with open(FPATH_DF_V_TRAIN_FOLDS, 'wb') as file:
        pickle.dump(df_V_train_normalized_folds, file)

©, 2023, Sirris