# PHM North America challenge '23

# Data Exploration

## Problem description: Gear pitting

Gear pitting is a surface fatigue failure of the gear tooth. It occurs due to repeated loading of tooth surface and the contact stress exceeding the surface fatigue strength of the material. Material in the fatigue region gets removed and a pit is formed. The pit itself will cause stress concentration and soon the pitting spreads to adjacent region till the whole surface is covered [[source](https://gearsmechon.wordpress.com/pitting-of-gears/)].

## Dataset description

The **training** dataset includes measurements under varied operating conditions from a healthy state as well as six known fault levels. The **testing and validation** datasets contain data from eleven health levels. Data from some fault levels and operating conditions are excluded from the training datasets to mirror real-world conditions where data collection may only be available from a subset of full range of operation. The training data are collected from a range of different operating conditions under 15 different rotational speeds and 6 different torque levels. Test and validation data operating conditions span 18 different rotational speeds and 6 different torque levels.

[[source](https://data.phmsociety.org/phm2023-conference-data-challenge/)]

<img src="https://data.phmsociety.org/wp-content/uploads/sites/9/2023/06/PHM2023dc_fig1.png" alt="MarineGEO circle logo" style="height: 375px; width:800px;"/>

<img src="https://data.phmsociety.org/wp-content/uploads/sites/9/2023/06/PHM2023dc_fig2.png" alt="MarineGEO circle logo" style="height: 300px; width:800px;"/>



In [None]:
%load_ext autoreload
%autoreload 2

from conscious_engie_icare.viz.spectrogram import plot_stft, plot_periodogram, plot_welch
from conscious_engie_icare.data.phm_data_handler import BASE_PATH_HEALTHY, FILE_NAMES_HEALTHY, load_train_data, \
    fetch_and_unzip_data, extract_process_parameters, load_data, load_cached_data

import os
import pandas as pd
import glob
from tqdm import tqdm
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import string
import pickle
from umap import UMAP
import plotly.express as px
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

We first load the data and examine the structure of the dataset.
The academic dataset consists of 7 folders (`Pitting_degradation_level_<level>`).

In [None]:
fetch_and_unzip_data()

# Vibration data

First we load the vibration dataset and examine a single vibration entry. 
For each vibration measurement there are triaxial time-domain vibration measurements available (`x`, `y` and `z`) in addition to the actual rpm (`tachometer`).

In [None]:
rpm = 100
torque = 500
run = 1
df_example = load_train_data(rpm, torque, run)
print(f"A single sample (rpm={rpm}, torque={torque}, run={run}) has the following shape:")
print(df_example.shape)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for var, ax in zip(['x', 'y', 'z'], axes):
    ax.plot(df_example[var], label=var)
    ax.set_title(var)
    ax.legend()

## STFT

Vibration Sampling Frequency = 20480 Hz [[source](https://data.phmsociety.org/phm2023-conference-data-challenge/)].

The STFT divides the signal into overlapping segments and calculates a Fourier Transform for each segment.
It provides a localized view on the signal which is particularly useful for signals where the frequency components change over the measurement period.

In [None]:
plot_stft(df_example, 'z', nperseg=None, fs=20480)

As we expect that within each measurement there are no changes in the frequency components, we also check the periodogram below.

## Spectral density estimation

A periodogram is an estimate of the spectral density of a signal [[source](https://en.wikipedia.org/wiki/Periodogram)].
We use Welch's method.
**The primary idea behind Welch's method is to divide the original signal into overlapping segments, calculate the periodogram for each segment, and then average these periodograms to obtain a more stable estimate of the PSD.** This approach helps to reduce the variance and noise inherent in the standard periodogram.

In [None]:
rpm=200
torque=300
run=3

df_example = load_train_data(rpm=rpm, torque=torque, run=run)
plot_welch(df_example, 'x', nperseg=128, fs=20480)
plot_welch(df_example, 'y', nperseg=128, fs=20480)
plot_welch(df_example, 'z', nperseg=128, fs=20480)
plt.title(f'Measurement {run} @ {rpm} rpm, {torque} Nm');
plt.legend(['x', 'y', 'z'], title='Direction');

## Process parameters

In contrast to the industrial feedwater pump use case, **operating conditions are very stable in the given dataset**.
Therefore, clustering of operating modes based on a separate set of process paramters is not necessary.

In [None]:
data = []
for file_path in FILE_NAMES_HEALTHY:
    v_value, n_value, sample_number = extract_process_parameters(file_path)
    data.append({
        'V': v_value,  # Remove the 'V' prefix and convert to integer
        'N': n_value,  # Remove the 'N' suffix and convert to integer
        'SampleNumber': sample_number  # Remove the '.txt' extension and convert to integer
    })

df_process = pd.DataFrame(data)

print("--- Healthy data (pitting level 0) ---")
print(f"Number of samples: {len(df_process)}")
print(f"Number of unique RPM values: {len(df_process['V'].unique())}")
print(f"Number of unique torque values: {len(df_process['N'].unique())}")
print(f"Number of unique sample numbers: {len(df_process['SampleNumber'].unique())}")
df_process.head()

There are 77 unique combinations of rotational speed and torque in the training dataset.
Each combination has 1-5 samples.

In [None]:
# get the unique number of combinations of RPM and torque
df_runs = df_process.groupby(['V', 'N']).size().reset_index(name='counts')
df_runs.head()

©, 2023, Sirris