# Data Analytics and Classification Model for Failure Detection of Wind Turbine from IIoT Data

<div>
<img src="https://user-images.githubusercontent.com/51282928/143174095-7908d4f9-08d6-4c9a-9446-a26f60e80953.png" width="800"/>
</div>

[Image Source](https://www.windpowermonthly.com/article/1594597/windtech-new-hybrid-gearbox-splits-loads-scalability)

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [1]:
import matplotlib.pyplot as plt
import missingno as msno
import seaborn as sns

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import plot_confusion_matrix, classification_report
from lightgbm import LGBMClassifier

%config InlineBackend.figure_format = 'retina'

# 1. Read data

We have 3 data:

* `scada_data.csv`: Contains >60 information (or status) of wind turbine components recorded by SCADA system
* `fault_data.csv`: Contains wind turbine fault types (or modes)
* `status_data.csv`: Contains description of status of wind turbine operational

In [1]:
scada_df = pd.read_csv('../input/iiot-data-of-wind-turbine/scada_data.csv')
scada_df['DateTime'] = pd.to_datetime(scada_df['DateTime'])
# scada_df.set_index('DateTime', inplace=True)

scada_df

In [1]:
status_df = pd.read_csv('../input/iiot-data-of-wind-turbine/status_data.csv')
status_df['Time'] = pd.to_datetime(status_df['Time'])
status_df.rename(columns={'Time': 'DateTime'}, inplace=True)
# status_df.set_index('DateTime', inplace=True)

status_df

In [1]:
fault_df = pd.read_csv('../input/iiot-data-of-wind-turbine/fault_data.csv')
fault_df['DateTime'] = pd.to_datetime(fault_df['DateTime'])
# fault_df.set_index('DateTime', inplace=True)

fault_df

In the fault data, there are 5 types of faults, or fault modes:

* gf: generator heating fault
* mf: mains failure fault
* ff: feeding fault
* af: air cooling fault
* ef: excitation fault

I don't know exactly what these are. The source of these fault modes in this [GitHub](https://github.com/lkev/wt-fdd).

In [1]:
fault_df.Fault.unique()

# 2. Time series analysis

The 3 data have different time spans. The status data has the longest record timespan from January 2014 to December 2015. The shortest is SCADA data from April 2014 to April 2015. Therefore, when seeing the SCADA records, we can refer to status and fault data to see what happens on the turbine at certain timestamps.

In [1]:
# Plot time span of all data
t_scada = scada_df.DateTime
t_fault = fault_df.DateTime
t_status = status_df.DateTime

plt.figure(figsize=(10,4))
plt.plot(t_scada, np.full(len(scada_df), 1), label='scada data')
plt.plot(t_fault, np.full(len(fault_df), 2), label='fault data')
plt.plot(t_status, np.full(len(status_df), 3), label='status data')
plt.legend(loc='lower right')
plt.ylim(0,4)

In [1]:
# Plot of max power from SCADA data
scada_df.plot(x='DateTime', y='WEC: max. Power', figsize=(15,4))

In [1]:
# Plot of max power on weekly resampled data
y = 'WEC: max. Power'
scada_df.resample('W', on='DateTime').mean().plot(y=y, figsize=(15,4))

There were times when power dropped, for example in October 2014, December 2014, and the most significant in January 2015.

In [1]:
# Plot of power production on monthly resampled data
y = 'WEC: Production kWh'
scada_df.resample('D', on='DateTime').mean().plot(y=y, figsize=(15,4))

The number of wind turbine faults significantly increases on October 2014.

In [1]:
# Plot of number of faults on monthly resampled data
fault_df.resample('M', on='DateTime').Fault.count().plot.bar()

In [1]:
fault_df.resample('M', on='DateTime').Fault.value_counts()

Let's plot the faults grouped by its fault modes. There are lots of EF events in October and November 2014, and lots of FF events from October 2014 - January 2015.

In [1]:
def line_format(label):
    """
    Convert time label to the format of pandas line plot
    """
    month = label.month_name()[:3]
    if month == 'Jan':
        month += f'\n{label.year}'
    return month

In [1]:
c = ['red', 'orange', 'green', 'blue', 'violet']
fault_df.resample('M', on='DateTime').Fault.value_counts().unstack().plot.bar(stacked=True, width=0.8, figsize=(10,5), color=c, rot=45,
                                                                              title='Wind Turbine Faults', ylabel='Fault Counts')

# 3. Combine SCADA and faults data

We combine SCADA and fault data to pair each measurements with associated faults.

In [1]:
# Combine scada and fault data
df_combine = scada_df.merge(fault_df, on='Time', how='outer')
msno.matrix(df_combine)

There are lots of NaNs, or unmatched SCADA timestamps with fault timestamps, simply because there are no faults happen at certain time. For these NaNs, we will replace with "NF".

**NF is No Fault (normal condition)**

In [1]:
# Replace records that has no fault label (NaN) as 'NF' (no fault)
df_combine['Fault'] = df_combine['Fault'].replace(np.nan, 'NF')

df_combine

# 4. Exploratory Data Analysis

Print the averages of SCADA values grouped by fault modes.

In [1]:
# Suppress scientific notations
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Groupby fault and take average
df_summary = df_combine.groupby('Fault').mean().T
df_summary.tail(20)

Seeing the averages above, we could identify the anomalous behavior of Fault Modes:

* WF has lower ava, min, max active reactive power than No Fault (NF)
* EF has higher ava, min, max active reactive power than No Fault (NF)
* GF has ZERO ava, min, max active reactive power
* FF and MF have higher nacelle cable twisting than NF
* AF and GF have negative nacelle cable twisting
* AF and MF have lower production
* All faults have higher blade angle, the highest is FF
* GF in general has the lowest temperature in ALL components (cabinet temp, T spinner, T front bearing, ..., T transformer)
* While other faults (FF, AF, MF, EF) have higher temperature
* EF temperature is highest in cabinet, pitch, rotor, stator, ambient, control, tower, and transformer
* AF temperature is highest in spinner, front bearing, rare bearing, nacelle, main carrier, rectifier, yaw, and fan inverter

The boxplots of temperatures (at spinner, bearing, nacelle, and fan inverter) shows that during GF, the temperatures are anomalously lower than normal condition. However, temperatures are higher than normal during AF and EF.

In [1]:
# Boxplots of temperature
f, axes = plt.subplots(nrows=2, ncols=2,figsize=(10,8))

sns.boxplot(x='Fault', y='Spinner temp.', data=df_combine, ax=axes[0][0])
axes[0][0].set_title('Spinner Temperature')
sns.boxplot(x='Fault', y='Rear bearing temp.', data=df_combine, ax=axes[0][1])
axes[0][1].set_title('Rear Bearing Temperature')
sns.boxplot(x='Fault', y='Nacelle temp.', data=df_combine, ax=axes[1][0])
axes[1][0].set_title('Nacelle Temperature')
sns.boxplot(x='Fault', y='Fan inverter cabinet temp.', data=df_combine, ax=axes[1][1])
axes[1][1].set_title('Fan Inverter Temperature')

plt.tight_layout()

Boxplot of reactive power (power from the generator?) shows the power during EF is anomalously high, while the power during MF is lower than normal condition. 

In [1]:
sns.catplot(data=df_combine, x='Fault', y='WEC: ava. reactive Power', kind='box')

The boxplot of nacelle position and cable twisting shows that during AF, nacelle position is negative (up to -500) while during MF, FF, and EF, nacelle position is positive (up to +500).

In [1]:
sns.catplot(data=df_combine, x='Fault', y='WEC: ava. Nacel position including cable twisting', kind='box')

The boxplot of operating hours shows that during MF and AF, the operating hours are shorter than normal condition. However, during FF, the operating hours are longer than normally are.

In [1]:
sns.catplot(data=df_combine, x='Fault', y='WEC: Operating Hours', kind='box')

# 5. Data preparation for ML

There are far more records of NF (normal condition) than faulty records - imbalanced dataset. We will sample the No Fault dataframe and pick only 300 records. 

In [1]:
df_combine.Fault.value_counts()

In [1]:
# Pick 300 samples of NF (No Fault) mode data
df_nf = df_combine[df_combine.Fault=='NF'].sample(300, random_state=42)

df_nf

In [1]:
# With fault mode data
df_f = df_combine[df_combine.Fault!='NF']

df_f

In [1]:
# Combine no fault and faulty dataframes
df_combine = pd.concat((df_nf, df_f), axis=0).reset_index(drop=True)

df_combine

Preparing for the training dataset, we **drop irrelevant features**. First we drop datetime, time, and error columns. Next, features that "de facto" are output of wind turbine, such as power from wind, operating hours, and kWh production, are dropped. Also, climatic variable such as wind speed are not useful.

In [1]:
# Drop irrelevant features
train_df = df_combine.drop(columns=['DateTime_x', 'Time', 'Error', 'WEC: ava. windspeed', 
                                    'WEC: ava. available P from wind',
                                    'WEC: ava. available P technical reasons',
                                    'WEC: ava. Available P force majeure reasons',
                                    'WEC: ava. Available P force external reasons',
                                    'WEC: max. windspeed', 'WEC: min. windspeed', 
                                    'WEC: Operating Hours', 'WEC: Production kWh',
                                    'WEC: Production minutes', 'DateTime_y'])

train_df

In [1]:
# Imbalanced fault modes
train_df.Fault.value_counts().plot.pie(title='Fault Modes')

# 6. Machine learning - fault modes classification

We are going to build a predictive model to classify fault modes of wind turbine based on the information or status of wind turbine components (gear box, tower, nacelle, bearing, etc.) from SCADA system. This is a multiclass classification task.

Because our training data is largely imbalanced for each fault modes, we use **SMOTE** (Synthetic Minority Oversampling Technique) to oversample the minority classes. The classifier that we use is **LightGBM** (Gradient Boosting Machine). To avoid overfitting, we did **Stratified K-Fold** cross-validation with 5-folds. **Multiple scoring metrics** are used: accuracy, macro-averaged precision, macro recall, and macro F1 score.  

**NOTE.** During cross-validation, the train set will be divided into train set and validation set. Therefore, to ensure that the train set is balanced, the SMOTE should be put inside via **pipeline**. If outside, the score result will be unfair (see this [article](https://towardsdatascience.com/the-right-way-of-using-smote-with-cross-validation-92a8d09d00c7)).

In [1]:
# Feature and target
# X = df_combine.iloc[:,3:-2]
# y = df_combine.iloc[:,-1]
X = train_df.iloc[:,:-1]
y = train_df.iloc[:,-1]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Make pipeline of SMOTE, scaling, and classifier
pipe = make_pipeline(SMOTE(), StandardScaler(), LGBMClassifier(random_state=42))

# Define multiple scoring metrics
scoring = {
    'acc': 'accuracy',
    'prec_macro': 'precision_macro',
    'rec_macro': 'recall_macro',
    'f1_macro': 'f1_macro'
}

# Stratified K-Fold 
stratkfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Return a dictionary of all scorings
cv_scores = cross_validate(pipe, X_train, y_train, cv=stratkfold, scoring=scoring)

In [1]:
# Print scoring results from dictionary
for metric_name, metric_value in cv_scores.items():
    mean = np.mean(metric_value)
    print(f'{metric_name}: {np.round(metric_value, 4)}, Mean: {np.round(mean, 4)}')

**The precision, recall, and F1 score are 65%.**

In [1]:
# Fit pipeline to train set
pipe.fit(X_train, y_train)

# Predict on test set
y_pred = pipe.predict(X_test)

In [1]:
# Confusion matrix of test set
plot_confusion_matrix(pipe, X_test, y_test, values_format='.5g') 
plt.show()

We can see 2 problematic (false predicted) classes here are FF and EF. There are 30 EF predicted as FF, and 26 FF predicted as EF.

In [1]:
# Classification report
print(classification_report(y_test, y_pred))

# 7. Hyperparameter tuning

From our previous LightGBM model, we achieved 65% precision, recall, and F1, however we find 2 problematic classes being falsely predicted. We will improve our model with hyperparameter tuning. We will do grid search over **4 hyperparameters** and optimize the **F1 score** as our metric.

In [1]:
# Define parameter search grid
param_grid = {'lgbmclassifier__n_estimators': [6, 8, 16, 24], 
              'lgbmclassifier__num_leaves': [4, 6, 8],
              'lgbmclassifier__reg_alpha' : [1, 1.5],
              'lgbmclassifier__reg_lambda': [1, 1.5],
              'lgbmclassifier__boosting_type': ['gbdt'] # Gradient Boosting Decision Tree
             }

# Grid search CV
grid = GridSearchCV(pipe, param_grid, verbose=1, cv=stratkfold, n_jobs=-1, scoring='f1_macro')

# Fit grid on train set
grid.fit(X_train, y_train)

We got **improvement to 70% F1-score** with the following tuned hyperparameters.

In [1]:
# Best model from tuning
print(grid.best_params_)
print(f'Average of Macro F1: {grid.best_score_}')

Using the tuned LightGBM model, we had successfully reduced the false classes i.e. from 30 to only 8 EF classes falsely predicted as FF. 

In [1]:
# Confusion matrix of test set
plot_confusion_matrix(grid, X_test, y_test, values_format='.5g') 
plt.show()

Comparing the present classification report to the previous report, we improved the F1-score of EF class from 46% to 69%, and small improvement of FF class from 62% to 68%.  

In [1]:
# Classification report
y_pred = grid.predict(X_test)

print(classification_report(y_test, y_pred))

With this improvement, however, there remains some issues recommended for **future improvements of this work**:
* Eventhough there is improvement, 26 FF class still falsely predicted as EF
* Score of MF and AF are very low: 13% and 41%

# 8. Conclusion

We analyzed data that comes from the Supervisory Control and Data Acquisition (SCADA) system of a wind turbine from April 2014 until April 2015, with the associated faults that occured during the operating times. The SCADA system gives more than 60 records of all components of the wind turbine such as nacelle, inverter, bearing, and so on. There were 5 fault modes that developed, labeled as FF, AF, EF, MF, and GF. We found out that the number of faults significantly increases in October 2014. 

During faulty times, there are anomalous behaviors of the wind turbine components. For example, during GF, the temperatures of all wind turbine components are lower than during normal conditions. However, temperatures during AF and EF are higher than normal. And then, the reactive power is anomalously high during EF, while power is lower during MF. Therefore, we could classify fault modes based on various SCADA measured components. 

We made an Adaptive Boosting (AdaBoost) based predictive model to classify fault modes. The data is largely imbalanced among fault modes, therefore SMOTE was implemented within a 5-fold CV pipeline. From that attempt, we achieved a macro F1 score of 64-65%. Two problematic classes, EF and FF, were falsely predicted. To correct this and improve the model performance, we performed hyperparameter tuning to tune 4 AdaBoost hyperparameters. After tuning, the macro F1 score improved to 69-70%. The number of false prediction of EF classes successfully reduces. 

The individual F1 scores of AF and MF were still low. Therefore, an improvement of this work is recommended and will be appreciated.

**Thank you!**