# Maintenance of Naval Propulsion Plants Data Set
#### _Predicting Gas Turbine propulsion plant's decay state coefficient_


## Motivation:
In this case-study, we generate a predictive model to predict of decay state of rotating equipment using

## Abstract: 

Dataset source (http://archive.ics.uci.edu/ml/datasets/condition+based+maintenance+of+naval+propulsion+plants)
Kaggle: https://www.kaggle.com/elikplim/maintenance-of-naval-propulsion-plants-data-set

Data have been generated from a sophisticated simulator of a Gas Turbines (GT), mounted on a Frigate characterized by a COmbined Diesel eLectric And Gas (CODLAG) propulsion plant type.

## Problem Statement: 

The experiments have been carried out by means of a numerical simulator of a naval vessel (Frigate) characterized by a Gas Turbine (GT) propulsion plant. The different blocks forming the complete simulator (Propeller, Hull, GT, Gear Box and Controller) have been developed and fine tuned over the year on several similar real propulsion plants. In view of these observations the available data are in agreement with a possible real vessel.

In this release of the simulator it is also possible to take into account the performance decay over time of the GT components such as GT compressor and turbines.

The propulsion system behaviour has been described with this parameters:
- Ship speed (linear function of the lever position lp).
- Compressor degradation coefficient kMc.
- Turbine degradation coefficient kMt.
so that each possible degradation state can be described by a combination of this triple (lp,kMt,kMc).

The range of decay of compressor and turbine has been sampled with an uniform grid of precision 0.001 so to have a good granularity of representation.
In particular for the compressor decay state discretization the kMc coefficient has been investigated in the domain [1; 0.95], and the turbine coefficient in the domain [1; 0.975].
Ship speed has been investigated sampling the range of feasible speed from 3 knots to 27 knots with a granularity of representation equal to tree knots.
A series of measures (16 features) which indirectly represents of the state of the system subject to performance decay has been acquired and stored in the dataset over the parameter's space.


## Background


##### reference:
- https://www.simonwenkel.com/2019/04/19/revisitingML-naval-propulsion.html
- https://www.researchgate.net/publication/245386997_Real-time_simulation_of_a_COGAG_naval_ship_propulsion_system
- https://www.linkedin.com/pulse/gas-turbine-compressor-decay-state-coefficient-john-kingsley/?trackingId=5S5swf3uTqCizwyGWxxSIw%3D%3D

Steps:

1. Data cleaning/Preparation
2. EDA: Data visuallization and Understanding
3. PCA: Feature selection
4. Model building and hypertuning with GridsearchCV:
    1. LinearRegression
    2. RandomForestRegressor
    3. KNeighborsRegressor 
    4. DecisionTreeRegressor
    5. BaggingRegressor
    6. XGBRegressor

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA


import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn import metrics

from sklearn.metrics import explained_variance_score, mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.inspection import permutation_importance

import warnings
warnings.filterwarnings("ignore")

In [None]:

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
naval_df = pd.read_csv(os.path.join(dirname, filename), delim_whitespace=True, header=None)
naval_df.head()

In [None]:
naval_df.columns = ['lever_position', 'ship_speed', 'gt_shaft', 'gt_rate', 'gg_rate', 'sp_torque', 'pp_torque', 'hpt_temp', 'gt_c_i_temp', 'gt_c_o_temp', 'hpt_pressure', 'gt_c_i_pressure', 'gt_c_o_pressure', 'gt_exhaust_pressure', 'turbine_inj_control', 'fuel_flow', 'gt_c_decay',  'gt_t_decay']

In [None]:
100*naval_df.isna().sum()/len(naval_df)

In [None]:
naval_df = naval_df.dropna()

In [None]:
# ??
#naval_df = naval_df.drop('gt_c_i_temp', axis=1)

In [None]:
naval_df.head()

In [None]:
naval_df.describe()

In [None]:
naval_df.info()

In [None]:
naval_df.shape

In [None]:
def PrintUniqueLenforallCols(temp_df):
    for col in temp_df:
        print(col, len(temp_df[col].unique()))
        
PrintUniqueLenforallCols(naval_df)

In [None]:
# we can drop gt_c_i_pressure and gt_c_i_temp as they have only 1 unique value, and thus not conributing to our dataset
naval_df = naval_df.drop(['gt_c_i_pressure', 'gt_c_i_temp'], axis=1)

In [None]:
# let's check lever_position and ship_speed
naval_df.lever_position.unique()

In [None]:
naval_df.ship_speed.unique()

In [None]:
naval_df.gt_c_decay.unique()

In [None]:
naval_df.gt_t_decay.unique()

## EDA

Let's look at the target variables: 

In [None]:
# let's look at the target variables: 
plt.figure(figsize=(10, 6))
plt.plot(naval_df.index, naval_df.gt_c_decay,'.-')
plt.xlabel("sampleID")
plt.ylabel("gt_c_decay")
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(naval_df.index, naval_df.gt_t_decay,'.-')
plt.xlabel("sampleID")
plt.ylabel("gt_t_decay")
plt.show()

In [None]:
# Let's check plot for other features
plt.figure(figsize=(22, 20))
icount =1
for col in naval_df.columns:
    plt.subplot(4,4, icount)
    sns.boxplot(naval_df[col], orient="v")
    icount = icount+1
plt.show()

we don't see any outliner, let's now see normal distribution

In [None]:
plt.figure(figsize=(20,20))
icount =1
for col in naval_df.columns:
    plt.subplot(4,4, icount)
    sns.distplot(naval_df[col])
    icount +=1
plt.show()

In [None]:
# let's check the pairplot
sns.pairplot(naval_df)
plt.show()

In [None]:
# let's look at pair plot only for continuous variables
sns.pairplot(naval_df[naval_df.columns[2:-2]])
plt.show()

#### from above graph we can se that there seems to be linear pattern between the features, let us check the correlation between parameters 

In [None]:
plt.figure(figsize=(15,10))
cols = naval_df.corr().index
corr_mat = np.corrcoef(naval_df[cols].values.T)
sns.set(font_scale =1)
hm = sns.heatmap(corr_mat, annot=True, yticklabels = cols.values, xticklabels=cols.values)
plt.show()

As can be seen, there is strong correlation between the feature set.
Let's try using RFE or PCA, to reduce the feature set

In [None]:
# First let's split data into X and y

# we have two target variables, so we'll have two sets 
np.random.seed(0)
df_train_navel, df_test_navel = train_test_split(naval_df, train_size = 0.7, test_size=0.3, random_state = 100)


In [None]:
y_train_c = df_train_navel.pop('gt_c_decay')
y_train_t = df_train_navel.pop('gt_t_decay')
X_train = df_train_navel


y_test_c = df_test_navel.pop('gt_c_decay')
y_test_t = df_test_navel.pop('gt_t_decay')
X_test = df_test_navel


In [None]:
X_train.shape

In [None]:
# convert train data
tr_scaled_features = StandardScaler().fit_transform(X_train.values)
X_train = pd.DataFrame(tr_scaled_features, index=X_train.index, columns=X_train.columns)


# convert test data
tt_scaled_features = StandardScaler().fit_transform(X_test.values)
X_test = pd.DataFrame(tt_scaled_features, index=X_test.index, columns=X_test.columns)


# Feature selection: apply PCA on the data

In [None]:
pca = PCA(random_state=42)

In [None]:
pca.fit(X_train)

In [None]:
plt.bar(range(1,len(pca.explained_variance_ratio_)+1), pca.explained_variance_ratio_)

In [None]:
var_cumu = np.cumsum(pca.explained_variance_ratio_)
plt.plot(range(1,len(var_cumu)+1), var_cumu)
plt.grid()

In [None]:
print("no. of Components  Variance accounted")
for i in range (2, 8):
    s = ("      " + str(i)+ "             "+ str(100*var_cumu[i]));
    print(s)

In [None]:
def getPCAMostImportantFeat(model, initial_feature_names):
    # number of components
    n_pcs= model.components_.shape[0]

    # get the index of the most important feature on EACH component based on argument value
    most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]

    # get the names
    most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

    # Create a dictionary for PCA components and most important feature 
    dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}

    # build the dataframe
    df = pd.DataFrame(dic.items())
    return df

In [None]:
pca_grid_df = getPCAMostImportantFeat(pca,X_train.columns)
pca_grid_df

In [None]:
pca_4_cpnt = PCA(n_components=4, random_state=42)

In [None]:
navel_pca_data = pca_4_cpnt.fit_transform(X_train)

In [None]:
cmp_lst = []
for i in range(1,5):
    s = 'PC'+ str(i)
    cmp_lst.append(s)

In [None]:
#Create Dataframe
navel_pca_X = pd.DataFrame(navel_pca_data, columns=cmp_lst)
navel_pca_X

In [None]:
navel_pca_X.reset_index(drop=True, inplace=True)

In [None]:
x_pca_cols = pca_grid_df.iloc[:, 1].tolist()

In [None]:
# most important feature after running Logistic Regression
# important_features_lg = pd.Series(lg_coef, index=x_pca_lg_cols)
# important_features_lg.sort_values()[-10:].plot(kind = 'barh')

In [None]:
cmp_lst

In [None]:
# Transform test set
navel_pca_data_test = pca_4_cpnt.transform(X_test)                               
navel_pca_test_X  = pd.DataFrame(navel_pca_data_test, columns=cmp_lst)
#navel_pca_test_X

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor 
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor #Ensemble using averaging method
from xgboost import XGBRegressor #Ensemble using boosting method
from sklearn.ensemble import GradientBoostingRegressor

# Model Building

In [None]:
# the models that you want to compare
models = {'LinearRegression': LinearRegression(),
          'RandomForestRegressor': RandomForestRegressor(),
          'KNeighborsRegressor': KNeighborsRegressor(),
          'DecisionTreeRegressor':DecisionTreeRegressor(),
          'BaggingRegressor' : BaggingRegressor(),
          'XGBRegressor': XGBRegressor()}


# the optimisation parameters for each of the above models
params = {'LinearRegression': [{'fit_intercept':[True,False],'normalize':[True,False], 'copy_X':[True, False]}],
          'RandomForestRegressor': [{'n_estimators': [ 50, 60, 80]}],
          'KNeighborsRegressor': [{'n_neighbors': [2,3,4,5,6]}],
          'DecisionTreeRegressor': [{'max_depth': [2,4,6,8,10,12]}],        
          'BaggingRegressor': [{'base_estimator': [None, GradientBoostingRegressor(), KNeighborsRegressor()],
          'n_estimators': [20,50,100]}],
          'XGBRegressor': [{'n_estimators': [50,500]}]
         }


#models = {'BaggingRegressor' : BaggingRegressor()}
#params = {'BaggingRegressor': [{'base_estimator': [None, KNeighborsRegressor()]}]}

x_pca_cols = pca_grid_df.iloc[:, 1].tolist() 
important_features_list = []
plt.figure(figsize=(20, 12))


def runregressors(X_train, Y_train, X_test, Y_test):
    """
    fits the list of models to the training data, thereby obtaining in each 
    case an evaluation score after GridSearchCV cross-validation
    """
    i_count = 0
    fig, ax = plt.subplots(nrows=3, ncols=2, figsize = (20, 15))
    
    # Evaluations
    result_name = []
    result_summary1 = []
    result_mae = []
    result_mse = []
    result_exp_var = []
    result_r2_score = []
    result_ac_score = []

    for name in models.keys():
        est = models[name]
        est_params = params[name]
        gscv = GridSearchCV(estimator=est, param_grid=est_params, cv=5) #, verbose=2
        gscv.fit(X_train, Y_train)
        
        msg1 = str(gscv.best_estimator_)
        result_summary1.append(msg1)
        result_name.append(name)
        

        # Evaluate the model
        y_pred = gscv.predict(X_test)
        score = explained_variance_score(Y_test, y_pred)
        mae = mean_absolute_error(Y_test, y_pred)
        mse = mean_squared_error(Y_test, y_pred)
        ascore =gscv.best_estimator_.score( X_test, Y_test)
        r2 = r2_score(Y_test, y_pred)
        msg2 = "%s: %f (%f)" % (name, score*100, mae*100)
        #print(msg2)
        result_mse.append(mse)
        result_mae.append(mae)
        result_exp_var.append(score)
        result_r2_score.append(r2)
        result_ac_score.append(ascore)

        if name == "LinearRegression":
            important_features = pd.Series(gscv.best_estimator_.coef_ , index=x_pca_cols[:4])
        elif name == "KNeighborsRegressor":
            # perform permutation importance
            results = permutation_importance(gscv.best_estimator_, X_train, Y_train, scoring='neg_mean_squared_error')
            # get importance
            important_features = pd.Series(results.importances_mean , index=x_pca_cols[:4])
        elif name == "BaggingRegressor":
            feature_importances = np.mean([tree.feature_importances_ for tree in gscv.best_estimator_], axis=0)
            important_features = pd.Series(feature_importances , index=x_pca_cols[:4])
            #threshold = np.mean(feature_importances)
        else:
            important_features = pd.Series(gscv.best_estimator_.feature_importances_ , index=x_pca_cols[:4])
        important_features_list.append(important_features)
        #important_features.sort_values().plot(kind = 'barh')
        col = i_count%2
        row = i_count//2
        ax[row][col].scatter(Y_test, y_pred)
        ax[row][col].plot([Y_test.min(), Y_test.max()], [Y_test.min(), Y_test.max()], 'k--', lw=2)
        ax[row][col].set_xlabel('Measured')
        ax[row][col].set_ylabel('Predicted')
        ax[row][col].set_title(msg2)
        i_count+=1
            
    plt.show()

    
    result_summary_list = pd.DataFrame({'name': result_name,
                                        'best_estimator': result_summary1,
                                        'R2': result_r2_score,
                                        'MAE': result_mae,
                                        'MSE': result_mse,
                                        'explained variance score': result_exp_var,
                                        'accuracy': result_ac_score})
    return result_summary_list
        

# Predicting Compressor Decay

In [None]:
result_summary_list = runregressors(navel_pca_X, y_train_c, navel_pca_test_X, y_test_c)

In [None]:
for i in range(0,4):
    important_features_list[0][i]  = abs(important_features_list[0][i])

fig, ax = plt.subplots(nrows=3, ncols=2, figsize = (20, 15))
i_count = 0
nm = result_summary_list.name.to_list()
for imp_fea in important_features_list:
    col = i_count%2
    row = i_count//2
    imp_fea.sort_values().plot(kind = 'barh', ax = ax[row][col] )
    ax[row][col].set_title(nm[i_count])
    i_count+=1
            
plt.show()

In [None]:

result_summary_list

# Predicting Turbine Decay

In [None]:
result_summary_list_t= runregressors(navel_pca_X, y_train_t, navel_pca_test_X, y_test_t)

In [None]:
result_summary_list_t

# Conclusion

**Predicting Compressor decays**
*    KNeighborsRegressor with 84% seems to be the best model for prediction.


**Predicting turbine decays**
*    This seems to be a lot more challenging. Despite rather good metrics, we can see different “categories” quite clearly. Hence, we can conclude that all models generalize rather poorly.