# CI Portfolio Project 5 - Filter Maintenance Predictor 2022
## **ML Model - Predict Remaining Useful Life (RUL)**

## Objectives

Answer [Business Requirement 1](https://github.com/roeszler/filter-maintenance-predictor/blob/main/README.md#business-requirements) :
*   Fit and evaluate a **regression model** to predict the Remaining Useful Life of a replaceable part
*   Fit and evaluate a **classification model** to predict the Remaining Useful Life of a replaceable part should the regressor not perform well.

## Inputs

Data cleaning:
* outputs/datasets/cleaned/dfCleanTotal.csv

## Outputs

* Train set (features and target)
* Test set (features and target)
* Validation set (features and target)
* ML pipeline to predict RUL
* A map of the labels
* Feature Importance Plot



---

### Change working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
print("Current directory set to new location")

In [None]:
current_dir = os.getcwd()
current_dir

---

## The major steps in this Regressor Pipeline

<details>
<summary style="font-size: 0.9rem;"><strong>1. ML Pipeline: Regressor</strong> (Dropdown List)</summary>

* Create Regressor Pipeline
* Split the train set
* Grid Search CV SKLearn
    * Use standard hyperparameters to find most suitable algorithm
    * Extensive search on most suitable algorithm to find the best hyperparameter configuration
* Assess Feature Performance
* Evaluate Regressor
* Create Train, Test, Validation Sets
</details></br>


<details>
<summary style="font-size: 0.9rem;"><strong>2. ML Pipeline: Regressor + Principal Component Analysis</strong> (PCA)</summary>

* Prepare the Data for the Pipeline
* Create Regressor + PCA Pipeline
* Split the train and validation sets
* Grid Search CV SKLearn
    * Use standard hyperparameters to find most suitable algorithm
    * Do an extensive search on most suitable algorithm to find the best hyperparameter configuration
* Assess Feature Performance
* Evaluate Regressor
* Create Train, Test, Validation Sets
</details></br>

<details>
<summary style="font-size: 0.9rem;"><strong>3. Convert Regression to Classification</strong> (Optionally)</summary>

* Convert numerical target to bins, and check if it is balanced
* Rewrite Pipeline for ML Modelling
* Load Algorithms For Classification
* Split the Train Test sets:
* Grid Search CV SKLearn:
    * Use standard hyper parameters to find most suitable model
    * Grid Search CV
    * Check Result
* Do an extensive search on the most suitable model to find the best hyperparameter configuration.
    * Define Model Parameters
    * Extensive Grid Search CV                             
    * Check Results
    * Check Best Model
    * Parameters for best model
    * Define the best clf_pipeline
* Assess Feature Importance
* Evaluate Classifier on Train and Test Sets
    * Custom Function
    * List that relates the classes and tenure interval
</details></br>

<details><summary style="font-size: 0.9rem;"><strong>4. Decide which pipeline to use</strong></summary></details></br>

<details>
<summary style="font-size: 0.9rem;"><strong>5. Refit with the best features</strong></summary>

* Rewrite Pipeline
* Split Train Test Set with only best features
* Subset best features
* Grid Search CV SKLearn
* Best Parameters
    * Manually
* Grid Search CV
* Check Results
* Check Best Model
* Define the best pipeline
</details></br>

<details><summary style="font-size: 0.9rem;"><strong>6. Assess Feature Importance</strong></summary></details></br>

<details><summary style="font-size: 0.9rem;"><strong>7. Push Files to Repo</strong></summary></details>

<!-- Modelling:
The hypothesis part of the process where you will find out whether you can answer the question.
* Identify what techniques to use.
* Split your data into train, validate and test sets.
* Build and train the models with the train data set.
* Validate Models and hyper-parameter : Trial different machine learning methods and models with the validation data set.
* Poor Results - return to data preparation for feature engineering
* Successful hypothesis - where the inputs from the data set are mapped to the output target / label appropriately to evaluate.

5. Evaluation:
Where you test whether the model can predict unseen data.
* Test Dataset
* Choose the model that meets the business success criteria best.
* Review and document the work that you have done.
* If your project meets the success metrics you defined with your customer?
- Ready to deploy. -->

---

### Load Cleaned Data
Target variable for regressor, remove from classifier and drop other variables not required

In [None]:
import numpy as np
import pandas as pd

df_total = pd.read_csv(f'outputs/datasets/transformed/dfTransformedTotal.csv') # data with all negative log_EWM values removed
df_total_model = (pd.read_csv('outputs/datasets/transformed/dfTransformedTotal.csv')
        .drop(labels=['4point_EWM', 'change_DP', 'change_EWM'], axis=1)
    )
df_train_even_dist = (pd.read_csv(f'outputs/datasets/transformed/dfTransformedTrain.csv')
        .drop(labels=['4point_EWM', 'change_DP', 'change_EWM', 'std_DP', 'median_DP', 'bin_size'], axis=1)
    )
print(df_total.shape, '= df_total')
print(df_total_model.shape, '= df_total_model')
print(df_train_even_dist.shape, '= df_train_even_dist')
df_total

In [None]:
df_total_model

# ML Pipeline : Regressor
## Create Regressor Pipeline
### Set the Transformations
* Smart correlation
* feat_scaling
* feat_selection
* Modelling
* Model as variable

Note: Numerical Transformation not required as data supplied as integers

#### ML Pipeline for **Fitting Models** (regression)
Modelling and Hyperparameter Optimization

In [None]:
# # Feature Management
# from sklearn.pipeline import Pipeline
# from feature_engine.encoding import OrdinalEncoder
# from feature_engine.selection import SmartCorrelatedSelection
# from sklearn.preprocessing import StandardScaler, Normalizer
# from sklearn.feature_selection import SelectFromModel

# # ML regression algorithms
# from sklearn.tree import DecisionTreeRegressor
# from xgboost import XGBRegressor
# from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor, ExtraTreesRegressor
# from sklearn.linear_model import LogisticRegression, LinearRegression

# # # ML classification algorithms
# # from sklearn.tree import DecisionTreeClassifier
# # from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, AdaBoostClassifier
# # from xgboost import XGBClassifier

# def PipelineOptimization(model):
#     pipeline_base = Pipeline([
#         # ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
#         #                                              variables=['Differential_pressure', 'Flow_rate',
#         #                                                         # 'log_EWM', 'Time', 'mass_g', 'Tt', 'filter_balance'
#         #                                                         'Dust_feed', 'Dust', 'cumulative_mass_g'])),
#         ('SmartCorrelatedSelection', SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")),
#         ('feat_scaling', StandardScaler()),
#         ('feat_selection', SelectFromModel(model)),
#         ('model', model),])
#     return pipeline_base

In [None]:
df_total_model

In [None]:
# model = PipelineOptimization(self.models[key])

#### **Custom Class** to fit a set of algorithms, each with its own set of hyperparameters

In [None]:
# from sklearn.model_selection import GridSearchCV


# class HyperparameterOptimizationSearch:

#     def __init__(self, models, params):
#         self.models = models
#         self.params = params
#         self.keys = models.keys()
#         self.grid_searches = {}

#     def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
#         for key in self.keys:
#             model = PipelineOptimization(self.models[key])
#             print(f"\nRunning GridSearchCV for {key} \n")

#             params = self.params[key]
#             gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
#                               verbose=verbose, scoring=scoring)
#             gs.fit(X, y)
#             self.grid_searches[key] = gs

#     def score_summary(self, sort_by='mean_score'):
#         def row(key, scores, params):
#             d = {
#                 'estimator': key,
#                 'min_score': min(scores),
#                 'max_score': max(scores),
#                 'mean_score': np.mean(scores),
#                 'std_score': np.std(scores),
#             }
#             return pd.Series({**params, **d})

#         rows = []
#         for k in self.grid_searches:
#             params = self.grid_searches[k].cv_results_['params']
#             scores = []
#             for i in range(self.grid_searches[k].cv):
#                 key = "split{}_test_score".format(i)
#                 r = self.grid_searches[k].cv_results_[key]
#                 scores.append(r.reshape(len(params), 1))

#             all_scores = np.hstack(scores)
#             for p, s in zip(params, all_scores):
#                 rows.append((row(k, s, p)))

#         df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

#         columns = ['estimator', 'min_score',
#                    'mean_score', 'max_score', 'std_score']
#         columns = columns + [c for c in df.columns if c not in columns]

#         return df[columns], self.grid_searches

## Split the data into Train, Test, Validate

Data is discrete however in bins, so:
#### Define Cleaned **Train** & **Test** Datasets

In [None]:
df_total_model

In [None]:
n = df_total_model['Data_No'].iloc[0:len(df_total)]
# df_train = df_total_model[n < 51].reset_index(drop=True)
df_test = df_total_model[n > 50].reset_index(drop=True)
df_train = df_train_even_dist
df_train_model = df_train_even_dist
df_train

In [None]:
df_test

#### Engineer Even Distribution of `Dust` in **Train** dataset

In [None]:
# df_train_copy = df_train.copy()

# bin_sum = df_train_copy.groupby('Data_No')['Data_No'].count().reset_index(name='bin_tot')
# map_bin = df_train_copy['Data_No'].map(bin_sum.set_index('Data_No')['bin_tot'])
# df_train_copy.loc[:, 'bin_size'] = map_bin

# dust_A2 = df_train_copy[df_train_copy['Dust'] == 0.900]
# filter_A2 = dust_A2[dust_A2.Data_No != dust_A2.Data_No.shift(-1)]
# df_train_A2 = filter_A2.sort_values(by='filter_balance', ascending=True)
# df_train_A2['c_sum'] = df_train_A2['bin_size'].cumsum()
# # df_train_A2.head(13).style.hide(['Time', 'Dust_feed', 'Flow_rate', 'Dust', 'mass_g', 'cumulative_mass_g', 'Tt'], axis="columns")

# dust_A3 = df_train_copy[df_train_copy['Dust'] == 1.025]
# filter_A3 = dust_A3[dust_A3.Data_No != dust_A3.Data_No.shift(-1)]
# df_train_A3 = filter_A3.sort_values(by='filter_balance', ascending=True)
# df_train_A3['c_sum'] = df_train_A3['bin_size'].cumsum()
# # dn_fb = df_train_A3.loc[:, 'Data_No'].head(14).sort_values(ascending=True).reset_index(drop=True)
# # df_train_A3.head(14).style.hide(['Time', 'Dust_feed', 'Flow_rate', 'Dust', 'mass_g', 'cumulative_mass_g', 'Tt'], axis="columns")

# A2_bin = df_train_A2['Data_No'].head(9)
# A3_bin = df_train_A3['Data_No'].head(9)

# df_train_cleaned_A1 = df_train_copy[df_train_copy['Dust'] == 1.200]
# df_train_cleaned_A2 = df_train_copy[df_train_copy['Data_No'].isin(A2_bin)]
# df_train_cleaned_A3 = df_train_copy[df_train_copy['Data_No'].isin(A3_bin)]

# df_train_concat = pd.concat([df_train_cleaned_A1, df_train_cleaned_A2, df_train_cleaned_A3], ignore_index = True)
# df_train_model = df_train_concat.sort_values(by='Data_No', ascending=True)

# print('A1 Dust_train :', df_train_cleaned_A1.shape)
# print('A2 Dust_train :', df_train_cleaned_A2.shape)
# print('A3 Dust_train :', df_train_cleaned_A3.shape)
# print('\nTotal df_train :', df_train_model.shape)

#### Determine **Target** and **Independent** Variables and Extract **Validation** Dataset

As discussed in the readme, this data has been supplied pre-split into **train** and **test** within unique **data bins**. 
We extract random observations from the **test** dataset to create a **validation** set, in a 70:30 split.


In [None]:
df_test

In [None]:
# from feature_engine.selection import SmartCorrelatedSelection
# corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")

# df_engineering = df_test.copy()
# corr_sel.fit_transform(df_engineering)
# print('Correlated Variables :\n', corr_sel.correlated_feature_sets_)
# print('\nFeatures to Drop :\n', corr_sel.features_to_drop_)

Review correlations, Drop Features and Split into **70% test** and **30% validate**. 

In [None]:
from sklearn.model_selection import train_test_split
from feature_engine.selection import SmartCorrelatedSelection

corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")
df_engineering = df_test.copy()
corr_sel.fit_transform(df_engineering)

X = df_test.drop(corr_sel.features_to_drop_,axis=1)
y = df_test['Differential_pressure']

X_test, X_validate, y_test, y_validate = train_test_split(X,y,test_size=0.30, random_state=0)

print(X_test.shape, 'X_test')
print(X_validate.shape, 'X_validate')
print(y_test.shape, 'y_test')
print(y_validate.shape, 'y_validate')
print('\nFeatures Dropped :\n', corr_sel.features_to_drop_)

In [None]:
X_test

#### Define **X_train**, **y_train** variables

In [None]:
df_train

In [None]:
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")
df_engineering = df_train.copy()
corr_sel.fit_transform(df_engineering)

X_train = df_train.drop(corr_sel.features_to_drop_,axis=1)
y_train = df_train['Differential_pressure']

print(X_train.shape, 'X_train')
print(y_train.shape, 'y_train')
print('\nFeatures Dropped :\n', corr_sel.features_to_drop_)

In [None]:
X_train

In [None]:
y_train

## Handling Target Imbalance
### No need to handle target imbalance in this **regression model**.
* Typically we only need to create a single pipeline for Classification or Regression task. 
* The exception occurs when we need to handle a **classification target imbalance**, which requires more than one model to process 

## Fit the pipeline with Data
* Prepare the data for handling the train set and
* ~~Fix the target imbalance~~
* Feature scaling
* Feature selection
* Modelling
* Custom Python class for **hyperparameter optimization**

### Use standard hyperparameters to find most suitable algorithm for the data

In [None]:
# Feature Management
from sklearn.pipeline import Pipeline
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.feature_selection import SelectFromModel

# ML regression algorithms
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor, ExtraTreesRegressor
from sklearn.linear_model import LinearRegression
# from sklearn.linear_model import LogisticRegression

In [None]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    # 'LogisticRegression': LogisticRegression(),
    'DecisionTreeRegressor': DecisionTreeRegressor(random_state=0),
    'RandomForestRegressor': RandomForestRegressor(random_state=0),
    'ExtraTreesRegressor': ExtraTreesRegressor(random_state=0),
    'AdaBoostRegressor': AdaBoostRegressor(random_state=0),
    'GradientBoostingRegressor': GradientBoostingRegressor(random_state=0),
    'XGBRegressor': XGBRegressor(random_state=0),
}

params_quick_search = {
    'LinearRegression': {},
    # 'LogisticRegression': {},
    'DecisionTreeRegressor': {},
    'RandomForestRegressor': {},
    'ExtraTreesRegressor': {},
    'AdaBoostRegressor': {},
    'GradientBoostingRegressor': {},
    'XGBRegressor': {},
}

#### Fit the pipelines, using the above models with **default hyperparameters**
* Parsed the train set
* Set the performance metric as an R2 score (Regression: described in our ML business case)
* Cross validation as 5 (rule of thumb)

In [None]:
# def PipelineOptimization(model):
#     pipeline_base = Pipeline([
#         ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
#                                                     variables=['Differential_pressure', 'Flow_rate',
#                                                                 # 'log_EWM', 'Time', 'mass_g', 'Tt', 'filter_balance',
#                                                                 'Dust_feed', 'Dust', 'cumulative_mass_g'])),
#         ('SmartCorrelatedSelection', SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")),
#         ('feat_scaling', StandardScaler()),
#         ('feat_selection', SelectFromModel(model)),
#         ('model', model),])
#     return pipeline_base

In [None]:
def PipelineOptimization(model):
    pipeline_base = Pipeline([
        # ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
        #                                              variables=['Differential_pressure', 'Flow_rate',
        #                                                         # 'log_EWM', 'Time', 'mass_g', 'Tt', 'filter_balance',
        #                                                         'Dust_feed', 'Dust', 'cumulative_mass_g'])),
        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")),
        ("feat_scaling", StandardScaler()),
        ("feat_selection",  SelectFromModel(model)),
        ("model", model),
    ])
    return pipeline_base

In [None]:
from sklearn.model_selection import GridSearchCV


class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = PipelineOptimization(self.models[key]) # the model

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score (R²)'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score (R²)': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score',
                   'mean_score (R²)', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches


In [None]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score (R²)')
grid_search_summary

#### Observations

* The average **R² score** (mean_score) ranges from **0.98 to 1**, which is exceptional. This indicates well a model of data fits the data, and represents a perfect fit. 
* This is much higher than the **0.7** tolerance we decided in the business case.
* The best result is **LinearRegression** and/or **ExtraTreesRegressor**, however all algorithms can be confidently used to train a model.
<!-- * We will perform an extensive search to hopefully improve performance. -->

In [None]:
X_train

In [None]:
y_train

In [None]:
df_stop

### Train the Model

Multiple regression and classification models under consideration 

* sklearn.linear_model.**LinearRegression**(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)
* sklearn.linear_model.**LogisticRegression**(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)
    * *.predict_proba(X)*
* sklearn.linear_model.**SGDRegressor**(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)
    * *.SGDClassifier()*

List full of available and under consideration can be seen at scikitlearn [linear models](https://scikit-learn.org/stable/modules/linear_model.html#)

* No one optimal model. the most appropriate seems .LogisticRegression()
<!-- 
**.LinearRegression()** - Ordinary Least Squares
**.SGDClassifier()** and **.SGDRegressor()** - Stochastic Gradient Descent - SGD
.Ridge() 
.Lasso()
.MultiTaskLasso()
.ElasticNet()
.MultiTaskElasticNet()
.Lars() - Least Angle Regression
.LassoLars()
.OrthogonalMatchingPursuit() and orthogonal_mp()
.BayesianRidge() - Bayesian Regression
.ARDRegression() - Automatic Relevance Determination
Generalized Linear Models
**.LogisticRegression()** + **.predict_proba(X)**
.TweedieRegressor()
.Perceptron()
.PassiveAggressiveClassifier() and .PassiveAggressiveRegressor()
Robustness regression: outliers and modeling errors
.RANSACRegressor()
.TheilSenRegressor() and 
.HuberRegressor()
.QuantileRegressor()
Polynomial regression: extending linear models with basis functions
.PolynomialFeatures() transformer -->


In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)

In [None]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train,y_train)

In [None]:
from sklearn.linear_model import SGDRegressor
SGDreg = SGDRegressor()
SGDreg.fit(X_train,y_train)

### Predictions and Model Evaluation

In [None]:
from sklearn.metrics import classification_report

prediction = logrig.predict(X_test)
print(classification_report(y_test,prediction))