# CI Portfolio Project 5 - Filter Maintenance Predictor 2022
## **ML Model - Predict Remaining Useful Life (RUL)**

## Objectives

Answer [Business Requirement 1](https://github.com/roeszler/filter-maintenance-predictor/blob/main/README.md#business-requirements) :
*   Fit and evaluate a **regression model** to predict the Remaining Useful Life of a replaceable part
*   Fit and evaluate a **classification model** to predict the Remaining Useful Life of a replaceable part should the regressor not perform well.

## Inputs

Data cleaning:
* outputs/datasets/cleaned/dfCleanTotal.csv

## Outputs

* Train set (features and target)
* Test set (features and target)
* Validation set (features and target)
* ML pipeline to predict RUL
* A map of the labels
* Feature Importance Plot



---

### Change working directory

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/churnometer/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print("Current directory set to new location")

Current directory set to new location


In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/churnometer'

---

## The major steps in this Regressor Pipeline

1. **ML Pipeline: Regressor**
    * Create Regressor Pipeline
    * Split the train set
    * Grid Search CV SKLearn
        * Use standard hyperparameters to find most suitable algorithm
        * Extensive search on most suitable algorithm to find the best hyperparameter configuration
    * Assess Feature Performance
    * Evaluate Regressor
    * Create Train, Test, Validation Sets

2. **ML Pipeline: Regressor + Principal Component Analysis (PCA)**
    * Prepare the Data for the Pipeline
    * Create Regressor + PCA Pipeline
    * Split the train and validation sets
    * Grid Search CV SKLearn
        * Use standard hyperparameters to find most suitable algorithm
        * Do an extensive search on most suitable algorithm to find the best hyperparameter configuration
    * Assess Feature Performance
    * Evaluate Regressor
    * Create Train, Test, Validation Sets

_Optionally_

3. **Convert Regression to Classification**
    * Convert numerical target to bins, and check if it is balanced
    * Rewrite Pipeline for ML Modelling
    * Load Algorithms For Classification
    * Split the Train Test sets:
    * Grid Search CV SKLearn:
        * Use standard hyper parameters to find most suitable model
        * Grid Search CV
        * Check Result
    * Do an extensive search on the most suitable model to find the best hyperparameter configuration.
        * Define Model Parameters
        * Extensive Grid Search CV                             
        * Check Results
        * Check Best Model
        * Parameters for best model
        * Define the best clf_pipeline
    * Assess Feature Importance
    * Evaluate Classifier on Train and Test Sets
        * Custom Function
        * List that relates the classes and tenure interval

4. **Decide which pipeline to use**

5. **Refit with the best features**
    * Rewrite Pipeline
    * Split Train Test Set with only best features
    * Subset best features
    * Grid Search CV SKLearn
    * Best Parameters
        * Manually
    * Grid Search CV
    * Check Results
    * Check Best Model
    * Define the best pipeline

6. **Assess Feature Importance**

7. **Push Files to Repo**

<!-- Modelling:
The hypothesis part of the process where you will find out whether you can answer the question.
* Identify what techniques to use.
* Split your data into train, validate and test sets.
* Build and train the models with the train data set.
* Validate Models and hyper-parameter : Trial different machine learning methods and models with the validation data set.
* Poor Results - return to data preparation for feature engineering
* Successful hypothesis - where the inputs from the data set are mapped to the output target / label appropriately to evaluate.

5. Evaluation:
Where you test whether the model can predict unseen data.
* Test Dataset
* Choose the model that meets the business success criteria best.
* Review and document the work that you have done.
* If your project meets the success metrics you defined with your customer?
- Ready to deploy. -->

---

### Load Cleaned Data
The pipeline should handle the cleaning and engineering by itself

In [4]:
# import numpy as np
# import pandas as pd
# import os
# os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
# ! chmod 600 kaggle.json

# KaggleDatasetPath = 'prognosticshse/preventive-to-predicitve-maintenance'
# DestinationFolder = 'inputs/datasets/raw'   
# ! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

# ! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
#   && rm {DestinationFolder}/*.zip \
#   && rm {DestinationFolder}/*.pdf \
#   && rm {DestinationFolder}/*.mat \
# #   && rm kaggle.json

# df_test = pd.read_csv(f'inputs/datasets/raw/Test_Data_CSV.csv')
# df_train = pd.read_csv(f'inputs/datasets/raw/Train_Data_CSV.csv')
# # df_total = pd.read_csv(f'outputs/datasets/cleaned/dfCleanTotal.csv')
# df_total = pd.concat([df_train, df_test], ignore_index=True)

In [5]:
import numpy as np
import pandas as pd
df_total = pd.read_csv(f'outputs/datasets/cleaned/dfCleanTotal.csv')
# df_total = pd.read_csv(f'outputs/datasets/transformed/dfTransformedTotal.csv') # data with negative log_EWM values removed
# df_total = (pd.read_csv("outputs/datasets/collection/PredictiveMaintenanceTotal.csv").drop(labels=['customerID', 'TotalCharges', 'Churn'], axis=1))
print(df_total.shape)
# df_total

(78834, 7)


### Include 4 Point Exponentially Weighted Mean and Remove Negative Values

**Note:** A runtime warning indicating a divide by zero is expected in the calculation of the log_EWM. We are using this to identify which values to delete, so have temporarily suppressed the warning for this calculation.

In [6]:
# Create Log EWM to identify values to remove
import warnings
warnings.filterwarnings('ignore')

df_means = pd.DataFrame()

list_data_nos = list(df_total['Data_No'].unique())
for n in list_data_nos:
    if (df_total.Data_No != df_total.Data_No.shift(1)).any().any():
        df_bin = df_total[df_total['Data_No'] == n]

        ewm_calc = df_bin['Differential_pressure'].ewm(span=4, adjust=False).mean()
        df_bin.insert(loc=2, column='4point_EWM', value=ewm_calc)

        log_ewm = np.log(ewm_calc)
        df_bin.insert(loc=3, column='log_EWM', value=log_ewm)

        df_means = pd.concat([df_means, df_bin], ignore_index = True)
df_total = df_means

warnings.resetwarnings()

# Delete Negatives
data = df_total.loc[:, df_total.columns == 'log_EWM']
df_total = df_total[data.select_dtypes(include=[np.number]).ge(-0).all(1)].reset_index(drop=True)

# Delete engineered values
del df_total['log_EWM']
# del df_total['4point_EWM']

# df_total.loc[391:397]
# print(df_total.shape)
df_total

Unnamed: 0,Data_No,Differential_pressure,4point_EWM,Flow_rate,Time,Dust_feed,Dust,RUL
0,1,1.537182,1.046296,54.143527,5.5,236.428943,"ISO 12103-1, A3 Medium Test Dust",
1,1,1.537182,1.242651,54.518255,5.6,236.428943,"ISO 12103-1, A3 Medium Test Dust",
2,1,1.537182,1.360463,54.658781,5.7,236.428943,"ISO 12103-1, A3 Medium Test Dust",
3,1,3.345631,2.154530,54.780562,5.8,236.428943,"ISO 12103-1, A3 Medium Test Dust",
4,1,5.244502,3.390519,54.574466,5.9,236.428943,"ISO 12103-1, A3 Medium Test Dust",
...,...,...,...,...,...,...,...,...
69681,100,465.494800,457.888170,82.675521,52.0,316.985065,"ISO 12103-1, A4 Coarse Test Dust",8.2
69682,100,464.228900,460.424462,82.421873,52.1,316.985065,"ISO 12103-1, A4 Coarse Test Dust",8.1
69683,100,466.037300,462.669597,82.743156,52.2,316.985065,"ISO 12103-1, A4 Coarse Test Dust",8.0
69684,100,472.276500,466.512358,82.785427,52.3,316.985065,"ISO 12103-1, A4 Coarse Test Dust",7.9


# ML Pipeline : Regressor
## Create Regressor Pipeline
### Set the Transformations
* Smart correlation
* feat_scaling
* feat_selection
* Modelling
* Model as variable

Note: Numerical Transformation not required as data supplied as integers

#### ML Pipeline for **Fitting Models** (regression)
Modelling and Hyperparameter Optimization

In [7]:
# def PipelineOptimization(model):
#     pipeline_base = Pipeline([
#         # ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
#         #                                              variables=['gender', 'Partner', 'Dependents', 'PhoneService',
#         #                                                         'MultipleLines', 'InternetService', 'OnlineSecurity',
#         #                                                         'OnlineBackup', 'DeviceProtection', 'TechSupport',
#         #                                                         'StreamingTV', 'StreamingMovies', 'Contract',
#         #                                                         'PaperlessBilling', 'PaymentMethod'])),
#         ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")),
#         ("feat_scaling", StandardScaler()),
#         ("feat_selection",  SelectFromModel(model)),
#         ("model", model),
#     ])
#     return pipeline_base

In [8]:
# Feature Management
from sklearn.pipeline import Pipeline
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.feature_selection import SelectFromModel

# ML regression algorithms
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor, ExtraTreesRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression

# # ML classification algorithms
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, AdaBoostClassifier
# from xgboost import XGBClassifier

def PipelineOptimization(model):
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=['Data_No', 'Differential_pressure', '4point_EWM', 'Flow_rate',
                                                                'Time', 'Dust_feed', 'Dust', 'RUL', 'mass_g',
                                                                'cumulative_mass_g', 'Tt', 'filter_balance'])),
        ('SmartCorrelatedSelection', SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")),
        ('feat_scaling', StandardScaler()),
        ('feat_selection', SelectFromModel(model)),
        ('model', model),])
    return pipeline_base

  from scipy.sparse.base import spmatrix
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1
  from pandas import MultiIndex, Int64Index


In [9]:
# model = PipelineOptimization(self.models[key])

#### **Custom Class** to fit a set of algorithms, each with its own set of hyperparameters

In [10]:
from sklearn.model_selection import GridSearchCV


class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            model = PipelineOptimization(self.models[key])
            print(f"\nRunning GridSearchCV for {key} \n")

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

Convert `Dust` to floating variable

In [11]:
dust_density_total = [0.900 if n == 'ISO 12103-1, A2 Fine Test Dust' else (1.025 if n == 'ISO 12103-1, A3 Medium Test Dust' else 1.200) for n in df_total['Dust']]
df_total['Dust'] = dust_density_total

Add Filter Balance to **df_total**

In [12]:
df_total_dp = df_total['Differential_pressure']
df_test_filter_balance = (((600 - df_total_dp)/600)*100).round(decimals = 2)
df_total.insert(loc=8, column='filter_balance', value=df_test_filter_balance)

## Split the data into Train, Test, Validate

Data is discrete however in bins, so:
#### Extract Cleaned **Train** & **Test** Datasets

In [13]:
n = df_total['Data_No'].iloc[0:len(df_total)]
df_train = df_total[n < 51].reset_index(drop=True)
df_test = df_total[n > 50].reset_index(drop=True)
del df_train['RUL']
df_train

Unnamed: 0,Data_No,Differential_pressure,4point_EWM,Flow_rate,Time,Dust_feed,Dust,filter_balance
0,1,1.537182,1.046296,54.143527,5.5,236.428943,1.025,99.74
1,1,1.537182,1.242651,54.518255,5.6,236.428943,1.025,99.74
2,1,1.537182,1.360463,54.658781,5.7,236.428943,1.025,99.74
3,1,3.345631,2.154530,54.780562,5.8,236.428943,1.025,99.44
4,1,5.244502,3.390519,54.574466,5.9,236.428943,1.025,99.13
...,...,...,...,...,...,...,...,...
33319,50,359.971800,357.193974,58.721877,59.4,177.321707,1.200,40.00
33320,50,360.785600,358.630624,58.699919,59.5,177.321707,1.200,39.87
33321,50,361.509000,359.781974,58.743820,59.6,177.321707,1.200,39.75
33322,50,362.051500,360.689785,58.601152,59.7,177.321707,1.200,39.66


In [14]:
df_test

Unnamed: 0,Data_No,Differential_pressure,4point_EWM,Flow_rate,Time,Dust_feed,Dust,RUL,filter_balance
0,51,2.622251,1.159577,55.524146,0.4,236.428943,1.025,58.6,99.56
1,51,3.888165,2.251012,55.852018,0.5,236.428943,1.025,58.5,99.35
2,51,4.521122,3.159056,56.130203,0.6,236.428943,1.025,58.4,99.25
3,51,4.521122,3.703883,56.150070,0.7,236.428943,1.025,58.3,99.25
4,51,4.521122,4.030778,56.090457,0.8,236.428943,1.025,58.2,99.25
...,...,...,...,...,...,...,...,...,...
36357,100,465.494800,457.888170,82.675521,52.0,316.985065,1.200,8.2,22.42
36358,100,464.228900,460.424462,82.421873,52.1,316.985065,1.200,8.1,22.63
36359,100,466.037300,462.669597,82.743156,52.2,316.985065,1.200,8.0,22.33
36360,100,472.276500,466.512358,82.785427,52.3,316.985065,1.200,7.9,21.29


#### Engineer Even Distribution of `Dust` in **Train** dataset

In [15]:
df_train_copy = df_train.copy()

bin_sum = df_train_copy.groupby('Data_No')['Data_No'].count().reset_index(name='bin_tot')
map_bin = df_train_copy['Data_No'].map(bin_sum.set_index('Data_No')['bin_tot'])
df_train_copy.loc[:, 'bin_size'] = map_bin

dust_A2 = df_train_copy[df_train_copy['Dust'] == 0.900]
filter_A2 = dust_A2[dust_A2.Data_No != dust_A2.Data_No.shift(-1)]
df_train_A2 = filter_A2.sort_values(by='filter_balance', ascending=True)
df_train_A2['c_sum'] = df_train_A2['bin_size'].cumsum()
# df_train_A2.head(13).style.hide(['Time', 'Dust_feed', 'Flow_rate', 'Dust', 'mass_g', 'cumulative_mass_g', 'Tt'], axis="columns")

dust_A3 = df_train_copy[df_train_copy['Dust'] == 1.025]
filter_A3 = dust_A3[dust_A3.Data_No != dust_A3.Data_No.shift(-1)]
df_train_A3 = filter_A3.sort_values(by='filter_balance', ascending=True)
df_train_A3['c_sum'] = df_train_A3['bin_size'].cumsum()
# dn_fb = df_train_A3.loc[:, 'Data_No'].head(14).sort_values(ascending=True).reset_index(drop=True)
# df_train_A3.head(14).style.hide(['Time', 'Dust_feed', 'Flow_rate', 'Dust', 'mass_g', 'cumulative_mass_g', 'Tt'], axis="columns")

A2_bin = df_train_A2['Data_No'].head(9)
A3_bin = df_train_A3['Data_No'].head(9)

df_train_cleaned_A1 = df_train_copy[df_train_copy['Dust'] == 1.200]
df_train_cleaned_A2 = df_train_copy[df_train_copy['Data_No'].isin(A2_bin)]
df_train_cleaned_A3 = df_train_copy[df_train_copy['Data_No'].isin(A3_bin)]

df_train_concat = pd.concat([df_train_cleaned_A1, df_train_cleaned_A2, df_train_cleaned_A3], ignore_index = True)
df_train_model = df_train_concat.sort_values(by='Data_No', ascending=True)

print('A1 Dust_train :', df_train_cleaned_A1.shape)
print('A2 Dust_train :', df_train_cleaned_A2.shape)
print('A3 Dust_train :', df_train_cleaned_A3.shape)
print('\nTotal df_train :', df_train_model.shape)

A1 Dust_train : (6142, 9)
A2 Dust_train : (7581, 9)
A3 Dust_train : (7208, 9)

Total df_train : (20931, 9)


In [16]:
df_train_model

Unnamed: 0,Data_No,Differential_pressure,4point_EWM,Flow_rate,Time,Dust_feed,Dust,filter_balance,bin_size
13723,3,2.260561,1.208044,55.742718,1.4,236.428943,1.025,99.62,436
14021,3,152.181000,148.822382,55.762598,31.2,236.428943,1.025,74.64,436
14020,3,149.106600,146.583303,56.239490,31.1,236.428943,1.025,75.15,436
14019,3,147.840700,144.901106,56.150070,31.0,236.428943,1.025,75.36,436
14018,3,145.670600,142.941376,56.179876,30.9,236.428943,1.025,75.72,436
...,...,...,...,...,...,...,...,...,...
5852,50,35.083910,32.161310,58.601152,30.9,177.321707,1.200,94.15,427
5851,50,31.919120,30.212911,59.018164,30.8,177.321707,1.200,94.68,427
5850,50,28.754340,29.075438,58.798686,30.7,177.321707,1.200,95.21,427
5848,50,28.844760,30.851912,58.546286,30.5,177.321707,1.200,95.19,427


#### Determine **Target** and **Independent** Variables and Extract **Validation** Dataset

As discussed in the readme, this data has been supplied pre-split into **train** and **test** within unique **data bins**. 
We extract random observations from the **test** dataset to create a **validation** set, in a 70:30 split.


In [17]:
df_test

Unnamed: 0,Data_No,Differential_pressure,4point_EWM,Flow_rate,Time,Dust_feed,Dust,RUL,filter_balance
0,51,2.622251,1.159577,55.524146,0.4,236.428943,1.025,58.6,99.56
1,51,3.888165,2.251012,55.852018,0.5,236.428943,1.025,58.5,99.35
2,51,4.521122,3.159056,56.130203,0.6,236.428943,1.025,58.4,99.25
3,51,4.521122,3.703883,56.150070,0.7,236.428943,1.025,58.3,99.25
4,51,4.521122,4.030778,56.090457,0.8,236.428943,1.025,58.2,99.25
...,...,...,...,...,...,...,...,...,...
36357,100,465.494800,457.888170,82.675521,52.0,316.985065,1.200,8.2,22.42
36358,100,464.228900,460.424462,82.421873,52.1,316.985065,1.200,8.1,22.63
36359,100,466.037300,462.669597,82.743156,52.2,316.985065,1.200,8.0,22.33
36360,100,472.276500,466.512358,82.785427,52.3,316.985065,1.200,7.9,21.29


In [18]:
# from feature_engine.selection import SmartCorrelatedSelection
# corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")

# df_engineering = df_test.copy()
# corr_sel.fit_transform(df_engineering)
# print('Correlated Variables :\n', corr_sel.correlated_feature_sets_)
# print('\nFeatures to Drop :\n', corr_sel.features_to_drop_)

Review correlations, Drop Features and Split into **70% test** and **30% validate**. 

In [19]:
from sklearn.model_selection import train_test_split
from feature_engine.selection import SmartCorrelatedSelection

corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")
df_engineering = df_test.copy()
corr_sel.fit_transform(df_engineering)

X = df_test.drop(corr_sel.features_to_drop_,axis=1)
y = df_test['Differential_pressure']

X_test, X_validate, y_test, y_validate = train_test_split(X,y,test_size=0.30, random_state=0)

print(X_test.shape, 'X_test')
print(X_validate.shape, 'X_validate')
print(y_test.shape, 'y_test')
print(y_validate.shape, 'y_validate')
print('\nFeatures Dropped :\n', corr_sel.features_to_drop_)

(25453, 5) X_test
(10909, 5) X_validate
(25453,) y_test
(10909,) y_validate

Features Dropped :
 ['4point_EWM', 'Time', 'RUL', 'filter_balance']


  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]


In [20]:
X_test

Unnamed: 0,Data_No,Differential_pressure,Flow_rate,Dust_feed,Dust
2168,56,4.521122,57.383039,158.492533,1.025
20305,77,62.210650,57.690319,237.738799,1.025
32959,94,16.637730,60.006667,316.985065,0.900
24724,79,209.237600,81.686305,59.107236,1.200
34580,99,41.865590,80.313263,59.107236,1.200
...,...,...,...,...,...
20757,78,7.595486,83.445949,59.107236,1.200
32103,91,8.861401,80.846225,237.738799,0.900
30403,85,69.625290,80.485188,177.321707,1.200
21243,78,31.195750,82.624780,59.107236,1.200


#### Define **X_train**, **y_train** variables

In [21]:
df_train

Unnamed: 0,Data_No,Differential_pressure,4point_EWM,Flow_rate,Time,Dust_feed,Dust,filter_balance
0,1,1.537182,1.046296,54.143527,5.5,236.428943,1.025,99.74
1,1,1.537182,1.242651,54.518255,5.6,236.428943,1.025,99.74
2,1,1.537182,1.360463,54.658781,5.7,236.428943,1.025,99.74
3,1,3.345631,2.154530,54.780562,5.8,236.428943,1.025,99.44
4,1,5.244502,3.390519,54.574466,5.9,236.428943,1.025,99.13
...,...,...,...,...,...,...,...,...
33319,50,359.971800,357.193974,58.721877,59.4,177.321707,1.200,40.00
33320,50,360.785600,358.630624,58.699919,59.5,177.321707,1.200,39.87
33321,50,361.509000,359.781974,58.743820,59.6,177.321707,1.200,39.75
33322,50,362.051500,360.689785,58.601152,59.7,177.321707,1.200,39.66


In [22]:
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")
df_engineering = df_train.copy()
corr_sel.fit_transform(df_engineering)

X_train = df_train.drop(corr_sel.features_to_drop_,axis=1)
y_train = df_train['Differential_pressure']

print(X_train.shape, 'X_train')
print(y_train.shape, 'y_train')
print('\nFeatures Dropped :\n', corr_sel.features_to_drop_)

(33324, 5) X_train
(33324,) y_train

Features Dropped :
 ['4point_EWM', 'Time', 'filter_balance']


  f = X[feature_group].std().sort_values(ascending=False).index[0]


In [23]:
X_train

Unnamed: 0,Data_No,Differential_pressure,Flow_rate,Dust_feed,Dust
0,1,1.537182,54.143527,236.428943,1.025
1,1,1.537182,54.518255,236.428943,1.025
2,1,1.537182,54.658781,236.428943,1.025
3,1,3.345631,54.780562,236.428943,1.025
4,1,5.244502,54.574466,236.428943,1.025
...,...,...,...,...,...
33319,50,359.971800,58.721877,177.321707,1.200
33320,50,360.785600,58.699919,177.321707,1.200
33321,50,361.509000,58.743820,177.321707,1.200
33322,50,362.051500,58.601152,177.321707,1.200


In [24]:
y_train

0          1.537182
1          1.537182
2          1.537182
3          3.345631
4          5.244502
            ...    
33319    359.971800
33320    360.785600
33321    361.509000
33322    362.051500
33323    366.482200
Name: Differential_pressure, Length: 33324, dtype: float64

## Handling Target Imbalance
### No need to handle target imbalance in this **regression model**.
* Typically we only need to create a single pipeline for Classification or Regression task. 
* The exception occurs when we need to handle a **classification target imbalance**, which requires more than one model to process 

## Fit the pipeline with Data
* Prepare the data for handling the train set and
* ~~Fix the target imbalance~~
* Feature scaling
* Feature selection
* Modelling
* Custom Python class for **hyperparameter optimization**

### Use standard hyperparameters to find most suitable algorithm for the data

In [25]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    'LogisticRegression': LogisticRegression(),
    'DecisionTreeRegressor': DecisionTreeRegressor(random_state=0),
    'RandomForestRegressor': RandomForestRegressor(random_state=0),
    'ExtraTreesRegressor': ExtraTreesRegressor(random_state=0),
    'AdaBoostRegressor': AdaBoostRegressor(random_state=0),
    'GradientBoostingRegressor': GradientBoostingRegressor(random_state=0),
    'XGBRegressor': XGBRegressor(random_state=0),
}

params_quick_search = {
    'LinearRegression': {},
    'LogisticRegression': {},
    'DecisionTreeRegressor': {},
    'RandomForestRegressor': {},
    'ExtraTreesRegressor': {},
    'AdaBoostRegressor': {},
    'GradientBoostingRegressor': {},
    'XGBRegressor': {},
}

#### Fit the pipelines, using the above models with **default hyperparameters**
* Parsed the train set
* Set the performance metric as an R2 score (Regression: described in our ML business case)
* Cross validation as 5 (rule of thumb)

In [None]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

: 

: 

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

#### Observations

* The average R2 score is around **0.46**, which is lower than the **0.7** we decided in the business case.
* The best result is **GradientBoostingRegressor**.
* Perform an extensive search to hopefully improve performance.

In [None]:
df_stop

### Train the Model

Multiple regression and classification models under consideration 

* sklearn.linear_model.**LinearRegression**(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)
* sklearn.linear_model.**LogisticRegression**(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)
    * *.predict_proba(X)*
* sklearn.linear_model.**SGDRegressor**(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)
    * *.SGDClassifier()*

List full of available and under consideration can be seen at scikitlearn [linear models](https://scikit-learn.org/stable/modules/linear_model.html#)

* No one optimal model. the most appropriate seems .LogisticRegression()
<!-- 
**.LinearRegression()** - Ordinary Least Squares
**.SGDClassifier()** and **.SGDRegressor()** - Stochastic Gradient Descent - SGD
.Ridge() 
.Lasso()
.MultiTaskLasso()
.ElasticNet()
.MultiTaskElasticNet()
.Lars() - Least Angle Regression
.LassoLars()
.OrthogonalMatchingPursuit() and orthogonal_mp()
.BayesianRidge() - Bayesian Regression
.ARDRegression() - Automatic Relevance Determination
Generalized Linear Models
**.LogisticRegression()** + **.predict_proba(X)**
.TweedieRegressor()
.Perceptron()
.PassiveAggressiveClassifier() and .PassiveAggressiveRegressor()
Robustness regression: outliers and modeling errors
.RANSACRegressor()
.TheilSenRegressor() and 
.HuberRegressor()
.QuantileRegressor()
Polynomial regression: extending linear models with basis functions
.PolynomialFeatures() transformer -->


In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)

In [None]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train,y_train)

In [None]:
from sklearn.linear_model import SGDRegressor
SGDreg = SGDRegressor()
SGDreg.fit(X_train,y_train)

### Predictions and Model Evaluation

In [None]:
from sklearn.metrics import classification_report

prediction = logrig.predict(X_test)
print(classification_report(y_test,prediction))