# CI Portfolio Project 5 - Filter Maintenance Predictor 2022
## **ML Model - Predict Remaining Useful Life (RUL)**

## Objectives

Answer [Business Requirement 1](https://github.com/roeszler/filter-maintenance-predictor/blob/main/README.md#business-requirements) :
*   Fit and evaluate a **regression model** to predict the Remaining Useful Life of a replaceable part
*   Fit and evaluate a **classification model** to predict the Remaining Useful Life of a replaceable part should the regressor not perform well.

## Inputs

Data cleaning and feature engineering from their respective notebooks:
* inputs/datasets/cleaned/df_train.csv
* inputs/datasets/cleaned/df_test.csv
* inputs/datasets/cleaned/df_validate.csv

## Outputs

* Train set (features and target)
* Test set (features and target)
* Validation set (features and target)
* ML pipeline to predict RUL
* A map of the labels
* Feature Importance Plot



---

### Change working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
print("Current directory set to new location")

In [None]:
current_dir = os.getcwd()
current_dir

---

## The major steps in this Regressor Pipeline

1. **ML Pipeline: Regressor**
    * Create Regressor Pipeline
    * Split the train set
    * Grid Search CV SKLearn
        * Use standard hyperparameters to find most suitable algorithm
        * Extensive search on most suitable algorithm to find the best hyperparameter configuration
    * Assess Feature Performance
    * Evaluate Regressor
    * Create Train, Test, Validation Sets

2. **ML Pipeline: Regressor + Principal Component Analysis (PCA)**
    * Prepare the Data for the Pipeline
    * Create Regressor + PCA Pipeline
    * Split the train and validation sets
    * Grid Search CV SKLearn
        * Use standard hyperparameters to find most suitable algorithm
        * Do an extensive search on most suitable algorithm to find the best hyperparameter configuration
    * Assess Feature Performance
    * Evaluate Regressor
    * Create Train, Test, Validation Sets

_Optionally_

3. **Convert Regression to Classification**
    * Convert numerical target to bins, and check if it is balanced
    * Rewrite Pipeline for ML Modelling
    * Load Algorithms For Classification
    * Split the Train Test sets:
    * Grid Search CV SKLearn:
        * Use standard hyper parameters to find most suitable model
        * Grid Search CV
        * Check Result
    * Do an extensive search on the most suitable model to find the best hyperparameter configuration.
        * Define Model Parameters
        * Extensive Grid Search CV                             
        * Check Results
        * Check Best Model
        * Parameters for best model
        * Define the best clf_pipeline
    * Assess Feature Importance
    * Evaluate Classifier on Train and Test Sets
        * Custom Function
        * List that relates the classes and tenure interval

4. **Decide which pipeline to use**

5. **Refit with the best features**
    * Rewrite Pipeline
    * Split Train Test Set with only best features
    * Subset best features
    * Grid Search CV SKLearn
    * Best Parameters
        * Manually
    * Grid Search CV
    * Check Results
    * Check Best Model
    * Define the best pipeline

6. **Assess Feature Importance**

7. **Push Files to Repo**

<!-- Modelling:
The hypothesis part of the process where you will find out whether you can answer the question.
* Identify what techniques to use.
* Split your data into train, validate and test sets.
* Build and train the models with the train data set.
* Validate Models and hyper-parameter : Trial different machine learning methods and models with the validation data set.
* Poor Results - return to data preparation for feature engineering
* Successful hypothesis - where the inputs from the data set are mapped to the output target / label appropriately to evaluate.

5. Evaluation:
Where you test whether the model can predict unseen data.
* Test Dataset
* Choose the model that meets the business success criteria best.
* Review and document the work that you have done.
* If your project meets the success metrics you defined with your customer?
- Ready to deploy. -->

---

### Load Cleaned Data
The pipeline should handle the cleaning and engineering by itself

In [None]:
import numpy as np
import pandas as pd
df_total = pd.read_csv(f'outputs/datasets/cleaned/dfCleanTotal.csv')
# df_total = pd.read_csv(f'outputs/datasets/transformed/dfTransformedTotal.csv') # data with negative log_EWM values removed
# df_total = (pd.read_csv("outputs/datasets/collection/PredictiveMaintenanceTotal.csv").drop(labels=['customerID', 'TotalCharges', 'Churn'], axis=1))
print(df_total.shape)
df_total

### Remove Negative Values

In [None]:
# Remove Negative values
df_total.insert(loc=4, column='log_EWM', value=log_ewm)
data = df_total.loc[:, df_total.columns == 'log_EWM']
df_total = df_total[data.select_dtypes(include=[np.number]).ge(-0).all(1)]
# del df_total['log_EWM']

### Set the Transformations
* Numerical Transformation as required
* Smart correlation

In [None]:
def PipelineOptimization(model):
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(
            encoding_method='arbitrary', variables=['gender', 'Partner', 'Dependents',
            'PhoneService','MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
            'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
            'PaperlessBilling', 'PaymentMethod'])
        ),

        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")),
        ("feat_scaling", StandardScaler()),
        ("feat_selection", SelectFromModel(model)),
        ("model", model),
   ])
   return pipeline_base

### Split the data into Train, Test, Validate

If the data was continuous with discrete observations:

```
from sklearn.model_selection import train_test_split

X= ad_data.drop(['Ad Topic Line', 'City', 'Timestamp', 'Clicked on Ad', 'Country'],axis=1)
y = ad_data['Clicked on Ad']

X_train, X_test,y_train, y_test = train_test_split(X,y,test_size=0.33, random_state=42)
```

Data is discrete however in bins, so:

In [None]:
from sklearn.model_selection import train_test_split

X= ad_data.drop(['Ad Topic Line', 'City', 'Timestamp', 'Clicked on Ad', 'Country'],axis=1)
y = ad_data['Clicked on Ad']

X_train, X_test,y_train, y_test = train_test_split(X,y,test_size=0.33, random_state=42)

### Train the Model

Multiple regression and classification models under consideration 

* sklearn.linear_model.**LinearRegression**(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)
* sklearn.linear_model.**LogisticRegression**(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)
    * *.predict_proba(X)*
* sklearn.linear_model.**SGDRegressor**(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)
    * *.SGDClassifier()*

List full of available and under consideration can be seen at scikitlearn [linear models](https://scikit-learn.org/stable/modules/linear_model.html#)

* No one optimal model. the most appropriate seems .LogisticRegression()
<!-- 
**.LinearRegression()** - Ordinary Least Squares
**.SGDClassifier()** and **.SGDRegressor()** - Stochastic Gradient Descent - SGD
.Ridge() 
.Lasso()
.MultiTaskLasso()
.ElasticNet()
.MultiTaskElasticNet()
.Lars() - Least Angle Regression
.LassoLars()
.OrthogonalMatchingPursuit() and orthogonal_mp()
.BayesianRidge() - Bayesian Regression
.ARDRegression() - Automatic Relevance Determination
Generalized Linear Models
**.LogisticRegression()** + **.predict_proba(X)**
.TweedieRegressor()
.Perceptron()
.PassiveAggressiveClassifier() and .PassiveAggressiveRegressor()
Robustness regression: outliers and modeling errors
.RANSACRegressor()
.TheilSenRegressor() and 
.HuberRegressor()
.QuantileRegressor()
Polynomial regression: extending linear models with basis functions
.PolynomialFeatures() transformer -->


In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)

In [None]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train,y_train)

In [None]:
from sklearn.linear_model import SGDRegressor
SGDreg = SGDRegressor()
SGDreg.fit(X_train,y_train)

### Predictions and Model Evaluation

In [None]:
from sklearn.metrics import classification_report

prediction = logrig.predict(X_test)
print(classification_report(y_test,prediction))

In [None]:
df_stop