## Machine Learning Workflow

There are many steps in the process of creating, implementing and iterating over a machine learning model for a specific data-driven problem. While there is no single universal way of sequencing the different steps that go into a workflow, there are some general principles that are good to follow for optimal performance of a machine learning algorithm.

A machine learning workflow has the following steps.
1. ETL (Extract, Transform and Load) data
2. Data Cleaning
3. Train-Test-Validation Split
4. EDA (Exploratory Data Analysis)
5. Feature Engineering (normalization, removing autocorrelations, discretization, etc.)
6. Model Selection and Implementation
7. Model Evaluation
8. Hyperparameter Tuning
9. Model Validation
10. Build ML pipeline

#### 1. ETL(Extract, Transform and Load) Data

It is often the case that data is stored in a SQL database with a cloud service provider like AWS, Digital Ocean, etc. Depending on the volume of data, an engineer would use a tool like PySpark to extract this data, transform it and load it into a local database.

#### 2. Data Cleaning and Aggregation

This can involve a range of tasks depending on the form and type of data as well as the problem that the machine learning pipeline is being designed to solve. Some examples include: dealing with null or missing entries, conforming timestamps to a standard, carrying out aggregations like grouping events based on timestamps by the hour or day, grouping IP’s by location, etc. Since Spark is best suited to perform such tasks on big data, this task might very well be the “Transform” part of the above mentioned ETL step. 

#### 3. Train-Test-Validation Split

Before the modelling phase in Machine Learning, we split our data into Train-Test-Validation sets. The training data is used to train the machine learning models and test data is to test the performance of the trained model. When we take models to productions, then we use the validation data(a part of training data) to tune hyperparameters and/or model validation before we test it on the test data.

All the manipulations like scaling, encoding categorical features etc should be done after the splitting of the data.

In [None]:
from sklearn.model_selection import train_test_split
# For feature matrix X and target variable y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

#### 4. Exploratory Data Analysis

Exploratory Data Analysis or EDA in the context of a machine learning workflow, is the step of inspecting, analyzing and altering your data to get it ready for machine learning modeling.

#### 5. Feature Engineering

Feature engineer refers to prepping, selecting and reducing features in a machine learning problem. This can involve methods that overlap with EDA such as normalization, removing autocorrelations, discretization, etc. Feature engineering can also involve using machine learning algorithms like PCA to reduce dimensionality or methods that are implemented during the model fitting step like regularization.

#### 6. Model Selection and Implementation

Now we’re ready to test out different machine learning models. The choice of the model depends on the attributes of the data one’s working with as well as the type of question we’re trying to answer.

Refer this [cheatsheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) to choose the right algorithm.

#### 7. Model Evaluation

We’re now getting into the iterative part of the workflow. Whatever model is built, it must be evaluated on the test data. For classification problems, metrics like accuracy, precision, recall, F1 score and AUROC scores indicate how performant the model is and for regression problems, scores like RMSE and R-squared are some commonly used metrics. 

Machine learning engineers iterate over different types of models to figure out the most optimal model for the problem at hand.

#### 8. Hyperparameter Tuning

Once a model has been decided upon, it can be tuned for better performance. Hyperparameter tuning is essential in making sure that the model does not overfit or underfit the data.

This is key to how well the model is fitting known data and how well it’s able to generalize to new data as well. Hence hyperparameter tuning might be done on the validation or holdout dataset.

#### 9. Model Validation

Model validation is the process of making sure that the model is still performant on data that it hasn’t seen at all — neither in the training phase nor in the test phase. This can be done either during the hyperparameter tuning step or after. Typically the same metrics used during the model evaluation phase needs to be used here as well so as to make a reasonable comparison with the former.

#### 10. Build ML pipeline!

When a machine learning workflow is part of a production cycle, it is often the case that a model is tuned and updated based on incoming information. In other words the model that worked well on last month’s data might not be applicable for this month. It is the job of a Machine Learning Engineer or a Pipeline Engineer to make sure that the model deployed into production is thus flexible and alterable without affecting the rest of the codebase. ML pipelines allow one to do the same!

A ML pipeline is a modular sequence of objects that codifies and automates a ML workflow to make it efficient, reproducible and generalizable.

## Let's get into the implementation of the ML Pipeline

### Data Cleaning (Numeric)

To introduce pipelines, let’s look at a common set of data cleaning/EDA tasks — dealing with missing values and scaling numeric variables. We’re going to convert an existing code base that performs these tasks to more concise code that uses scikit-learn‘s Pipeline using the following steps.

1. First, to define a pipeline, we pass a list of tuples of the form (name, transform/estimator) into a Pipeline object. For example, if we wanted to perform imputation with a SimpleImputer first, and scale our numerical variables with a StandardScaler next, the code would look as follows:

In [None]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([("imputer",SimpleImputer()), ("scale",StandardScaler())])

2. Once a Pipeline object has been instantiated, the methods .fit and .transform can be called like we would with any data transformation object in scikit-learn.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.25)
pipeline.fit(x_train)
pipeline.transform(x_test)

In [None]:
pipeline.fit(x_train_num)
x_transform = pipeline.transform(x_test[num_cols])

If the pipeline includes a machine learning model as well, .predict() can also be called down the line. Each step in the pipeline will be fit in the order provided.

### Data Cleaning (Categorical)

We’re now going to implement a task similar to the previous exercise with pipeline.Pipeline(), but with categorical variables now. Specifically, we’ll be dealing with missing values in categorical data and one-hot-encoding categorical variables. We will convert an existing codebase to a pipeline like in the previous exercise.

In [None]:
pipeline = Pipeline([('imputer', SimpleImputer(strategy = 'most_frequent')), 
                     ('ohe', OneHotEncoder(drop = 'first', sparse = False))])

In [None]:
pipeline.fit(x_train[cat_cols])
pipeline.transform(x_test[cat_cols])

### Column Transformer

Often times, you may not want to simply apply every function to all columns. If our columns are of different types, we may only want to apply certain parts of the pipeline to a subset of columns. This is what we saw in the two previous exercises. One set of transformations are applied to numeric columns and another set to the categorical ones. We can use scikit-learn‘s ColumnTransformer as one way of combining these processes together.

In [None]:
num_vals = Pipeline([('imputer', SimpleImputer(strategy = 'mean')), ('scale', StandardScaler())])
cat_vals = Pipeline([('imputer', SimpleImputer(strategy = 'most_frequent')), 
                     ('ohe', OneHotEncoder(drop = 'first', sparse = False))])

In [None]:
#create the column transformer with the categorical and numerical processes
#num_cols in the numerical columns and cat_cols is the categorical columns
preprocess = ColumnTransformer(transformers = [('num_preprocess', num_vals, num_cols), 
                                               ('cat_preprocess', cat_vals, cat_cols)])

In [None]:
preprocess.fit(x_train)
preprocess.transform(x_test)

### Adding a Model

In [None]:
import numpy as np
import pandas as pd

from sklearn import svm, datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import r2_score

In [None]:
columns = ["sex","length","diam","height","whole","shucked","viscera","shell","age"]
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data",names=columns)

X = df.drop(columns=['age'])
y = df.age

In [None]:
num_cols = X.select_dtypes(include=np.number).columns
cat_cols = X.select_dtypes(include=['object']).columns

In [None]:
#create some missing values
for i in range(1000):
    X.loc[np.random.choice(X.index),np.random.choice(X.columns)] = np.nan

In [None]:
#train-test split
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.25)

In [None]:
#categorical and numerical data processing pipelines
cat_vals = Pipeline([("imputer",SimpleImputer(strategy='most_frequent')), 
                     ("ohe",OneHotEncoder(sparse=False, drop='first'))])

num_vals = Pipeline([("imputer",SimpleImputer(strategy='mean')), ("scale",StandardScaler())])

In [None]:
#combining categorical and numerical pipelines together
preprocess = ColumnTransformer(
    transformers=[
        ("cat_process", cat_vals, cat_cols),
        ("num_process", num_vals, num_cols)
    ]
)

In [None]:
#Create a pipeline with `preprocess` and a linear regression model, `regr`
pipeline = Pipeline([('preprocess', preprocess), ('regr', LinearRegression())])

#Fit the pipeline on the training data and predict on the test data
pipeline.fit(x_train, y_train)
y_pred = pipeline.predict(x_test)

In [None]:
#Calculate pipeline score and compare to estimator score
pipeline_score = pipeline.score(x_test, y_test)
print(pipeline_score)

#r-squared score
r2_score = r2_score(y_test, y_pred)
print(r2_score)

### Hyperparameter Tuning

In [None]:
#Very simple parameter grid, with and without the intercept
param_grid = {
    "regr__fit_intercept": [True,False]
}

In [None]:
#Grid search using previous pipeline
gs = GridSearchCV(pipeline, param_grid = param_grid, scoring = 'neg_mean_squared_error', cv = 5)

In [None]:
#fit grid using training data and print best score
gs.fit(x_train, y_train)
best_score = gs.best_score_
best_params = gs.best_params_
print(best_score)
print(best_params)

### Final Pipeline

We will now be searching over different types of models, each having their own sets of hyperparameters!

In [None]:
# Update the `search_space` array from the narrative to add a Lasso Regression model as the third dictionary.
search_space = [{'regr': [LinearRegression()], 'regr__fit_intercept': [True,False]},
                {'regr':[Ridge()],
                     'regr__alpha': [0.01,0.1,1,10,100]},
                {'regr':[Lasso()],
                     'regr__alpha': [0.01,0.1,1,10,100]}]

In [None]:
# Initialize a grid search on `search_space`
gs = GridSearchCV(pipeline, search_space, scoring='neg_mean_squared_error', cv=5)

In [None]:
# Find the best pipeline, regression model and its hyperparameters
## Fit to training data
gs.fit(x_train, y_train)

## Find the best pipeline
best_pipeline = gs.best_estimator_

In [None]:
## Find the best regression model
best_regression_model = best_pipeline.named_steps['regr']
print('The best regression model is:')
print(best_regression_model)

In [None]:
## Find the hyperparameters of the best regression model
best_model_hyperparameters = best_regression_model.get_params()
print('The hyperparameters of the regression model are:')
print(best_model_hyperparameters)

In [None]:
# Access the hyperparameters of the categorical preprocessing step
cat_preprocess_hyperparameters = best_pipeline.named_steps['preprocess'].named_transformers_['cat_preprocess'].named_steps['imputer'].get_params()
print('The hyperparameters of the imputer are:')
print(cat_preprocess_hyperparameters)

### <font color = 'Blue'>Practical Example combining all the above steps

In [None]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix
from scipy.io import arff

In [None]:
data = arff.loadarff('bone-marrow.arff')
df = pd.DataFrame(data[0])
df.drop(columns=['Disease'], inplace=True)

In [None]:
#Convert all columns to numeric, coerce errors to null values
for c in df.columns:
    df[c] = pd.to_numeric(df[c], errors='coerce')
    
#Make sure binary columns are encoded as 0 and 1
for c in df.columns[df.nunique()==2]:
    df[c] = (df[c]==1)*1.0

In [None]:
# 1. Calculate the number of unique values for each column
print('Count of unique values in each column:')
print(df.nunique())

# 2. Set target, survival_status,as y; features (dropping survival status and time) as X
y = df.survival_status
X = df.drop(columns=['survival_time','survival_status'])

In [None]:
# 3. Define lists of numeric and categorical columns based on number of unique values
num_cols = X.columns[X.nunique()>7]
cat_cols = X.columns[X.nunique()<=7]

# 4. Print columns with missing values
print('Columns with missing values:')
print(X.columns[X.isnull().sum()>0])

In [None]:
# 5. Split data into train/test split
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.2)

Categorical pipeline will consist of two steps : The first will fill in missing values using the mode and the second will one-hot-encode the variables.

In [None]:
# 6. Create categorical preprocessing pipeline
# Using mode to fill in missing values and OHE
cat_vals = Pipeline([("imputer",SimpleImputer(strategy='most_frequent')), ("ohe",OneHotEncoder(sparse=False, drop='first', handle_unknown = 'ignore'))])

Numerical pipeline will consist of two steps:  the first will fill in missing values using the mean and the second will scale features.

In [None]:
# 7. Create numerical preprocessing pipeline
# Using mean to fill in missing values and standard scaling of features
num_vals = Pipeline([("imputer",SimpleImputer(strategy='mean')), ("scale",StandardScaler())])

In [None]:
# 8. Create column transformer that will preprocess the numerical and categorical features separately
preprocess = ColumnTransformer(
    transformers=[
        ("cat_process", cat_vals, cat_cols),
        ("num_process", num_vals, num_cols)
    ]
)

In [None]:
# 9. Create a pipeline with preprocess, PCA, and a logistic regresssion model
pipeline = Pipeline([("preprocess",preprocess), 
                     ("pca", PCA()),
                     ("clf",LogisticRegression())])

In [None]:
# 10. Fit the pipeline on the training data
pipeline.fit(x_train, y_train)
#Predict the pipeline on the test data
print('Pipeline Accuracy Test Set:')
print(pipeline.score(x_test,y_test))

In [None]:
# 11. Define search space of hyperparameters
search_space = [{'clf':[LogisticRegression()],
                     'clf__C': np.logspace(-4, 2, 10),
                'pca__n_components':np.linspace(30,37,3).astype(int)},
                   ]

In [None]:
#12. Search over hyperparameters abolve to optimize pipeline and fit
gs = GridSearchCV(pipeline, search_space, cv=5)
gs.fit(x_train, y_train)

In [None]:
# 13. Save the best estimator from the gridsearch and print attributes and final accuracy on test set
best_model = gs.best_estimator_

In [None]:
# 14. Print attributes of best_model
print('The best classification model is:')
print(best_model.named_steps['clf'])
print('The hyperparameters of the best classification model are:')
print(best_model.named_steps['clf'].get_params())
print('The number of components selected in the PCA step are:')
print(best_model.named_steps['pca'].n_components)

# 15. Print final accuracy score 
print('Best Model Accuracy Test Set:')
print(best_model.score(x_test,y_test))