## ML Workflow

![workflow](workflow.jpg)

1. Extract, transform, and load (ETL) data
2. Data cleaning and aggregation
3. Train-Test-Validation split
4. Exploratory data analysis (EDA)
5. Feature engineering
   - Normalization
   - Removing autocorrelations
   - Discretization
   - PCA
   - Regularization
   - ...
6. Model selection and implementation
   - Sklearn cheatsheet: choosing the right estimator
7. Model evaluation
8. Hyperparameter tuning
9. Model Validation
10. Building ML pipelines

## ML Pipelines

- Intermediate steps of a pipeline must have **.fit()** and **.transform()**
    - Intermediate steps include: imputation, feature selection, dimension reduction, normalization, ...
- Final step of a pipeline must have **.fit()**

### Import Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import r2_score, mean_squared_error

### Load Dataset

In [3]:
np.random.seed(42)

num_records = 1000
df = pd.DataFrame({
    'feature_1' : np.random.uniform(1, 10, num_records),
    'feature_2' : np.random.randint(1, 101, num_records),
    'feature_3' : np.random.choice(['Yes','No'], num_records),
    'feature_4' : np.random.choice(['A','B','C','D'], num_records),
    'target' : np.random.uniform(-50, 50, num_records)})

for col in df.columns:
    nan_indices = np.random.choice(df[col].shape[0], size=int(0.1*num_records), replace=False)
    df.loc[nan_indices, col] = np.nan

df.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,target
0,4.370861,47.0,Yes,A,46.407842
1,9.556429,12.0,Yes,B,
2,,62.0,Yes,A,18.701927
3,6.387926,,Yes,B,-0.618596
4,2.404168,,No,,


### Imputation Pipeline

In [5]:
# Define the numerical and categorical columns
num_cols = ['feature_1', 'feature_2', 'target']
cat_cols = ['feature_3', 'feature_4']

# Create pipelines for numerical and categorical features
num_pipeline = Pipeline([
    ('num_imputer', SimpleImputer(strategy='mean')),
])
cat_pipeline = Pipeline([
    ('cat_imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False)),
])

# Create a column transformer
col_T = ColumnTransformer(transformers=[
    ('num_pipeline', num_pipeline, num_cols),
    ('cat_pipeline', cat_pipeline, cat_cols),
])

# Convert the result back into a DataFrame
df_imputed = col_T.fit_transform(df)
cat_names = col_T.transformers_[1][1].named_steps['onehot'].get_feature_names_out(cat_cols)
df_imputed = pd.DataFrame(df_imputed, columns = num_cols + list(cat_names))

df_imputed.head()

Unnamed: 0,feature_1,feature_2,target,feature_3_Yes,feature_4_B,feature_4_C,feature_4_D
0,4.370861,47.0,46.407842,1.0,0.0,0.0,0.0
1,9.556429,12.0,-0.084782,1.0,1.0,0.0,0.0
2,5.448955,62.0,18.701927,1.0,0.0,0.0,0.0
3,6.387926,49.433333,-0.618596,1.0,1.0,0.0,0.0
4,2.404168,49.433333,-0.084782,0.0,0.0,1.0,0.0


### Train-Test Split

In [7]:
df_train, df_test = train_test_split(df_imputed, test_size=0.2, random_state=42)

print(df_train.shape)
print(df_test.shape)

(800, 7)
(200, 7)


### Preprocessing Pipeline

- **Min-Max Normalization**
    - (value - min)/(max - min)
    - Cannot handle outliers well
- **Z-Score Standardization**
    - (value - $\mu$)/$\sigma$
    - Cannot produce normalized data with the same scale

In [9]:
# Create a column transformer for data normalization
preprocessor = ColumnTransformer(transformers=[
    ('minmax', MinMaxScaler(), ['feature_1', 'feature_2']),
    ('zscore', StandardScaler(), ['target']),
], remainder='passthrough')

df_train_T = preprocessor.fit_transform(df_train)
df_test_T = preprocessor.transform(df_test)

# Convert the result back into a DataFrame
df_train_T = pd.DataFrame(df_train_T, columns=df_train.columns)
df_test_T = pd.DataFrame(df_test_T, columns=df_test.columns)

df_train_T.head()

Unnamed: 0,feature_1,feature_2,target,feature_3_Yes,feature_4_B,feature_4_C,feature_4_D
0,0.042025,0.489226,0.951841,0.0,0.0,0.0,0.0
1,0.492115,0.171717,-0.237947,0.0,0.0,1.0,0.0
2,0.608981,0.444444,0.146212,0.0,1.0,0.0,0.0
3,0.02427,0.30303,-0.730968,0.0,0.0,0.0,0.0
4,0.914709,0.489226,-0.034902,1.0,1.0,0.0,0.0


In [11]:
X_train = df_train_T.drop('target', axis=1)
X_test = df_test_T.drop('target', axis=1)
y_train = df_train_T['target']
y_test = df_test_T['target']

### Model Pipeline

- To call hyperparameters in a model pipeline: **pipeline_step_name + '__' + hyperparameter**

- **Regularization**: to prevent overfitting by adding a penalty term to the loss function
    - L1-norm (Lasso)
        $$\text{loss} = \text{loss}_\alpha + \lambda \sum_{i=1}^{n} |\theta_i|$$
    - L2-norm (Ridge)
        $$\text{loss} = \text{loss}_\alpha + \lambda \sum_{i=1}^{n} \theta_i^2$$

In [13]:
voting_regr = VotingRegressor(estimators=[
    ('lr', LinearRegression()),
    ('lasso', Lasso()),
    ('ridge', Ridge()),
])

# Multiple models pipeline with grid search
model_pipeline = Pipeline(steps=[
    ('voting_regr', voting_regr),
])

search_space = [
    {'voting_regr__lr__fit_intercept': [True, False]},
    {'voting_regr__lasso__alpha': [0.01, 0.1, 1, 10]},
    {'voting_regr__ridge__alpha': [0.01, 0.1, 1, 10]},
]

gs = GridSearchCV(model_pipeline, param_grid=search_space, scoring='neg_mean_squared_error', cv=5)

gs.fit(X_train, y_train)
y_pred = gs.best_estimator_.predict(X_test)

print('Best model:', gs.best_estimator_.named_steps['voting_regr'])
print('Best hyperparameters:', gs.best_params_)
print(r2_score(y_test, y_pred))
print(mean_squared_error(y_test, y_pred))

Best model: VotingRegressor(estimators=[('lr', LinearRegression()), ('lasso', Lasso()),
                            ('ridge', Ridge(alpha=10))])
Best hyperparameters: {'voting_regr__ridge__alpha': 10}
-0.03674353999220692
1.0353195262271795
