## An Empirical Analysis of Feature Engineering for Predictive Modeling
This example notebook performs the analysis for the following paper. The code has been updated and converted to work with Kaggle.

Heaton, J. (2016, April). [An Empirical Analysis of Feature Engineering for Predictive Modeling](https://arxiv.org/abs/1701.07852). In *SoutheastCon 2016* (pp. 1-6). IEEE.

## Paper Abstract

Machine learning models, such as neural networks, decision trees, random forests, and gradient boosting machines, accept a feature vector, and provide a prediction.  These models learn in a supervised fashion where we provide feature vectors with the expected output.  It is common practice to engineer new features from the provided feature set.  Such engineered features will either augment or replace portions of the existing feature vector.  These engineered features are essentially calculated fields based on the values of the other features.  

Engineering such features is primarily a manual, time-consuming task.  Additionally, each type of model will respond differently to different kinds of engineered features.  This paper reports empirical research to demonstrate what kinds of engineered features are best suited to various machine learning model types.  We provide this recommendation by generating several datasets that we designed to benefit from a particular type of engineered feature.  The experiment demonstrates to what degree the machine learning model can synthesize the needed feature on its own.  If a model can synthesize a planned feature, it is not necessary to provide that feature.  The research demonstrated that the studied models do indeed perform differently with various types of engineered features. 

In [None]:
import os
import multiprocessing
import pandas as pd
import time
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
from sklearn.metrics import mean_squared_error

PATH = '../input/tabular-feature-engineering-dataset'

TOKEN = "_train.csv"
VERBOSE = 0
THREADS = multiprocessing.cpu_count()
CYCLES = 5
FAIL_ON_NAN = False
SAMPLE = 1.0
            
# Human readable time elapsed string.
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60.
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)
            
# Build a deep neural network for the experiments.
def neural_network_regression(x_train):
    model = Sequential()
    model.add(Dense(400, input_dim=x_train.shape[1], activation='relu'))
    model.add(Dense(200, activation='relu'))
    model.add(Dense(100, activation='relu'))
    model.add(Dense(50, activation='relu'))
    model.add(Dense(25, activation='relu'))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

# Grid-search for a SVM with good C and Gamma.
def svr_grid():
    param_grid = {
        'C': [1e-2, 1, 1e2],
        'gamma': [1e-1, 1, 1e1]

    }
    clf = GridSearchCV(SVR(kernel='rbf'), cv=5, verbose=VERBOSE,
                       n_jobs=THREADS, param_grid=param_grid)
    return clf

# Perform an experiment for a single model type.
def run_model(name, model, results, x_train, y_train, x_validate, y_validate):
    model_name = model.__class__.__name__

    # Normalize, if called for
    if 'GridSearchCV' in model_name:
        x_train = MinMaxScaler().fit_transform(x_train)
        x_validate = MinMaxScaler().fit_transform(x_validate)
    
    # Run cycles
    cycle_list = []
    for cycle_num in range(1, CYCLES + 1):
        start_time = time.time()
        if 'KerasRegressor' in model_name:

            monitor = EarlyStopping(
                monitor='val_loss', min_delta=1e-3, patience=20, verbose=VERBOSE, mode='auto')
            model.fit(x_train, y_train,
                      validation_data=(x_validate, y_validate),
                      callbacks=[monitor], verbose=VERBOSE, epochs=100000)
        else:
            model.fit(x_train, y_train)
            
        score = validate_model(model, x_validate, y_validate)
        elapsed_time = hms_string(time.time() - start_time)
        line = [name, model_name, score, np.std(y_validate), 
                np.mean(y_validate), elapsed_time]
        cycle_list.append(line)
        print(f"Cycle {cycle_num}:{line}")

    best_cycle = min(cycle_list, key=lambda k: k[2])
    print("{}(Best)".format(best_cycle))
    results.append(best_cycle)

    #writer.writerow(best_cycle)
    
def validate_model(model, x_validate, y_validate):
    model_name = model.__class__.__name__
    if 'KerasRegressor' in model_name:
            pred = model.predict(x_validate, verbose=VERBOSE)
    else:
        pred = model.predict(x_validate)

    # Get the validatoin score
    if np.isnan(pred).any():
        if FAIL_ON_NAN:
            raise Exception("Unstable model. Can't validate.")
        score = 1e5 # a bad score
    else:
        score = np.sqrt(mean_squared_error(pred, y_validate))
        score /= np.std(y_validate)

    return score

def eval_data(name, results):
    path_train = os.path.join(PATH,f"{name}_train.csv")
    path_validate = os.path.join(PATH,f"{name}_validate.csv")
    df_train=pd.read_csv(path_train)
    df_validate=pd.read_csv(path_validate)
    if SAMPLE<1.0:
        df_train = df_train.sample(frac=SAMPLE)
        df_validate = df_validate.sample(frac=SAMPLE)
    print(f"Training size: {len(df_train)}")
    print(f"Validate size: {len(df_validate)}")
    x_cols = list(df_train.columns)
    x_cols.remove('y1')
    y_cols = ['y1']
    x_train = df_train[x_cols].values
    y_train = df_train[y_cols].values.ravel()
    x_validate = df_validate[x_cols].values
    y_validate = df_validate[y_cols].values.ravel()
    
    models = [
        svr_grid(),
        RandomForestRegressor(n_estimators=100),
        GradientBoostingRegressor(
            n_estimators=100, learning_rate=0.1, max_depth=10, random_state=0, verbose=VERBOSE),
        KerasRegressor(build_fn=neural_network_regression, x_train=x_train)
    ]
    
    for model in models:
        run_model(name, model, results, x_train, y_train, x_validate, y_validate)
results = []

# Find all of the tests
tests = set()
for dirname, _, filenames in os.walk(PATH):
    for filename in filenames:
        if filename.endswith(TOKEN):
            tests.add(filename[:-len(TOKEN)])
            
# run the tests
start_time = time.time()
for test in tests:
    eval_data(test, results)
print(f"Total elapsed time: {hms_string(time.time() - start_time)}")
            
# format results
df = pd.DataFrame(results)
df.columns = ["equation","model", "score", "std", "mean", "time"]
df.to_csv("/kaggle/working/results.csv",index=False)

## Data Collected

In [None]:
df

# Support Vector Machine Results

In [None]:
df[df.model=='GridSearchCV'].plot.bar(x="equation",y="score",title="Support Vector Machine")

## Random Forest Results

In [None]:
df[df.model=='RandomForestRegressor'].plot.bar(x="equation",y="score",title="Random Forest")

# Gradient Boosted Machine Results

In [None]:
df[df.model=='GradientBoostingRegressor'].plot.bar(x="equation",y="score",title="Gradient Boosted Machine")

## Neural Network Results

In [None]:
df[df.model=='KerasRegressor'].plot.bar(x="equation",y="score",title="Neural Network")