# 1. What Is Feature Engineering
- Determine which features are the most important
- Invent new features
- Encode high-cardinality categoricals with a target encoding
- Create segmentation features with k-means clustering
- Decompose a dataset's variation into features with **PCA**

## Goal
- Improve a model's predictive performance
- Reduce computational or data needs
- Improve interpretability of the results

## A Guiding Principle of Feature Engineering
- Transforming features such that they reflect the relationship between the feature and the output
- Transformations provide the possibility for the model to earn relationships it can't learn itself

## Example - Concrete Formulations
- We add synthetic features to improve the predivtive performance of a random forest model
- The **Concrete** dataset contains concrete formulations with the target to predict the resulting **compressive strength**

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

concrete_path = './resources/Concrete_Data.xls'
df = pd.read_excel(concrete_path)
df.head()

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


In [9]:
df.columns

Index(['Cement (component 1)(kg in a m^3 mixture)',
       'Blast Furnace Slag (component 2)(kg in a m^3 mixture)',
       'Fly Ash (component 3)(kg in a m^3 mixture)',
       'Water  (component 4)(kg in a m^3 mixture)',
       'Superplasticizer (component 5)(kg in a m^3 mixture)',
       'Coarse Aggregate  (component 6)(kg in a m^3 mixture)',
       'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Age (day)',
       'Concrete compressive strength(MPa, megapascals) '],
      dtype='object')

In [10]:
# baseline without feature engineering
X = df.copy()
y = X.pop('Concrete compressive strength(MPa, megapascals) ')

baseline = RandomForestRegressor(criterion='absolute_error', random_state=0)
baseline_score = cross_val_score(baseline, X, y, cv=5, scoring="neg_mean_absolute_error")
baseline_score = -1 * baseline_score.mean()

print(f"MAE Baseline Score: {baseline_score:.4}")

MAE Baseline Score: 8.397


In [12]:
# the ratio of ingredients in a recipeis usually a better predictor than the absolute amounts
X = df.copy()
y = X.pop('Concrete compressive strength(MPa, megapascals) ')

X["FCRatio"] = X['Fine Aggregate (component 7)(kg in a m^3 mixture)'] / X['Coarse Aggregate  (component 6)(kg in a m^3 mixture)']
X["AggCmtRatio"] = (X['Fine Aggregate (component 7)(kg in a m^3 mixture)'] + X['Coarse Aggregate  (component 6)(kg in a m^3 mixture)']) / X['Cement (component 1)(kg in a m^3 mixture)']
X["WtrCmtRatio"] = X['Water  (component 4)(kg in a m^3 mixture)'] / X['Cement (component 1)(kg in a m^3 mixture)']

model = RandomForestRegressor(criterion='absolute_error', random_state=0)
score = cross_val_score(model, X, y, cv=5, scoring="neg_mean_absolute_error")
score = -1 * score.mean()

print(f"MAE Score with Ratio Features: {score:.4}")

MAE Score with Ratio Features: 8.01
