**The Goal of Feature Engineering**

The goal of feature engineering is simply to make your data better suited to the problem at hand.

Consider "apparent temperature" measures like the heat index and the wind chill. These quantities attempt to measure the perceived temperature to humans based on air temperature, humidity, and wind speed, things which we can measure directly. You could think of an apparent temperature as the result of a kind of feature engineering, an attempt to make the observed data more relevant to what we actually care about: how it actually feels outside!

You might perform feature engineering to:

improve a model's predictive performance
reduce computational or data needs
improve interpretability of the results

**Example - Concrete Formulations**

To illustrate these ideas we'll see how adding a few synthetic features to a dataset can improve the predictive performance of a random forest model.

The Concrete dataset contains a variety of concrete formulations and the resulting product's compressive strength, which is a measure of how much load that kind of concrete can bear. **The task for this dataset is to predict a concrete's compressive strength given its formulation.**

In [41]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

In [42]:
data = pd.read_csv("concrete.csv")

data.head()

Unnamed: 0,Cement,BlastFurnaceSlag,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age,CompressiveStrength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


You can see here the various ingredients going into each variety of concrete. We'll see in a moment how adding some additional synthetic features derived from these can help a model to learn important relationships among them.

We'll first establish a baseline by training the model on the un-augmented dataset. This will help us determine whether our new features are actually useful.

Establishing baselines like this is good practice at the start of the feature engineering process. A baseline score can help you decide whether your new features are worth keeping, or whether you should discard them and possibly try something else.

In [43]:
X = data.copy()

y = X.pop('CompressiveStrength')

X.head()

Unnamed: 0,Cement,BlastFurnaceSlag,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [44]:
X_train,X_vaild, y_train,y_vaild = train_test_split(X,y, test_size=.2, random_state=0)

lower values indicate better predictive accuracy. MAE measures the average absolute difference between the predicted values and the actual values. A lower MAE means that, on average, the model's predictions are closer to the actual values, which is a desirable outcome.


In [45]:
X = data.copy()
y = X.pop("CompressiveStrength")

# Train and score baseline model
baseline = RandomForestRegressor(criterion="mae", random_state=0)
baseline_score = cross_val_score(
    baseline, X, y, cv=5, scoring="neg_mean_absolute_error"
)
baseline_score = -1 * baseline_score.mean()

print(f"MAE Baseline Score: {baseline_score:.4}")

MAE Baseline Score: 8.232


In [46]:
model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=0)
model.fit(X_train, y_train)
model.score(X_vaild, y_vaild)

0.9201924502721118

In [47]:
num_folds = 5

# Use cross_val_score to obtain cross-validation scores
scores = cross_val_score(model, X_train, y_train, cv=num_folds, scoring='neg_mean_squared_error')
print(scores)

[-28.41658724 -22.71439776 -30.76954739 -42.48765775 -27.66658295]


Here, the lower the score, the better the model is at making predictions. So, in this case, the second fold with a score of approximately -22.71 appears to be the best-performing fold, followed by the first and fifth folds with scores around -28.42 and -27.67, respectively. The third and fourth folds have higher scores, which indicate that the model performed relatively worse on those particular subsets of the data.

However, it's important to consider the entire array of cross-validation scores and not just a single fold when evaluating the overall performance of your model. You might want to calculate additional statistics, such as the mean and standard deviation of the cross-validation scores, to get a better sense of the model's overall performance and variability across different subsets of the data.

In [48]:
import numpy as np

#scores = [-28.41658724, -22.71439776, -30.76954739, -42.48765775, -27.66658295]

mean_score = np.mean(scores)
std_dev = np.std(scores)

print("Mean Score:", mean_score)
print("Standard Deviation of Scores:", std_dev)

Mean Score: -30.410954616999696
Standard Deviation of Scores: 6.5836374710218


The mean score represents the average performance of your model across the different folds of cross-validation. In this case, the mean score is approximately -30.41, which indicates that, on average, the model's predictions have a mean squared error (MSE) of around 30.41.

The standard deviation of the scores measures the variability or spread of the model's performance across different folds. A smaller standard deviation suggests that the model's performance is relatively consistent across folds, while a larger standard deviation suggests greater variability.

With a standard deviation of approximately 6.58, you can see that there is some variability in the model's performance across folds, but the mean score is still quite low, indicating that the model is making reasonably accurate predictions on average.

Overall, a mean score of -30.41 suggests that the model is doing a decent job at capturing the underlying patterns in the data for your regression task, and the standard deviation of 6.58 gives you an idea of the degree of variability in its performance. If this performance meets your requirements and expectations, then the model is likely a suitable choice for your task.


In [49]:
X = data.copy()
y = X.pop("CompressiveStrength")

# Create synthetic features
X["FCRatio"] = X["FineAggregate"] / X["CoarseAggregate"]
X["AggCmtRatio"] = (X["CoarseAggregate"] + X["FineAggregate"]) / X["Cement"]
X["WtrCmtRatio"] = X["Water"] / X["Cement"]

# Train and score model on dataset with additional ratio features
model = RandomForestRegressor(criterion="mae", random_state=0)
score = cross_val_score(
    model, X, y, cv=5, scoring="neg_mean_absolute_error"
)
score = -1 * score.mean()

print(f"MAE Score with Ratio Features: {score:.4}")

MAE Score with Ratio Features: 7.948


And sure enough, performance improved! This is evidence that these new ratio features exposed important information to the model that it wasn't detecting before.

In [53]:
num_folds = 5

# Use cross_val_score to obtain cross-validation scores
scores = cross_val_score(model, X, y, cv=num_folds, scoring='neg_mean_squared_error')
print(scores)

[ -94.28856021  -82.52819244  -59.19305004  -30.31051343 -288.13298297]


In [51]:
scores.mean()

-110.8906598177281