## New feature

we'll see how adding a **new synthetic features** to a dataset can improve the predictive performance of a random forest model.

The Concrete dataset contains a variety of concrete formulations and the resulting product's compressive strength, which is a measure of how much load that kind of concrete can bear. The task for this dataset is to predict a concrete's compressive strength given its formulation.

In [13]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

In [14]:
df = pd.read_csv(r"C:\Users\Fujitsu\Desktop\concrete_data.csv")
df.head()    
    

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


You can see here the various ingredients going into each variety of concrete. We'll see in a moment how adding some additional synthetic features derived from these can help a model to learn important relationships among them.

We'll first establish a baseline by training the model on the un-augmented dataset. This will help us determine whether our new features are actually useful.

The **MAE score** is measured as the average of the absolute error values. Therefore, the difference between an expected value and a predicted value can be positive or negative and will necessarily be positive when calculating the MAE.

In [15]:
X = df.copy()
y = X.pop("Strength") # setting target variable

# Train and score baseline model
baseline = RandomForestRegressor(criterion="mae", random_state=0)
baseline_score = cross_val_score(
    baseline, X, y, cv=5, scoring="neg_mean_absolute_error"
)
baseline_score = -1 * baseline_score.mean()

print(f"MAE Baseline Score: {baseline_score:.4}")

MAE Baseline Score: 8.232


MAE Baseline Score: 8.232

In [16]:
#The cell below adds three new ratio features to the dataset.

X = df.copy()
y = X.pop("Strength")

# Create synthetic features
X["FC Ratio"] = X["Fine Aggregate"] / X["Coarse Aggregate"]
X["Agg Cmt Ratio"] = (X["Coarse Aggregate"] + X["Fine Aggregate"]) / X["Cement"]
X["Wtr Cmt Ratio"] = X["Water"] / X["Cement"]

# Train and score model on dataset with additional ratio features
model = RandomForestRegressor(criterion="mae", random_state=0)
score = cross_val_score(
    model, X, y, cv=5, scoring="neg_mean_absolute_error"
)
score = -1 * score.mean()

print(f"MAE Score with Ratio Features: {score:.4}")

MAE Score with Ratio Features: 7.948


MAE Score with Ratio Features: 7.948

And sure enough, Reduced ratio of absolute Error!! performance improved! This is evidence that these new ratio features exposed important information to the model that it wasn't detecting before.