##  Feature Engineering

To improve predictive performance, we add **domain-driven features** that directly capture known failure mechanisms from the AI4I 2020 dataset:

- **Temperature difference (`temp_diff`)**  
  - Formula: `process_temperature - air_temperature`  
  - Reason: Machines fail (HDF = Heat Dissipation Failure) when heat cannot escape. A smaller temperature difference means poor cooling.  

- **Power (`power`)**  
  - Formula: `torque * rotational_speed`  
  - Reason: Power Failures (PWF) happen when torque × speed is outside safe operating limits. This derived feature makes the relationship explicit.  

- **Wear × Torque (`wear_torque`)**  
  - Formula: `tool_wear * torque`  
  - Reason: Overstrain Failures (OSF) occur when tools under heavy wear are also subjected to high torque → tool breakage.  

- **Normalized Wear (`norm_wear`)**  
  - Formula: `tool_wear / wear_limit_by_product_type`  
  - Reason: Each product type (Low(L)/Medium(M)/High(H)) allows different wear limits. Normalizing wear makes tool usage comparable across product types.  

- **One-Hot Encoded Product Type (`type_M`, `type_H`)**  
  - Reason: Product type influences tool wear (H > M > L). Encoding product categories lets the model capture this effect.  

---

👉 These features are designed to align with the **five failure modes**:  

- **TWF** = Tool Wear Failure  
- **HDF** = Heat Dissipation Failure  
- **PWF** = Power Failure  
- **OSF** = Overstrain Failure  
- **RNF** = Random Failure  

**Takeaway:**  
Good feature engineering embeds domain knowledge into the data. By mirroring the actual failure mechanics (TWF, HDF, PWF, OSF, RNF), we make patterns learnable with simpler models and significantly improve both accuracy and interpretability.

In [21]:
import pandas as pd
df = pd.read_csv("/Users/swetha/predictive-maintenance-etl-ml/data/ai4i2020.csv")
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

In [22]:
# Temperature difference (for HDF detection)
df["temp_diff"] = df["process_temperature_[k]"] - df["air_temperature_[k]"]

# Power = torque * rotational speed (for PWF detection)
df["power"] = df["torque_[nm]"] * df["rotational_speed_[rpm]"]

# Wear × Torque interaction (for OSF detection)
df["wear_torque"] = df["tool_wear_[min]"] * df["torque_[nm]"]

# Normalized wear by product type
wear_limits = {"l": 240, "m": 250, "h": 260}
df["norm_wear"] = df.apply(lambda x: x["tool_wear_[min]"] / wear_limits[x["type"].lower()], axis=1)

# One-hot encode product type
df = pd.get_dummies(df, columns=["type"], drop_first=True)

In [23]:
df.head()
df.describe()

Unnamed: 0,udi,air_temperature_[k],process_temperature_[k],rotational_speed_[rpm],torque_[nm],tool_wear_[min],machine_failure,twf,hdf,pwf,osf,rnf,temp_diff,power,wear_torque,norm_wear
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,300.00493,310.00556,1538.7761,39.98691,107.951,0.0339,0.0046,0.0115,0.0095,0.0098,0.0019,10.00063,59967.14704,4314.66455,0.440984
std,2886.89568,2.000259,1.483734,179.284096,9.968934,63.654147,0.180981,0.067671,0.106625,0.097009,0.098514,0.04355,1.001094,10193.093881,2826.567692,0.260542
min,1.0,295.3,305.7,1168.0,3.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.6,10966.8,0.0,0.0
25%,2500.75,298.3,308.8,1423.0,33.2,53.0,0.0,0.0,0.0,0.0,0.0,0.0,9.3,53105.4,1963.65,0.216
50%,5000.5,300.1,310.1,1503.0,40.1,108.0,0.0,0.0,0.0,0.0,0.0,0.0,9.8,59883.9,4012.95,0.441667
75%,7500.25,301.5,311.1,1612.0,46.8,162.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,66873.75,6279.0,0.6625
max,10000.0,304.5,313.8,2886.0,76.6,253.0,1.0,1.0,1.0,1.0,1.0,1.0,12.1,99980.4,16497.0,1.045833


In [24]:
df.to_csv("/Users/swetha/predictive-maintenance-etl-ml/data/ai4i2020_featured.csv", index=False)