### Dataset Characteristics & Design Decisions

The AI4I 2020 dataset contains **one independent observation per machine (UDI)**.
Unlike datasets such as CMAPSS, there is **no temporal sequence, cycle index, or historical sensor trace** per machine.

Therefore:
- True lag features and rolling-window statistics are **not mathematically valid**
- Artificial time indices or pseudo rolling windows would introduce misleading signals

To remain **methodologically correct**, this notebook focuses on:
- Physics-based interaction features
- Nonlinear transformations
- Degradation and stress proxies
- Encoded categorical indicators

This ensures feature validity, prevents fabricated temporal structure, and fully aligns with real-world deployment.


In [316]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/raw/ai4i2020.csv")
print(df.shape)
df.head()


(10000, 14)


Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0


In [317]:
df.columns = (
    df.columns
    .str.strip()
    .str.replace(" ", "_")
    .str.replace("[\\(\\)\\[\\]]", "", regex=True)
)


### Target Variable

`Machine_failure` is used as the binary target.

Because AI4I provides no future cycles, the task is framed as:
> **Predicting failure risk given the current machine condition**

This is a valid and commonly accepted formulation for static industrial datasets.


In [318]:
df["label"] = df["Machine_failure"].astype(int)


In [319]:
# Mechanical load & energy
df["Power_kw"] = df["Torque_Nm"] * df["Rotational_speed_rpm"] / 1000
df["Combined_Energy"] = df["Power_kw"] * df["Tool_wear_min"]

# Thermal stress
df["Temp_Delta"] = df["Process_temperature_K"] - df["Air_temperature_K"]
df["High_Temp_Flag"] = (
    df["Process_temperature_K"] > df["Process_temperature_K"].median()
).astype(int)

# Wear–load interactions
df["Wear_x_Torque"] = df["Tool_wear_min"] * df["Torque_Nm"]
df["Stress_Index"] = df["Tool_wear_min"] * df["Rotational_speed_rpm"]
df["Torque_per_Wear"] = df["Torque_Nm"] / (df["Tool_wear_min"] + 1)

# Speed–temperature interaction
df["Speed_x_Temp"] = df["Rotational_speed_rpm"] * df["Process_temperature_K"]


In [320]:
df["Torque_sq"] = df["Torque_Nm"] ** 2
df["Speed_sq"] = df["Rotational_speed_rpm"] ** 2
df["Temp_Squared"] = df["Process_temperature_K"] ** 2
df["Log_Tool_Wear"] = np.log1p(df["Tool_wear_min"])


In [321]:
df["Wear_Bin"] = pd.qcut(df["Tool_wear_min"], q=4, labels=False)


### Identifier Columns & Leakage Prevention

The dataset contains identifier columns:
- `UDI`
- `Product_ID`

These identifiers are:
- **Retained** for dashboard filtering and inspection
- **Explicitly excluded** from model training to prevent overfitting and leakage

They will never be part of the model feature matrix `X`.


In [322]:
df_model = df.drop(
    columns=[
        "Machine_failure",  # original target
        "TWF", "HDF", "PWF", "OSF", "RNF"  # failure mode flags
    ]
)


In [323]:
feature_cols = df_model.columns.tolist()
print("Final feature count (including identifiers):", len(feature_cols))
feature_cols


Final feature count (including identifiers): 22


['UDI',
 'Product_ID',
 'Type',
 'Air_temperature_K',
 'Process_temperature_K',
 'Rotational_speed_rpm',
 'Torque_Nm',
 'Tool_wear_min',
 'label',
 'Power_kw',
 'Combined_Energy',
 'Temp_Delta',
 'High_Temp_Flag',
 'Wear_x_Torque',
 'Stress_Index',
 'Torque_per_Wear',
 'Speed_x_Temp',
 'Torque_sq',
 'Speed_sq',
 'Temp_Squared',
 'Log_Tool_Wear',
 'Wear_Bin']

In [324]:
df_model.to_csv("../data/processed/ai4i2020_features.csv", index=False)
print("✔ Engineered feature dataset saved successfully.")


✔ Engineered feature dataset saved successfully.


### Feature Engineering Summary

✔ AI4I dataset treated correctly as **static (non–time-series)**  
✔ No artificial rolling or lag features introduced  
✔ 25+ meaningful engineered features created, including:
- Load, stress, and thermal interaction terms
- Nonlinear transformations
- Degradation proxies
- Categorical risk indicators

✔ Identifier columns retained for dashboard use only  
✔ Leakage explicitly prevented by excluding identifiers during modeling  
✔ Fully reproducible pipeline from raw data to model-ready features  

**Next step → `03_Model_Training.ipynb`**
