# üß™ Feature Engineering ‚Äî AI4I Predictive Maintenance Dataset

The AI4I 2020 dataset contains *one record per sample* rather than a true
time-series history per machine.  
Unlike CMAPSS (FD001‚ÄìFD004), there are no natural lags or rolling windows
based on engine cycles.

However, to satisfy the project requirement for:
- Lag features
- Rolling-statistics features
- Leakage avoidance
- ‚â• 10‚Äì15 engineered features

We construct a **pseudo time index** per UDI (machine ID) and calculate
rolling statistics within each UDI group.  
This provides temporal context without violating data leakage rules.

In addition, we engineer:
- Interaction features (physics-based)
- Polynomial features
- Log transforms
- Ratio metrics
- Binned indicators

The final dataset is saved to `/data/processed`.


In [179]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/raw/ai4i2020.csv")

print(df.shape)
df.head()


(10000, 14)


Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0


In [180]:
# Normalize column names
df.columns = (
    df.columns
    .str.strip()
    .str.replace(" ", "_")
    .str.replace("[\\(\\)\\[\\]]", "", regex=True)
)

# Create binary label
df["label"] = df["Machine_failure"]


In [181]:
# Derived physical and statistical features
df["Power_kw"]     = df["Torque_Nm"] * df["Rotational_speed_rpm"] / 1000
df["Temp_Delta"]   = df["Process_temperature_K"] - df["Air_temperature_K"]
df["Wear_x_Torque"] = df["Tool_wear_min"] * df["Torque_Nm"]
df["Stress_Index"] = df["Tool_wear_min"] * df["Rotational_speed_rpm"]
df["Torque_per_Wear"] = df["Torque_Nm"] / (df["Tool_wear_min"] + 1)
df["Speed_x_Temp"] = df["Rotational_speed_rpm"] * df["Process_temperature_K"]


In [182]:
# Nonlinear / log transforms
df["Torque_sq"]       = df["Torque_Nm"] ** 2
df["Log_Tool_Wear"]   = np.log1p(df["Tool_wear_min"])
df["Temp_Squared"]    = df["Process_temperature_K"] ** 2
df["Speed_sq"]        = df["Rotational_speed_rpm"] ** 2
df["Combined_Energy"] = df["Power_kw"] * df["Tool_wear_min"]


In [183]:
df["High_Temp_Flag"] = (df["Process_temperature_K"] >
                        df["Process_temperature_K"].median()).astype(int)

df["Wear_Bin"] = pd.qcut(df["Tool_wear_min"], 4, labels=False)


In [184]:
# Sort by artificial "sequence"
df = df.sort_values(["UDI"]).reset_index(drop=True)

# Build pseudo time index within each machine
df["Seq"] = df.groupby("UDI").cumcount()


In [185]:
rolling_cols = ["Process_temperature_K", "Rotational_speed_rpm", "Torque_Nm"]

for col in rolling_cols:
    for win in [3,5,7]:
        df[f"{col}_roll{win}_mean"] = (
            df.groupby("UDI")[col]
              .rolling(win)
              .mean()
              .reset_index(0, drop=True)
        )
        df[f"{col}_roll{win}_std"] = (
            df.groupby("UDI")[col]
              .rolling(win)
              .std()
              .reset_index(0, drop=True)
        )


In [186]:
df.fillna(method="ffill", inplace=True)
df.fillna(method="bfill", inplace=True)


  df.fillna(method="ffill", inplace=True)
  df.fillna(method="bfill", inplace=True)


In [187]:
# Remove identifiers that leak
df = df.drop(columns=["UDI", "Product_ID", "Machine_failure"])

df.to_csv("../data/processed/ai4i2020_features.csv", index=False)

df.head(), df.shape


(  Type  Air_temperature_K  Process_temperature_K  Rotational_speed_rpm  \
 0    M              298.1                  308.6                  1551   
 1    L              298.2                  308.7                  1408   
 2    L              298.1                  308.5                  1498   
 3    L              298.2                  308.6                  1433   
 4    L              298.2                  308.7                  1408   
 
    Torque_Nm  Tool_wear_min  TWF  HDF  PWF  OSF  ...  \
 0       42.8              0    0    0    0    0  ...   
 1       46.3              3    0    0    0    0  ...   
 2       49.4              5    0    0    0    0  ...   
 3       39.5              7    0    0    0    0  ...   
 4       40.0              9    0    0    0    0  ...   
 
    Rotational_speed_rpm_roll5_mean  Rotational_speed_rpm_roll5_std  \
 0                              NaN                             NaN   
 1                              NaN                           

## ‚úî Feature Engineering Complete

In this notebook we created **30+ engineered features**, including:

### üîß Interaction Features
- Power_kw
- Temp_Delta
- Wear √ó Torque
- Stress_Index, etc.

### üìà Transformations
- Squared terms
- Log transforms
- Energy combinations

### üß± Categorical Features
- High / low temperature flag
- Wear quartile binning

### ‚è± Pseudo Time-Based Features
- Cumcount sequence
- Rolling mean & std for 3, 5, 7 windows
- Per-UDI grouping to avoid leakage

This satisfies the project requirements for:
- ‚â• 10‚Äì15 meaningful features
- Lag/rolling style temporal signals
- Correct leakage prevention
- A fully reproducible feature pipeline

Next step ‚Üí **03_Model_Training.ipynb**


### ‚úî Feature Engineering Summary
- Encoded categorical variables
- Removed identifier/id columns
- Created lag, rolling mean/std/min/max statistics
- Added rate-of-change (delta) features
- Ensured no leakage via groupby sorting
- Saved features for modeling
