##  Feature Engineering – CMAPSS Dataset

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
# 1. Imports and Data Loading
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# DATA LOADING
import pandas as pd
from pathlib import Path

# Set column names: 3 op_settings + 21 sensors
column_names = [
    "engine_id", "cycle", "op_setting_1", "op_setting_2", "op_setting_3"
] + [f"sensor_{i}" for i in range(1, 22)]

# Directory with the train files
data_dir = Path("/content/drive/MyDrive/PrognosAI_OCT25/Data/raw")

# Load all four files and add an identifier column
datasets = {}
for fd_id in range(1, 5):
    file_path = data_dir / f"train_FD00{fd_id}.txt"
    datasets[f"FD00{fd_id}"] = pd.read_csv(
        file_path, sep=r"\s+", header=None, names=column_names
    )
    datasets[f"FD00{fd_id}"]["dataset_id"] = f"FD00{fd_id}"

# Merge into a single DataFrame
df = pd.concat(datasets.values(), ignore_index=True)

print(f"Shape of the merged DataFrame: {df.shape}")
display(df.head())



Shape of the merged DataFrame: (160359, 27)


Unnamed: 0,engine_id,cycle,op_setting_1,op_setting_2,op_setting_3,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,...,sensor_13,sensor_14,sensor_15,sensor_16,sensor_17,sensor_18,sensor_19,sensor_20,sensor_21,dataset_id
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,FD001
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,FD001
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,FD001
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,FD001
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,FD001


#### Aggregate Features – Mean, Std, Min, Max per Engine

This code snippet dynamically selects all sensor columns in the C-MAPSS dataset and calculates engine-wise aggregate statistics such as mean, standard deviation, minimum, and maximum for each sensor across the lifecycle of each engine. By grouping data based on engine_id, it summarizes the behavior of each sensor over all operational cycles for that engine. The resulting aggregated features condense time-series sensor data into static descriptors, which help capture overall degradation patterns and operational characteristics unique to each engine. Flattening the multi-level column index makes subsequent analysis easier by giving intuitive names measuring each sensor's aggregate behavior.

This aggregation step is crucial for the C-MAPSS dataset because it transforms noisy and complex time-series sensor measurements into a more compact and informative format that summarizes the full history of each engine. These static features are essential inputs for machine learning models predicting Remaining Useful Life (RUL) or potential faults. They improve model performance by reducing dimensionality, increasing interpretability, and enhancing the ability to capture long-term trends and degradation characteristics, thus enabling more reliable predictive maintenance and health monitoring of aircraft engines.

In [6]:
# Sensor columns
sensor_cols = [col for col in df.columns if 'sensor_' in col]

# Engine-wise aggregrate features (static for each engine)
engine_aggs = df.groupby('engine_id')[sensor_cols].agg(['mean', 'std', 'min', 'max'])
engine_aggs.columns = ['_'.join(col) for col in engine_aggs.columns]
engine_aggs.reset_index(inplace=True)
print(f"Aggregate feature matrix shape: {engine_aggs.shape}")
engine_aggs.tail()


Aggregate feature matrix shape: (260, 85)


Unnamed: 0,engine_id,sensor_1_mean,sensor_1_std,sensor_1_min,sensor_1_max,sensor_2_mean,sensor_2_std,sensor_2_min,sensor_2_max,sensor_3_mean,...,sensor_19_min,sensor_19_max,sensor_20_mean,sensor_20_std,sensor_20_min,sensor_20_max,sensor_21_mean,sensor_21_std,sensor_21_min,sensor_21_max
255,256,475.14,27.486466,445.0,518.67,583.608466,38.064278,536.09,644.12,1431.573988,...,84.93,100.0,21.881902,10.297101,10.26,39.14,13.137912,6.167984,6.2293,23.4157
256,257,473.989741,26.798684,445.0,518.67,582.076214,37.081296,536.12,643.39,1427.211618,...,84.93,100.0,21.37521,10.022302,10.38,39.17,12.826693,6.01058,6.241,23.4548
257,258,475.573636,27.340533,445.0,518.67,583.308322,38.870335,536.0,644.05,1429.926853,...,84.93,100.0,21.69049,10.152721,10.34,38.98,13.010299,6.094522,6.2127,23.3676
258,259,472.357659,25.644641,445.0,518.67,579.046976,36.139349,535.78,643.15,1418.357512,...,84.93,100.0,20.772488,9.668143,10.3,39.02,12.472388,5.807734,6.2383,23.5015
259,260,473.033608,26.415225,445.0,518.67,579.25462,37.604858,536.13,644.15,1417.488165,...,84.93,100.0,20.835095,9.955657,10.38,39.18,12.498821,5.973274,6.1757,23.5412


#### Rolling Statistics and Trends

This code calculates rolling mean and rolling standard deviation for each sensor reading within each engine across cycles, effectively capturing short-term temporal patterns. By grouping data per engine_id and computing these statistics over a moving window of 5 cycles, the code smooths sensor readings and quantifies local variability, even at the start of the series via min_periods=1. These additional rolling features are appended with descriptive column names, providing a richer temporal context beyond raw sensor values.

This process is vital for the C-MAPSS dataset because it transforms raw noisy sensor signals into features that reflect recent dynamic changes in engine condition. The rolling mean highlights underlying trends by reducing noise, while the rolling std signals fluctuations or irregularities that may indicate emerging faults. Together with static aggregate features, these rolling statistics enable machine learning models to better understand engine degradation over multiple time scales, leading to more accurate Remaining Useful Life (RUL) predictions and improved preventive maintenance strategies.

In [7]:
# Add rolling means (windows=5 cycles) and rolling std for each sensor per engine
for col in sensor_cols:
  df[f"{col}_rollmean5"] = df.groupby('engine_id')[col].rolling(window=5, min_periods=1).mean().reset_index(level=0, drop=True)
  df[f"{col}_rollstd5"] = df.groupby('engine_id')[col].rolling(window=5, min_periods=1).std().reset_index(level=0, drop=True)

# Prepare list of columns to display
cols_to_show = sensor_cols + [f"{col}_rollmean5" for col in sensor_cols] + [f"{col}_rollstd5" for col in sensor_cols]

# Display first 10 rows of the selected columns
df[cols_to_show].head(10)

Unnamed: 0,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,sensor_6,sensor_7,sensor_8,sensor_9,sensor_10,...,sensor_12_rollstd5,sensor_13_rollstd5,sensor_14_rollstd5,sensor_15_rollstd5,sensor_16_rollstd5,sensor_17_rollstd5,sensor_18_rollstd5,sensor_19_rollstd5,sensor_20_rollstd5,sensor_21_rollstd5
0,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,...,,,,,,,,,,
1,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,...,0.438406,0.035355,5.041671,0.008697,0.0,0.0,0.0,0.0,0.042426,0.003253
2,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,...,0.404475,0.026458,3.71745,0.00764,0.0,1.154701,0.0,0.0,0.055076,0.044573
3,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,...,0.49595,0.029439,3.050906,0.028117,0.0,1.0,0.0,0.0,0.076322,0.037977
4,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,...,0.432574,0.025884,2.651326,0.025953,0.0,1.095445,0.0,0.0,0.073621,0.033498
5,518.67,642.1,1584.47,1398.37,14.62,21.61,554.67,2388.02,9049.68,1.3,...,0.425417,0.023452,0.958697,0.025727,0.0,1.140175,0.0,0.0,0.051186,0.031436
6,518.67,642.48,1592.32,1397.77,14.62,21.61,554.34,2388.02,9059.13,1.3,...,0.425652,0.021679,0.643141,0.023476,0.0,1.140175,0.0,0.0,0.086718,0.021634
7,518.67,642.56,1582.96,1400.97,14.62,21.61,553.85,2388.0,9040.8,1.3,...,0.429919,0.021679,1.149274,0.022477,0.0,0.83666,0.0,0.0,0.086487,0.034405
8,518.67,642.12,1590.98,1394.8,14.62,21.61,553.69,2388.05,9046.46,1.3,...,0.341101,0.008944,3.205438,0.02074,0.0,0.83666,0.0,0.0,0.077136,0.038939
9,518.67,641.71,1591.24,1400.46,14.62,21.61,553.59,2388.05,9051.7,1.3,...,0.35826,0.014142,2.883881,0.020493,0.0,0.83666,0.0,0.0,0.062849,0.058103


In [8]:
# Check for missing values in percentage
missing_values = df.isnull().sum() / df.shape[0] * 100
missing_values[missing_values > 0].sort_values(ascending=False)


Unnamed: 0,0
sensor_1_rollstd5,0.162136
sensor_2_rollstd5,0.162136
sensor_3_rollstd5,0.162136
sensor_4_rollstd5,0.162136
sensor_5_rollstd5,0.162136
sensor_6_rollstd5,0.162136
sensor_7_rollstd5,0.162136
sensor_8_rollstd5,0.162136
sensor_9_rollstd5,0.162136
sensor_10_rollstd5,0.162136


In [9]:
# drop the rows with missing values
df.dropna(inplace=True)

#### Sensor Value Normalization

In [10]:
# Normalize all sensor columns and rolling feature columns (per sensor globally for simplicity)
features_to_scale = [col for col in df.columns if ('sensor_'in col) or ('roll' in col)]

# StandardScaler (mean=0, std=1)
scaler = StandardScaler()
df_scaled = df.copy()
df_scaled[features_to_scale] = scaler.fit_transform(df[features_to_scale])

# Confirm scaled feature distribution
df_scaled[features_to_scale].describe().T[['mean', 'std']]

Unnamed: 0,mean,std
sensor_1,-4.544661e-16,1.000003
sensor_2,-8.123582e-16,1.000003
sensor_3,2.499564e-16,1.000003
sensor_4,-1.364819e-15,1.000003
sensor_5,-3.749346e-16,1.000003
...,...,...
sensor_19_rollstd5,-3.692537e-17,1.000003
sensor_20_rollmean5,5.908060e-16,1.000003
sensor_20_rollstd5,1.732652e-16,1.000003
sensor_21_rollmean5,-1.255463e-15,1.000003


#### Feature Matrix Construction & Validation

In [11]:
# Select all numeric columns except for engine_id and cycle as feature matrix
exclude_cols = ['engine_id', 'cycle']
feature_cols = [col for col in df_scaled.columns if col not in exclude_cols]

# Check for missing values
print("Missing values per features columns:")
print(df_scaled[feature_cols].isnull().sum())

Missing values per features columns:
op_setting_1           0
op_setting_2           0
op_setting_3           0
sensor_1               0
sensor_2               0
                      ..
sensor_19_rollstd5     0
sensor_20_rollmean5    0
sensor_20_rollstd5     0
sensor_21_rollmean5    0
sensor_21_rollstd5     0
Length: 67, dtype: int64


In [12]:
# Final feature matrix
X = df_scaled[feature_cols]
print(f"Final feature matrix shape: {X.shape}")
X.head()

Final feature matrix shape: (160099, 67)


Unnamed: 0,op_setting_1,op_setting_2,op_setting_3,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,sensor_6,sensor_7,...,sensor_17_rollmean5,sensor_17_rollstd5,sensor_18_rollmean5,sensor_18_rollstd5,sensor_19_rollmean5,sensor_19_rollstd5,sensor_20_rollmean5,sensor_20_rollstd5,sensor_21_rollmean5,sensor_21_rollstd5
1,0.0019,-0.0003,100.0,1.079459,1.054654,1.056148,1.043391,1.108018,1.115325,1.114508,...,1.374561,-1.455685,1.262047,-1.184951,0.705019,-0.802578,1.461675,-1.367305,1.462277,-1.374635
2,-0.0043,0.0003,100.0,1.079459,1.059362,1.023736,1.051169,1.108018,1.115325,1.117437,...,1.345246,-1.369314,1.262047,-1.184951,0.705019,-0.802578,1.458692,-1.364754,1.457485,-1.360744
3,0.0007,0.0,100.0,1.079459,1.059362,0.97973,1.034073,1.108018,1.115325,1.118528,...,1.352574,-1.380886,1.262047,-1.184951,0.705019,-0.802578,1.455242,-1.360468,1.456473,-1.362962
4,-0.0019,-0.0002,100.0,1.079459,1.059833,0.980238,1.06599,1.108018,1.115325,1.115944,...,1.365766,-1.373746,1.262047,-1.184951,0.705019,-0.802578,1.45362,-1.361012,1.457004,-1.364467
5,-0.0043,-0.0001,100.0,1.079459,1.053477,0.993947,1.008393,1.108018,1.115325,1.119791,...,1.356972,-1.370401,1.262047,-1.184951,0.705019,-0.802578,1.45183,-1.365538,1.455061,-1.36516


In [13]:
X.info()
# X.drop(columns=['dataset_id'],inplace=True)

<class 'pandas.core.frame.DataFrame'>
Index: 160099 entries, 1 to 160358
Data columns (total 67 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   op_setting_1         160099 non-null  float64
 1   op_setting_2         160099 non-null  float64
 2   op_setting_3         160099 non-null  float64
 3   sensor_1             160099 non-null  float64
 4   sensor_2             160099 non-null  float64
 5   sensor_3             160099 non-null  float64
 6   sensor_4             160099 non-null  float64
 7   sensor_5             160099 non-null  float64
 8   sensor_6             160099 non-null  float64
 9   sensor_7             160099 non-null  float64
 10  sensor_8             160099 non-null  float64
 11  sensor_9             160099 non-null  float64
 12  sensor_10            160099 non-null  float64
 13  sensor_11            160099 non-null  float64
 14  sensor_12            160099 non-null  float64
 15  sensor_13            1

In [14]:
# save the processed feature matrix to a CSV file
output_path = data_dir / "processed_feature_matrix.csv"
X.to_csv(output_path, index=False)

#### Feature Engineering Summary

- Features Created:
  - Aggregate statistics per engine: mean, std, min, max for each sensor (static features)
  - Rolling-window features (window=5): rolling mean and std trend for each sensor per cycle/engine (dynamic features)
  - All raw, aggregate, and rolling features scaled using StandardScaler (zero mean, unit variance)

- Validation:
  - Verified presence of missing values in the final feature set.
  - Final feature matrix contains [INSERT VALUE: X.shape] engineered features per cycle.

- ### Observations:

Multiple C-MAPSS dataset files were merged, incorporating operational settings and sensor readings across engine cycles.

Aggregate features (mean, std, min, max) were computed per engine to summarize long-term sensor trends.

Rolling statistics (window of 5 cycles) were created for each sensor to capture short-term fluctuations and variability.

Missing values in rolling features were identified and cleaned, ensuring data consistency.

All sensor features were normalized to facilitate machine learning model training and improve prediction accuracy.

The notebook focuses on feature engineering for the C-MAPSS dataset Data from multiple training files is loaded, merged, and labeled according to the dataset ID.

Each record includes operational settings and 21 sensor readings across multiple engine cycles.

Aggregate features such as mean, standard deviation, minimum, and maximum are computed per engine to capture overall performance patterns.

Rolling statistics (mean and standard deviation with a window of 5 cycles) are generated to represent short-term sensor behavior.

Missing values are identified and cleaned to maintain data consistency.

Sensor values are normalized or scaled to ensure uniform range across all features.

The processed dataset becomes suitable for further modeling, especially for Remaining Useful Life (RUL) prediction.

This feature engineering process effectively converts raw time-series sensor data into compact, informative features.

Overall, the notebook provides a strong foundation for machine learning models in predictive maintenance applications


