## CMAPSS Dataset

### Dataset Overview
- C-MAPSS (Commercial Modular Aero-Propulsion System Simulation) simulates realistic sensor data from large commercial turbofan engines using a high-fidelity thermodynamic model.

- The dataset contains multivariate time-series data from multiple engines, each under various operational conditions and fault scenarios.

- It is divided into four subsets (FD001, FD002, FD003, FD004), each representing different settings:

- Varying the number of operational conditions (1 or 6) and fault modes (1 or 2).

- Each subset includes both training and testing trajectories, where each trajectory represents one engine run until failure (run-to-failure).

### Data Structure
- Columns: Each record has 26 sensor measurements plus metadata such as engine ID, operational settings (3 variables), and cycle number.

- Goal: Predict RUL for engines in the test set, using only partial run-to-failure sensor data provided for them.

- Noise and Variability: The dataset incorporates realistic elements such as sensor noise, manufacturing variance, and operational differences to mimic real-world degradation scenarios

Source Link: https://data.nasa.gov/dataset/cmapss-jet-engine-simulated-data

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
# DATA LOADING
import pandas as pd
from pathlib import Path

# Set column names: 3 op_settings + 21 sensors
column_names = [
    "engine_id", "cycle", "op_setting_1", "op_setting_2", "op_setting_3"
] + [f"sensor_{i}" for i in range(1, 22)]

# Directory with the train files
data_dir = Path("/content/drive/MyDrive/PrognosAI_OCT25/Data/raw")

# Load all four files and add an identifier column
datasets = {}
for fd_id in range(1, 5):
    file_path = data_dir / f"train_FD00{fd_id}.txt"
    datasets[f"FD00{fd_id}"] = pd.read_csv(
        file_path, sep=r"\s+", header=None, names=column_names
    )
    datasets[f"FD00{fd_id}"]["dataset_id"] = f"FD00{fd_id}"

# Merge into a single DataFrame
df = pd.concat(datasets.values(), ignore_index=True)

print(f"Shape of the merged DataFrame: {df.shape}")
display(df.head())



Shape of the merged DataFrame: (160359, 27)


Unnamed: 0,engine_id,cycle,op_setting_1,op_setting_2,op_setting_3,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,...,sensor_13,sensor_14,sensor_15,sensor_16,sensor_17,sensor_18,sensor_19,sensor_20,sensor_21,dataset_id
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,FD001
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,FD001
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,FD001
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,FD001
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,FD001


### Data Inspection and Initial Exploration

In [4]:
# Quick structure and type info
print("DataFrame Info:")
df.info()

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160359 entries, 0 to 160358
Data columns (total 27 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   engine_id     160359 non-null  int64  
 1   cycle         160359 non-null  int64  
 2   op_setting_1  160359 non-null  float64
 3   op_setting_2  160359 non-null  float64
 4   op_setting_3  160359 non-null  float64
 5   sensor_1      160359 non-null  float64
 6   sensor_2      160359 non-null  float64
 7   sensor_3      160359 non-null  float64
 8   sensor_4      160359 non-null  float64
 9   sensor_5      160359 non-null  float64
 10  sensor_6      160359 non-null  float64
 11  sensor_7      160359 non-null  float64
 12  sensor_8      160359 non-null  float64
 13  sensor_9      160359 non-null  float64
 14  sensor_10     160359 non-null  float64
 15  sensor_11     160359 non-null  float64
 16  sensor_12     160359 non-null  float64
 17  sensor_13     160359 non-null  f

In [5]:
# Statistical profiling of all sensor and operational columns
print("\nSummary statistics for numeric columns:")
display(df.describe().transpose())


Summary statistics for numeric columns:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
engine_id,160359.0,105.553758,72.867325,1.0,44.0,89.0,164.0,260.0
cycle,160359.0,123.331338,83.538146,1.0,57.0,114.0,173.0,543.0
op_setting_1,160359.0,17.211973,16.527988,-0.0087,0.0013,19.9981,35.0015,42.008
op_setting_2,160359.0,0.410004,0.367938,-0.0006,0.0002,0.62,0.84,0.842
op_setting_3,160359.0,95.724344,12.359044,60.0,100.0,100.0,100.0,100.0
sensor_1,160359.0,485.84089,30.420388,445.0,449.44,489.05,518.67,518.67
sensor_2,160359.0,597.361022,42.478516,535.48,549.96,605.93,642.34,645.11
sensor_3,160359.0,1467.035653,118.175261,1242.67,1357.36,1492.81,1586.59,1616.91
sensor_4,160359.0,1260.956434,136.300073,1023.77,1126.83,1271.74,1402.2,1441.49
sensor_5,160359.0,9.894999,4.265554,3.91,5.48,9.35,14.62,14.62


In [7]:
# Check for missing data per column
print("\nMissing values per column:")
missing = df.isnull().sum()
print(missing[missing > 0] if missing.sum() else "No missing values detected.")


Missing values per column:
No missing values detected.


In [8]:
# Distribution of number of records per engine (just as a quick check)
records_per_engine = df["engine_id"].value_counts().sort_index()
print("\nRecords per engine_id (min, median, max):")
print(f"{records_per_engine.min()}, "
      f"{records_per_engine.median()}, "
      f"{records_per_engine.max()}")


Records per engine_id (min, median, max):
135, 527.5, 1305


In [9]:
# Sanity check: count from each dataset
print("\nRow count per dataset_id subset (FD001–FD004):")
print(df["dataset_id"].value_counts())


Row count per dataset_id subset (FD001–FD004):
dataset_id
FD004    61249
FD002    53759
FD003    24720
FD001    20631
Name: count, dtype: int64


- **Observations**

we have collected the dataset from nasa datasets and neatly columns names are added and converted into csv format
by using info function we have known the information of the dataset
using isnull.counts we try to find the number od misiing values in each column
then we count the no of rows in each dataset.