# Process NASA Turbofan Data for Survival Analysis

This notebook processes the NASA turbofan jet engine dataset into a format compatible with the survival analysis utilities in this repo.

## Data Description
- Multiple engines, each with time series of sensor measurements until failure
- 3 operational settings (discrete operating conditions with sensor noise)
- 21 sensor measurements
- Goal: Predict Remaining Useful Life (RUL)

## Transformation Strategy
Each time series measurement becomes a separate datapoint:
- **time**: RUL = cycles until failure from current measurement
- **failure**: True (all engines run to failure in training data)
- Features: operational settings + sensor measurements

In [1]:
import pandas as pd
import numpy as np
import os
from pathlib import Path

## Load and Process Training Data

In [2]:
# Column names based on readme.txt and Table 2 in the PDF documentation
# Using interpretable names from C-MAPSS simulation
column_names = ['unit_id', 'cycle'] + \
               ['op_setting_1', 'op_setting_2', 'op_setting_3'] + \
               ['T2', 'T24', 'T30', 'T50', 'P2', 'P15', 'P30',
                'Nf', 'Nc', 'epr', 'Ps30', 'phi', 'NRf', 'NRc',
                'BPR', 'farB', 'htBleed', 'Nf_dmd', 'PCNfR_dmd', 'W31', 'W32']

In [3]:
def load_turbofan_file(filepath):
    """Load a single turbofan data file."""
    df = pd.read_csv(filepath, sep='\s+', header=None, names=column_names)
    return df

def compute_rul(df):
    """
    Compute Remaining Useful Life (RUL) for each measurement.
    RUL = max_cycle - current_cycle
    """
    # Get max cycle for each engine (time of failure)
    max_cycles = df.groupby('unit_id')['cycle'].max()
    
    # Compute RUL for each row
    df['time'] = df.apply(lambda row: max_cycles[row['unit_id']] - row['cycle'], axis=1)
    
    # All training data has observed failures
    df['failure'] = True
    
    return df

## Process Each Dataset

There are 4 datasets:
- **FD001**: 1 operating condition, 1 fault mode
- **FD002**: 6 operating conditions, 1 fault mode
- **FD003**: 1 operating condition, 2 fault modes
- **FD004**: 6 operating conditions, 2 fault modes

In [4]:
data_dir = Path('../data/raw/nasa')
datasets = ['FD001', 'FD002', 'FD003', 'FD004']

all_data = {}

for dataset in datasets:
    print(f"\nProcessing {dataset}...")
    
    train_file = data_dir / f'train_{dataset}.txt'
    df_train = load_turbofan_file(train_file)
    
    df_train = compute_rul(df_train)
    
    all_data[dataset] = df_train


Processing FD001...

Processing FD002...

Processing FD003...

Processing FD004...


## Create Combined Dataset

In [5]:
# Add dataset identifier to each dataframe
for dataset, df in all_data.items():
    df['dataset'] = dataset

df_combined = pd.concat(all_data.values(), ignore_index=True)
output_file = 'processed/nasa_turbofan_combined_processed.pkl'
df_combined.to_pickle(output_file)
print(f"\nSaved combined dataset: {output_file}")


Saved combined dataset: processed/nasa_turbofan_combined_processed.pkl
