### Rolling Window Sequence
A `rolling window sequence` in time series is a method where a fixed-size "window" or subset of consecutive time steps moves ("rolls") sequentially over the data. At each position, the window captures a segment of the data (for example, the last 30 cycles of sensor readings), which can then be used as input for models or calculations. The window shifts forward by one or more time steps, always covering the same number of points, allowing for dynamic analysis that reflects recent context while preserving temporal ordering.

- Why do we generate rolling window sequences?

  - This step is essential for time-series modeling techniques (like LSTMs or GRUs) that require input data shaped as sequences of fixed length rather than individual time points.

  - Rolling windows create these context-rich, fixed-size sequences from the continuous stream of data for each engine, capturing temporal dependencies and trends.

  - It allows models to learn from patterns that span multiple cycles, rather than isolated measurements.

  - Even after earlier steps that compute rolling statistics or aggregates, rolling window sequence generation formats the data structurally for model training.

In [2]:
# 1. Imports and Data Loading
import pandas as pd
import numpy as np

# Load the feature-engineered dataset from previous step (adjust path as needed)
# df = pd.read_csv('C:/Users/win10/Desktop/Project_Oct25/prognosAI-Infosys-intern-project/data/processed/processed_feature_matrix.csv')  # Assume feature engineered file
df = pd.read_csv('C:/Users/SRIJAN SASMAL/Desktop/Infosys Springboard Intern/prognosAI-Infosys-intern-project/data/processed/cmapss_preprocessed.csv')  # Updated path

# Basic info
print("Dataset shape:", df.shape)
df.head()


Dataset shape: (160099, 68)


Unnamed: 0,engine_id,cycle,op_setting_1,op_setting_2,op_setting_3,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,...,sensor_17_rollmean5,sensor_17_rollstd5,sensor_18_rollmean5,sensor_18_rollstd5,sensor_19_rollmean5,sensor_19_rollstd5,sensor_20_rollmean5,sensor_20_rollstd5,sensor_21_rollmean5,sensor_21_rollstd5
0,1,1,1.075919,1.168421,0.345918,-1.196412,-0.989581,-0.917426,-0.907656,-1.034833,...,0.139278,1.357228,0.444362,-0.045512,0.754282,-0.850881,0.14892,1.878359,0.142641,1.88263
1,1,1,-1.04168,-1.113557,0.345918,1.079459,1.059597,0.983453,0.997166,1.108018,...,0.702843,0.564134,0.829328,-0.358053,0.754282,-0.850881,0.788164,1.018412,0.774208,1.009349
2,1,1,1.499853,1.168421,0.345918,-1.342373,-1.122364,-1.045889,-1.08595,-1.402916,...,0.063801,0.685183,0.405865,-0.310204,0.754282,-0.850881,-0.010433,1.368269,-0.021259,1.35533
3,1,2,-1.041535,-1.115459,0.345918,1.079459,1.054654,1.056147,1.043391,1.108018,...,0.42911,0.514635,0.644545,-0.384259,0.754282,-0.850881,0.40153,1.144667,0.394119,1.13976
4,1,2,1.499448,1.170595,0.345918,-1.342373,-1.117183,-0.96304,-0.991667,-1.402916,...,-0.319623,0.543376,0.151788,-0.369223,0.754282,-0.850881,-0.498069,1.264855,-0.503559,1.254081


In [3]:
exclude_cols = ['engine_id', 'cycle', 'dataset_id']  # ADD 'dataset_id' to exclusions
feature_cols = [col for col in df.columns if col not in exclude_cols]

print(f"Feature columns ({len(feature_cols)}): {feature_cols}")

# Verify all feature columns are numeric
numeric_check = df[feature_cols].dtypes.apply(lambda x: np.issubdtype(x, np.number)).all()
assert numeric_check, "Non-numeric columns found in feature_cols!"

# Sort data by engine_id and cycle to ensure correct temporal order
df = df.sort_values(['engine_id', 'cycle']).reset_index(drop=True)


Feature columns (66): ['op_setting_1', 'op_setting_2', 'op_setting_3', 'sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5', 'sensor_6', 'sensor_7', 'sensor_8', 'sensor_9', 'sensor_10', 'sensor_11', 'sensor_12', 'sensor_13', 'sensor_14', 'sensor_15', 'sensor_16', 'sensor_17', 'sensor_18', 'sensor_19', 'sensor_20', 'sensor_21', 'sensor_1_rollmean5', 'sensor_1_rollstd5', 'sensor_2_rollmean5', 'sensor_2_rollstd5', 'sensor_3_rollmean5', 'sensor_3_rollstd5', 'sensor_4_rollmean5', 'sensor_4_rollstd5', 'sensor_5_rollmean5', 'sensor_5_rollstd5', 'sensor_6_rollmean5', 'sensor_6_rollstd5', 'sensor_7_rollmean5', 'sensor_7_rollstd5', 'sensor_8_rollmean5', 'sensor_8_rollstd5', 'sensor_9_rollmean5', 'sensor_9_rollstd5', 'sensor_10_rollmean5', 'sensor_10_rollstd5', 'sensor_11_rollmean5', 'sensor_11_rollstd5', 'sensor_12_rollmean5', 'sensor_12_rollstd5', 'sensor_13_rollmean5', 'sensor_13_rollstd5', 'sensor_14_rollmean5', 'sensor_14_rollstd5', 'sensor_15_rollmean5', 'sensor_15_rollstd5', 'sensor_

In [4]:
def generate_rolling_windows(data, engine_col, features, window_size=30):
    sequences = []
    engine_ids = []
    cycle_ids = []
    
    for engine in data[engine_col].unique():
        engine_data = data[data[engine_col] == engine]
        engine_features = engine_data[features].values
        
        # Generate sequences with rolling window
        for i in range(window_size - 1, len(engine_data)):
            seq = engine_features[i - window_size + 1 : i + 1]
            sequences.append(seq)
            engine_ids.append(engine)
            cycle_ids.append(engine_data.iloc[i]['cycle'])
            
    # Convert to array for modeling
    sequences = np.array(sequences)
    return sequences, engine_ids, cycle_ids



In [5]:
window_size = 30  # Typical rolling window length; adjust as needed
sequences, engine_ids, cycle_ids = generate_rolling_windows(df, 'engine_id', feature_cols, window_size)

print("Shape of rolling window sequences:", sequences.shape)  # (num_sequences, window_size, num_features)
print("Example sequence shape:", sequences[0].shape)


Shape of rolling window sequences: (152559, 30, 66)
Example sequence shape: (30, 66)


In [6]:
# Print the first sequence info
print(f"Engine ID: {engine_ids[0]}, Cycle: {cycle_ids[0]}")
print("Sequence data for first time window (shape {}):".format(sequences[0].shape))
print(sequences[0])


Engine ID: 1, Cycle: 8.0
Sequence data for first time window (shape (30, 66)):
[[ 1.0759192   1.1684207   0.34591845 ...  1.8783585   0.14264108
   1.8826302 ]
 [-1.0416802  -1.1135565   0.34591845 ...  1.0184119   0.7742078
   1.0093489 ]
 [ 1.4998531   1.1684207   0.34591845 ...  1.3682685  -0.02125945
   1.3553296 ]
 ...
 [-1.0418557  -1.1138283   0.34591845 ...  0.6388587   1.1585671
   0.63089   ]
 [ 0.16856942  0.7884537   0.34591845 ...  0.6468219   0.70961046
   0.6372574 ]
 [-1.0416924  -1.1160026   0.34591845 ... -1.0404671   1.6012071
  -1.0732535 ]]


In [7]:
assert sequences.shape[1] == window_size, "Sequence window length mismatch"

# Check sequences integrity: cycles should increase within each engine
# When engine changes, cycle resets (decreases), which is expected
for i in range(1, len(cycle_ids)):
    if engine_ids[i] == engine_ids[i-1]:
        # Same engine: cycle should increase
        assert cycle_ids[i] >= cycle_ids[i-1], f"Cycle order violation within engine {engine_ids[i]}"
    # Different engine: cycle can reset (no assertion needed)

print("✓ Cycle order validation passed")
print(f"  Total sequences: {len(sequences)}")
print(f"  Unique engines: {len(set(engine_ids))}")

✓ Cycle order validation passed
  Total sequences: 152559
  Unique engines: 260


In [8]:
# Save sequences and metadata for modeling
np.save('rolling_window_sequences.npy', sequences)
pd.DataFrame({'engine_id': engine_ids, 'cycle': cycle_ids}).to_csv('sequence_metadata.csv', index=False)

### Observations:

1. Fixed-Size Sequence Creation for Modeling: The core purpose of this step is to transform the time-series data into fixed-length sequences (or "windows") with a size of 30 cycles. This structure is mandatory for training sequence models like Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) networks, as they require context-rich, temporally ordered input.

2. Input Data and Feature Set: The process utilized a feature-engineered dataset with an initial shape of (160,099, 68). The sequences were generated using 66 feature columns, which include original sensor readings, operational settings, and pre-calculated rolling statistics (mean and standard deviation) over a window of 5 cycles.

3. Window Configuration: A standard window size of 30 was applied. The process involves an overlapping shift of one cycle per engine, ensuring each sequence captures the preceding 30 time steps of contextual data.

4. Temporal Integrity Validated: The data was correctly sorted by engine_id and cycle prior to sequence generation. Validation confirmed that the cycle numbers within the generated sequences are temporally increasing for each engine, ensuring the integrity and correct ordering required for time-series analysis.
