# SET UP

This notebook sets up the project environment and directory structure for the RUL prediction workflow. 

It defines the project enviroment, paths, imports required libraries, and loads the initial datasets. 

It also creates the training, validation, and working samples that will be used in later stages. The goal is to establish a clear and organized foundation before beginning data quality, trasnformation and modeling steps.

## PROJECT ENVIROMENT

We will activate the enviroment used on risk scoring projects (called 'riesgos') because of current computational issues when trying to create a specific enviroment for this new project (which is the ideal). 

This enviroment covers all needs we'll have during this new project.

FYI, steps to create the enviroment 'riesgos':


**1. Create the enviroment (terminal):**

conda create --name riesgos -c conda-forge python=3.11.11 numpy pandas matplotlib seaborn scikit-learn=1.3.1 scipy=1.9.3 sqlalchemy xgboost=2.0.3

**2. Activate the enviroment:**

conda activate riesgos

**3. Install libraries from other channels:**

conda install -c conda-forge pyjanitor scikit-plot yellowbrick imbalanced-learn cloudpickle 

conda install -c districtdatalabs yellowbrick

conda install -c conda-forge streamlit

pip install notebook==5.7.8 jupyter_client jupyter_contrib_nbextensions category_encoders==2.6.2

pip install streamlit-echarts

pip install pipreqs

**4. Save the environment as a .yml**

conda env export > riesgos.yml

## IMPORT LIBRARIES

In [2]:
import os
import numpy as np
import pandas as pd

#Automcomplete
%config IPCompleter.greedy=True

## DIRECTORY

In [3]:
root = '/Users/rober/'

### Project name

In [4]:
dir_name = 'cmapss-rul-prediction'

### Create the directory and project structure

In [6]:
path = root + dir_name

In [None]:
try:
    os.mkdir(path)
    os.mkdir(path + '/01_Documents')
    os.mkdir(path + '/02_Data')
    os.mkdir(path + '/02_Data/01_Raw')
    os.mkdir(path + '/02_Data/02_Validation')
    os.mkdir(path + '/02_Data/03_Working')
    os.mkdir(path + '/02_Data/04_Caches')
    os.mkdir(path + '/03_Notebooks')
    os.mkdir(path + '/03_Notebooks/01_Functions')
    os.mkdir(path + '/03_Notebooks/02_Development')
    os.mkdir(path + '/03_Notebooks/03_System')
    os.mkdir(path + '/04_Models')
    os.mkdir(path + '/05_Outputs')
    os.mkdir(path + '/09_Other')
    
except OSError:
    print ("The directory %s has NOT been created" % path)
else:
    print ("The directory %s has been succesfully created" % path)

### Set the directory in the project

In [7]:
os.chdir(path)

## INITIAL DATASETS

Raw data in '/02_Data/01_Raw'

3 raw files:

ðŸ”¹ **train_FD00X.txt** â†’ training data
- Engines run from cycle 1 until failure.
- Each row = one engine at one cycle, grouped by engine ID (unit_number).
- You can compute RUL because you know the failure happens at the last cycle.
- Used to train the model.

ðŸ”¹ **test_FD00X.txt** â†’ prediction data
- Engines run from cycle 1 until some cutoff (not failure).
- Same structure: each row = one engine at one cycle, grouped by engine ID (unit_number).
- The last cycle per engine is the row where you must predict RUL.
- Used to generate predictions.

ðŸ”¹ **RUL_FD00X.txt** â†’ ground truth for test
- How much life was left at the cutoff.
- One value per test engine (in the same order they appear in test).
- Each value = the true RUL at the last observed cycle in test.
- Used to evaluate your predictions.

âœ… Summary

- Rows are cycles.
- Grouped by engine ID.
- Train files â†’ complete until failure.
- Test files â†’ partial until some cut-off (not failure).
- RUL files â†’ the truth: how far from failure those last test rows really are.

### Import data

Choose which dataset to process (FD001, FD002, FD003, FD004)

**We'll focus on the first simulation scenario FD001 end to end** and then apply the same process to the rest.

In [8]:
dataset_id = 'FD001'  # FD001, FD002, FD003, FD004

raw_dir = os.path.join(path, '02_Data', '01_Raw')

train_path = os.path.join(raw_dir, f'train_{dataset_id}.txt')
test_path  = os.path.join(raw_dir,  f'test_{dataset_id}.txt')
rul_path   = os.path.join(raw_dir,  f'RUL_{dataset_id}.txt')

In [10]:
train_df = pd.read_csv(train_path, delim_whitespace=True, header=None)
test_df  = pd.read_csv(test_path,  delim_whitespace=True, header=None)
rul_df   = pd.read_csv(rul_path,   delim_whitespace=True, header=None)

In [11]:
train_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.70,1400.60,14.62,...,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.4190
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.00,23.4236
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.20,14.62,...,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442
3,1,4,0.0007,0.0000,100.0,518.67,642.35,1582.79,1401.87,14.62,...,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,522.19,2388.04,8133.80,8.4294,0.03,393,2388,100.0,38.90,23.4044
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20626,100,196,-0.0004,-0.0003,100.0,518.67,643.49,1597.98,1428.63,14.62,...,519.49,2388.26,8137.60,8.4956,0.03,397,2388,100.0,38.49,22.9735
20627,100,197,-0.0016,-0.0005,100.0,518.67,643.54,1604.50,1433.58,14.62,...,519.68,2388.22,8136.50,8.5139,0.03,395,2388,100.0,38.30,23.1594
20628,100,198,0.0004,0.0000,100.0,518.67,643.42,1602.46,1428.18,14.62,...,520.01,2388.24,8141.05,8.5646,0.03,398,2388,100.0,38.44,22.9333
20629,100,199,-0.0011,0.0003,100.0,518.67,643.23,1605.26,1426.53,14.62,...,519.67,2388.23,8139.29,8.5389,0.03,395,2388,100.0,38.29,23.0640


In [12]:
test_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,...,521.72,2388.03,8125.55,8.4052,0.03,392,2388,100.0,38.86,23.3735
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,...,522.16,2388.06,8139.62,8.3803,0.03,393,2388,100.0,39.02,23.3916
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,...,521.97,2388.03,8130.10,8.4441,0.03,393,2388,100.0,39.08,23.4166
3,1,4,0.0042,0.0000,100.0,518.67,642.44,1584.12,1406.42,14.62,...,521.38,2388.05,8132.90,8.3917,0.03,391,2388,100.0,39.00,23.3737
4,1,5,0.0014,0.0000,100.0,518.67,642.51,1587.19,1401.92,14.62,...,522.15,2388.03,8129.54,8.4031,0.03,390,2388,100.0,38.99,23.4130
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13091,100,194,0.0049,0.0000,100.0,518.67,643.24,1599.45,1415.79,14.62,...,520.69,2388.00,8213.28,8.4715,0.03,394,2388,100.0,38.65,23.1974
13092,100,195,-0.0011,-0.0001,100.0,518.67,643.22,1595.69,1422.05,14.62,...,521.05,2388.09,8210.85,8.4512,0.03,395,2388,100.0,38.57,23.2771
13093,100,196,-0.0006,-0.0003,100.0,518.67,643.44,1593.15,1406.82,14.62,...,521.18,2388.04,8217.24,8.4569,0.03,395,2388,100.0,38.62,23.2051
13094,100,197,-0.0038,0.0001,100.0,518.67,643.26,1594.99,1419.36,14.62,...,521.33,2388.08,8220.48,8.4711,0.03,395,2388,100.0,38.66,23.2699


In [13]:
rul_df

Unnamed: 0,0
0,112
1,98
2,69
3,82
4,91
...,...
95,137
96,82
97,59
98,117


### Insert column names

Column names come from CMAPSS dataset documentation (by NASA)

The 5 known features are always in this order:
- unit_number
- time_in_cycles
- 3 operational settings (labels are arbitrary: op_setting_1, 2, 3)

The remaining 21 columns are unnamed sensors in the data, so we label them as: sensor_1, sensor_2, ..., sensor_21

In [15]:
# Name columns dynamically

n_cols = train_df.shape[1] # this n_cols is valid for 'test' also
n_sensors = n_cols - 5 # first 4 columns are not sensors

columns = (
    ['unit_number', 'time_in_cycles'] +
    [f'op_setting_{i}' for i in range(1, 4)] +
    [f'sensor_{i}' for i in range(1, n_sensors + 1)]
) # concatenate names to create the whole list of column names ('columns')

train_df.columns = columns
test_df.columns = columns
rul_df.columns = ['RUL']

In [16]:
train_df

Unnamed: 0,unit_number,time_in_cycles,op_setting_1,op_setting_2,op_setting_3,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,...,sensor_12,sensor_13,sensor_14,sensor_15,sensor_16,sensor_17,sensor_18,sensor_19,sensor_20,sensor_21
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.70,1400.60,14.62,...,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.4190
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.00,23.4236
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.20,14.62,...,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442
3,1,4,0.0007,0.0000,100.0,518.67,642.35,1582.79,1401.87,14.62,...,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,522.19,2388.04,8133.80,8.4294,0.03,393,2388,100.0,38.90,23.4044
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20626,100,196,-0.0004,-0.0003,100.0,518.67,643.49,1597.98,1428.63,14.62,...,519.49,2388.26,8137.60,8.4956,0.03,397,2388,100.0,38.49,22.9735
20627,100,197,-0.0016,-0.0005,100.0,518.67,643.54,1604.50,1433.58,14.62,...,519.68,2388.22,8136.50,8.5139,0.03,395,2388,100.0,38.30,23.1594
20628,100,198,0.0004,0.0000,100.0,518.67,643.42,1602.46,1428.18,14.62,...,520.01,2388.24,8141.05,8.5646,0.03,398,2388,100.0,38.44,22.9333
20629,100,199,-0.0011,0.0003,100.0,518.67,643.23,1605.26,1426.53,14.62,...,519.67,2388.23,8139.29,8.5389,0.03,395,2388,100.0,38.29,23.0640


In [17]:
test_df

Unnamed: 0,unit_number,time_in_cycles,op_setting_1,op_setting_2,op_setting_3,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,...,sensor_12,sensor_13,sensor_14,sensor_15,sensor_16,sensor_17,sensor_18,sensor_19,sensor_20,sensor_21
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,...,521.72,2388.03,8125.55,8.4052,0.03,392,2388,100.0,38.86,23.3735
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,...,522.16,2388.06,8139.62,8.3803,0.03,393,2388,100.0,39.02,23.3916
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,...,521.97,2388.03,8130.10,8.4441,0.03,393,2388,100.0,39.08,23.4166
3,1,4,0.0042,0.0000,100.0,518.67,642.44,1584.12,1406.42,14.62,...,521.38,2388.05,8132.90,8.3917,0.03,391,2388,100.0,39.00,23.3737
4,1,5,0.0014,0.0000,100.0,518.67,642.51,1587.19,1401.92,14.62,...,522.15,2388.03,8129.54,8.4031,0.03,390,2388,100.0,38.99,23.4130
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13091,100,194,0.0049,0.0000,100.0,518.67,643.24,1599.45,1415.79,14.62,...,520.69,2388.00,8213.28,8.4715,0.03,394,2388,100.0,38.65,23.1974
13092,100,195,-0.0011,-0.0001,100.0,518.67,643.22,1595.69,1422.05,14.62,...,521.05,2388.09,8210.85,8.4512,0.03,395,2388,100.0,38.57,23.2771
13093,100,196,-0.0006,-0.0003,100.0,518.67,643.44,1593.15,1406.82,14.62,...,521.18,2388.04,8217.24,8.4569,0.03,395,2388,100.0,38.62,23.2051
13094,100,197,-0.0038,0.0001,100.0,518.67,643.26,1594.99,1419.36,14.62,...,521.33,2388.08,8220.48,8.4711,0.03,395,2388,100.0,38.66,23.2699


In [18]:
rul_df

Unnamed: 0,RUL
0,112
1,98
2,69
3,82
4,91
...,...
95,137
96,82
97,59
98,117


### Split into validation and work datasets

In [19]:
# Validation split %
val_frac = 0.3

# Get engine IDs
unit_ids = train_df['unit_number'].unique()
val_units = pd.Series(unit_ids).sample(frac=val_frac, random_state=42).tolist()

# Split by engine ID
val_df  = train_df[train_df['unit_number'].isin(val_units)].copy()
work_df = train_df[~train_df['unit_number'].isin(val_units)].copy()

In [22]:
work_df.shape

(14507, 26)

In [23]:
val_df.shape

(6124, 26)

### Save validation and work datasets

In [20]:
# Output folders
validation_dir = os.path.join(path, '02_Data', '02_Validation')
work_dir       = os.path.join(path, '02_Data', '03_Working')

# Ensure folders exist
os.makedirs(validation_dir, exist_ok=True)
os.makedirs(work_dir, exist_ok=True)

# Save CSVs
val_df.to_csv(os.path.join(validation_dir, f'validation_{dataset_id}.csv'), index=False)
rul_df.to_csv(os.path.join(validation_dir, f'RUL_{dataset_id}.csv'), index=False)

work_df.to_csv(os.path.join(work_dir, f'work_{dataset_id}.csv'), index=False)
test_df.to_csv(os.path.join(work_dir, f'test_{dataset_id}.csv'), index=False)

### (Optional) Save a sample from work

We won't save a sample this time since the maximum rows we have is 20k (easy to compute)