# FEATURE ENGINEERING

This notebook focuses on performing transformations in variables to enhance the predictive power of the dataset. 

It includes the creation of the RUL (Remaining Useful Life) target, numerical transformations such as normalization, standardization, and binarization, and the merging of transformed variables into a single analytical base table. 

The goal is to produce a well-structured, transformed dataset ready for feature selection, where we preselect the right variables for modeling.

## IMPORT LIBRARIES

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from category_encoders import TargetEncoder
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import QuantileTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MaxAbsScaler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

# Autocomplete
%config IPCompleter.greedy=True

## IMPORT DATASETS

In [2]:
WORK_PATH = '/Users/rober/cmapss-rul-prediction/02_Data/03_Working/'

df_work = pd.read_pickle(f"{WORK_PATH}df_work_eda.pickle")
cat = pd.read_pickle(f"{WORK_PATH}cat_eda.pickle")
num = pd.read_pickle(f"{WORK_PATH}num_eda.pickle")

## CREATE THE TARGET: Remaining Useful Life (RUL)

We always create the target of the project in this stage because in many projects we need the target to do some transformations (like TargetEncoding). 

I this case, we won't do that kind of trasformations that need the target but we'll create the target anyway.

We will just merge it at the end of the notebook with the rest of dataframes created during transformations to create our analytical base table (_df_analytical_base_)

We calculate the target like this:

**Remaining Useful Life (_RUL_) = maximum time per engine (_unit_number_) ‚Äì current time (_time_in_cycles_)**

RUL = 'max_cycle' (new variable) ‚Äì 'time_in_cycles'

In [3]:
# Get the last cycle for each engine
rul_per_unit = df_work.groupby('unit_number')['time_in_cycles'].max().reset_index()
rul_per_unit.columns = ['unit_number', 'max_cycle']

# Merge back to the main dataframe
df_work = df_work.merge(rul_per_unit, on='unit_number', how='left')

# Calculate RUL
df_work['RUL'] = df_work['max_cycle'] - df_work['time_in_cycles']

# Clean up (optional)
df_work.drop(columns=['max_cycle'], inplace=True)

# Extract the target
target = df_work[['RUL']].copy()

In [4]:
df_work[['unit_number', 'time_in_cycles', 'RUL']].head(15)

Unnamed: 0,unit_number,time_in_cycles,RUL
0,2,1,286
1,2,2,285
2,2,3,284
3,2,4,283
4,2,5,282
5,2,6,281
6,2,7,280
7,2,8,279
8,2,9,278
9,2,10,277


## NUMERICAL TRANSFORMATION

(No categorical transformations in this project, 'unit_number' is the only categorical variable)

**SUMMARY AFTER DATA QUALITY AND EDA:**

üëâüèª **ACTION: Create a new variable: binary flag sensor_6_drop** and keep sensor_6 as-is for now.
- We'll include it in the notebook dedicated to transforming variables to keep the same scheme as other projects:
- *df_work['sensor_6_drop'] = (df_work['sensor_6'] < 21).astype(int)*

‚ö†Ô∏è **Flat or extremely narrow distributions (low variance)**:
- **sensor_6, op_setting_1, op_setting_2**: almost constant or near-constant. Likely not informative.
- **‚û§ Action: Flag for potential removal unless they‚Äôre critical domain-wise... or just drop them**


üîç **Multimodal patterns** (potential operational modes):
- **sensor_17** (clearly multimodal), and perhaps **op_setting_2** show multiple peaks.
- **‚û§ Action: Might require mode separation (e.g., clustering) or further investigation. Could reflect different regimes or engine types.**

‚úÖ **Clean bell curves**:
- **sensor_2, sensor_3, sensor_4, sensor_11...** look well-behaved and normally distributed.
- **‚û§ Action: Good candidates for standard scaling and modeling.**

üîç **Skewed or asymmetric distributions**:
- **sensor_14** is right-skewed, **sensor_9** has long tails.
- **‚û§ Action: Consider transformations depending on correlation with target.**

### Binarize variables

**SENSOR 6** 

1 when sensor_6 drops below 21

**Offtopic**:

We'll binarize it manually. We used *Binarizer* function in previous projects but the function only supports > comparison when establishing a threshold and it would do the opposite of what we want:

- What we want:

‚úÖ *sensor_6_drop = 1 if sensor_6 **<** 21 else 0*

- What does *Binarizer(threshold=21)* do?

‚ùå *sensor_6_drop = 1 if sensor_6 **>** 21 else 0*

Surprisingly, Binarizer doesn't have parameters to adjust that.


**Why is Binarizer so basic?**
Because it's not meant for feature engineering ‚Äî it's meant for preprocessing pipelines.
- Binarizer was designed to work like a transformer in a pipeline
- It's meant to normalize data quickly, like:
    - Turn any positive number into 1
    - Leave 0 or negatives as 0

#### Binarize manually

In [None]:
df_work['sensor_6_drop'] = (df_work['sensor_6'] < 21).astype(int)

### Normalize (Gauss)

- sensor_14 right-skewed: Yeo-Johnson
- sensor_9 long tails: Quantline Transformer

#### Yeo-Johnson

##### Variables to normalize with Yeo-Johnson

In [6]:
var_yeo = ['sensor_14']

##### Instance

In [7]:
yeo = PowerTransformer(method = 'yeo-johnson')

##### Train and apply

In [8]:
num_yeo = yeo.fit_transform(num[var_yeo])

##### Save as dataframe

In [9]:
#Add sufix to names
names_yeo = [variable + '_yeo' for variable in var_yeo]

#Save as dataframe
num_yeo = pd.DataFrame(num_yeo,columns = names_yeo)

#### Con Quantile Transformer

##### Variables to normalize with Quantile Transformer

In [10]:
var_qt = ['sensor_9']

##### Instance

In [11]:
qt = QuantileTransformer(output_distribution='normal')

##### Train and apply

In [12]:
num_qt = qt.fit_transform(num[var_qt])

##### Save as dataframe

In [13]:
#Add sufix to names
names_qt = [variable + '_qt' for variable in var_qt]

#Save as dataframe
num_qt = pd.DataFrame(num_qt,columns = names_qt)

## MERGE TRANSFORMED DATASETS

### Create a list including all dataframes generated

In [14]:
dataframes = []
dataframes.extend(value for name, value in locals().items() if name.startswith('cat_') or name.startswith('num_'))

### Merge all dataframes

In [15]:
df = pd.concat(dataframes, axis = 1)

## RESCALE VARIABLES

üü¢ **Variables to rescale:**
    
- time_in_cycles 
- sensor_2
- sensor_3
- sensor_4
- sensor_7
- sensor_8
- sensor_11
- sensor_12
- sensor_13
- sensor_15
- sensor_20
- sensor_21

**Method: For regression, the most common and recommended rescaling method is standardization (a.k.a. Z-scores):**
- Center mean to 0 and std to 1 

üî¥ Variables already transformed (excluded for rescalation):
- sensor_6 ‚Üí replaced by binary flag sensor_6_drop
- sensor_9 ‚Üí already transformed with Quantile Transformer (sensor_9_qt), and we checked its scale
- sensor_14 ‚Üí transformed (Yeo-Johnson)
- sensor_17 ‚Üí multimodal, to be analyzed separately
- op_setting_1, op_setting_2 ‚Üí near‚Äëconstant (likely drop)

### Standardization (Z-scores)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14507 entries, 0 to 14506
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   sensor_14_yeo  14507 non-null  float64
 1   sensor_9_qt    14507 non-null  float64
dtypes: float64(2)
memory usage: 226.8 KB


#### Variables to rescalate via standardization

In [17]:
var_ss = [
    'time_in_cycles',
    'sensor_2',
    'sensor_3',
    'sensor_4',
    'sensor_7',
    'sensor_8',
    'sensor_11',
    'sensor_12',
    'sensor_13',
    'sensor_15',
    'sensor_20',
    'sensor_21'
]

#### Instance

In [18]:
ss = StandardScaler()

#### Train and apply

In [19]:
num_ss = ss.fit_transform(num[var_ss])

#### Save as dataframe

In [20]:
#Add sufix to names
names_ss = [variable + '_ss' for variable in var_ss]

#Save as dataframe
num_ss = pd.DataFrame(num_ss,columns = names_ss)

## MERGE DATASETS

### List of dataframes to include in our analytical base table

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14507 entries, 0 to 14506
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   sensor_14_yeo  14507 non-null  float64
 1   sensor_9_qt    14507 non-null  float64
dtypes: float64(2)
memory usage: 226.8 KB


In [32]:
num_ss.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14507 entries, 0 to 14506
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   time_in_cycles_ss  14507 non-null  float64
 1   sensor_2_ss        14507 non-null  float64
 2   sensor_3_ss        14507 non-null  float64
 3   sensor_4_ss        14507 non-null  float64
 4   sensor_7_ss        14507 non-null  float64
 5   sensor_8_ss        14507 non-null  float64
 6   sensor_11_ss       14507 non-null  float64
 7   sensor_12_ss       14507 non-null  float64
 8   sensor_13_ss       14507 non-null  float64
 9   sensor_15_ss       14507 non-null  float64
 10  sensor_20_ss       14507 non-null  float64
 11  sensor_21_ss       14507 non-null  float64
dtypes: float64(12)
memory usage: 1.3 MB


In [22]:
include = [df, num_ss, target]

### Merge those dataframes to create the analytical base table

In [27]:
df_analytical_base = pd.concat(include, axis = 1)

In [28]:
df_analytical_base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14507 entries, 0 to 14506
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sensor_14_yeo      14507 non-null  float64
 1   sensor_9_qt        14507 non-null  float64
 2   time_in_cycles_ss  14507 non-null  float64
 3   sensor_2_ss        14507 non-null  float64
 4   sensor_3_ss        14507 non-null  float64
 5   sensor_4_ss        14507 non-null  float64
 6   sensor_7_ss        14507 non-null  float64
 7   sensor_8_ss        14507 non-null  float64
 8   sensor_11_ss       14507 non-null  float64
 9   sensor_12_ss       14507 non-null  float64
 10  sensor_13_ss       14507 non-null  float64
 11  sensor_15_ss       14507 non-null  float64
 12  sensor_20_ss       14507 non-null  float64
 13  sensor_21_ss       14507 non-null  float64
 14  RUL                14507 non-null  int64  
dtypes: float64(14), int64(1)
memory usage: 1.7 MB


## SAVE DATASET AFTER FEATURE ENGINEERING

In [29]:
#File name
ANALYTICAL_BASE_PATH = WORK_PATH + 'df_analytical_base.pickle'

In [30]:
#Save file
df_analytical_base.to_pickle(ANALYTICAL_BASE_PATH)