# PREPARATION OF PRODUCTION CODE

Goal: Prepare the final, production-ready pipeline for the RUL prediction model.

Key idea:
- Pandas-based operations are done before the pipeline (data cleaning, and changing structure).
- scikit-learn transformations are done inside the pipeline (for modeling). 


Our approach at this stage is:

1. **Load the raw dataset.** Read original data (no preprocessing applied). 


2. **Transformations in the structure (outside the pipeline, with pandas)**
- Correct column names
- Remove duplicates and nulls (not needed in this project)
- Restrict the dataset to final selected variables
- Create the target variable (RUL) (and declare X and y)

3. **Create the pipeline and include transformations in the data (with Sklearn)**, such as:
- imputations, encodings, and scalings using scikit-learn transformers (only standard scaling in this project)

4. **Integrate the model within the pipeline**

 
5. **Save the final execution-ready pipeline and store it for retraining or production use**

After this, there will be only 2 notebooks more, which are actually 2 scripts, in which we paste what we preparated in this notebook:
- 08_Retraining Code
- 09_Execution Code

## IMPORT LIBRARIES

In [2]:
import numpy as np
import pandas as pd
import cloudpickle  # alternative to Pickle that also allows saving custom Pandas functions

#Enable fast autocomplete
%config IPCompleter.greedy=True

#Libraries needed for any project
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

#Specific libraries for this example project template:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import HistGradientBoostingRegressor

## IMPORT DATA

### Import datasets

In [3]:
# Project path
PROJECT_PATH = '/Users/rober/cmapss-rul-prediction'

# Dataset selection (FD001, FD002, FD003, or FD004)
dataset_id = 'FD001'

# Data paths
path = PROJECT_PATH + '/02_Data/01_Raw/'
train_path = path + f'train_{dataset_id}.txt'

# Load datasets
df = pd.read_csv(train_path, delim_whitespace=True, header=None)

### Select only selected variables

#### Load list of selected variables (saved on previous notebook)

In [5]:
selected_variables_path = PROJECT_PATH + '/05_Results/' + 'selected_variables.pickle'

pd.read_pickle(selected_variables_path).sort_index().values.tolist()

['time_in_cycles_ss',
 'sensor_11_ss',
 'sensor_4_ss',
 'sensor_12_ss',
 'sensor_7_ss',
 'sensor_15_ss',
 'sensor_21_ss',
 'sensor_20_ss']

#### List of selected variables without sufixes

In [6]:
selected_variables = ['time_in_cycles',
                     'sensor_11',
                     'sensor_4',
                     'sensor_12',
                     'sensor_7',
                     'sensor_15',
                     'sensor_21',
                     'sensor_20']

#### Transformations in selected variables

| |  | time_in_cycles | sensor_11 | sensor_4 | sensor_12 | sensor_7 | sensor_15 | sensor_21 | sensor_20 |
| --------------- | ------ | -------------- | --------- | -------- | --------- | -------- | --------- | --------- | --------- |
| NAMES           | MANUAL | X              | X         | X        | X         | X        | X         | X         | X         |
| TYPES           |        |                | X         | X        | X         | X        | X         | X         | X         |
| RESCALING       | SS     | X              | X         | X        | X         | X        | X         | X         | X         |

## STRUCTURAL TRANSFORMATIONS

1. Correct names (original columns have jut numbers as names)
2. Create the target (calculate RUL)

### Correct names

In [7]:
# Name columns dynamically

n_cols = df.shape[1] # this n_cols is valid for 'test' also
n_sensors = n_cols - 5 # first 4 columns are not sensors

columns = (
    ['unit_number', 'time_in_cycles'] +
    [f'op_setting_{i}' for i in range(1, 4)] +
    [f'sensor_{i}' for i in range(1, n_sensors + 1)]
) # concatenate names to create the whole list of column names ('columns')

df.columns = columns

### Create the target and assign X and y

In [8]:
# Get the last cycle for each engine
rul_per_unit = df.groupby('unit_number')['time_in_cycles'].max().reset_index()
rul_per_unit.columns = ['unit_number', 'max_cycle']

# Merge back to the main dataframe
df = df.merge(rul_per_unit, on='unit_number', how='left')

# Calculate RUL
df['RUL'] = df['max_cycle'] - df['time_in_cycles']

# Clean up (optional)
df.drop(columns=['max_cycle'], inplace=True)

#### For X: index only selected variables

In [9]:
X = df[selected_variables].copy()

#### For y: specify the target and create y

In [10]:
# Define target variable
target = 'RUL'

# Create y
y = df[target].copy()

## CREATE THE PIPELINE (for data quality and transformations; not mandatory but efficient) 

### Instance data quality function

#### Create the function

The only data quality process we did with selected variables was coverting types to float.

In [11]:
def data_quality(df):
    
    # Make a copy to avoid mutating original
    temp = df.copy()            
    
    # All sensors to float to unify types
    sensor_cols = [col for col in df.columns if 'sensor_' in col]
    df[sensor_cols] = df[sensor_cols].astype(float)
    
    return(temp)

#### Convert the function in a transformer

In [12]:
do_data_quality = FunctionTransformer(data_quality)

### Instance transformations in variables (feature engineering)

The only transformation we did with selected variables was standard scalling (rescaling) with all of them.

In [13]:
ss = StandardScaler()

### Create the preprocessing pipeline

#### Create the column transformer

In [14]:
ct = make_column_transformer(
    (ss, selected_variables),
    remainder='drop')

#### Create the pipeline

In [15]:
pipe_prepro = make_pipeline(do_data_quality, 
                            ct)

### Instance the model

#### Instance the algorithm

In [16]:
model = HistGradientBoostingRegressor(l2_regularization=0.5,
                                      learning_rate=0.025,
                                      max_depth=10, max_iter=200,
                                      min_samples_leaf=500,
                                      scoring='neg_mean_absolute_percentage_error')

#### Build the final training pipeline (not yet trained)

In [17]:
pipe_training = make_pipeline(pipe_prepro, model)

#### Save the final training pipeline (not yet trained)

In [19]:
pipe_training_name = 'pipe_training.pickle'

PIPE_TRAINING_PATH = PROJECT_PATH + '/04_Models/' + pipe_training_name

with open(PIPE_TRAINING_PATH, mode='wb') as file:
   cloudpickle.dump(pipe_training, file)

#### Train the execution pipeline

In [20]:
pipe_execution = pipe_training.fit(X,y)

## SAVE THE PIPELINE

### Save the execution pipeline (already trained)

In [23]:
pipe_execution_name = 'pipe_execution.pickle'

PIPE_EXECUTION_PATH = PROJECT_PATH + '/04_Models/' + pipe_execution_name

with open(PIPE_EXECUTION_PATH, mode='wb') as file:
   cloudpickle.dump(pipe_execution, file)

## Recap

We have saved:

- **pipe_training**: the untrained pipeline, in case we want to retrain it in the future.
- **pipe_execution**: the trained pipeline (already fitted), which we will later use to make predictions.

From this point on, we will generate two scripts:

- **Retraining**: 

    - Models gradually lose predictive power over time (data drifts, the market evolves...)
    - There is no fixed rule for how often to retrain (e.g., insurance or energy companies may retrain every 5 years, while in digital advertising it can be every few seconds). 
    - Typically, retraining is triggered when predictive performance drops by 5–10%. 
    - We’ll keep this code ready, although the one we’ll actually use is the execution script. 
    

- **Execution**: An engineer will deploy this script in a production environment (to run in batch mode, or via API, or as part of an app...).