# RETRAINING CODE

- Models gradually lose predictive power over time (data drifts, the market evolves...)
- There is no fixed rule for how often to retrain (e.g., insurance or energy companies may retrain every 5 years, while in digital advertising it can be every few seconds). 
- Typically, retraining is triggered when predictive performance drops by 5–10%. 
- We’ll keep this code ready, although the one we’ll actually use is the execution script. 

IMPORTANT: This  code must be executed in the exact same environment in which it was originally created.

The enviroment can be installed on a new machine using the *riesgos.yml* file created (or activated) during the project setup.

Copy the file *riesgos.yml* to your working directory and run this command in the terminal (or Anaconda Prompt): 

*conda env create --file riesgos.yml --name riesgos*

In [1]:
# --- Import libraries ---
import numpy as np
import pandas as pd
import cloudpickle 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import HistGradientBoostingRegressor


# --- Import data ---
PROJECT_PATH = '/Users/rober/cmapss-rul-prediction'
dataset_id = 'FD001' # can be switched between FD001, FD002, FD003, FD004
path = PROJECT_PATH + '/02_Data/01_Raw/'
train_path = path + f'train_{dataset_id}.txt'
df = pd.read_csv(train_path, delim_whitespace=True, header=None)


# --- Rename columns ---
n_cols = df.shape[1] 
n_sensors = n_cols - 5 
columns = (
    ['unit_number', 'time_in_cycles'] +
    [f'op_setting_{i}' for i in range(1, 4)] +
    [f'sensor_{i}' for i in range(1, n_sensors + 1)]
)
df.columns = columns


# --- Create RUL variable ---
rul_per_unit = df.groupby('unit_number')['time_in_cycles'].max().reset_index()
rul_per_unit.columns = ['unit_number', 'max_cycle']
df = df.merge(rul_per_unit, on='unit_number', how='left')
df['RUL'] = df['max_cycle'] - df['time_in_cycles']
df.drop(columns=['max_cycle'], inplace=True)


# --- Declare X and y ---
selected_variables = ['time_in_cycles',
                     'sensor_11',
                     'sensor_4',
                     'sensor_12',
                     'sensor_7',
                     'sensor_15',
                     'sensor_21',
                     'sensor_20']
X = df[selected_variables].copy()
target = 'RUL'
y = df[target].copy()


# --- Create training and execution pipelines ---
pipe_training_name = 'pipe_training.pickle'

PIPE_TRAINING_PATH = PROJECT_PATH + '/04_Models/' + pipe_training_name

with open(PIPE_TRAINING_PATH, mode='wb') as file:
   cloudpickle.dump(pipe_training, file)

pipe_execution = pipe_training.fit(X,y)

pipe_execution_name = 'pipe_execution.pickle'

PIPE_EXECUTION_PATH = PROJECT_PATH + '/04_Models/' + pipe_execution_name

with open(PIPE_EXECUTION_PATH, mode='wb') as file:
   cloudpickle.dump(pipe_execution, file)