# üè≠ **Overview**

# üìö **Library and Configuration**

This section initializes the environment and loads the essential libraries used throughout the preprocessing workflow. It first extends sys.path to allow importing project modules, then suppresses warnings to keep the notebook clean. Core tools such as `numpy`, `pandas`, `Pathlib`, and the `KMeans` clustering algorithm are imported, along with the custom preprocessing utilities from src.preprocessing. A small reload helper is defined to streamline development when updating the module. The notebook then sets standardized paths to the processed train and test files, including the engineered versions that will be created later. A final message confirms that all libraries and configuration steps¬†are¬†ready.

In [None]:
# System & Environment Configuration
import sys
import importlib
sys.path.append("..")

# Ignore warning
from warnings import filterwarnings
filterwarnings("ignore")

# Core Library
import numpy as np
import pandas as pd
from pathlib import Path

# Engineering library
from sklearn.cluster import KMeans
from statsmodels.tsa.seasonal import STL
from sklearn.linear_model import LinearRegression
from sklearn.impute import KNNImputer

# Source helper
import src.preprocessing as preprocessing

# Reload shortcut
def r(module=preprocessing):
    importlib.reload(module)

r()

# Train and Test Processed
PROCESSED_ROOT = Path('../data/processed/')

TRAIN_PATH_PROCESSED = PROCESSED_ROOT/'train.csv'
TEST_PATH_PROCESSED = PROCESSED_ROOT/'test.csv'

TRAIN_PATH_ENGINEERED = PROCESSED_ROOT/'train_engineered.csv'
TEST_PATH_ENGINEERED = PROCESSED_ROOT/'test_engineered.csv'

print('library and configuration ready!')

library and configuration ready!


# üóÉÔ∏è **Train and Test Loading**

In [259]:
train = pd.read_csv(TRAIN_PATH_PROCESSED)
test = pd.read_csv(TEST_PATH_PROCESSED)

print('Train shape :', train.shape)
print('Test Shape  :', test.shape)

Train shape : (18942, 39)
Test Shape  : (1077, 39)


In [260]:
# To preserve original dataset, we copy just to give some demonstration how these feature engineering works
train_lore = train.copy()
test_lore = test.copy()

# ‚å®Ô∏è **Imputer**

Before applying any feature engineering, the dataset shows several missing values across key raw variables, including `GHI`, `DHI`, `DNI`, the clearsky radiation components, and the `Cloud Type` field. These measurements form the foundation for all subsequent analysis, meaning we must restore their completeness before constructing any derived features or transformations. Missing values in this stage do not yet break engineered features‚Äîbut they do disrupt the underlying physical continuity of the solar and weather signals, which would lead to unreliable or biased feature construction later on.

By identifying these gaps up front, the imputation stage ensures that the feature-engineering pipeline receives clean, consistent, and physically coherent inputs. This early correction prevents NaNs from propagating into downstream steps such as dynamic lag calculations, weather clustering, or any astronomy- or physics-based transformations that rely on uninterrupted numeric sequences.

In [130]:
missing_df = pd.DataFrame({
    'train_missing': train_lore.isna().sum(),
    'test_missing':  test_lore.isna().sum()
})

missing_df = missing_df[(missing_df['train_missing'] > 0) | (missing_df['test_missing'] > 0)]

print('Train test initial missing values:\n\n', missing_df, sep='')

Train test initial missing values:

              train_missing  test_missing
% Baseline                0          1077
moonrise                629            34
moonset                 639            33
DHI                    1044           167
DNI                    1044           167
GHI                    1044           167
Clearsky DHI           1044           167
Clearsky DNI           1044           167
Clearsky GHI           1044           167
Cloud Type              977           112


# **Moon Imputer**

he `moonrise` and `moonset` columns originally contained only time-of-day strings (e.g., `"05:12 AM"`) or placeholders like `"No moonset"`, but during feature engineering we converted them into full datetime values by attaching the corresponding year, month, and day. Entries labeled `"No moonrise"` or `"No moonset"` cannot be parsed into valid timestamps, causing them to become `NaT` and introducing unwanted missing values into the dataset. To prevent this parsing failure, we explicitly replace missing entries with the same placeholder strings before datetime conversion, ensuring consistency and avoiding unintended `NaT` values during analysis.

In [131]:
train_lore['moonrise'] = train_lore['moonrise'].fillna("No moonrise")
train_lore['moonset']  = train_lore['moonset'].fillna("No moonset")

test_lore['moonrise'] = test_lore['moonrise'].fillna("No moonrise")
test_lore['moonset']  = test_lore['moonset'].fillna("No moonset")

## **Solar Imputer**

Since `GHI`, `DHI`, and `DNI` are continuous solar-irradiance measurements that follow smooth temporal patterns, we impute their missing values using a time-aware strategy. After converting `Timestamp` into a proper datetime index, we apply `interpolate(method='time')` to estimate missing points based on their chronological neighbors, which aligns with the physical behavior of irradiance throughout the day. Any remaining gaps‚Äîsuch as those at the beginning or end of the series‚Äîare then filled using forward and backward fill (`ffill` and `bfill`). This approach preserves the natural temporal structure of the data while ensuring that downstream features relying on these variables remain consistent and free of NaN values.

In [132]:
train_lore['Timestamp'] = pd.to_datetime(train_lore['Timestamp'], errors='coerce')
train_lore = train_lore.set_index('Timestamp')

test_lore['Timestamp'] = pd.to_datetime(test_lore['Timestamp'], errors='coerce')
test_lore = test_lore.set_index('Timestamp')

train_lore[['GHI', 'DHI', 'DNI']] = (
    train_lore[['GHI', 'DHI', 'DNI']].interpolate(method='time')
)
train_lore[['GHI', 'DHI', 'DNI']] = train_lore[['GHI','DHI','DNI']].ffill().bfill()

test_lore[['GHI', 'DHI', 'DNI']] = (
    test_lore[['GHI', 'DHI', 'DNI']].interpolate(method='time')
)
test_lore[['GHI', 'DHI', 'DNI']] = test_lore[['GHI','DHI','DNI']].ffill().bfill()

## **Clearsky Imputer**

For the clearsky radiation components (`Clearsky DHI`, `Clearsky DNI`, and `Clearsky GHI`), we use a `KNNImputer` because these variables exhibit strong correlations with each other and behave in smooth, physically consistent patterns. Unlike simple mean or median replacement, KNN considers the relationships between multiple features and imputes missing values based on the nearest valid observations in the feature space. This allows the imputed values to better reflect realistic atmospheric conditions instead of producing overly generic or biased estimates. By using distance-weighted neighbors, the imputer assigns more influence to samples that are physically similar, resulting in smoother and more reliable clearsky estimates for downstream processing.

In [133]:
knn = KNNImputer(n_neighbors=3, weights='distance')
solar_cols = ['Clearsky DHI', 'Clearsky DNI', 'Clearsky GHI']

train_lore[solar_cols] = knn.fit_transform(train_lore[solar_cols])
test_lore[solar_cols] = knn.fit_transform(test_lore[solar_cols])

After imputing the solar and clearsky components, we recompute `clearsky_index` and `diffuse_fraction` to ensure these derived ratios remain physically consistent. These features are simple yet informative indicators of atmospheric clarity and radiation scattering, and recalculating them after imputation guarantees that they accurately reflect the corrected underlying measurements.

In [134]:
eps = 1e-6
train_lore['clearsky_index'] = train_lore['GHI'] / (train_lore['Clearsky GHI'] + eps)
train_lore['diffuse_fraction'] = train_lore['DHI'] / (train_lore['GHI'] + eps)

test_lore['clearsky_index'] = test_lore['GHI'] / (test_lore['Clearsky GHI'] + eps)
test_lore['diffuse_fraction'] = test_lore['DHI'] / (test_lore['GHI'] + eps)

Missing values in `Cloud Type` are filled with `1` to represent the ‚ÄúUnknown‚Äù category, ensuring all observations retain a valid and consistent cloud classification.


In [152]:
train_lore['Cloud Type'] = train_lore['Cloud Type'].fillna(1)
test_lore['Cloud Type'] = test_lore['Cloud Type'].fillna(1)

In [147]:
train_lore.reset_index(inplace=True)
test_lore.reset_index(inplace=True)

In [153]:
missing_df = pd.DataFrame({
    'train_missing': train_lore.isna().sum(),
    'test_missing':  test_lore.isna().sum()
})

missing_df = missing_df[(missing_df['train_missing'] > 0) | (missing_df['test_missing'] > 0)]

print('Train test initial missing values:\n\n', missing_df, sep='')

Train test initial missing values:

            train_missing  test_missing
% Baseline              0          1077


With all missing values now resolved and the raw inputs restored to a consistent state, the dataset is ready for the next stage. Since `% Baseline` is the only remaining missing field in the test set and represents our target variable, we can proceed to apply the full feature engineering pipeline on the cleaned train and test data.

# ‚öôÔ∏è **Feature Engineering**

## **Cloud Type Mapping**

From exploration before, we found that our only categorical features is ordinal and has an order. Sorted by Q3, we found that `Cloud Type` fog is producing more energy than `Probably Clear` which seems like an anomaly in terms of real life. But we'll make a data driven decision, so lets map `Cloud Type` with this order:

<p float="left">
  <center>
  <img src="../reports/figures/astronomical/cloudtype_impact.png" width="60%">
</p>

In [138]:
cloud_map = {
    'Unknown': 1, 'Opaque Ice': 2, 'Overlapping': 3,
    'Super-Cooled Water': 4, 'Cirrus': 5, 'Fog': 6,
    'Water': 7, 'Overshooting': 8,
    'Probably Clear': 9, 'Clear': 10
}

train_lore['Cloud Type'] = train_lore['Cloud Type'].map(cloud_map)
test_lore['Cloud Type'] = test_lore['Cloud Type'].map(cloud_map)

## **Time Features**

From earlier exploration, we saw that the raw `Timestamp` column contains repeating daily and monthly cycles that models cannot naturally learn from linear time values. To preserve these cyclical patterns, we convert hour and month components into `hour_sin`, `hour_cos`, `month_sin`, and `month_cos`, allowing the model to recognize periodic relationships such as midnight being close to early morning. We also shift time by ‚àí5 hours to better align the solar peak with true solar noon.

<p style="text-align: center;">
  <img src="..\reports\figures\astronomical\diurnal_cycle.png" width="60%">
</p>

Additionally, we extract `doy` (day-of-year) to capture broader seasonal trends that influence solar irradiance throughout the year. These engineered time features provide a more physically meaningful representation of temporal behavior, enabling the model to interpret solar patterns more effectively than using the raw timestamp alone.

In [139]:
def add_time_features(df):
    df = df.copy()

    # Standardize timestamp
    df['Timestamp'] = pd.to_datetime(df['Timestamp'], errors='coerce')
    df = df.sort_values('Timestamp')

    # Shift time -5h (solar noon correction)
    solar_time = df['Timestamp'] - pd.Timedelta(hours=5)
    solar_hour = solar_time.dt.hour

    # Cyclical hour
    df['hour_sin'] = np.sin(2 * np.pi * solar_hour / 24)
    df['hour_cos'] = np.cos(2 * np.pi * solar_hour / 24)

    # Cyclical month
    df['month_sin'] = np.sin(2 * np.pi * df['Timestamp'].dt.month / 12)
    df['month_cos'] = np.cos(2 * np.pi * df['Timestamp'].dt.month / 12)

    # Day-of-year (for astronomy)
    df['doy'] = solar_time.dt.dayofyear

    return df

train_time = add_time_features(train_lore)
test_time = add_time_features(test_lore)

print('Train shape after time features :', train_time.shape)
print('Test shape after time features  :', test_time.shape)

Train shape after time features : (18942, 46)
Test shape after time features  : (1077, 46)


## **Astronomical Features**

Building on the `doy` feature, we introduce astronomy-based variables that reflect the physical behavior of the Sun throughout the year. Using day-of-year, we compute `solar_declination`, which represents the Sun‚Äôs angle relative to the equatorial plane and directly affects the intensity of solar radiation received at the surface. We also add the `equation_of_time`, a small correction that captures irregularities in solar time caused by Earth‚Äôs orbital mechanics.

To account for changes in Earth‚ÄìSun distance, we compute `sun_earth_distance_factor` and derive `extraterrestrial_radiation` as a theoretical maximum solar input. These astronomy-driven features ground the model in physical reality, giving it access to seasonal solar dynamics that raw weather data cannot fully represent.

In [140]:
def add_astronomy_features(df):
    df = df.copy()
    doy = df['doy']

    # Solar declination
    delta = 23.45 * np.sin(np.radians(360 * (284 + doy) / 365))
    df['solar_declination'] = delta

    # Equation of Time
    B = np.radians((doy - 81) * 360 / 365)
    df['equation_of_time'] = (
        9.87 * np.sin(2*B) - 7.53 * np.cos(B) - 1.5 * np.sin(B)
    )

    # Earth-Sun distance
    df['sun_earth_distance_factor'] = 1 + 0.033 * np.cos(np.radians(360 * doy / 365))
    df['extraterrestrial_radiation'] = 1367 * df['sun_earth_distance_factor']

    return df

train_astro = add_astronomy_features(train_time)
test_astro = add_astronomy_features(test_time)

print('Train shape after astronomy features :', train_astro.shape)
print('Test shape after astronomy features  :', test_astro.shape)

Train shape after astronomy features : (18942, 50)
Test shape after astronomy features  : (1077, 50)


## **Sun Features**

To incorporate sun-position context, we extract two simple yet meaningful features from the raw `sunrise` and `sunset` columns. First, we compute `sunHour`, representing the total daylight duration for each day. Then we derive `is_daytime`, a binary flag indicating whether a given timestamp falls between sunrise and sunset. These features help the model distinguish between naturally dark hours and hours where solar energy is physically possible.

In [141]:
def add_sun_features(df):
    df = df.copy()

    sunrise_dt = pd.to_datetime(df['sunrise'])
    sunset_dt = pd.to_datetime(df['sunset'])

    # length of daylight
    df['sunHour'] = (sunset_dt - sunrise_dt).dt.total_seconds()

    # day/night flag
    curr = df['Timestamp'].dt.time
    rise = sunrise_dt.dt.time
    set_  = sunset_dt.dt.time

    df['is_daytime'] = [
        1 if (r <= c <= s) else 0
        for c, r, s in zip(curr, rise, set_)
    ]

    return df

train_sun = add_sun_features(train_astro)
test_sun = add_sun_features(test_astro)

print('Train shape after sun features :', train_sun.shape)
print('Test shape after sun features  :', test_sun.shape)

Train shape after sun features : (18942, 51)
Test shape after sun features  : (1077, 51)


## **Physics Features**

To capture physically meaningful relationships in solar production, we derive features grounded in solar and atmospheric physics. The `clearsky_index` measures how actual irradiance compares to an ideal clear-sky condition, while `diffuse_fraction` reflects the portion of radiation scattered by clouds or aerosols. We also compute `wind_cooling_potential` to approximate how wind and temperature together influence panel efficiency. These derived ratios provide the model with clearer signals than raw GHI or weather values alone.

In [142]:
def add_physics_features(df):
    df = df.copy()
    eps = 1e-6

    df['clearsky_index'] = df['GHI'] / (df['Clearsky GHI'] + eps)
    df['diffuse_fraction'] = df['DHI'] / (df['GHI'] + eps)

    # wind cooling using Kelvin
    df['wind_cooling_potential'] = df['windspeedKmph'] / (df['tempC'] + 273.15)

    return df

train_physics = add_physics_features(train_sun)
test_physics = add_physics_features(test_sun)

print('Train shape after physics features :', train_physics.shape)
print('Test shape after physics features  :', test_physics.shape)

Train shape after physics features : (18942, 52)
Test shape after physics features  : (1077, 52)


## **Rolling Features**

To capture short-term temporal behavior in solar conditions, we introduce time-aware dynamic features based on lagged and rolling information. Using each row‚Äôs timestamp, we construct a 1-hour lookup to retrieve the previous values of `GHI` and `cloudcover`, producing `GHI_lag1` and `cloudcover_lag1`. Unlike a simple `.shift()`, this merge-based approach ensures correctness even when timestamps are irregular or contain gaps.

We also compute a `GHI_rolling_mean_3h`, which summarizes irradiance over the previous three hours and smooths out rapid fluctuations caused by changing cloud patterns. Together, these dynamic features allow the model to track short-term trends and transitions, something that cannot be captured from static weather snapshots alone.

In [143]:
def add_time_dynamic_features(df):
    df = df.copy()

    # lag 1h reference timestamp
    df['target_time_1h'] = df['Timestamp'] - pd.Timedelta(hours=1)

    # lookup table
    lookup = df[['Timestamp', 'GHI', 'cloudcover']].copy()
    lookup.columns = ['ts_ref', 'GHI_lag1', 'cloudcover_lag1']

    df = df.merge(lookup, left_on='target_time_1h',
                  right_on='ts_ref', how='left')

    # rolling 3-hour mean
    idx = df.set_index('Timestamp')
    df['GHI_rolling_mean_3h'] = (
        idx['GHI'].rolling('3h', min_periods=1).mean().values
    )

    # cleanup + fill
    df.drop(columns=['target_time_1h', 'ts_ref'], inplace=True)
    df[['GHI_lag1','cloudcover_lag1','GHI_rolling_mean_3h']] = \
        df[['GHI_lag1','cloudcover_lag1','GHI_rolling_mean_3h']].fillna(0)

    return df

train_roll = add_time_dynamic_features(train_physics)
test_roll = add_time_dynamic_features(test_physics)

print('Train shape after rolling features :', train_roll.shape)
print('Test shape after rolling features  :', test_roll.shape)

Train shape after rolling features : (18942, 55)
Test shape after rolling features  : (1077, 55)


## **Gradient Features**

To capture rapid changes in solar and weather conditions, we introduce gradient-based features that quantify how key variables evolve over time. `GHI_diff_1h`, `cloudcover_diff_1h`, and `clearsky_index_diff` measure first-order changes from one hour to the next, while `GHI_acceleration` represents the second derivative, highlighting sudden shifts in irradiance. These temporal gradients help the model detect transitions‚Äîsuch as clouds moving in or out‚Äîthat strongly influence short-term solar output.

In [144]:
def add_gradient_features(df):
    df = df.copy()

    df['GHI_diff_1h'] = df['GHI'] - df['GHI_lag1']
    df['cloudcover_diff_1h'] = df['cloudcover'] - df['cloudcover_lag1']
    df['clearsky_index_diff'] = df['clearsky_index'].diff().fillna(0)

    # 2nd derivative (acceleration)
    df['GHI_acceleration'] = df['GHI_diff_1h'].diff().fillna(0)

    return df

train_grad = add_gradient_features(train_roll)
test_grad = add_gradient_features(test_roll)

print('Train shape after gradient features :', train_grad.shape)
print('Test shape after gradient features  :', test_grad.shape)

Train shape after gradient features : (18942, 59)
Test shape after gradient features  : (1077, 59)


## **Weather Clustering**

To capture broader weather regimes rather than relying solely on raw meteorological values, we cluster historical conditions using KMeans. The clustering is based on short-term irradiance dynamics (`GHI_rolling_mean_3h`), atmospheric properties (`humidity`, `diffuse_fraction`), and cloud-related signals (`cloudcover_lag1`, `clearsky_index`, `windspeedKmph`). The resulting `weather_cluster` assigns each timestamp to a learned weather state, helping the model recognize patterns such as clear-sky conditions, scattered clouds, or rapidly changing environments.

To avoid data leakage, the clustering model is fit **only on the training set**, ensuring that the structure of weather regimes comes entirely from past data. Once fitted, this model is applied to both training and test sets using identical feature columns and transformations. This guarantees that the test data is assigned to pre-learned clusters without influencing how those clusters were originally formed.

In [145]:
def fit_weather_clusters(train_df, n_clusters=4):
    features = [
        'cloudcover_lag1',
        'humidity',
        'windspeedKmph',
        'GHI_rolling_mean_3h',
        'clearsky_index',
        'diffuse_fraction'
    ]

    model = KMeans(n_clusters=n_clusters, random_state=42)
    model.fit(train_df[features].fillna(0))

    return model, features

def apply_weather_clusters(df, model, features):
    df = df.copy()
    df['weather_cluster'] = model.predict(df[features].fillna(0))
    return df


cluster_model, cluster_feats = fit_weather_clusters(train_grad)

train_cluster = apply_weather_clusters(train_grad, cluster_model, cluster_feats)
test_cluster  = apply_weather_clusters(test_grad, cluster_model, cluster_feats)

print('Train shape after weather clustering :', train_cluster.shape)
print('Test shape after weather clustering  :', test_cluster.shape)

Train shape after weather clustering : (18942, 60)
Test shape after weather clustering  : (1077, 60)


## **STL Decomposition Features**

To isolate different temporal components of solar irradiance, we apply STL decomposition to the training set‚Äôs `GHI` values. This produces three signals such as `trend`, `seasonal`, and `residual` which reflect long-term behavior, daily cyclical patterns, and short-term irregularities, respectively. Because STL itself cannot be directly applied to unseen data, we fit simple linear models on each extracted component to approximate how these signals evolve over time.

To prevent leakage, the STL decomposition and the linear models are fit exclusively on the training data. The fitted models are then used to generate `GHI_trend`, `GHI_seasonal`, and `GHI_residual` for both the training and test sets using only their timestamps. This ensures the test set never influences how the decomposition was learned while still benefiting from the same temporal structure.

In [146]:
def fit_stl_decomposition(train_df, period=24):
    # Fit STL on train
    stl = STL(train_df['GHI'].fillna(0), period=period).fit()

    train_trend = stl.trend
    train_seasonal = stl.seasonal
    train_resid = stl.resid

    # Fit linear models to approximate patterns for test set
    t = np.arange(len(train_df)).reshape(-1, 1)

    trend_model = LinearRegression().fit(t, train_trend)
    seasonal_model = LinearRegression().fit(t, train_seasonal)
    resid_model = LinearRegression().fit(t, train_resid)

    return trend_model, seasonal_model, resid_model

def apply_stl_features(df, trend_model, seasonal_model, resid_model):
    df = df.copy()
    t = np.arange(len(df)).reshape(-1, 1)

    df['GHI_trend'] = trend_model.predict(t)
    df['GHI_seasonal'] = seasonal_model.predict(t)
    df['GHI_residual'] = resid_model.predict(t)

    return df

trend_m, season_m, resid_m = fit_stl_decomposition(train_cluster)

train_stl = apply_stl_features(train_cluster, trend_m, season_m, resid_m)
test_stl  = apply_stl_features(test_cluster, trend_m, season_m, resid_m)

print('Train shape after STL features :', train_stl.shape)
print('Test shape after STL features  :', test_stl.shape)

Train shape after STL features : (18942, 63)
Test shape after STL features  : (1077, 63)


## **Drop Unnecessary Features**

Before modeling, we remove all remaining datetime-related columns (`Timestamp`, `sunrise`, `sunset`, `moonrise`, `moonset`) since they cannot be used directly by tree-based models. These features have already been transformed into numerical representations during feature engineering, so dropping the raw columns ensures the final dataset contains only numeric, model-ready inputs.

In [313]:
train_final = train_stl.drop(columns=['Timestamp', 'moonrise', 'moonset', 'sunrise', 'sunset'])
test_final = test_stl.drop(columns=['Timestamp', 'moonrise', 'moonset', 'sunrise', 'sunset'])

print('Train final shape :', train_final.shape)
print('Test final shape  :', test_final.shape)

Train final shape : (18942, 58)
Test final shape  : (1077, 58)


# üõ†Ô∏è **Pipelines**

`SolarImputer` cleans and imputes the raw data (fit on train, applied to both train and test), while `FeatureEngineering` learns patterns such as clusters and STL components from the training set and transforms both datasets consistently. This produces final engineered features ready for modeling.

In [306]:
imputer  = preprocessing.SolarImputer()
fe       = preprocessing.FeatureEngineering(n_clusters=4, stl_period=24)

# Fit only on train
train_clean  = imputer.fit_transform(train)
test_clean   = imputer.transform(test)

fe.fit(train_clean)
train_engineered = fe.transform(train_clean)
test_engineered  = fe.transform(test_clean)

print('Train engineered shape :', train_engineered.shape)
print('Test engineered shape  :', test_engineered.shape)

Train engineered shape : (18942, 58)
Test engineered shape  : (1077, 58)


In [307]:
missing_df = pd.DataFrame({
    'train_missing': train_engineered.isna().sum(),
    'test_missing':  train_engineered.isna().sum()
})

missing_df = missing_df[(missing_df['train_missing'] > 0) | (missing_df['test_missing'] > 0)]

print('Train test initial missing values:\n\n', missing_df, sep='')

Train test initial missing values:

Empty DataFrame
Columns: [train_missing, test_missing]
Index: []


We export the engineered train and test datasets so we don‚Äôt need to rerun the entire preprocessing pipeline (imputation + feature engineering) every time we experiment with models. Saving these files makes the workflow faster, ensures reproducibility, and allows us to reuse the exact same processed features for training and validation.

In [308]:
train_engineered.to_csv(TRAIN_PATH_ENGINEERED, index=False)
test_engineered.to_csv(TEST_PATH_ENGINEERED, index=False)

print(f'Train and Test engineered saved to {TRAIN_PATH_ENGINEERED} and {TEST_PATH_ENGINEERED}')

Train and Test engineered saved to ..\data\processed\train_engineered.csv and ..\data\processed\test_engineered.csv
