## Baseline Feature Engineering Pipeline

This notebook builds the baseline feature engineering pipeline. 
All feature selection and statistics are fiited on training data only
and applied unchanged to the test set.

In [3]:
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.feature_selection import VarianceThreshold

NAN_THRESHOLD = 0.8
VAR_THRESHOLD = 1e-5
CORR_THRESHOLD = 0.95

In [4]:
train_df = pd.read_csv("../data/processed/tg_rdkit/train_raw.csv")
test_df = pd.read_csv("../data/processed/tg_rdkit/test_raw.csv")

train_df.shape, test_df.shape

((5827, 221), (1457, 221))

In [5]:
x_train = train_df.drop(columns=["Tg"])
y_train = train_df["Tg"]

x_test = test_df.drop(columns=["Tg"])
y_test = test_df["Tg"]

x_train.shape, x_test.shape

((5827, 220), (1457, 220))

Drop high NaN features

In [6]:
nan_ratio = x_train.isna().mean()
drop_nan_columns = nan_ratio[nan_ratio > NAN_THRESHOLD].index

x_train1 = x_train.drop(columns=drop_nan_columns)
x_test1 = x_test.drop(columns=drop_nan_columns)

len(drop_nan_columns), x_train1.shape, x_test1.shape

(12, (5827, 208), (1457, 208))

Examine if columns are numeric. Drop if it is not.

In [7]:
x_train1.dtypes.value_counts()

int64      110
float64     95
object       3
Name: count, dtype: int64

In [8]:
numeric_cols = x_train1.select_dtypes(include=[np.number]).columns

x_train1 = x_train1[numeric_cols]
x_test1 = x_test1[numeric_cols] 

x_train1.shape, x_test1.shape

((5827, 205), (1457, 205))

Median Imputation

In [9]:
imputer = SimpleImputer(strategy="median")

x_train2 = pd.DataFrame(
    imputer.fit_transform(x_train1), 
    columns=x_train1.columns, 
    index=x_train1.index)
x_test2 = pd.DataFrame(
    imputer.transform(x_test1), 
    columns=x_test1.columns, 
    index=x_test1.index)


In [10]:
var_selector = VarianceThreshold(threshold=VAR_THRESHOLD)
x_train3 = pd.DataFrame(
    var_selector.fit_transform(x_train2), 
    columns=x_train2.columns[var_selector.get_support()], 
    index=x_train2.index)
x_test3 = pd.DataFrame(
    var_selector.transform(x_test2),
    columns=x_test2.columns[var_selector.get_support()],
    index=x_test2.index)
x_train3.shape, x_test3.shape

((5827, 193), (1457, 193))

In [11]:
corr_matrix = x_train3.corr().abs()
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > CORR_THRESHOLD)]

x_train_final = x_train3.drop(columns=to_drop)
x_test_final = x_test3.drop(columns=to_drop)  
len(to_drop), x_train_final.shape, x_test_final.shape

(44, (5827, 149), (1457, 149))

In [12]:
x_train_final.to_csv("../data/processed/tg_rdkit/x_train_proc.csv", index=False)
x_test_final.to_csv("../data/processed/tg_rdkit/x_test_proc.csv", index=False)


## Pipeline Freeze Note

Baseline feature engineering pipeline frozen with:
- NaN threshold: 80%
- Median imputation (train-only)
- Variance threshold: 1e-5
- Correlation threshold: 0.95

This pipeline will be reused unchanged for all baseline models.
