# Feature Pipeline: Polymer-augmented Features

This notebook applies preprocessing and feature selection steps to the
polymer-augmented feature matrix, including missing value handling,
low-variance filtering, and correlation-based feature removal.

All preprocessing statistics are fitted on the training set only and applied
unchanged to the test set. The resulting feature set is frozen and reused for
all downstream polymer-aware baseline models.

In [2]:
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.feature_selection import VarianceThreshold

NAN_THRESHOLD = 0.8
VAR_THRESHOLD = 1e-5
CORR_THRESHOLD = 0.95
CLIP_VALUE = 1e6

In [3]:
meta_cols = ["SMILES", "Tg", "PID", "Polymer Class"]

# feature matrix (feature engineering already completed)
df = pd.read_csv("../data/intermediate/tg_with_rdkit_and_polymer_features.csv")

y = df["Tg"]
x = df.drop(columns=meta_cols)

# frozen split
train_idx = pd.read_csv("../data/processed/train_idx.csv").squeeze()
test_idx  = pd.read_csv("../data/processed/test_idx.csv").squeeze()

x_train = x.loc[train_idx].copy()
x_test  = x.loc[test_idx].copy()

y_train = y.loc[train_idx]
y_test  = y.loc[test_idx]

x_train.shape, x_test.shape

((5827, 227), (1457, 227))

In [4]:
nan_ratio = x_train.isna().mean()
drop_nan_cols = nan_ratio[nan_ratio > NAN_THRESHOLD].index

x_train = x_train.drop(columns=drop_nan_cols)
x_test  = x_test.drop(columns=drop_nan_cols)

len(drop_nan_cols), x_train.shape

(12, (5827, 215))

In [5]:
numeric_cols = x_train.select_dtypes(include=[np.number]).columns

x_train = x_train[numeric_cols]
x_test  = x_test[numeric_cols]

x_train.shape, x_test.shape

((5827, 215), (1457, 215))

In [6]:
imputer = SimpleImputer(strategy="median")

x_train = pd.DataFrame(
    imputer.fit_transform(x_train),
    columns=x_train.columns,
    index=x_train.index
)

x_test = pd.DataFrame(
    imputer.transform(x_test),
    columns=x_train.columns,
    index=x_test.index
)

In [7]:
var_selector = VarianceThreshold(threshold=VAR_THRESHOLD)

x_train = pd.DataFrame(
    var_selector.fit_transform(x_train),
    columns=x_train.columns[var_selector.get_support()],
    index=x_train.index
)

x_test = pd.DataFrame(
    var_selector.transform(x_test),
    columns=x_train.columns,
    index=x_test.index
)

x_train.shape, x_test.shape

((5827, 203), (1457, 203))

In [8]:
corr_matrix = x_train.corr().abs()
upper_tri = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)

to_drop = [
    col for col in upper_tri.columns
    if any(upper_tri[col] > CORR_THRESHOLD)
]

x_train_final = x_train.drop(columns=to_drop)
x_test_final  = x_test.drop(columns=to_drop)

len(to_drop), x_train_final.shape

(47, (5827, 156))

In [9]:
x_train_final = x_train_final.clip(-CLIP_VALUE, CLIP_VALUE)
x_test_final  = x_test_final.clip(-CLIP_VALUE, CLIP_VALUE)

In [10]:
x_train_final.to_csv(
    "../data/processed/tg_rdkit_polymer/x_train_proc.csv",
    index=False
)

x_test_final.to_csv(
    "../data/processed/tg_rdkit_polymer/x_test_proc.csv",
    index=False
)

pd.Series(x_train_final.columns).to_csv(
    "../data/processed/selected_baseline_polymer_features.csv",
    index=False
)

## Pipeline Freeze Note

The polymer-augmented feature pipeline is frozen with the following settings:

- NaN threshold: 80% (train-only)
- Median imputation (train-only)
- Variance threshold: 1e-5
- Correlation threshold: 0.95

This preprocessing pipeline is reused unchanged for all polymer-aware baseline
models to ensure fair and reproducible comparisons.
