## Baseline Feature Engineering Pipeline for the same data with added polymer-aware features 

This notebook constructs polymer-aware, repeat-unit-level molecular features
to augment generic RDKit descriptors. These features encode polymer-specific
information such as composition, backbone rigidity, and chain flexibility.
All feature selection and statistics are fitted on training data only
and applied unchanged to the test set. Same feature pipeline as used for previous 
baseline. 

In [1]:
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.feature_selection import VarianceThreshold

RANDOM_SEED = 42
CLIP_VALUE = 1e6

NAN_THRESHOLD = 0.8
VAR_THRESHOLD = 1e-5
CORR_THRESHOLD = 0.95

train_idx = pd.read_csv("../data/processed/train_idx.csv").squeeze()
test_idx = pd.read_csv("../data/processed/test_idx.csv").squeeze()

df_polymer_features = pd.read_csv('../data/intermediate/tg_with_rdkit_and_polymer_features.csv')

In [2]:
meta_cols = ["SMILES", "Tg", "PID", "Polymer Class"]

x_train = df_polymer_features.loc[train_idx].drop(columns=meta_cols)
x_test = df_polymer_features.loc[test_idx].drop(columns=meta_cols)
y_train = df_polymer_features.loc[train_idx, 'Tg']
y_test = df_polymer_features.loc[test_idx, 'Tg']

x_train.shape, x_test.shape, y_train.shape, y_test.shape

((5827, 227), (1457, 227), (5827,), (1457,))

Drop high NaN features

In [3]:
nan_ratio = x_train.isna().mean()
drop_nan_columns = nan_ratio[nan_ratio > NAN_THRESHOLD].index

x_train1 = x_train.drop(columns=drop_nan_columns)
x_test1 = x_test.drop(columns=drop_nan_columns)

len(drop_nan_columns), x_train1.shape, x_test1.shape

(12, (5827, 215), (1457, 215))

Examine if columns are numeric. Drop if it is not.

In [4]:
x_train1.dtypes.value_counts()

int64      113
float64    102
Name: count, dtype: int64

In [5]:
numeric_cols = x_train1.select_dtypes(include=[np.number]).columns

x_train1 = x_train1[numeric_cols]
x_test1 = x_test1[numeric_cols] 

x_train1.shape, x_test1.shape

((5827, 215), (1457, 215))

Median Imputation

In [6]:
imputer = SimpleImputer(strategy="median")

x_train2 = pd.DataFrame(
    imputer.fit_transform(x_train1), 
    columns=x_train1.columns, 
    index=x_train1.index)
x_test2 = pd.DataFrame(
    imputer.transform(x_test1), 
    columns=x_test1.columns, 
    index=x_test1.index)


In [7]:
var_selector = VarianceThreshold(threshold=VAR_THRESHOLD)
x_train3 = pd.DataFrame(
    var_selector.fit_transform(x_train2), 
    columns=x_train2.columns[var_selector.get_support()], 
    index=x_train2.index)
x_test3 = pd.DataFrame(
    var_selector.transform(x_test2),
    columns=x_test2.columns[var_selector.get_support()],
    index=x_test2.index)
x_train3.shape, x_test3.shape

((5827, 203), (1457, 203))

In [8]:
corr_matrix = x_train3.corr().abs()
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > CORR_THRESHOLD)]

x_train_final = x_train3.drop(columns=to_drop)
x_test_final = x_test3.drop(columns=to_drop)  
len(to_drop), x_train_final.shape, x_test_final.shape

(47, (5827, 156), (1457, 156))

In [9]:
x_train_final.to_csv("../data/processed/tg_rdkit_polymer/x_train_proc.csv", index=False)
x_test_final.to_csv("../data/processed/tg_rdkit_polymer/x_test_proc.csv", index=False)


## Pipeline Freeze Note

Baseline feature engineering pipeline frozen with:
- NaN threshold: 80%
- Median imputation (train-only)
- Variance threshold: 1e-5
- Correlation threshold: 0.95

This pipeline will be reused unchanged for all baseline models.
