## Clean and reduce descriptors

This notebook handle missing values in RDKit descriptor table, remove low-variance describors,
remove hightly correlated descriptors, and save the table into final feature table for modeling.

⚠️ This notebook performs exploratory feature cleaning
using statistics computed on the full dataset.
It is used to determine reasonable thresholds
and is NOT used for final model training or evaluation.


In [14]:
import pandas as pd
import numpy as np
df_desc = pd.read_csv("../data/processed/tg_with_rdkit_descriptors.csv")

In [4]:
nan_ratio = df_desc.isna().mean()
nan_ratio.sort_values(ascending=False).head(20)

BCUT2D_MWLOW           1.000000
BCUT2D_LOGPLOW         1.000000
BCUT2D_LOGPHI          1.000000
BCUT2D_MRHI            1.000000
BCUT2D_CHGLO           1.000000
BCUT2D_CHGHI           1.000000
BCUT2D_MWHI            1.000000
BCUT2D_MRLOW           1.000000
MinPartialCharge       0.948517
MinAbsPartialCharge    0.948517
MaxPartialCharge       0.948517
MaxAbsPartialCharge    0.948517
Tg                     0.000000
ExactMolWt             0.000000
HeavyAtomMolWt         0.000000
MolWt                  0.000000
SPS                    0.000000
qed                    0.000000
MinEStateIndex         0.000000
MinAbsEStateIndex      0.000000
dtype: float64

### NaN Cleaning
- Removed molecular descriptors with more than 80% missing values.
- Remaining missing values were imputed using column-wise median.
- This resulted in a prototype descriptor table used to explore feature quality and thresholds.


Drop columns with more than 80% missing data

In [5]:
NAN_THRESHOLD = 0.8
drop_cols = nan_ratio[nan_ratio > NAN_THRESHOLD].index
len(drop_cols)

12

In [6]:
df_desc_step1 = df_desc.drop(columns=drop_cols)

Remaining NaNs are filled with column median

In [7]:
meta_cols = ["SMILES", "Tg", "PID", "Polymer Class"]
descriptor_cols = [c for c in df_desc_step1.columns if c not in meta_cols]

df_desc_clean = df_desc_step1.copy()

df_desc_clean[descriptor_cols] = (
    df_desc_clean[descriptor_cols]
    .fillna(df_desc_clean[descriptor_cols].median())
)

In [8]:
df_desc_clean.isna().sum().sum()

np.int64(0)

## Low Variance Columns Handling

In [9]:
desciptor_data = df_desc_clean[descriptor_cols]
variances = desciptor_data.var()
variances.sort_values().head(20)

SMR_VSA8              0.000000
SlogP_VSA9            0.000000
fr_lactam             0.000000
fr_nitroso            0.000000
fr_guanido            0.000000
fr_benzodiazepine     0.000000
fr_barbitur           0.000000
fr_thiocyan           0.000000
fr_prisulfonamd       0.000000
fr_isothiocyan        0.000000
fr_dihydropyridine    0.000000
fr_tetrazole          0.000137
fr_HOCCN              0.000137
fr_epoxide            0.000137
fr_isocyan            0.000137
fr_term_acetylene     0.000275
fr_priamide           0.000275
fr_oxime              0.000275
fr_amidine            0.000412
fr_SH                 0.000549
dtype: float64

In [10]:
VAR_THRESHOLD = 1e-5
low_variance_cols = variances[variances < VAR_THRESHOLD].index
len(low_variance_cols)

11

In [11]:
desciptor_data_reduced = desciptor_data.drop(columns=low_variance_cols)
desciptor_data_reduced.shape

(7284, 194)

In [12]:
print("Before reduction:", desciptor_data.shape[1])
print("After reduction:", desciptor_data_reduced.shape[1])

Before reduction: 205
After reduction: 194


## Correlation Filtering

In [29]:
corr_matrix = desciptor_data_reduced.corr().abs()
upper_tri = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)

In [30]:
CORR_THRESHOLD = 0.95
to_drop = [
    column for column in upper_tri.columns if any(upper_tri[column] > CORR_THRESHOLD)
]

In [31]:
descriptor_final = desciptor_data_reduced.drop(columns=to_drop)
descriptor_final.shape

(7284, 150)

In [32]:
descriptor_final.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MaxAbsEStateIndex,7284.0,11.357325,3.391001,1.682870,11.269619,12.558593,13.402306,17.428248
MinAbsEStateIndex,7284.0,0.131352,0.191667,0.000006,0.033196,0.079493,0.152922,3.423234
MinEStateIndex,7284.0,-1.317516,1.923319,-9.594603,-1.349483,-0.551287,-0.189081,0.875000
qed,7284.0,0.339146,0.219577,0.007751,0.140923,0.300258,0.519413,0.908118
SPS,7284.0,15.482425,6.376848,9.555556,12.025000,13.703704,15.871536,74.875000
...,...,...,...,...,...,...,...,...
fr_tetrazole,7284.0,0.000137,0.011717,0.000000,0.000000,0.000000,0.000000,1.000000
fr_thiazole,7284.0,0.003981,0.080234,0.000000,0.000000,0.000000,0.000000,2.000000
fr_thiophene,7284.0,0.037068,0.303757,0.000000,0.000000,0.000000,0.000000,6.000000
fr_unbrch_alkane,7284.0,1.518259,3.535127,0.000000,0.000000,0.000000,1.000000,32.000000


In [35]:
print(f"Original descriptor count: {desciptor_data_reduced.shape[1]}")
print(f"Final descriptor count: {descriptor_final.shape[1]}")

Original descriptor count: 194
Final descriptor count: 150


In [33]:
low_std = descriptor_final.std().sort_values().head(10)
low_std

fr_HOCCN             0.011717
fr_tetrazole         0.011717
fr_isocyan           0.011717
fr_epoxide           0.011717
fr_term_acetylene    0.016569
fr_oxime             0.016569
fr_priamide          0.016569
fr_amidine           0.020292
fr_SH                0.023434
fr_morpholine        0.026193
dtype: float64

In [34]:
descriptor_final.std().describe()

count    1.500000e+02
mean     6.945145e+54
std      8.506031e+55
min      1.171696e-02
25%      1.698111e-01
50%      1.099795e+00
75%      6.915655e+00
max      1.041772e+57
dtype: float64

Descriptor standard deviations span several orders of magnitude, which is expected for RDKit molecular descriptors. Feature scaling will be applied prior to model training.

In [36]:
df_features_final = pd.concat(
    [df_desc_clean[meta_cols].reset_index(drop=True), descriptor_final.reset_index(drop=True)], axis=1)

df_features_final.shape

(7284, 154)

### Feature Engineering Summary
- Generated RDKit molecular descriptors from validated SMILES.
- Removed descriptors with high missing-value ratio (>80%).
- Imputed remaining missing values using column-wise median.
- Removed low-variance descriptors (variance < 1e-5).
- Removed highly correlated descriptors (|r| > 0.95).