## Train / Test Split

This notebook performs a fixed train/test split
on the descriptor dataset.

No feature engineering, imputation, or filtering
is performed in this step.

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split

RANDOM_SEED = 42
TEST_SIZE = 0.2

In [15]:
df = pd.read_csv("../data/intermediate/tg_with_rdkit_descriptors.csv")
df.shape

(7284, 221)

In [16]:
df.head()

Unnamed: 0,SMILES,Tg,PID,Polymer Class,MaxAbsEStateIndex,MaxEStateIndex,MinAbsEStateIndex,MinEStateIndex,qed,SPS,...,fr_sulfide,fr_sulfonamd,fr_sulfone,fr_term_acetylene,fr_tetrazole,fr_thiazole,fr_thiocyan,fr_thiophene,fr_unbrch_alkane,fr_urea
0,*C*,-54.0,P010001,Polyolefins,1.75,1.75,0.875,0.875,0.355446,20.0,...,0,0,0,0,0,0,0,0,0,0
1,*CC(*)C,-3.0,P010002,Polyolefins,2.395833,2.395833,0.75,0.75,0.41472,25.666667,...,0,0,0,0,0,0,0,0,0,0
2,*CC(*)CC,-24.1,P010003,Polyolefins,2.332824,2.332824,0.743056,0.743056,0.451401,22.25,...,0,0,0,0,0,0,0,0,0,0
3,*CC(*)CCC,-37.0,P010004,Polyolefins,2.31,2.31,0.734306,0.734306,0.476641,20.2,...,0,0,0,0,0,0,0,0,0,0
4,*CC(*)C(C)C,60.0,P010006,Polyolefins,2.374491,2.374491,0.699074,0.699074,0.465496,21.4,...,0,0,0,0,0,0,0,0,0,0


In [17]:
x = df.drop(columns=["Tg"])
y = df["Tg"]

x.shape, y.shape

((7284, 220), (7284,))

In [18]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=TEST_SIZE, random_state=RANDOM_SEED)

x_train.shape, x_test.shape, y_train.shape, y_test.shape

((5827, 220), (1457, 220), (5827,), (1457,))

In [19]:
train_df = pd.concat([x_train, y_train], axis=1)
test_df = pd.concat([x_test, y_test], axis=1)

In [20]:
train_df.isna().mean().sort_values(ascending=False).head()

BCUT2D_CHGLO      1.0
BCUT2D_LOGPHI     1.0
BCUT2D_LOGPLOW    1.0
BCUT2D_MRHI       1.0
BCUT2D_MRLOW      1.0
dtype: float64

In [23]:
train_df.to_csv(
    "../data/processed/tg_rdkit/train_raw.csv",
    index=False
)

test_df.to_csv(
    "../data/processed/tg_rdkit/test_raw.csv",
    index=False
)

## Split Freeze Note

- Input data: `tg_with_rdkit_descriptors.csv`
- Train/test split: 80 / 20
- Random seed: 42
- No feature engineering performed

This split is fixed and reused for all baseline experiments.

## Save Train and Test Index for different feature sets

In [22]:
train_idx, test_idx = train_test_split(
    df.index,
    test_size=TEST_SIZE,
    random_state=RANDOM_SEED
)

pd.Series(train_idx).to_csv("../data/processed/train_idx.csv", index=False)
pd.Series(test_idx).to_csv("../data/processed/test_idx.csv", index=False)