# Molecular Properties Prediction: Analysis, Modeling, and Visualization

This notebook predicts molecular properties (Tg, FFV, Tc, Density, Rg) for SMILES strings using the `train.csv` dataset. It includes:
- Data loading and preprocessing
- Featurization using RDKit
- Correlation analysis with visualizations
- Random Forest model training and evaluation
- Predictions for test SMILES
- Generation of model report and submission file

Outputs:
- `model_report.csv`: Model performance metrics (R2, RMSE)
- `correlation_matrix.png`, `scatter_{descriptor}_vs_{property}.png`: Visualization images
- `submission.csv`: Predicted properties for test SMILES

**Note**: Ensure `train.csv` is in the working directory. In Kaggle, use `/kaggle/input/` paths if needed.

In [65]:
!conda install -c rdkit rdkit -y

Channels:
 - rdkit
 - defaults
Platform: win-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed



LibMambaUnsatisfiableError: Encountered problems while solving:
  - nothing provides numpy 1.10.* needed by rdkit-2015.09.2-np110py27_0

Could not solve for environment specs
The following packages are incompatible
\u251c\u2500 pin-1 is installable and it requires
\u2502  \u2514\u2500 python 3.12.* , which can be installed;
\u2514\u2500 rdkit is not installable because there are no viable options
   \u251c\u2500 rdkit [2014.09.2|2015.03.1|...|2017.09.3.0] would require
   \u2502  \u2514\u2500 python [2.7* |>=2.7,<2.8.0a0 ], which conflicts with any installable versions previously reported;
   \u251c\u2500 rdkit [2015.09.2|2016.03.1] would require
   \u2502  \u2514\u2500 numpy 1.10.* , which does not exist (perhaps a missing channel);
   \u251c\u2500 rdkit [2016.03.1|2016.09.2|...|2018.03.1.1] would require
   \u2502  \u2514\u2500 python [3.5* |>=3.5,<3.6.0a0 ], which conflicts with any installable versions previously reported;
   \u251c\u2500 rdkit [2017.03.1|2017.03.2|...|2020.09.1.0

In [66]:
pip install rdkit

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.





In [67]:
# Install RDKit if needed (uncomment in Kaggle)
# !conda install -c rdkit rdkit -y

# Import libraries
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

In [68]:
# Load dataset
try:
    train_df = pd.read_csv('train.csv')
except FileNotFoundError:
    train_df = pd.read_csv(r"C:\Users\DELL\Downloads\neurips-open-polymer-prediction-2025\train.csv")

# Define test SMILES
test_data = [
    {"id": 1109053969, "SMILES": "*Oc1ccc(C=NN=Cc2ccc(Oc3ccc(C(c4ccc(*)cc4)(C(F)(F)F)C(F)(F)F)cc3)cc2)cc1"},
    {"id": 1422188626, "SMILES": "*Oc1ccc(C(C)(C)c2ccc(Oc3ccc(C(=O)c4cccc(C(=O)c5ccc(*)cc5)cc3)cc2)cc1"},
    {"id": 2032016830, "SMILES": "*c1cccc(OCCCCCCCCOc2cccc(N3C(=O)c4ccc(-c5cccc6c5C(=O)N(*)C6=O)cc4C3=O)c2)c1"}
]
test_df = pd.DataFrame(test_data)

# Verify data
print('Train dataset shape:', train_df.shape)
print('Test dataset shape:', test_df.shape)
train_df.head()

Train dataset shape: (7973, 7)
Test dataset shape: (3, 2)


Unnamed: 0,id,SMILES,Tg,FFV,Tc,Density,Rg
0,87817,*CC(*)c1ccccc1C(=O)OCCCCCC,,0.374645,0.205667,,
1,106919,*Nc1ccc([C@H](CCC)c2ccc(C3(c4ccc([C@@H](CCC)c5...,,0.37041,,,
2,388772,*Oc1ccc(S(=O)(=O)c2ccc(Oc3ccc(C4(c5ccc(Oc6ccc(...,,0.37886,,,
3,519416,*Nc1ccc(-c2c(-c3ccc(C)cc3)c(-c3ccc(C)cc3)c(N*)...,,0.387324,,,
4,539187,*Oc1ccc(OC(=O)c2cc(OCCCCCCCCCOCC3CCCN3c3ccc([N...,,0.35547,,,


In [70]:
# Featurization function
def compute_descriptors(smiles):
    smiles = smiles.replace('*', 'H')
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    descriptors = {
        'MolWt': Descriptors.MolWt(mol),
        'TPSA': Descriptors.TPSA(mol),
        'NumRotatableBonds': Descriptors.NumRotatableBonds(mol),
        'NumHeavyAtoms': Descriptors.HeavyAtomCount(mol)
    }
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)
    descriptors['Fingerprint'] = np.array(fp)
    return descriptors

# Apply featurization
train_df['Descriptors'] = train_df['SMILES'].apply(compute_descriptors)
test_df['Descriptors'] = test_df['SMILES'].apply(compute_descriptors)

# Remove invalid SMILES
train_df = train_df[train_df['Descriptors'].notnull()]
test_df = test_df[test_df['Descriptors'].notnull()]

# Extract numerical descriptors
train_descriptors = pd.DataFrame(list(train_df['Descriptors']))[['MolWt', 'TPSA', 'NumRotatableBonds', 'NumHeavyAtoms']]
test_descriptors = pd.DataFrame(list(test_df['Descriptors']))[['MolWt', 'TPSA', 'NumRotatableBonds', 'NumHeavyAtoms']]

# Extract fingerprints
train_fps = np.vstack(train_df['Descriptors'].apply(lambda x: x['Fingerprint']))
test_fps = np.vstack(test_df['Descriptors'].apply(lambda x: x['Fingerprint']))

# Combine features
train_features = np.hstack([train_descriptors.values, train_fps])
test_features = np.hstack([test_descriptors.values, test_fps])

print('Train features shape:', train_features.shape)
print('Test features shape:', test_features.shape)

KeyError: "None of [Index(['MolWt', 'TPSA', 'NumRotatableBonds', 'NumHeavyAtoms'], dtype='object')] are in the [columns]"

In [71]:
# Correlation analysis
properties = ['Tg', 'FFV', 'Tc', 'Density', 'Rg']
corr_data = train_df[properties].join(train_descriptors)

# Correlation matrix
corr_matrix = corr_data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix of Descriptors and Properties')
plt.savefig('correlation_matrix.png')
plt.show()

# Scatter plots
for prop in properties:
    for desc in ['MolWt', 'TPSA']:
        plt.figure(figsize=(6, 4))
        sns.scatterplot(data=corr_data, x=desc, y=prop)
        plt.title(f'{desc} vs {prop}')
        plt.savefig(f'scatter_{desc}_vs_{prop}.png')
        plt.show()

NameError: name 'train_descriptors' is not defined

In [72]:
# Model training and evaluation
model_report = {}
models = {}

for prop in properties:
    if train_df[prop].isna().all():
        print(f'No data for {prop}')
        continue
    valid_data = train_df[train_df[prop].notna()]
    X = np.vstack(valid_data['Descriptors'].apply(lambda x: np.hstack([list(x.values())[:-1], x['Fingerprint']])))
    y = valid_data[prop]
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    r2 = r2_score(y_val, y_pred)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    model_report[prop] = {'R2': r2, 'RMSE': rmse}
    models[prop] = model
    print(f'{prop} - R2: {r2:.3f}, RMSE: {rmse:.3f}')

# Save model report
report_df = pd.DataFrame(model_report).T
report_df.to_csv('model_report.csv')
print('Model Report:')
print(report_df)

No data for Tg
No data for FFV
No data for Tc
No data for Density
No data for Rg
Model Report:
Empty DataFrame
Columns: []
Index: []


In [73]:
# Predictions for test SMILES
predictions = {'id': test_df['id'], 'SMILES': test_df['SMILES']}
for prop in properties:
    if prop in models:
        predictions[prop] = models[prop].predict(test_features)
    else:
        predictions[prop] = [np.nan] * len(test_df)

# Create submission file
submission_df = pd.DataFrame(predictions)
submission_df.to_csv('submission.csv', index=False)
print('Submission file:')
print(submission_df)

Submission file:
Empty DataFrame
Columns: [id, SMILES, Tg, FFV, Tc, Density, Rg]
Index: []


## Outputs
- **Model Report**: `model_report.csv` with R² and RMSE.
- **Visualizations**: `correlation_matrix.png`, scatter plots (`scatter_{descriptor}_vs_{property}.png`).
- **Submission**: `submission.csv` with predictions.

To download, locate files in the working directory (or `/kaggle/working/` in Kaggle) and use the download option in Jupyter/Kaggle.