# Predicting Molecular Properties using Machine Learning Models on the QM9 Dataset

## Research Question:
- Can we accurately predict quantum chemical properties of molecules using classical machine learning models like Random Forest and XGBoost, and improve perfomance through feature transfer from neural network-based embeddings?

## Project Objectives:
### Develop Predictive Models:
- Use Random Forest and XGBoost algorithms to predict molecular properties from the QM9 dataset.
### Target Key Properties:
Focus on predicting critical quantum molecular properties, including:
- HOMO-LUMO gap
- Dipole moment
- Atomization energy
### Leverage QM9 Dataset:
- Utilize the standardized and widely accepted QM9 dataset for training and evaluation.
### Incorporate Feature Transfer:
- Enhance tabular input with abstract representations inspired by neural networks such as SchNet or PhysNet.
### Bridge ML Paradigms:
- Integrate traditional machine learning with representation learning to improve model performance.
### Evaluate Model Performance:
- Benchmark models using appropriate metrics (e.g., MAE, RMSE) to assess prediction accuracy and generalization.


## Motivation and Significance:
### 1. Reduce Computational Costs:
- Density Functional Theory (DFT) calculations are resource-intensive—machine learning offers a faster alternative.
### 2. Accelerate Material Discovery:
- Predictive ML models can streamline the search for new molecules and materials.
### 3. Enable Scalable Simulations:
- Efficient algorithms allow large-scale quantum simulations previously limited by DFT.
### 4. Enhance Interpretability:
- Combining traditional ML with modern representations supports transparent, explainable models.
### 5. Cross-Disciplinary Impact:
- Potential applications in drug discovery, catalyst development, and electronic materials design.

## Dataset:
We will use the QM9 dataset available through TensorFlow Datasets.
- Dataset link: https://www.tensorflow.org/datasets/catalog/qm9
- The dataset includes over 130,000 small organic molecules with the following features:
- Atom types and 3D coordinates
- Quantum chemical properties (e.g., energy levels, dipole moments)
- Molecular descriptors and calculated DFT outputs
### Examples of Features:
- homo, lumo, gap – frontier orbital energies
- mu – dipole moment
- alpha – polarizability
- U0, H, G – internal energy, enthalpy, and Gibbs free energy
- SMILES and InChI – chemical string representations


In [7]:
import tensorflow_datasets as tfds
import tensorflow as tf
import pandas as pd

# Loading the Dataset 
ds, ds_info = tfds.load('qm9',with_info = True, split = "train")

In [8]:
# show the features in our dataset
print(ds_info)

tfds.core.DatasetInfo(
    name='qm9',
    full_name='qm9/original/1.0.0',
    description="""
    QM9 consists of computed geometric, energetic, electronic, and thermodynamic
    properties for 134k stable small organic molecules made up of C, H, O, N, and F.
    As usual, we remove the uncharacterized molecules and provide the remaining
    130,831.
    """,
    config_description="""
    QM9 does not define any splits. So this variant puts the full QM9 dataset in the train split, in the original order (no shuffling).
    """,
    homepage='https://doi.org/10.6084/m9.figshare.c.978904.v5',
    data_dir='C:\\Users\\tkasiror\\tensorflow_datasets\\qm9\\original\\1.0.0',
    file_format=tfrecord,
    download_size=82.62 MiB,
    dataset_size=177.16 MiB,
    features=FeaturesDict({
        'A': float32,
        'B': float32,
        'C': float32,
        'Cv': float32,
        'G': float32,
        'G_atomization': float32,
        'H': float32,
        'H_atomization': float32,
        '

In [9]:
# Convert the entire dataset to a list of dicts
data_list = [dict(example) for example in tfds.as_numpy(ds)]

# Convert list of dicts to DataFrame
df = pd.DataFrame(data_list)

print(df.head())

            A           B           C     Cv          G  G_atomization  \
0  157.711807  157.709976  157.706985  6.469 -40.498596      -0.593572   
1  293.609741  293.541107  191.393967  6.316 -56.544960      -0.413283   
2  799.588135  437.903870  282.945465  6.002 -76.422348      -0.320963   
3    0.000000   35.610035   35.610035  8.574 -77.327431      -0.582941   
4    0.000000   44.593884   44.593884  6.278 -93.431244      -0.460105   

           H  H_atomization                        InChI  \
0 -40.475117      -0.639058         b'InChI=1S/CH4/h1H4'   
1 -56.522083      -0.446845         b'InChI=1S/H3N/h1H3'   
2 -76.400925      -0.342879         b'InChI=1S/H2O/h1H2'   
3 -77.304581      -0.619937  b'InChI=1S/C2H2/c1-2/h1-2H'   
4 -93.408424      -0.484601     b'InChI=1S/CHN/c1-2/h1H'   

                 InChI_relaxed  ...     gap    homo index    lumo      mu  \
0         b'InChI=1S/CH4/h1H4'  ...  0.5048 -0.3877     1  0.1171  0.0000   
1         b'InChI=1S/H3N/h1H3'  ...  0.3

In [10]:
# a description of the dataset
df.describe()

Unnamed: 0,A,B,C,Cv,G,G_atomization,H,H_atomization,U,U0,...,U_atomization,alpha,gap,homo,index,lumo,mu,num_atoms,r2,zpve
count,130831.0,130831.0,130831.0,130831.0,130831.0,130831.0,130831.0,130831.0,130831.0,130831.0,...,130831.0,130831.0,130831.0,130831.0,130831.0,130831.0,130831.0,130831.0,130831.0,130831.0
mean,9.966022,1.406729,1.1274,31.620365,-410.852844,-2.603199,-410.809998,-2.830369,-410.810944,-410.819458,...,-2.814281,75.281181,0.252045,-0.24021,66839.584976,0.011835,2.672953,18.0325,1189.410522,0.14909
std,1830.463013,1.600828,1.107471,4.067581,39.894783,0.349058,39.894066,0.385474,39.894066,39.894283,...,0.382751,8.173831,0.047192,0.021967,38457.235392,0.04685,1.503479,2.943715,280.478149,0.033138
min,0.0,0.33712,0.33118,6.002,-714.602112,-3.851932,-714.559204,-4.211903,-714.560181,-714.568054,...,-4.185451,6.31,0.0246,-0.4286,1.0,-0.175,0.0,3.0,19.0002,0.015951
25%,2.55504,1.091545,0.911495,28.955,-437.911835,-2.827818,-437.869919,-3.078822,-437.870865,-437.878799,...,-3.061095,70.480003,0.217,-0.2526,33749.5,-0.0233,1.5778,16.0,1017.431244,0.125638
50%,3.0901,1.37065,1.08203,31.577999,-416.841309,-2.608441,-416.799591,-2.835484,-416.800537,-416.808472,...,-2.819446,75.599998,0.2502,-0.2411,67093.0,0.0126,2.4753,18.0,1147.221069,0.148629
75%,3.83689,1.65505,1.28272,34.298,-387.074524,-2.379072,-387.030273,-2.581601,-387.031219,-387.040466,...,-2.567252,80.610001,0.2894,-0.2289,100063.5,0.0509,3.59635,20.0,1309.046997,0.171397
max,619867.6875,437.90387,282.945465,46.969002,-40.498596,-0.320963,-40.475117,-0.342879,-40.476063,-40.478931,...,-0.34099,196.619995,0.6221,-0.1017,133885.0,0.1935,29.5564,29.0,3374.753174,0.273944


## Key Columns and Their Meanings (QM9 Dataset)

| **Symbol**    | **Meaning** |
|---------------|-------------|
| `A`        | **Rotational constant A** (GHz) — corresponds to rotation around the **principal axis with the smallest moment of inertia** |
| `B`        | **Rotational constant B** (GHz) — corresponds to rotation around the **intermediate moment of inertia axis** |
| `C`        | **Rotational constant C** (GHz) — corresponds to rotation around the **axis with the largest moment of inertia** |
| `mu`          | **Dipole moment** (Debye) — quantifies charge separation in the molecule |
| `alpha`       | **Isotropic polarizability** (Bohr³) — how easily a molecule's electron cloud distorts in an electric field |
| `homo`        | **Highest Occupied Molecular Orbital energy** (eV) — energy of the most energetic electron in a filled orbital |
| `lumo`        | **Lowest Unoccupied Molecular Orbital energy** (eV) — energy of the lowest empty orbital |
| `gap`         | **HOMO-LUMO energy gap** (eV) — energy difference between `homo` and `lumo`, important for optical/electronic properties |
| `r2`          | **Electronic spatial extent** (Bohr²) — represents the size of the electron cloud |
| `zpve`        | **Zero Point Vibrational Energy** (eV) — energy remaining when vibrational motion is at its lowest quantum state |
| `U0`          | **Internal energy at 0 K** (eV) — includes electronic and vibrational components |
| `U`           | **Internal energy at 298.15 K** (eV) |
| `H`           | **Enthalpy at 298.15 K** (eV) — total energy including pressure-volume work |
| `G`           | **Gibbs free energy at 298.15 K** (eV) — useful for predicting spontaneity of reactions |
| `Cv`          | **Heat capacity at constant volume** (cal/mol·K) — how much heat is needed to raise temperature |
| `SMILES`      | **Simplified Molecular Input Line Entry System** — a compact ASCII string that encodes a molecular structure |
| `InChI`       | **IUPAC International Chemical Identifier** — a textual identifier providing a standard way to encode molecular information |
| `omega1` to `omega3N-6` | **Vibrational frequencies** (cm⁻¹) — frequencies of normal modes of vibration for each molecule |
