# Predicting Molecular Properties using Machine Learning Models on the QM9 Dataset

## Research Question:
- Can we accurately predict quantum chemical properties of molecules using classical machine learning models like Random Forest and XGBoost, and improve perfomance through feature transfer from neural network-based embeddings?

## Project Objectives:
### Develop Predictive Models:
- Use Random Forest and XGBoost algorithms to predict molecular properties from the QM9 dataset.
### Target Key Properties:
Focus on predicting critical quantum molecular properties, including:
- HOMO-LUMO gap
- Dipole moment
- Atomization energy
### Leverage QM9 Dataset:
- Utilize the standardized and widely accepted QM9 dataset for training and evaluation.
### Incorporate Feature Transfer:
- Enhance tabular input with abstract representations inspired by neural networks such as SchNet or PhysNet.
### Bridge ML Paradigms:
- Integrate traditional machine learning with representation learning to improve model performance.
### Evaluate Model Performance:
- Benchmark models using appropriate metrics (e.g., MAE, RMSE) to assess prediction accuracy and generalization.


## Motivation and Significance:
### 1. Reduce Computational Costs:
- Density Functional Theory (DFT) calculations are resource-intensive—machine learning offers a faster alternative.
### 2. Accelerate Material Discovery:
- Predictive ML models can streamline the search for new molecules and materials.
### 3. Enable Scalable Simulations:
- Efficient algorithms allow large-scale quantum simulations previously limited by DFT.
### 4. Enhance Interpretability:
- Combining traditional ML with modern representations supports transparent, explainable models.
### 5. Cross-Disciplinary Impact:
- Potential applications in drug discovery, catalyst development, and electronic materials design.

## Dataset:
We will use the QM9 dataset available through TensorFlow Datasets.
- Dataset link: https://www.tensorflow.org/datasets/catalog/qm9
- The dataset includes over 130,000 small organic molecules with the following features:
- Atom types and 3D coordinates
- Quantum chemical properties (e.g., energy levels, dipole moments)
- Molecular descriptors and calculated DFT outputs
### Examples of Features:
- homo, lumo, gap – frontier orbital energies
- mu – dipole moment
- alpha – polarizability
- U0, H, G – internal energy, enthalpy, and Gibbs free energy
- SMILES and InChI – chemical string representations


In [1]:
!pip install tensorflow tensorflow-datasets

import tensorflow_datasets as tfds
import tensorflow as tf

# Loading the Dataset 
ds, ds_info = tfds.load('qm9',with_info = True, split = "train")





[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\tkasiror\tensorflow_datasets\qm9\original\1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

  atomref = pd.read_table(


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

  uncharacterized = pd.read_table(


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter serve

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling C:\Users\tkasiror\tensorflow_datasets\qm9\original\incomplete.K2PMZE_1.0.0\qm9-train.tfrecord*...:  …

[1mDataset qm9 downloaded and prepared to C:\Users\tkasiror\tensorflow_datasets\qm9\original\1.0.0. Subsequent calls will reuse this data.[0m


In [4]:
# show the features in our dataset
print(ds_info)

tfds.core.DatasetInfo(
    name='qm9',
    full_name='qm9/original/1.0.0',
    description="""
    QM9 consists of computed geometric, energetic, electronic, and thermodynamic
    properties for 134k stable small organic molecules made up of C, H, O, N, and F.
    As usual, we remove the uncharacterized molecules and provide the remaining
    130,831.
    """,
    config_description="""
    QM9 does not define any splits. So this variant puts the full QM9 dataset in the train split, in the original order (no shuffling).
    """,
    homepage='https://doi.org/10.6084/m9.figshare.c.978904.v5',
    data_dir='C:\\Users\\tkasiror\\tensorflow_datasets\\qm9\\original\\1.0.0',
    file_format=tfrecord,
    download_size=82.62 MiB,
    dataset_size=177.16 MiB,
    features=FeaturesDict({
        'A': float32,
        'B': float32,
        'C': float32,
        'Cv': float32,
        'G': float32,
        'G_atomization': float32,
        'H': float32,
        'H_atomization': float32,
        '