# QM9 Modeling

QM9 is a dataset of 134,000 molecules consisting of 9 heavy atoms drawn from the elements C, H, O, N, F{cite}`ramakrishnan2014quantum`. The features are the xyz coordinates ($\mathbf{X}$) and elements ($\vec{e}$) of the molecule. The coordinates are determined from B3LYP/6-31G(2df,p) level DFT geometry optimization. There are multiple labels (see table below), but we'll be interested specifically in the energy of formation (Enthalpy at 298.15 K). The goal in this chapter is to see how we can model this data.


QM9 is one of the most popular dataset for machine learning and deep learning since it came out in 2014. The first papers could achieve about 10 kcal/mol on this regression problem and now are down to ~1 kcal/mol and lower. Any model on this dataset must be translation, rotation, and permutation invariant. 

## Label Description

|Index | Name | Units | Description|
 |:-----|-------|-------|-----------:|
  |0  |index  |   -            |Consecutive, 1-based integer identifier of molecule|
  |1  |A      |   GHz          |Rotational constant A|
  |2  |B      |   GHz          |Rotational constant B|
  |3  |C      |   GHz          |Rotational constant C|
  |4  |mu     |   Debye        |Dipole moment|
  |5  |alpha  |   Bohr^3       |Isotropic polarizability|
  |6  |homo   |   Hartree      |Energy of Highest occupied molecular orbital (HOMO)|
  |7  |lumo   |   Hartree      |Energy of Lowest occupied molecular orbital (LUMO)|
 |8 | gap   |    Hartree     | Gap, difference between LUMO and HOMO|
 |9 | r2    |    Bohr^2      | Electronic spatial extent|
 |10 | zpve  |    Hartree     | Zero point vibrational energy|
 |11 | U0    |    Hartree     | Internal energy at 0 K|
 |12 | U     |    Hartree     | Internal energy at 298.15 K|
 |13 | H     |    Hartree     | Enthalpy at 298.15 K|
 |14 | G     |    Hartree     | Free energy at 298.15 K|
 |15 | Cv    |    cal/(mol K) | Heat capacity at 298.15 K|


## Data

I have written some helper code in the `fetch_qm9.py` file. It downloads the data and converts into a format easily used in Python. The data returned from this function is broken into the features $\mathbf{X}$ and $\vec{e}$. $\mathbf{X}$ is an $N\times4$ matrix of atom positions + partial charge of the atom. $\vec{e}$ is vector of atomic numbers for each atom in the molecule. Remember to slice the specific label you want from the label vector.

## Running This Notebook


Click the &nbsp;<i aria-label="Launch interactive content" class="fas fa-rocket"></i>&nbsp; above to launch this page as an interactive Google Colab. See details below on installing packages, either on your own environment or on Google Colab

````{tip} My title
:class: dropdown
To install packages, execute this code in a new cell

```
!pip install matplotlib numpy pandas seaborn tensorflow jax jaxlib
```

````

In [1]:
import tensorflow as tf
import numpy as np
from fetch_qm9 import fetch_qm9, get_qm9

Let's load the data. This step will take a few minutes as it is downloaded and processed. 

In [2]:
qm9_records = fetch_qm9()
data = get_qm9(qm9_records)

Found existing record file, delete if you want to re-fetch


`data` is an iterable containing the data for the 133k molecules. Let's examine the first one.

In [3]:
for d in data:
    print(d)
    break

((<tf.Tensor: shape=(5,), dtype=int64, numpy=array([6, 1, 1, 1, 1])>, <tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[-1.2698136e-02,  1.0858041e+00,  8.0009960e-03, -5.3568900e-01],
       [ 2.1504159e-03, -6.0313176e-03,  1.9761203e-03,  1.3392100e-01],
       [ 1.0117308e+00,  1.4637512e+00,  2.7657481e-04,  1.3392200e-01],
       [-5.4081506e-01,  1.4475266e+00, -8.7664372e-01,  1.3392299e-01],
       [-5.2381361e-01,  1.4379326e+00,  9.0639728e-01,  1.3392299e-01]],
      dtype=float32)>), <tf.Tensor: shape=(16,), dtype=float32, numpy=
array([ 1.0000000e+00,  1.5771181e+02,  1.5770998e+02,  1.5770699e+02,
        0.0000000e+00,  1.3210000e+01, -3.8769999e-01,  1.1710000e-01,
        5.0480002e-01,  3.5364101e+01,  4.4748999e-02, -4.0478931e+01,
       -4.0476063e+01, -4.0475117e+01, -4.0498596e+01,  6.4689999e+00],
      dtype=float32)>)


These are Tensorflow Tensors. They can be converted to numpy arrays via `x.numpy()`. The first item is the element vector `6,1,1,1,1`. Do you recognize the elements? It's C, H, H, H, H. The positions come next. Note that the there is an extra column containing the atom partial charges, which we will not use as a feature. Finally, the last tensor is the label vector. 

Now we will do some processing of the data to get into a more usable format. Let's convert to numpy arrays, remove the partial charges, and convert the elements into one-hot vectors.

In [8]:
def convert_record(d):
    # break up record
    (e, x), y = d
    # 
    e = e.numpy()
    x = x.numpy()
    r = x[:, :3]
    ohc = np.zeros((len(e), 9))
    ohc[np.arange(len(e)), e - 1] = 1    
    return (ohc, x), y.numpy()[13]

for d in data:
    (e,x), y = convert_record(d)
    print('Element one hots\n', e)
    print('Coordinates\n', x)
    print('Label:', y)
    break

Element one hots
 [[0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0.]]
Coordinates
 [[-1.2698136e-02  1.0858041e+00  8.0009960e-03 -5.3568900e-01]
 [ 2.1504159e-03 -6.0313176e-03  1.9761203e-03  1.3392100e-01]
 [ 1.0117308e+00  1.4637512e+00  2.7657481e-04  1.3392200e-01]
 [-5.4081506e-01  1.4475266e+00 -8.7664372e-01  1.3392299e-01]
 [-5.2381361e-01  1.4379326e+00  9.0639728e-01  1.3392299e-01]]
Label: -40.475117


We now can work with this data to build a model. Let's build a simple model that can model energy and obeys the invariances required of the problem. We will use a graph neural network (GNN) because it obeys permutation invariance. We will create a *graph* from the coordinates/element vector by joining all atoms to all other atoms and using their inverse pairwise distance as the edge weight. The choice of pairwise distance gives us translation and rotation invariance. The choice of inverse distance means that atoms which are far away naturally have low edge weights. 


**In progress...**

## Cited References

```{bibliography} references.bib
```