## Coulomb Matrix Representation

In [1]:
import numpy as np
from math import sqrt

## Useful Resources
 - [Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.108.058301)
 - [Prediction Errors of Molecular Machine learning Models lower than Hybrid DFT Error](https://pubs.acs.org/doi/abs/10.1021/acs.jctc.7b00577)
 - [Understanding molecular representations in machine learning: The role of uniqueness and target similarity](https://aip.scitation.org/doi/10.1063/1.4964627)

## Introduction
For machine learning there needs to be some way to represent the data to the model in a way in which the model can infer knowledge about the data and use it for future predictions. In chemistry, the data we are trying to represent are molecules with the information we are trying to teach the model are property values for those molecules. The goal is to represent the molecule in a way that provides a detailed enough description about the underlying physics of the molecule in order to accurately predict the properties of the molecule. This has led to a lot of work to determine how to best represent the molecule for the model to learn from. One of the most simplistic ways to describe the molecule is what we are going to work on today, the Coulomb matrix.

## General Theory
The Coulomb matrix is one of the more simplistic representations used to describe the molecule. The Coulomb matrix is a square matrix with diagonal elements being the electronic potential energy of the atom and off diagonal elements being the Coulomb nuclear repulsion between atom I and J.   

$$M_{IJ} =\begin{cases}0.5Z_{I}^{2.4} &\text{for } I = J, \\ \frac{Z_I Z_J}{\left | R_I - R_J \right |} &\text{for } I \neq J.\end{cases} $$
 
## Setup
1. Parse file for atoms and coordinates
2. Build Coulomb Matrix

In [2]:
file = open('methane.xyz', 'r')

doc = []
for line in file:
    doc.append(line)

In [3]:
# read number of atoms
natoms = int(doc[0].split()[0])

# parse coordinates
coords = []
for i in range(natoms):
    a_coords = doc[i + 2].split()[0:4]
    coords.append(a_coords)

coords

[['C', '1.041682', '-0.056200', '-0.071481'],
 ['H', '2.130894', '-0.056202', '-0.071496'],
 ['H', '0.678598', '0.174941', '-1.072044'],
 ['H', '0.678613', '0.694746', '0.628980'],
 ['H', '0.678614', '-1.038285', '0.228641']]

## What do we need for CM?
1. Nuclear charges
2. Calculate when $I = J$
3. Calculate when $I \neq J$
4. Output lower triangle of matrix

In [4]:
# nuclear charges
nuc = {'C': 6, 'H': 1}

# build CM matrix
mat = np.zeros((5, 5))
for i in range(5):
    for j in range(i, 5):
        zi = nuc[coords[i][0]]  # nuc['C'] = 6
        zj = nuc[coords[j][0]]
        if i == j:
            mii = 0.5 * zi ** 2.4
            mat[i, i] = mii

        else:
            # mij = zizj/rij
            # rij = sqrt((xi - xj)^2 + (yi - yj)^2 + (zi - zj)^2)
            x = float(coords[i][1]) - float(coords[j][1])
            y = float(coords[i][2]) - float(coords[j][2])
            z = float(coords[i][3]) - float(coords[j][3])
            rij = sqrt((x ** 2) + (y ** 2) + (z ** 2))
            mij = (zi * zj) / rij

            mat[j, i] = mij

print(mat)

[[36.8581052   0.          0.          0.          0.        ]
 [ 5.5085695   0.5         0.          0.          0.        ]
 [ 5.50857022  0.56221605  0.5         0.          0.        ]
 [ 5.50856526  0.56221405  0.56221669  0.5         0.        ]
 [ 5.50857007  0.56221611  0.56221501  0.56221777  0.5       ]]


In [5]:
# return the lower trinagle of the CM as a vector
mat = mat[np.tril_indices(natoms)]
mat

array([36.8581052 ,  5.5085695 ,  0.5       ,  5.50857022,  0.56221605,
        0.5       ,  5.50856526,  0.56221405,  0.56221669,  0.5       ,
        5.50857007,  0.56221611,  0.56221501,  0.56221777,  0.5       ])

## What if we wanted this as a function so we could do multiple molecules?

In [6]:
def cm(natoms, coords, size):
    '''
    Paramters
    ---------
    natoms: int
        number of atoms in the molecule
    coords: 
        molecule coordinates
    size: int
        size of CM matrix
    Returns
    -------
    mat: triangle matrix
        triangle CM matrix
    '''
    # build CM matrix
    mat = np.zeros((size, size))
    for i in range(natoms):
        for j in range(i, natoms):
            zi = nuc[coords[i][0]]  # nuc['C'] = 6
            zj = nuc[coords[j][0]]
            if i == j:
                mii = 0.5 * zi ** 2.4
                mat[i, i] = mii

            else:
                # mij = zizj/rij
                # rij = sqrt((xi - xj)^2 + (yi - yj)^2 + (zi - zj)^2)
                x = float(coords[i][1]) - float(coords[j][1])
                y = float(coords[i][2]) - float(coords[j][2])
                z = float(coords[i][3]) - float(coords[j][3])
                rij = sqrt((x ** 2) + (y ** 2) + (z ** 2))
                mij = (zi * zj) / rij

                mat[j, i] = mij

    # return the lower trinagle of the CM as a vector
    mat = mat[np.tril_indices(natoms)]

    return mat

In [7]:
cm(natoms, coords, size=29)

array([36.8581052 ,  5.5085695 ,  0.5       ,  5.50857022,  0.56221605,
        0.5       ,  5.50856526,  0.56221405,  0.56221669,  0.5       ,
        5.50857007,  0.56221611,  0.56221501,  0.56221777,  0.5       ])

## If this interests you, feel free to help out with [chemreps](https://github.com/dlf57/chemreps)!