# Data Exploration
The goal of the competition is to predict NMR coupling constants from molecule structures and other information.

Briefly, coupling constants represent the strength of the interaction between different atomic nuclei in a molecule. The interactions occur through bonds, rather than directly through space. Therefore the number of bonds between the two atoms of interest will strongly affect their corresponding coupling constant, while the spatial separation between the two atoms may have only a weak correlation with coupling constant. Coupling constants through more than three bonds can not normally be detected, except in certain special cases. The coupling constant is also strongly dependent on the chemical functional group to which the atoms belong. For example, the three-bond coupling between two alkene protons (hydrogen atoms) may differ significantly from that of two alkyl protons. The geometry of the bonds has a great impact on the coupling; the strength of proton-proton three bond couplings are strongly dependent on the dihedral angle between the two proton-carbon/heteroatom bonds as described by the [Karplus equation](https://en.wikipedia.org/wiki/Karplus_equation).

Additional literature:
- [Vicinal Coupling Constants and Conformation of Biomolecules, Altona](https://doi.org/10.1002/9780470034590.emrstm0587)
- [Scalar Coupling Constants—Their Analysis and Their Application for the Elucidation of Structures, Eberstadt _et al._](https://doi.org/10.1002/anie.199516711)
- <a href="https://doi.org/10.1016/0040-4020(80)80155-4">The relationship between proton-proton NMR coupling constants and substituent electronegativities, Haasnoot _et al._</a>

It may be difficult to identify the functional group involved from the raw structural data. However, the hybridisation (essentially the number of neighbouring atoms) and the identities of the other atoms in the bond my serve as proxies. The interatomic bond length can also be strongly influenced by the functional group that it belongs to:

- [Typical interatomic distances: organic compounds, Allen _et al._](https://doi.org/10.1107/97809553602060000621)

Before we begin to explore our data, therefore, some features which we may be interested in engineering are:
- Bond lengths
- Functional groups (perhaps bond lengths may serve as a proxy?)
- Number of neighbouring atoms (also may serve as a proxy for functional group)
- Angles between bonds (in the case of two-bond couplings)
- Dihedral angles (for three-bond couplings)
- Mean coupling constants (possibly for coupling types which don't vary much)
- Identities of the other atoms involved in two- or three-bond couplings
- Electronegativities of the atoms involved

Let's first import the necessary libraries, our training, and test data, and look at the information provided.

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/train.csv')
train.head()

We have a molecule name, allowing us to match the training data entry to a structure, and the indices of the two atoms that are coupled. Note that multiple coupling constants correspond to the same structure. Let's write a function to import the structural data and see what form it takes:

In [None]:
def read_xyz(path, filename):
    return pd.read_csv(path+filename, skiprows = 2, header = None, sep = ' ', usecols=[0, 1,2,3], names=['atom', 'x', 'y', 'z'])

path = '../input/structures/'
filename = 'dsgdb9nsd_000001.xyz'

read_xyz(path, filename)

We have what appears to be a molecule of methane, with the positions of each of the atoms represented with cartesian coordinates. This molecule contains only carbon and hydrogen, but how many different atom types are present in our training data?

In [None]:
# This is the code, but it is quite time consuming to run so I'll just provide the answer below
"""
atom_list = []
for filename in os.listdir("../input/structures"):
    atom_list = atom_list + list(read_xyz(path, filename)['atom'])
atom_list = set(atom_list)
print(atom_list)
"""
print("{'O', 'H', 'C', 'F', 'N'}")


All the molecules are made up only of oxygen, hydrogen, carbon, fluorine, and nitrogen. 

In [None]:
x_list = []
y_list = []
z_list = []
for filename in os.listdir("../input/structures"):
    x_list = x_list + list(read_xyz(path, filename)['x'])
    y_list = y_list + list(read_xyz(path, filename)['y'])
    z_list = z_list + list(read_xyz(path, filename)['z'])
dimfig, dimaxes = plt.subplots(3, 1, figsize = (6, 6))
sns.distplot(x_list, ax=dimaxes[0])
sns.distplot(y_list, ax=dimaxes[1])
sns.distplot(z_list, ax=dimaxes[2])
print("x max: " + str(np.max(x_list)) + " x min : " + str(np.min(x_list)))
print("y max: " + str(np.max(y_list)) + " y min : " + str(np.min(y_list)))
print("z max: " + str(np.max(z_list)) + " z min : " + str(np.min(z_list)))

The coordinates almost all appear to lie within 10 and -10 Å, and the great majority lie within 5 Å of the origin. Let's next look at how many distinct coupling types there are in the data.

In [None]:
coupling_types = set(train['type'])
print(coupling_types)

So we have eight types of coupling, all of which are between a proton and another proton or heteroatom. We also have one, two and three bond couplings. Let's look at these coupling types and the distribution of their coupling constants in turn.

In [None]:
coupling_types = list(coupling_types)
totals = [np.sum(train['type'] == x) for x in coupling_types]

subsets = dict()
for x in coupling_types:
    subsets[x] = train.loc[train['type'] == x]

bar_fig, bar_axis = plt.subplots()

sns.barplot(coupling_types, totals, ax = bar_axis)

dist_fig, dist_axes = plt.subplots(len(subsets), 1, figsize = (6, 12))

for (x, y) in zip(dist_axes, coupling_types):
    sns.distplot(subsets[y]['scalar_coupling_constant'], ax=x)
    x.set_title(y)

dist_fig.tight_layout()

The distributions of the coupling constants look quite complex 

# Feature Engineering

We will write some functions to allow us to engineer some features and see what relationship they have with the coupling constants. The algorithm for determining the dihedral angle was based on [this Stack Exchange thread.](https://math.stackexchange.com/questions/47059/how-do-i-calculate-a-dihedral-angle-given-cartesian-coordinates)

In [None]:
def length(data, index1, index2):
    """Takes an xyz file imported by read_xyz and calculates the distance between two points"""
    return np.sqrt(np.sum(np.square(data[['x', 'y', 'z']].loc[index1]-data[['x', 'y', 'z']].loc[index2])))

def neighbours(data, index):
    """Takes an xyz file imported by read_xyz and calculates the number of neighbours within sqrt(3) Å of the indexed atom"""
    l2 = np.array([np.sum(np.square(data[['x', 'y', 'z']].loc[index]-data[['x', 'y', 'z']].loc[x])) for x in range(len(data))])
    return np.sum(l2 < 3) - 1

def nearest(data, index):
    """Takes an xyz file imported by read_xyz and finds the index of the nearest atom"""
    #data['index'] = data.index
    point = data.loc[index][['x', 'y', 'z']]
    data = data[data['atom'] != 'H'][['x', 'y', 'z']]
    data[['x', 'y', 'z']] = data[['x', 'y', 'z']] - point
    data[['x', 'y', 'z']] = np.square(data[['x', 'y', 'z']])
    data = np.sum(data, axis = 1)
    if index in data.index: data[index] = 999
    return np.argmin(data)

def magnitude(vector):
    """Calculates the magnitude of a vector"""
    return np.sqrt(np.sum(np.square(vector)))
    
def dihedral(point1, point2, point3, point4):
    """Calculates the dihederal angle between two bonds"""
    b1 = point1-point2
    b2 = point2-point3
    b3 = point3-point4
    n1 = np.cross(b1, b2)
    n1 = n1/magnitude(n1)
    n2 = np.cross(b2, b3)
    n2 = n2/magnitude(n2)
    m1 = np.cross(n1, b2/magnitude(b2))
    x = np.dot(n1, n2)
    y = np.dot(m1, n2)
    return np.arctan2(x, y)

The next step is to generate the features appropriate to each coupling type. We will do this for one thousand training examples for each coupling type for the purposes of visualisation:

In [None]:
def single_bond(coupling_type):    
    feature_list = []
    
    for x in range(1000):#len(subsets[coupling_type])):
        current = subsets[coupling_type].iloc[x]
        index0 = current['atom_index_0']
        index1 = current['atom_index_1']
        filename = current['molecule_name'] + '.xyz'
        data = read_xyz(path, filename)
        feature_list.append((length(data, index0, index1), neighbours(data, index1), current['scalar_coupling_constant']))
    
    return pd.DataFrame(feature_list, columns = ['length', 'hybrid', 'coupling'])

def two_bond(coupling_type):
    feature_list = []
    for x in range(1000):
        current = subsets[coupling_type].iloc[x]
        data = read_xyz(path, current['molecule_name'] + '.xyz')
        index_0 = current['atom_index_0']
        index_1 = current['atom_index_1']
        shared = nearest(data, index_0)
        length1 = length(data, index_0, shared)
        length2 = length(data, index_1, shared)
        vector1 = data[['x', 'y', 'z']].loc[index_0]-data[['x', 'y', 'z']].loc[shared]
        vector2 = data[['x', 'y', 'z']].loc[index_1]-data[['x', 'y', 'z']].loc[shared]
        cosine = np.dot(vector1, vector2)/(length1 * length2)
        shared_hybrid = neighbours(data, shared)
        carbon_hybrid = neighbours(data, index_1)
        feature_list.append((length1, length2, cosine, data['atom'].iloc[shared], shared_hybrid, carbon_hybrid, current['scalar_coupling_constant']))
    return pd.DataFrame(feature_list, columns = ['length1', 'length2', 'cosine', 'atom', 'hybrid1', 'hybrid2', 'coupling'])

def three_bond(coupling_type):
    feature_list = []
    for x in range(1000):
        current = subsets[coupling_type].iloc[x]
        data = read_xyz(path, current['molecule_name'] + '.xyz')
        index_0 = current['atom_index_0']
        index_1 = current['atom_index_1']
        shared1 = nearest(data, index_0)
        shared2 = nearest(data, index_1)
        length1 = length(data, index_0, shared1)
        length2 = length(data, index_1, shared2)
        length_shared = length(data, index_0, index_1)
        cosine = dihedral(data[['x', 'y', 'z']].loc[index_0], data[['x', 'y', 'z']].loc[shared1], data[['x', 'y', 'z']].loc[shared2], data[['x', 'y', 'z']].loc[index_1])
        shared1_hybrid = neighbours(data, shared1)
        shared2_hybrid = neighbours(data, shared2)
        terminal_hybrid = neighbours(data, index_1)
        feature_list.append((length1, length2, length_shared, cosine, data['atom'].iloc[shared1], data['atom'].iloc[shared2], shared1_hybrid, shared2_hybrid, terminal_hybrid, current['scalar_coupling_constant']))
    return pd.DataFrame(feature_list, columns = ['length1', 'length2', 'length_shared', 'angle', 'atom1', 'atom2', 'hybrid1', 'hybrid2', 'terminal_hybrid', 'coupling'])

function_dict = {'1': single_bond, '2': two_bond, '3': three_bond}
engineered = {x:function_dict[x[0]](x) for x in coupling_types}

Now let's look at the relationship between some of the engineered features and the coupling constant. Firstly, the dihedral angle anf the 3JHH couplings: 

In [None]:
sns.scatterplot(engineered['3JHH']['angle'], engineered['3JHH']['coupling'], hue=engineered['3JHH']['length_shared'])

There dihedral angle and the coupling constant appear to be related by a periodic function, as one might expect (Karplus equation). Interestingly, the dihedral angle also correlates with the length of the central bond, possibly due to the relationship between the relationship between hybridisation and bond length. Let's look at the relationships of all the engineered features with a pairplot, paying special attention to the last row - the plots of the engineered features against the couping constant.

In [None]:
sns.pairplot(engineered['3JHH'])

Does the same hold true for the other three-bond couplings? Next we will analyse the three-bond proton-carbon coupling

In [None]:
sns.pairplot(engineered['3JHC'])

The relationship between dihedral angle and coupling constant is not so clear in this case. The relationship between the other engineered features and the coupling constant is also weak. The last of the three-bond couplings is proton to nitrogen:

In [None]:
sns.pairplot(engineered['3JHN'])

Here we see essentially similar patterns as in the proton-carbon couplings. Next come the two-bond couplings. Let's start with the proton-proton couplings.

In [None]:
sns.pairplot(engineered['2JHH'])

Here we can see a clear correlation between the cosine of the angle between the two bonds and the coupling constant. Unsurprisingly (as both protons are bonded to the same atom) the lengths of both bonds are strongly correlated. Next, the two bond couplings between proton and carbon.

In [None]:
sns.pairplot(engineered['2JHC'])

Here we may be able to use the bond lengths, the hybridisation, and the cosine of the angle between bonds to predict coupling constants. The last of the two bond coupling constants in proton-nitrogen:

In [None]:
sns.pairplot(engineered['2JHN'])

The correlations here appear rather weak. Letäs now look at the single bond couplings. These should be significantly simpler to analyse as we have calculated fewer engineered features. The only (non-exotic) molecule in which single-bond proton-proton interactions is possible is molecular hydrogen, and so we only have two-bond proton-nitrogen and proton-nitrogen couplings. Let's start with proton-carbon.

In [None]:
sns.pairplot(engineered['1JHC'])

Both the hybridisation of the carbon atom and the length of the proton-carbon bond appear to correlate with the coupling constant - these should be good features to use to make predictions. Finally, the single-bond proton-nitrogen couplings:

In [None]:
sns.pairplot(engineered['1JHN'])

A very nice relationship between the length of the proton-nitrogen bond can be observed here. The hybridisation of the nitrogen also offers some predictive power - sp$_2$ hybridised nitrogen appears to couple to protons in a very narrow range around 35 Hz.

# Conclusions

The number of features analysed here has been quite limited; information such as the identity of the functional groups which the coupled atoms belong can be expected to yield more information. However, several of the features engineered here have a significant correlation with the coupling constant and yield significant improvements in predictive ability if incorporated into machine learning models.