# Predicting Molecular Properties - Theory and EDA

# Table of Contents
1. [Background](#Background)

2. [Loading Modules and Data](#Loading-Modules-and-Data)

3. [Exploratory Data Analysis](#Exploratory-Data-Analysis)

### Background

In this competition, we will develop an algorithm that can predict the magnetic interaction between two atoms in a molecule (i.e., the scalar coupling constant). 

A qualitative understanding of the scalar coupling constant and mechanics behind it may help guide in engineering features that succinctly capture the interactions between the atoms.

A series of snippets from a number of different sources lay out the qualitative characteristics of scalar coupling based on the theoretical background:

> - Scalar couplings arise from spin-spin interactions that occur via bonding electrons. 
    - More specifically, it arises from the interaction of the nuclear magnetic moment with the electrons involved in the chemical bond. The nuclear spin polarization of one atom affects the polarization of the surrounding electrons. The electron polarization subsequently produces a change in the magnetic field that is sensed by the coupled spin.
    - Consequently, they provide information on the chemical connectivity between atoms. 
    -  the sizes of three bond scalar couplings are sensitive to the electron distribution of the intervening bonds.
    - The nomenclature that is used to describe the coupling is as follows: "nJAB" where n refers to the number of intervening bonds, and A and B identify the two coupled spins. 
[(source)](https://www.springer.com/gp/book/9781402034992)

> As the
coupling is mediated by bonding electrons. it provides important information about the constitution of molecules in terms of
the connectivity of thc coupled nuclei. The size of the coupling
constant depends not only on the number of bonds that separate
the coupled nuclei, but also on the configuration of the electrons
and their spatial arrangement. [(source)](https://onlinelibrary.wiley.com/doi/abs/10.1002/anie.199516711)

> The most important contribution to the scalar coupling is the Fermi contact term.  This term relies on the probability of finding an electron at the site of the two coupled nuclei.  With this term, it is therefore expected that s electrons will play a very significant role since these are the only electrons that do not have nodes at the nuclear sites. [(source)](https://bouman.chem.georgetown.edu/nmr/scalar/scalar.htm)

> More on Fermi Contact: The scalar interaction arises between two different
nuclear spins, I1 and I2, and is mediated by the
electrons surrounding these two spins. Through the
Fermi contact, the electrons are polarised in the
opposite direction to the nucleus they are interacting
with. This polarisation in turn has an effect on the
other electrons in close proximity, which in turn
affects the neighbouring nuclei.[(source)](https://groups.chem.ubc.ca/straus/l3.pdf)

> The Karplus equation, named after Martin Karplus, describes the correlation between $^3J$-coupling constants and dihedral torsion angles: 
\begin{equation*} J(\sigma) = Ccos(2\sigma) + Bcos(\sigma) + A\end{equation*}
where J is the $^3J$ coupling constant,  $\phi$  is the dihedral angle, and $A$, $B$, and $C$ are empirically derived parameters whose values depend on the atoms and substituents involved.[(source)](https://en.wikipedia.org/wiki/Karplus_equation)

> Atom-centered Symmetry Functions:  
In chemistry, the most common way of representing molecules is to use Cartesian coordinates. The problem of Cartesian coordinates is that if one rotates or translates a molecule, the coordinates change.  However, properties like the total energy of the molecule, the dipole moment or the atomisation energy remain unchanged.  Therefore, if you want to learn the properties of a molecule with a neural network, you want to represent the molecular structure in a way that doesn’t change when you rotate or translate the molecule.  
By doing this, you make the training process more efficient because the neural network doesn’t need to learn that many different Cartesian coordinates represent the same structure.  This is where Atom Centred Symmetry Functions (ACSF) come in handy. They are a way of representing a molecular structure which remains the same when the molecule is rotated or translated. [(source)](https://samabilino.wordpress.com/2018/07/07/atom-centred-symmetry-functions/)



### Loading Modules and Data

Load data analysis modules we may require:

In [None]:


#Load data analysis/plotting modules
from tqdm import tqdm_notebook as tqdm
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
import numpy as np
import plotly
import sympy
import scipy
import numpy
import os






list the csv files for the competition:

In [None]:
os.listdir('../input/champs-scalar-coupling/')

Loading all the data.

In [None]:
PATH = "../input/champs-scalar-coupling/"
train = pd.read_csv(PATH + "train.csv")
test = pd.read_csv(PATH + "test.csv")
struct = pd.read_csv(PATH + "structures.csv")
dpm = pd.read_csv(PATH + "dipole_moments.csv")
mst = pd.read_csv(PATH + "magnetic_shielding_tensors.csv")
mlk = pd.read_csv(PATH + "mulliken_charges.csv")
pe = pd.read_csv(PATH + "potential_energy.csv")
scc = pd.read_csv(PATH + "scalar_coupling_contributions.csv")
smpsub = pd.read_csv(PATH + "sample_submission.csv")


### Exploratory Data Analysis

In [None]:
#preview train,test, and structure dataframes
display(train.head(), train.shape)

display(test.head(), test.shape)

display(struct.head(), struct.shape)


In [None]:
print(train.molecule_name.nunique(), "molecules in train dataset.")
print(test.molecule_name.nunique(), "molecules in test dataset.")
print(struct.molecule_name.nunique(), "molecules in structures dataset.")
# structures is the sum of both train+test molecules 

In [None]:

ratio = (test.shape[0]/(train.shape[0]+test.shape[0]))
print('ratio:', 1-ratio) 


We have roughly a 60/40 split for the train/test datasets.
The notation for the type of coupling follows the method mentioned in the background section.   
Let's look at the distribution of types between the train/test dataset:

In [None]:
print(train.type.unique()) # 8 different coupling types (between carbon, nitrogen, and hydrogen)
print(test.type.unique())
print(set(train.type.unique()) == set(test.type.unique())) #same types exist in both train/test datasets

In [None]:
f,ax=plt.subplots(1,2,figsize=(15,5))
train.type.value_counts().plot.bar(ax=ax[0], color = 'blue')
ax[0].set_title('Train: Number of Coupling Types')
test.type.value_counts().plot.bar(ax=ax[1], color = 'red')
ax[1].set_title('Test: Number of Coupling Types')
plt.show()
print('train:')
display(train.type.str[0].value_counts(), train.type.str[-2:].value_counts()) # number of coupling interactions for intervening bonds and atom pairs.
print('test:')
display(test.type.str[0].value_counts(), test.type.str[-2:].value_counts())

The datasets have a suprisingly similar distribution of coupling types.  This is good, since it means we won't have to worry about oversampling certain types of coupling when we train our model on the training dataset.  
But, intuitively it seems that coupling constants with 3 intervening bonds would be the most challenging to predict (due to long-range interactions) and seem to be the most abundant.

H-C interactions also make up a disproportionately large number of coupling compared to the others. ( H-C >> H-H > H-N)

Let's see what the distribution of the number of atoms per molecule looks like:

In [None]:

data_train = struct[struct.molecule_name.isin(train.molecule_name.unique())].molecule_name.value_counts()
bins_train = data_train.nunique()
data_test = struct[struct.molecule_name.isin(test.molecule_name.unique())].molecule_name.value_counts()
bins_test = data_test.nunique()

f,ax=plt.subplots(1,2,figsize=(15,5))
data_train.plot.hist(ax=ax[0], color = 'blue', bins=bins_train)
ax[0].set_title('Distribution of # of atoms per Molecule in Train')
data_test.plot.hist(ax=ax[1], color = 'red', bins=bins_test)
ax[1].set_title('Distribution of # of atoms per Molecule in Test')
plt.show()
display(data_train.describe())


Once again, it seems that the train/test split were made to have identical distributions in molecule size and coupling type.  

It is likely that the small deviation from a 60/40 split is due to the need to balance the coupling type/molecule size between the train/test datasets.  Maybe we can take this into consideration when we create the validation dataset.  

Let's look at the coupling contributions with respect to the resultant coupling constant:

In [None]:
pd.concat([scc, train.scalar_coupling_constant], axis=1).head()

The Fermi contact term accounts for most of the coupling constant contribution (as was mentioned in the background section).  Predicting this term would be the primary objective of the model.


A great molecule visualization function was written by [Mykola Zotko](https://www.kaggle.com/mykolazotko/3d-visualization-of-magnetic-interactions):



In [None]:

import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
from sympy.geometry import Point3D


# initiate the plotly notebook mode
init_notebook_mode(connected=True)
    

def plot_interactions(molecule_name, structures, train_df):
    """Creates a 3D plot of the molecule"""


    atomic_radii = dict(C=0.77, F=0.71, H=0.38, N=0.75, O=0.73)  
    cpk_colors = dict(C='black', F='green', H='white', N='blue', O='red')
    
    if molecule_name not in train_df.molecule_name.unique():
        print(f'Molecule "{molecule_name}" is not in the training set!')
        return
    
    molecule = structures[structures.molecule_name == molecule_name]
    coordinates = molecule[['x', 'y', 'z']].values
    x_coordinates = coordinates[:, 0]
    y_coordinates = coordinates[:, 1]
    z_coordinates = coordinates[:, 2]
    elements = molecule.atom.tolist()
    radii = [atomic_radii[element] for element in elements]
    
    data_train = train_df[train_df.molecule_name == molecule_name][['atom_index_0', 'atom_index_1', 'scalar_coupling_constant']]
    interactions = data_train.groupby('atom_index_0')['atom_index_1'].apply(set).to_dict()
    coupling_constants = data_train.set_index(['atom_index_0', 'atom_index_1']).round(2).to_dict()['scalar_coupling_constant']
    
    def get_bonds():
        """Generates a set of bonds from atomic cartesian coordinates"""
        ids = np.arange(coordinates.shape[0])
        bonds = dict()
        coordinates_compare, radii_compare, ids_compare = coordinates, radii, ids
        
        for _ in range(len(ids)):
            coordinates_compare = np.roll(coordinates_compare, -1, axis=0)
            radii_compare = np.roll(radii_compare, -1, axis=0)
            ids_compare = np.roll(ids_compare, -1, axis=0)
            distances = np.linalg.norm(coordinates - coordinates_compare, axis=1)
            bond_distances = (radii + radii_compare) * 1.3
            mask = np.logical_and(distances > 0.1, distances <  bond_distances)
            distances = distances.round(2)
            new_bonds = {frozenset([i, j]): dist for i, j, dist in zip(ids[mask], ids_compare[mask], distances[mask])}
            bonds.update(new_bonds)
        return bonds      
            
    def atom_trace():
        """Creates an atom trace for the plot"""
        colors = [cpk_colors[element] for element in elements]
        markers = dict(color=colors, line=dict(color='lightgray', width=2), size=7, symbol='circle', opacity=0.8)
        trace = go.Scatter3d(x=x_coordinates, y=y_coordinates, z=z_coordinates, mode='markers', marker=markers,
                             text=elements, name='')
        return trace

    def bond_trace():
        """"Creates a bond trace for the plot"""
        trace = go.Scatter3d(x=[], y=[], z=[], hoverinfo='none', mode='lines',
                             marker=dict(color='grey', size=7, opacity=1), line=dict(width=5))
        for i, j in bonds.keys():
            trace['x'] += (x_coordinates[i], x_coordinates[j], None)
            trace['y'] += (y_coordinates[i], y_coordinates[j], None)
            trace['z'] += (z_coordinates[i], z_coordinates[j], None)
        return trace
    
    def interaction_trace(atom_id):
        """"Creates an interaction trace for the plot"""
        trace = go.Scatter3d(x=[], y=[], z=[], hoverinfo='none', mode='lines',
                             marker=dict(color='pink', size=7, opacity=0.5),
                            visible=False)
        for i in interactions[atom_id]:
            trace['x'] += (x_coordinates[atom_id], x_coordinates[i], None)
            trace['y'] += (y_coordinates[atom_id], y_coordinates[i], None)
            trace['z'] += (z_coordinates[atom_id], z_coordinates[i], None)
        return trace
    
    bonds = get_bonds()
    
    zipped = zip(range(len(elements)), x_coordinates, y_coordinates, z_coordinates)
    annotations_id = [dict(text=num, x=x, y=y, z=z, showarrow=False, yshift=15, font = dict(color = "blue"))
                      for num, x, y, z in zipped]
    
    annotations_length = []
    for (i, j), dist in bonds.items():
        p_i, p_j = Point3D(coordinates[i]), Point3D(coordinates[j])
        p = p_i.midpoint(p_j)
        annotation = dict(text=dist, x=float(p.x), y=float(p.y), z=float(p.z), showarrow=False, yshift=10)
        annotations_length.append(annotation)
    
    annotations_interaction = []
    for k, v in interactions.items():
        annotations = []
        for i in v:
            p_i, p_j = Point3D(coordinates[k]), Point3D(coordinates[i])
            p = p_i.midpoint(p_j)
            constant = coupling_constants[(k, i)]
            annotation = dict(text=constant, x=float(p.x), y=float(p.y), z=float(p.z), showarrow=False, yshift=25,
                              font = dict(color = "hotpink"))
            annotations.append(annotation)
        annotations_interaction.append(annotations)
    
    buttons = []
    for num, i in enumerate(interactions.keys()):
        mask = [False] * len(interactions)
        mask[num] = True
        button = dict(label=f'Atom {i}',
                      method='update',
                      args=[{'visible': [True] * 2 + mask},
                            {'scene.annotations': annotations_id + annotations_length + annotations_interaction[num]}])
        buttons.append(button)
        
    updatemenus = list([
        dict(buttons = buttons,
             direction = 'down',
             xanchor = 'left',
             yanchor = 'top'
            )
    ])
    
    data = [atom_trace(), bond_trace()]
    
    # add interaction traces
    for num, i in enumerate(interactions.keys()):
        trace = interaction_trace(i)
        if num == 0:
            trace.visible = True 
        data.append(trace)
        
    axis_params = dict(showgrid=False, showticklabels=False, zeroline=False, titlefont=dict(color='white'))
    layout = dict(scene=dict(xaxis=axis_params, yaxis=axis_params, zaxis=axis_params,
                             annotations=annotations_id + annotations_length + annotations_interaction[0]),
                  margin=dict(r=0, l=0, b=0, t=0), showlegend=False, updatemenus=updatemenus)

    fig = go.Figure(data=data, layout=layout)
    iplot(fig)

In [None]:
plot_interactions('dsgdb9nsd_000001', struct, train)

A good start for generating deeper features.  Let's compile a list of structural properties that can be engineered as features based on what we can intuit from the theory.  This [source](https://www.ucl.ac.uk/nmr/NMR_lecture_notes/L3_3_97_web.pdf) has a great summary of the variables to consider.  Influentual variables can be placed in three groups :

### 1.  The hybridization of the atoms involved in the coupling.

The hybridization of the atoms along the coupling pathway has an effect on the scalar coupling constant (scc).  For instance, a 3JHH coupling would have a different scc between the molecule H-C-C-H and H-C=C-H, (where '-' and '=' represent a single and double bond, respectively).
    
### 2. Dihedral/bond angles of the atoms involved in the coupling.

The angles between the atoms involved in the coupling is also an important variable.  For 1J coupling, the angle would be 0 since we just have two atom bonded directly together.  For 2J coupling it would be the bond angle between the two atoms and for 3J coupling it would be the dihedral angle.  The dihedral angle is the angle between the atoms with respect to a common rotational axis (usually a pair of carbon atoms). Looking down the axis of the angle in question would look like this ([source](https://www.ucl.ac.uk/nmr/NMR_lecture_notes/L3_3_97_web.pdf) for figures below):

<img src='https://i.imgur.com/RKcjhFZ.png'>    
    

### 3. Substituent effects due to bonded/adjacent chemical elements.

The molecular structures attached to the primary coupling pathway must also be considered.
    
For 1J and 2J couplings, the electronegativity of $\alpha$ and $\beta$ substituents attached adjacent to the coupling also have an influence on the scc.  The $\alpha$ substituent being the first atom bonded to the coupling and the $\beta$ substituent being the second, we can define a third as a $\gamma$ substituent but the effect of its electronegativity is negligible.  It would be useful to chiefly consider the electronegativity of the  $\alpha$ substituent and average the electronegativities of all elements attached to it.

For 3J couplings, the electronegativity is still a factor but now it requires a deeper analysis of its location/angle along the 3J coupling pathway.  For example, in the case of 3JHH coupling, the position of the OH substituent plays a role([source](https://www.ucl.ac.uk/nmr/NMR_lecture_notes/L3_3_97_web.pdf) for figures below):
        
<img src="https://i.imgur.com/necGAcQ.png">    


The presence of $\pi$ (double) bonds adjacent to the coupling also affects the scc.  In this case, we need to take into account the angle between the $\pi$ bond and the bond for the coupling pathway in question, where the effect is maximized if they're parallel ([source](http://www.sliderbase.com/spitem-114-3.html) for figures below).
<table><tr>
<td> <img src="https://i.imgur.com/mWYLFIQ.png" alt="Drawing" style="width: 300px;"/> </td> 
<td> <img src="https://i.imgur.com/32thny7.png" alt="Drawing" style="width: 300px;"/> </td>
</tr></table> 

The complexity of the problem becomes apparent when we layout all of the variables we need to consider for an accurate prediction of the scc.  Most of the features are going to rely on a structural approach to the problem; we need to define the primary coupling pathway and the substituents bonded/adjacent to it seperately.  

These substituents then need to be characterized by their overall electronegativity and distance/angle w.r.t to the coupling pathway.  A similar approach would be applied to $\pi$ bonds(not on the coupling pathway) but instead of the electronegativity, we could consider the distance/angle. 



**Note:** There's a *dist* feature that seems to be a significant predictor of the coupling coefficient.  Though I haven't begun generating features and training a model based on my EDA, it would be interesting to consider why the *dist* features does so well as a means of predicting other features.  Such as:

- the average distance between coupling atoms is likely to increase as a result of a longer coupling pathway, ergo more intervening bonds(1J, 2J, 3J).  


- is it possible that ignoring substituent effects and other characteristics that could describe the fc contribution to the scc more accurately is creating a plateau for max lb scores?


- what would happen if we include the thereoretically derived features along with dist, would the model become more accurate or would it suffer due to the dist feature?


- In the case above, it may be wise to remove the dist feature if we're experimenting with new/different features (LOFO analysis?).