## Introduction
The aim of this competition is to predict the **scalar coupling coefficient** of atoms in a molecule using certain chemical features of the molecule provided in the dataset. The scalar coupling coefficient is simply a measure of the magnetic interaction between two atoms in a molecule.

In this kernel, I will explore different features provided in the data and give a brief chemical background of each feature. I will also visualize the distributions of these features in the dataset.

<img src="https://i.imgur.com/47YCR3M.png" width="500px">

## Acknowledgements

The courtesy for the chemistry images in this kernel goes to the book [Chemistry for the IB Diploma by Steve Owen](https://www.ibdocuments.com/IB%20BOOKS/Group%204%20-%20Sciences/Chemistry/CAMBRIDGE/Chemistry%20-%20Steve%20Owen%20-%20Second%20Edition%20-%20Cambridge%202014.pdf). I would like to thank [funkyboy](https://www.kaggle.com/super13579) for [this kernel](https://www.kaggle.com/super13579/simple-eda-and-lightgbm) from which I borrowed many ideas for visulization. I would also like to thank [Chemistry LibreTexts](https://chem.libretexts.org) from which I picked up many Chemistry concepts explained in this kernel.

### Import necessary libraries

In [None]:
import os
import gc
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### Check the files provided in the dataset

In [None]:
os.listdir('../input')

### Load training data and get list of molecule types

In [None]:
train = pd.read_csv('../input/train.csv')
train.head(10)

In [None]:
typelist = list(train['type'].value_counts().index)
typelist

### Visualize distribution of scalar coupling coefficient

In [None]:
sns.distplot(train['scalar_coupling_constant'], color='orangered')
plt.show()

From the above graph, we can see that the distribution of *scalar_coupling_coefficient* is skewed to the left, but the distribution is not perfectly unimodal. There is a significantly smaller peak close to 100 making it bimodal. The mode and mean is approximately 0.

In [None]:
plt.figure(figsize=(26, 24))
for i, col in enumerate(typelist):
    plt.subplot(4,2, i + 1)
    sns.distplot(train[train['type']==col]['scalar_coupling_constant'],color ='indigo')
    plt.title(col)

Above, there are the distributions of the target for different molecule categories. Some distributions are unimodal, some heavily skewed to the left and others perfectly bimodal.

## Dipole Moment

When two electrical charges, of opposite sign and equal magnitude, are separated by a distance, an electric dipole is established. In the case of a molecule, there is a dipole established between two atoms when there is large difference in [electronegativity](https://en.wikipedia.org/wiki/Electronegativity). This causes the electron density between the atoms to be imbalanced, creating a dipole (more electrons are attracted to the higher electronegativity molecule).

The size of a dipole is measured by its **dipole moment** (μ). Dipole moment is measured in Debye units, which is equal to the distance between the charges multiplied by the charge. The dipole moment is calculated using the formula provided below where :
* μ is the dipole moment vector
* q<sub>i</sub> is the magnitude of the i<sup>th</sup> charge and
* r<sub>i</sub> is the position vector of the i<sup>th</sup> charge

<img src="https://i.imgur.com/0FmyfsG.png" width="175px">

The dipole moment acts in the direction of the vector quantity. An example of a polar molecule is H<sub>2</sub>O. Because of the lone pair on oxygen, the structure of H<sub>2</sub>O is bent (via [VSEPR theory](https://en.wikipedia.org/wiki/VSEPR_theory)), which that the vectors representing the dipole moment of each bond do not cancel each other out. Hence, water is a [polar molecule](https://en.wikipedia.org/wiki/Chemical_polarity). The case of water is shown below. As one can see, the two dipole moment vectors do not  cancel out and instead, these vectors add up to form a net dipole moment in the upward direction (shown by the arrow in the figure).

<center><font size=4>Water molecule</font></center>

<img src="https://i.imgur.com/v7qMcFW.png" width="150px">

But, when the dipole moment vectors in a molecule cancel out, the molecule becomes non-polar. For example, in CO<SUB>2</SUB>, which has a linear structure (180 degrees angle between the two carbon-oxygen bonds) due to the lack of a lone pair on carbon, the dipole moment vectors of each bond cancel out as they are in exactly opposite directions. Therefore, carbon dioxide is a non-polar molecule. The case of carbon dioxide is shown below.

<center><font size=4>Carbon dioxide molecule</font></center>

<img src="https://i.imgur.com/fUykc8C.png" width="200px">

### Load dipole moment data

In [None]:
dipole_moments = pd.read_csv('../input/dipole_moments.csv')
dipole_moments.head(10)

### Visualize the distribution of dipole moments in X, Y and Z directions

In [None]:
sns.distplot(dipole_moments.X, color='mediumseagreen')
plt.title('Dipole moment along X-axis')
plt.show()
sns.distplot(dipole_moments.Y, color='seagreen')
plt.title('Dipole moment along Y-axis')
plt.show()
sns.distplot(dipole_moments.Z, color='green')
plt.title('Dipole moment along Z-axis')
plt.show()

The distributions of dipole moment along the X and Y axes are approximately normal with a mean of 0, with the X-axis distribution having a greater standard deviation and range. On the other hand, the dipole moment along the Z-axis has a slightly skewed distribution (skewed to the right), with a secondary peak around 1 in addition to the primary peak (mode) above 0.

### Visualize the distribution of dipole moments in all directions for each molecule type

In [None]:
plt.figure(figsize=(26, 24))
for i, col in enumerate(typelist):
    plt.subplot(4,2, i + 1)
    sns.distplot(dipole_moments[train['type']==col]['X'],color = 'orange', kde=False)
    sns.distplot(dipole_moments[train['type']==col]['Y'],color = 'red', kde=False)
    sns.distplot(dipole_moments[train['type']==col]['Z'],color = 'blue', kde=False)
    plt.title(col)

In the above figures, the orange, red and blue distributions represent the dipole moment distributions along the X, Y and Z axes respectively. They all are normal distributions with a mean of 0, but the standard deviation ("spread") increase form Z to Y to X.

## Potential Energy

The [potential energy](https://en.wikipedia.org/wiki/Potential_energy) of a molecule (or any object for that matter) is the energy contained in the molecule by virtue of its position, composition or arrangement in a force field. Potential energy in the case of molecules is proportional to the sum of several forces between atoms and molecules in the substance. These forces include [covalent bonds](https://en.wikipedia.org/wiki/Covalent_bond) between atoms, [electrostatic forces](https://en.wikipedia.org/wiki/Electrostatic_forces) between oppositely charged ions and [nuclear forces](https://en.wikipedia.org/wiki/Nuclear_force) that hold the protons and neutrons in atomic nuclei together. It is these forces that constitute the potential energy of a molecule and as one can see, all these forces depend on the relative positions of all particles in the substance (atoms, protons, neutrons etc), as well as the charges and masses of these particles.

For example, the potential energy of liquid bromine is higher than that of bromine vapor. This is because the molecules of bromine vapor are further apart from each other as compared to liquid bromine molecules. This results in lower forces of attraction between the molecules of bromine vapor (**force is inversely proportional to distance<sup>2</sup>**), and thus bromine vapor molecules have a lower potential energy as compared to liquid bromine molecules. Here, **the relative positions of the molecules affect the potential energy**.

<center><font size=4>Bromine liquid</font></center>
<img src="https://i.imgur.com/4GsJgl2.png" width="400px">

<center><font size=4>Bromine vapor</font></center>
<img src="https://i.imgur.com/MzwkKUm.png" width="400px">

Also, water has a higher potential energy than carbon dioxide because there is [hydrogen bonding](https://en.wikipedia.org/wiki/Hydrogen_bond) between the hydrogen and oxygen atoms of different H<sub>2</sub>O molecules (due to the polar nature of H<sub>2</sub>O). This hydrogen bond is a strong intermolecular force that brings the molecules of water closer together. But, on the other hand, carbon dioxide only has [London forces](https://www.chem.purdue.edu/gchelp/liquids/disperse.html) between atoms of different CO<sub>2</sub> molecules. These London forces are a lot weaker than the hydrogen bonds between water molecules, and thus water has a greater potential energy than carbon dioxide. Here, **the intermolecular forces affect the potential energy**.



Potential energy need not always refer to that in molecules. It may also refer to [gravitational](https://en.wikipedia.org/wiki/Potential_energy#Gravitational_potential_energy), [electrical](https://en.wikipedia.org/wiki/Potential_energy#Electric_potential_energy) or [magnetic](https://en.wikipedia.org/wiki/Potential_energy#Magnetic_potential_energy) potential energy.

### Load potential energy data

In [None]:
potential_energy = pd.read_csv('../input/potential_energy.csv')
potential_energy.head(10)

### Visualize the distribution of potential energy

In [None]:
sns.distplot(potential_energy.potential_energy, color='darkblue', kde=False)
plt.show()

The distribution of potential energy of the molecules is approximately normal with a mean of around -400.

### Visualize the distribution of potential energy for each molecule type

In [None]:
plt.figure(figsize=(26, 24))
for i, col in enumerate(typelist):
    plt.subplot(4,2, i + 1)
    sns.distplot(potential_energy[train['type']==col]['potential_energy'], color = 'orangered')
    plt.title(col)

Above are the distributions of potential energy for each molecule type. One can see the distributions are very different for each molecule type.

## Magnetic Shielding


Magnetic shielding is a complicated quantum mechanical concept which is difficult to explain in plain terms. But, [here](https://www.sciencedirect.com/topics/physics-and-astronomy/magnetic-shielding) is a good article explaining what magnetic shielding tensors are.

Essentially, we are given 2-D magnetic shielding tensors for each molecule. Each 2-D tensor is a 3-by-3 matrix with quantities representing magnetic shielding in all possible direction combinations (XX, XY, XZ, YX, YY, YZ, ZX, ZY and ZZ).

### Load magnetic shielding tensor data

In [None]:
magnetic_shielding_tensors = pd.read_csv('../input/magnetic_shielding_tensors.csv')
magnetic_shielding_tensors.head(10)

### Define helper function to remove outliers

In [None]:
def is_outlier(points, thresh=3.5):
    """
    Returns a boolean array with True if points are outliers and False 
    otherwise.

    Parameters:
    -----------
        points : An numobservations by numdimensions array of observations
        thresh : The modified z-score to use as a threshold. Observations with
            a modified z-score (based on the median absolute deviation) greater
            than this value will be classified as outliers.

    Returns:
    --------
        mask : A numobservations-length boolean array.

    References:
    ----------
        Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
        Handle Outliers", The ASQC Basic References in Quality Control:
        Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
    """
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

### Visualize the magnetic shielding in each direction combination

In [None]:
sns.distplot(magnetic_shielding_tensors.XX[~is_outlier(magnetic_shielding_tensors.XX)], color='red')
plt.title('Magnetic Shielding (XX)')
plt.show()
sns.distplot(magnetic_shielding_tensors.XY[~is_outlier(magnetic_shielding_tensors.XY)], color='orangered')
plt.title('Magnetic Shielding (XY)')
plt.show()
sns.distplot(magnetic_shielding_tensors.XZ[~is_outlier(magnetic_shielding_tensors.XZ)], color='orange')
plt.title('Magnetic Shielding (XZ)')
plt.show()
sns.distplot(magnetic_shielding_tensors.YX[~is_outlier(magnetic_shielding_tensors.YX)], color='yellow')
plt.title('Magnetic Shielding (YX)')
plt.show()
sns.distplot(magnetic_shielding_tensors.YY[~is_outlier(magnetic_shielding_tensors.YY)], color='green')
plt.title('Magnetic Shielding (YY)')
plt.show()
sns.distplot(magnetic_shielding_tensors.YZ[~is_outlier(magnetic_shielding_tensors.YZ)], color='blue')
plt.title('Magnetic Shielding (YZ)')
plt.show()
sns.distplot(magnetic_shielding_tensors.ZX[~is_outlier(magnetic_shielding_tensors.ZX)], color='darkblue')
plt.title('Magnetic Shielding (ZX)')
plt.show()
sns.distplot(magnetic_shielding_tensors.ZY[~is_outlier(magnetic_shielding_tensors.ZY)], color='indigo')
plt.title('Magnetic Shielding (ZY)')
plt.show()
sns.distplot(magnetic_shielding_tensors.ZZ[~is_outlier(magnetic_shielding_tensors.ZZ)], color='darkviolet')
plt.title('Magnetic Shielding (ZZ)')
plt.show()

The distributions of magnetic shielding in each direction seems to be roughly linear with a mean of 0. But, some of them have steep, sharp slopes from the peak to the tails (YX for example), while others have smooth, bulgy tails (ZY for example). All of them have unque shapes, but most of them fall under these two categories.

## Mulliken Charges
Mulliken charges arise from [Mulliken population analysis](http://iqc.udg.es/articles/pdf/iqc413.pdf), and are calculated using the methods of computational chemistry. They provide a means to calculate the partial charge in an atom.

A partial charge is a non-integer charge value when measured in elementary charge units. Partial charge is more commonly called net atomic charge. It is represented by the Greek lowercase letter δ, namely δ− or δ+.

Partial charges are created due to the asymmetric distribution of electrons in chemical bonds. For example, in a polar covalent bond like HCl, the shared electrons are drawn more towards the chlorine atom due to its higher electronegativity (greater tendency to attract electrons) as compared to the hydrogen atom. This creates a higher electron density around the chlorine atom, and thus chlorine gains a partial negative charge and hydrogen gains a partial positive charge due to the lower electron density surrounding it. These charges are relative, which means that the chlorine atom has a negative charge only because it is more negative than hydrogen. The case of HCl is shown below.

<center><font size=4>Hydrogen chloride</font></center>
<img src="https://i.imgur.com/nNP2mtE.png" width="100px">

### Load mulliken charge data

In [None]:
mulliken_charges = pd.read_csv('../input/mulliken_charges.csv')
mulliken_charges.head(10)

### Visualize distribution of mulliken charge

In [None]:
sns.distplot(mulliken_charges.mulliken_charge, color = 'seagreen')
plt.show()

The distribution of mulliken charges of the molecules peaks at around 0.175 and it is clearly unimodal. But, the distribution has a very uneven (small peaks and valleys) appearance on the tails. Additionally, there is a clear rightward skew.

### Visualize distribution of mulliken charge for each atom index

In [None]:
sns.distplot(mulliken_charges.loc[mulliken_charges.atom_index == 0].mulliken_charge, color = 'blue')
plt.title('Atom index 0')
plt.show()
sns.distplot(mulliken_charges.loc[mulliken_charges.atom_index == 1].mulliken_charge, color = 'darkblue')
plt.title('Atom index 1')
plt.show()
sns.distplot(mulliken_charges.loc[mulliken_charges.atom_index == 2].mulliken_charge, color = 'blueviolet')
plt.title('Atom index 2')
plt.show()
sns.distplot(mulliken_charges.loc[mulliken_charges.atom_index == 3].mulliken_charge, color = 'purple')
plt.title('Atom index 3')
plt.show()
sns.distplot(mulliken_charges.loc[mulliken_charges.atom_index == 3].mulliken_charge, color = 'indigo')
plt.title('Atom index 4')
plt.show()

As one can see from the distributions above, there are four peaks (quadmodal ?). The means of the distributions tend to decrease as the atom index increases (the mulliken charges are higher for lower atom indices). I am not sure about the reason for this trend.

That's it ! Thanks for reading this kernel. Hope you found it useful :)