# Can ML predict the properties of molecules as accurately as Quantum Mechanical Algorithms?

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

In [None]:
data = pd.read_csv('../input/roboBohr.csv',index_col=0)

In [None]:
data.head()

The first thing we should do is ignore the pubchem id as that has no influence on the energy levels.

In [None]:
data.drop('pubchem_id',axis=1,inplace=True)

In [None]:
data.head()

Is there any missing data?

In [None]:
data.isnull().sum().sum()

What is the Distribution of Atomic Energies?

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
fig = plt.figure(figsize=(8,6))
sns.distplot(data['Eat'],bins=50)
plt.title("Atomic Energy Distribution")
plt.xlabel("Atomic Energy (Eat)")

So first, to proceed further we should know what the data values actually mean. According to the Dataset Description, these are the Coulomb Matrix Values. 

## What is a Coulomb Matrix?
 The Coulomb matrix is defined as
$$ C_{ii} = \frac{1}{2}Z^{2.4}$$
$$ C_{ij} = \frac{Z_i * Z_j}{|R_i-R_j|}$$
where Zi, Zj are the nuclear charges of atoms i and j and Ri, Rj is their position. The Coulomb matrix has built-in invariance to translation and rotation of the molecule. 

So in a way its an interaction matrix much like a covariance matrix. Thus the aim of this ML task is to use the molecule specific matrix to return a singular value as output .

In [None]:
X = data.drop('Eat',axis=1).as_matrix()
y = data['Eat'].as_matrix()

Firstly, Let's try a linear regression model as a Baseline.

In [None]:
from sklearn.linear_model import LinearRegression,BayesianRidge
from sklearn.model_selection import train_test_split,cross_val_predict,cross_val_score

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
linear_model = LinearRegression()
linear_model.fit(X_train,y_train)
pred = linear_model.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error,mean_squared_error
print("The MAE is {}.\nThe MSE is {}".format(mean_absolute_error(y_test,pred),mean_squared_error(y_test,pred)))
fig = plt.figure(figsize=(6,3))
sns.distplot((pred-y_test),kde=False)
plt.title("Linear Regression Error Distribution")
plt.xlabel("Error Value")

Thus the linear model fails miserably. Next, Lets try a Bayesian Ridge Regression.

In [None]:
Bayesian_model = BayesianRidge()
Bayesian_model.fit(X_train,y_train)
pred = Bayesian_model.predict(X_test)
print("The MAE is {}.\nThe MSE is {}".format(mean_absolute_error(y_test,pred),mean_squared_error(y_test,pred)))
print("The Cross Validation Scores are: {}".format(cross_val_score(Bayesian_model,X,y,cv=10)))
fig = plt.figure(figsize=(6,3))
sns.distplot((pred-y_test),kde=False)
plt.title("Bayesian Ridge Regression Error Distribution")
plt.xlabel("Error Value")

Before proceeding further, it would be best to try and reduce the number of features by Principal Component Ananlysis (PCA)

In [None]:
from sklearn.decomposition import PCA