### Machine Learning for Drug Discovering - towards datascience article

predicting solubility of drug molecules from atomic structure<br>
https://towardsdatascience.com/how-to-use-machine-learning-for-drug-discovery-1ccb5fdf81ad <br>
source: https://pubs.acs.org/doi/10.1021/ci034243x <br><br>

quantifying molecular properties using rdkit

In [1]:
# get data
import os
if not os.path.exists("./delaney.csv"):
    ! wget https://raw.githubusercontent.com/dataprofessor/data/master/delaney.csv

In [2]:
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors

The inital dataframe contains a name, the solubility (+predicted values from the paper's model) and Smiles description of a compound <br>
We might be able to use the papers predictions as baseline for improving it

In [3]:
compoundsDF = pd.read_csv("delaney.csv")
print("Number of entries: ",len(compoundsDF))
compoundsDF.head(3)

Number of entries:  1144


Unnamed: 0,Compound ID,measured log(solubility:mol/L),ESOL predicted log(solubility:mol/L),SMILES
0,"1,1,1,2-Tetrachloroethane",-2.18,-2.794,ClCC(Cl)(Cl)Cl
1,"1,1,1-Trichloroethane",-2.0,-2.232,CC(Cl)(Cl)Cl
2,"1,1,2,2-Tetrachloroethane",-1.74,-2.549,ClC(Cl)C(Cl)Cl


In [4]:
# transforming random compound to rdkit mol and trying methods of it
Chem.MolFromSmiles(compoundsDF.iloc[99,3]).GetNumAtoms()

11

In [5]:
# making a mol list from the smiles
molList = [Chem.MolFromSmiles(x) for x in compoundsDF.iloc[:,3]]
molList[0:3]

[<rdkit.Chem.rdchem.Mol at 0x7fe0c4e9fd00>,
 <rdkit.Chem.rdchem.Mol at 0x7fe0c4e9ff80>,
 <rdkit.Chem.rdchem.Mol at 0x7fe0c4e9fc10>]

The authors use 4 Molecular descriptors for their model: <br>
1. cLogP (Octanol-water partition coefficient) <br>
2. MW (Molecular weight) <br>
3. RB (Number of rotatable bonds) <br>
4. AP (Aromatic proportion = number of aromatic atoms / number of heavy atoms) <br>

I implement a generate function similar to the one in the Post, exploiting rdkits default features except for the last


In [6]:
from mediumArticleFunctionGenerate import generate #add smiles or name
descriptorDF = generate(molList) #don't rely on indices, they change in any filtering step
descriptorDF.head(3)

Unnamed: 0,MolLogP,MolWt,NumRotatableBonds
0,2.5954,167.85,0
1,2.3765,133.405,0
2,2.5938,167.85,1


The 4th Descriptor has to be computed manually. <br>
First I compute the number of aromatic atoms, then I divide by the number of heavy atoms:

In [7]:
#print(
aromAtoms = [
sum([molList[j].GetAtomWithIdx(i).GetIsAromatic() 
for i in range(molList[j].GetNumAtoms())]) 
for j in range(len(molList))
]
#)
heavyAtoms=[Descriptors.HeavyAtomCount(molList[i]) for i in range (len(molList))]


In [8]:
aromPropDf = pd.DataFrame(np.round(
    (np.array(aromAtoms)/np.array(heavyAtoms)),
    decimals=2),
    columns=["Arom. Prop."])

In [17]:
#all descriptors:
X = pd.concat([descriptorDF,aromPropDf],axis=1)
X.head(3)

Unnamed: 0,MolLogP,MolWt,NumRotatableBonds,Arom. Prop.
0,2.5954,167.85,0,0.0
1,2.3765,133.405,0,0.0
2,2.5938,167.85,1,0.0


In [10]:
Y = compoundsDF.iloc[:,1]
Y.head(3)

0   -2.18
1   -2.00
2   -1.74
Name: measured log(solubility:mol/L), dtype: float64

In [11]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

In [12]:
X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size=0.2)

model = linear_model.LinearRegression()
model.fit(X_train, Y_train)

LinearRegression()

In [13]:
# predictions on training data
predTrain = model.predict(X_train)
predTest = model.predict(x_test)
pd.DataFrame([predTrain,Y_train])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,905,906,907,908,909,910,911,912,913,914
0,-1.356315,-3.703669,-1.051723,-2.403899,-0.351034,-1.502174,-2.136071,-1.049519,-0.716177,-6.425677,...,-2.378611,-1.947288,-2.723806,-2.044786,-3.708265,-2.643873,-2.755287,-4.373323,-2.636848,-5.055805
1,-0.66,-4.14,-1.34,-3.73,0.57,-1.11,-1.17,-0.85,-0.49,-7.28,...,-2.58,-0.85,-3.11,-1.04,-3.59,-2.337,-1.64,-4.88,-2.35,-5.27


In [14]:
print("Coefficients:", model.coef_,
    "\nIntercept:", model.intercept_,
    "\n\nMSE: ",mean_squared_error(Y_train, predTrain),
    "\nCoefficient of determination (R^2): ", r2_score(Y_train, predTrain))

print("\nMSE: ",mean_squared_error(y_test, predTest),
    "\nCoefficient of determination (R^2): ", r2_score(y_test, predTest))

Coefficients: [-0.74099342 -0.00661237 -0.00323885 -0.49076549] 
Intercept: 0.29805134478838324 

MSE:  1.0316180073353405 
Coefficient of determination (R^2):  0.7688615915796508

MSE:  0.9253264504156217 
Coefficient of determination (R^2):  0.7736548556232452


In [15]:
0.298+