# Regression Models for Fa Prediction using Descriptors Calculated with Mordred

## Materials and Method

- Libraries: NumPy, pandas, scikit-learn, matplotlib, RDKit, mordred and SHAP
- Dataset: Fraction of absorption (Fa) and Parmeability measured by Caco-2 cells (Papp), which were collected previous strudy (Esaki, et al., Journal of Phermeceutical Sciences, 2019)
- Descriptor calcularion: Mordred

### Library import

In [None]:
import numpy as np
print('numpy version: ', np.__version__)

from rdkit import Chem, rdBase
print('rdkit version: ', rdBase.rdkitVersion)

### Datasets

The dataset contained information on the chemical structure of 5567 compounds as SMILES strings. In this dataset, the number of Fa and Papp experimental values were 946 and 4460, respectively. Owing to its accuracy, we used CORINA (ver. 4.4.0) to generate 3D structures of the chemical compounds as structure data format (SDF).

In [None]:
sdf = 'corina_result_all_SI_JPS_largestMWFragment_LowestEnergyConformation.sdf'

suppl = Chem.SDMolSupplier(sdf, removeHs=False)
mols = [mol for mol in suppl if mol is not None]
print(len(mols))

Checking the contents of mols objects

In [None]:
mols[0].GetPropsAsDict()

Preparing sdf information as lists

In [None]:
chemblids = [mol.GetPropsAsDict()['chembl_id'] for mol in mols]
Fas = [mol.GetPropsAsDict()['Fa'] for mol in mols]
Papps = [mol.GetPropsAsDict()['Papp'] for mol in mols]

print(len(chemblids), len(Fas), len(Papps))

### Descriptor calculation

#### Install and import of Mordred

Mordred is a descriptor calculation tool rapped RDKit in Python. Mordred had developed to aim improve PaDEL-Descriptor.

> Moriwaki H, Tian Y-S, Kawashita N, Takagi T (2018) Mordred: a molecular descriptor calculator. Journal of Cheminformatics 10:4 . doi: 10.1186/s13321-018-0258-y

Number of cacularatable descrpiptors are as follows:
- 1D, 2D: 1613 
- 3D: 213

For instration of Mordred in Anaconda prompts `conda install -c rdkit -c mordred-descriptor mordred`

In [None]:
import mordred
print(f'mordred version: {mordred.__version__}')

from mordred import Calculator, descriptors

Calculator() is used to provide information for descriptor caculation（option, 1D2D: ignore_3D=True, 3D: ingnore_3D=False）

#### Calculation of 1D2D descriptors

In [None]:
calc2d = Calculator(descriptors, ignore_3D=True)
print('Number of 1D2D descriptor: ', len(calc2d))

df_2D_mordred = calc2d.pandas(mols)
df_2D_mordred.head(5)

Transform error messages generated in descriptor calculation into NaN

In [None]:
df_2Ddescriptors = df_2D_mordred.astype(str)
masks_2D = df_2Ddescriptors.apply(lambda d: d.str.contains('[a-zA-Z]' ,na=False))
df_2Ddescriptors = df_2Ddescriptors[~masks_2D]
df_2Ddescriptors = df_2Ddescriptors.astype(float)

df_2Ddescriptors.head(5)

Merging descriptors and labels

In [None]:
df_2Ddescriptors.insert(0, 'ChEMBL_ID', chemblids)
df_2Ddescriptors.insert(1, 'Fa', Fas)
df_2Ddescriptors.insert(2, 'Papp', Papps)

df_2Ddescriptors = df_2Ddescriptors.replace('.', np.nan)
df_2Ddescriptors.head(5)

Export as csv file

In [None]:
df_2Ddescriptors.to_csv(sdf.split('.')[0] + '_mordred_1D2D.csv')

#### Calculation of 1D2D3D descriptors

In [None]:
calc3d = Calculator(descriptors, ignore_3D=False)
print('Number of 1D2D3D descriptor: ', len(calc3d))

df_3D_mordred = calc3d.pandas(mols)
df_3D_mordred.head(5)

Transform error messages generated in descriptor calculation into NaN

In [None]:
df_3Ddescriptors = df_3D_mordred.astype(str)
masks_3D = df_3Ddescriptors.apply(lambda d: d.str.contains('[a-zA-Z]' ,na=False))
df_3Ddescriptors = df_3Ddescriptors[~masks_3D]
df_3Ddescriptors = df_3Ddescriptors.astype(float)

df_3Ddescriptors.head(5)

Merging descriptors and labels

In [None]:
df_3Ddescriptors.insert(0, 'ChEMBL_ID', chemblids)
df_3Ddescriptors.insert(1, 'Fa', Fas)
df_3Ddescriptors.insert(2, 'Papp', Papps)

df_3Ddescriptors = df_3Ddescriptors.replace('.', np.nan)
df_3Ddescriptors.head(5)

Export as csv file

In [None]:
df_3Ddescriptors.to_csv(sdf.split('.')[0] + '_mordred_1D2D3D.csv')

## References

- J-Stage: https://www.jstage.jst.go.jp/article/ciqs/2016/0/2016_Y4/_pdf/-char/ja
- github: https://github.com/mordred-descriptor/mordred
- kiseno-log: https://kiseno-log.com/2019/11/07/mordred%E3%81%A7%E8%A8%98%E8%BF%B0%E5%AD%90%E3%82%92%E8%A8%88%E7%AE%97%E3%81%97%E3%81%A6pandas%E5%BD%A2%E5%BC%8F%E3%81%A7%E5%87%BA%E5%8A%9B%E3%81%99%E3%82%8B/