## Descritores

Para criar modelos de aprendizado de máquina, precisamos converter estruturas químicas a um formato que possa ser interpretado pelos algoritmos utilizados. Ainda que existam algoritmos capazes de interpretar estruturas moleculares representadas na forma de texto ou grafos, o uso de números (variáveis numéricas) é o método mais usado para a criação de modelos. 

Os descritores representam características das estruturas químicas. Podem ser facilmente interpretáveis, como a massa molecular relativa. Outros descritores, como logP, têm um significado físico, mas são calculados a partir da estrutura, podendo não representar corretamente o valor real dessa variável. Também podemos contar grupos químicos, como doadores de ligação de H ou átomos de flúor. Por fim, podemos calcular descritores topológicos e não facilmente interpretáveis.

Diversos módulos estão disponíveis gratuitamente para o cálculo de descritores. Nesse notebook, vamos mostrar como calcular 200 descritores para as estruturas no [conjunto de dados de Delaney (solubilidade)](https://www.kaggle.com/c/drug-solubility-challenge/data) utilizando o pacote RDKit.

Importando os módulos necessários

In [1]:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors

Importando o conjunto de dados no formato .csv e convertendo para um DataFrame do pandas

In [2]:
df = pd.read_csv("delaney-processed.csv")
# Mostrar as 10 primeiras linhas
df.head(10)

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles
0,Amigdalin,-0.974,1,457.432,7,3,7,202.32,-0.77,OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...
1,Fenfuram,-2.885,1,201.225,1,2,2,42.24,-3.3,Cc1occc1C(=O)Nc2ccccc2
2,citral,-2.579,1,152.237,0,0,4,17.07,-2.06,CC(C)=CCCC(C)=CC(=O)
3,Picene,-6.618,2,278.354,0,5,0,0.0,-7.87,c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43
4,Thiophene,-2.232,2,84.143,0,1,0,0.0,-1.33,c1ccsc1
5,benzothiazole,-2.733,2,135.191,0,2,0,12.89,-1.5,c2ccc1scnc1c2
6,"2,2,4,6,6'-PCB",-6.545,1,326.437,0,2,1,0.0,-7.32,Clc1cc(Cl)c(c(Cl)c1)c2c(Cl)cccc2Cl
7,Estradiol,-4.138,1,272.388,2,4,0,40.46,-5.03,CC12CCC3C(CCc4cc(O)ccc34)C2CCC1O
8,Dieldrin,-4.533,1,380.913,0,5,0,12.53,-6.29,ClC4=C(Cl)C5(Cl)C3C1CC(C2OC12)C3C4(Cl)C5(Cl)Cl
9,Rotenone,-5.246,1,394.423,0,5,3,63.22,-4.42,COc5cc4OCC3Oc2c1CC(Oc1ccc2C(=O)C3c4cc5OC)C(C)=C


Vamos manter somente a coluna contendo os SMILES e o valor da variável resposta (*measured log solubility in mols per litre*)

In [3]:
df = df[["smiles", "measured log solubility in mols per litre"]]
# Renomeando as colunas
df.columns=["SMILES", "Solubilidade_medida"]

Calculando os 200 descritores

In [4]:
# Esse código nos permite ver todas as colunas em um DataFrame
pd.set_option('display.max_columns', None)

In [5]:
# Anotando todos os nomes dos descritores em uma lista
names = [name[0] for name in Descriptors.descList][:100]
# Calculando os descritores e adicionando a uma lista
calc = MoleculeDescriptors.MolecularDescriptorCalculator(names)
descs = [calc.CalcDescriptors(Chem.MolFromSmiles(cpd)) for cpd in df["SMILES"]]
# Convertendo a lista a um DataFrame e adicionando uma coluna contendo a variável resposta
descriptors_df = pd.DataFrame(descs, columns=names)
descriptors_df["Solubilidade_medida"] = df["Solubilidade_medida"]

descriptors_df.head()

Unnamed: 0,MaxEStateIndex,MinEStateIndex,MaxAbsEStateIndex,MinAbsEStateIndex,qed,MolWt,HeavyAtomMolWt,ExactMolWt,NumValenceElectrons,NumRadicalElectrons,MaxPartialCharge,MinPartialCharge,MaxAbsPartialCharge,MinAbsPartialCharge,FpDensityMorgan1,FpDensityMorgan2,FpDensityMorgan3,BalabanJ,BertzCT,Chi0,Chi0n,Chi0v,Chi1,Chi1n,Chi1v,Chi2n,Chi2v,Chi3n,Chi3v,Chi4n,Chi4v,HallKierAlpha,Ipc,Kappa1,Kappa2,Kappa3,LabuteASA,PEOE_VSA1,PEOE_VSA10,PEOE_VSA11,PEOE_VSA12,PEOE_VSA13,PEOE_VSA14,PEOE_VSA2,PEOE_VSA3,PEOE_VSA4,PEOE_VSA5,PEOE_VSA6,PEOE_VSA7,PEOE_VSA8,PEOE_VSA9,SMR_VSA1,SMR_VSA10,SMR_VSA2,SMR_VSA3,SMR_VSA4,SMR_VSA5,SMR_VSA6,SMR_VSA7,SMR_VSA8,SMR_VSA9,SlogP_VSA1,SlogP_VSA10,SlogP_VSA11,SlogP_VSA12,SlogP_VSA2,SlogP_VSA3,SlogP_VSA4,SlogP_VSA5,SlogP_VSA6,SlogP_VSA7,SlogP_VSA8,SlogP_VSA9,TPSA,EState_VSA1,EState_VSA10,EState_VSA11,EState_VSA2,EState_VSA3,EState_VSA4,EState_VSA5,EState_VSA6,EState_VSA7,EState_VSA8,EState_VSA9,VSA_EState1,VSA_EState10,VSA_EState2,VSA_EState3,VSA_EState4,VSA_EState5,VSA_EState6,VSA_EState7,VSA_EState8,VSA_EState9,FractionCSP3,HeavyAtomCount,NHOHCount,NOCount,NumAliphaticCarbocycles,Solubilidade_medida
0,10.253329,-1.701605,10.253329,0.486602,0.217518,457.432,430.216,457.158411,178,0,0.188266,-0.393567,0.393567,0.188266,0.8125,1.375,1.96875,1.654937,759.662938,23.413485,16.86252,16.86252,15.277295,9.998816,9.998816,7.601218,7.601218,5.431494,5.431494,3.50693,3.50693,-1.73,12121200.0,24.903474,10.926356,5.251706,182.935327,54.693144,48.831731,18.68402,0.0,0.0,0.0,0.0,0.0,5.261892,0.0,30.331835,5.563451,0.0,19.282985,54.693144,0.0,5.261892,0.0,0.0,67.515751,13.213764,35.895287,0.0,6.069221,0.0,0.0,0.0,0.0,110.37124,18.947452,11.331113,11.667418,30.331835,0.0,0.0,0.0,202.32,80.729515,41.007583,0.0,0.0,5.563451,0.0,0.0,30.331835,6.069221,0.0,18.947452,21.592355,0.0,0.0,79.055254,0.486602,0.0,10.31518,-16.476443,-1.139614,0.0,0.65,32,7,12,0,-0.77
1,11.724911,-0.14588,11.724911,0.14588,0.811283,201.225,190.137,201.078979,76,0,0.258698,-0.468799,0.468799,0.258698,1.2,1.933333,2.533333,2.148162,459.484175,10.673362,8.357948,8.357948,7.270857,4.676643,4.676643,3.210611,3.210611,2.135103,2.135103,1.340444,1.340444,-2.03,4231.896,9.52216,4.002882,2.070849,87.724095,9.73394,5.760247,0.0,0.0,5.90718,0.0,4.794537,0.0,0.0,0.0,18.199101,25.122838,5.687386,11.826614,9.211688,11.594566,0.0,0.0,0.0,6.923737,5.316789,53.985064,0.0,0.0,5.316789,5.687386,0.0,0.0,5.90718,0.0,6.923737,16.118236,47.078516,0.0,0.0,0.0,42.24,0.0,4.794537,0.0,5.90718,11.323699,5.687386,6.263163,12.990104,30.331835,5.316789,4.417151,5.062498,0.0,11.724911,2.786204,1.349031,0.482297,10.988763,1.509167,1.763796,0.0,0.083333,15,1,3,0,-3.3
2,10.020498,0.84509,10.020498,0.84509,0.343706,152.237,136.109,152.120115,62,0,0.142281,-0.298566,0.298566,0.142281,1.272727,1.909091,2.363636,3.62576,171.311799,8.690234,7.554513,7.554513,5.163902,3.908188,3.908188,2.969252,2.969252,1.44382,1.44382,0.788002,0.788002,-0.85,203.6951,10.15,5.899351,7.042356,68.806046,0.0,6.286161,0.0,0.0,0.0,0.0,4.794537,0.0,0.0,0.0,17.222229,39.688875,0.0,0.0,4.794537,6.286161,0.0,0.0,0.0,33.612855,0.0,23.298249,0.0,0.0,0.0,0.0,0.0,0.0,6.286161,4.794537,0.0,33.612855,23.298249,0.0,0.0,0.0,17.07,0.0,4.794537,0.0,0.0,0.0,24.700908,5.573105,6.07602,6.923737,19.923495,0.0,0.0,0.0,10.020498,0.0,2.48741,0.0,0.0,6.681799,6.143627,0.0,0.5,11,0,1,0,-2.06
3,2.270278,1.301055,2.270278,1.301055,0.291526,278.354,264.242,278.10955,102,0,-0.009873,-0.061629,0.061629,0.009873,0.272727,0.636364,1.136364,2.041379,1071.547817,14.518297,12.082904,12.082904,10.915816,7.636751,7.636751,5.829201,5.829201,4.648219,4.648219,3.586716,3.586716,-2.86,296139.6,11.762233,4.315741,1.523286,128.158061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,84.929139,43.089794,0.0,0.0,0.0,43.089794,0.0,0.0,0.0,0.0,0.0,84.929139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,84.929139,0.0,43.089794,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,43.089794,0.0,0.0,84.929139,0.0,0.0,0.0,0.0,10.598207,0.0,0.0,30.735126,0.0,0.0,0.0,0.0,22,0,0,0,-7.87
4,2.041667,1.712963,2.041667,1.712963,0.448927,84.143,80.111,84.003371,26,0,-0.009338,-0.152454,0.152454,0.009338,1.0,1.6,1.8,3.125,60.124818,3.535534,2.717649,3.534146,2.5,1.471405,2.414214,0.793148,1.609645,0.425381,1.05392,0.226805,0.680414,-0.3,22.88644,2.912766,1.22105,0.484065,35.071766,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.336786,0.0,12.132734,10.760126,0.0,0.0,0.0,11.336786,0.0,0.0,0.0,0.0,0.0,22.89286,0.0,0.0,0.0,0.0,0.0,11.336786,0.0,0.0,0.0,0.0,22.89286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.336786,22.89286,0.0,0.0,0.0,1.712963,0.0,4.083333,0.0,0.0,4.037037,0.0,0.0,0.0,0.0,5,0,0,0,-1.33


In [6]:
# Opcional: salvar o DataFrame como .csv
descriptors_df.to_csv("Delaney_descriptors.csv", sep=";", index=False)