<a href="https://colab.research.google.com/github/rromerov/Proyecto_Integrador/blob/main/Avance2/Avance2.12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Instituto Tecnológico y de Estudios Superiores de Monterrey
## Maestría en Inteligencia Artificial Aplicada
### Proyecto Integrador (Gpo 10) - TC5035.10

### **Proyecto: Diseño Acelerado de Fármacos**

### Avance 2: Ingeniería de características

#### **Docentes:**
- Dra. Grettel Barceló Alonso - Profesor Titular
- Dr. Luis Eduardo Falcón Morales - Profesor Titular
- Dr. Ricardo Ambrocio Ramírez Mendoza  – Profesor Tutor

#### **Miembros del equipo:**
- Ernesto Enríquez Rubio - A01228409
- Roberto Romero Vielma - A00822314
- Herbert Joadan Romero Villarreal –  A01794199

### Combinar 3 columnas (molecule_chembl_id,canonical_smiles,standard_value) y bioactivity_class en un DataFrame

* **molecule_chembl_id**: Esta columna contiene identificadores únicos para las moléculas en el conjunto de datos. Es crucial si estás trabajando con un conjunto de datos que involucra múltiples moléculas y necesitas distinguirlas y realizar operaciones específicas basadas en su identificación.

* **canonical_smiles**: El SMILES canónico es una representación única y estandarizada de la estructura química de una molécula. Esta columna es importante cuando se realizan análisis químicos o comparacion de estructuras moleculares entre diferentes moléculas dentro del conjunto de datos.

* **standard_value**: Esta columna contiene valores numéricos que representan medidas estándar asociadas con las moléculas, como la actividad biológica de un compuesto (por ejemplo, la concentración inhibidora 50 (IC50) en ensayos biológicos). Es una columna crucial si estás interesado en analizar la actividad biológica de las moléculas o realizar comparaciones cuantitativas entre ellas.

Al seleccionar estas columnas específicas, estamos enfocando tu análisis en la identificación de las moléculas, sus estructuras químicas y las medidas estándar asociadas con su actividad biológica.

La Bioactividad será calculada/etiquetada en los siguientes pasos a través de un umbral.

In [67]:
! pip install rdkit



In [68]:
import pandas as pd
import numpy as np
from google.colab import drive
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski, AllChem
from sklearn.preprocessing import StandardScaler
import os

In [69]:
# Cargar Google Drive al notebook
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [70]:
ruta_archivo = '/content/drive/My Drive/Colab Notebooks/data/bioactivity_data_preprocessed.csv'

# Lee el archivo CSV en un DataFrame
df = pd.read_csv(ruta_archivo)

# Mostrar que se haya importado correctamente el archivo
df.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL212560,CN(C)CCCOc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,79000.0
1,CHEMBL386641,CN(C)CCNc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,42000.0
2,CHEMBL425440,N#Cc1c(-c2ccccc2)c(Nc2cccc(N)c2)n2c(Cl)cccc12,80000.0
3,CHEMBL436932,CN(C)CCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,62000.0
4,CHEMBL213321,Cc1cccc(C)c1Nc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,60000.0
5,CHEMBL379547,CN(C)c1ccc(Nc2c(-c3ccccc3)c(C#N)c3cccc(Cl)n23)cc1,47000.0
6,CHEMBL215998,COCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,9000.0
7,CHEMBL411137,CN(CCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12)Cc1ccccc1,4000.0
8,CHEMBL386808,N#Cc1c(-c2ccccc2)c(NCCc2ccc(Cl)cc2)n2c(Cl)cccc12,4000.0
9,CHEMBL214248,N#Cc1c(-c2ccccc2)c(NCCc2ccccc2Cl)n2c(Cl)cccc12,3000.0


In [71]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df2 = df[selection]

df2.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL212560,CN(C)CCCOc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,79000.0
1,CHEMBL386641,CN(C)CCNc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,42000.0
2,CHEMBL425440,N#Cc1c(-c2ccccc2)c(Nc2cccc(N)c2)n2c(Cl)cccc12,80000.0
3,CHEMBL436932,CN(C)CCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,62000.0
4,CHEMBL213321,Cc1cccc(C)c1Nc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,60000.0
5,CHEMBL379547,CN(C)c1ccc(Nc2c(-c3ccccc3)c(C#N)c3cccc(Cl)n23)cc1,47000.0
6,CHEMBL215998,COCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,9000.0
7,CHEMBL411137,CN(CCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12)Cc1ccccc1,4000.0
8,CHEMBL386808,N#Cc1c(-c2ccccc2)c(NCCc2ccc(Cl)cc2)n2c(Cl)cccc12,4000.0
9,CHEMBL214248,N#Cc1c(-c2ccccc2)c(NCCc2ccccc2Cl)n2c(Cl)cccc12,3000.0


### Etiquetado de componentes
Los datos de bioactividad estan en la unidad IC50. Compuestos con menos de 1000 nM será considerados como **activos** mientras aquellos que sean mayores a 10,000 nM serán considerados como **inactivos**. Todos los valores que caen dentro de 1,000 y 10,000 nM serán clasificados como **intermedios**.

In [72]:
bioactivity_threshold = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [73]:
# Concatenar lista generada como una serie de pandas y agregarla al df
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df3 = pd.concat([df2, bioactivity_class], axis=1)


df3.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL212560,CN(C)CCCOc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,79000.0,inactive
1,CHEMBL386641,CN(C)CCNc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,42000.0,inactive
2,CHEMBL425440,N#Cc1c(-c2ccccc2)c(Nc2cccc(N)c2)n2c(Cl)cccc12,80000.0,inactive
3,CHEMBL436932,CN(C)CCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,62000.0,inactive
4,CHEMBL213321,Cc1cccc(C)c1Nc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,60000.0,inactive
5,CHEMBL379547,CN(C)c1ccc(Nc2c(-c3ccccc3)c(C#N)c3cccc(Cl)n23)cc1,47000.0,inactive
6,CHEMBL215998,COCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,9000.0,intermediate
7,CHEMBL411137,CN(CCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12)Cc1ccccc1,4000.0,intermediate
8,CHEMBL386808,N#Cc1c(-c2ccccc2)c(NCCc2ccc(Cl)cc2)n2c(Cl)cccc12,4000.0,intermediate
9,CHEMBL214248,N#Cc1c(-c2ccccc2)c(NCCc2ccccc2Cl)n2c(Cl)cccc12,3000.0,intermediate


## Calcular descriptores Lipinski

### Regla de Lipinski


La regla de Lipinski establece los siguientes criterios para evaluar la idoneidad de una molécula como candidato a fármaco:

1. **Peso molecular (MW):** MW < 500
2. **LogP (coeficiente de partición octanol-agua):** LogP < 5
3. **Número de donantes de hidrógeno (HBD):** HBD < 5
4. **Número de aceptores de hidrógeno (HBA):** HBA < 10

In [74]:
def lipinski(smiles, verbose=False):

  moldata = []
  for element in smiles:
    mol = Chem.MolFromSmiles(element)
    moldata.append(mol)

  baseData = np.arange(1,1)
  i = 0
  for mol in moldata:

    desc_MolWt = Descriptors.MolWt(mol)
    desc_MolLogP = Descriptors.MolLogP(mol)
    desc_NumHDonors = Lipinski.NumHDonors(mol)
    desc_NumAcceptors = Lipinski.NumHAcceptors(mol)

    row = np.array([desc_MolWt,
                    desc_MolLogP,
                    desc_NumHDonors,
                    desc_NumAcceptors])

    if i==0:
      baseData = row
    else:
      baseData = np.vstack([baseData, row])
    i = i+1

  columNames = ['MW','LogP','NumHDonors','NumHAcceptors']
  descriptors = pd.DataFrame(data=baseData, columns = columNames)

  return descriptors

In [75]:
df_lipinski = lipinski(df3.canonical_smiles)

## Combinar DataFrames

In [76]:
# Visualizar el dataframe con los descriptores calculados
df_lipinski

Unnamed: 0,MW,LogP,NumHDonors,NumHAcceptors
0,416.569,5.55308,1.0,5.0
1,401.558,5.19608,2.0,5.0
2,358.832,5.45718,2.0,4.0
3,352.869,4.49498,1.0,4.0
4,371.871,6.49182,1.0,3.0
...,...,...,...,...
93,688.835,-2.51836,12.0,9.0
94,436.566,4.38742,1.0,8.0
95,450.593,4.69584,1.0,8.0
96,452.565,4.09302,2.0,9.0


In [77]:
combined_df = pd.concat([df3, df_lipinski], axis=1)
combined_df.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class,MW,LogP,NumHDonors,NumHAcceptors
0,CHEMBL212560,CN(C)CCCOc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,79000.0,inactive,416.569,5.55308,1.0,5.0
1,CHEMBL386641,CN(C)CCNc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,42000.0,inactive,401.558,5.19608,2.0,5.0
2,CHEMBL425440,N#Cc1c(-c2ccccc2)c(Nc2cccc(N)c2)n2c(Cl)cccc12,80000.0,inactive,358.832,5.45718,2.0,4.0
3,CHEMBL436932,CN(C)CCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,62000.0,inactive,352.869,4.49498,1.0,4.0
4,CHEMBL213321,Cc1cccc(C)c1Nc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,60000.0,inactive,371.871,6.49182,1.0,3.0
5,CHEMBL379547,CN(C)c1ccc(Nc2c(-c3ccccc3)c(C#N)c3cccc(Cl)n23)cc1,47000.0,inactive,386.886,5.94098,1.0,4.0
6,CHEMBL215998,COCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,9000.0,intermediate,325.799,4.18968,1.0,4.0
7,CHEMBL411137,CN(CCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12)Cc1ccccc1,4000.0,intermediate,414.94,5.67528,1.0,4.0
8,CHEMBL386808,N#Cc1c(-c2ccccc2)c(NCCc2ccc(Cl)cc2)n2c(Cl)cccc12,4000.0,intermediate,406.316,6.43938,1.0,3.0
9,CHEMBL214248,N#Cc1c(-c2ccccc2)c(NCCc2ccccc2Cl)n2c(Cl)cccc12,3000.0,intermediate,406.316,6.43938,1.0,3.0


## Convertir IC50 a pIC50

Para contar con datos más uniformes, se convirtió **IC50** a su escala logaritmica negativa, lo cual esencialemnte es ${-\log_{10}(IC_{50})}$

Se definió una función **pIC50** la cual aceptará un dataframe como entrada y hará lo siguiente:

* Tomar los valores de IC50 de la columna **standard_value** y los convertirá de nM a M mediante la multiplicación del valor por ${10^{-9}}$.
* Tomar el valor molar y aplicar ${-\log_{10}}$
* Borrar la columna de **standard_value** y crear una nueva columna llamada **pIC50**.

In [78]:
def pIC50(input):
  pIC50 = []
  for i in input['standard_value_norm']:
    molar = i*(10**-9) # Convierte nM a M
    pIC50.append(-np.log10(molar))

  input['pIC50'] = pIC50
  x = input.drop(columns='standard_value_norm')
  return x

Los valores mayores a 100,000,000 se quedarán en 100,000,000, de no hacerlo así los valores logaritmicos negativos se convertirán en negativos.

In [79]:
combined_df.standard_value.describe()

count        98.000000
mean      29928.472755
std       39451.145929
min         609.000000
25%        5550.000000
50%       14000.000000
75%       37750.000000
max      187000.000000
Name: standard_value, dtype: float64

En este caso no es necesario pero se debe implementar esa lógica para evitar cualquier tipo de problema en el futuro.

In [80]:
def norm_value(input):
    norm = []

    for i in input['standard_value']:
        if i > 100000000:
          i = 100000000
        norm.append(i)

    input['standard_value_norm'] = norm
    x = input.drop(columns='standard_value')

    return x

Primero aplicaremos la función norm_value para que los valores dentro de la columna standard_value sean normalizados.

In [81]:
df_norm = norm_value(combined_df)
df_norm.head(5)

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,standard_value_norm
0,CHEMBL212560,CN(C)CCCOc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,inactive,416.569,5.55308,1.0,5.0,79000.0
1,CHEMBL386641,CN(C)CCNc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,inactive,401.558,5.19608,2.0,5.0,42000.0
2,CHEMBL425440,N#Cc1c(-c2ccccc2)c(Nc2cccc(N)c2)n2c(Cl)cccc12,inactive,358.832,5.45718,2.0,4.0,80000.0
3,CHEMBL436932,CN(C)CCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,inactive,352.869,4.49498,1.0,4.0,62000.0
4,CHEMBL213321,Cc1cccc(C)c1Nc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,inactive,371.871,6.49182,1.0,3.0,60000.0


El siguiente paso es convertir la columna IC50 a pIC50

In [82]:
df_plc50 = pIC50(df_norm)
df_plc50.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL212560,CN(C)CCCOc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,inactive,416.569,5.55308,1.0,5.0,4.102373
1,CHEMBL386641,CN(C)CCNc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,inactive,401.558,5.19608,2.0,5.0,4.376751
2,CHEMBL425440,N#Cc1c(-c2ccccc2)c(Nc2cccc(N)c2)n2c(Cl)cccc12,inactive,358.832,5.45718,2.0,4.0,4.09691
3,CHEMBL436932,CN(C)CCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,inactive,352.869,4.49498,1.0,4.0,4.207608
4,CHEMBL213321,Cc1cccc(C)c1Nc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,inactive,371.871,6.49182,1.0,3.0,4.221849
5,CHEMBL379547,CN(C)c1ccc(Nc2c(-c3ccccc3)c(C#N)c3cccc(Cl)n23)cc1,inactive,386.886,5.94098,1.0,4.0,4.327902
6,CHEMBL215998,COCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,intermediate,325.799,4.18968,1.0,4.0,5.045757
7,CHEMBL411137,CN(CCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12)Cc1ccccc1,intermediate,414.94,5.67528,1.0,4.0,5.39794
8,CHEMBL386808,N#Cc1c(-c2ccccc2)c(NCCc2ccc(Cl)cc2)n2c(Cl)cccc12,intermediate,406.316,6.43938,1.0,3.0,5.39794
9,CHEMBL214248,N#Cc1c(-c2ccccc2)c(NCCc2ccccc2Cl)n2c(Cl)cccc12,intermediate,406.316,6.43938,1.0,3.0,5.522879


In [83]:
df_plc50.pIC50.describe()

count    98.000000
mean      4.846317
std       0.555514
min       3.728158
25%       4.423832
50%       4.853872
75%       5.256167
max       6.215383
Name: pIC50, dtype: float64

## Eliminar la clase bioactiva intermedia

Cuando eliminamos una clase bioactiva intermedia durante el análisis de datos del CHEMBL para la proteína VEGF165, estamos simplificando el análisis, enfoncanonos en resultados relevantes, optimizamos recursos y mejoramos la precisión de los resultados obtenidos.

In [84]:
df_plc50_two_classes = df_plc50[df_plc50['class'] != 'intermediate']
df_plc50_two_classes.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL212560,CN(C)CCCOc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,inactive,416.569,5.55308,1.0,5.0,4.102373
1,CHEMBL386641,CN(C)CCNc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,inactive,401.558,5.19608,2.0,5.0,4.376751
2,CHEMBL425440,N#Cc1c(-c2ccccc2)c(Nc2cccc(N)c2)n2c(Cl)cccc12,inactive,358.832,5.45718,2.0,4.0,4.09691
3,CHEMBL436932,CN(C)CCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,inactive,352.869,4.49498,1.0,4.0,4.207608
4,CHEMBL213321,Cc1cccc(C)c1Nc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,inactive,371.871,6.49182,1.0,3.0,4.221849
5,CHEMBL379547,CN(C)c1ccc(Nc2c(-c3ccccc3)c(C#N)c3cccc(Cl)n23)cc1,inactive,386.886,5.94098,1.0,4.0,4.327902
11,CHEMBL214266,N#Cc1c(C2CCNCC2)c(NCCc2ccccc2)n2c(Cl)cccc12,inactive,378.907,4.58598,2.0,4.0,4.022276
12,CHEMBL215632,Cc1cnc(-c2c(C#N)c3cccc(Cl)n3c2NCCc2ccccc2)s1,inactive,392.915,5.5509,1.0,5.0,4.657577
13,CHEMBL427044,CC(C)(C)c1ccc(-c2c(C#N)c3cccc(Cl)n3c2NCCc2cccc...,inactive,427.979,7.08348,1.0,3.0,4.853872
14,CHEMBL211986,N#Cc1c(CCO)c(NCCc2ccccc2)n2c(Cl)cccc12,inactive,339.826,3.65378,2.0,4.0,4.886057


Convertir la columna class en binaria numérica.

In [85]:
# Reemplazar "active" por 1 y "inactive" por 0
df_plc50_two_classes.loc[:, 'class'] = df_plc50_two_classes['class'].replace({'active': 1, 'inactive': 0})

df_plc50_two_classes.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL212560,CN(C)CCCOc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,0,416.569,5.55308,1.0,5.0,4.102373
1,CHEMBL386641,CN(C)CCNc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12,0,401.558,5.19608,2.0,5.0,4.376751
2,CHEMBL425440,N#Cc1c(-c2ccccc2)c(Nc2cccc(N)c2)n2c(Cl)cccc12,0,358.832,5.45718,2.0,4.0,4.09691
3,CHEMBL436932,CN(C)CCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,0,352.869,4.49498,1.0,4.0,4.207608
4,CHEMBL213321,Cc1cccc(C)c1Nc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12,0,371.871,6.49182,1.0,3.0,4.221849
5,CHEMBL379547,CN(C)c1ccc(Nc2c(-c3ccccc3)c(C#N)c3cccc(Cl)n23)cc1,0,386.886,5.94098,1.0,4.0,4.327902
11,CHEMBL214266,N#Cc1c(C2CCNCC2)c(NCCc2ccccc2)n2c(Cl)cccc12,0,378.907,4.58598,2.0,4.0,4.022276
12,CHEMBL215632,Cc1cnc(-c2c(C#N)c3cccc(Cl)n3c2NCCc2ccccc2)s1,0,392.915,5.5509,1.0,5.0,4.657577
13,CHEMBL427044,CC(C)(C)c1ccc(-c2c(C#N)c3cccc(Cl)n3c2NCCc2cccc...,0,427.979,7.08348,1.0,3.0,4.853872
14,CHEMBL211986,N#Cc1c(CCO)c(NCCc2ccccc2)n2c(Cl)cccc12,0,339.826,3.65378,2.0,4.0,4.886057


Estandarizar valor de las columnas

In [86]:
# Verificar el rango del valor

valores_maximos = df_plc50_two_classes.max()
valores_minimos = df_plc50_two_classes.min()
# Imprimir los valores máximo y mínimo de cada columna
for columna in df_plc50_two_classes.columns:
    if pd.api.types.is_numeric_dtype(df_plc50_two_classes[columna]):  # Verificar si la columna es numérica
        print(f"Columna '{columna}':")
        print("  Valor máximo:", valores_maximos[columna])
        print("  Valor mínimo:", valores_minimos[columna])

Columna 'MW':
  Valor máximo: 1595.918000000001
  Valor mínimo: 310.78799999999995
Columna 'LogP':
  Valor máximo: 10.153270000000008
  Valor mínimo: -4.985660000000015
Columna 'NumHDonors':
  Valor máximo: 15.0
  Valor mínimo: 1.0
Columna 'NumHAcceptors':
  Valor máximo: 16.0
  Valor mínimo: 3.0
Columna 'pIC50':
  Valor máximo: 6.215382707367125
  Valor mínimo: 3.728158393463501


In [87]:
df_standar = df_plc50_two_classes.copy()

In [88]:
# Seleccionar las columnas que deseas estandarizar
columnas_numericas = ['MW', 'LogP', 'NumHDonors', 'NumHAcceptors', 'pIC50']

# Crear un objeto StandardScaler
scaler = StandardScaler()

# Aplicar la estandarización a las columnas seleccionadas
df_standar[columnas_numericas] = scaler.fit_transform(df_standar[columnas_numericas])

In [89]:
# Verificar el rango del valor

valores_maximos = df_standar.max()
valores_minimos = df_standar.min()
# Imprimir los valores máximo y mínimo de cada columna
for columna in df_standar.columns:
    if pd.api.types.is_numeric_dtype(df_standar[columna]):  # Verificar si la columna es numérica
        print(f"Columna '{columna}':")
        print("  Valor máximo:", valores_maximos[columna])
        print("  Valor mínimo:", valores_minimos[columna])

Columna 'MW':
  Valor máximo: 5.556083357421796
  Valor mínimo: -1.2484561310114908
Columna 'LogP':
  Valor máximo: 2.3498910709563305
  Valor mínimo: -1.4467959159700419
Columna 'NumHDonors':
  Valor máximo: 1.7678226297888884
  Valor mínimo: -1.2767607881808638
Columna 'NumHAcceptors':
  Valor máximo: 3.1558543971924524
  Valor mínimo: -1.6616051560975413
Columna 'pIC50':
  Valor máximo: 3.675211357077131
  Valor mínimo: -1.798315621699227


Transformar columna Canonical Smile

In [90]:
df_canonical_rep = df_standar.copy()

In [91]:
selection = ['canonical_smiles','molecule_chembl_id']
df_selection = df_plc50_two_classes[selection]
df_selection.to_csv('/content/drive/My Drive/Colab Notebooks/data/molecule.smi', sep='\t', index=False, header=False)

In [92]:
! cat '/content/drive/My Drive/Colab Notebooks/data/molecule.smi' | head -5

CN(C)CCCOc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12	CHEMBL212560
CN(C)CCNc1cccc2c(C#N)c(-c3ccccc3)c(NC3CCCCC3)n12	CHEMBL386641
N#Cc1c(-c2ccccc2)c(Nc2cccc(N)c2)n2c(Cl)cccc12	CHEMBL425440
CN(C)CCCNc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12	CHEMBL436932
Cc1cccc(C)c1Nc1c(-c2ccccc2)c(C#N)c2cccc(Cl)n12	CHEMBL213321


In [93]:
! cat '/content/drive/My Drive/Colab Notebooks/data/molecule.smi' | wc -l

62


## Calcular descriptores

In [94]:
# Función para calcular los descriptores de huellas dactilares
def calculate_pubchem_fingerprints(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=881)  # Morgan fingerprint, radius=2, nBits=881
    return fp

# Leer el archivo molecule.smi y calcular los descriptores de huellas dactilares
data = pd.read_csv("/content/drive/My Drive/Colab Notebooks/data/molecule.smi", sep="\t", header=None, names=["SMILES", "ID"])

# Eliminar las sales y normalizar los nitros utilizando RDKit
data["SMILES"] = data["SMILES"].apply(Chem.MolFromSmiles)
data["SMILES"] = data["SMILES"].apply(Chem.RemoveHs)
data["SMILES"] = data["SMILES"].apply(Chem.MolToSmiles)

fingerprints = []
for smiles in data["SMILES"]:
    fp = calculate_pubchem_fingerprints(smiles)
    if fp is not None:
        # Convertir el objeto ExplicitBitVect a una lista de enteros
        arr = [int(x) for x in fp.ToBitString()]
        fingerprints.append(arr)

# Convertir los descriptores de huellas dactilares en un DataFrame de pandas
fingerprints_df = pd.DataFrame(fingerprints)
fingerprints_df.columns = [f"PubchemFP{i}" for i in range(len(fingerprints_df.columns))]
fingerprints_df.insert(0, "Name", data["ID"])

# Guardar los descriptores de huellas dactilares en un archivo CSV
fingerprints_df.to_csv('/content/drive/My Drive/Colab Notebooks/data/descriptors_output.csv', index=False)

In [95]:
len(fingerprints_df)

62

In [96]:
df_pubchem_fp_pic = pd.concat([fingerprints_df,df_plc50_two_classes['pIC50']], axis=1)
df_pubchem_fp_pic.head(10)

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,CHEMBL212560,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.102373
1,CHEMBL386641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,4.376751
2,CHEMBL425440,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.09691
3,CHEMBL436932,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,4.207608
4,CHEMBL213321,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.221849
5,CHEMBL379547,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.327902
6,CHEMBL214266,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,
7,CHEMBL215632,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,
8,CHEMBL427044,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,
9,CHEMBL211986,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,


In [97]:
# Elimina los registros de name que su contenido sea Nan dentro de df_pubchem_fp_pic o pIC50 este vacio
df_pubchem_fp_pic = df_pubchem_fp_pic.dropna(subset=['Name', 'pIC50'])

In [98]:
df_pubchem_fp_pic

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,CHEMBL212560,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.102373
1,CHEMBL386641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,4.376751
2,CHEMBL425440,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.09691
3,CHEMBL436932,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,4.207608
4,CHEMBL213321,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.221849
5,CHEMBL379547,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.327902
11,CHEMBL379832,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,4.022276
12,CHEMBL213376,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,4.657577
13,CHEMBL379579,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.853872
14,CHEMBL212851,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,4.886057


## Preparar Matrices de datos de X y Y

### Matriz de datos X

In [99]:
df2_X = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/descriptors_output.csv')

In [100]:
df2_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL212560,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,CHEMBL386641,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
2,CHEMBL425440,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
3,CHEMBL436932,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,1
4,CHEMBL213321,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57,CHEMBL4515173,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
58,CHEMBL4646334,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
59,CHEMBL4643884,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60,CHEMBL4637483,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [101]:
df2_X = df2_X.drop(columns=['Name'])
df2_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,1
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
58,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
59,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Variable Y

### Convertir IC50 a pIC50

In [102]:
df2_Y = df_pubchem_fp_pic['pIC50']
df2_Y

0     4.102373
1     4.376751
2     4.096910
3     4.207608
4     4.221849
5     4.327902
11    4.022276
12    4.657577
13    4.853872
14    4.886057
15    4.886057
16    5.000000
17    5.000000
37    4.537602
38    4.366532
39    4.721246
40    4.886057
41    4.744727
42    4.522879
43    4.853872
44    4.886057
45    4.853872
46    4.769551
47    4.744727
48    4.283997
50    4.481486
52    4.769551
53    4.920819
55    4.721246
56    3.767004
57    4.036212
59    6.215383
60    6.096910
61    4.600000
Name: pIC50, dtype: float64

In [108]:
dataset = pd.concat([df2_X,df2_Y], axis=1)
dataset.head()

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,4.102373
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,4.376751
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,4.09691
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,1,4.207608
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,4.221849


Guardar el dataset para posteriormente realizar el modelo de aprendizaje supervisado

In [104]:
dataset.to_csv('/content/drive/My Drive/Colab Notebooks/data/bioactivity_data_2class_pIC50_pubchem_fp.csv', index=False)

Guardar en un archivo .zip los archivos generados

In [105]:
# Guardar archivos en un archivo zip
! zip -r /content/drive/My\ Drive/Colab\ Notebooks/data/results.zip /content/drive/My\ Drive/Colab\ Notebooks/data


updating: content/drive/My Drive/Colab Notebooks/data/plot_bioactivity_class.pdf (deflated 37%)
updating: content/drive/My Drive/Colab Notebooks/data/plot_ic50.pdf (deflated 38%)
updating: content/drive/My Drive/Colab Notebooks/data/plot_LogP.pdf (deflated 38%)
updating: content/drive/My Drive/Colab Notebooks/data/plot_MW.pdf (deflated 38%)
updating: content/drive/My Drive/Colab Notebooks/data/plot_MW_vs_LogP.pdf (deflated 19%)
updating: content/drive/My Drive/Colab Notebooks/data/plot_NumHAcceptors.pdf (deflated 37%)
updating: content/drive/My Drive/Colab Notebooks/data/plot_NumHDonors.pdf (deflated 37%)
updating: content/drive/My Drive/Colab Notebooks/data/bioactivity_data_2class_pIC50.csv (deflated 77%)
updating: content/drive/My Drive/Colab Notebooks/data/bioactivity_data.csv (deflated 90%)
updating: content/drive/My Drive/Colab Notebooks/data/bioactivity_data_curated.csv (deflated 83%)
updating: content/drive/My Drive/Colab Notebooks/data/bioactivity_data_preprocessed.csv (deflate

In [106]:
# Verificar cambios
! ls '/content/drive/My Drive/Colab Notebooks/data/'

bioactivity_data_2class_pIC50.csv	      mannwhitneyu_pIC50.csv
bioactivity_data_2class_pIC50_pubchem_fp.csv  molecule.smi
bioactivity_data_3class_pIC50_pubchem_fp.csv  plot_bioactivity_class.pdf
bioactivity_data.csv			      plot_ic50.pdf
bioactivity_data_curated.csv		      plot_LogP.pdf
bioactivity_data_preprocessed.csv	      plot_MW.pdf
descriptors_output.csv			      plot_MW_vs_LogP.pdf
mannwhitneyu_LogP.csv			      plot_NumHAcceptors.pdf
mannwhitneyu_MW.csv			      plot_NumHDonors.pdf
mannwhitneyu_NumHAcceptors.csv		      results.zip
mannwhitneyu_NumHDonors.csv
