# Data preprocessing

This notebook shows the steps necessary to get data that can be used as input to the models. This mainly includes some formatting for the DataWarrior descriptors. Furthermore, Mordred descriptors can be calculated and filtered so that always the same set of descriptors are used also later on.

### Housekeeping

In [1]:
import os

from cyc_pep_perm.data.paths import DATA_PATH, TRAIN_RANDOM_DW, TRAIN_RANDOM_MORDRED

## DataWarrior descriptor processing
This merely formats the provided ```.ods``` file. The defaults in this script assume that the input data is put in the ```cyc_pep_perm/data``` folder. Adapt accordingly if that is not the case.

In [2]:
from cyc_pep_perm.data.processing import DataProcessing

datapath = os.path.join(DATA_PATH, "perm_random80_train_raw.ods")

# instantiate the class and make sure the columns match your inputed file - otherwise change arguments
dp = DataProcessing(datapath=datapath)

Target column: CAPA [1 µM]
SMILES column: SMILES


In [3]:
df_dw = dp.read_data(filename="perm_random80_train_dw.csv")
df_dw.head()

Saved data to /home/rebecca/code/CycPepPerm/data/perm_random80_train_datawarrior.csv


Unnamed: 0,SMILES,target,MW,cLogP,cLogS,HBA,HBD,Total Surface Area,Rel. PSA,PSA,Rot. Bonds,Amides
0,CC(C)[C@@H](C(NCCSCc1cccc(CSCCC(N[C@@H](CC(NCC...,41.206682,821.5,3.0452,-6.508,13.0,5.0,644.43,0.29243,227.45,15.0,5.0
1,O=C(C[C@@H](C(N[C@H](CCCC1)[C@@H]1C(N(CCC1)C[C...,39.74797,825.53,2.9212,-6.027,13.0,4.0,640.81,0.28174,218.66,14.0,5.0
2,O=C(C[C@@H](C(NCC(NC(CC1)CCC1CC(NCCSCc1cccc(CS...,24.527463,785.47,2.1131,-5.843,13.0,5.0,615.45,0.3062,227.45,14.0,5.0
3,OC[C@@H](C(NCCSCc1cccc(CSCCC(N[C@@H](CC(NCCOCC...,13.128625,813.44,-0.312,-4.362,16.0,7.0,628.35,0.36048,276.36,17.0,5.0
4,CC(C)[C@@H](C(NCCSCc1cccc(CSCCC(N[C@@H](CC(NCC...,86.647628,835.53,3.0732,-6.46,13.0,5.0,656.93,0.28686,227.45,17.0,5.0


## Mordred descriptors
Calculates Mordred descriptors, which is an extensive set of molecular 2D properties. This takes some time.

In [4]:
df_mordred = dp.calc_mordred(filename="perm_random80_train_mordred.csv")
df_mordred.head()

  3%|▎         | 1/29 [00:03<01:33,  3.35s/it]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


 24%|██▍       | 7/29 [00:04<00:07,  2.78it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


 45%|████▍     | 13/29 [00:04<00:05,  3.19it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


 76%|███████▌  | 22/29 [00:05<00:01,  5.58it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


 93%|█████████▎| 27/29 [00:05<00:00, 10.17it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


100%|██████████| 29/29 [00:07<00:00,  4.06it/s]


  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
Saved Mordred descriptors to /home/rebecca/code/CycPepPerm/data/Cyclic_peptide_membrane_permeability_random20percent_mordred.csv


Unnamed: 0,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,LogEE_A,VE1_A,VE2_A,VE3_A,VR1_A,...,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2,SMILES
0,70.524608,2.422422,4.844786,70.524608,1.259368,4.916128,4.450055,0.079465,3.215683,294078.399525,...,93.963407,834.321132,7.516407,15622,77,268.0,298.0,17.444444,12.861111,Oc1ccc(C[C@@H](C(N(C2)CC2C(NCCSCc2cccc(CSCCC(N...
1,77.805423,2.341892,4.683785,77.805423,1.275499,4.99337,5.637772,0.092422,3.537778,59440.079607,...,99.031578,898.352433,7.48627,18898,84,288.0,317.0,18.694444,14.083333,Oc1ccc(C[C@@H](C(NCCSCc2cccc(CSCCC(N[C@@H](CC(...
2,66.789727,2.324614,4.649229,66.789727,1.236847,4.853806,5.185249,0.096023,3.332217,40398.990677,...,91.135011,817.326946,7.430245,14195,73,246.0,267.0,19.333333,12.611111,C[C@H]([C@@H](C(NCCSCc1cccc(CSCCC(N[C@@H](CC(N...
3,69.641153,2.419902,4.839804,69.641153,1.289651,4.87744,4.732239,0.087634,3.240798,139525.694782,...,91.658732,810.357518,7.171305,14199,79,258.0,289.0,16.333333,12.5,O=C(C[C@@H](C(N(CCCC1)[C@H]1C(N(CC1)CCC1C(NCCS...
4,63.265425,2.440843,4.881687,63.265425,1.240499,4.815109,4.080159,0.080003,3.035376,394066.337656,...,88.445591,770.326218,7.267228,12398,71,240.0,267.0,17.083333,11.833333,CC(C)[C@@H](C(N(CC1)[C@@H]1C(NCCSCc1cccc(CSCCC...


## Scaling data

Not needed yet since only tree-based models considered.