# Data preprocessing

This notebook shows the steps necessary to get data that can be used as input to the models. This mainly includes some formatting for the DataWarrior descriptors. Furthermore, Mordred descriptors can be calculated and filtered so that always the same set of descriptors are used also later on.

### Housekeeping

In [1]:
from pathlib import Path

ROOT_PATH = Path.cwd().parent
DATA_PATH = ROOT_PATH / 'data'
TRAIN_RANDOM_DW = DATA_PATH / 'perm_random80_train_dw.csv'
TRAIN_RANDOM_MORDRED = DATA_PATH / 'perm_random80_train_mordred.csv'

## DataWarrior descriptor processing
This merely formats the provided ```.ods``` file. The defaults in this script assume that the input data is put in the ```cyc_pep_perm/data``` folder. Adapt accordingly if that is not the case.

In [None]:
from cyc_pep_perm.data.processing import DataProcessing

# TODO: CHANGE TO YOUR PATH!
datapath = DATA_PATH / 'perm_random80_train_raw.ods'

# instantiate the class and make sure the columns match your inputed file - otherwise change arguments
dp = DataProcessing(datapath=datapath)

In [None]:
df_dw = dp.read_data(filename="perm_random80_train_dw.csv")
df_dw.head()

## Mordred descriptors
Calculates Mordred descriptors, which is an extensive set of molecular 2D properties. This takes some time.

In [None]:
df_mordred = dp.calc_mordred(filename="perm_random80_train_mordred.csv")
df_mordred.head()

## Scaling data

Not needed yet since only tree-based models considered.