# Data Pre-processing

First we will begin by pre-processing the data. In this case the sample data comes with the following features/columns: 

|   **Column**   |   **Desciption**   |
|:-:	         |:--	              |
|   STUD	|   Study Number 	|
|   DSFQ	|   Dosing Frequency	|
|   PTNM	|   Patient Number	|
|   CYCL	|    Dosing Cycles	|
|   AMT	|   Dosing Amounts	|
|   TIME	|   Time in Hours Since the Experiment Began for one Individual	|
|   TFDS	|   Time in Hours Since the Last Dosing|
|   DV/PK_timeCourse	|   The Observations of PK  	|

Let us first import the data into the notebook and observe these features of the dataset

In [31]:
# Importing required libraries for data pre-processing

import pandas as pd
import numpy as np

#Reading the csv file with the data

data_complete = pd.read_csv("/Users/rishabhgoel/Desktop/NeuralODE_Paper_Supplementary_Code/ExampleData/sim_data.csv", na_values='.')

data_complete

Unnamed: 0,STUD,PTNM,DSFQ,CYCL,AMT,TIME,TFDS,DV
0,1000.0,1.0,3.0,1.0,296.6,0.0,0.0,20.382000
1,1000.0,1.0,3.0,1.0,0.0,24.0,24.0,73.148000
2,1000.0,1.0,3.0,1.0,0.0,216.0,216.0,19.764000
3,1000.0,1.0,3.0,1.0,0.0,504.0,504.0,3.219900
4,1000.0,1.0,3.0,2.0,288.0,504.0,0.0,3.219900
...,...,...,...,...,...,...,...,...
5371,3000.0,200.0,3.0,15.0,0.0,7416.0,360.0,6.091200
5372,3000.0,200.0,3.0,15.0,0.0,7584.0,7584.0,2.043100
5373,3000.0,200.0,3.0,17.0,259.2,8064.0,0.0,0.090115
5374,3000.0,200.0,3.0,17.0,0.0,8088.0,24.0,53.990000


There are 5376 observations and 8 features that we are looking at. Now let us begin the pre-processing the data. 

Our sample data only has the relevant columns for the model. However, in the data we receive from patients there will be many more features so it is important for us to select the correct features. Thus, let us begin by creating a place holder for the features that are important to the model.

In [32]:
#variable for colummns that we will eventually select from the raw dataset we receive

select_cols = ["STUD", "DSFQ", "PTNM", "CYCL", "AMT", "TIME", "TFDS", "DV"]

#Selecting the relevant columns from the dataframe

data_complete = data_complete[select_cols]

We then start filtering data based on multiple parameters:
    
1. Dosing Cycle < 100

In [33]:
# filtering for only the rows in the dataframe with a dosing cycle less than 100

data_complete = data_complete[data_complete.CYCL < 100]
data_complete

Unnamed: 0,STUD,DSFQ,PTNM,CYCL,AMT,TIME,TFDS,DV
0,1000.0,3.0,1.0,1.0,296.6,0.0,0.0,20.382000
1,1000.0,3.0,1.0,1.0,0.0,24.0,24.0,73.148000
2,1000.0,3.0,1.0,1.0,0.0,216.0,216.0,19.764000
3,1000.0,3.0,1.0,1.0,0.0,504.0,504.0,3.219900
4,1000.0,3.0,1.0,2.0,288.0,504.0,0.0,3.219900
...,...,...,...,...,...,...,...,...
5371,3000.0,3.0,200.0,15.0,0.0,7416.0,360.0,6.091200
5372,3000.0,3.0,200.0,15.0,0.0,7584.0,7584.0,2.043100
5373,3000.0,3.0,200.0,17.0,259.2,8064.0,0.0,0.090115
5374,3000.0,3.0,200.0,17.0,0.0,8088.0,24.0,53.990000


We then convert the "PTNM" column to an integer type using the astype() method, and then using the map() method to apply a format string to each value in the column. The format string "{:05d}" specifies that each value should be formatted as a zero-padded integer with a width of 5 digits.

For example, if the "PTNM" column originally contained the values [1, 10, 100, 1000], after this line of code is executed, the "PTNM" column would contain the values ['00001', '00010', '00100', '01000'].

This line of code is useful for standardizing the format of the values in the "PTNM" column, which can make it easier to perform operations on the column and to compare values within the column. It can also be useful for preparing the data for downstream model applications that require the data to be in a particular format.

In [34]:
# formatting Patient Number for Unique Identifier
data_complete["PTNM"] = data_complete["PTNM"].astype("int").map("{:05d}".format)
data_complete

Unnamed: 0,STUD,DSFQ,PTNM,CYCL,AMT,TIME,TFDS,DV
0,1000.0,3.0,00001,1.0,296.6,0.0,0.0,20.382000
1,1000.0,3.0,00001,1.0,0.0,24.0,24.0,73.148000
2,1000.0,3.0,00001,1.0,0.0,216.0,216.0,19.764000
3,1000.0,3.0,00001,1.0,0.0,504.0,504.0,3.219900
4,1000.0,3.0,00001,2.0,288.0,504.0,0.0,3.219900
...,...,...,...,...,...,...,...,...
5371,3000.0,3.0,00200,15.0,0.0,7416.0,360.0,6.091200
5372,3000.0,3.0,00200,15.0,0.0,7584.0,7584.0,2.043100
5373,3000.0,3.0,00200,17.0,259.2,8064.0,0.0,0.090115
5374,3000.0,3.0,00200,17.0,0.0,8088.0,24.0,53.990000


Now we create a new column called "ID" in the pandas DataFrame data_complete. The "ID" column is being created by concatenating two existing columns in ```data_complete```: "STUD" and "PTNM".

First, the "STUD" column is being converted to an integer data type using the ```astype()``` method with the argument "int". Then, the "STUD" column is being converted to a string data type using the ```astype()``` method with the argument "str". This ensures that the values in the "STUD" column are in string format.

Next, the values in the "STUD" and "PTNM" columns are being concatenated using the + operator, which joins the two strings together. This creates a new string for each row in the DataFrame, which is assigned to the "ID" column.

For example, if the "STUD" column contained the values ['100', '101', '102', '103'] and the "PTNM" column contained the values ['00001', '00010', '00100', '01000'], then after this line of code is executed, the "ID" column would contain the values ['10000001', '10100010', '10200100', '10301000'].

This line of code is useful for creating a unique identifier for each row in the DataFrame based on the values in the "STUD" and "PTNM" columns. This can be useful for identifying and tracking individual records, as well as for linking data across multiple datasets.

In [35]:
# Creating a unique identifier column, 'ID', and formatting it based on Patient Number and Study Number
data_complete["ID"] = data_complete["STUD"].astype("int").astype("str") + data_complete["PTNM"]
data_complete

Unnamed: 0,STUD,DSFQ,PTNM,CYCL,AMT,TIME,TFDS,DV,ID
0,1000.0,3.0,00001,1.0,296.6,0.0,0.0,20.382000,100000001
1,1000.0,3.0,00001,1.0,0.0,24.0,24.0,73.148000,100000001
2,1000.0,3.0,00001,1.0,0.0,216.0,216.0,19.764000,100000001
3,1000.0,3.0,00001,1.0,0.0,504.0,504.0,3.219900,100000001
4,1000.0,3.0,00001,2.0,288.0,504.0,0.0,3.219900,100000001
...,...,...,...,...,...,...,...,...,...
5371,3000.0,3.0,00200,15.0,0.0,7416.0,360.0,6.091200,300000200
5372,3000.0,3.0,00200,15.0,0.0,7584.0,7584.0,2.043100,300000200
5373,3000.0,3.0,00200,17.0,259.2,8064.0,0.0,0.090115,300000200
5374,3000.0,3.0,00200,17.0,0.0,8088.0,24.0,53.990000,300000200


In [36]:
# Creating a dataframe with the maximum time for each patient in each study (Essentially presenting the time of the last observed dose)
time_summary = data_complete[["ID", "TIME"]].groupby("ID").max().reset_index()

# Creating a new dataframe with only the IDs for patients with a time of the last observed dose greater than 0 (eliminating errors/outliers)
selected_ptnms = time_summary[time_summary.TIME > 0].ID

# Only selecting IDs with the last observed dose greater than 0 in our main table called data_complete
data_complete = data_complete[data_complete.ID.isin(selected_ptnms)]
data_complete

Unnamed: 0,STUD,DSFQ,PTNM,CYCL,AMT,TIME,TFDS,DV,ID
0,1000.0,3.0,00001,1.0,296.6,0.0,0.0,20.382000,100000001
1,1000.0,3.0,00001,1.0,0.0,24.0,24.0,73.148000,100000001
2,1000.0,3.0,00001,1.0,0.0,216.0,216.0,19.764000,100000001
3,1000.0,3.0,00001,1.0,0.0,504.0,504.0,3.219900,100000001
4,1000.0,3.0,00001,2.0,288.0,504.0,0.0,3.219900,100000001
...,...,...,...,...,...,...,...,...,...
5371,3000.0,3.0,00200,15.0,0.0,7416.0,360.0,6.091200,300000200
5372,3000.0,3.0,00200,15.0,0.0,7584.0,7584.0,2.043100,300000200
5373,3000.0,3.0,00200,17.0,259.2,8064.0,0.0,0.090115,300000200
5374,3000.0,3.0,00200,17.0,0.0,8088.0,24.0,53.990000,300000200


In [37]:
# filling in missing values in the "AMT" column with Dosing Amounts equal to 0
data_complete["AMT"] = data_complete["AMT"].fillna(0)


#Renaming the column DV to PK_timeCourse so that the name is more self-explanatory
data_complete = data_complete.rename(columns={"DV": "PK_timeCourse"})


# Duplicating PK_timeCourse column to make changes to it separately
data_complete["PK_round1"] = data_complete["PK_timeCourse"]

# Changing the PK_round1 such that if the dosing frequency is once weekly and the time of the last 
# observed dose is greater than 168hrs the PK for round 1 is zero. Similarly, if the dosing frequency 
# is once weekly and the time of the last observed dose is greater than 504hrs the PK for round 1 is zero
data_complete.loc[(data_complete.DSFQ == 1) & (data_complete.TIME >= 168), "PK_round1"] = 0
data_complete.loc[(data_complete.DSFQ == 3) & (data_complete.TIME >= 504), "PK_round1"] = 0

# filling in missing values in the "PK_round1" column with 0
data_complete["PK_round1"] = data_complete["PK_round1"].fillna(0)

# filling in missing values in the "PK_timeCourse" column with -1
data_complete["PK_timeCourse"] = data_complete["PK_timeCourse"].fillna(-1)
data_complete

Unnamed: 0,STUD,DSFQ,PTNM,CYCL,AMT,TIME,TFDS,PK_timeCourse,ID,PK_round1
0,1000.0,3.0,00001,1.0,296.6,0.0,0.0,20.382000,100000001,20.382
1,1000.0,3.0,00001,1.0,0.0,24.0,24.0,73.148000,100000001,73.148
2,1000.0,3.0,00001,1.0,0.0,216.0,216.0,19.764000,100000001,19.764
3,1000.0,3.0,00001,1.0,0.0,504.0,504.0,3.219900,100000001,0.000
4,1000.0,3.0,00001,2.0,288.0,504.0,0.0,3.219900,100000001,0.000
...,...,...,...,...,...,...,...,...,...,...
5371,3000.0,3.0,00200,15.0,0.0,7416.0,360.0,6.091200,300000200,0.000
5372,3000.0,3.0,00200,15.0,0.0,7584.0,7584.0,2.043100,300000200,0.000
5373,3000.0,3.0,00200,17.0,259.2,8064.0,0.0,0.090115,300000200,0.000
5374,3000.0,3.0,00200,17.0,0.0,8088.0,24.0,53.990000,300000200,0.000


In [38]:
# Removing rows where the "AMT" column is 0 and the "TIME" column is also 0.
data_complete = data_complete[~((data_complete.AMT == 0) & (data_complete.TIME == 0))]

# Keeping the last row for all patients with duplicate values for time in hours 
data_complete.loc[data_complete[["PTNM", "TIME"]].duplicated(keep="last"), "AMT"] = \
    data_complete.loc[data_complete[["PTNM", "TIME"]].duplicated(keep="first"), "AMT"].values

# Keeping the first row for all observations with duplicate values for all features apart from patient and time in hours 
data_complete = data_complete[~data_complete[["PTNM", "TIME"]].duplicated(keep="first")]

data_complete

Unnamed: 0,STUD,DSFQ,PTNM,CYCL,AMT,TIME,TFDS,PK_timeCourse,ID,PK_round1
0,1000.0,3.0,00001,1.0,296.6,0.0,0.0,20.382000,100000001,20.382
1,1000.0,3.0,00001,1.0,0.0,24.0,24.0,73.148000,100000001,73.148
2,1000.0,3.0,00001,1.0,0.0,216.0,216.0,19.764000,100000001,19.764
3,1000.0,3.0,00001,1.0,288.0,504.0,504.0,3.219900,100000001,0.000
5,1000.0,3.0,00001,2.0,0.0,696.0,192.0,23.210000,100000001,0.000
...,...,...,...,...,...,...,...,...,...,...
5371,3000.0,3.0,00200,15.0,0.0,7416.0,360.0,6.091200,300000200,0.000
5372,3000.0,3.0,00200,15.0,0.0,7584.0,7584.0,2.043100,300000200,0.000
5373,3000.0,3.0,00200,17.0,259.2,8064.0,0.0,0.090115,300000200,0.000
5374,3000.0,3.0,00200,17.0,0.0,8088.0,24.0,53.990000,300000200,0.000


# Splitting Data

Now, we want to be able to split our data into training, validation, and test groups. This is a method of ensuring that we can hypertune our model without any biases and have the most robust version of our model in the final output.

We begin by defining a function called ```data_split``` that we will use eventually to split our data in a customized way. Arguments for the function include a dataframe (```df```), a column (```on_col```), a list of columns to be saved in the output dataframes (```save_cols```), the seed of the split (```seed```), and the proportion of the data that will go into the test set (```test_size```).

In [39]:
#Importing the necessary library for splitting the data into training and test sets
from sklearn.model_selection import train_test_split

#Defining the function called data_split which takes in a dataframe, a column name (likely with patient identifiers), 
# and a default test size of 20%. 
def data_split(df, on_col, save_cols=None, seed=2020, test_size=0.2):
    
    #Setting the default save_cols to all columns of the dataset
    if not save_cols:
        save_cols = df.columns.values

    #Setting a variable called target which contains all the unique values of a certain column. 
    #This variable will be helpful to split the unique patients into train and test later in the code.
    target = df[on_col].unique()
    
    #Setting the train variable to contain the random unique patients numbers of 80% of the patients 
    #while the remaining 20% is the allocated to the test variable
    train, test = train_test_split(target, random_state=seed, test_size=test_size, shuffle=True)
    
    #Creating the train and test dataframes (train_df and test_df). The training dataframe (train_df) includes all rows of 
    #patients in the train variable (train). The test dataframe (test_df) includes all rows of remaining patients (test).
    train_df = df[df[on_col].isin(train)]
    test_df = df[df[on_col].isin(test)]

    #Returning  
    return train_df[save_cols], test_df[save_cols]

We then build the code for the ```main``` that takes in our input data uses the helper function that we created above ```data_split()``` to split our data into training, validation, and test sets.

Next, the main adds a specific portion of the test set into the training and validation sets that we created above. 

Then, the main performs an augmentation that splits each patient into three groups. The main purpose of this augmentation in my opinion is to eliminate the problems associated with translating the results from one dosing regimen to the next.


In [43]:
# data = pd.read_csv(args.data)
train, test = data_split(data_complete, "PTNM", seed=1329, test_size=0.2)
    
# test[(test.DSFQ == 1) & (test.TIME < 168)]
test[(test.DSFQ == 3) & (test.TIME < 504)]

Unnamed: 0,STUD,DSFQ,PTNM,CYCL,AMT,TIME,TFDS,PK_timeCourse,ID,PK_round1
100,1000.0,3.0,00004,1.0,222.0,0.0,0.0,6.8146,100000004,6.8146
101,1000.0,3.0,00004,1.0,0.0,24.0,24.0,52.2980,100000004,52.2980
102,1000.0,3.0,00004,1.0,0.0,192.0,192.0,17.1410,100000004,17.1410
186,1000.0,3.0,00007,1.0,205.0,0.0,0.0,20.4480,100000007,20.4480
187,1000.0,3.0,00007,1.0,0.0,192.0,192.0,19.7200,100000007,19.7200
...,...,...,...,...,...,...,...,...,...,...
4875,3000.0,3.0,00184,1.0,0.0,360.0,360.0,2.9028,300000184,2.9028
5129,3000.0,3.0,00194,1.0,309.0,0.0,0.0,2.8220,300000194,2.8220
5130,3000.0,3.0,00194,1.0,0.0,24.0,24.0,60.3760,300000194,60.3760
5131,3000.0,3.0,00194,1.0,0.0,288.0,288.0,6.5730,300000194,6.5730


In [None]:
#  
if __name__ == "__main__":
    
    from args import args

    data = pd.read_csv(args.data)
    train, test = data_split(data, "PTNM", seed=1329+args.fold, test_size=0.2)
    train, validate = data_split(train, "PTNM", seed=1329+args.fold+args.model, test_size=0.2)

    test_add_to_train = pd.DataFrame()
    test_add_to_train = pd.concat([test_add_to_train, test[(test.DSFQ == 1) & (test.TIME < 168)]], ignore_index=True)
    test_add_to_train = pd.concat([test_add_to_train, test[(test.DSFQ == 3) & (test.TIME < 504)]], ignore_index=True)
    train = pd.concat([train, test_add_to_train], ignore_index=True)
    validate = pd.concat([validate, test_add_to_train], ignore_index=True)
    
    # James' augmentation
    # Deep learning is prone to overfitting, and they applied augmentation to prevent overfitting. We applied 
    # timewise truncation to increase the number of training examples. For each training example, in addition to the 
    # original example, we also truncated the examples at 1008 hr, 1512 hr, and 2016 hr and generated and added a 
    # set of new examples to the training examples.
    
    augment_data = pd.DataFrame(columns=train.columns)
    
    for ptnm in train.PTNM.unique():
        df = train[(train.PTNM == ptnm) & (train.TIME <= 2 * 21 * 24) & (train.TIME >= 0)]
        df["PTNM"] = df["PTNM"] + 0.1
        augment_data = pd.concat([augment_data, df], ignore_index=True)

        df = train[(train.PTNM == ptnm) & (train.TIME <= 3 * 21 * 24) & (train.TIME >= 0)]
        df["PTNM"] = df["PTNM"] + 0.2
        augment_data = pd.concat([augment_data, df], ignore_index=True)

        df = train[(train.PTNM == ptnm) & (train.TIME <= 4 * 21 * 24) & (train.TIME >= 0)]
        df["PTNM"] = df["PTNM"] + 0.3
        augment_data = pd.concat([augment_data, df], ignore_index=True)

    train = pd.concat([train, augment_data], ignore_index=True).reset_index(drop=True)

    train.to_csv("/Users/rishabhgoel/Desktop/NeuralODE_Paper_Supplementary_Code/5fold_models/Neural-ODE/results/train.csv", index=False) 
    validate.to_csv("/Users/rishabhgoel/Desktop/NeuralODE_Paper_Supplementary_Code/5fold_models/Neural-ODE/results/validate.csv", index=False)
    test.to_csv("/Users/rishabhgoel/Desktop/NeuralODE_Paper_Supplementary_Code/5fold_models/Neural-ODE/results/test.csv", index=False)

In [None]:
import argparse


# parser file which understands the different args thta are taken in along with the python file
parser = argparse.ArgumentParser("neural ODE model")

# the data that takes the input of the data file for data splitting steps
parser.add_argument("--data", type=str, help="data file for processing")

#argument for fold number
parser.add_argument("--fold", type=int, help="current fold number")

#argument for model number
parser.add_argument("--model", type=int, help="current model number")

#argument for the directory of where to save the file
parser.add_argument("--save", type=str, help="save dirs for the results")

# Argument to load a certain utils model (not sure what this is exactly) 
parser.add_argument("--continue-train", action="store_true", help="continue training")

#Argument with a random seed number to add with a default of 1000
parser.add_argument("--random-seed", type=int, default=1000, help="random seed")

#
parser.add_argument("--layer", type=int, default=2, help="hidden layer of the ODE Function")

#Argument for the learning rate of the model (usually set to 0.00005)
parser.add_argument("--lr", type=float, help="learning rate")

#Argument for lambda of the regularization term (which is usually 0.1)
parser.add_argument("--l2", type=float, help="l2 regularization")

#Argument for number of hidden dimensions in the ODE Function
parser.add_argument("--hidden", type=int, help="hidden dim in ODE Function")

#Argument for 
parser.add_argument("--tol", type=float, help="control the precision in ODE solver")

#Argument for number of training epochs (set to 30 in the actual code)
parser.add_argument("--epochs", type=int, help="epochs for training")

args = parser.parse_args()