# Dataset Formation - Tabular

In this document I try creating the "tabular" version of this task. The goal is to end up with the following:

* `df_features`: a pandas dataframe containing features for predicting each row, corresponding to a patient id and timepoint.
    * this would contain all types of features.
* `df_targets`: the `VL`, `CD4` values, as well as the `reward` for that particular timepoint. Note that our assumption is that these target observations are NOT known at timestep t.


## Feature Selection
In a nutshell, I group the features in this dataset into the following:

* `ignored`: `[]`
    * features that I currently ignore. For example, imputation flags can be ignored if we consider the preprocessing encapsulated from everything else. Nonetheless, it has been shown in the literature that such masks of *missingness* are often informative, thus, I will include them still as covariates of target features.
* `target`: `['VL', 'CD4', 'Rel CD4', 'VL (M)', 'CD4 (M)']`
    * As mentioned in the instructions, covariates related to the targets
* `static`: `['Gender', 'Ethnic']`
* `categorical`: `['Comp. NNRTI', 'Comp. INI', 'Base Drug Combo', 'Drug (M)', 'Extra PI','Extra pk-En',]`
* `numerical`: `['VL','CD4','Rel CD4',]`

So, in a nutshell, what we have can be simplified as follows:

For a particular row, corresponding to patiend id `pid` and timestamp `t`, we want to use:
* `static` features
* `categorical` features up until and including `t` timestep
* `target` features up until and NOT including `t` (until `t-1`).

In [1]:
# - building this metadata just for our use.
feature_groups = dict(
    static=['Gender', 'Ethnic'],
    categorical=['Comp. NNRTI', 'Comp. INI', 'Base Drug Combo', 'Drug (M)', 'Extra PI','Extra pk-En',],
    numerical=['VL','CD4','Rel CD4', 'reward'],
    target=['VL', 'CD4', 'Rel CD4', 'VL (M)', 'CD4 (M)', 'reward']
)

__Remark__: In the above, I added a `reward` feature as well. It will be computed based on each row, but it will be a target co-variate (and of course derived from values such as VL that are unobserved at timestep t), so it will be a `numerical` value that is in the `target` group.

Let's get the dataframe:

In [2]:
# - preparing the master dataframe
from typing import List, Tuple
import os, sys
sys.path.insert(0, os.path.abspath('../../'))
from tqdm import tqdm
import random
import numpy
import pandas
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.linear_model import LinearRegression
import torch
import torch.utils.data.dataset
import torch.utils.data.dataloader

from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from tabulate import tabulate

from psmpy import PsmPy
from psmpy.functions import cohenD
from psmpy.plotting import *


from kp_problem.dataset.tabular import fetch_dataset

In [3]:
%%time

model_type = 'xgboost'
df_features, df_targets, feature_groups = fetch_dataset(
    ewma_histories=True,
    ewma_alpha=0.8,
    ewma_adjust=True,
    prev_month=True,
    balance_by_n_rows_per_treatment=10000,
)

100%|██████████| 30/30 [00:04<00:00,  7.41it/s]


CPU times: user 21.5 s, sys: 872 ms, total: 22.4 s
Wall time: 22.4 s


In [4]:
import pickle

with open(os.path.abspath('../../resources/train_test_split.pkl'), 'rb') as handle:
    df_train_meta, df_test_meta = pickle.load(handle)
    df_train_meta = df_train_meta.loc[:, ['PatientID']]
    df_test_meta = df_test_meta.loc[:, ['PatientID']]

In [5]:
df_features

Unnamed: 0,PatientID,Timepoints,Comp. NNRTI_0.0,Comp. NNRTI_1.0,Comp. NNRTI_2.0,Comp. NNRTI_3.0,Comp. INI_0.0,Comp. INI_1.0,Comp. INI_2.0,Comp. INI_3.0,...,prev_month_Drug (M)_0.0_ewma,prev_month_Drug (M)_1.0_ewma,prev_month_Extra PI_0.0_ewma,prev_month_Extra PI_1.0_ewma,prev_month_Extra PI_2.0_ewma,prev_month_Extra PI_3.0_ewma,prev_month_Extra PI_4.0_ewma,prev_month_Extra PI_5.0_ewma,prev_month_Extra pk-En_0.0_ewma,prev_month_Extra pk-En_1.0_ewma
0,7052,0.900000,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,2.233785e-36,1.0,0.00128,0.000000e+00,0.032002,0.000000e+00,0.160011,0.806707,1.000000,0.000000
1,5938,0.700000,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,2.744381e-27,1.0,0.00128,0.000000e+00,0.032002,0.000000e+00,0.000061,0.966657,1.000000,0.000000
2,1083,0.500000,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.611042e-17,1.0,0.00000,0.000000e+00,0.032000,0.000000e+00,0.000000,0.968000,1.000000,0.000000
3,7961,0.700000,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,9.005687e-22,1.0,0.00000,0.000000e+00,0.000000,0.000000e+00,0.000000,1.000000,1.000000,0.000000
4,2087,0.216667,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.966080e-08,1.0,0.00128,0.000000e+00,0.000000,0.000000e+00,0.160051,0.838669,1.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,3403,0.666667,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,2.638828e-27,1.0,0.00000,4.398047e-28,0.000000,1.374390e-24,0.000000,1.000000,1.000000,0.000000
39996,3689,0.766667,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,1.000000e+00,0.0,0.00000,0.000000e+00,0.000000,0.000000e+00,0.000000,1.000000,1.000000,0.000000
39997,7982,0.600000,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,2.748779e-25,1.0,0.00000,1.000000e+00,0.000000,0.000000e+00,0.000000,0.000000,1.000000,0.000000
39998,3124,0.850000,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,5.404320e-35,1.0,0.00000,1.125900e-33,0.000000,1.759219e-29,0.000000,1.000000,1.000000,0.000000


In [6]:
df_features_test = pandas.merge(df_features, df_test_meta, on=['PatientID'], how='inner')
df_features_train = pandas.merge(df_features, df_train_meta, on=['PatientID'], how='inner')

df_targets_test = pandas.merge(df_targets, df_test_meta, on=['PatientID'], how='inner')
df_targets_train = pandas.merge(df_targets, df_train_meta, on=['PatientID'], how='inner')

assert len(set(df_targets_test.PatientID.unique()).intersection(df_targets_train.PatientID.unique())) == 0
assert len(set(df_features_test.PatientID.unique()).intersection(df_features_train.PatientID.unique())) == 0




In [7]:
df_features_test_original = df_features_test.copy()
feature_columns = [e for e in df_features_train.columns.tolist() if e not in ['PatientID']]

In [8]:
standard_scaler = StandardScaler()

scaling_columns = [e for e in df_features_train.columns.tolist() if e in feature_groups['numerical']]
if scaling_columns:
    print("scaling...")
    df_features_train.loc[:, scaling_columns] = standard_scaler.fit_transform(df_features_train.loc[:, scaling_columns])
    df_features_test.loc[:, scaling_columns] = standard_scaler.transform(df_features_test.loc[:, scaling_columns])


In [9]:
df_features_train.head()

Unnamed: 0,PatientID,Timepoints,Comp. NNRTI_0.0,Comp. NNRTI_1.0,Comp. NNRTI_2.0,Comp. NNRTI_3.0,Comp. INI_0.0,Comp. INI_1.0,Comp. INI_2.0,Comp. INI_3.0,...,prev_month_Drug (M)_0.0_ewma,prev_month_Drug (M)_1.0_ewma,prev_month_Extra PI_0.0_ewma,prev_month_Extra PI_1.0_ewma,prev_month_Extra PI_2.0_ewma,prev_month_Extra PI_3.0_ewma,prev_month_Extra PI_4.0_ewma,prev_month_Extra PI_5.0_ewma,prev_month_Extra pk-En_0.0_ewma,prev_month_Extra pk-En_1.0_ewma
0,7052,0.9,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,2.233785e-36,1.0,0.00128,0.0,0.032002,0.0,0.160011,0.8067072,1.0,0.0
1,7052,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,5.453578e-28,1.0,0.00128,0.0,2e-06,0.0,0.160061,0.8386565,1.0,0.0
2,7052,0.4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,2.080375e-15,1.0,0.00128,0.0,0.032002,0.0,0.166462,0.800256,1.0,0.0
3,7052,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,5.453578e-28,1.0,0.00128,0.0,2e-06,0.0,0.160061,0.8386565,1.0,0.0
4,7052,0.3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,3.250586e-11,1.0,0.001536,0.0,0.032002,0.0,0.966462,1.641651e-08,1.0,0.0


In [10]:
df_targets_test#['reward']

Unnamed: 0,VL,CD4,Rel CD4,VL (M),CD4 (M),reward,reward_advantage,PatientID,Timepoints
0,23.355646,701.59220,24.842684,0.0,0.0,1.726424,-1.612224,7961,0.700000
1,59.432470,490.77365,20.943201,0.0,0.0,0.858201,-2.269312,7961,0.300000
2,23.355646,701.59220,24.842684,0.0,0.0,1.726424,-1.612224,7961,0.700000
3,24.697287,630.32855,26.647427,0.0,0.0,1.623059,-1.154584,7961,0.900000
4,24.697287,630.32855,26.647427,0.0,0.0,1.623059,-1.154584,7961,0.900000
...,...,...,...,...,...,...,...,...,...
7947,39.206230,784.85960,39.077830,0.0,0.0,1.431118,-1.459456,7667,0.616667
7948,90.116356,358.45180,18.295029,0.0,0.0,0.378305,-0.870689,6353,0.900000
7949,5.931007,457.42566,37.628155,0.0,0.0,2.429233,0.631894,6513,0.783333
7950,26.191462,691.90610,28.665380,0.0,0.0,1.637867,0.424517,4656,0.666667


In [17]:
if model_type == 'xgboost':
    model = xgb.XGBRegressor(n_estimators=100, objective='reg:squarederror')
elif model_type == 'linear_regression':
    model = LinearRegression()
elif model_type == 'random_forest':
    model = RandomForestRegressor(n_estimators=100, criterion='squared_error')
else:
    raise ValueError()


In [18]:
# feature_columns = [e for e in feature_columns if e != 'Timepoints']

In [19]:
X_train = df_features_train.loc[:, feature_columns]
X_test = df_features_test.loc[:, feature_columns]

y_train = df_targets_train['reward_advantage'].to_numpy()
y_test = df_targets_test['reward_advantage'].to_numpy()

In [20]:
feature_columns

['Timepoints',
 'Comp. NNRTI_0.0',
 'Comp. NNRTI_1.0',
 'Comp. NNRTI_2.0',
 'Comp. NNRTI_3.0',
 'Comp. INI_0.0',
 'Comp. INI_1.0',
 'Comp. INI_2.0',
 'Comp. INI_3.0',
 'Base Drug Combo_0.0',
 'Base Drug Combo_1.0',
 'Base Drug Combo_2.0',
 'Base Drug Combo_3.0',
 'Base Drug Combo_4.0',
 'Base Drug Combo_5.0',
 'Drug (M)_0.0',
 'Drug (M)_1.0',
 'Extra PI_0.0',
 'Extra PI_1.0',
 'Extra PI_2.0',
 'Extra PI_3.0',
 'Extra PI_4.0',
 'Extra PI_5.0',
 'Extra pk-En_0.0',
 'Extra pk-En_1.0',
 'Gender_1.0',
 'Gender_2.0',
 'Ethnic_2.0',
 'Ethnic_3.0',
 'Ethnic_4.0',
 'Comp. NNRTI_0.0_ewma',
 'Comp. NNRTI_1.0_ewma',
 'Comp. NNRTI_2.0_ewma',
 'Comp. NNRTI_3.0_ewma',
 'Comp. INI_0.0_ewma',
 'Comp. INI_1.0_ewma',
 'Comp. INI_2.0_ewma',
 'Comp. INI_3.0_ewma',
 'Base Drug Combo_0.0_ewma',
 'Base Drug Combo_1.0_ewma',
 'Base Drug Combo_2.0_ewma',
 'Base Drug Combo_3.0_ewma',
 'Base Drug Combo_4.0_ewma',
 'Base Drug Combo_5.0_ewma',
 'Drug (M)_0.0_ewma',
 'Drug (M)_1.0_ewma',
 'Extra PI_0.0_ewma',
 'Extr

In [21]:
X_train

Unnamed: 0,Timepoints,Comp. NNRTI_0.0,Comp. NNRTI_1.0,Comp. NNRTI_2.0,Comp. NNRTI_3.0,Comp. INI_0.0,Comp. INI_1.0,Comp. INI_2.0,Comp. INI_3.0,Base Drug Combo_0.0,...,prev_month_Drug (M)_0.0_ewma,prev_month_Drug (M)_1.0_ewma,prev_month_Extra PI_0.0_ewma,prev_month_Extra PI_1.0_ewma,prev_month_Extra PI_2.0_ewma,prev_month_Extra PI_3.0_ewma,prev_month_Extra PI_4.0_ewma,prev_month_Extra PI_5.0_ewma,prev_month_Extra pk-En_0.0_ewma,prev_month_Extra pk-En_1.0_ewma
0,0.900000,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,2.233785e-36,1.0,0.001280,0.0,3.200205e-02,0.0,0.160011,8.067072e-01,1.0,0.0
1,0.700000,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,5.453578e-28,1.0,0.001280,0.0,2.048131e-06,0.0,0.160061,8.386565e-01,1.0,0.0
2,0.400000,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,2.080375e-15,1.0,0.001280,0.0,3.200205e-02,0.0,0.166462,8.002560e-01,1.0,0.0
3,0.700000,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,5.453578e-28,1.0,0.001280,0.0,2.048131e-06,0.0,0.160061,8.386565e-01,1.0,0.0
4,0.300000,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,3.250586e-11,1.0,0.001536,0.0,3.200205e-02,0.0,0.966462,1.641651e-08,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32043,0.300000,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,...,2.726298e-11,1.0,0.000000,0.0,0.000000e+00,0.0,0.000000,1.000000e+00,1.0,0.0
32044,0.366667,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,...,1.006633e-14,1.0,0.000000,0.0,2.097152e-13,0.0,0.000000,1.000000e+00,1.0,0.0
32045,0.600000,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,...,8.592683e-22,1.0,0.000000,0.0,0.000000e+00,0.0,0.000000,1.000000e+00,1.0,0.0
32046,0.983333,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,...,5.629499e-35,1.0,0.000000,0.0,0.000000e+00,0.0,0.992000,8.000000e-03,1.0,0.0


In [22]:
%%time
# - training the model
model.fit(X_train, y_train)

CPU times: user 9.25 s, sys: 157 ms, total: 9.41 s
Wall time: 300 ms


In [23]:
%%time
# - evaluating on both train and test
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

CPU times: user 643 ms, sys: 4.73 ms, total: 648 ms
Wall time: 20.3 ms


In [24]:
evaluation_metrics = {
    'MSE': mean_squared_error,
    'MAE': mean_absolute_error,
    'MAPE': mean_absolute_percentage_error
}

def evaluate(y_pred, y_true):
    output = []
    for metric_name, metric in evaluation_metrics.items():
        output.append(dict(title=metric_name, result=f"{metric(y_pred, y_true):.8f}"))
    print(tabulate(output))

In [25]:
print("-" * 10)
print("TRAIN")
evaluate(y_pred_train, y_train)

print("-" * 10)
print("TEST")
evaluate(y_pred_test, y_test)

----------
TRAIN
----  --------
MSE   0.143929
MAE   0.263271
MAPE  9.24765
----  --------
----------
TEST
----  --------
MSE   0.307669
MAE   0.387567
MAPE  5.13614
----  --------


In [None]:
y_pred_test

array([-1.7090292 , -2.0199852 , -1.7090292 , ...,  0.16745645,
       -0.57883316, -0.53397644], dtype=float32)

In [None]:
y_test

array([-1.64793447, -1.70510119, -1.64793447, ..., -0.11119979,
       -0.54240722, -0.64610931])

In [None]:
df_features_test.iloc[0]['Comp. NNRTI_0.0']

1.0

## Treatment Recommendation

In [None]:
record_index = 629

In [None]:
the_input = df_features_test_original.iloc[[record_index]]

assert the_input.shape[0] == 1
output = []
for recommended_nnrti in range(4):
    sample_input = the_input.copy()
    for i in range(4):
        sample_input.loc[:, [f'Comp. NNRTI_{i}.0']] = 1. if recommended_nnrti == i else 0.
    
    if scaling_columns:
        sample_input.loc[:, scaling_columns] = standard_scaler.transform(sample_input.loc[:, scaling_columns])
    output.append(
        dict(
            predicted_reward=model.predict(sample_input.loc[:, feature_columns]).item(),
            recommended_nnrti=recommended_nnrti
        )
    )

print(tabulate(output))

---------  -
-1.29339   0
-0.933871  1
-0.977708  2
-0.933871  3
---------  -


In [None]:
df_targets_test.iloc[[record_index]]

Unnamed: 0,VL,CD4,Rel CD4,VL (M),CD4 (M),reward,reward_advantage,PatientID,Timepoints
629,22.981445,904.13837,33.23349,0.0,0.0,1.889908,-1.237163,7881,0.9


In [None]:
df_features_test.iloc[[record_index]]

Unnamed: 0,PatientID,Timepoints,Comp. NNRTI_0.0,Comp. NNRTI_1.0,Comp. NNRTI_2.0,Comp. NNRTI_3.0,Comp. INI_0.0,Comp. INI_1.0,Comp. INI_2.0,Comp. INI_3.0,...,prev_month_Drug (M)_0.0_ewma,prev_month_Drug (M)_1.0_ewma,prev_month_Extra PI_0.0_ewma,prev_month_Extra PI_1.0_ewma,prev_month_Extra PI_2.0_ewma,prev_month_Extra PI_3.0_ewma,prev_month_Extra PI_4.0_ewma,prev_month_Extra PI_5.0_ewma,prev_month_Extra pk-En_0.0_ewma,prev_month_Extra pk-En_1.0_ewma
629,7881,0.9,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,2.233785e-36,1.0,8.192524e-08,0.0,0.032,0.0,0.16,0.808,1.0,0.0


In [None]:
feature_importances = pandas.DataFrame(dict(name=model.feature_names_in_, importance=model.feature_importances_))
feature_importances.sort_values(by='importance', ascending=False)

Unnamed: 0,name,importance
1,Comp. NNRTI_0.0,0.452861
55,prev_month_CD4,0.059366
72,prev_month_Base Drug Combo_3.0,0.043099
73,prev_month_Base Drug Combo_4.0,0.038195
60,prev_month_reward,0.025049
...,...,...
16,Drug (M)_1.0,0.000000
17,Extra PI_0.0,0.000000
27,Ethnic_2.0,0.000000
26,Gender_2.0,0.000000
