# For each patient visit where a protein/peptide sample was taken you will estimate future UPDR scores: 
 * ## for that visit (month 0)
 * ## for 6 months later
 * ## for 12 months later
 * ## for 24 months later
 
Predictions for any visits that didn't ultimately take place are ignored.

## This project has multiple fascinating features and challenges, including:
 - Reusing base layers of another model I also train
 - Multi-output models (16 outputs!)

## Approach

 1. **Ingest and format data**
 2. **Pre-train a multi-output model using `supplemental_clinical_data.csv`**
 
 Since there is no protein/peptide information associated with the supplemental clinical data, we can use this to train a model on the typical progression of Parkinsons (as measured by UPDR scores). It will only predict scores given the time since initial visit. This will give us base layers that we can reuse when predicting progression based on proteins/peptides.
 3. **Train a multi-output model using `train_clinical_data`**
 
 Our goal in both steps 2 and 3 is to train a single model that can predict UPDR scores at months 0, 6, 12, and 24. I'll use a multi-output variable approach because each UPDR score has 4 sections, predicting a total of 16 outputs!

## Data Ingestion

In [1]:
import numpy as np
import pandas as pd
import sklearn
from pathlib import Path

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/amp-parkinsons-disease-progression-prediction/train_proteins.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/train_clinical_data.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/public_timeseries_testing_util.py
/kaggle/input/amp-parkinsons-disease-progression-prediction/supplemental_clinical_data.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/train_peptides.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/amp_pd_peptide/competition.cpython-37m-x86_64-linux-gnu.so
/kaggle/input/amp-parkinsons-disease-progression-prediction/amp_pd_peptide/__init__.py
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/sample_submission.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/test_proteins.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/test_peptides.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test

In [2]:
# Pull in all clinical data

INPUTS_DIR = Path('/kaggle/input/amp-parkinsons-disease-progression-prediction/') 

# train_clinical_data.csv
clinical_df = pd.read_csv(
    INPUTS_DIR / 'train_clinical_data.csv',
    index_col='visit_id',
)
print(clinical_df.shape)
print(clinical_df.head(n=10))

# supplemental_clinical_data.csv
supp_clinical_df = pd.read_csv(
    INPUTS_DIR / 'supplemental_clinical_data.csv',
    index_col='visit_id',
)
print(supp_clinical_df.shape)
print(supp_clinical_df.head(n=10))

# concat both to a single dataframe
all_clinical_df = pd.concat([clinical_df, supp_clinical_df])

print("\nUnion of both clinical data sets:")
print(all_clinical_df.shape)

(2615, 7)
          patient_id  visit_month  updrs_1  updrs_2  updrs_3  updrs_4  \
visit_id                                                                
55_0              55            0     10.0      6.0     15.0      NaN   
55_3              55            3     10.0      7.0     25.0      NaN   
55_6              55            6      8.0     10.0     34.0      NaN   
55_9              55            9      8.0      9.0     30.0      0.0   
55_12             55           12     10.0     10.0     41.0      0.0   
55_18             55           18      7.0     13.0     38.0      0.0   
55_24             55           24     16.0      9.0     49.0      0.0   
55_30             55           30     14.0     13.0     49.0      0.0   
55_36             55           36     17.0     18.0     51.0      0.0   
55_42             55           42     12.0     20.0     41.0      0.0   

         upd23b_clinical_state_on_medication  
visit_id                                      
55_0               

In [3]:
# Pull in protein data
proteins_df = pd.read_csv(
    INPUTS_DIR / 'train_proteins.csv',
)
print(proteins_df.shape)
proteins_df.head(n=10)

# Pull in peptide data
peptide_df = pd.read_csv(
    INPUTS_DIR / 'train_peptides.csv',
)
print(peptide_df.shape)
peptide_df.head(n=10)

(232741, 5)
(981834, 6)


Unnamed: 0,visit_id,visit_month,patient_id,UniProt,Peptide,PeptideAbundance
0,55_0,0,55,O00391,NEQEQPLGQWHLS,11254.3
1,55_0,0,55,O00533,GNPEPTFSWTK,102060.0
2,55_0,0,55,O00533,IEIPSSVQQVPTIIK,174185.0
3,55_0,0,55,O00533,KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK,27278.9
4,55_0,0,55,O00533,SMEQNGPGLEYR,30838.7
5,55_0,0,55,O00533,TLKIENVSYQDKGNYR,23216.5
6,55_0,0,55,O00533,VIAVNEVGR,170878.0
7,55_0,0,55,O00533,VMTPAVYAPYDVK,148771.0
8,55_0,0,55,O00533,VNGSPVDNHPFAGDVVFPR,55202.1
9,55_0,0,55,O00584,ELDLNSVLLK,27229.3


## Data Exploration

In [4]:
# sort by patient id and month
clinical_df = clinical_df.sort_values(
    ['patient_id', 'visit_month'],
)
supp_clinical_df = supp_clinical_df.sort_values(
    ['patient_id', 'visit_month'],
)
all_clinical_df = all_clinical_df.sort_values(
    ['patient_id', 'visit_month'],
)
all_clinical_df.head()

# get all unique patient ids (already sorted)
clin_patient_ids = clinical_df['patient_id'].unique()
supp_patient_ids = supp_clinical_df['patient_id'].unique()
patient_ids = all_clinical_df['patient_id'].unique()

# Verifying there are no supplemental patient ids that may have protein/peptide data
np.intersect1d(clin_patient_ids, supp_patient_ids)



array([], dtype=int64)

## Data Processing
This is (with only minor modifications) from [this work by Xavier R Nogueira](https://www.kaggle.com/code/xaviernogueira/pre-processing-making-labels-and-imputation#Add-supplemental-clinical-data-to-the-main-clininical-data) released under the [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) open source license.

## Clinical Data Processing
The main change I made was to turn it into a factory pattern function, so that I could process the `train_clinical_data` and the `supplemental_clinical_data` separately. Since the `supplemental_clinical_data` doesn't have protein/peptide information, I don't think it should be grouped it in with the labeled data. 

Instead, I'll use the `supplemental_clinical_data` to pretrain the base layers of my multi-output model on typical Parkinsons Disease progression.

In [5]:
def process_clinical_data(clinical_data, parquet_outfile_name):
    # rename same visit columns
    for updrs_col in [i for i in clinical_data.columns if 'updrs' in i]:
        clinical_data = clinical_data.rename(
            columns={updrs_col: updrs_col + '_plus_0_months'},
        )
    clinical_data.head()

    # sort by patient id and month
    clinical_data = clinical_data.sort_values(
        ['patient_id', 'visit_month'],
    )
    clinical_data.head()

    # get all unique patient ids (already sorted)
    patient_ids = clinical_data['patient_id'].unique()

    # get all unique visit_months and sort
    months = clinical_data.visit_month.unique()
    months.sort()
    evenly_spaced_months = np.arange(months[0], months[-1] + 1)

    # create a new multi index with all unique patient id evenly spaced months
    multi_index = pd.MultiIndex.from_product(
        [patient_ids, evenly_spaced_months],
        names=['patient_id', 'visit_month'],
    )
    print(len(multi_index))

    # convert into dataframe multiindex
    multi_index = pd.DataFrame(
        index=multi_index,
    )

    multi_index.head(n=5)

    # join full clinical data
    clinical_data = multi_index.join(
        clinical_data.reset_index().set_index(['patient_id', 'visit_month']),
        how='left',
    )
    clinical_data.head()

    # shift data forward for our labels while grouping by patient_id
    for col in [i for i in clinical_data.columns if 'plus_0_months' in i]:
        clinical_data[col.replace('0', '6')] = clinical_data.groupby('patient_id')[col].shift(-6)
        clinical_data[col.replace('0', '12')] = clinical_data.groupby('patient_id')[col].shift(-12)
        clinical_data[col.replace('0', '24')] = clinical_data.groupby('patient_id')[col].shift(-24)
    clinical_data.head()

    # drop all rows where visit_id is NaN
    clinical_data = clinical_data.loc[clinical_data.visit_id.notna()]

    # reset index back to visit_id
    clinical_data = (
        clinical_data
        .reset_index()
        .set_index('visit_id')
    )
    clinical_data.head()
    
    # save labels data frame as a parquet file (without non labels columns)
    clinical_data.drop(
        columns=['patient_id', 'visit_month', 'upd23b_clinical_state_on_medication'],
    ).to_parquet(
        parquet_outfile_name,
    )
    
    return clinical_data

In [6]:
# Clinical data post-processing, saving labels to a parquet file
clinical_df_pp = process_clinical_data(clinical_df, 'clinical_data_training_labels.parquet')

# Supplemental clinical data post-processing, saving labels to a parquet file
supp_clinical_df_pp = process_clinical_data(supp_clinical_df, 'supp_clinical_data_training_labels.parquet')

27032
28527


## Protein/Peptide data processing

In [7]:
proteins_df = proteins_df.pivot(
    index='visit_id',
    columns='UniProt',
    values='NPX',
)
print(proteins_df.index.unique().shape)
print(proteins_df.shape)
proteins_df.head(n=5)


peptide_df = peptide_df.pivot(
    index='visit_id',
    columns='Peptide',
    values='PeptideAbundance',
)
print(peptide_df.index.unique().shape)
print(peptide_df.shape)
peptide_df.head(n=5)

(1113,)
(1113, 227)
(1113,)
(1113, 968)


Peptide,AADDTWEPFASGK,AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K,AAFTEC(UniMod_4)C(UniMod_4)QAADK,AANEVSSADVK,AATGEC(UniMod_4)TATVGKR,AATVGSLAGQPLQER,AAVYHHFISDGVR,ADDKETC(UniMod_4)FAEEGK,ADDKETC(UniMod_4)FAEEGKK,ADDLGKGGNEESTKTGNAGSR,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,6580710.0,31204.4,7735070.0,,,,46620.3,236144.0,,,...,202274.0,,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,,7207.3
10053_12,6333510.0,52277.6,5394390.0,,,,57554.5,108298.0,45885.4,,...,201009.0,,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.8
10053_18,7129640.0,61522.0,7011920.0,35984.7,17188.0,19787.3,36029.4,708729.0,5067790.0,30838.2,...,220728.0,,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.7
10138_12,7404780.0,46107.2,10610900.0,,20910.2,66662.3,55253.9,79575.5,6201210.0,26720.0,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,13788300.0,56910.3,6906160.0,13785.5,11004.2,63672.7,36819.8,34160.9,2117430.0,15645.2,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,,56977.6,4903.09


In [8]:
# keep track of our protein / peptide columns
protein_cols = list(proteins_df.columns)
peptide_cols = list(peptide_df.columns)

# join the protein / peptide data
prot_and_peps_df = pd.concat(
    [proteins_df, peptide_df],
    axis=1,
)
prot_and_peps_df.head()

Unnamed: 0_level_0,O00391,O00533,O00584,O14498,O14773,O14791,O15240,O15394,O43505,O60888,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,9104.27,402321.0,,,7150.57,2497.84,83002.9,15113.6,167327.0,129048.0,...,202274.0,,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,,7207.3
10053_12,10464.2,435586.0,,,,,197117.0,15099.1,164268.0,108114.0,...,201009.0,,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.8
10053_18,13235.7,507386.0,7126.96,24525.7,,2372.71,126506.0,16289.6,168107.0,163776.0,...,220728.0,,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.7
10138_12,12600.2,494581.0,9165.06,27193.5,22506.1,6015.9,156313.0,54546.4,204013.0,56725.0,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,12003.2,522138.0,4498.51,17189.8,29112.4,2665.15,151169.0,52338.1,240892.0,85767.1,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,,56977.6,4903.09


## Use regression to impute missing protein/peptide values

The author mentioned above uses IterativeImputer and KNNImputer, with time I may try other approaches.

In [9]:
# enable experimental imputer
from sklearn.experimental import enable_iterative_imputer

# import our imputation algos
from sklearn.impute import (
    IterativeImputer,
    KNNImputer,
    MissingIndicator,
)

In [10]:
%%time
iter_imputer = IterativeImputer(
    max_iter=10,
    n_nearest_features=25,
    imputation_order='ascending',
    random_state=0,
    verbose=1,
)

# fit the imputer
iter_prot_and_peps_array = iter_imputer.fit_transform(prot_and_peps_df)
iter_prot_and_peps_df = pd.DataFrame(
    data=iter_prot_and_peps_array,
    columns=prot_and_peps_df.columns,
    index=prot_and_peps_df.index,
)

# save to parquet
iter_prot_and_peps_df[protein_cols].to_parquet(
    'protein_data_iterative_impute.parquet',
)

iter_prot_and_peps_df[peptide_cols].to_parquet(
    'peptide_data_iterative_impute.parquet',
)

# Saving the data together makes more sense to me
iter_prot_and_peps_df.to_parquet(
    'protein_peptide_data_iterative_impute.parquet',
)

[IterativeImputer] Completing matrix with shape (1113, 1195)
[IterativeImputer] Change: 130567040.68250778, scaled tolerance: 613851.0 
[IterativeImputer] Change: 77675980.52362576, scaled tolerance: 613851.0 
[IterativeImputer] Change: 78590208.8506695, scaled tolerance: 613851.0 
[IterativeImputer] Change: 68482566.33326018, scaled tolerance: 613851.0 
[IterativeImputer] Change: 58949685.60739637, scaled tolerance: 613851.0 
[IterativeImputer] Change: 49104473.013554275, scaled tolerance: 613851.0 
[IterativeImputer] Change: 53711002.099543095, scaled tolerance: 613851.0 
[IterativeImputer] Change: 4.3106789405883556e+19, scaled tolerance: 613851.0 
[IterativeImputer] Change: 4.5373704590897545e+19, scaled tolerance: 613851.0 
[IterativeImputer] Change: 9.571200423027006e+19, scaled tolerance: 613851.0 




CPU times: user 3min 17s, sys: 2min 33s, total: 5min 51s
Wall time: 1min 30s


In [11]:
%%time
knn_imputer = KNNImputer(
    weights='uniform',
    n_neighbors=10,
)

# fit the imputer
knn_prot_and_peps_array = knn_imputer.fit_transform(prot_and_peps_df)
knn_prot_and_peps_df = pd.DataFrame(
    data=knn_prot_and_peps_array,
    columns=prot_and_peps_df.columns,
    index=prot_and_peps_df.index,
)

# save to parquet
knn_prot_and_peps_df[protein_cols].to_parquet(
    'protein_data_KNN_impute.parquet',
)

knn_prot_and_peps_df[peptide_cols].to_parquet(
    'peptide_data_KNN_impute.parquet',
)

# Saving the data together makes more sense to me
knn_prot_and_peps_df.to_parquet(
    'protein_peptide_data_KNN_impute.parquet',
)

CPU times: user 11.2 s, sys: 367 ms, total: 11.6 s
Wall time: 10.7 s


# Train a deep learning model on Parkinsons Progression

Use the `supplemental_clinical_data` which has no protein/peptide information to create the base layers.

In [12]:
# Split the supplemental_clinical_data into a train/test set
from sklearn.model_selection import train_test_split

# Split the post-processed supplemental into data and labels
supp_X = supp_clinical_df_pp[['visit_month']].reset_index(drop=True)
supp_y = supp_clinical_df_pp.drop(columns=['patient_id', 'visit_month', 'upd23b_clinical_state_on_medication'],).reset_index(drop=True)

supp_X.head()
supp_y.head()

print(supp_X.shape)
print(supp_y.shape)

# Split 80/20 into train and test data
# Not sure if we'll end up using this or k-fold cv
pre_train_X, pre_test_X, pre_train_y, pre_test_y = train_test_split(supp_X, supp_y, test_size=0.2, random_state=42)

pre_train_X.head()
pre_train_y.head()

# Turns out, there's a lot of holes in this data
# print(supp_y.isna().sum().sum())
# print(supp_y.dropna().shape)

(2223, 1)
(2223, 16)


Unnamed: 0,updrs_1_plus_0_months,updrs_2_plus_0_months,updrs_3_plus_0_months,updrs_4_plus_0_months,updrs_1_plus_6_months,updrs_1_plus_12_months,updrs_1_plus_24_months,updrs_2_plus_6_months,updrs_2_plus_12_months,updrs_2_plus_24_months,updrs_3_plus_6_months,updrs_3_plus_12_months,updrs_3_plus_24_months,updrs_4_plus_6_months,updrs_4_plus_12_months,updrs_4_plus_24_months
23,6.0,3.0,11.0,6.0,,,,,,,,,,,,
1903,0.0,1.0,20.0,,,,,,,,,,,,,
427,13.0,6.0,34.0,0.0,,,,,,,,,,,,
554,2.0,1.0,23.0,,0.0,1.0,,2.0,1.0,,23.0,20.0,,,3.0,
1767,4.0,9.0,26.0,0.0,,,,,,,,,,,,


In [13]:
from numpy import mean
from numpy import std
from sklearn.model_selection import RepeatedKFold
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout

# Define our multi-output regression model
def multi_output_regression_model(n_inputs, n_outputs, extra_dense=False, dropout=False):
    print("Building a model with " + str(n_inputs) + " inputs and " + str(n_outputs) + " outputs.")
    model = Sequential()
    model.add(Dense(16, input_dim=n_inputs, kernel_initializer='he_uniform', activation='tanh'))
    model.add(Dropout(0.2))
    if extra_dense:
        model.add(Dense(64, activation='tanh'))
    if dropout:
        model.add(Dropout(0.2))
    if extra_dense:
        model.add(Dense(32, activation='tanh'))
    if dropout:        
        model.add(Dropout(0.2))
    model.add(Dense(n_outputs))
    model.compile(loss='mae', optimizer='adam')
    return model

# Define a k-fold cross validation
def evaluate_model(X, y, dense=False, dropout=False):
    results = list()
    n_inputs, n_outputs = X.shape[1], y.shape[1]
    # print("Multi-output shape of inputs/outputs: " + str(n_inputs) + ":" + str(n_outputs))
    # define evaluation procedure
    cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=256)
    # enumerate folds
    for train_ix, test_ix in cv.split(X):
        # prepare data
        X_train, X_test = X.iloc[train_ix], X.iloc[test_ix]
        y_train, y_test = y.iloc[train_ix], y.iloc[test_ix]
        # define model
        model = multi_output_regression_model(n_inputs, n_outputs, dense, dropout)
        # fit model
        model.fit(X_train, y_train, verbose=0, epochs=100)
        # evaluate model on test set
        mae = model.evaluate(X_test, y_test, verbose=0)
        # store result
        print('>%.3f' % mae)
        results.append(mae)
    return results


In [14]:
# Let's try it
results = evaluate_model(supp_X, supp_y)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a model with 1 inputs and 16 outputs.
>nan
Building a m

## About 56% of this data is missing, 
And the regression model isn't converging, so using it as a pre-training may not be particularly effective after all. 

For now, I'll move on to training the clinical data, and circle back to this if time allows.

## Update
Coming back from working with clinical data, I've discovered that NANs are throwing this learning and that imputing missing values is pretty effective. I'll try the same approach here.

As noted later on, I'm using KNN impute only, because it's consistently equal to or perhaps trivially better than Iterative impute.

In [15]:
%%time
print(supp_y.isna().sum().sum())
knn_imputer = KNNImputer(
    weights='uniform',
    n_neighbors=10,
)

# fit the imputer
supp_y_knn_array = knn_imputer.fit_transform(supp_y)
supp_y_knn = pd.DataFrame(
    data=supp_y_knn_array,
    columns=supp_y.columns,
    index=supp_y.index,
)

# save to parquet
supp_y_knn.to_parquet(
    'supp_clinical_data_KNN_impute.parquet',
)
print(supp_y_knn.isna().sum().sum())

20126
0
CPU times: user 1.09 s, sys: 446 ms, total: 1.54 s
Wall time: 861 ms


In [16]:
# Repeat using KNN Imputed Supplemental Clinical Data
print(supp_X.isna().sum().sum())
print(supp_y_knn.isna().sum().sum())

results = evaluate_model(supp_X, supp_y_knn)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

0
0
Building a model with 1 inputs and 16 outputs.
>3.984
Building a model with 1 inputs and 16 outputs.
>3.976
Building a model with 1 inputs and 16 outputs.
>3.774
Building a model with 1 inputs and 16 outputs.
>3.889
Building a model with 1 inputs and 16 outputs.
>3.954
Building a model with 1 inputs and 16 outputs.
>3.868
Building a model with 1 inputs and 16 outputs.
>3.735
Building a model with 1 inputs and 16 outputs.
>4.119
Building a model with 1 inputs and 16 outputs.
>3.916
Building a model with 1 inputs and 16 outputs.
>3.924
Building a model with 1 inputs and 16 outputs.
>3.980
Building a model with 1 inputs and 16 outputs.
>3.898
Building a model with 1 inputs and 16 outputs.
>3.844
Building a model with 1 inputs and 16 outputs.
>3.741
Building a model with 1 inputs and 16 outputs.
>3.722
Building a model with 1 inputs and 16 outputs.
>4.068
Building a model with 1 inputs and 16 outputs.
>3.899
Building a model with 1 inputs and 16 outputs.
>3.999
Building a model with 1 

# Summary of results: 

Using KNN to impute missing UPDRS scores yields the following - not bad!

**MAE: 3.913 STD: 0.123**

# Train on Clinical Data



In [17]:
def merge_align_split(data, labels, join_col, drop_na):
    data_cols = list(data.columns)
    label_cols = list(labels.columns)
    merge_df = pd.merge(data, labels, how='inner', on = join_col)
    if drop_na:
        merge_df = merge_df.dropna()
    out_X = merge_df.drop(columns=label_cols,).reset_index(drop=True)
    out_y = merge_df.drop(columns=data_cols,).reset_index(drop=True)
    return out_X, out_y

In [18]:
# Join clinical data with peptide data to align indices
iter_prot_and_peps_df.head()
print(iter_prot_and_peps_df.shape)
knn_prot_and_peps_df.head()
print(knn_prot_and_peps_df.shape)
clinical_df_pp.head()
print(clinical_df_pp.shape)

# Sanity check to make sure there's no protein/peptide data in the supplemental dataset
all_clinical_df = clinical_df_pp.append(supp_clinical_df_pp)
print(all_clinical_df.shape)
all_clinical_df.head()
all_clinical_df_iter = pd.merge(all_clinical_df, iter_prot_and_peps_df, how='inner', on = 'visit_id')
print(all_clinical_df_iter.shape)

# Align the protein/peptide data to order it with the clinical
iter_X, y = merge_align_split(iter_prot_and_peps_df, clinical_df_pp, 'visit_id', False)
knn_X, y = merge_align_split(knn_prot_and_peps_df, clinical_df_pp, 'visit_id', False)

y = y.drop(columns=['patient_id', 'visit_month', 'upd23b_clinical_state_on_medication'],)

# Try dropping NAN columns...but this reduces our dataset to nothing
iter_X_dropNA, y_dropNA = merge_align_split(iter_prot_and_peps_df, clinical_df_pp, 'visit_id', True)
knn_X_dropNA, y_dropNA = merge_align_split(knn_prot_and_peps_df, clinical_df_pp, 'visit_id', True)

y_dropNA = y_dropNA.drop(columns=['patient_id', 'visit_month', 'upd23b_clinical_state_on_medication'],)

y_dropNA.head()

(1113, 1195)
(1113, 1195)
(2615, 19)
(4838, 19)
(1068, 1214)


Unnamed: 0,updrs_1_plus_0_months,updrs_2_plus_0_months,updrs_3_plus_0_months,updrs_4_plus_0_months,updrs_1_plus_6_months,updrs_1_plus_12_months,updrs_1_plus_24_months,updrs_2_plus_6_months,updrs_2_plus_12_months,updrs_2_plus_24_months,updrs_3_plus_6_months,updrs_3_plus_12_months,updrs_3_plus_24_months,updrs_4_plus_6_months,updrs_4_plus_12_months,updrs_4_plus_24_months
0,3.0,6.0,31.0,0.0,6.0,4.0,5.0,11.0,7.0,2.0,25.0,19.0,11.0,1.0,10.0,0.0
1,17.0,13.0,30.0,0.0,15.0,14.0,15.0,11.0,8.0,6.0,25.0,26.0,26.0,0.0,0.0,0.0
2,14.0,8.0,26.0,0.0,15.0,15.0,14.0,9.0,6.0,6.0,35.0,26.0,41.0,0.0,0.0,4.0
3,15.0,6.0,26.0,0.0,14.0,14.0,13.0,11.0,6.0,9.0,29.0,41.0,18.0,5.0,4.0,0.0
4,6.0,1.0,21.0,0.0,10.0,7.0,2.0,6.0,3.0,2.0,30.0,18.0,21.0,0.0,0.0,0.0


In [19]:
# Let's try it again using the Iterative data!
# Since we've abstracted it into a function, we can reuse our earlier work
results = evaluate_model(iter_X, y)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a

In [20]:
# Let's try it again using the KNN data!
# Since we've abstracted it into a function, we can reuse our earlier work
results = evaluate_model(knn_X, y)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a model with 1195 inputs and 16 outputs.
>nan
Building a

## Again, no convergence
So I need to explore other models, and hyperparameter tuning.

# Searching for Signal 
via Enhanced Data Processing

## Replacing NANs with 0's gives signal!

So finally I can compare the missing values imputed with Iterative vs KNN.

Since they are performing pretty equally, I also with added Dense layers and dropout. More hyperparameter tuning will come later, it's just a sanity check to see if performance dramatically improves with a larger model.

In [21]:
# Let's try replacing NANs with 0's
results = evaluate_model(iter_X.fillna(0), y.fillna(0), False, False)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

# SIGNAL!!!

Building a model with 1195 inputs and 16 outputs.
>6.102
Building a model with 1195 inputs and 16 outputs.
>4.979
Building a model with 1195 inputs and 16 outputs.
>6.073
Building a model with 1195 inputs and 16 outputs.
>5.992
Building a model with 1195 inputs and 16 outputs.
>5.575
Building a model with 1195 inputs and 16 outputs.
>6.367
Building a model with 1195 inputs and 16 outputs.
>5.633
Building a model with 1195 inputs and 16 outputs.
>5.657
Building a model with 1195 inputs and 16 outputs.
>5.616
Building a model with 1195 inputs and 16 outputs.
>6.001
Building a model with 1195 inputs and 16 outputs.
>5.915
Building a model with 1195 inputs and 16 outputs.
>5.666
Building a model with 1195 inputs and 16 outputs.
>5.603
Building a model with 1195 inputs and 16 outputs.
>6.240
Building a model with 1195 inputs and 16 outputs.
>5.701
Building a model with 1195 inputs and 16 outputs.
>5.695
Building a model with 1195 inputs and 16 outputs.
>5.458
Building a model with 1195 inpu

In [22]:
# Repeat, using dense models and dropout
results = evaluate_model(iter_X.fillna(0), y.fillna(0), True, True)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

Building a model with 1195 inputs and 16 outputs.
>6.063
Building a model with 1195 inputs and 16 outputs.
>4.966
Building a model with 1195 inputs and 16 outputs.
>6.073
Building a model with 1195 inputs and 16 outputs.
>5.988
Building a model with 1195 inputs and 16 outputs.
>5.622
Building a model with 1195 inputs and 16 outputs.
>6.344
Building a model with 1195 inputs and 16 outputs.
>5.675
Building a model with 1195 inputs and 16 outputs.
>5.675
Building a model with 1195 inputs and 16 outputs.
>5.448
Building a model with 1195 inputs and 16 outputs.
>6.010
Building a model with 1195 inputs and 16 outputs.
>5.932
Building a model with 1195 inputs and 16 outputs.
>5.601
Building a model with 1195 inputs and 16 outputs.
>5.575
Building a model with 1195 inputs and 16 outputs.
>6.287
Building a model with 1195 inputs and 16 outputs.
>5.780
Building a model with 1195 inputs and 16 outputs.
>5.650
Building a model with 1195 inputs and 16 outputs.
>5.445
Building a model with 1195 inpu

In [23]:
# Repeat with KNN
results = evaluate_model(knn_X.fillna(0), y.fillna(0), False, False)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

Building a model with 1195 inputs and 16 outputs.
>6.108
Building a model with 1195 inputs and 16 outputs.
>4.979
Building a model with 1195 inputs and 16 outputs.
>6.091
Building a model with 1195 inputs and 16 outputs.
>5.973
Building a model with 1195 inputs and 16 outputs.
>5.549
Building a model with 1195 inputs and 16 outputs.
>6.359
Building a model with 1195 inputs and 16 outputs.
>5.672
Building a model with 1195 inputs and 16 outputs.
>5.687
Building a model with 1195 inputs and 16 outputs.
>5.555
Building a model with 1195 inputs and 16 outputs.
>5.996
Building a model with 1195 inputs and 16 outputs.
>5.895
Building a model with 1195 inputs and 16 outputs.
>5.680
Building a model with 1195 inputs and 16 outputs.
>5.599
Building a model with 1195 inputs and 16 outputs.
>6.308
Building a model with 1195 inputs and 16 outputs.
>5.671
Building a model with 1195 inputs and 16 outputs.
>5.681
Building a model with 1195 inputs and 16 outputs.
>5.439
Building a model with 1195 inpu

In [24]:
# Repeat with KNN using dense layers and dropout
results = evaluate_model(knn_X.fillna(0), y.fillna(0), True, True)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

Building a model with 1195 inputs and 16 outputs.
>6.101
Building a model with 1195 inputs and 16 outputs.
>5.034
Building a model with 1195 inputs and 16 outputs.
>6.093
Building a model with 1195 inputs and 16 outputs.
>5.972
Building a model with 1195 inputs and 16 outputs.
>5.583
Building a model with 1195 inputs and 16 outputs.
>6.335
Building a model with 1195 inputs and 16 outputs.
>5.674
Building a model with 1195 inputs and 16 outputs.
>5.718
Building a model with 1195 inputs and 16 outputs.
>5.504
Building a model with 1195 inputs and 16 outputs.
>5.994
Building a model with 1195 inputs and 16 outputs.
>5.909
Building a model with 1195 inputs and 16 outputs.
>5.672
Building a model with 1195 inputs and 16 outputs.
>5.583
Building a model with 1195 inputs and 16 outputs.
>6.308
Building a model with 1195 inputs and 16 outputs.
>5.725
Building a model with 1195 inputs and 16 outputs.
>5.540
Building a model with 1195 inputs and 16 outputs.
>5.400
Building a model with 1195 inpu

## Finally, let's try with just protein data, and just peptide data

In [25]:
# Repeat, using Protein data on the Iterative imputed dataset
iter_X_prot = iter_X[protein_cols]
results = evaluate_model(iter_X_prot.fillna(0), y.fillna(0), True, True)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

Building a model with 227 inputs and 16 outputs.
>6.059
Building a model with 227 inputs and 16 outputs.
>4.930
Building a model with 227 inputs and 16 outputs.
>6.111
Building a model with 227 inputs and 16 outputs.
>5.999
Building a model with 227 inputs and 16 outputs.
>5.563
Building a model with 227 inputs and 16 outputs.
>6.357
Building a model with 227 inputs and 16 outputs.
>5.707
Building a model with 227 inputs and 16 outputs.
>5.684
Building a model with 227 inputs and 16 outputs.
>5.387
Building a model with 227 inputs and 16 outputs.
>5.986
Building a model with 227 inputs and 16 outputs.
>5.903
Building a model with 227 inputs and 16 outputs.
>5.748
Building a model with 227 inputs and 16 outputs.
>5.619
Building a model with 227 inputs and 16 outputs.
>6.296
Building a model with 227 inputs and 16 outputs.
>5.726
Building a model with 227 inputs and 16 outputs.
>5.683
Building a model with 227 inputs and 16 outputs.
>5.441
Building a model with 227 inputs and 16 outputs.

In [26]:
# Repeat, using Peptide data on the Iterative imputed dataset
iter_X_pep = iter_X[peptide_cols]
results = evaluate_model(iter_X_pep.fillna(0), y.fillna(0), True, True)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

Building a model with 968 inputs and 16 outputs.
>6.091
Building a model with 968 inputs and 16 outputs.
>4.853
Building a model with 968 inputs and 16 outputs.
>6.036
Building a model with 968 inputs and 16 outputs.
>5.920
Building a model with 968 inputs and 16 outputs.
>5.611
Building a model with 968 inputs and 16 outputs.
>6.296
Building a model with 968 inputs and 16 outputs.
>5.546
Building a model with 968 inputs and 16 outputs.
>5.636
Building a model with 968 inputs and 16 outputs.
>5.569
Building a model with 968 inputs and 16 outputs.
>6.065
Building a model with 968 inputs and 16 outputs.
>5.900
Building a model with 968 inputs and 16 outputs.
>5.662
Building a model with 968 inputs and 16 outputs.
>5.607
Building a model with 968 inputs and 16 outputs.
>6.280
Building a model with 968 inputs and 16 outputs.
>5.574
Building a model with 968 inputs and 16 outputs.
>5.587
Building a model with 968 inputs and 16 outputs.
>5.438
Building a model with 968 inputs and 16 outputs.

In [27]:
# Repeat, using Protein data on the KNN imputed dataset
knn_X_prot = knn_X[protein_cols]
results = evaluate_model(knn_X_prot.fillna(0), y.fillna(0), True, True)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

Building a model with 227 inputs and 16 outputs.
>6.048
Building a model with 227 inputs and 16 outputs.
>5.011
Building a model with 227 inputs and 16 outputs.
>6.082
Building a model with 227 inputs and 16 outputs.
>5.950
Building a model with 227 inputs and 16 outputs.
>5.534
Building a model with 227 inputs and 16 outputs.
>6.244
Building a model with 227 inputs and 16 outputs.
>5.519
Building a model with 227 inputs and 16 outputs.
>5.657
Building a model with 227 inputs and 16 outputs.
>5.629
Building a model with 227 inputs and 16 outputs.
>5.991
Building a model with 227 inputs and 16 outputs.
>5.980
Building a model with 227 inputs and 16 outputs.
>5.564
Building a model with 227 inputs and 16 outputs.
>5.617
Building a model with 227 inputs and 16 outputs.
>6.292
Building a model with 227 inputs and 16 outputs.
>5.753
Building a model with 227 inputs and 16 outputs.
>5.630
Building a model with 227 inputs and 16 outputs.
>5.432
Building a model with 227 inputs and 16 outputs.

In [28]:
# Repeat, using Peptide data on the KNN imputed dataset
knn_X_pep = knn_X[peptide_cols]
results = evaluate_model(knn_X_pep.fillna(0), y.fillna(0), True, True)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

Building a model with 968 inputs and 16 outputs.
>6.168
Building a model with 968 inputs and 16 outputs.
>4.983
Building a model with 968 inputs and 16 outputs.
>5.904
Building a model with 968 inputs and 16 outputs.
>5.894
Building a model with 968 inputs and 16 outputs.
>5.552
Building a model with 968 inputs and 16 outputs.
>6.255
Building a model with 968 inputs and 16 outputs.
>5.605
Building a model with 968 inputs and 16 outputs.
>5.622
Building a model with 968 inputs and 16 outputs.
>5.592
Building a model with 968 inputs and 16 outputs.
>6.016
Building a model with 968 inputs and 16 outputs.
>5.914
Building a model with 968 inputs and 16 outputs.
>5.680
Building a model with 968 inputs and 16 outputs.
>5.622
Building a model with 968 inputs and 16 outputs.
>6.265
Building a model with 968 inputs and 16 outputs.
>5.710
Building a model with 968 inputs and 16 outputs.
>5.682
Building a model with 968 inputs and 16 outputs.
>5.454
Building a model with 968 inputs and 16 outputs.

## NANs and Additional Data Processing

For UPDRS_4 values, converting NAN to 0 feels appropriate since, as I understand it, the 4th section of the test is only administered if deemed necessary so if it was not administered, a score of 0 can be safely assumed.

However, in cases where the visit was simply not conducted, 0 scores aren't appropriate and it would be better to impute those values based on their neighbors or perhaps drop them. This section experiments with that.

In [29]:
# Fill NAN UPDRS_4 values with 0
print(y.isna().sum().sum())
y[['updrs_4_plus_0_months', 'updrs_4_plus_6_months','updrs_4_plus_12_months', 'updrs_4_plus_24_months']] = y[['updrs_4_plus_0_months', 'updrs_4_plus_6_months','updrs_4_plus_12_months', 'updrs_4_plus_24_months']].fillna(value=0)
print(y.isna().sum().sum())
# In this case, looks like there actually weren't any NAN values in any UPDRS_4
y.head()

4510
2462


Unnamed: 0,updrs_1_plus_0_months,updrs_2_plus_0_months,updrs_3_plus_0_months,updrs_4_plus_0_months,updrs_1_plus_6_months,updrs_1_plus_12_months,updrs_1_plus_24_months,updrs_2_plus_6_months,updrs_2_plus_12_months,updrs_2_plus_24_months,updrs_3_plus_6_months,updrs_3_plus_12_months,updrs_3_plus_24_months,updrs_4_plus_6_months,updrs_4_plus_12_months,updrs_4_plus_24_months
0,3.0,0.0,13.0,0.0,2.0,4.0,4.0,4.0,2.0,3.0,5.0,8.0,0.0,0.0,0.0,0.0
1,4.0,2.0,8.0,0.0,2.0,4.0,,2.0,3.0,,0.0,0.0,,0.0,0.0,0.0
2,2.0,2.0,0.0,0.0,4.0,,,3.0,,,0.0,,,0.0,0.0,0.0
3,3.0,6.0,31.0,0.0,6.0,4.0,5.0,11.0,7.0,2.0,25.0,19.0,11.0,1.0,10.0,0.0
4,4.0,7.0,19.0,10.0,10.0,5.0,,10.0,2.0,,20.0,11.0,,5.0,0.0,0.0


In [30]:
%%time
print(y.isna().sum().sum())
iter_imputer = IterativeImputer(
    max_iter=10,
    n_nearest_features=25,
    imputation_order='ascending',
    random_state=0,
    verbose=1,
)

# fit the imputer
y_iter_array = iter_imputer.fit_transform(y)
y_iter = pd.DataFrame(
    data=y_iter_array,
    columns=y.columns,
    index=y.index,
)

# save to parquet
y_iter.to_parquet(
    'clinical_data_iterative_impute.parquet',
)
print(y_iter.isna().sum().sum())

2462
[IterativeImputer] Completing matrix with shape (1068, 16)
[IterativeImputer] Change: 104.40097674228682, scaled tolerance: 0.08600000000000001 
[IterativeImputer] Change: 37.69813334271678, scaled tolerance: 0.08600000000000001 
[IterativeImputer] Change: 22.484742356633873, scaled tolerance: 0.08600000000000001 
[IterativeImputer] Change: 15.2116153838379, scaled tolerance: 0.08600000000000001 
[IterativeImputer] Change: 10.766832527873715, scaled tolerance: 0.08600000000000001 
[IterativeImputer] Change: 7.814567505924514, scaled tolerance: 0.08600000000000001 
[IterativeImputer] Change: 5.719819802433758, scaled tolerance: 0.08600000000000001 
[IterativeImputer] Change: 4.189443068809356, scaled tolerance: 0.08600000000000001 
[IterativeImputer] Change: 3.06973956559695, scaled tolerance: 0.08600000000000001 
[IterativeImputer] Change: 2.2497384569014054, scaled tolerance: 0.08600000000000001 
0
CPU times: user 1.6 s, sys: 1.35 s, total: 2.95 s
Wall time: 771 ms




In [31]:
%%time
print(y.isna().sum().sum())
knn_imputer = KNNImputer(
    weights='uniform',
    n_neighbors=10,
)

# fit the imputer
y_knn_array = knn_imputer.fit_transform(y)
y_knn = pd.DataFrame(
    data=y_knn_array,
    columns=y.columns,
    index=y.index,
)

# save to parquet
y_knn.to_parquet(
    'clinical_data_KNN_impute.parquet',
)
print(y_knn.isna().sum().sum())

2462
0
CPU times: user 260 ms, sys: 209 ms, total: 469 ms
Wall time: 161 ms


In [32]:
# Repeat using Iterative Imputed Clinical Data and Imputed Protein/Peptide
print(iter_X.isna().sum().sum())
print(y_iter.isna().sum().sum())

results = evaluate_model(iter_X, y_iter, True, True)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

0
0
Building a model with 1195 inputs and 16 outputs.
>5.871
Building a model with 1195 inputs and 16 outputs.
>5.295
Building a model with 1195 inputs and 16 outputs.
>5.595
Building a model with 1195 inputs and 16 outputs.
>5.612
Building a model with 1195 inputs and 16 outputs.
>5.342
Building a model with 1195 inputs and 16 outputs.
>6.078
Building a model with 1195 inputs and 16 outputs.
>5.724
Building a model with 1195 inputs and 16 outputs.
>5.860
Building a model with 1195 inputs and 16 outputs.
>5.778
Building a model with 1195 inputs and 16 outputs.
>5.798
Building a model with 1195 inputs and 16 outputs.
>5.908
Building a model with 1195 inputs and 16 outputs.
>5.590
Building a model with 1195 inputs and 16 outputs.
>5.672
Building a model with 1195 inputs and 16 outputs.
>6.017
Building a model with 1195 inputs and 16 outputs.
>5.481
Building a model with 1195 inputs and 16 outputs.
>5.423
Building a model with 1195 inputs and 16 outputs.
>5.339
Building a model with 1195 

In [33]:
# Repeat using KNN Imputed Clinical Data and KNN Imputed Protein/Peptide
print(knn_X.isna().sum().sum())
print(y_knn.isna().sum().sum())

results = evaluate_model(knn_X, y_knn, True, True)

# summarize performance
print('MAE: %.3f (%.3f)' % (mean(results), std(results)))

0
0
Building a model with 1195 inputs and 16 outputs.
>5.957
Building a model with 1195 inputs and 16 outputs.
>5.174
Building a model with 1195 inputs and 16 outputs.
>5.515
Building a model with 1195 inputs and 16 outputs.
>5.648
Building a model with 1195 inputs and 16 outputs.
>5.268
Building a model with 1195 inputs and 16 outputs.
>6.095
Building a model with 1195 inputs and 16 outputs.
>5.721
Building a model with 1195 inputs and 16 outputs.
>5.859
Building a model with 1195 inputs and 16 outputs.
>5.699
Building a model with 1195 inputs and 16 outputs.
>5.788
Building a model with 1195 inputs and 16 outputs.
>5.877
Building a model with 1195 inputs and 16 outputs.
>5.594
Building a model with 1195 inputs and 16 outputs.
>5.593
Building a model with 1195 inputs and 16 outputs.
>6.039
Building a model with 1195 inputs and 16 outputs.
>5.405
Building a model with 1195 inputs and 16 outputs.
>5.442
Building a model with 1195 inputs and 16 outputs.
>5.184
Building a model with 1195 

### Summary of results:

| Method | MAE   | STD   |
| ------ | ----- | ----- |
| Iterative - Simple model | 5.792 | 0.301 |
| KNN - Simple model | 5.786 | 0.307 |
| Iterative - Dense + Dropout model | 5.760 | 0.302 |
| KNN - Dense + Dropout model | 5.767 | 0.313 |
| Iterative - Protein data | 5.785 | 0.299 |
| KNN - Protein data | 5.787 | 0.307 |
| Iterative - Peptide data | 5.772 | 0.298 |
| KNN - Peptide data | 5.762 | 0.308 |
| Iterative - Clinical NANs imputed | 5.697 | 0.271  |
| KNN - Clinical NANs imputed | 5.658 | 0.255 |

I'm not seeing significant differences in these, however there's some small lessons:
 * Imputing missing UPDRS scores gave a marginal improvement in performance
 * The protein vs peptide results indicate feature selection may yield further improvements even though strictly proteins or strictly peptides weren't particularly favorable
 * Iterative imputing vs KNN imputing was about equal
 
I will move forward with KNN imputation, it's not significantly superior but it does seem to perform equal to or very slightly better than Iterative imputation.

# Hyperparameter Tuning


## Clinical Data Model
First let's try tuning just the UPDRS/Protein+Peptide data so we can see if reusing the supplemental data's base layers improves it or not.

## Supplemental Data For Model Reuse

Now that we've got our supplemental data working, let's get that model trained and tuned so we can reuse it with our UPDRS data.

# Sources Used & Further Resources

### Data processing
 * https://www.kaggle.com/code/xaviernogueira/pre-processing-making-labels-and-imputation#Add-supplemental-clinical-data-to-the-main-clininical-data

### Model reuse of base layers
 * <Placeholder>

### Predicting multiple outputs with one model
* https://pub.towardsai.net/how-to-predict-multiple-variables-with-one-model-and-why-31e6a8efc09e
* https://machinelearningmastery.com/deep-learning-models-for-multi-output-regression/
    
### Fixes to bugs
 * https://stackoverflow.com/a/64691642