# For each patient visit where a protein/peptide sample was taken you will estimate future UPDR scores: 
 * ## for that visit (month 0)
 * ## for 6 months later
 * ## for 12 months later
 * ## for 24 months later
 
Predictions for any visits that didn't ultimately take place are ignored.

## This project has multiple fascinating features and challenges, including:
 - Reusing base layers of another model I also train
 - Multi-output models (16 outputs!)

## Approach

 1. **Ingest and format data**
 2. **Pre-train a multi-output model using `supplemental_clinical_data.csv`**
 
 Since there is no protein/peptide information associated with the supplemental clinical data, we can use this to train a model on the typical progression of Parkinsons (as measured by UPDR scores). It will only predict scores given the time since initial visit. This will give us base layers that we can reuse when predicting progression based on proteins/peptides.
 3. **Train a multi-output model using `train_clinical_data`**
 
 Our goal in both steps 2 and 3 is to train a single model that can predict UPDR scores at months 0, 6, 12, and 24. I'll use a multi-output variable approach because each UPDR score has 4 sections, predicting a total of 16 outputs!

## Data Ingestion

In [1]:
import numpy as np
import pandas as pd
import sklearn
from pathlib import Path

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/amp-parkinsons-disease-progression-prediction/train_proteins.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/train_clinical_data.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/public_timeseries_testing_util.py
/kaggle/input/amp-parkinsons-disease-progression-prediction/supplemental_clinical_data.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/train_peptides.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/amp_pd_peptide/competition.cpython-37m-x86_64-linux-gnu.so
/kaggle/input/amp-parkinsons-disease-progression-prediction/amp_pd_peptide/__init__.py
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/sample_submission.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/test_proteins.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/test_peptides.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test

In [2]:
# Pull in all clinical data

INPUTS_DIR = Path('/kaggle/input/amp-parkinsons-disease-progression-prediction/') 

# train_clinical_data.csv
clinical_df = pd.read_csv(
    INPUTS_DIR / 'train_clinical_data.csv',
    index_col='visit_id',
)
print(clinical_df.shape)
print(clinical_df.head(n=10))

# supplemental_clinical_data.csv
supp_clinical_df = pd.read_csv(
    INPUTS_DIR / 'supplemental_clinical_data.csv',
    index_col='visit_id',
)
print(supp_clinical_df.shape)
print(supp_clinical_df.head(n=10))

# concat both to a single dataframe
all_clinical_df = pd.concat([clinical_df, supp_clinical_df])

print("\nUnion of both clinical data sets:")
print(all_clinical_df.shape)

(2615, 7)
          patient_id  visit_month  updrs_1  updrs_2  updrs_3  updrs_4  \
visit_id                                                                
55_0              55            0     10.0      6.0     15.0      NaN   
55_3              55            3     10.0      7.0     25.0      NaN   
55_6              55            6      8.0     10.0     34.0      NaN   
55_9              55            9      8.0      9.0     30.0      0.0   
55_12             55           12     10.0     10.0     41.0      0.0   
55_18             55           18      7.0     13.0     38.0      0.0   
55_24             55           24     16.0      9.0     49.0      0.0   
55_30             55           30     14.0     13.0     49.0      0.0   
55_36             55           36     17.0     18.0     51.0      0.0   
55_42             55           42     12.0     20.0     41.0      0.0   

         upd23b_clinical_state_on_medication  
visit_id                                      
55_0               

In [3]:
# Pull in protein data
proteins_df = pd.read_csv(
    INPUTS_DIR / 'train_proteins.csv',
)
print(proteins_df.shape)
proteins_df.head(n=10)

# Pull in peptide data
peptide_df = pd.read_csv(
    INPUTS_DIR / 'train_peptides.csv',
)
print(peptide_df.shape)
peptide_df.head(n=10)

(232741, 5)
(981834, 6)


Unnamed: 0,visit_id,visit_month,patient_id,UniProt,Peptide,PeptideAbundance
0,55_0,0,55,O00391,NEQEQPLGQWHLS,11254.3
1,55_0,0,55,O00533,GNPEPTFSWTK,102060.0
2,55_0,0,55,O00533,IEIPSSVQQVPTIIK,174185.0
3,55_0,0,55,O00533,KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK,27278.9
4,55_0,0,55,O00533,SMEQNGPGLEYR,30838.7
5,55_0,0,55,O00533,TLKIENVSYQDKGNYR,23216.5
6,55_0,0,55,O00533,VIAVNEVGR,170878.0
7,55_0,0,55,O00533,VMTPAVYAPYDVK,148771.0
8,55_0,0,55,O00533,VNGSPVDNHPFAGDVVFPR,55202.1
9,55_0,0,55,O00584,ELDLNSVLLK,27229.3


## Data Exploration

In [4]:
# sort by patient id and month
clinical_df = clinical_df.sort_values(
    ['patient_id', 'visit_month'],
)
supp_clinical_df = supp_clinical_df.sort_values(
    ['patient_id', 'visit_month'],
)
all_clinical_df = all_clinical_df.sort_values(
    ['patient_id', 'visit_month'],
)
all_clinical_df.head()

# get all unique patient ids (already sorted)
clin_patient_ids = clinical_df['patient_id'].unique()
supp_patient_ids = supp_clinical_df['patient_id'].unique()
patient_ids = all_clinical_df['patient_id'].unique()

# Verifying there are no supplemental patient ids that may have protein/peptide data
np.intersect1d(clin_patient_ids, supp_patient_ids)



array([], dtype=int64)

## Data Processing
This is (with only minor modifications) from [this work by Xavier R Nogueira](https://www.kaggle.com/code/xaviernogueira/pre-processing-making-labels-and-imputation#Add-supplemental-clinical-data-to-the-main-clininical-data) released under the [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) open source license.

## Clinical Data Processing
The main change I made was to turn it into a factory pattern function, so that I could process the `train_clinical_data` and the `supplemental_clinical_data` separately. Since the `supplemental_clinical_data` doesn't have protein/peptide information, I don't think it should be grouped it in with the labeled data. 

Instead, I'll use the `supplemental_clinical_data` to pretrain the base layers of my multi-output model on typical Parkinsons Disease progression.

In [5]:
def process_clinical_data(clinical_data, parquet_outfile_name):
    # rename same visit columns
    for updrs_col in [i for i in clinical_data.columns if 'updrs' in i]:
        clinical_data = clinical_data.rename(
            columns={updrs_col: updrs_col + '_plus_0_months'},
        )
    clinical_data.head()

    # sort by patient id and month
    clinical_data = clinical_data.sort_values(
        ['patient_id', 'visit_month'],
    )
    clinical_data.head()

    # get all unique patient ids (already sorted)
    patient_ids = clinical_data['patient_id'].unique()

    # get all unique visit_months and sort
    months = clinical_data.visit_month.unique()
    months.sort()
    evenly_spaced_months = np.arange(months[0], months[-1] + 1)

    # create a new multi index with all unique patient id evenly spaced months
    multi_index = pd.MultiIndex.from_product(
        [patient_ids, evenly_spaced_months],
        names=['patient_id', 'visit_month'],
    )
    print(len(multi_index))

    # convert into dataframe multiindex
    multi_index = pd.DataFrame(
        index=multi_index,
    )

    multi_index.head(n=5)

    # join full clinical data
    clinical_data = multi_index.join(
        clinical_data.reset_index().set_index(['patient_id', 'visit_month']),
        how='left',
    )
    clinical_data.head()

    # shift data forward for our labels while grouping by patient_id
    for col in [i for i in clinical_data.columns if 'plus_0_months' in i]:
        clinical_data[col.replace('0', '6')] = clinical_data.groupby('patient_id')[col].shift(-6)
        clinical_data[col.replace('0', '12')] = clinical_data.groupby('patient_id')[col].shift(-12)
        clinical_data[col.replace('0', '24')] = clinical_data.groupby('patient_id')[col].shift(-24)
    clinical_data.head()

    # drop all rows where visit_id is NaN
    clinical_data = clinical_data.loc[clinical_data.visit_id.notna()]

    # reset index back to visit_id
    clinical_data = (
        clinical_data
        .reset_index()
        .set_index('visit_id')
    )
    clinical_data.head()
    
    # save labels data frame as a parquet file (without non labels columns)
    clinical_data.drop(
        columns=['patient_id', 'visit_month', 'upd23b_clinical_state_on_medication'],
    ).to_parquet(
        parquet_outfile_name,
    )
    
    return clinical_data

In [6]:
# Clinical data post-processing, saving labels to a parquet file
clinical_df_pp = process_clinical_data(clinical_df, 'clinical_data_training_labels.parquet')

# Supplemental clinical data post-processing, saving labels to a parquet file
supp_clinical_df_pp = process_clinical_data(supp_clinical_df, 'supp_clinical_data_training_labels.parquet')

27032
28527


## Protein/Peptide data processing

In [7]:
proteins_df = proteins_df.pivot(
    index='visit_id',
    columns='UniProt',
    values='NPX',
)
print(proteins_df.index.unique().shape)
print(proteins_df.shape)
proteins_df.head(n=5)


peptide_df = peptide_df.pivot(
    index='visit_id',
    columns='Peptide',
    values='PeptideAbundance',
)
print(peptide_df.index.unique().shape)
print(peptide_df.shape)
peptide_df.head(n=5)

(1113,)
(1113, 227)
(1113,)
(1113, 968)


Peptide,AADDTWEPFASGK,AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K,AAFTEC(UniMod_4)C(UniMod_4)QAADK,AANEVSSADVK,AATGEC(UniMod_4)TATVGKR,AATVGSLAGQPLQER,AAVYHHFISDGVR,ADDKETC(UniMod_4)FAEEGK,ADDKETC(UniMod_4)FAEEGKK,ADDLGKGGNEESTKTGNAGSR,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,6580710.0,31204.4,7735070.0,,,,46620.3,236144.0,,,...,202274.0,,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,,7207.3
10053_12,6333510.0,52277.6,5394390.0,,,,57554.5,108298.0,45885.4,,...,201009.0,,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.8
10053_18,7129640.0,61522.0,7011920.0,35984.7,17188.0,19787.3,36029.4,708729.0,5067790.0,30838.2,...,220728.0,,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.7
10138_12,7404780.0,46107.2,10610900.0,,20910.2,66662.3,55253.9,79575.5,6201210.0,26720.0,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,13788300.0,56910.3,6906160.0,13785.5,11004.2,63672.7,36819.8,34160.9,2117430.0,15645.2,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,,56977.6,4903.09


In [8]:
# keep track of our protein / peptide columns
protein_cols = list(proteins_df.columns)
peptide_cols = list(peptide_df.columns)

# join the protein / peptide data
prot_and_peps_df = pd.concat(
    [proteins_df, peptide_df],
    axis=1,
)
prot_and_peps_df.head()

Unnamed: 0_level_0,O00391,O00533,O00584,O14498,O14773,O14791,O15240,O15394,O43505,O60888,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,9104.27,402321.0,,,7150.57,2497.84,83002.9,15113.6,167327.0,129048.0,...,202274.0,,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,,7207.3
10053_12,10464.2,435586.0,,,,,197117.0,15099.1,164268.0,108114.0,...,201009.0,,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.8
10053_18,13235.7,507386.0,7126.96,24525.7,,2372.71,126506.0,16289.6,168107.0,163776.0,...,220728.0,,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.7
10138_12,12600.2,494581.0,9165.06,27193.5,22506.1,6015.9,156313.0,54546.4,204013.0,56725.0,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,12003.2,522138.0,4498.51,17189.8,29112.4,2665.15,151169.0,52338.1,240892.0,85767.1,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,,56977.6,4903.09


## Use regression to impute missing protein/peptide values

The author mentioned above uses IterativeImputer and KNNImputer, with time I may try other approaches.

In [9]:
# enable experimental imputer
from sklearn.experimental import enable_iterative_imputer

# import our imputation algos
from sklearn.impute import (
    IterativeImputer,
    KNNImputer,
    MissingIndicator,
)

In [10]:
%%time
iter_imputer = IterativeImputer(
    max_iter=10,
    n_nearest_features=25,
    imputation_order='ascending',
    random_state=0,
    verbose=1,
)

# fit the imputer
iter_prot_and_peps_array = iter_imputer.fit_transform(prot_and_peps_df)
iter_prot_and_peps_df = pd.DataFrame(
    data=iter_prot_and_peps_array,
    columns=prot_and_peps_df.columns,
    index=prot_and_peps_df.index,
)

# save to parquet
iter_prot_and_peps_df[protein_cols].to_parquet(
    'protein_data_iterative_impute.parquet',
)

iter_prot_and_peps_df[peptide_cols].to_parquet(
    'peptide_data_iterative_impute.parquet',
)

# Saving the data together makes more sense to me
iter_prot_and_peps_df[protein_cols].to_parquet(
    'protein_peptide_data_iterative_impute.parquet',
)

[IterativeImputer] Completing matrix with shape (1113, 1195)
[IterativeImputer] Change: 130567040.68250778, scaled tolerance: 613851.0 
[IterativeImputer] Change: 77675980.52362576, scaled tolerance: 613851.0 
[IterativeImputer] Change: 78590208.8506695, scaled tolerance: 613851.0 
[IterativeImputer] Change: 68482566.33326018, scaled tolerance: 613851.0 
[IterativeImputer] Change: 58949685.60739637, scaled tolerance: 613851.0 
[IterativeImputer] Change: 49104473.013554275, scaled tolerance: 613851.0 
[IterativeImputer] Change: 53711002.099543095, scaled tolerance: 613851.0 
[IterativeImputer] Change: 4.3106789405883556e+19, scaled tolerance: 613851.0 
[IterativeImputer] Change: 4.5373704590897545e+19, scaled tolerance: 613851.0 
[IterativeImputer] Change: 9.571200423027006e+19, scaled tolerance: 613851.0 




CPU times: user 3min 25s, sys: 2min 39s, total: 6min 4s
Wall time: 1min 33s


In [11]:
%%time
knn_imputer = KNNImputer(
    weights='uniform',
    n_neighbors=10,
)

# fit the imputer
knn_prot_and_peps_array = knn_imputer.fit_transform(prot_and_peps_df)
knn_prot_and_peps_df = pd.DataFrame(
    data=knn_prot_and_peps_array,
    columns=prot_and_peps_df.columns,
    index=prot_and_peps_df.index,
)

# save to parquet
knn_prot_and_peps_df[protein_cols].to_parquet(
    'protein_data_KNN_impute.parquet',
)

knn_prot_and_peps_df[peptide_cols].to_parquet(
    'peptide_data_KNN_impute.parquet',
)

# Saving the data together makes more sense to me
knn_prot_and_peps_df.to_parquet(
    'protein_peptide_data_KNN_impute.parquet',
)

CPU times: user 11.8 s, sys: 426 ms, total: 12.3 s
Wall time: 11.3 s


# Sources Used & Further Resources

### Data processing
 * https://www.kaggle.com/code/xaviernogueira/pre-processing-making-labels-and-imputation#Add-supplemental-clinical-data-to-the-main-clininical-data

### Model reuse of base layers
 * <Placeholder>

### Predicting multiple outputs with one model
* https://pub.towardsai.net/how-to-predict-multiple-variables-with-one-model-and-why-31e6a8efc09e
* https://machinelearningmastery.com/deep-learning-models-for-multi-output-regression/