# For each patient visit where a protein/peptide sample was taken you will estimate future UPDR scores: 
 * ## for that visit (month 0)
 * ## for 6 months later
 * ## for 12 months later
 * ## for 24 months later
 
Predictions for any visits that didn't ultimately take place are ignored.

## This project has multiple fascinating features and challenges, including:
 - Reusing base layers of another model I also train
 - Multi-output models (16 outputs!)

## Approach

 1. **Ingest and format data**
 2. **Pre-train a multi-output model using `supplemental_clinical_data.csv`**
 
 Since there is no protein/peptide information associated with the supplemental clinical data, we can use this to train a model on the typical progression of Parkinsons (as measured by UPDR scores). It will only predict scores given the time since initial visit. This will give us base layers that we can reuse when predicting progression based on proteins/peptides.
 3. **Train a multi-output model using `train_clinical_data`**
 
 Our goal in both steps 2 and 3 is to train a single model that can predict UPDR scores at months 0, 6, 12, and 24. I'll use a multi-output variable approach because each UPDR score has 4 sections, predicting a total of 16 outputs!

## Data Ingestion

In [1]:
import numpy as np
import pandas as pd
import sklearn
from pathlib import Path

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/amp-parkinsons-disease-progression-prediction/train_proteins.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/train_clinical_data.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/public_timeseries_testing_util.py
/kaggle/input/amp-parkinsons-disease-progression-prediction/supplemental_clinical_data.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/train_peptides.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/amp_pd_peptide/competition.cpython-37m-x86_64-linux-gnu.so
/kaggle/input/amp-parkinsons-disease-progression-prediction/amp_pd_peptide/__init__.py
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/sample_submission.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/test_proteins.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/test_peptides.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test

In [2]:
# Pull in all clinical data

INPUTS_DIR = Path('/kaggle/input/amp-parkinsons-disease-progression-prediction/') 

# train_clinical_data.csv
clinical_df = pd.read_csv(
    INPUTS_DIR / 'train_clinical_data.csv',
    index_col='visit_id',
)
print(clinical_df.shape)
print(clinical_df.head(n=10))

# supplemental_clinical_data.csv
supp_clinical_df = pd.read_csv(
    INPUTS_DIR / 'supplemental_clinical_data.csv',
    index_col='visit_id',
)
print(supp_clinical_df.shape)
print(supp_clinical_df.head(n=10))

# concat both to a single dataframe
all_clinical_df = pd.concat([clinical_df, supp_clinical_df])

print("\nUnion of both clinical data sets:")
print(all_clinical_df.shape)

(2615, 7)
          patient_id  visit_month  updrs_1  updrs_2  updrs_3  updrs_4  \
visit_id                                                                
55_0              55            0     10.0      6.0     15.0      NaN   
55_3              55            3     10.0      7.0     25.0      NaN   
55_6              55            6      8.0     10.0     34.0      NaN   
55_9              55            9      8.0      9.0     30.0      0.0   
55_12             55           12     10.0     10.0     41.0      0.0   
55_18             55           18      7.0     13.0     38.0      0.0   
55_24             55           24     16.0      9.0     49.0      0.0   
55_30             55           30     14.0     13.0     49.0      0.0   
55_36             55           36     17.0     18.0     51.0      0.0   
55_42             55           42     12.0     20.0     41.0      0.0   

         upd23b_clinical_state_on_medication  
visit_id                                      
55_0               

## Data Exploration

In [3]:
# sort by patient id and month
clinical_df = clinical_df.sort_values(
    ['patient_id', 'visit_month'],
)
supp_clinical_df = supp_clinical_df.sort_values(
    ['patient_id', 'visit_month'],
)
all_clinical_df = all_clinical_df.sort_values(
    ['patient_id', 'visit_month'],
)
all_clinical_df.head()

# get all unique patient ids (already sorted)
clin_patient_ids = clinical_df['patient_id'].unique()
supp_patient_ids = supp_clinical_df['patient_id'].unique()
patient_ids = all_clinical_df['patient_id'].unique()

# Verifying there are no supplemental patient ids that may have protein/peptide data
np.intersect1d(clin_patient_ids, supp_patient_ids)



array([], dtype=int64)

## Data Processing
This is (with only minor modifications) from [this work by Xavier R Nogueira](https://www.kaggle.com/code/xaviernogueira/pre-processing-making-labels-and-imputation#Add-supplemental-clinical-data-to-the-main-clininical-data) released under the [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) open source license.

The main change I made was to turn it into a factory pattern function, so that I could process the `train_clinical_data` and the `supplemental_clinical_data` separately. Since the `supplemental_clinical_data` doesn't have protein/peptide information, I don't think it should be grouped it in with the labeled data. 

Instead, I'll use the `supplemental_clinical_data` to pretrain the base layers of my multi-output model on typical Parkinsons Disease progression.

In [4]:
def process_clinical_data(clinical_data, parquet_outfile_name):
    # rename same visit columns
    for updrs_col in [i for i in clinical_data.columns if 'updrs' in i]:
        clinical_data = clinical_data.rename(
            columns={updrs_col: updrs_col + '_plus_0_months'},
        )
    clinical_data.head()

    # sort by patient id and month
    clinical_data = clinical_data.sort_values(
        ['patient_id', 'visit_month'],
    )
    clinical_data.head()

    # get all unique patient ids (already sorted)
    patient_ids = clinical_data['patient_id'].unique()

    # get all unique visit_months and sort
    months = clinical_data.visit_month.unique()
    months.sort()
    evenly_spaced_months = np.arange(months[0], months[-1] + 1)

    # create a new multi index with all unique patient id evenly spaced months
    multi_index = pd.MultiIndex.from_product(
        [patient_ids, evenly_spaced_months],
        names=['patient_id', 'visit_month'],
    )
    print(len(multi_index))

    # convert into dataframe multiindex
    multi_index = pd.DataFrame(
        index=multi_index,
    )

    multi_index.head(n=5)

    # join full clinical data
    clinical_data = multi_index.join(
        clinical_data.reset_index().set_index(['patient_id', 'visit_month']),
        how='left',
    )
    clinical_data.head()

    # shift data forward for our labels while grouping by patient_id
    for col in [i for i in clinical_data.columns if 'plus_0_months' in i]:
        clinical_data[col.replace('0', '6')] = clinical_data.groupby('patient_id')[col].shift(-6)
        clinical_data[col.replace('0', '12')] = clinical_data.groupby('patient_id')[col].shift(-12)
        clinical_data[col.replace('0', '24')] = clinical_data.groupby('patient_id')[col].shift(-24)
    clinical_data.head()

    # drop all rows where visit_id is NaN
    clinical_data = clinical_data.loc[clinical_data.visit_id.notna()]

    # reset index back to visit_id
    clinical_data = (
        clinical_data
        .reset_index()
        .set_index('visit_id')
    )
    clinical_data.head()
    
    # save labels data frame as a parquet file (without non labels columns)
    clinical_data.drop(
        columns=['patient_id', 'visit_month', 'upd23b_clinical_state_on_medication'],
    ).to_parquet(
        parquet_outfile_name,
    )
    
    return clinical_data

In [5]:
# Clinical data post-processing, saving labels to a parquet file
clinical_df_pp = process_clinical_data(clinical_df, 'clinical_data_training_labels.parquet')

# Supplemental clinical data post-processing, saving labels to a parquet file
supp_clinical_df_pp = process_clinical_data(supp_clinical_df, 'supp_clinical_data_training_labels.parquet')

27032
28527


# Sources Used & Further Resources

### Data processing
 * https://www.kaggle.com/code/xaviernogueira/pre-processing-making-labels-and-imputation#Add-supplemental-clinical-data-to-the-main-clininical-data

### Model reuse of base layers
 * <Placeholder>

### Predicting multiple outputs with one model
* https://pub.towardsai.net/how-to-predict-multiple-variables-with-one-model-and-why-31e6a8efc09e
* https://machinelearningmastery.com/deep-learning-models-for-multi-output-regression/