# PPMI Gait Analysis
This data set is part of the Parkinson's Progression Markers Initiative. Anat Mirelman, PhD, of Tel Aviv University is the PI. According to the study summary: "The Gait study was proposed in order to obtain quantitative, objective motor measures that could
inform on pre-clinical symptoms, progression markers, and dynamic changes of function throughout disease and potential modifiers and mediators of motor symptoms."

My goals are to explore the data set to see if I can come up with an interesting clinical question, and to also improve my data science skills. One interesting question is to try and classify each subject according to these gait measures. The main categories of subjects are individuals who have a genetic marker for PD (e.g., LRRK2) but do not show signs of PD, and those who have a genetic marker and do have diagnoses of PD. 

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import linalg as LA
from datetime import datetime

from sklearn import preprocessing
from sklearn.preprocessing import scale 
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.metrics import confusion_matrix, classification_report, precision_score

from sklearn.model_selection import cross_val_score, cross_val_predict, LeaveOneOut
from sklearn.model_selection import KFold

# This sets a higher resolution for figures
%config InlineBackend.figure_format = 'retina'

First, we need to import the necessary data sets. The Gait_Data data are the objective gait measures. The Screening data contain information on all > 600 individuals participating in PPMI.  

In [2]:
pd.options.display.max_rows = 1000
gait = pd.read_csv('Gait_Data___Arm_swing.csv')
screen_all = pd.read_csv('Screening___Demographics.csv')
updrs = pd.read_csv('MDS_UPDRS_Part_III.csv')
gait.columns

Index(['PATNO', 'EVENT_ID', 'INFODT', 'COHORT', 'SP_U', 'RA_AMP_U', 'LA_AMP_U',
       'RA_STD_U', 'LA_STD_U', 'SYM_U', 'R_JERK_U', 'L_JERK_U', 'ASA_U',
       'ASYM_IND_U', 'TRA_U', 'T_AMP_U', 'CAD_U', 'STR_T_U', 'STR_CV_U',
       'STEP_REG_U', 'STEP_SYM_U', 'JERK_T_U', 'SP__DT', 'RA_AMP_DT',
       'LA_AMP_DT', 'RA_STD_DT', 'LA_STD_DT', 'SYM_DT', 'R_JERK_DT',
       'L_JERK_DT', 'ASA_DT', 'ASYM_IND_DT', 'TRA_DT', 'T_AMP_DT', 'CAD_DT',
       'STR_T_DT', 'STR_CV_DT', 'STEP_REG_DT', 'STEP_SYM_DT', 'JERK_T_DT',
       'SW_VEL_OP', 'SW_PATH_OP', 'SW_FREQ_OP', 'SW_JERK_OP', 'SW_VEL_CL',
       'SW_PATH_CL', 'SW_FREQ_CL', 'SW_JERK_CL', 'TUG1_DUR', 'TUG1_STEP_NUM',
       'TUG1_STRAIGHT_DUR', 'TUG1_TURNS_DUR', 'TUG1_STEP_REG', 'TUG1_STEP_SYM',
       'TUG2_DUR', 'TUG2_STEP_NUM', 'TUG2_STRAIGHT_DUR', 'TUG2_TURNS_DUR',
       'TUG2_STEP_REG', 'TUG2_STEP_SYM'],
      dtype='object')

In [None]:
screen_all.columns

From screening data, I only care about a few variables.

In [3]:
# 'PATNO', 'APPRDX', 'CURRENT_APPRDX', 'BIRTHDT', 'GENDER'
key_vars = ['PATNO', 'APPRDX', 'CURRENT_APPRDX', 'BIRTHDT', 'GENDER']
screen_filt = screen_all[key_vars]

For UPDRS Part III (motor scores), I only care about the 33 variables related to the testing instrument. 

In [4]:
updrs_vars = ['PATNO', 'INFODT', 'NP3SPCH', 'NP3FACXP', 'NP3RIGN', 'NP3RIGRU', 'NP3RIGLU', 'PN3RIGRL', 'NP3RIGLL', 
              'NP3FTAPR', 'NP3FTAPL', 'NP3HMOVR', 'NP3HMOVL', 'NP3PRSPR', 'NP3PRSPL', 'NP3TTAPR', 
              'NP3TTAPL', 'NP3LGAGR', 'NP3LGAGL', 'NP3RISNG', 'NP3GAIT', 'NP3FRZGT', 'NP3PSTBL', 
              'NP3POSTR', 'NP3BRADY', 'NP3PTRMR', 'NP3PTRML', 'NP3KTRMR', 'NP3KTRML', 'NP3RTARU', 
              'NP3RTALU', 'NP3RTARL', 'NP3RTALL', 'NP3RTALJ', 'NP3RTCON']
updrs_filt = updrs[updrs_vars]
updrs_filt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15832 entries, 0 to 15831
Data columns (total 35 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   PATNO     15832 non-null  int64  
 1   INFODT    15832 non-null  object 
 2   NP3SPCH   15832 non-null  int64  
 3   NP3FACXP  15832 non-null  int64  
 4   NP3RIGN   15829 non-null  float64
 5   NP3RIGRU  15829 non-null  float64
 6   NP3RIGLU  15829 non-null  float64
 7   PN3RIGRL  15827 non-null  float64
 8   NP3RIGLL  15826 non-null  float64
 9   NP3FTAPR  15830 non-null  float64
 10  NP3FTAPL  15827 non-null  float64
 11  NP3HMOVR  15830 non-null  float64
 12  NP3HMOVL  15826 non-null  float64
 13  NP3PRSPR  15830 non-null  float64
 14  NP3PRSPL  15828 non-null  float64
 15  NP3TTAPR  15799 non-null  float64
 16  NP3TTAPL  15802 non-null  float64
 17  NP3LGAGR  15831 non-null  float64
 18  NP3LGAGL  15830 non-null  float64
 19  NP3RISNG  15829 non-null  float64
 20  NP3GAIT   15830 non-null  fl

In [5]:
# Now I want to create new dataframe that only includes patients common to both gait and screen_filt
df_merged = pd.merge(gait, screen_filt, how='inner', on=['PATNO'])

# Did all of the subjects keep same diagnosis? If so, you can drop one of the dx columns 
print("Subjects kept same dx? " + str(sum(df_merged['APPRDX'] == df_merged['CURRENT_APPRDX']) == len(df_merged)))

# Let's drop some more columns
df_merged = df_merged.drop(columns=['CURRENT_APPRDX', 'EVENT_ID', 'COHORT'])

df_merged.columns

print(len(np.unique(df_merged['PATNO'])))
print(len(df_merged))

Subjects kept same dx? True
103
191


In [6]:
# keep track of the cols (excluding UDPRS scores) that we want to keep later when we drop rows based on NaNs
predictor_cols = ['SP_U', 'RA_AMP_U', 'LA_AMP_U', 'RA_STD_U',
       'LA_STD_U', 'SYM_U', 'R_JERK_U', 'L_JERK_U', 'ASA_U', 'ASYM_IND_U',
       'TRA_U', 'T_AMP_U', 'CAD_U', 'STR_T_U', 'STR_CV_U', 'STEP_REG_U',
       'STEP_SYM_U', 'JERK_T_U', 'SP__DT', 'RA_AMP_DT', 'LA_AMP_DT',
       'RA_STD_DT', 'LA_STD_DT', 'SYM_DT', 'R_JERK_DT', 'L_JERK_DT', 'ASA_DT',
       'ASYM_IND_DT', 'TRA_DT', 'T_AMP_DT', 'CAD_DT', 'STR_T_DT', 'STR_CV_DT',
       'STEP_REG_DT', 'STEP_SYM_DT', 'JERK_T_DT', 'SW_VEL_OP', 'SW_PATH_OP',
       'SW_FREQ_OP', 'SW_JERK_OP', 'SW_VEL_CL', 'SW_PATH_CL', 'SW_FREQ_CL',
       'SW_JERK_CL', 'TUG1_DUR', 'TUG1_STEP_NUM', 'TUG1_STRAIGHT_DUR',
       'TUG1_TURNS_DUR', 'TUG1_STEP_REG', 'TUG1_STEP_SYM', 'TUG2_DUR',
       'TUG2_STEP_NUM', 'TUG2_STRAIGHT_DUR', 'TUG2_TURNS_DUR', 'TUG2_STEP_REG',
       'TUG2_STEP_SYM']

# How many unique subjects are there? 
print(len(np.unique(gait['PATNO'])))
print(len(np.unique(df_merged['PATNO'])))
print(len(df_merged))
df_merged.head()

103
103
191


Unnamed: 0,PATNO,INFODT,SP_U,RA_AMP_U,LA_AMP_U,RA_STD_U,LA_STD_U,SYM_U,R_JERK_U,L_JERK_U,...,TUG1_STEP_SYM,TUG2_DUR,TUG2_STEP_NUM,TUG2_STRAIGHT_DUR,TUG2_TURNS_DUR,TUG2_STEP_REG,TUG2_STEP_SYM,APPRDX,BIRTHDT,GENDER
0,42443,11/2018,1.445,42.787289,31.405978,2.783204,2.597315,0.369588,0.001618,0.002542,...,1.421568,10.390625,12.291016,0.546875,1.523438,0.565465,1.056312,6.0,1957.0,2.0
1,42443,11/2017,1.211,41.744432,42.194299,2.286481,2.235276,0.019062,0.002209,0.003016,...,1.284985,9.4375,11.674805,0.523438,1.441406,0.744995,1.20942,6.0,1957.0,2.0
2,42443,11/2019,1.431,44.932577,33.966371,2.373181,1.987091,0.32948,0.003805,0.005256,...,0.98132,10.085938,10.582031,6.046875,1.472656,0.618923,0.901443,6.0,1957.0,2.0
3,42438,10/2018,1.131,30.357805,42.788477,5.422287,5.012269,0.289054,0.002742,0.008316,...,0.993754,13.78125,16.245117,0.554688,2.363281,0.71315,1.002639,5.0,1955.0,1.0
4,42438,11/2019,1.068,19.245223,41.001083,4.567233,4.523336,0.52518,0.002266,0.009318,...,1.207092,20.8125,22.253418,14.242188,2.679688,0.327035,1.072561,5.0,1955.0,1.0


We want to create another data frame that contains data only from each patient's initial visit.

In [7]:
# We first need to format the dates correctly. Note that this method imputes day of month, but it doesn't matter
# since we only have month and year
df_merged['INFODATE'] = [datetime.strptime(x, '%m/%Y') for x in df_merged['INFODT']]

# Next we want to create new data frame with only first visit's data
df_baseline = df_merged.sort_values('INFODATE').groupby(['PATNO'], as_index=False).first()
df_baseline.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 0 to 102
Data columns (total 62 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PATNO              103 non-null    int64  
 1   INFODT             103 non-null    object 
 2   SP_U               94 non-null     float64
 3   RA_AMP_U           94 non-null     float64
 4   LA_AMP_U           94 non-null     float64
 5   RA_STD_U           94 non-null     float64
 6   LA_STD_U           94 non-null     float64
 7   SYM_U              94 non-null     float64
 8   R_JERK_U           94 non-null     float64
 9   L_JERK_U           94 non-null     float64
 10  ASA_U              94 non-null     float64
 11  ASYM_IND_U         94 non-null     float64
 12  TRA_U              94 non-null     float64
 13  T_AMP_U            94 non-null     float64
 14  CAD_U              94 non-null     float64
 15  STR_T_U            94 non-null     float64
 16  STR_CV_U           94 non-

Let's figure out how many subjects were assessed on multiple visits.

In [8]:
df_pts = df_merged.groupby('PATNO', as_index=False).count()
df_pts['MULT_VISITS'] = [1 if x > 1 else 0 for x in df_pts['INFODT']]
df_mult_visits = df_pts[df_pts['MULT_VISITS']==1]
print(len(df_mult_visits))

51


Uncomment next line if you want to print the current data frame.

In [None]:
# df_baseline.to_csv('gait_baseline.csv', index=False)

Let's start to look at the data. 

In [None]:
df_baseline.info()

In [None]:
# Plot correlation matrix 
sns.heatmap(df_baseline.corr(),cmap='coolwarm')

In [None]:
# Plot each variable against every other -- runtime is very slow, so don't include too much
sns.pairplot(df_baseline[['SP_U', 'RA_AMP_U', 
       'RA_STD_U', 'SYM_U', 'R_JERK_U', 'SP__DT', 'RA_AMP_DT',
       'RA_STD_DT', 'SYM_DT', 'R_JERK_DT']])

In [9]:
# count number of patients with each diagnosis
print(df_baseline.groupby('APPRDX').count())

        PATNO  INFODT  SP_U  RA_AMP_U  LA_AMP_U  RA_STD_U  LA_STD_U  SYM_U  \
APPRDX                                                                       
4.0         7       7     2         6         6         6         6      6   
5.0        53      53    52        49        49        49        49     49   
6.0        42      42    40        38        38        38        38     38   
8.0         1       1     0         1         1         1         1      1   

        R_JERK_U  L_JERK_U  ...  TUG1_STEP_SYM  TUG2_DUR  TUG2_STEP_NUM  \
APPRDX                      ...                                           
4.0            6         6  ...              5         5              5   
5.0           49        49  ...             53        52             52   
6.0           38        38  ...             42        40             40   
8.0            1         1  ...              1         1              1   

        TUG2_STRAIGHT_DUR  TUG2_TURNS_DUR  TUG2_STEP_REG  TUG2_STEP_SYM  \
APPRD

In [None]:
print(df_baseline.info())

In [10]:
# Count how many subjects have missing data
print(df_baseline.isna().any(axis=1).sum())

df_baseline.info()

# Create new data frame that excludes subjects with missing data
df_filt = df_baseline.dropna(subset=predictor_cols)

print(len(df_filt))

# Now how many subjects are there per group?
print(df_filt.groupby('APPRDX', as_index=False).count())

df_filt.info()

26
<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 0 to 102
Data columns (total 62 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PATNO              103 non-null    int64  
 1   INFODT             103 non-null    object 
 2   SP_U               94 non-null     float64
 3   RA_AMP_U           94 non-null     float64
 4   LA_AMP_U           94 non-null     float64
 5   RA_STD_U           94 non-null     float64
 6   LA_STD_U           94 non-null     float64
 7   SYM_U              94 non-null     float64
 8   R_JERK_U           94 non-null     float64
 9   L_JERK_U           94 non-null     float64
 10  ASA_U              94 non-null     float64
 11  ASYM_IND_U         94 non-null     float64
 12  TRA_U              94 non-null     float64
 13  T_AMP_U            94 non-null     float64
 14  CAD_U              94 non-null     float64
 15  STR_T_U            94 non-null     float64
 16  STR_CV_U           94 n

According to the presentation on accessing PPMI data ("08b_v2_Caspell_Foster_PPMI-Data-Access_May-2015-v2.0.pdf"), the APPRDX codes in our data set correspond to:

**4 - Prodromal** (this means an individual who appears at risk for PD based on report of "anosmia" or disrupted REM behavior)

**5 - Genetic Cohort subject with PD**

**6 - Genetic Cohort subject unaffected**

In [13]:
# Based on above classifications, we want to classify subjects as either having PD (APPRDX=5) or no PD (APPRDX=4 or 6)
ParkDx = [1 if x == 5 else 0 for x in df_filt['APPRDX']]

# add Parkinson's disease binary variable to data frame
df_filt.assign(PD=ParkDx)

Unnamed: 0,PATNO,INFODT,SP_U,RA_AMP_U,LA_AMP_U,RA_STD_U,LA_STD_U,SYM_U,R_JERK_U,L_JERK_U,...,TUG2_STEP_NUM,TUG2_STRAIGHT_DUR,TUG2_TURNS_DUR,TUG2_STEP_REG,TUG2_STEP_SYM,APPRDX,BIRTHDT,GENDER,INFODATE,PD
0,40553,05/2017,0.806,36.126752,17.767148,4.386049,2.648999,1.144444,0.023956,0.090631,...,13.097656,0.539062,1.539062,0.642412,1.046546,5.0,1945.0,1.0,2017-05-01 00:00:00,1
1,40555,06/2017,1.0,3.700673,24.505864,1.345954,2.865374,0.847282,0.016979,0.020427,...,14.117188,0.609375,1.507812,0.442304,0.821767,5.0,1957.0,1.0,2017-06-01 00:00:00,1
3,40567,02/2019,0.973,29.782323,41.685751,9.155024,18.263038,0.811041,0.045715,0.098497,...,9.140625,0.585938,1.714844,0.636833,1.788158,6.0,1966.0,2.0,2019-02-01 00:00:00,0
4,40578,12/2017,1.104,35.03777,30.911425,4.972681,5.745657,0.134001,0.026055,0.047951,...,12.363281,0.570312,1.632812,0.552915,1.465661,5.0,1963.0,2.0,2017-12-01 00:00:00,1
5,40585,06/2017,1.305,38.543712,21.629526,4.982411,5.998333,0.805728,0.041232,0.019728,...,13.681641,0.5,2.210938,0.308428,0.782646,5.0,1962.0,2.0,2017-06-01 00:00:00,1
6,40586,06/2017,1.072,18.217614,11.348703,2.496837,2.775233,0.627736,0.062392,0.029549,...,8.810547,0.585938,2.222656,0.603835,1.76176,5.0,1975.0,2.0,2017-06-01 00:00:00,1
7,40587,06/2017,1.117,35.03777,30.911425,4.972681,5.745657,0.134001,0.026055,0.047951,...,12.407227,0.570312,1.632812,0.555093,1.464773,6.0,1950.0,2.0,2017-06-01 00:00:00,0
8,40593,01/0019,1.4,37.8283,55.009772,9.515026,6.200623,0.308241,0.035884,0.048152,...,6.069336,0.617188,1.789062,0.568204,0.750719,6.0,1959.0,1.0,0019-01-01 00:00:00,0
9,40596,11/2017,1.3,19.482179,25.957144,2.807639,2.232487,0.249487,0.034289,0.045664,...,8.929688,0.71875,2.34375,0.463234,5.936446,6.0,1951.0,2.0,2017-11-01 00:00:00,0
10,40599,03/2019,1.6,74.429563,58.530308,7.754609,5.568363,0.272419,0.118047,0.068798,...,5.824219,3.882812,1.324219,0.360449,1.080128,5.0,1956.0,1.0,2019-03-01 00:00:00,1


In [14]:
df_filt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 0 to 101
Data columns (total 62 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PATNO              77 non-null     int64  
 1   INFODT             77 non-null     object 
 2   SP_U               77 non-null     float64
 3   RA_AMP_U           77 non-null     float64
 4   LA_AMP_U           77 non-null     float64
 5   RA_STD_U           77 non-null     float64
 6   LA_STD_U           77 non-null     float64
 7   SYM_U              77 non-null     float64
 8   R_JERK_U           77 non-null     float64
 9   L_JERK_U           77 non-null     float64
 10  ASA_U              77 non-null     float64
 11  ASYM_IND_U         77 non-null     float64
 12  TRA_U              77 non-null     float64
 13  T_AMP_U            77 non-null     float64
 14  CAD_U              77 non-null     float64
 15  STR_T_U            77 non-null     float64
 16  STR_CV_U           77 non-n