# PPMI Gait Analysis
This data set is part of the Parkinson's Progression Markers Initiative. Anat Mirelman, PhD, of Tel Aviv University is the PI. According to the study summary: "The Gait study was proposed in order to obtain quantitative, objective motor measures that could inform on pre-clinical symptoms, progression markers, and dynamic changes of function throughout disease and potential modifiers and mediators of motor symptoms."

Your goal is to examine the data and later implement some ML algorithms to try and classify PD patients from non-PD patients based on the gait measures.

To learn more about PPMI, go here: http://www.ppmi-info.org/  
To learn more about working with the data, go here: www.ppmi-info.org/wp-content/uploads/2015/12/PPMI-data-access-final.mp4


In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# seaborn is for making figures look really nice
import seaborn as sns

# this is a big hint for a later part of the exercise
from datetime import datetime 

First, you need to import the necessary data sets. The gait data are the objective gait measures. The screening data contain information on all of the individuals participating in PPMI.  

In [3]:
# read in csv files as pandas data frames
gait_variables=pd.read_csv('Gait_Data___Arm_swing.csv')
Screening_variables=pd.read_csv('Screening___Demographics.csv',usecols=['PATNO','APPRDX','CURRENT_APPRDX','BIRTHDT','GENDER','ORIG_ENTRY','LAST_UPDATE'])


# take a look at the variables in each csv


Unnamed: 0,PATNO,APPRDX,CURRENT_APPRDX,BIRTHDT,GENDER,ORIG_ENTRY,LAST_UPDATE
0,3400,1.0,1.0,1971.0,0.0,06/2010,2010-12-17 10:58:57.0
1,3401,2.0,2.0,1954.0,1.0,06/2010,2010-07-20 08:09:56.0
2,3402,3.0,3.0,1964.0,2.0,06/2010,2011-09-27 12:12:25.0
3,3403,1.0,1.0,1941.0,2.0,06/2010,2010-07-20 08:35:45.0
4,3404,2.0,2.0,1954.0,0.0,06/2010,2010-07-20 09:00:48.0
...,...,...,...,...,...,...,...
2249,5012,9.0,9.0,1952.0,2.0,09/2019,2019-09-27 08:31:52.0
2250,5013,9.0,9.0,1962.0,2.0,09/2019,2019-09-27 10:51:18.0
2251,5014,9.0,9.0,1953.0,2.0,11/2019,2019-11-19 11:26:48.0
2252,5015,9.0,9.0,1956.0,1.0,11/2019,2019-11-20 09:20:16.0


The reason why we have to read in both csv files is because the gait csv does not contain all of the necessary information we want about the subjects. For instance, we don't know from the gait data what diagnosis each individual has received. This is a problem if we later want to classify PD vs non-PD. The diagnosis information is contained within the screening csv. The trick is that there are a lot of subjects in the screening csv that were not part of the gait study. ***What to do?***

My advice is to:  
1) Get rid of the variables from the screening csv that we don't want or need.   
2) Merge the two data frames into one that has both gait and screening data for all of the subjects in the gait study.

**Hint:** Every individual in PPMI has a unique ID that is contained in 'PATNO'. The diagnosis each subject initially received is in 'APPRDX' and their most current diagnosis i in 'CURRENT_APPRDX'.

From screening data, we only care about a few variables.

In [42]:
# Probably want to keep 'PATNO', 'APPRDX', 'CURRENT_APPRDX', 'BIRTHDT', 'GENDER', 'ORIG_ENTRY', 'LAST_UPDATE'
# Now create new dataframe that only includes patients common to both gait and screen

Common_Screeningdata=(Screening_variables[Screening_variables.PATNO.isin(gait_variables.PATNO)])

# Some subjects were tested on multiple visits. How many unique subjects are there? 103
#To ensure that common screening data has all the unique patients
gait_unique=np.unique(gait_variables['PATNO'])
screen_unique=np.unique(Common_Screeningdata['PATNO'])

print (len(gait_unique))
print (len(screen_unique))

# merging data:
gait_screening_variables=gait_variables.merge(Common_Screeningdata)
gait_screening_variables


103
103


Unnamed: 0,PATNO,EVENT_ID,INFODT,COHORT,SP_U,RA_AMP_U,LA_AMP_U,RA_STD_U,LA_STD_U,SYM_U,...,TUG2_STRAIGHT_DUR,TUG2_TURNS_DUR,TUG2_STEP_REG,TUG2_STEP_SYM,APPRDX,CURRENT_APPRDX,BIRTHDT,GENDER,ORIG_ENTRY,LAST_UPDATE
0,42443,V06,11/2018,1.0,1.445,42.787289,31.405978,2.783204,2.597315,0.369588,...,0.546875,1.523438,0.565465,1.056312,6.0,6.0,1957.0,2.0,12/2016,2016-12-20 00:37:15.0
1,42443,V04,11/2017,1.0,1.211,41.744432,42.194299,2.286481,2.235276,0.019062,...,0.523438,1.441406,0.744995,1.209420,6.0,6.0,1957.0,2.0,12/2016,2016-12-20 00:37:15.0
2,42443,V8,11/2019,1.0,1.431,44.932577,33.966371,2.373181,1.987091,0.329480,...,6.046875,1.472656,0.618923,0.901443,6.0,6.0,1957.0,2.0,12/2016,2016-12-20 00:37:15.0
3,42438,V06,10/2018,3.0,1.131,30.357805,42.788477,5.422287,5.012269,0.289054,...,0.554688,2.363281,0.713150,1.002639,5.0,5.0,1955.0,1.0,11/2016,2016-11-23 22:38:32.0
4,42438,V8,11/2019,3.0,1.068,19.245223,41.001083,4.567233,4.523336,0.525180,...,14.242188,2.679688,0.327035,1.072561,5.0,5.0,1955.0,1.0,11/2016,2016-11-23 22:38:32.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
186,60045,V12,04/2019,,,29.782323,41.685751,9.155024,18.263038,0.811041,...,0.585938,1.714844,0.638668,1.755831,4.0,4.0,1941.0,2.0,01/2014,2014-02-21 07:35:29.0
187,60046,V12,05/2019,,,23.160092,27.543798,3.630631,3.522706,0.158126,...,,,,,4.0,4.0,1941.0,2.0,01/2014,2014-04-23 08:15:23.0
188,60057,V12,04/2019,,,29.782323,41.685751,9.155024,18.263038,0.811041,...,0.585938,1.714844,0.638727,1.743030,4.0,4.0,1953.0,2.0,02/2014,2014-03-20 09:46:40.0
189,60059,V12,05/2019,,,31.132878,33.728352,8.217441,3.558946,0.172383,...,0.570312,1.917969,0.532902,1.168936,4.0,4.0,1943.0,2.0,02/2014,2014-04-01 11:02:26.0


Check to see if any subjects changed diagnoses within the course of the study. If so, drop one of the diagnosis columns. 

It also makes sense to subset the data so that you can look at data only from each subjects' initial visit. How are you going to do this? **Hint:** You will need to reformat the dates so that they are in a format that python/pandas will understand as datetime. **Extra hint:** You will probably want to use the datetime.strptime function.

In [43]:
# We first need to format the dates correctly. 
for i in range (0, len(gait_screening_variables)):
    #INFODT
    infodtstr=gait_screening_variables.INFODT[i]
    infodt=datetime.strptime(infodtstr,'%m/%Y')
    
   

    #ORIG_ENTRY
    origstr=gait_screening_variables.ORIG_ENTRY[i]
    orig_entry=datetime.strptime(origstr,'%m/%Y')
    
    gait_screening_variables.INFODT[i]=infodt
    gait_screening_variables.ORIG_ENTRY[i]=orig_entry
   
# Next we want to create new data frame with only first visit's data

data = gait_screening_variables[gait_screening_variables['INFODT'] == gait_screening_variables.groupby('PATNO')['INFODT'].transform('min')]
data = data.reset_index()
del data['index']

data 
    

    



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,PATNO,EVENT_ID,INFODT,COHORT,SP_U,RA_AMP_U,LA_AMP_U,RA_STD_U,LA_STD_U,SYM_U,...,TUG2_STRAIGHT_DUR,TUG2_TURNS_DUR,TUG2_STEP_REG,TUG2_STEP_SYM,APPRDX,CURRENT_APPRDX,BIRTHDT,GENDER,ORIG_ENTRY,LAST_UPDATE
0,42443,V04,2017-11-01 00:00:00,1.0,1.211,41.744432,42.194299,2.286481,2.235276,0.019062,...,0.523438,1.441406,0.744995,1.209420,6.0,6.0,1957.0,2.0,2016-12-01 00:00:00,2016-12-20 00:37:15.0
1,42438,V06,2018-10-01 00:00:00,3.0,1.131,30.357805,42.788477,5.422287,5.012269,0.289054,...,0.554688,2.363281,0.713150,1.002639,5.0,5.0,1955.0,1.0,2016-11-01 00:00:00,2016-11-23 22:38:32.0
2,42426,BL,2016-11-01 00:00:00,1.0,0.982,51.516231,30.989870,7.412588,4.762775,0.672047,...,0.640625,1.449219,0.667530,1.035053,6.0,6.0,1951.0,2.0,2016-11-01 00:00:00,2016-11-28 03:08:43.0
3,42422,BL,2016-11-01 00:00:00,1.0,1.143,38.314673,33.248165,7.560887,8.128498,0.168624,...,0.609375,1.269531,0.491730,0.733162,6.0,6.0,1958.0,2.0,2016-12-01 00:00:00,2016-12-04 01:01:10.0
4,42418,BL,2017-02-01 00:00:00,3.0,0.964,15.610665,36.560067,4.258304,3.149224,0.569155,...,0.585938,2.457031,0.484080,1.557265,5.0,5.0,1949.0,1.0,2018-02-01 00:00:00,2018-02-05 23:03:40.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,60045,V12,2019-04-01 00:00:00,,,29.782323,41.685751,9.155024,18.263038,0.811041,...,0.585938,1.714844,0.638668,1.755831,4.0,4.0,1941.0,2.0,2014-01-01 00:00:00,2014-02-21 07:35:29.0
99,60046,V12,2019-05-01 00:00:00,,,23.160092,27.543798,3.630631,3.522706,0.158126,...,,,,,4.0,4.0,1941.0,2.0,2014-01-01 00:00:00,2014-04-23 08:15:23.0
100,60057,V12,2019-04-01 00:00:00,,,29.782323,41.685751,9.155024,18.263038,0.811041,...,0.585938,1.714844,0.638727,1.743030,4.0,4.0,1953.0,2.0,2014-02-01 00:00:00,2014-03-20 09:46:40.0
101,60059,V12,2019-05-01 00:00:00,,,31.132878,33.728352,8.217441,3.558946,0.172383,...,0.570312,1.917969,0.532902,1.168936,4.0,4.0,1943.0,2.0,2014-02-01 00:00:00,2014-04-01 11:02:26.0


It might be a good idea to print out a csv of the baseline data at this point.

In [6]:
data.to_csv(r'C:\Users\bhats\OneDrive\Documents\GitHub\codingsess\PPMI/Modified_Gait_ScreeningData.csv', index = False)

PermissionError: [Errno 13] Permission denied: 'C:\\Users\\bhats\\OneDrive\\Documents\\GitHub\\codingsess\\PPMI/Modified_Gait_ScreeningData.csv'

All right. At this piont, you can probably appreciate that data wrangling isn't easy. But now that we have a slice of the data that we're interested in, let's start to look at the data. 

In [6]:
# df.info() and df.columns are a good place to start -- you should probably have used them earlier, too
data.info()
data.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 66 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PATNO              103 non-null    int64  
 1   EVENT_ID           91 non-null     object 
 2   INFODT             103 non-null    object 
 3   COHORT             81 non-null     float64
 4   SP_U               94 non-null     float64
 5   RA_AMP_U           94 non-null     float64
 6   LA_AMP_U           94 non-null     float64
 7   RA_STD_U           94 non-null     float64
 8   LA_STD_U           94 non-null     float64
 9   SYM_U              94 non-null     float64
 10  R_JERK_U           94 non-null     float64
 11  L_JERK_U           94 non-null     float64
 12  ASA_U              94 non-null     float64
 13  ASYM_IND_U         94 non-null     float64
 14  TRA_U              94 non-null     float64
 15  T_AMP_U            94 non-null     float64
 16  CAD_U              94 non-

Index(['PATNO', 'EVENT_ID', 'INFODT', 'COHORT', 'SP_U', 'RA_AMP_U', 'LA_AMP_U',
       'RA_STD_U', 'LA_STD_U', 'SYM_U', 'R_JERK_U', 'L_JERK_U', 'ASA_U',
       'ASYM_IND_U', 'TRA_U', 'T_AMP_U', 'CAD_U', 'STR_T_U', 'STR_CV_U',
       'STEP_REG_U', 'STEP_SYM_U', 'JERK_T_U', 'SP__DT', 'RA_AMP_DT',
       'LA_AMP_DT', 'RA_STD_DT', 'LA_STD_DT', 'SYM_DT', 'R_JERK_DT',
       'L_JERK_DT', 'ASA_DT', 'ASYM_IND_DT', 'TRA_DT', 'T_AMP_DT', 'CAD_DT',
       'STR_T_DT', 'STR_CV_DT', 'STEP_REG_DT', 'STEP_SYM_DT', 'JERK_T_DT',
       'SW_VEL_OP', 'SW_PATH_OP', 'SW_FREQ_OP', 'SW_JERK_OP', 'SW_VEL_CL',
       'SW_PATH_CL', 'SW_FREQ_CL', 'SW_JERK_CL', 'TUG1_DUR', 'TUG1_STEP_NUM',
       'TUG1_STRAIGHT_DUR', 'TUG1_TURNS_DUR', 'TUG1_STEP_REG', 'TUG1_STEP_SYM',
       'TUG2_DUR', 'TUG2_STEP_NUM', 'TUG2_STRAIGHT_DUR', 'TUG2_TURNS_DUR',
       'TUG2_STEP_REG', 'TUG2_STEP_SYM', 'APPRDX', 'CURRENT_APPRDX', 'BIRTHDT',
       'GENDER', 'ORIG_ENTRY', 'LAST_UPDATE'],
      dtype='object')

In [7]:
# Given the large number of predictors, you might want to start looking for correlations among the data. 
# Try plotting a correlation matrix using a seaborn function. 
data.corr()
#sns.heatmap(data.corr(),cmap='coolwarm' )





Unnamed: 0,PATNO,COHORT,SP_U,RA_AMP_U,LA_AMP_U,RA_STD_U,LA_STD_U,SYM_U,R_JERK_U,L_JERK_U,...,TUG2_DUR,TUG2_STEP_NUM,TUG2_STRAIGHT_DUR,TUG2_TURNS_DUR,TUG2_STEP_REG,TUG2_STEP_SYM,APPRDX,CURRENT_APPRDX,BIRTHDT,GENDER
PATNO,1.000000,-0.199155,-0.028207,-0.002119,0.052518,0.150884,0.285902,-0.063076,-0.016283,0.165301,...,-0.100698,-0.098702,-0.065378,-0.124020,0.041947,0.005329,-0.519410,-0.519410,-0.222325,0.149438
COHORT,-0.199155,1.000000,-0.315497,-0.330249,-0.246705,-0.389277,-0.292700,0.152277,-0.051655,-0.291661,...,0.400599,0.270737,0.044194,0.517439,-0.268924,0.118707,-0.879266,-0.879266,0.109980,-0.151517
SP_U,-0.028207,-0.315497,1.000000,0.372597,0.427593,0.196154,0.177339,-0.237216,0.050328,0.013971,...,-0.608836,-0.550797,-0.117999,-0.391345,0.347427,0.014924,0.278154,0.278154,0.072085,0.238884
RA_AMP_U,-0.002119,-0.330249,0.372597,1.000000,0.472256,0.435489,0.252331,0.097031,0.039307,0.138914,...,-0.512312,-0.362703,-0.054569,-0.490480,0.262289,-0.059754,0.249945,0.249945,-0.033804,-0.040930
LA_AMP_U,0.052518,-0.246705,0.427593,0.472256,1.000000,0.479620,0.404807,-0.383773,0.013296,0.058811,...,-0.307720,-0.265168,0.110564,-0.359637,0.153503,-0.074525,0.153614,0.153614,0.051620,-0.076607
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TUG2_STEP_SYM,0.005329,0.118707,0.014924,-0.059754,-0.074525,-0.102167,0.030169,-0.022338,-0.005821,0.092821,...,0.059292,-0.010679,0.078108,0.095408,0.025607,1.000000,0.028697,0.028697,-0.095670,-0.050080
APPRDX,-0.519410,-0.879266,0.278154,0.249945,0.153614,0.076424,-0.021116,-0.203190,0.229334,0.114284,...,-0.266094,-0.174397,-0.032289,-0.328413,0.204903,0.028697,1.000000,1.000000,0.215861,0.034620
CURRENT_APPRDX,-0.519410,-0.879266,0.278154,0.249945,0.153614,0.076424,-0.021116,-0.203190,0.229334,0.114284,...,-0.266094,-0.174397,-0.032289,-0.328413,0.204903,0.028697,1.000000,1.000000,0.215861,0.034620
BIRTHDT,-0.222325,0.109980,0.072085,-0.033804,0.051620,0.065384,0.051339,-0.185493,0.190360,0.138907,...,-0.096722,-0.057556,0.061189,-0.155043,-0.085314,-0.095670,0.215861,0.215861,1.000000,-0.077884


In [41]:
data.groupby('APPRDX').count()

Unnamed: 0_level_0,PATNO,SP_U,RA_AMP_U,LA_AMP_U,RA_STD_U,LA_STD_U,SYM_U,R_JERK_U,L_JERK_U,ASA_U,...,TUG1_STEP_SYM,TUG2_DUR,TUG2_STEP_NUM,TUG2_STRAIGHT_DUR,TUG2_TURNS_DUR,TUG2_STEP_REG,TUG2_STEP_SYM,CURRENT_APPRDX,ORIG_ENTRY,LAST_UPDATE
APPRDX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4.0,7,2,6,6,6,6,6,6,6,6,...,5,5,5,5,5,5,5,7,7,7
5.0,53,52,49,49,49,49,49,49,49,49,...,52,50,50,50,50,50,50,53,53,53
6.0,42,40,38,38,38,38,38,38,38,38,...,41,39,39,39,39,39,39,42,42,42
8.0,1,0,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [44]:
# count number of 'patients with each diagnosis
data.groupby('APPRDX').count()

data=data.drop(['INFODT','COHORT','EVENT_ID','BIRTHDT','GENDER','CURRENT_APPRDX'],axis=1)
# Count how many subjects have missing data
a=sum([True for idx,row in data.iterrows() if any(row.isnull())])
print(a)

        
# Create new data frame that excludes subjects with missing data
data2=data.dropna()
data2 = data2.reset_index()
del data2['index']


# Now how many subjects are there per group?
data2.groupby('APPRDX').count()

data2

35


Unnamed: 0,PATNO,SP_U,RA_AMP_U,LA_AMP_U,RA_STD_U,LA_STD_U,SYM_U,R_JERK_U,L_JERK_U,ASA_U,...,TUG1_STEP_SYM,TUG2_DUR,TUG2_STEP_NUM,TUG2_STRAIGHT_DUR,TUG2_TURNS_DUR,TUG2_STEP_REG,TUG2_STEP_SYM,APPRDX,ORIG_ENTRY,LAST_UPDATE
0,42443,1.211,41.744432,42.194299,2.286481,2.235276,0.019062,0.002209,0.003016,0.612344,...,1.284985,9.437500,11.674805,0.523438,1.441406,0.744995,1.209420,6.0,2016-12-01 00:00:00,2016-12-20 00:37:15.0
1,42438,1.131,30.357805,42.788477,5.422287,5.012269,0.289054,0.002742,0.008316,10.824114,...,0.993754,13.781250,16.245117,0.554688,2.363281,0.713150,1.002639,5.0,2016-11-01 00:00:00,2016-11-23 22:38:32.0
2,42422,1.143,38.314673,33.248165,7.560887,8.128498,0.168624,0.081082,0.104493,4.843285,...,0.737591,7.562500,6.070312,0.609375,1.269531,0.491730,0.733162,6.0,2016-12-01 00:00:00,2016-12-04 01:01:10.0
3,42418,0.964,15.610665,36.560067,4.258304,3.149224,0.569155,0.005900,0.004559,24.238172,...,2.056776,11.101562,8.835938,0.585938,2.457031,0.484080,1.557265,5.0,2018-02-01 00:00:00,2018-02-05 23:03:40.0
4,40621,1.500,51.613331,37.382977,2.566866,3.752649,0.383879,0.030677,0.030634,10.120941,...,0.854427,9.640625,7.667969,0.656250,2.210938,0.762715,1.158379,6.0,2017-02-01 00:00:00,2017-02-06 22:35:36.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63,42447,1.272,20.811657,20.972986,2.774855,3.700150,0.496404,0.041731,0.014464,13.038048,...,6.031192,9.406250,8.804688,0.585938,2.097656,0.643226,1.068970,5.0,2016-12-01 00:00:00,2016-12-20 03:47:35.0
64,42451,1.333,30.460233,70.541678,6.811038,13.168107,0.558768,0.010897,0.004974,23.693095,...,0.846846,7.859375,6.589844,0.523438,1.613281,0.592706,1.036322,5.0,2017-02-01 00:00:00,2017-02-01 00:14:02.0
65,42452,1.271,16.311517,40.832855,6.109233,4.708778,0.596745,0.002103,0.001666,25.823470,...,2.517994,8.570312,5.712891,0.617188,1.718750,0.491318,1.121519,5.0,2017-01-01 00:00:00,2017-01-29 02:46:23.0
66,42453,0.982,2.764410,25.210230,1.611778,3.438281,0.890630,0.008972,0.001199,43.065015,...,1.153814,20.445312,18.304688,0.687500,3.492188,0.507125,0.966052,5.0,2017-01-01 00:00:00,2017-01-31 00:04:01.0


Unnamed: 0_level_0,PATNO,SP_U,RA_AMP_U,LA_AMP_U,RA_STD_U,LA_STD_U,SYM_U,R_JERK_U,L_JERK_U,ASA_U,...,TUG1_STEP_SYM,TUG2_DUR,TUG2_STEP_NUM,TUG2_STRAIGHT_DUR,TUG2_TURNS_DUR,TUG2_STEP_REG,TUG2_STEP_SYM,CURRENT_APPRDX,ORIG_ENTRY,LAST_UPDATE
APPRDX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4.0,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
5.0,38,38,38,38,38,38,38,38,38,38,...,38,38,38,38,38,38,38,38,38,38
6.0,29,29,29,29,29,29,29,29,29,29,...,29,29,29,29,29,29,29,29,29,29


According to the presentation on accessing PPMI data ("08b_v2_Caspell_Foster_PPMI-Data-Access_May-2015-v2.0.pdf"), the APPRDX codes in our data set correspond to:

**4 - Prodromal (this means an individual who appears at risk for PD based on report of "anosmia" or disrupted REM behavior)**

**5 - Genetic Cohort subject with PD**

**6 - Genetic Cohort subject unaffected**

In [40]:
# Based on above diagnoses, we want to classify subjects as either having PD (APPRDX = 5) or no PD (APPRDX = 4 or 6)
# Create a new column called PD. For each subject, PD takes a value of 1 for those with PD and 0 for those without.
data2['CLASSIFY'] = [1 if x ==4|6 else 0 for x in data2['APPRDX']]
data2

Unnamed: 0,PATNO,EVENT_ID,INFODT,COHORT,SP_U,RA_AMP_U,LA_AMP_U,RA_STD_U,LA_STD_U,SYM_U,...,TUG2_TURNS_DUR,TUG2_STEP_REG,TUG2_STEP_SYM,APPRDX,CURRENT_APPRDX,BIRTHDT,GENDER,ORIG_ENTRY,LAST_UPDATE,CLASSIFY
0,42443,V04,11/2017,1.0,1.211,41.744432,42.194299,2.286481,2.235276,0.019062,...,1.441406,0.744995,1.209420,6.0,6.0,1957.0,2.0,12/2016,2016-12-20 00:37:15.0,1
1,42438,V06,10/2018,3.0,1.131,30.357805,42.788477,5.422287,5.012269,0.289054,...,2.363281,0.713150,1.002639,5.0,5.0,1955.0,1.0,11/2016,2016-11-23 22:38:32.0,0
2,42422,BL,11/2016,1.0,1.143,38.314673,33.248165,7.560887,8.128498,0.168624,...,1.269531,0.491730,0.733162,6.0,6.0,1958.0,2.0,12/2016,2016-12-04 01:01:10.0,1
3,42418,BL,02/2017,3.0,0.964,15.610665,36.560067,4.258304,3.149224,0.569155,...,2.457031,0.484080,1.557265,5.0,5.0,1949.0,1.0,02/2018,2018-02-05 23:03:40.0,0
4,42415,V04,01/2018,3.0,0.996,17.191551,29.273684,2.702083,3.963163,0.412973,...,1.742188,0.761910,1.146845,5.0,5.0,1947.0,1.0,10/2017,2017-11-06 00:21:59.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,42447,V8,10/2019,3.0,1.119,18.535454,23.644239,2.857698,2.644760,0.280598,...,2.523438,0.405288,0.948151,5.0,5.0,1947.0,2.0,12/2016,2016-12-20 03:47:35.0,0
57,42451,BL,12/2016,1.0,1.333,30.460233,70.541678,6.811038,13.168107,0.558768,...,1.613281,0.592706,1.036322,5.0,5.0,1947.0,1.0,02/2017,2017-02-01 00:14:02.0,0
58,42452,BL,12/2016,1.0,1.271,16.311517,40.832855,6.109233,4.708778,0.596745,...,1.718750,0.491318,1.121519,5.0,5.0,1954.0,2.0,01/2017,2017-01-29 02:46:23.0,0
59,42453,BL,12/2016,3.0,0.982,2.764410,25.210230,1.611778,3.438281,0.890630,...,3.492188,0.507125,0.966052,5.0,5.0,1948.0,1.0,01/2017,2017-01-31 00:04:01.0,0


Once you have your labelled data (identifying each subject as having PD or no PD) you can now start thinking about building an ML algorithm to predict the diagnosis based on the gait data. The basic idea is that you wait train your algorithm on a portion of the data and then see how well it predicts a dignosis of PD on the out-of-sample (i.e., "test") data. A good first classification algorithm to learn and use is **logistic regression.** 

But before you learn about logistic regression, you should probably learn the ins and outs of linear regression. For that, and all other things ML, I highly recommend "Introduction to Statistical Learning" by James, Witten, Hastie and Tibshirani.  