# PPMI Gait Analysis
This data set is part of the Parkinson's Progression Markers Initiative. Anat Mirelman, PhD, of Tel Aviv University is the PI. According to the study summary: "The Gait study was proposed in order to obtain quantitative, objective motor measures that could inform on pre-clinical symptoms, progression markers, and dynamic changes of function throughout disease and potential modifiers and mediators of motor symptoms."

Your goal is to examine the data and later implement some ML algorithms to try and classify PD patients from non-PD patients based on the gait measures.

To learn more about PPMI, go here: http://www.ppmi-info.org/  
To learn more about working with the data, go here: www.ppmi-info.org/wp-content/uploads/2015/12/PPMI-data-access-final.mp4


In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# seaborn is for making figures look really nice
import seaborn as sns

# this is a big hint for a later part of the exercise
from datetime import datetime

First, you need to import the necessary data sets. The gait data are the objective gait measures. The screening data contain information on all of the individuals participating in PPMI.  

In [None]:
# read in csv files as pandas data frames

# take a look at the variables in each csv


The reason why we have to read in both csv files is because the gait csv does not contain all of the necessary information we want about the subjects. For instance, we don't know from the gait data what diagnosis each individual has received. This is a problem if we later want to classify PD vs non-PD. The diagnosis information is contained within the screening csv. The trick is that there are a lot of subjects in the screening csv that were not part of the gait study. ***What to do?***

My advice is to:  
1) Get rid of the variables from the screening csv that we don't want or need.   
2) Merge the two data frames into one that has both gait and screening data for all of the subjects in the gait study.

**Hint:** Every individual in PPMI has a unique ID that is contained in 'PATNO'. The diagnosis each subject initially received is in 'APPRDX' and their most current diagnosis is in 'CURRENT_APPRDX'.

From screening data, we only care about a few variables.

In [None]:
# Probably want to keep 'PATNO', 'APPRDX', 'CURRENT_APPRDX', 'BIRTHDT', 'GENDER', 'ORIG_ENTRY', 'LAST_UPDATE'

# Now create new dataframe that only includes patients common to both gait and screen

# Some subjects were tested on multiple visits. How many unique subjects are there? 


Check to see if any subjects changed diagnoses within the course of the study. If so, drop one of the diagnosis columns. 

It also makes sense to subset the data so that you can look at data only from each subjects' initial visit. How are you going to do this? **Hint:** You will need to reformat the dates so that they are in a format that python/pandas will understand as datetime. **Extra hint:** You will probably want to use the datetime.strptime function.

In [None]:
# We first need to format the dates correctly. 

# Next we want to create new data frame with only first visit's data


It might be a good idea to print out a csv of the baseline data at this point.

In [None]:
# print df to csv

All right. At this point, you can probably appreciate that data wrangling isn't easy. But now that we have a slice of the data that we're interested in, let's start to look at the data. 

In [None]:
# df.info() and df.columns are a good place to start -- you should probably have used them earlier, too

In [None]:
# Given the large number of predictors, you might want to start looking for correlations among the data. 
# Try plotting a correlation matrix using a seaborn function. 

In [None]:
# count number of patients with each diagnosis

# Count how many subjects have missing data

# Create new data frame that excludes subjects with missing data

# Now how many subjects are there per group?


According to the presentation on accessing PPMI data ("08b_v2_Caspell_Foster_PPMI-Data-Access_May-2015-v2.0.pdf"), the APPRDX codes in our data set correspond to:

**4 - Prodromal (this means an individual who appears at risk for PD based on report of "anosmia" or disrupted REM behavior)**

**5 - Genetic Cohort subject with PD**

**6 - Genetic Cohort subject unaffected**

In [None]:
# Based on above diagnoses, we want to classify subjects as either having PD (APPRDX = 5) or no PD (APPRDX = 4 or 6)
# Create a new column called PD. For each subject, PD takes a value of 1 for those with PD and 0 for those without.


Once you have your labelled data (identifying each subject as having PD or no PD) you can now start thinking about building an ML algorithm to predict the diagnosis based on the gait data. The basic idea is that you wait train your algorithm on a portion of the data and then see how well it predicts a dignosis of PD on the out-of-sample (i.e., "test") data. A good first classification algorithm to learn and use is **logistic regression.** 

But before you learn about logistic regression, you should probably learn the ins and outs of linear regression. For that, and all other things ML, I highly recommend "Introduction to Statistical Learning" by James, Witten, Hastie and Tibshirani.  

# Data visualization