# Detecting Parkinson's Disease from Speech Patterns


## Import libraries
We will start by importing all the key libraries for this project.

In [3]:
import numpy as np
import pandas as pd
import os, sys
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Import data
Import the saved raw data from the project.

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column.

In [4]:
df=pd.read_csv('./data/parkinsons.data') # Import Parkinson's data (in csv format)
df.head() # Look at first 5 entries in the table to check data integrity

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


## Clean data
Get the features and labels from the DataFrame. 

The features are all the columns except ‘status’:

- MDVP:Fo(Hz) - *Average vocal fundamental frequency*
- MDVP:Fhi(Hz) - *Maximum vocal fundamental frequency*
- MDVP:Flo(Hz) - *Minimum vocal fundamental frequency*
- MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP - *Several measures of variation in fundamental frequency*
- MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ, Shimmer:DDA - *Several measures of variation in amplitude*
- NHR,HNR - *Two measures of ratio of noise to tonal components in the voice* 
- RPDE,D2 - *Two nonlinear dynamical complexity measures*
- DFA - *Signal fractal scaling exponent*
- spread1,spread2,PPE - *Three nonlinear measures of fundamental frequency variation*

Labels are those in the ‘status’ column:
- 0 = Healthy
- 1 = Parkinson's disease


In [5]:
features=df.loc[:,df.columns!='status'].values[:,1:] # Get all the features from the main dataframe
labels=df.loc[:,'status'].values # Get all the labels as a variable, extracted from status

In [16]:
print(str(labels[labels==1].shape[0]) + " observations are labeled for Parkinson's disease")
print(str(labels[labels==0].shape[0]) + " observations are labeled for healthy controls")

147 observations are labeled for Parkinson's disease
48 observations are labeled for healthy controls


## Prepare data
Scale the features to between -1 and 1 to normalize them, using MinMaxScaler.
The fit_transform() method fits to the data and then transforms it.

In [17]:
scaler=MinMaxScaler((-1,1)) # Set scaler to be between -1 and 1
x=scaler.fit_transform(features) # Apply this scaler to our features that we extracted above
y=labels # These are 0/1 so don't need to be scaled

## Split data
Split the dataset into training (80%) and testing (20%) sets.

In [8]:
x_train,x_test,y_train,y_test=train_test_split(x, y, test_size=0.2, random_state=7) # Split data set into _train and _test data to be used by the model

## Initialise and fit model to training data
First, initialize an XGBClassifier and train the model. This classifies using *eXtreme Gradient Boosting*. It falls under the category of Ensemble Learning in ML, where we train and predict using many models to produce one superior output.

In [9]:
model=XGBClassifier() # Define the model as a XGB classifier.
model.fit(x_train,y_train) # Use this model and fit it to the train data we split in the previous step.





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=4,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

## Test model on withheld data
Generate predicted values for status, from the withheld features data. Calculate the accuracy for the model

In [10]:
y_pred=model.predict(x_test) # use the model to predict status from the test features
print(accuracy_score(y_test, y_pred)*100) # output the accuracy of this model

94.87179487179486
