# Detecting Parkinson's with XGBoost

Parkinson's Disease affects 1-2 people per thousand at any given time, presenting a major issue.

The dataset used in this study is found in UC Irvine Machine Learning Repository labelled as "Oxford Parkinson's Disease Detection Dataset" donated on 6/25/2008.

STEP 1: Install necessary libraries: numpy, pandas, sklearn, xgboost

In [1]:
pip install xgboost

Collecting xgboost
  Obtaining dependency information for xgboost from https://files.pythonhosted.org/packages/c3/eb/496aa2f5d356af4185f770bc76055307f8d1870e11016b10fd779b21769c/xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl.metadata
  Downloading xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl.metadata (2.0 kB)
Downloading xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl (297.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.1/297.1 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: xgboost
Successfully installed xgboost-2.0.3
Note: you may need to restart the kernel to use updated packages.


STEP 2: Import libraries

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
import os, sys
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

STEP 3: read "parkinsons.data" into a DataFrame, get first 5 records

In [3]:
# Read the data
df=pd.read_csv('parkinsons.data')
df.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


STEP 4: Get features and labels from DataFrame. 

- Features: all columns except "status"
- Labels: in "status column

Coding Notes:
df.columns!="status" creates boolean mask where condition is true for all columns names in df except column named "status".

.valuess converts selected portion of df into a NumPy array

"[:,1:]" is a slicing operation on the Numpy array, selecting all rows, and all columns from index 1 onwards.



In [4]:
# Get the features and labels
features=df.loc[:,df.columns!="status"].values[:,1:]
labels=df.loc[:,"status"].values

STEP 5: The 'status columns have values 0 and 1 as labels. Get counts of these labels for both- 0 and 1.

Coding Notes:
.shape[0] returns the size of first dimension, which for a one dimensional array is total number of elements. This ets number of elements in array, counting times "1" appears in "labels"

In [5]:
# Get count of each label (0 and 1) in labels
print(labels[labels==1].shape[0],
     labels[labels==0].shape[0])

147 48


We have 147 ones and 48 zeros in the status column of the dataset

STEP 6: Initialize MinMaxScaler to scale features between -1 and 1 to normalize.

In [6]:
# Scale the features to between -1 and 1
scaler = MinMaxScaler((-1, 1))
x = scaler.fit_transform(features)
y = labels

STEP 7: Split dataset into training and testing, 20% for testing

In [8]:
# Split the dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 7)

STEP 8: Intiliaize XGBClassifier and train model

Classifying using eXtreme Gradient Boosi=ting using gradient boosting algorithms. This is Ensembl Learning in ML, training and predicting using many models to produce one superior output

In [11]:
# Train the model
model=XGBClassifier()
model.fit(x_train, y_train)

STEP 9: generate y_pred(predicted values for x_test) and calcuate accuracy of model

In [10]:
# Calculate accuracy
y_pred=model.predict(x_test)
print(accuracy_score(y_test, y_pred)*100)

94.87179487179486


# Summary

We detected the presence of Parkinson's Disease in individuals using relevant features. XGBClassifier was used, giving accuracy of 94.87%.