## Detecting Parkinson's Disease
<p>Parkinson’s disease is a progressive disorder of the central nervous system affecting movement and inducing tremors and stiffness. It has 5 stages to it and affects more than 1 million individuals every year in India. This is chronic and has no cure yet. It is a neurodegenerative disorder affecting dopamine-producing neurons in the brain.</p>

<p>I will be constructing this predictive model using XGBoost.  XGBoost stands for eXtreme Gradient Boosting and is based on decision trees.</p>

<p>The data I will be using comes from the UCI Machine Learning Repository,  you can <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/">download it here.</a> 



<p><img src="https://media.springernature.com/m685/springer-static/image/art%3A10.1038%2Fs41598-020-78418-8/MediaObjects/41598_2020_78418_Fig1_HTML.png" alt="Machine Learning for Parkinson's Disease Detection"></p>



## 1. Import the libraries and load data

<p>I have stored the data locally so it is easy to load and I have previewed the head data.</p>

<p>I am going to collect the labels and and features of the data, excluding the 'status' column.  Instead I'm going to count the labels within the 'status.' column.</p>

<p>There are 147 ones and 48 zeros in the status column in our dataset.</p>



In [None]:
# import libraries
import numpy as np
import pandas as pd
import os, sys
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
df = pd.read_csv('parkinsons.data')
print(df.head())

# get features and labels
features = df.loc[:,df.columns!='status'].values[:,1:]
labels = df.loc[:,'status'].values

# count status labels
print(labels[labels==1].shape[0], labels[labels==0].shape[0])

             name  MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  \
0  phon_R01_S01_1      119.992       157.302        74.997         0.00784   
1  phon_R01_S01_2      122.400       148.650       113.819         0.00968   
2  phon_R01_S01_3      116.682       131.111       111.555         0.01050   
3  phon_R01_S01_4      116.676       137.871       111.366         0.00997   
4  phon_R01_S01_5      116.014       141.781       110.655         0.01284   

   MDVP:Jitter(Abs)  MDVP:RAP  MDVP:PPQ  Jitter:DDP  MDVP:Shimmer  ...  \
0           0.00007   0.00370   0.00554     0.01109       0.04374  ...   
1           0.00008   0.00465   0.00696     0.01394       0.06134  ...   
2           0.00009   0.00544   0.00781     0.01633       0.05233  ...   
3           0.00009   0.00502   0.00698     0.01505       0.05492  ...   
4           0.00011   0.00655   0.00908     0.01966       0.06425  ...   

   Shimmer:DDA      NHR     HNR  status      RPDE       DFA   spread1  \
0      0.0654

## 2. Initialise MinMaxScaler and Split Data
<p>I will scale the features between -1 and 1 to normalise them. I will then use the function <code>fit_transform()</code> which fits to the data and then transforms it. It is a good habit to scale the data so that the algorithm will better fit the data. It is a rare case to get a higher accuracy without scaling.</p>

<p>I have split the dataset into two parts 80% to train and 20% to test.</p>


In [None]:
# normalise features -1 to 1. 
scaler = MinMaxScaler((-1,1))
x = scaler.fit_transform(features)
y = labels

# split the data set 20% test 80% training
x_train,x_test,y_train,y_test = train_test_split(x, y, test_size=0.2, random_state=7)

## 3. Initialise an XGBClassifier and train the model
<p>This classifies using eXtreme Gradient Boosting - using gradient boosting algorithms for modern data science problems. It falls under the category of Ensemble Learning in ML, where we train and predict using many models to produce one superior output.</p>

In [None]:
# initialise XGBClassifier and train the model
model = XGBClassifier()
model.fit(x_train,y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=16, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

## 4. Test the algorithm
<p>Finally,  use the <code>y_pred</code> to predict on the test data and print the accuracy of the model. 94.8% accuracy is extremely high considering how many lines of code this particular program has.   
  


In [None]:
# use y_pred to analyse test data and print the models accuracy. 
y_pred = model.predict(x_test)
print(accuracy_score(y_test, y_pred)*100)
  

94.87179487179486
