# Prediction of Parkinson's Disease 
**(Using various Classification Algorithms)**

Parkinson's disease is a progressive neurodegenerative disorder that registers a decrease in dopamine production by neurons in the brain, resulting in symptoms such as tremors, stiffness, slow movement, loss of balance and speech variations. While Parkinson's is a chronic disease and there is no medical cure for it yet, there are medications that can help control the symptoms and give those affected by it a good quality of life. 

Studies have also shown that diagnosing the condition early on can help mitigate the effects of later stages and help control the symptoms from getting worse. This project tries to explore a way to detect the occurence of Parkinson's disease early on based on certain parameters, and tries to ascertain the most ideal algorithm to do so accurately. 

We firstly import the libraries that we will require to do this:

In [182]:
import numpy as np
import pandas as pd
import os, sys
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

**THE DATA**

The data source for this project is the UCI Machine learning repository data on Parkinson's disease. Here's a link:
https://archive.ics.uci.edu/ml/datasets/parkinsons

It has the following features/inputs:
- name - ASCII subject name and recording number
- MDVP:Fo(Hz) - Average vocal fundamental frequency
- MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
- MDVP:Flo(Hz) - Minimum vocal fundamental frequency
- MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
- MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
- NHR,HNR - Two measures of ratio of noise to tonal components in the voice
- RPDE,D2 - Two nonlinear dynamical complexity measures
- DFA - Signal fractal scaling exponent
- spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation


The label or the variable which shows the Parkinson's positive/negative result is:
- status - Health status of the subject (one) - Parkinson's, (zero) - healthy

In [5]:
df=pd.read_csv('R:\Rahul\Projects\Parkinsons Project\parkinsons.data')
df

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,0.00007,0.00370,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.400,148.650,113.819,0.00968,0.00008,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.335590,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.01050,0.00009,0.00544,0.00781,0.01633,0.05233,...,0.08270,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,0.00009,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.10470,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.332180,0.410335
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,phon_R01_S50_2,174.188,230.978,94.261,0.00459,0.00003,0.00263,0.00259,0.00790,0.04087,...,0.07008,0.02764,19.517,0,0.448439,0.657899,-6.538586,0.121952,2.657476,0.133050
191,phon_R01_S50_3,209.516,253.017,89.488,0.00564,0.00003,0.00331,0.00292,0.00994,0.02751,...,0.04812,0.01810,19.147,0,0.431674,0.683244,-6.195325,0.129303,2.784312,0.168895
192,phon_R01_S50_4,174.688,240.005,74.287,0.01360,0.00008,0.00624,0.00564,0.01873,0.02308,...,0.03804,0.10715,17.883,0,0.407567,0.655683,-6.787197,0.158453,2.679772,0.131728
193,phon_R01_S50_5,198.764,396.961,74.904,0.00740,0.00004,0.00370,0.00390,0.01109,0.02296,...,0.03794,0.07223,19.020,0,0.451221,0.643956,-6.744577,0.207454,2.138608,0.123306


**DATA PREPROCESSING**

Before we perform any analysis or predictions using the data, we need to put it through a preprocessing method by means of data cleaning and data transformations.

Firstly, we retrieve the Y values or Output values:

In [141]:
#scol = df.columns.get_loc("status")   #to store the column number of status
#y = df.values[:,scol]
y = df.loc[:,'status'].values
y


array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)

Next, we retrieve the X values or Input values:

In [133]:
X = df.drop(['status'],axis=1).values[:,1:]
X

array([[119.992, 157.302, 74.997, ..., 0.266482, 2.301442, 0.284654],
       [122.4, 148.65, 113.819, ..., 0.33559, 2.486855, 0.368674],
       [116.682, 131.111, 111.555, ..., 0.311173, 2.342259, 0.332634],
       ...,
       [174.688, 240.005, 74.287, ..., 0.158453, 2.679772, 0.131728],
       [198.764, 396.961, 74.904, ..., 0.207454, 2.138608, 0.123306],
       [214.289, 260.277, 77.973, ..., 0.190667, 2.555477, 0.148569]],
      dtype=object)

We may ascertain the number of positive and negative cases in our dataset in order to compare later:

In [134]:
p_count = 0    #positive count
n_count = 0    #negative count
for i in range(0,len(y),1):
    if(y[i] == 0):
        n_count = n_count+1
    else:
        p_count = p_count+1

p_count, n_count

(147, 48)

Next, we apply transformations on the data in order to scale it down to a normal form in the range [0,1].

In [155]:
scaler=MinMaxScaler((0,1))
xd = scaler.fit_transform(X)
yd = y
xd

array([[0.18430827, 0.11259173, 0.05481479, ..., 0.58576513, 0.39066128,
        0.4973096 ],
       [0.19832685, 0.09493044, 0.2783228 , ..., 0.74133704, 0.47314522,
        0.67132602],
       [0.16503854, 0.05912816, 0.26528838, ..., 0.68637091, 0.40881938,
        0.59668246],
       ...,
       [0.50273036, 0.28141298, 0.05072714, ..., 0.34257652, 0.55896743,
        0.18057983],
       [0.6428929 , 0.60180655, 0.05427936, ..., 0.45288473, 0.31822198,
        0.16313677],
       [0.73327434, 0.32279413, 0.07194837, ..., 0.41509481, 0.50367281,
        0.21545975]])

We then split the data into test and train dataset, with 80% of the data being used for training and 20% for testing. 

In [156]:
x_train,x_test,y_train,y_test=train_test_split(xd, yd, test_size=0.2, random_state=7)

Firstly, we try using the Support Vector Machine algorithm using the Polynomial kernel:

In [161]:
x_train
y_train

clf=SVC(kernel='poly')
clf.fit(x_train,y_train)

SVC(kernel='poly')

In [162]:
y_test

array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int64)

We predict the Y values based on trained data and then compute an approximate accuracy score.

In [165]:
from sklearn import metrics
y_pred=clf.predict(x_test)
y_pred
print("SVM Polynomial Accuracy score = ",metrics.accuracy_score(y_test,y_pred)*100,"%")


SVM Polynomial Accuracy score =  92.3076923076923 %


As seen above, SVM Polynomial algorithm has an accuracy of ~92%, which is pretty good. Let's try the same with the RBF kernel.

In [167]:
clf=SVC(kernel='rbf')
clf.fit(x_train,y_train)

SVC()

In [168]:
from sklearn import metrics
y_pred=clf.predict(x_test)
y_pred
print("SVM RBF Accuracy score = ",metrics.accuracy_score(y_test,y_pred)*100,"%")

SVM RBF Accuracy score =  87.17948717948718 %


Evidently, the RBF kernel algorithm performs quite poorly. 

**Decision Trees**

In [173]:
clf=DecisionTreeClassifier()
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)
y_pred
print("Decision Tree Accuracy score = ",metrics.accuracy_score(y_test,y_pred)*100,"%")

Decision Tree Accuracy score =  84.61538461538461 %


Decision trees perform very poorly. 

**K-Nearest Neighbor**

In [177]:
k = KNeighborsClassifier(n_neighbors = 3)         #Creates a classifier object 
k.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=3)

In [178]:
y_pred=k.predict(x_test)
y_pred
print("KNN Accuracy score = ",metrics.accuracy_score(y_test,y_pred)*100,"%")

KNN Accuracy score =  97.43589743589743 %


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


The KNN result is the best thus far.

**Naive Bayes**

In [181]:
gnb = GaussianNB()        #Creates a classifier object 
gnb.fit(x_train, y_train)
y_pred=gnb.predict(x_test)
y_pred
print("Naive Bayes Accuracy score = ",metrics.accuracy_score(y_test,y_pred)*100,"%")

Naive Bayes Accuracy score =  71.7948717948718 %


Naive Bayes produced the worst results thus far. 

From all the above algorithms, we may safely conclude that the K-Nearest Neighbor algorithm produced the most accurate results in predicting the condition of Parkinson's disease in a person. We may use more parameters to enhance the accuracy in the future and may also perform an analysis on the average age of its onset, stage wise stats, region wise stats and so on.