# Parkinson's Disease Detection

### Problem Statement

A Health company/Hospital or Research institute is looking to find solutions with regards to the parkinson's Disease that has been affecting a lot of patients. The health institute has called upon a  machine learning engineer or data scientist to try and help them build a system that will help them quickly detect the disease to minimize human error and time taken by health professionals to find answers for patients. This opportunity will also help the health institution to not spend too much money on resources. So the aim for this project is to come up with a solution that will quickly detect the disease with minimal effort and predict whether a patient tested positive or  negative for the disease.

## Data Gathering and Anaysis

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.metrics import accuracy_score

### Import Dataset

In [2]:
df = pd.read_csv(r"C:\Users\Cash Crusaders\Desktop\My Portfolio\Projects\Data Science Projects\Machine Learning Project 6 - Parkinson's Disease Detection\dataset\Parkinsson disease.csv")

In [3]:
# print the first 5 rows of the dataframe
df.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


In [4]:
# get the sshape of the dataset
df.shape

(195, 24)

In [5]:
# get info about the df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 1

### Missing Values

In [6]:
# check for null values
df.isnull().sum()

name                0
MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
status              0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
dtype: int64

In [8]:
# how is the data distributed
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MDVP:Fo(Hz),195.0,154.228641,41.390065,88.333,117.572,148.79,182.769,260.105
MDVP:Fhi(Hz),195.0,197.104918,91.491548,102.145,134.8625,175.829,224.2055,592.03
MDVP:Flo(Hz),195.0,116.324631,43.521413,65.476,84.291,104.315,140.0185,239.17
MDVP:Jitter(%),195.0,0.00622,0.004848,0.00168,0.00346,0.00494,0.007365,0.03316
MDVP:Jitter(Abs),195.0,4.4e-05,3.5e-05,7e-06,2e-05,3e-05,6e-05,0.00026
MDVP:RAP,195.0,0.003306,0.002968,0.00068,0.00166,0.0025,0.003835,0.02144
MDVP:PPQ,195.0,0.003446,0.002759,0.00092,0.00186,0.00269,0.003955,0.01958
Jitter:DDP,195.0,0.00992,0.008903,0.00204,0.004985,0.00749,0.011505,0.06433
MDVP:Shimmer,195.0,0.029709,0.018857,0.00954,0.016505,0.02297,0.037885,0.11908
MDVP:Shimmer(dB),195.0,0.282251,0.194877,0.085,0.1485,0.221,0.35,1.302


### Imbalance Data 

In [9]:
# checking the dirstibution of the target feature
df['status'].value_counts()

1    147
0     48
Name: status, dtype: int64

1 - psosive parkinsons 
0 - negative (healthy)

In [10]:
# grouping datat based on target variable
df.groupby('status').mean()

Unnamed: 0_level_0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,MDVP:APQ,Shimmer:DDA,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE
status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,181.937771,223.63675,145.207292,0.003866,2.3e-05,0.001925,0.002056,0.005776,0.017615,0.162958,...,0.013305,0.028511,0.011483,24.67875,0.442552,0.695716,-6.759264,0.160292,2.154491,0.123017
1,145.180762,188.441463,106.893558,0.006989,5.1e-05,0.003757,0.0039,0.011273,0.033658,0.321204,...,0.0276,0.053027,0.029211,20.974048,0.516816,0.725408,-5.33342,0.248133,2.456058,0.233828


## Data Preprocessing

In [11]:
# seperate target variable and feature variables
X = df.drop(columns = ['status','name'], axis = 1)
y = df['status']

In [12]:
X

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,MDVP:APQ,Shimmer:DDA,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE
0,119.992,157.302,74.997,0.00784,0.00007,0.00370,0.00554,0.01109,0.04374,0.426,...,0.02971,0.06545,0.02211,21.033,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,122.400,148.650,113.819,0.00968,0.00008,0.00465,0.00696,0.01394,0.06134,0.626,...,0.04368,0.09403,0.01929,19.085,0.458359,0.819521,-4.075192,0.335590,2.486855,0.368674
2,116.682,131.111,111.555,0.01050,0.00009,0.00544,0.00781,0.01633,0.05233,0.482,...,0.03590,0.08270,0.01309,20.651,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,116.676,137.871,111.366,0.00997,0.00009,0.00502,0.00698,0.01505,0.05492,0.517,...,0.03772,0.08771,0.01353,20.644,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,0.584,...,0.04465,0.10470,0.01767,19.649,0.417356,0.823484,-3.747787,0.234513,2.332180,0.410335
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,174.188,230.978,94.261,0.00459,0.00003,0.00263,0.00259,0.00790,0.04087,0.405,...,0.02745,0.07008,0.02764,19.517,0.448439,0.657899,-6.538586,0.121952,2.657476,0.133050
191,209.516,253.017,89.488,0.00564,0.00003,0.00331,0.00292,0.00994,0.02751,0.263,...,0.01879,0.04812,0.01810,19.147,0.431674,0.683244,-6.195325,0.129303,2.784312,0.168895
192,174.688,240.005,74.287,0.01360,0.00008,0.00624,0.00564,0.01873,0.02308,0.256,...,0.01667,0.03804,0.10715,17.883,0.407567,0.655683,-6.787197,0.158453,2.679772,0.131728
193,198.764,396.961,74.904,0.00740,0.00004,0.00370,0.00390,0.01109,0.02296,0.241,...,0.01588,0.03794,0.07223,19.020,0.451221,0.643956,-6.744577,0.207454,2.138608,0.123306


In [13]:
y

0      1
1      1
2      1
3      1
4      1
      ..
190    0
191    0
192    0
193    0
194    0
Name: status, Length: 195, dtype: int64

In [14]:
# split the dataframe into train and test 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)


In [15]:
print(X_train.shape, X_test.shape)

(146, 22) (49, 22)


In [16]:
print(y_train.shape, y_test.shape)

(146,) (49,)


## Data Standardazation

In [17]:
sc = StandardScaler()
sc.fit(X_train)

StandardScaler()

In [18]:
sc.transform(X_train)
sc.transform(X_test)

array([[-0.99491236, -0.74459136, -0.29897103, ...,  0.96450045,
         0.10664994,  0.05355063],
       [-0.21458024,  1.77944465, -0.81790045, ...,  0.16703424,
        -0.28608523,  0.14967735],
       [-0.8573791 , -0.73620247, -0.50431927, ..., -0.21635141,
         0.15827968,  0.02210831],
       ...,
       [ 1.18434627,  0.20716237,  1.92974213, ..., -0.67052798,
         0.1772297 , -1.57031924],
       [ 0.51814892,  0.43857107, -0.53820744, ..., -1.25109542,
         0.71122081, -0.76213917],
       [ 2.05780827,  0.62389755,  2.64786493, ..., -1.53161093,
         0.10052107, -1.16942165]])

In [19]:
print(X_train)

     MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  \
65       228.969       239.541       113.201         0.00238   
104      154.003       160.267       128.621         0.00183   
159      127.930       138.752       112.173         0.00605   
78        95.385       102.145        90.264         0.00608   
76       112.876       148.826       106.981         0.00500   
..           ...           ...           ...             ...   
106      155.078       163.736       144.148         0.00168   
14       152.845       163.305        75.836         0.00294   
92       148.272       164.989       142.299         0.00459   
179      148.143       155.982       135.041         0.00392   
102      139.224       586.567        66.157         0.03011   

     MDVP:Jitter(Abs)  MDVP:RAP  MDVP:PPQ  Jitter:DDP  MDVP:Shimmer  \
65            0.00001   0.00136   0.00140     0.00408       0.01745   
104           0.00001   0.00076   0.00100     0.00229       0.01030   
159           0.00

In [20]:
print(X_test)

     MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  \
138      112.239       126.609       104.095         0.00472   
16       144.188       349.259        82.764         0.00544   
155      117.870       127.349        95.654         0.00647   
96       159.116       168.913       144.811         0.00342   
68       143.533       162.215        65.809         0.01101   
153      121.345       139.644        98.250         0.00684   
55       109.860       126.358       104.437         0.00874   
15       142.167       217.455        83.159         0.00369   
112      204.664       221.300       189.621         0.00841   
111      208.519       220.315       199.020         0.00609   
184      116.848       217.552        99.503         0.00531   
18       153.046       175.829        68.623         0.00742   
82       100.960       110.019        95.628         0.00606   
9         95.056       120.103        91.226         0.00532   
164      102.273       142.830        85

## Model Training

In [21]:
classifier = svm.SVC(kernel = 'linear')

In [22]:
classifier.fit(X_train, y_train)

SVC(kernel='linear')

### Model Evaluation

In [23]:
# accuracy score on training data
y_pred_train = classifier.predict(X_train)
acc_score = accuracy_score(y_train, y_pred_train)
print("Accuracy score for training : ", acc_score)

Accuracy score for training :  0.8904109589041096


In [24]:
# accuracy score on testing data
y_pred_test = classifier.predict(X_test)
acc_score = accuracy_score(y_test, y_pred_test)
print("Accuracy score for training : ", acc_score)

Accuracy score for training :  0.8775510204081632


### Predictive System

In [26]:
# input data
input_data = (152.845,163.305,75.836,0.00294,0.00002,0.00121,0.00149,0.00364,0.01828,0.158,0.01064,0.00972,0.01246,0.03191,0.00609,1,0.474791,0.654027,-6.105098,0.203653,2.125618,0.1701)
# change the input data into numpy array
input_arr = np.asarray(input_data)

# reshape the input data 
input_data = input_arr.reshape(1,-1)

# apply teh standard scaler to the input data
input_data = sc.transform(input_data)

# predict using the model
prediction = classifier.predict(input_data)
# check the prediction
print(prediction)

if prediction[0]==0:
    print("The Person does not have Parkinsons Disease")
else:
    print("The Person has Parkinsons")


[1]
The Person has Parkinsons


