# Parkinson's Disease Classification with Replicated Acoustic Features

* The Dataset is a collection of acoustic features extracted from 3 samples of voice from 80 people of which 40 were tested positive for Parkinson's Disease.
- ```
80 persons
3 voice samples per person
80 * 3 = 240 samples in total
```

## **Abstract**
  - Contains acoustic features extracted from 3 voice recording replications of the sustained /a/ phonation for each one of the 80 subjects (40 tested positive).

## **Dataset Information**
  - **Data Set Characteristics**: Multivariate
  - **Number of Instances**: 240
  - **Area**: Life
  - **Number of Attributes**: 46
  - **Associated Tasks**: Classification
  - **Date Donated**: 10th April 2019

## **Dataset Links**
  - [dataset-csv](https://archive.ics.uci.edu/ml/machine-learning-databases/00489/ReplicatedAcousticFeatures-ParkinsonDatabase.csv)
  - [dataset-description](https://archive.ics.uci.edu/ml/datasets/Parkinson+Dataset+with+replicated+acoustic+features+)

## **Information**
  1. Each row can not be used independently, because is one of the three replications of one individual. Nature of data is dependent for each subject, but independent from one to another subject. So, traditional technique from machine learning can not be applied to this dataset, because those techniques are based on the independent nature of the instances. There are 240 instances but for only 80 subjects, so they are not independent. Techniques as those presented in Naranjo et al. (2016), Naranjo et al. (2017) or other specifically designed can be used.
  2. The concept of replication considered here does not match the classical concept of statistical repeated measurements. The term 'replications' refers to the collection of features extracted from voice recordings belonging to the same subject. Since, in this context, features are extracted from multiple consecutive voice recordings from the same subject, in principle, the features should be identical. The imperfections in technology and the own biological variability result in non-identical replicated features that are more similar to one another than features from different subjects.
  3. All information about how the dataset was generated is presented in Naranjo et al. (2016).
  
## **Attribute Information**
  1. **ID**: Subjects's identifier.
  2. **Recording**: Number of the recording.
  3. **Status**: 0=Healthy; 1=PD
  4. **Gender**: 0=Man; 1=Woman
  5. Pitch local perturbation measures: relative jitter (**Jitter_rel**), absolute jitter (**Jitter_abs**), relative average perturbation (**Jitter_RAP**), and pitch perturbation quotient (**Jitter_PPQ**).
  6. Amplitude perturbation measures: local shimmer (**Shim_loc**), shimmer in dB (**Shim_dB**), 3-point amplitude perturbation quotient (**Shim_APQ3**), 5-point amplitude perturbation quotient (**Shim_APQ5**), and 11-point amplitude perturbation quotient (**Shim_APQ11**).
  7. Harmonic-to-noise ratio measures: harmonic-to-noise ratio in the frequency band 0-500 Hz (**HNR05**), in 0-1500 Hz (**HNR15**), in 0-2500 Hz (**HNR25**), in 0-3500 Hz (**HNR35**), and in 0-3800 Hz (**HNR38**).
  8. Mel frequency cepstral coefficient-based spectral measures of order 0 to 12 (**MFCC0, MFCC1,..., MFCC12**) and their derivatives (**Delta0, Delta1,..., Delta12**).
  9. Recurrence period density entropy (**RPDE**).
  10. Detrended fluctuation analysis (**DFA**).
  11. Pitch period entropy (**PPE**).
  12. Glottal-to-noise excitation ratio (**GNE**).

In [1]:
#@title Getting the Dataset from the [dataset-link](https://archive.ics.uci.edu/ml/machine-learning-databases/00489/ReplicatedAcousticFeatures-ParkinsonDatabase.csv).
!wget -q https://archive.ics.uci.edu/ml/machine-learning-databases/00489/ReplicatedAcousticFeatures-ParkinsonDatabase.csv

In [3]:
#@title Importing the Necessary Libraries.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
import numpy as np

In [None]:
#@title Converting the csvdata to a [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).
dataset = pd.read_csv("ReplicatedAcousticFeatures-ParkinsonDatabase.csv")
dataset = dataset.drop('ID',axis=1)
x = dataset.drop('Status',axis=1,inplace=False)
y = dataset['Status']
dataset.head()

In [None]:
#@title Plotting a Correlation HeatMap for the Attributes in the Dataset.
%matplotlib inline
corrMatrix = dataset.corr()
fig, ax = plt.subplots(figsize=(10,10))
hm = sns.heatmap(dataset.corr(),xticklabels=True, yticklabels=True,linewidths=1)
hm.set_title("Heat Map")

In [None]:
#@title Performing Principal Component Analysis for the given Dataset.
scaler=StandardScaler()
x = scaler.fit_transform(x)

components = 10
pca = PCA(n_components=components)
principalComponents = pca.fit_transform(x)
x = pd.DataFrame(data = principalComponents, columns = ["pc"+str(i) for i in range(components)])

ax = sns.barplot( x=["pc"+str(i) for i in range(components)],y=pca.explained_variance_)
ax.set_title("PCA")

In [None]:
#@title Performing Train Test Split with 80-20.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=30)

In [None]:
#@title Choosing the Optimal Random State
accuracy_max = 0
random_state_choice = 0
for state in range(0, 50):
  rf = RandomForestRegressor(n_estimators = 2500, random_state = state)
  rf.fit(x_train, y_train)
  y_pred = np.round(rf.predict(x_test))
  accuracy = accuracy_score(y_test, y_pred)
  if accuracy > accuracy_max:
    random_state_choice = state

print('random_state_choice: ' + str(random_state_choice))

In [None]:
#@title Choosing the Optimal Number of Estimators
accuracy_max = 0
estimator_choice = 0
for estimator in range(500, 5000, 500):
  rf = RandomForestRegressor(n_estimators = estimator, random_state = random_state_choice)
  rf.fit(x_train, y_train)
  y_pred = np.round(rf.predict(x_test))
  accuracy = accuracy_score(y_test, y_pred)
  if accuracy > accuracy_max:
    estimator_choice = estimator

print('estimator_choice: ' + str(estimator_choice))

In [None]:
#@title Training the Model with the Optimal Number of Estimators and Random States
rf = RandomForestRegressor(n_estimators = estimator_choice, random_state = random_state_choice)
rf.fit(x_train, y_train)
y_pred = np.round(rf.predict(x_test))

macro = f1_score(y_test, y_pred, average='macro')
micro = f1_score(y_test, y_pred, average='micro')
weighted = f1_score(y_test, y_pred, average='weighted')

print("Accuracy: ", accuracy_score(y_test, y_pred))

In [None]:
#@title Confusion Matrix for the Random Forest Regressor.
cm = confusion_matrix(y_test, y_pred)
ax= plt.subplot()

sns.heatmap(cm, annot=True, ax = ax)

ax.set_title("Random Forest Classifier")
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['Yes', 'No'])
ax.yaxis.set_ticklabels(['Yes', 'No'])

Accuracy of the Model used: **0.854167**

***~ 85.42% accuracy***