# Application: Diagnosing Parkinson’s disease voice signals

We consider the medical application of diagnosing Parkinson’s disease from a person’s voice. We consider the data from Little et al. (2009), which can be obtained from the well-known UCI benchmark repository Frank and Asuncion (2010).

The data were collected from 31 people, 23 suffering from Parkinson’s disease. Several voice recordings of these people were processed. Each line in the data files corresponds to one recording. The first 22 columns are features derived from the recording, including minimum, average and maximum vocal fundamental frequency, several measures of variation in fundamental frequency, several measures of variation in amplitude, two measures of ratio of noise to tonal components in the voice status, two nonlinear dynamical complexity measures, a measure called signal fractal scaling exponent, as well as nonlinear measures of fundamental frequency variation Little et al. (2009), Frank and Asuncion (2010). The last column is the target label indicating whether the subject is healthy (0) or suffers from Parkinson’s disease (1). The data were split into a training and test set, parkinsonsTrainStatML.dt and parkinsonsTestStatML.dt, respectively.

In [1]:
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split


In [2]:
data_train = np.loadtxt("parkinsonsTrainStatML.dt")
data_test = np.loadtxt("parkinsonsTestStatML.dt")

train_label = data_train[:,-1]
test_label = data_test[:,-1]

## Data normalization

Consider the training data in parkinsonsTrainStatML.dt. Compute the mean and the variance of every input feature (i.e., of every component of the input vector). Find the affine linear mapping fnorm : $R^{22} \rightarrow R^{22}$ that transforms the training data such that the mean and the variance of every feature in the trans- formed data are 0 and 1, respectively (verify by computing these values).

In [3]:
data_feature = data_train[:,:-1]

We take the mean and variance of all features (22 of them, that means axis 0):

In [4]:
mu_train = np.mean(data_feature, axis = 0)
sig_train = np.var(data_feature, axis = 0)

Normalize the training data using the mean and variance with formula:

$$ f_{norm} = \frac{x-\mu_{\text{train data}}}{\sqrt{var(\text{train data})}} $$

In [5]:
new_training_data = (data_train[:,:-1]-mu_train)/np.sqrt(sig_train)

In principle, we should now have a mean of zero and a variance of one, lets see what it is: 

In [6]:
mu_train_new = np.mean(new_training_data)
sig_train_new = np.var(new_training_data)

print(f"The mean of the normalized training data is {mu_train_new:.3e} and variance {sig_train_new}")

The mean of the normalized training data is -2.307e-17 and variance 1.0


The mean is not zero due to rounding errors.

We also need to normalize the test data, we do this with the same $f_{norm}$ as the training data.

In [7]:
new_test_data = (data_test[:,:-1]-mu_train)/np.sqrt(sig_train)

mu_test_new = np.mean(new_test_data)
sig_test_new = np.var(new_test_data)

print(f"The mean of the normalized test data is {mu_test_new:.3e} and variance {sig_test_new}")

The mean of the normalized test data is 1.267e-01 and variance 1.5727196891758013


## Model selection using grid-search

The performance of your SVM classifier depends on the choice of the regularization parameter $C$ and the kernel parameters. 
We use the radial basis function kernel and will thus vary $\gamma$. 

In [8]:
C_list = 10**(np.linspace(-2,2,7))  #[0.1, 1, 10, 100, 1000]
gamma_list = 10**(np.linspace(-3,1,7))  #[.001, 0.01, 0.1, 1, 10, 100]

We want to have some validation data to test the cross-validation on. This is chosen randomly from the training data. We take 20 %. 

In [9]:
score = 0
scores = []
for C in C_list:
    for gamma in gamma_list:
        X_train, X_valid, y_train, y_valid = train_test_split(new_training_data[:,:-1], train_label, test_size=0.2, random_state=42)
        clf = SVC(C=C, gamma=gamma)
        clf.fit(X_train, y_train)
        scores.append(clf.score(X_valid, y_valid))
        
        if scores[-1] > score:
            score = scores[-1]
            best = [C, gamma]
            
print(f"The best score was {score} found from [C, gamma] = {best}")

The best score was 0.95 found from [C, gamma] = [1.0, 0.021544346900318832]


Lets train on the full training data set using the optimal parameters $C$ and $\gamma$.

In [10]:
clf = SVC(C=best[0], gamma=best[1])
clf.fit(new_training_data[:,:-1], train_label)
test_score = clf.score(new_test_data[:,:-1], test_label)

print(f"The test score was {test_score:3.3f}.")

The test score was 0.866.
