## Lab Introduction
Growing up, everyone has a hero. For many people, that hero was Muhammad Ali. He taught people it was okay to be proud of who they were, at a time when others would not accept that. He showed people how to stand up for their beliefs in the face of oppression and tyranny. He made people value themselves, and encouraged them care for those around them. He showed us what bravery truly meant, how to be a heck of a boxer, and so much more. Every single person who met Muhammad Ali, either in the ring or outside of it, had a motivating story to share about their encounter.

On June 3, 2016, Muhammad Ali passed away at the age of 74 due to septic shock. Thirty years earlier, he was diagnosed with Parkinson's syndrome, a neurodegenerative condition that doctors attributed to his boxing-related brain injuries.

Parkinson's disease itself is a long-term disorder of the nervous system that affects many aspects of a person's mobility over time. It's characterized by shaking, slowed movement, rigidity, dementia, and depression. In 2013, some 53 million people were diagnosed with it, mostly men. Other famous personalities affected by it include actor Michael J. Fox, and olympic cyclist Davis Phinney.

In this lab, you will be applying SVC to the [Parkinson's Data Set](https://archive.ics.uci.edu/ml/datasets/Parkinsons), provided courtesy of UCI's Machine Learning Repository. The dataset was created at the University of Oxford, in collaboration with 10 medical centers around the US, along with Intel who developed the device used to record the primary features of the dataset: speech signals. Your goals for this lab are first to see if it's possible to differentiate between people who have Parkinson's and who don't using SciKit-Learn's support vector classifier, and then to take a first-stab at a naive way of fine-tuning your parameters in an attempt to maximize the accuracy of your testing set.

"I've never really resented hard work because I've always liked it. Up every morning for roadwork. Going to the gymnasium every day at 12 o'clock. I never change my pattern."

## Cycle 1

Download the dataset from the link above, then load up the **parkinsons.data** into a variable **X**, being sure to drop the name column.


In [63]:
# .. your code here ..
import pandas as pd
df = pd.read_csv('parkinsons.data')
df.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


Splice out the status column into a variable **y** and delete it from **X**.

In [64]:
# .. your code here ..
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 1

In [65]:
y = df['status'].values
X = df.drop(columns=['name','status']).values

Perform a train/test split. **30**% test group size, with a random_state equal to **7**.

In [66]:
# .. your code here ..
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=7)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (136, 22) (136,)
Test set: (59, 22) (59,)


Create a SVC classifier. Don't specify any parameters, just leave everything as default. Fit it against your training data and then score your testing data with accuracy and F1 score.

In [28]:
# .. your code here ..
from sklearn import svm
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_score
clf = svm.SVC()
clf.fit(X_train, y_train) 
yhat = clf.predict(X_test)
print("Avg F1-score: %.4f" % f1_score(y_test, yhat, average='weighted'))
print("Jaccard score: %.4f" % jaccard_score(y_test, yhat))

Avg F1-score: 0.7303
Jaccard score: 0.7544


## Cycle 2


That accuracy was just too low to be useful. We need to get it up. One way you could go about doing that would be to manually try a bunch of combinations of **C, and gamma values for your rbf kernel**. But that could literally take forever. Also, you might unknowingly skip a pair of values that would have resulted in a very good accuracy.

Instead, lets get the computer to do what computers do best. Program a naive, best-parameter search by creating nested for-loops. The outer for-loop should iterate a variable **C from 0.05 to 2, using 0.05 unit increments**. The inner for-loop should increment a variable **gamma from 0.001 to 0.1, using 0.001 unit increments**. As you know, Python ranges won't allow for float intervals, so you'll have to do some research on NumPy ARanges, if you don't already know how to use them.

Since the goal is to find the parameters that result in the model having the best accuracy score, you'll need a **best_score = 0 variable that you initialize outside of the for-loops.** Inside the inner for-loop, create an SVC model and pass in the C and gamma parameters its class constructor. Train and score the model appropriately. If the current best_score is less than the model's score, update the best_score being sure to print it out, along with the C and gamma values that resulted in it.

After running your lab again, what are the highest accuracy and F1 score you are able to get?

In [29]:
import numpy as np

In [41]:
# .. your code here ..
c_att = np.linspace(0.05,2,40)
g_att = np.linspace(0.001,0.1,100)

In [50]:
best_f1_score = 0
best_jaccard_score = 0
C,G = 0,0
for c in c_att:
    for g in g_att:
        clf = svm.SVC(C = c, gamma = g)
        clf.fit(X_train, y_train) 
        yhat = clf.predict(X_test)
        if (f1_score(y_test, yhat, average='weighted') > best_f1_score):
            best_f1_score = f1_score(y_test, yhat, average='weighted')
            best_jaccard_score = jaccard_score(y_test,yhat)
            C = c
            G = g
print('Best F1 score:',best_f1_score,'@ C =',C,'& gamma =',G)
print('Best Jaccard score:',best_jaccard_score,'@ C =',C,'& gamma =',G)

Best F1 score: 0.9062435235495003 @ C = 1.65 & gamma = 0.005
Best Jaccard score: 0.9038461538461539 @ C = 1.65 & gamma = 0.005


## Cycle 3

Wait a second. Pull open the dataset's label file from: https://archive.ics.uci.edu/ml/datasets/Parkinsons

Look at the units on those columns: **Hz, %, Abs, dB, etc.** What happened to transforming your data? With all of those units interacting with one another, some pre-processing is surely in order.
Right after you preform the train/test split but before you train your model, inject SciKit-Learn's pre-processing code. Unless you have a good idea which one is going to work best, you're going to have to try the various pre-processors one at a time, checking to see if they improve your predictive accuracy.

Experiment with ***Normalizer(), MaxAbsScaler(), MinMaxScaler(), KernelCenterer(), and StandardScaler().***

After trying all of these scalers, what are the new highest accuracy and F1 score you're able to achieve?

In [51]:
from sklearn import preprocessing

In [75]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=7)
normalizer = preprocessing.Normalizer()
X_train = normalizer.fit_transform(X_train)
X_test = normalizer.fit_transform(X_test)

best_f1_score = 0
best_jaccard_score = 0
C,G = 0,0
for c in c_att:
    for g in g_att:
        clf = svm.SVC(C = c, gamma = g)
        clf.fit(X_train, y_train) 
        yhat = clf.predict(X_test)
        if (f1_score(y_test, yhat, average='weighted') > best_f1_score):
            best_f1_score = f1_score(y_test, yhat, average='weighted')
            best_jaccard_score = jaccard_score(y_test,yhat)
            C = c
            G = g
print('Best F1 score:',best_f1_score,'@ C =',C,'& gamma =',G)
print('Best Jaccard score:',best_jaccard_score,'@ C =',C,'& gamma =',G)

Best F1 score: 0.7064278861528621 @ C = 0.05 & gamma = 0.001
Best Jaccard score: 0.7966101694915254 @ C = 0.05 & gamma = 0.001


In [76]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=7)
maxabsscaler = preprocessing.MaxAbsScaler()
X_train = maxabsscaler.fit_transform(X_train)
X_test = maxabsscaler.fit_transform(X_test)

best_f1_score = 0
best_jaccard_score = 0
C,G = 0,0
for c in c_att:
    for g in g_att:
        clf = svm.SVC(C = c, gamma = g)
        clf.fit(X_train, y_train) 
        yhat = clf.predict(X_test)
        if (f1_score(y_test, yhat, average='weighted') > best_f1_score):
            best_f1_score = f1_score(y_test, yhat, average='weighted')
            best_jaccard_score = jaccard_score(y_test,yhat)
            C = c
            G = g
print('Best F1 score:',best_f1_score,'@ C =',C,'& gamma =',G)
print('Best Jaccard score:',best_jaccard_score,'@ C =',C,'& gamma =',G)

Best F1 score: 0.861040640454873 @ C = 1.2 & gamma = 0.1
Best Jaccard score: 0.8703703703703703 @ C = 1.2 & gamma = 0.1


In [77]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=7)
minmaxscaler = preprocessing.MinMaxScaler()
X_train = minmaxscaler.fit_transform(X_train)
X_test = minmaxscaler.fit_transform(X_test)
best_f1_score = 0
best_jaccard_score = 0
C,G = 0,0
for c in c_att:
    for g in g_att:
        clf = svm.SVC(C = c, gamma = g)
        clf.fit(X_train, y_train) 
        yhat = clf.predict(X_test)
        if (f1_score(y_test, yhat, average='weighted') > best_f1_score):
            best_f1_score = f1_score(y_test, yhat, average='weighted')
            best_jaccard_score = jaccard_score(y_test,yhat)
            C = c
            G = g
print('Best F1 score:',best_f1_score,'@ C =',C,'& gamma =',G)
print('Best Jaccard score:',best_jaccard_score,'@ C =',C,'& gamma =',G)

Best F1 score: 0.861040640454873 @ C = 0.75 & gamma = 0.094
Best Jaccard score: 0.8703703703703703 @ C = 0.75 & gamma = 0.094


In [78]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=7)
kernelcenterer = preprocessing.KernelCenterer()
X_train = kernelcenterer.fit_transform(X_train)
X_test = kernelcenterer.fit_transform(X_test)

best_f1_score = 0
best_jaccard_score = 0
C,G = 0,0
for c in c_att:
    for g in g_att:
        clf = svm.SVC(C = c, gamma = g)
        clf.fit(X_train, y_train) 
        yhat = clf.predict(X_test)
        if (f1_score(y_test, yhat, average='weighted') > best_f1_score):
            best_f1_score = f1_score(y_test, yhat, average='weighted')
            best_jaccard_score = jaccard_score(y_test,yhat)
            C = c
            G = g
print('Best F1 score:',best_f1_score,'@ C =',C,'& gamma =',G)
print('Best Jaccard score:',best_jaccard_score,'@ C =',C,'& gamma =',G)

ValueError: Kernel matrix must be a square matrix. Input is a 136x22 matrix.

In [79]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=7)
standardscaler = preprocessing.StandardScaler()
X_train = standardscaler.fit_transform(X_train)
X_test = standardscaler.fit_transform(X_test)

best_f1_score = 0
best_jaccard_score = 0
C,G = 0,0
for c in c_att:
    for g in g_att:
        clf = svm.SVC(C = c, gamma = g)
        clf.fit(X_train, y_train) 
        yhat = clf.predict(X_test)
        if (f1_score(y_test, yhat, average='weighted') > best_f1_score):
            best_f1_score = f1_score(y_test, yhat, average='weighted')
            best_jaccard_score = jaccard_score(y_test,yhat)
            C = c
            G = g
print('Best F1 score:',best_f1_score,'@ C =',C,'& gamma =',G)
print('Best Jaccard score:',best_jaccard_score,'@ C =',C,'& gamma =',G)

Best F1 score: 0.9649139702105802 @ C = 1.9 & gamma = 0.1
Best Jaccard score: 0.9591836734693877 @ C = 1.9 & gamma = 0.1
