# Homework 6 - Data Splitting, Support Vector Machines

You shold have downloaded:
- pulsar.csv

## 0 Load Data
Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter.

 You can read more (interesting!) details at ([source](https://archive.ics.uci.edu/ml/datasets/HTRU2)).

`pulsar.csv`  contains statistics from two types of signal from pulsar candidates: 
1. integrated profile (IP) and 
2. dispersion-measure signal-to-noise ratio (DMSNR) curve. 

Run the cell below to see what data we have.

In [None]:
# ---------- DO NOT CHANGE CODE HERE ---------
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv("pulsar.csv")
display(data)
X = data.iloc[:,:8].to_numpy()
y = data.iloc[:,8].to_numpy()
# --------------------------------------------

## 1 Data Splitting (3 ways)
There are many ways to split the training and test data. Here is a short exercise to learn and compare 3 such ways using `sklearn.model_selection`:
1. `train_test_split`
2. `KFold`
3. `StratifiedShuffleSplit`

**Read and understand** how the 3 methods work by reading the code demostration below.
- You should know what every line of code is doing.

#### Method 1: train_test_split 
Using `sklearn.model_selection.train_test_split`, we split the data into training and test.

In [None]:
# ---------- DO NOT CHANGE CODE HERE ---------
from sklearn.model_selection import train_test_split
X_train_tts, X_test_tts, y_train_tts, y_test_tts = train_test_split(X, y, test_size=1/3, shuffle=False)

n_pulsar_train_tts = (y_train_tts==1).sum()
n_pulsar_test_tts = (y_test_tts==1).sum()
print("Training Set, Pulsars:", n_pulsar_train_tts, "out of", y_train_tts.shape[0])
print("Test Set    , Pulsars:", n_pulsar_test_tts, "out of", y_test_tts.shape[0])
# --------------------------------------------

#### Method 2: K-Fold
Using `sklearn.model_selection.KFold` on default shuffle settings, we split the data into training and test.

In [None]:
# ---------- DO NOT CHANGE CODE HERE ---------
from sklearn.model_selection import KFold
kf = KFold(n_splits=3) 

for i, (train_idx_kf, test_idx_kf) in enumerate(kf.split(X)):
    X_train_kf, y_train_kf = X[train_idx_kf], y[train_idx_kf]
    X_test_kf, y_test_kf = X[test_idx_kf], y[test_idx_kf]

    n_pulsar_train_kf = (y_train_kf==1).sum()
    n_pulsar_test_kf = (y_test_kf==1).sum()
    print("Training Set, Pulsars:", n_pulsar_train_kf, "out of", y_train_kf.shape[0])
    print("Test Set    , Pulsars:", n_pulsar_test_kf, "out of", y_test_kf.shape[0], '\n')
# --------------------------------------------

#### Method 3: Stratified Shuffle Split
Using `sklearn.model_selection.StratifiedShuffleSplit`, we split the data into training and test. 

In [None]:
# ---------- DO NOT CHANGE CODE HERE ---------
from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=3, test_size=1/3, random_state=0)

X_train_sss, y_train_sss, X_test_sss, y_test_sss = {}, {}, {}, {}
for i, (train_idx, test_idx) in enumerate(sss.split(X, y)):
    X_train_sss[i], y_train_sss[i] = X[train_idx], y[train_idx]
    X_test_sss[i], y_test_sss[i] = X[test_idx], y[test_idx]

    n_pulsar_train = (y_train_sss[i]==1).sum()
    n_pulsar_test = (y_test_sss[i]==1).sum()
    print("Training Set, Pulsars:", n_pulsar_train, "out of", y_train_sss[i].shape[0])
    print("Test Set    , Pulsars:", n_pulsar_test, "out of", y_test_sss[i].shape[0], '\n')
# --------------------------------------------

### 1.1 Discussion (Stratified Shuffle Split)
The number of pulsars in the training and test data for stratified shuffle split are identical.

**Task:**
1. [1 pt] Why are the number of pulsars identical for each stratified shuffle split? (i.e., what does "stratified" mean?)

    **Ans:** 

2. [1 pt] Using the code cell below, verify that the splits are actually not identical. (Tip: use np.all(...), where ... is code you fill in yourself.)

3. [1 pt] Why is the number of pulsars for stratified shuffle split different from those of train_test_split and KFold? A short answer will do.
        
    **Ans:** 


In [None]:
# TODO Use this cell to verify that all the splits from stratified shuffle split are different


### 1.2 Discussion (train_test_split and KFold)
There is an identical match between the split for train_test_split and one of the splits for KFold. 

**Task:**
1. [1 pt] Using the code cell below, verify that the split is indeed identical. (Tip: use np.all(...), where ... is code you fill in yourself.)

2. [1 pt] Why does this identical match happen? What settings or function/method arguments explain the occurence of this match?

    **Ans:** 


In [None]:
# TODO Use this cell to verify that one of the splits from train_test_split and KFold are the same


## 2 Cross Validation

### 2.1 sklearn corss val score
**Task:**
1. [2 pt] Use `sklearn.model_selection.cross_val_score` to perform cross validation on decision tree classifier
    - Define your DecisionTreeClassifier as `clf`.
    - Set `max_depth=9` and `random_state=0` in your DecisionTreeClassifier object.
    - Perform a 3-fold validation in `cross_val_score`.
    - Print the cross validation scores, this should be an array of three elements.

Note: You may not have done trees by the time you are doing this homework... Luckily you don't need to know anything about them for this section!

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score


clf = DecisionTreeClassifier(max_depth=None,random_state=None) # TODO
cross_val_score(clf,X,y,cv=3)

### 2.2 "Manual" cross val score 
**Task:**

Run the code cell below.

Based on the lecture Jupyter notebooks, the code below *should be* what the `cross_val_score` function performs. If it is what `cross_val_score` function is actually performing, we ought to see the same three validation scores printed.

1. [1 pt] Read the documentation for [sklearn.model_selection.cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html). Pay attention to the description in the "cv" parameter. Why isn't the code below performing as we expected?

    **Ans:** 

2. [1 pt] Based on what you found was wrong with the code below, make a change to the splitting method (it may not be onethat we have previously discussed before, so read the documentation carefully) and print out the new cross validation scores. Make sure they match the previous cell. You should not need to change anything in the for-loop, just the code before it.

In [None]:
k_fold = KFold(n_splits=3, shuffle=False)

# ---------- DO NOT CHANGE CODE HERE ---------
for k, (train, test) in enumerate(k_fold.split(X,y)):
    clf.fit(X[train],y[train])
    ypred = clf.predict(X[test])
    print ( clf.score(X[test],y[test]) )
# --------------------------------------------

# 3 SVM Implementation
In this section you will use sklearns SVM package to create an SVM classifier and then test it on different types of data. 

In [24]:
# --------- RUN BUT DO NOT CHANGE -------------------
from sklearn import svm
from sklearn.datasets import make_moons
np.random.seed(1)

def generate_data(s = 2):
    X, y = make_moons(n_samples = 600, noise = .15, random_state = 10)
    X1, X2 = X[np.where(y==0)] + s, X[np.where(y==1)]
    y1, y2 = y[np.where(y==0)], y[np.where(y==1)]
    y = np.hstack([y1,y2])
    X = np.vstack([X1,X2])
    return train_test_split(X, y, test_size=1/3)
# -------------------------------------------------

### 3.1 Fitting the SVM

Complete the `SVM_Train` function below

**Tasks**
- [1 pt] Initialize a classifier `clf` using `svm.SVC`, make sure to set the kernel to `kernelType`
- [1 pt] Fit the classifier on `X` and `y`
- [1 pt] Return the fitted classifier


In [25]:
def SVM_train(X,y,kernelType = "linear"):
    # TODO

### Run the plotting function

In [26]:
# -------- RUN THIS CODE DO NOT EDIT -----------------------
def SVM_plot(X,y,error,clf):

    # Create grid to evaluate model
    xx = np.linspace(X[:,0].min()-0.5, X[:,0].max()+0.5, 30)
    yy = np.linspace(X[:,1].min()-0.5, X[:,1].max()+0.5, 30)

    YY, XX = np.meshgrid(yy, xx)
    xy = np.vstack([XX.ravel(), YY.ravel()]).T
    Z = clf.decision_function(xy).reshape(XX.shape)

    # Plot decision boundary and margins
    plt.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])

    plt.scatter(X[:,0], X[:,1], c=y, edgecolors = error)
    plt.show()
# -------------------------------------------------------


### 3.2 Accuracy

Complete the accuracy function below

**Tasks** 

- [3 pt] Complete the function to return the percentage of points that were classified correctly

In [27]:
def accuracy(y_true,y_pred):
    # TODO

### 3.3 Testing

Complete the for loop to test the SVM classifier in a few different situations

**Tasks**
For each iteration the loop should:
- [2 pt] Use `SVM_train` to fit a classifier `clf` on `X_train` and `y_train`
    - Note: Make sure to set `kernelType` to `"linear"`
- [1 pt] Predict `y_pred` using this classifier and `X_test`
- [1 pt] Print the accuracy `y_pred` when compared to `y_test` (the true values)

In [None]:
for s in [5, 2, .5, .1, 0]:
    X_train, X_test, y_train, y_test = generate_data(s)
    
    clf = None # TODO
    y_pred = None # TODO
    
    err = np.where(np.abs(y_test - y_pred)==0, "None", "red") # This is just for plotting purposes later
    SVM_plot(X = X_test,y = y_pred, error=err, clf = clf)


# 3.4 Discussion

Note what is happening as the data changes.

[2 pt] As the data changes, what about our SVM implementation causes it to classify more poorly despite there still being visually distinct clusters?
- Note that incorrecly classifier points are highlighted in red

**Ans** 

[1 pt] How would you recommend that, while still using an SVM, we change either our training data or our classifier to better classify data like this?
- (This is a tough problem and there are many correct answers we just want to see that you have thought about it a little)

**Ans** 

# 3.5 Changing the kernel

Complete the following two code blocks by training the classifier on `X_train`, `y_train` with different kernels

**Tasks**
- [1 pt] In the first code block set the kernel to `"poly"`
- [1 pt] In the second code block set the kernel to `"rbf"`

Observe what changes in the plots

In [None]:
# TODO: Use an polynomial kernel
X_train, X_test, y_train, y_test = generate_data(.1)

clf = # TODO

y_pred = clf.predict(X_test)
print(accuracy(y_test,y_pred))
err = np.where(np.abs(y_test - y_pred)==0, "None", "red")
SVM_plot(X = X_test,y = y_pred, error = err, clf = clf)


In [None]:
# TODO: Use a radial basis kernel
X_train, X_test, y_train, y_test = generate_data(.1)

clf = # TODO

y_pred = clf.predict(X_test)
print(accuracy(y_test,y_pred))
err = np.where(np.abs(y_test - y_pred)==0, "None", "red")
SVM_plot(X = X_test,y = y_pred, error = err, clf = clf)

### 3.6 Discussion

[1 pt] What did you observe in the plots using the different kernels? Why might this be?

**Ans** 