# Activity 09 - $k$-Nearest Neighbor

***
##### CS 434 - Dating Mining and Machine Learning
##### Oregon State University-Cascades
***

# Load packages

In [0]:
import ssl
import math 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from scipy.stats.stats import pearsonr, spearmanr
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
ssl._create_default_https_context = ssl._create_unverified_context

# Dataset

[UCI Yeast Dataset](http://archive.ics.uci.edu/ml/datasets/Yeast)

Predicted Attribute: Localization site of protein. ( non-numeric ).

### Features
* **Sequence Name**: Accession number for the SWISS-PROT database
* **mcg**: McGeoch's method for signal sequence recognition.
* **gvh**: von Heijne's method for signal sequence recognition.
* **alm**: Score of the ALOM membrane spanning region prediction program.
* **mit**: Score of discriminant analysis of the amino acid content of the N-terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins.
* **erl**: Presence of "HDEL" substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute.
* **pox**: Peroxisomal targeting signal in the C-terminus.
* **vac**: Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins.

### Class label
* **nuc**: Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins.

> Note: this is a hard dataset.  The best reported result is $F_1$ = `0.608`. (src: [Table 4](https://arxiv.org/ftp/arxiv/papers/1703/1703.08283.pdf))

In [0]:
url='http://archive.ics.uci.edu/ml/machine-learning-databases/yeast/yeast.data'
header=['sequence','mcg','gvh','alm','mit','erl','pox','vac','nuc']

*** 
# Exercise #1 - Load data
*** 

##### 1.1 Read the url into new dataframe `df`.

* mind your delimiter
* take the first column as row index (default)

In [0]:
# read the url into df
print('your code here')

##### 1.2 Describe your data


In [0]:
# describe the dataset
print('your code here')

*** 
# Exercise #2 - Prepare the experiment
***

##### 2.1 Split `X` and `y`

In [0]:
# split to X and y
print('your code here')

##### 2.2 Partition to train and test sets

* test proportion of 20%
* `random_state=1`
* stratify by `y`

In [0]:
# partition to train and test sets
print('your code here')

##### 2.3 Print the shapes of your four sets

In [0]:
# print shapes of train and test
print('your code here')

##### 2.4 Experiment to train and test model

Complete the function to:
1. Build [$k$-NN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) model with Euclidean distance
1. Train the model
1. Test the model and save prediction as `y_pred`
1. Calculate `score`
1. Return the `score` and `y_pred`

In [0]:
def train_and_test(X_train, X_test, y_train, y_test, k=5):
  # (1) build k-NN model
  print('your code here')

  # (2) train the model
  print('your code here')

  # (3) test the model
  print('your code here')

  # (4) calculate score
  print('your code here')

  # (5) return score and y_pred
  return score, y_pred

*** 
# Exercise #3 - Run the experiment
*** 

##### 3.1 Predict the test set
* use $k=11$
* save predictions as `y_pred`
* save score as `score`

In [0]:
# call your train_and_test function
print('your code here')

##### 3.2 Print your test score from result of Exercise `3.1`

In [0]:
# print the accuracy you returned in 3.1
print('your code here')

##### 3.3 Count the predictions by class label

In [0]:
# count the number of predictions by class
print('your code here')

##### 3.4 Graph a confusion matrix

In [0]:
# graph a confusion matrix
print('your code here')

##### 3.5 Calculate precision, recall, and F$_1$

In [0]:
# precision, recall, and F1
print('your code here')

> The warning you see if because we didn't predict any instances for some of the classes.  Therefore `precision` gets a `0/0` in its calculation.  We want this `NaN` to be treated as `0`, which is the default behaviour (with warning).  
>
>To suppress the warning, add the parameter `zero_division=0`.  But in most cases, it's useful to be reminded that we aren't predicting all labels. 

*** 
# Exercise #4 - Select the best $k$ by math
*** 

> Before lazily asking someone what to do here, review the lecture and/or textbook.

##### 4.1 Calculate an "informed" value of $k$ using the math approach

* calculate it; don't just hard-code a number
* save as variable `k`
* round to nearest *odd* number

In [0]:
# calculate k
print('your code here')  

##### 4.2 Print $k$

In [0]:
# print k
print(k)

35


##### 4.3 Call `train_and_test` with your value of $k$

In [0]:
# call train_and_test with your new k
print('your code here')

*** 
# Exercise #5 - Select $k$ by experiment
*** 

##### 5.1 Select the best $k$ by exeriment

* run $k$ = 1 to 99
* only check odd values of $k$
* train_and_test each $k$
* save the score in list `k_results`

In [0]:
# run k = 1 to 99 and save results
print('your code here')

##### 5.2 Plot $k$ vs score

* x-axis is $k$ from 1 to 99 (odds)
* y-axis is the respective score
* label your graph

In [0]:
# graph k against the respective score
print('your code here')

##### 5.3 Find the best $k$

* iterate over your values of $k$
* print the value $k$, a tab `\t`, and the respective score
* use `.format` notation
* format your floats (to six places, the default for `:f`)
* your numbers should line up (width) with the header

In [0]:
# print accuracy for each k
print('k\taccuracy\n================')
print('your code here')

##### 5.4 Print the average score across all `k`'s

In [0]:
# average score of the k_results
print('your code here')

> This doesn't tell us anything useful or informative, but it's a Canvas questions nevertheless. 

##### 5.5 Train vs. Validation vs. Test

> Note: In the above section we calculated the best `k` by viewing performance on the `test` set.  This is fine for our learning purposes and can be informative. However, in a true experiment, this would be *cheating* because we used the `test` set to learn a parameter for our model.
>
>  The proper way is to divide the `train` set into `train'` and `validation` sets. We then train on `train'` and use the `validation` set to test our model while exploring the values of $k$. After selecting a final $k$, we retrain the model on the original `train` set, and evaluate with the held-out `test` set. 

<img src="https://66.media.tumblr.com/dded9d1a2bf2068f92af9f7a9b6b5451/tumblr_p6s3hbPzgV1vd8jsjo1_500.gifv" width="300">