# K-Nearest Neighbor Lab
Read over the sklearn info on [nearest neighbor learners](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification)




In [7]:
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
import numpy as np
import pandas as pd
from scipy.io import arff
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

## 1 K-Nearest Neighbor (KNN) algorithm

### 1.1 (15%) Basic KNN Classification

Learn the [Glass data set](https://archive.ics.uci.edu/dataset/42/glass+identification) using [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) with default parameters.
- Randomly split your data into train/test.  Anytime we don't tell you specifics (such as what percentage is train vs test) choose your own reasonable values
- Give typical train and test set accuracies after running with different random splits
- Print the output probabilities for a test set (predict_proba)
- Try it with different p values (Minkowskian exponent) and discuss any differences

In [8]:
# Learn the glass data

Data_Set = arff.loadarff(f'datasets/glass_train.arff')
df = pd.DataFrame(Data_Set[0])
le = LabelEncoder()

display(df.head())
display(df["Type"].value_counts())


df['Type'] = le.fit_transform(df['Type'])


# df = df.replace({
#     b'n': 'no',
#     b'y': 'yes',
#     b'?': 'unknown',
#     b'republican': 'republican',
#     b'democrat': 'democrat'
# }).astype(str)

display(df.head())

X = pd.get_dummies(df.drop('Type', axis=1)).to_numpy()
y = df['Type'].to_numpy()

X_voting = X
y_voting = y

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


Unnamed: 0,RI,Na,Mg,Al,Si,'K',Ca,Ba,Fe,Type
0,1.51793,12.79,3.5,1.12,73.03,0.64,8.77,0.0,0.0,b'build wind float'
1,1.51643,12.16,3.52,1.35,72.89,0.57,8.53,0.0,0.0,b'vehic wind float'
2,1.51793,13.21,3.48,1.41,72.64,0.59,8.43,0.0,0.0,b'build wind float'
3,1.51299,14.4,1.74,1.54,74.55,0.0,7.59,0.0,0.0,b'tableware'
4,1.53393,12.3,0.0,1.0,70.16,0.12,16.19,0.0,0.24,b'build wind non-float'


Type
b'build wind non-float'    44
b'build wind float'        41
b'headlamps'               18
b'vehic wind float'        12
b'containers'               7
b'tableware'                5
Name: count, dtype: int64

Unnamed: 0,RI,Na,Mg,Al,Si,'K',Ca,Ba,Fe,Type
0,1.51793,12.79,3.5,1.12,73.03,0.64,8.77,0.0,0.0,0
1,1.51643,12.16,3.52,1.35,72.89,0.57,8.53,0.0,0.0,5
2,1.51793,13.21,3.48,1.41,72.64,0.59,8.43,0.0,0.0,0
3,1.51299,14.4,1.74,1.54,74.55,0.0,7.59,0.0,0.0,4
4,1.53393,12.3,0.0,1.0,70.16,0.12,16.19,0.0,0.24,1


In [22]:
def run_stuff(test_size=0.2, p=2):
  x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=test_size)

  clf = KNeighborsClassifier(n_neighbors=3, p=p)

  clf.fit(x_train, y_train)
  clf.score(x_test, y_test)
  print("----------------------------------------------------------")
  print(f"Test Size: {test_size * 100}% Train Size: {(1 - test_size) * 100}% P Value: {p}")
  print(f"Train score: {clf.score(x_train, y_train)}")
  print(f"Test score: {clf.score(x_test, y_test)}")


run_stuff(0.1)
run_stuff(0.2)
run_stuff(0.3)
run_stuff(0.4)
run_stuff(0.5)
run_stuff(0.6)
run_stuff(0.7)
run_stuff(0.8)
run_stuff(0.9)

run_stuff()
run_stuff(p=1)
run_stuff(p=2)
run_stuff(p=3)
run_stuff(p=4)



----------------------------------------------------------
Test Size: 10.0% Train Size: 90.0% P Value: 2
Train score: 0.7982456140350878
Test score: 0.6153846153846154
----------------------------------------------------------
Test Size: 20.0% Train Size: 80.0% P Value: 2
Train score: 0.7524752475247525
Test score: 0.6923076923076923
----------------------------------------------------------
Test Size: 30.0% Train Size: 70.0% P Value: 2
Train score: 0.7613636363636364
Test score: 0.5897435897435898
----------------------------------------------------------
Test Size: 40.0% Train Size: 60.0% P Value: 2
Train score: 0.7105263157894737
Test score: 0.6274509803921569
----------------------------------------------------------
Test Size: 50.0% Train Size: 50.0% P Value: 2
Train score: 0.7777777777777778
Test score: 0.578125
----------------------------------------------------------
Test Size: 60.0% Train Size: 40.0% P Value: 2
Train score: 0.76
Test score: 0.5714285714285714
----------------

#### Test/Train Split Analysis (P=2)

|Test Size|Train Size|P Value|Train Score|Test Score|
|---|---|---|---|---|
|10.0%|90.0%|2|0.7982|0.6154|
|20.0%|80.0%|2|0.7525|0.6923|
|30.0%|70.0%|2|0.7614|0.5897|
|40.0%|60.0%|2|0.7105|0.6275|
|50.0%|50.0%|2|0.7778|0.5781|
|60.0%|40.0%|2|0.7600|0.5714|
|70.0%|30.0%|2|0.7632|0.5281|
|80.0%|20.0%|2|0.8400|0.5294|
|90.0%|10.0%|2|0.5833|0.3478|
|20.0%|80.0%|2|0.7327|0.5769|

#### P Value Comparison (20% Test / 80% Train)

|Test Size|Train Size|P Value|Train Score|Test Score|
|---|---|---|---|---|
|20.0%|80.0%|2|0.7327|0.5769|
|20.0%|80.0%|1|0.8119|0.4615|
|20.0%|80.0%|2|0.7129|0.6538|
|20.0%|80.0%|3|0.8119|0.6154|
|20.0%|80.0%|4|0.8020|0.5000|


#### Discussion
What were your accuracies or output probabilities and how did different hyperparameter values affect the outcome? Discuss the differences you see.

** Your discussion goes here **

## 2 KNN Classification with normalization and distance weighting

Use the [magic telescope](https://axon.cs.byu.edu/data/uci_class/MagicTelescope.arff) dataset

### 2.1 (5%) - Without Normalization or Distance Weighting
- Do random 80/20 train/test splits each time
- Run with k=3 and *without* distance weighting and *without* normalization
- Show train and test set accuracy

In [1]:
# Learn magic telescope data

#### Discussion
What did you observe in your results?

** Your discussion goes here **

### 2.2 (10%) With Normalization
- Try it with k=3 without distance weighting but *with* normalization of input features.  You may use any reasonable normalization approach (e.g. standard min-max normalization between 0-1, z-transform, etc.)

In [None]:
# Train/Predict with normalization

#### Discussion
Discuss the results of using normalized data vs. unnormalized data

** Your discussion goes here **

### 2.3 (10%) With Distance Weighting
- Try it with k=3 and with distance weighting *and* normalization

In [None]:
#Train/Precdict with normalization and distance weighting

#### Discussion
Comparison and discuss the differences you see with distance weighting and normalization vs without.

** Your discussion goes here **

### 2.4 (10%) Different k Values
- Using your normalized data with distance weighting, create one graph with classification accuracy on the test set on the y-axis and k values on the x-axis.
- Use values of k from 1 to 15.  Use the same train/test split for each. 

In [3]:
# Calculate and Graph classification accuracy vs k values

#### Discussion
How do the k values affect your results?

** Your discussion goes here **

## 3 KNN Regression with normalization and distance weighting

Use the [sklean KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor) on the [housing price prediction](https://axon.cs.byu.edu/data/uci_regression/housing.arff) problem.  
### 3.1 (5%) Ethical Data
Note this data set has an example of an inappropriate input feature which we discussed.  State which feature is inappropriate and discuss why.

#### Discussion
Discuss the innapropriate feature. Which one and why?

** Your discussion goes here **

### 3.2 (15%) - KNN Regression 
- Do random 80/20 train/test splits each time
- Run with k=3
- Print the score (coefficient of determination) and Mean Absolute Error (MAE) for the train and test set for the cases of
  - No input normalization and no distance weighting
  - Normalization and no distance weighting
  - Normalization and distance weighting
- Normalize inputs features where needed but do not normalize the output

In [6]:
# Learn and experiment with housing price prediction data

#### Discussion
Discuss your results. How did the hyperparameters affect your results? Discuss each one and combinations of each.

** Your discussion goes here **

### 3.3 (10%)  Different k Values
- Using housing with normalized data and distance weighting, create one graph with MAE on the test set on the y-axis and k values on the x-axis
- Use values of k from 1 to 15.  Use the same train/test split for each. 

In [None]:
# Learn and graph for different k values

#### Discussion
How did the k values affect your results for this dataset? How does that compare to your previous work in this lab?

** Your discussion goes here **

## 4. (20%) KNN with nominal and real data

- Use the [lymph dataset](https://axon.cs.byu.edu/data/uci_class/lymph.arff)
- Use a 80/20 split of the data for the training/test set
- This dataset has both continuous and nominal attributes 
- Implement a distance metric which uses Euclidean distance for continuous features and 0/1 distance for nominal. Hints:
    - Write your own distance function (e.g. mydist) and use clf = KNeighborsClassifier(metric=mydist)
    - Change the nominal features in the data set to integer values since KNeighborsClassifier expects numeric features. I used Label_Encoder on the nominal features.
    - Keep a list of which features are nominal which mydist can use to decide which distance measure to use
    - There was an occasional bug in SK version 1.3.0 ("Flags object has no attribute 'c_contiguous'") that went away when I upgraded to the lastest SK version 1.3.1 
- Use your own choice for k and other parameters

In [5]:
# Train/Predict lymph with your own distance metric

#### Discussion
Explain your distance metric and discuss your results

** Your discussion goes here **

## 5. (Optional 15% extra credit) Code up your own KNN Learner 
Below is a scaffold you could use if you want. Requirements for this task:
- Your model should support the methods shown in the example scaffold below
- Use Euclidean distance to decide closest neighbors
- Implement both the classification and regression versions
- Include optional distance weighting for both algorithms
- Run your algorithm on the magic telescope and housing data sets above and discuss and compare your results 

*Discussion*

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin

class KNNClassifier(BaseEstimator,ClassifierMixin):
    def __init__(self, columntype=[], weight_type='inverse_distance'): ## add parameters here
        """
        Args:
            columntype for each column tells you if continues[real] or if nominal[categoritcal].
            weight_type: inverse_distance voting or if non distance weighting. Options = ["no_weight","inverse_distance"]
        """
        self.columntype = columntype #Note This won't be needed until part 5
        self.weight_type = weight_type

    def fit(self, data, labels):
        """ Fit the data; run the algorithm (for this lab really just saves the data :D)
        Args:
            X (array-like): A 2D numpy array with the training data, excluding targets
            y (array-like): A 2D numpy array with the training targets
        Returns:
            self: this allows this to be chained, e.g. model.fit(X,y).predict(X_test)
        """
        return self
    
    def predict(self, data):
        """ Predict all classes for a dataset X
        Args:
            X (array-like): A 2D numpy array with the training data, excluding targets
        Returns:
            array, shape (n_samples,)
                Predicted target values per element in X.
        """
        pass

    #Returns the Mean score given input data and labels
    def score(self, X, y):
        """ Return accuracy of model on a given dataset. Must implement own score function.
        Args:
            X (array-like): A 2D numpy array with data, excluding targets
            y (array-like): A 2D numpy array with targets
        Returns:
            score : float
                Mean accuracy of self.predict(X) wrt. y.
        """
        return 0