[Question 1.1]
### The value of *k* in KNN Classification
When analysing data using nearest-neighbors classification, the optimal value for *k* is going to largely depend on the size of the dataset. Generally speaking, large values of *k* will reduce variability in that it reduces the impact of noise in data, but it will also make classifications less precise.   

The value of *k* typically refers to the number of nearest neighbor values to the data point being classified. When implementing KNN classification code, the argument `n_neighbors` is what is used to assign the default value of *k*. The relative weight in contributing to the final classification that each data point has relative to the data point being classified can be assigned using the argument `weight`. The most common values of `weight` are `uniform` (all data points in each neighborhood are weighted equally) and `distance` (the weight of a given data point in a neighborhood is inversely correlated with its distance from the point being classified).

In [1]:
#Prework -- Staging the Environment

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

In [2]:
# Load iris dataset

iris = sns.load_dataset('iris')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
#  Convert `species` variable from string to numeric

le = preprocessing.LabelEncoder()
species_encoded=le.fit_transform(iris.species)
print(species_encoded)

pd.to_numeric(species_encoded)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [4]:
#[Question 1.2] Create feature matrix and target array with `species` as the target

X_iris = iris.drop('species', axis=1)
Y_iris = species_encoded
print('X shape:', X_iris.shape, '\ny shape:', Y_iris.shape)

X shape: (150, 4) 
y shape: (150,)


In [5]:
#[Question 1.2] Creating a KNN model object where k = 1

neigh = KNeighborsClassifier(n_neighbors=1)

In [6]:
#[Question 1.3] Fitting the model

neigh.fit(X_iris, Y_iris)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

In [7]:
#[Question 1.3] Predicting values based on all data available

Xpredict = neigh.predict(X_iris)
print(Xpredict)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [8]:
#[Question 1.3] Determing accuracy score of the model for k=1

accuracy_score(Y_iris, Xpredict)



1.0

In [9]:
#[Question 1.3] Determining accuracy score of the model for k=5

#Creating new KNN model for k = 5
neigh5 = KNeighborsClassifier(n_neighbors=5)
neigh5.fit(X_iris, Y_iris)

#Predicting values based on new KNN model
X5predict = neigh5.predict(X_iris)

#Calculate accuracy score for 
accuracy_score(Y_iris, X5predict)

0.9666666666666667

[Question 1.3]
### KNN Models and Accuracy Scores
As discussed above, with a larger value for *k*, the impact of noise within the data is decreased, but the boundaries of the classification "buckets" are less distinct. Put another way, the larger *k* is, the less precise the boundaries around the classifications are going to be and the more variance there will be within classifications. Thus, the model is less accurate overall. However, this can be partially attributed to the fact that both models are being trained with the full dataset available. Given that this dataset is fairly well-segregated with respect to the `species` variable, it makes sense that a model where *k = 1* is going to be 100% accurate, whereas a model where *k = 5* will have some variance in observations near the boundaries of the `species` categories.

In [10]:
#[Question 2.1] Creation of training and test sets from the 'Iris' dataset, with a 50% split and a seed of 0

X_train, X_test, y_train, y_test = train_test_split(X_iris, Y_iris, test_size=0.5, random_state=0)
print('X train:', X_train.shape, '\ny train:', y_train.shape)
print('X test:', X_test.shape, '\ny test:', y_test.shape)

X train: (75, 4) 
y train: (75,)
X test: (75, 4) 
y test: (75,)


In [11]:
#[Question 2.2] Creation of KNN model for training data
neight5 = KNeighborsClassifier(n_neighbors=5)

#Fitting the model
neight5.fit(X_train, y_train)

#Predicting test data

NE5predict = neight5.predict(X_test)

#[Question 2.3] Computing accuracy score

accuracy_score(y_test, NE5predict)

0.96

[Question 2.3]

### Why the accuracy scores from 1.3 and 2.3 are the same
The accuracy score obtained in Question 1.3 is essentially equivalent to the accuracy score obtained in Question 2.3. This is because the *k*-value of the model in Q1.3 is the same as the *k*-value of the model in Q2.3, and the training data in Q2.3 effectively taught the algorithm via the training data batch.

[Question 2.4]

### The merits of `cross_val_score`

Cross-validation (CV) is a process that helps protect against overfitting a given model and "knowledge leaks" from the test data to a model during the learning process, while allowing for overall effective machine learning.

In order to best protect against overfitting, data scientists can split their data into three batches: the training batch (from which the algorithm learns about the data), a validation batch (to act as a prophylactic measure against the algorithm accidentally learning about test data during the model hyperparameter turning process), and the test batch (the data on which the algorithm will actually run). This process reduces the amount of data from which the algorithm can learn and can also potentially end with misleading results due to random chance.

CV sidesteps these issues by splitting the training batch into sub-batches. In *k*-fold CV, the training set is split up into *k* number of sub-batches. The algorithm is then trained on *k - 1* of those sub-batches, and validated on the last sub-batch.

CV is used in situations in which it would be methodologically unsound to split a dataset into three discrete batches (e.g., small datasets). It could also be used in scenarios in which the analyst is trying to find the most effective value of *k* for a large dataset that is not well self-segregated with respect to the variable of interest.

In [12]:
#[Question 2.5] Creating a KNN model with k = 3

KNN3 = KNeighborsClassifier(n_neighbors=3)

# Running cross-validation on the model

cross_val_score(KNN3, X_iris, Y_iris, cv=5)

array([0.96666667, 0.96666667, 0.93333333, 0.96666667, 1.        ])

### Cross-Validation Performance of the KNN3 model

Based on the CV scores returned by `cross_val_score`, my model performs fairly well, accurately classifying query points over 93% each time. 

In [13]:
#Bonus Round: Scripting a function
#Variables involved: 
#    list_of_k: List of k-values to be tested
#    cvfold: Number of sub-batches desired for CV testing
#    xparam: Independent variable
#    yparam: Variable of interest/classification

def knncv(list_of_k, cvfold, xparam, yparam):
    for k in list_of_k:
        KNN = KNeighborsClassifier(n_neighbors=k)
        print(cross_val_score(KNN, xparam, yparam, cv=cvfold))


#Demonstration of use -- quick testing of multiple k-scores to determine most accurate value for k of those scores.
knncv([73, 15, 3, 5, 2, 1], 5, X_iris, Y_iris)
    

[0.9        0.9        0.83333333 0.93333333 0.9       ]
[0.93333333 1.         0.93333333 0.96666667 1.        ]
[0.96666667 0.96666667 0.93333333 0.96666667 1.        ]
[0.96666667 1.         0.93333333 0.96666667 1.        ]
[0.96666667 0.93333333 0.93333333 0.9        1.        ]
[0.96666667 0.96666667 0.93333333 0.93333333 1.        ]
