# KNN Exercise

![iris](images/iris.jpg)

We are going to use the famous **iris data set** again. 

The dataset consists of four attributes, which can be used to distinguish different iris species: 
* sepal-width
* sepal-length
* petal-width 
* petal-length. 


The task is to predict the class to which these plants belong. There are three classes in the dataset: **Iris-setosa, Iris-versicolor and Iris-virginica.** 

Further details of the dataset are available here.
https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

## Task

1. Please import and preprocess the data (as far as it's necessary). Afterwards split it in a train and test set, fit a KNN model and make predictions on the test set. The last step is to evaluate your model. Try to also scale your data and fit the model to the unscaled and scaled data. Can you see a difference in performance?

2. Please also calculate the accuracy for K values of 1 to 40. In each iteration the accuracy for the predicted values of the test set is calculated and the result is appended to an error list.
The next step is to plot the accuracy values against K values.

#### Task 1

Let's first import some libraries, as we will need them later.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from itertools import combinations 
from scipy.stats import zscore

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, MinMaxScaler


After having unziped the data, and located at successfully in data/iris.csv, we can read it into a dataframe:

In [None]:
df=pd.read_csv('data/iris.csv')

And let's quickly check if the import worked.

In [None]:
df.head()

On first glance, everything looks fine. Let's check the datatypes and see if we have missing values:

In [None]:
df.info()

This also looks fine: 
- No missing values
- all features are floats (numerical) 
- only the target is an object (=string)

Next, let's quickly have a look at the distribution of the features:

In [None]:
desc=df.describe().T
display(desc)

.. as well as the spanned range for each feature and the zscores for the edge-observations (to check for outliers)

In [None]:
desc['range']=desc['max'] - desc['min']
desc['zscore_min']=(desc['mean']-desc['min']) / desc['std']
desc['zscore_max']=(desc['max']-desc['mean']) / desc['std']
display(desc)

We shouldn't run into big issues here, but we could check later if scaling improves things - after all the range of the feature ```petal_length``` is more then twice as large as from the ```xxx_width``` features.

A z-score threshold of 3 is often used for outlier detection. There is at least one observation that could be considered to be an outlier with regards to the ```sepal_width```. Let's identify these 'outliers':


In [None]:
#find outliers according to zscore > 3 criterion
features=df.columns.drop('species')
zscores=zscore(df[features])
is_outlier=(zscores>3).values

print('These are the outliers:')
outliers=df[features][is_outlier]
display(outliers)

print('These are their zscores:')
display(zscores[is_outlier])

So it is only one observation that is only slightly above the threshold. We should be able to keep it.

It seems that this data set is reasonable clean without any preprocessing required.

Next, lets generate some plots to get to a visualisation, inside we can mark the point that was identified as an outlier.

In [None]:
fig,ax = plt.subplots(2,3,figsize=(16,9))
ft_combinations=combinations(features,2)
for i,(f1,f2) in enumerate(ft_combinations):
    sns.scatterplot(data=df,x=f1,y=f2,hue='species',ax=ax[i%2,i%3])
    sns.scatterplot(data=outliers,x=f1,y=f2,color='red',s=500,marker='x',ax=ax[i%2,i%3])

So far so good! Next, let's check the distributions of the features and see how they are correlated. For that we can use a pairplot:

In [None]:
sns.pairplot(df,kind='reg')

```petal_length``` and ```petal_width``` show 2 distinct peaks in their histograms. This indicates overlapping distributions. However, we could already identify from the previous plots, that we would be able to seperate one of the species (```iris_setosa```) just based on one of those features. 

Additionally, we can see, that we have several correlations between the features. Let's check closer:

In [None]:
# Create a new DataFrame that only includes the numerical variables
df_numeric = df.select_dtypes(include=['float64', 'int64'])

## Upper triangle of an array. 
## Return a copy of an array with the elements below the k-th diagonal zeroed
mask = np.triu(np.ones_like(df_numeric.corr(), dtype=bool))

heatmap = sns.heatmap(df_numeric.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap=sns.diverging_palette(20, 220, n=100))
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=16);


#### Modeling

With the EDA done, we can start building a model

In [None]:
#define the target
y=df.species
y.head()

The scale of all features is quite similar. To demonstrate the effect of scaling we artificially inflate the scale of sepal_length.

In [None]:
df.sepal_length = df.sepal_length * 10

In [None]:
#define the features
X=df[features]
X.head()

First step of the modelling process is, to do a train-test split. In this case we use an unusual small training fraction, because the dataset is already easily seperable.

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,shuffle=True,stratify=y, train_size=0.5, random_state=123)

In [None]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

Lets Train a KNN Classifier with a starting parameter of 5 neighbors using the euclidean distance as a metric:

In [None]:
clf=KNeighborsClassifier(n_neighbors=5,metric='euclidean')
clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)
print('Classification report:')
print(classification_report(y_test,y_pred))

print('\n\nConfusion matrix')
print(confusion_matrix(y_test,y_pred))

Now let's try scaling and normalisation to check if this has an effect on the classification

In [None]:
std=StandardScaler()
norm=MinMaxScaler()

In [None]:
X_train_norm=norm.fit_transform(X_train)
X_test_norm=norm.transform(X_test)

In [None]:
clf=KNeighborsClassifier(n_neighbors=5,metric='euclidean')
clf.fit(X_train_norm,y_train)

y_pred=clf.predict(X_test_norm)
print('Classification report:')
print(classification_report(y_test,y_pred))

print('\n\nConfusion matrix')
print(confusion_matrix(y_test,y_pred))

In [None]:
X_train_std=std.fit_transform(X_train)
X_test_std=std.transform(X_test)

In [None]:
clf=KNeighborsClassifier(n_neighbors=5,metric='euclidean')
clf.fit(X_train_std,y_train)

y_pred=clf.predict(X_test_std)
print('Classification report:')
print(classification_report(y_test,y_pred))

print('\n\nConfusion matrix')
print(confusion_matrix(y_test,y_pred))

KNN is very sensitive to the scale of data as it relies on computing the distances. For features with a higher scale, the calculated distances can be very high and might produce poor results. It is thus advised to scale the data before running the KNN. This is true for all algorithms that rely on computation of distance. 

#### Task 2

Now let's wrap this into a function that allows us to compute the accuracy for different parameters - i.e. different numbers of neighbors and different orders of the minkovski metric

In [None]:
def fit_predict(k=3,metric_p=2):
    clf=KNeighborsClassifier(n_neighbors=k,p=metric_p)
    clf.fit(X_train,y_train)
    res=clf.score(X_test,y_test)
    return res

Using this function, we can compute the resulting accuracy for different combinations of the neighbors and distance metric

In [None]:
n_neighbors=range(1,40)
data=pd.DataFrame(index=n_neighbors)

for p in range(1,4):
    data[f'p={p}']=[fit_predict(k,p) for k in n_neighbors]
data=data.reset_index()  
data=data.rename(columns={'index':'NrNeighbors'})
data.head()

In [None]:
data_long=pd.wide_to_long(data, ['p='], i='NrNeighbors', j='MinkovskiOrder', sep='').reset_index()
data_long=data_long.rename(columns={'p=':'Accuracy'})
data_long.head()

In [None]:
fig, ax = plt.subplots(figsize=(12,6))

highest_accuarcy=np.argsort(data_long.Accuracy)[-3:]
best_k=list(data_long.NrNeighbors[highest_accuarcy])
sns.lineplot(data=data_long,x='NrNeighbors',y='Accuracy',hue='MinkovskiOrder')
ax.vlines(best_k,data_long.Accuracy.min(),data_long.Accuracy.max(),colors='grey',linestyle='dashed')

print(f'The best numbers of neighbors in this case is {best_k}')