# Drills in Classification
Without practice, you cannot claim that you know things and these drills here are there to enable this option for you. Are you ready to classify some very interesting data? 


## Exercise 1
* **Dataset:** `Iris`
* **Model to use:** [`KNN`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
* **Model evaluation:** try the [classification report](https://muthu.co/understanding-the-classification-report-in-sklearn/)

The Iris dataset includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other. 

You can load the dataset with `scikit-learn` by using: 

```python
sklearn.datasets.load_iris()
```

Your mission it to apply KNN to this dataset and find the best K.

You will quickly understand that you can't evaluate a complexe classification model just with a percentage of accuracy. 

To understand how accurate your model is and, more importantly, where it is wrong, use scikit learn's [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).

To use it properly, you will need to understand what the following terms are:
* `Recall`
* `Precision`
* `F1-score`
* `Support`

You can make your own research or [read this article](https://www.analyticsvidhya.com/blog/2020/09/precision-recall-machine-learning/).

In [27]:
# Import libraries

import numpy as np 
import pandas as pd 

from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import Normalizer

from sklearn.metrics import classification_report

import math

In [28]:
from sklearn.datasets import load_iris
iris = load_iris()


In [29]:
# Explore the dataset to understand it. (use pandas and your data visualation's favorite library)
df = pd.DataFrame(iris['data'], columns=iris['feature_names'])
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [30]:
# Storing the data and labels into "X" and "y" varaibles
X = iris.data
y = iris.target

In [31]:
# Preprocess the data (deal with NaNs, deal with text features,...)
df.isnull().values.any()

False

In [32]:
#Split into X and Y
X = iris.data
y = iris.target

In [33]:
# split the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(f'training set size: {X_train.shape[0]} samples \ntest set size: {X_test.shape[0]} samples')

training set size: 120 samples 
test set size: 30 samples


In [34]:
#Feature Scaling
sc_X= StandardScaler()
X_train= sc_X.fit_transform(X_train) # the scaler is applied to the training set
X_test= sc_X.transform(X_test) # the scaler is applied to the test set

In [35]:
#Define the KNN model
model = KNeighborsClassifier(n_neighbors=3)
# Training or fitting the model with the train data
model.fit(X_train,y_train)


KNeighborsClassifier(n_neighbors=3)

In [36]:
y_pred = model.predict(X_test)

In [37]:
model.score(X_test,y_test)

0.9666666666666667

In [38]:
# Evaluate your model
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       0.93      1.00      0.96        13
           2       1.00      0.83      0.91         6

    accuracy                           0.97        30
   macro avg       0.98      0.94      0.96        30
weighted avg       0.97      0.97      0.97        30

