In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [4]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Assign colum names to the dataset
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Read dataset to pandas dataframe
dataset = pd.read_csv(url, names=names)

*The dataset consists of four attributes: sepal-width, sepal-length, petal-width and petal-length. These are the attributes of specific types of iris plant. The task is to predict the class to which these plants belong. There are three classes in the dataset: Iris-setosa, Iris-versicolor and Iris-virginica.*

In [5]:
dataset.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


**PREPROCESSING**

In [7]:
X = dataset.iloc[:, :-1].values #first four columns as attributes
y = dataset.iloc[:, 4].values #labels

**SPLITTING INTO TRAINING AND TESTING SETS**

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

**FEATURE SCALING**

*Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.*

In [12]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

**TRAINING AND PREDICTIONS**

In [13]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [14]:
y_pred = classifier.predict(X_test)

In [16]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df

Unnamed: 0,Actual,Predicted
0,Iris-setosa,Iris-setosa
1,Iris-virginica,Iris-versicolor
2,Iris-setosa,Iris-setosa
3,Iris-virginica,Iris-virginica
4,Iris-setosa,Iris-setosa
5,Iris-setosa,Iris-setosa
6,Iris-setosa,Iris-setosa
7,Iris-virginica,Iris-virginica
8,Iris-versicolor,Iris-versicolor
9,Iris-virginica,Iris-versicolor


**EVALUATING THE ALGORITHM**

In [19]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[14  0  0]
 [ 0  3  1]
 [ 0  2 10]]
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        14
Iris-versicolor       0.60      0.75      0.67         4
 Iris-virginica       0.91      0.83      0.87        12

      micro avg       0.90      0.90      0.90        30
      macro avg       0.84      0.86      0.85        30
   weighted avg       0.91      0.90      0.90        30



*Rows (Predicted) - Iris-setosa, Iris-versicolor, Iris-virginica*

*Columns (Actual) - Iris-setosa, Iris-versicolor, Iris-virginica*

* The system correctly predicted all 14 Iris-setosa. 
* Out of the 4 Iris-versicolor, 3 were correctly recognized, but for one case it was taken to be Iris-virginica. 
* 10 out of 12 of Iris-virginica were correctly recognized, but 2 were taken to be Iris-versicolor.*