Welcome to the IrisRecognitionProject. In this short project of mine, I will be experimenting with ML models to try accurately predict the species of a iris flower based on 4 of its features: sepal length,sepal width, petal length, and petal width. There are 3 different species that we could predict: iris setosa, iris versicolor, and the iris virginica. 
Side note: The iris zip file, downloaded from the UCI ML Repository, has two different data files that differ only in 2 datapoints. This project will be using the bezdekIris file.

Before I can create models, I must first convert my data into a readable format for the models to understand. The only conversion I must do is to substitute the species name for a number.  I will stuff the data from the csv into a pandas dataframe, and convert the species into numerical labels using a custom mapping.

In [1]:
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestCentroid
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

In [2]:
species_mapping = {"Iris-setosa":0,"Iris-versicolor":1,"Iris-virginica":2}
data = pd.read_csv('C:\\Users\\Sharvin Joshi\\IrisRecognitionProject\\iris\\bezdekIris.csv')
print(data)
data['speciesName'] = data['speciesName'].map(species_mapping)
print(data)

     sepalLength  sepalWidth  petalLength  petalWidth     speciesName
0            5.1         3.5          1.4         0.2     Iris-setosa
1            4.9         3.0          1.4         0.2     Iris-setosa
2            4.7         3.2          1.3         0.2     Iris-setosa
3            4.6         3.1          1.5         0.2     Iris-setosa
4            5.0         3.6          1.4         0.2     Iris-setosa
..           ...         ...          ...         ...             ...
145          6.7         3.0          5.2         2.3  Iris-virginica
146          6.3         2.5          5.0         1.9  Iris-virginica
147          6.5         3.0          5.2         2.0  Iris-virginica
148          6.2         3.4          5.4         2.3  Iris-virginica
149          5.9         3.0          5.1         1.8  Iris-virginica

[150 rows x 5 columns]
     sepalLength  sepalWidth  petalLength  petalWidth  speciesName
0            5.1         3.5          1.4         0.2            0
1 

Now that the labels converted into numbers, I must normalize my data to prevent outliers from heavily skewing the training and predictions. For this I will be using the simple Z score scaling method. Thankfully, the dataset is complete (no values missing), so no datapoints must be removed.
The Z score scaling method is as follows: for each feature column X, take its mean (call it M) and standard deviation (call it S) values. Then the formula is (X-M)/S.

In [3]:
# Thankfully, pandas takes care of the boring math for me with its helpful methods.
features = ['sepalLength','sepalWidth','petalLength','petalWidth']
for column in features:
    featureMean = data[column].mean()
    featureStd = data[column].std()
    data[column]=(data[column]-featureMean)/featureStd
print(data)

     sepalLength  sepalWidth  petalLength  petalWidth  speciesName
0      -0.897674    1.015602    -1.335752   -1.311052            0
1      -1.139200   -0.131539    -1.335752   -1.311052            0
2      -1.380727    0.327318    -1.392399   -1.311052            0
3      -1.501490    0.097889    -1.279104   -1.311052            0
4      -1.018437    1.245030    -1.335752   -1.311052            0
..           ...         ...          ...         ...          ...
145     1.034539   -0.131539     0.816859    1.443994            2
146     0.551486   -1.278680     0.703564    0.919223            2
147     0.793012   -0.131539     0.816859    1.050416            2
148     0.430722    0.786174     0.930154    1.443994            2
149     0.068433   -0.131539     0.760211    0.788031            2

[150 rows x 5 columns]


Now that our data has been normalized, I must split it up into training and testing sets for now.
Also, I must have the data be shuffled before I split it since it's originally ordered by species.

In [4]:
X = data[features]
y = data['speciesName']

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.20, random_state=27,shuffle=True)
X_train = X_train.to_numpy()
X_test = X_test.to_numpy()
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

#print(X_train.size)
#print(y_train)
#print(y_test)

With our data manipulated, I can start with my models. For my first model, I want to work with nearest centroid. Since there are a small amount of classes, I'd think that finding the "average" of a class to determine a new datapoint's value could be an ideal way to classify.
Since I'm not used to manually dealing with dimensions (features) greater than 3, I will be importing the NearestCentroidModel from the sklearn library. 

In [5]:
nearestCentroidModel = NearestCentroid()
nearestCentroidModel.fit(X_train,y_train)
print("Training Set Score:",nearestCentroidModel.score(X_train,y_train))
print("Testing Set Score:",nearestCentroidModel.score(X_test,y_test))

Training Set Score: 0.8833333333333333
Testing Set Score: 0.8666666666666667


Not terrible as far as first attempts go, but since nearestCentroid can't be hypertuned all that much, this is as far as it goes. Let's try something else, like kNN, aka k Nearest Neighbors.

Since we have more than 3 dimensions (features), it becomes difficult to visually plot and measure the distance between datapoints. However, knn is simple: using the training set, find the k nearest neighbors to a given datapoint using the euclidian distance formula, and return the label that appears most commonly. For funsies, I'll write this one manually...

In [6]:
class KNNClassifier:
    
    def __init__(self,X_training_set,y_training_set,k=5):
        self.k=k
        self.X_training_set = X_training_set
        self.y_training_set = y_training_set
        
    def euclidianDistance(self,point1,point2):
        try:
            return np.sqrt(np.sum((np.array(point1, dtype=float) - np.array(point2, dtype=float)) ** 2))
        except ValueError:
            return 10000000
    
    def predict(self,X_testing_set):
        predictions = []
        for datapoint in X_testing_set:
            predicted_label = self._predict(datapoint)
            predictions.append(predicted_label)
        
        return predictions
    
    def _predict(self,datapoint):
        #Get distances between datapoint and every training datapoint, find indicies of smallest distances,
        #get corresponding labels from y_training_set, return most frequent label
        distances = np.array([self.euclidianDistance(datapoint, training_datapoint) for training_datapoint in self.X_training_set])
        nearestK_indicies = np.argpartition(distances, self.k)
        corresponding_labels = [self.y_training_set[index] for index in nearestK_indicies]
        #print(corresponding_labels)
        return np.bincount(np.array(corresponding_labels[:self.k])).argmax()
        

In [7]:
knnModel = KNNClassifier(X_train,y_train)
knnModel_predictions = knnModel.predict(X_test)
correct = 0
for i in range(len(knnModel_predictions)):
    if knnModel_predictions[i] == y_test[i]:
        correct += 1
print("Testing Set Score:",correct/len(knnModel_predictions))

Testing Set Score: 0.9333333333333333


While simple, the knnModel proves to be a spectacular choice for the job. The model most likely does so well because of the relatively small dataset, and the fact that it doesn't get easily swayed by outliers and anomolies. But if the k value were to become too extreme or if the dataset grew sufficently larger, I'm sure that the model would break down.