In the previous version of this notebook, I faced a problem where I could not generate a real rectangle that can contain all the points of a class. But then I realised, it is almost never possible to contain some random points inside a rectangle. Instead, I need an algorithm that will create a convex hull around the class points (*e.g. Graham Scan*). I have not implemented the idea because that would be a **too complex** solution. My intension is to create easy solutions.\
So, I am thinking about the latest approach I had while classifying the dataset using one feature, finding the **mean or mode** of the train data and calculating the distance from all class to a given point and finding out which is class is closer.\
This method worked very well while classifying with one feature. Let's see if it works the same for two features or not.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
from sklearn.model_selection import train_test_split
import random
import pprint
%matplotlib inline

#**Fetching Data**
---

In [2]:
def fetchData(linkToFile):
  return pd.read_csv(linkToFile)

In [3]:
dataset = fetchData("https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv")

#**A very basic classification, version-2.2.1**
---

In [4]:
column_names = list(dataset.columns)
print(column_names)

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']


In [5]:
# Get unique names of species
class_names = dataset['species'].unique()

**Let's build our classifier**
---

In [6]:
def two_feature_classifier(train_data, mode="MEAN"):
  '''
  Input:
  train_data  - a two dimentional numpy array, that contains two features and its corresponding class on each row

  Output:
  classifier  - a dictionary that contains centroids(mean/median) for each class
  '''
  classifier = {}

  for flower in train_data:
    #flower[0] and flower[1] are features
    #flower[2] is class of that flower
    if flower[2] in classifier:
      classifier[flower[2]].append((flower[0], flower[1]))
    else:
      classifier[flower[2]] = [(flower[0], flower[1])]
  
  #create the actual classifier
  #based on mode (mean/median)
  if mode=="MEAN":
    for class_name, points in classifier.items():
      x_co_ordinates = [x[0] for x in points]
      y_co_ordinates = [y[1] for y in points]
      try:
        classifier[class_name] = (sum(x_co_ordinates)/len(x_co_ordinates), sum(y_co_ordinates)/len(y_co_ordinates))
      except:
        pass
  elif mode=="MEDIAN":
    for this_class in classifier:
      classifier[this_class].sort()
      classifier[this_class] = classifier[this_class][int(len(classifier[this_class])/2)]

  return classifier

In [7]:
def distance(centroid, point):
  '''
  Input:
  centroid  - a tuple that represents a point in a 2D plane and is the center of a class.
  point     - a tuple that represents a point in a 2D plane and feature of a test data.

  Output:
  distance  - the distance between the two given points

  Method:
  Euclidean distance
  '''
  return math.sqrt((centroid[0]-point[0])**2 + (centroid[1]-point[1])**2)

In [8]:
def predict_class(feature_value, classifier):
  closest = 100
  for class_name, centroid in classifier.items():
    if distance(centroid, feature_value)<closest:
      closest = distance(centroid, feature_value)
      feature_class = class_name
  
  return feature_class

In [9]:
def calculate_accuracy(test_data_with_class, classifier):
  correct_prediction = 0
  for flower in test_data_with_class:
    # flower[0] and flower[1] are features
    # flower[1] is the class
    if (flower[2] == predict_class((flower[0], flower[1]), classifier)):
      correct_prediction += 1
    accuracy = (correct_prediction/len(test_data_with_class))*100

  return accuracy

In [10]:
def simulate(dataset, mode='MEAN'):
  for x_cor_feature_index in range(4):
    for y_cor_feature_index in range(x_cor_feature_index+1, 4):
      print("Showing results for {} and {}\n".format(column_names[x_cor_feature_index], column_names[y_cor_feature_index]))
      numpy_dataset = dataset[[column_names[x_cor_feature_index], column_names[y_cor_feature_index], "species"]].to_numpy()
    
      train_set, validation_set = train_test_split(numpy_dataset, test_size=0.2, random_state=42)
      train_set, test_set = train_test_split(train_set, test_size=0.2, random_state=42)

      classifier = two_feature_classifier(train_set, mode)
      pprint.pprint(classifier)

      print("Accuracy on validation set: {}".format(calculate_accuracy(validation_set, classifier)))
      print("Accuracy on test set: {}".format(calculate_accuracy(test_set, classifier)))
    
      # checking a random data
      feature_value = test_set[random.randint(0,len(test_set)-1)][:2]
      print("Flower of {feature1} {feature1_value} and {feature2} {feature2_value} is {class_name}".format(feature1=column_names[x_cor_feature_index], feature1_value=feature_value[0], feature2=column_names[y_cor_feature_index], feature2_value=feature_value[1], class_name=predict_class((feature_value[0],feature_value[1]), classifier)))
      print("\n---------------------------------------------------------------------------------\n")

In [11]:
simulate(dataset)

Showing results for sepal_length and sepal_width

{'setosa': (5.053125, 3.5156250000000004),
 'versicolor': (5.959999999999999, 2.7900000000000005),
 'virginica': (6.552941176470587, 2.988235294117647)}
Accuracy on validation set: 90.0
Accuracy on test set: 87.5
Flower of sepal_length 4.9 and sepal_width 3.1 is setosa

---------------------------------------------------------------------------------

Showing results for sepal_length and petal_length

{'setosa': (5.053125, 1.4625000000000001),
 'versicolor': (5.959999999999999, 4.23),
 'virginica': (6.552941176470587, 5.529411764705881)}
Accuracy on validation set: 93.33333333333333
Accuracy on test set: 87.5
Flower of sepal_length 6.0 and petal_length 4.0 is versicolor

---------------------------------------------------------------------------------

Showing results for sepal_length and petal_width

{'setosa': (5.053125, 0.25312500000000004),
 'versicolor': (5.959999999999999, 1.3133333333333332),
 'virginica': (6.552941176470587, 2.0

In [12]:
simulate(dataset, mode='MEDIAN')

Showing results for sepal_length and sepal_width

{'setosa': (5.1, 3.4), 'versicolor': (5.9, 3.2), 'virginica': (6.4, 3.2)}
Accuracy on validation set: 83.33333333333334
Accuracy on test set: 75.0
Flower of sepal_length 4.9 and sepal_width 3.1 is setosa

---------------------------------------------------------------------------------

Showing results for sepal_length and petal_length

{'setosa': (5.1, 1.4), 'versicolor': (5.9, 4.8), 'virginica': (6.4, 5.5)}
Accuracy on validation set: 93.33333333333333
Accuracy on test set: 91.66666666666666
Flower of sepal_length 4.9 and petal_length 1.5 is setosa

---------------------------------------------------------------------------------

Showing results for sepal_length and petal_width

{'setosa': (5.1, 0.2), 'versicolor': (5.9, 1.8), 'virginica': (6.4, 2.3)}
Accuracy on validation set: 96.66666666666667
Accuracy on test set: 87.5
Flower of sepal_length 6.3 and petal_width 1.8 is versicolor

--------------------------------------------------

#**Discussion**
---

This model seems to work suprisingly well, both using **mean** and **median**.

Is it possible to improve overall performance by changing the algorithm that computes the distance? Will have to look into that.

When we move further and work with three or more features at a time, it will not be possible to get an idea of dataset's behavior by looking at data plots. So, I need to come up with a new idea when working with more feautres.