In the previous model, I tried to classify the iris dataset (using one feature) creating range of the features and use that to find the class. This idea worked on some feaures, while failed on others, because there were overlaps in classes.And while classifying, I checked if the data fell into any range. But this was not always correct. Why? Well, due to the overlaps, some data might show up in more than one class range (might be outliers, but inputs can always be unpredictable). In this case, the classifier is most likely to give wrong answer.\
So here, I am going to use the average/mean feature value (since the features are numerical) or median and then when classifying, I will compute the distance of the given data from the mean or median value of all classes. The class closest to the point, will be considered as its class.\
Hopefully this will work better than the last model, but then again, there are overlaps and outliers. So there will be errors.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import random
%matplotlib inline

#**Fetching Data**
---

In [2]:
def fetchData(linkToFile):
  return pd.read_csv(linkToFile)

In [3]:
dataset = fetchData("https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv")

#**A very basic classification, version-1.2**
---

In [4]:
column_names = list(dataset.columns)
print(column_names)

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']


In [5]:
class_names = dataset[column_names[4]].unique()

In [6]:
def median_model_trainer(train_set):
  classifier = {}

  for flower in train_set:
    if flower[1] in classifier:
      classifier[flower[1]].append(flower[0])
    else:
      classifier[flower[1]]=[flower[0]]
  
  for each_class in classifier:
    classifier[each_class].sort()
    classifier[each_class] = classifier[each_class][int(len(classifier[each_class])/2)]
  
  return classifier

In [7]:
def model_trainer(train_set):
  '''
  Input:
  train_set   : a numpy array, on which the model is to be created

  Output:
  classifier: a dictionary that contains the average feture value for each class.

  Process:
  '''
  
  classifier = {}
  for flower in train_set:
    #flower[0] is feature value
    #flower[1] is class of that data
    if flower[1] in classifier:
      classifier[flower[1]]['total_feature_value'] += flower[0]
      classifier[flower[1]]['data_in_this_class'] += 1
    else:
      classifier[flower[1]] = {'total_feature_value':flower[0], 'data_in_this_class':1}
    
  for each_class in classifier:
    classifier[each_class] = classifier[each_class]['total_feature_value']/classifier[each_class]['data_in_this_class']

  return classifier

In [8]:
def distance(point_from_class, point_from_feature):
  return abs(point_from_class - point_from_feature)

In [9]:
def predict_class(classifier, flower_feature):
  smallest_distance = 10
  for class_name, mean in classifier.items():
    if distance(mean, flower_feature) < smallest_distance:
      smallest_distance = distance(mean, flower_feature)
      probable_class = class_name
  return probable_class

In [10]:
def calculate_accuracy(validation_set, classifier):
  '''
  Input:
  validation_set: a numpy array, which will be used to find classification accuracy
  classifier    : a dictionary of classification info

  Output:
  accuracy: a floating point number (denoting percentage)

  Process:
  A loop will iterate through the validation_set, predict the class from class_gen (using the predict_class() function) and match it with the given class.
  Count if the prediction was correct.
  At the end ofthe loop, just calculate the percentage and return.
  '''
  correct_prediction = 0
  for flower in validation_set:
    # flower[0] is the feature
    # flower[1] is the class
    if (flower[1] == predict_class(classifier, flower[0])):
      correct_prediction += 1
    accuracy = (correct_prediction/len(validation_set))*100

  return accuracy

In [14]:
def simulate(dataset, classify_on = 'MEAN'):
  for feature in column_names[:4]:
    print("Working feature: {}".format(feature))
  
    numpy_dataset = dataset[[feature, "species"]].to_numpy()
    train_set, validation_set = train_test_split(numpy_dataset, test_size=0.2, random_state=42)
    train_set, test_set = train_test_split(train_set, test_size=0.2, random_state=42)

    if classify_on == 'MEDIAN':
      classifier = median_model_trainer(train_set)
    else:
      classifier = model_trainer(train_set)
    print(classifier)

    print("Accuracy on validation set: {}".format(calculate_accuracy(validation_set, classifier)))
    print("Accuracy on test set: {}".format(calculate_accuracy(test_set, classifier)))

    # checking a random data
    feature_value = test_set[random.randint(0,len(test_set)-1)][0]
    print("Flower of {} {} is {}".format(feature, feature_value,predict_class(classifier, feature_value)))
    print("\n---------------------------------------------------------------------------------\n")

In [15]:
simulate(dataset)

Working feature: sepal_length
{'virginica': 6.552941176470587, 'versicolor': 5.959999999999999, 'setosa': 5.053125}
Accuracy on validation set: 86.66666666666667
Accuracy on test set: 66.66666666666666
Flower of sepal_length 6.6 is virginica

---------------------------------------------------------------------------------

Working feature: sepal_width
{'virginica': 2.988235294117647, 'versicolor': 2.7900000000000005, 'setosa': 3.5156250000000004}
Accuracy on validation set: 56.666666666666664
Accuracy on test set: 54.166666666666664
Flower of sepal_width 2.5 is versicolor

---------------------------------------------------------------------------------

Working feature: petal_length
{'virginica': 5.529411764705881, 'versicolor': 4.23, 'setosa': 1.4625000000000001}
Accuracy on validation set: 100.0
Accuracy on test set: 91.66666666666666
Flower of petal_length 5.0 is virginica

---------------------------------------------------------------------------------

Working feature: petal_wi

It looks like the accuracy for validation set have increased for all features, but the same cannot be said for the test data.\
May be we can improve further using **median** instead of **mean**, *may be*.\
Let's see.

In [16]:
simulate(dataset, classify_on='MEDIAN')

Working feature: sepal_length
{'virginica': 6.4, 'versicolor': 5.9, 'setosa': 5.1}
Accuracy on validation set: 80.0
Accuracy on test set: 75.0
Flower of sepal_length 6.6 is virginica

---------------------------------------------------------------------------------

Working feature: sepal_width
{'virginica': 3.0, 'versicolor': 2.9, 'setosa': 3.5}
Accuracy on validation set: 63.33333333333333
Accuracy on test set: 50.0
Flower of sepal_width 2.7 is versicolor

---------------------------------------------------------------------------------

Working feature: petal_length
{'virginica': 5.5, 'versicolor': 4.3, 'setosa': 1.5}
Accuracy on validation set: 100.0
Accuracy on test set: 91.66666666666666
Flower of petal_length 4.5 is versicolor

---------------------------------------------------------------------------------

Working feature: petal_width
{'virginica': 2.0, 'versicolor': 1.3, 'setosa': 0.2}
Accuracy on validation set: 100.0
Accuracy on test set: 100.0
Flower of petal_width 0.1 is

Well, it seems like both mean and median are working well.