### Iris dataset classification

This is one of a few notebooks designed to showcase how Conveyor can make your work in Jupyter more organized. The objective of this example is to seperate the Iris dataset classification task (covered [here](https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html)) into smaller subtasks, from exploratory data analysis to evaluating different classification strategies.

In [1]:
import numpy as np
from sklearn import datasets

In [2]:
iris_X, iris_y = datasets.load_iris(return_X_y=True)

(From scikit-learn's website) "The iris dataset is a classification task consisting in identifying 3 different types of irises (Setosa, Versicolour, and Virginica) from their petal and sepal length and width."

In [3]:
np.unique(iris_y)

array([0, 1, 2])

In [4]:
# iris_X appears to contain petal and sepal lengths and widths...
iris_X[0]

array([5.1, 3.5, 1.4, 0.2])

In [5]:
# The 0 class is a type of flower
iris_y[0]

0

How many of each type of flower does the dataset contain?

In [6]:
class_count = [0]*len(np.unique(iris_y))

for flower_type in iris_y:
    class_count[flower_type] += 1

class_count

[50, 50, 50]

Are there any obvious identifying characteristics about each flower's petals? What about sepal and petal areas?

In [7]:
class_data = [iris_X[np.where(iris_y == flower_type)] for flower_type in np.unique(iris_y)]
class_areas_avg = []
class_areas = []

for flower_type in range(len(class_data)):
    flower_avg_dims = np.mean(class_data[flower_type], axis=1) 
    class_areas_avg.append((flower_avg_dims[0]*flower_avg_dims[1],
                        flower_avg_dims[2]*flower_avg_dims[3]))
    class_areas.append([(x[0]*x[1], x[2]*x[3]) for x in class_data[flower_type]])
    
# From classes 0 to 1 to 2 the average sizes increase
class_areas_avg

[(6.0562499999999995, 5.522499999999999),
 (15.892499999999998, 13.4275),
 (17.534375, 18.77875)]

The average areas seem to be markedly different among the three types of flowers. Do the areas vary much across individual flowers, relative to these values? If so, area will not be a useful indicator for classifying our flowers.

In [8]:
np.var(class_areas, axis=1)

array([[ 8.43489316,  0.03216064],
       [ 8.05463156,  1.83507584],
       [11.72391684,  4.56133956]])

Let's get the areas for each flower in the order we see them.

In [12]:
flower_areas = []

for flower_idx in range(len(iris_X)):
    flower_data = iris_X[flower_idx]
    flower_areas.append([flower_data[0] * flower_data[1], flower_data[2] * flower_data[3]])
    
flower_areas

[[17.849999999999998, 0.27999999999999997],
 [14.700000000000001, 0.27999999999999997],
 [15.040000000000001, 0.26],
 [14.26, 0.30000000000000004],
 [18.0, 0.27999999999999997],
 [21.060000000000002, 0.68],
 [15.639999999999999, 0.42],
 [17.0, 0.30000000000000004],
 [12.76, 0.27999999999999997],
 [15.190000000000001, 0.15000000000000002],
 [19.980000000000004, 0.30000000000000004],
 [16.32, 0.32000000000000006],
 [14.399999999999999, 0.13999999999999999],
 [12.899999999999999, 0.11000000000000001],
 [23.2, 0.24],
 [25.080000000000002, 0.6000000000000001],
 [21.060000000000002, 0.52],
 [17.849999999999998, 0.42],
 [21.66, 0.51],
 [19.38, 0.44999999999999996],
 [18.36, 0.34],
 [18.87, 0.6000000000000001],
 [16.56, 0.2],
 [16.83, 0.85],
 [16.32, 0.38],
 [15.0, 0.32000000000000006],
 [17.0, 0.6400000000000001],
 [18.2, 0.30000000000000004],
 [17.68, 0.27999999999999997],
 [15.040000000000001, 0.32000000000000006],
 [14.879999999999999, 0.32000000000000006],
 [18.36, 0.6000000000000001],
 [

Petal area doesn't seem to vary too much. In the next notebook, I'll try classifying each flower using the sepal and petal areas.