# Iris Data set Notebook

The Iris flower data is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis . It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”.


The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. 

![flowers](images/flowers.jpeg)

In [1]:
import sklearn.neighbors as neigh
import pandas as pd # packages needed 

Imports dataset to be used

In [2]:
df = pd.read_csv("https://github.com/ianmcloughlin/datasets/raw/master/iris.csv")

### Thanks to GMIT lecturer Dr. Ian McLoughlin for the use of this dataset
The data set contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.

In [3]:
df
# displays data 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [4]:
import seaborn as sns
sns.pairplot(df, hue="class")
# displays data in graphs

<seaborn.axisgrid.PairGrid at 0x154543913c8>

Seperates data frame into inputs and outputs

In [5]:
inputs = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
outputs = df['class']

In [6]:
inputs# prints to screen  

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [7]:
outputs# prints to screen  

0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
5         setosa
6         setosa
7         setosa
8         setosa
9         setosa
10        setosa
11        setosa
12        setosa
13        setosa
14        setosa
15        setosa
16        setosa
17        setosa
18        setosa
19        setosa
20        setosa
21        setosa
22        setosa
23        setosa
24        setosa
25        setosa
26        setosa
27        setosa
28        setosa
29        setosa
         ...    
120    virginica
121    virginica
122    virginica
123    virginica
124    virginica
125    virginica
126    virginica
127    virginica
128    virginica
129    virginica
130    virginica
131    virginica
132    virginica
133    virginica
134    virginica
135    virginica
136    virginica
137    virginica
138    virginica
139    virginica
140    virginica
141    virginica
142    virginica
143    virginica
144    virginica
145    virginica
146    virginica
147    virgini

In [8]:
knn = neigh.KNeighborsClassifier(n_neighbors=5)

Breaks the data into groups of 5, the idea being if the 4 closest neighbours of an iris are all one type, then the flower being checked must be of the same type.

In [9]:
knn.fit(inputs, outputs)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [10]:
df.loc[0]# runs first iris in list

sepal_length       5.1
sepal_width        3.5
petal_length       1.4
petal_width        0.2
class           setosa
Name: 0, dtype: object

In [11]:
knn.predict([[5.1, 3.5, 1.4, 0.2]])
# predicts what class the inputed data falls into

array(['setosa'], dtype=object)

In [12]:
knn.predict([[5.1, 3.5, 1.4, 0.2], [0.2,3.5,1.8,8.2]])
# predicts what class the inputed data of 2 iris' fall into

array(['setosa', 'versicolor'], dtype=object)

In [13]:
knn.predict(inputs) == outputs

0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
19     True
20     True
21     True
22     True
23     True
24     True
25     True
26     True
27     True
28     True
29     True
       ... 
120    True
121    True
122    True
123    True
124    True
125    True
126    True
127    True
128    True
129    True
130    True
131    True
132    True
133    True
134    True
135    True
136    True
137    True
138    True
139    True
140    True
141    True
142    True
143    True
144    True
145    True
146    True
147    True
148    True
149    True
Name: class, Length: 150, dtype: bool

In [14]:
(knn.predict(inputs) == outputs).sum()# tells us that 4 iris' are classified incorrectly
# because we check the closest 4 flowers to each one 
# the classifer will say that it is similar to its nearest 4 even if the iris 
# is of a different type



145