In [1]:
### template for notebook
### this will need to be included for all examples
%matplotlib notebook
### %matplotlib inline (is another alternative)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn
from IPython.display import display


# Classifying Iris Species - Briefing

We've been collecting data on the measurements of irises, and we handily have the measurements from a botanist as to measurements belonging to three species _setosa_ , _versicolor_ or _virginica_.

Using the botanists measurements we can identify our species based on the measurements we've taken.

This is a *classification* problem. Every iris belongs to possibly one of three classes, a so called three-class classification problem.

The aim is for each single data point (iris) is to find out the species of the flower. The species it belongs to is called it's _label_.

## Origin of the dataset

[British statistician and biologist Ronald Fisher in his 1936 paper _The use of multiple measurements in taxonomic problems_ ](https://en.wikipedia.org/wiki/Iris_flower_data_set)

In [2]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

## We want to find out what's available in the data set
Let's find out by looking at the keys. You should be away that the type is a Bunch, which is similar to a dictionary key. Therefore you can check for the keys.


In [3]:
print(type(iris_dataset))
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))

<class 'sklearn.utils.Bunch'>
Keys of iris_dataset: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])


## Look at the description of the data.

In [4]:
print(iris_dataset['DESCR'] + "\nEND\n____")

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

### The species = target_names

In [5]:
print("Target names: {}".format(iris_dataset['target_names']))

Target names: ['setosa' 'versicolor' 'virginica']


### The feature_names are the column names

In [6]:
print("Feature names: \n{}".format(iris_dataset['feature_names']))

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


### Data is stored in 'data' (unsurprisingly)

It's stored in a numpy array.

In [7]:
print("target type:  {}".format(type(iris_dataset['data'])))

target type:  <class 'numpy.ndarray'>


If you ran just 

```python
print(iris_dataset['data'])
```

You would see it return all rows. You should consider the rows in this instance, as each flower. In machine learning the individual items are called **_samples_**. The properties are called **_features_**.

You can use the method shape on an ndarray, so you can see what you're working with.

In [8]:
print("Shape of data (numpy ndarray): {}".format(iris_dataset['data'].shape))

Shape of data (numpy ndarray): (150, 4)


The shape of data returned is 150 samples multiplied by the number of features. (This is a two-dimensional array)

This is a convention in scikit-learn. It will make an assumption that your data is in this shape.

### The first five rows

In [9]:
print("First five rows of data:\n{}".format(iris_dataset['data'][:5]))

First five rows of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


### Don't forget the 'target' key

This is another numpy ndarray. However this time it is only a one-dimensional array.

In [10]:
print("target type: {}".format(type(iris_dataset['target'])))
print("shape of data (numpy ndarray): {}".format(iris_dataset['target'].shape))

target type: <class 'numpy.ndarray'>
shape of data (numpy ndarray): (150,)


In [11]:
print("Target:\n{}".format(iris_dataset['target']))

Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


This might not make sense, but if you notice that target names have 3 species. These numbers represent the positions in the array returned by target_names. This means that:

* 0 = setosa
* 1 = versicolor
* 2 = virginica

# Assessing the model's performance

We need a way to assess the model's performance. Training the data requires splitting the labeled data of 150 flower measurements for the Iris example into two parts.

One part will be used to build the machine learning model which is the **training data**. And the rest of the data will be used to assess how well the model works (**test data**)

The test data is typically considered 25% of the data.

There is a function handily in scikit-learn which will split the data called.

```python
from sklearn.model_selection import train_test_split
```

When defining your variables data is represented with a capital X while labels (the outcome of the model) is represented as a y.
\begin{equation*}
f(X=y)
\end{equation*}

Because X is representing a 2 dimensional array, it is considered a matrix in mathematics, the y is lower case because it is representing a one dimensional array (or vector).



In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0)


The train_test_split function will shuffle the data as well.

The random_state parameter is passed as 0 so that it is fixed and considered deterministic. (AKA, the same out comes)


In [13]:
## Training set
print ("X_train shape: {}".format(X_train.shape))
print("y_train shape: {} (remember there is no second outcome because it's a one dimensional array)".format(y_train.shape))

X_train shape: (112, 4)
y_train shape: (112,) (remember there is no second outcome because it's a one dimensional array)


In [14]:
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_test shape: (38, 4)
y_test shape: (38,)


# Check the data

We need to check the data so we'll convert the numpy array to a pandas dataframe.


In [16]:
#create a dataframe from data in X_train
# label the columns using the strings in iris_data.feature_names
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# create a scatterplot matrix from the dataframe, color by y_train
pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15),
                          marker='o', hist_kwds={'bins' : 20}, s=60,
                          alpha=.8, cmap=mglearn.cm3)

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7faf40300cc0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7faf41cba6d8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7faf41c69630>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7faf41c22ac8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7faf41bde8d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7faf41bde908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7faf41b40e10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7faf41b7ae10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7faf41b32e10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7faf41b19668>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7faf41aa18d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7faf41a5b860>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7faf41a15860>,
        <matplotlib.axes._subplots.