---
# Classifying Iris Species
---

Let’s assume that a hobby botanist is interested in distinguishing the species of some iris flowers that she has found. She has collected some measurements associated with each iris: the length and width of the petals and the length and width of the sepals, all measured in centimeters.

She also has the measurements of some irises that have been previously identified by
an expert botanist as belonging to the species setosa, versicolor, or virginica. For these measurements, she can be certain of which species each iris belongs to. Let’s assume that these are the only species our hobby botanist will encounter in the wild.

Our goal is to build a machine learning model that can learn from the measurements
of these irises whose species is known, so that we can predict the species for a new
iris

## Meet the Data
The data we will use for this example is the Iris dataset, a classical dataset in machine learning and statistics. It is included in scikit-learn in the datasets module. We can load it by calling the load_iris function:

__Nota:__ Classes, labels, samples y features

Classes -> Los posibles outputs de nuestro modelo

Labels -> Las etiquetas del output (Classes) deseado para el input

Samples -> Elementos individuales de un array de datos

Features -> Las propiedades inherentes a los samples

Desde la librería sklearn.datasets se carga el dataset "iris", incluída por defecto.

In [7]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

El dataset se cargará en un objeto de tipo "Bunch" propio de la librería sklearn semejante a un diccionario.

In [3]:
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))

Keys of iris_dataset: 
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])


In [6]:
type(iris_dataset)

sklearn.utils.Bunch

El campo 'DESCR' contiene una descripción pormenorizada del dataframe, la siguiente celda muestra tan solo una parte de esta descripción.

In [8]:
print(iris_dataset['DESCR'][:193] + "\n...")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, pre
...


El campo 'Target names' es un array de strings que contiene las especies de flores que queremos predecir (los outputs o classes).

In [9]:
print("Target names: {}".format(iris_dataset['target_names']))

Target names: ['setosa' 'versicolor' 'virginica']


El campo 'Feature names' es una lista de strings que contiene la descripción de cada característica.

In [10]:
print("Feature names: \n{}".format(iris_dataset['feature_names']))

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


The data itself is contained in the target and data fields. data contains the numeric measurements of sepal length, sepal width, petal length, and petal width in a NumPy array:

In [14]:
print("Type of data: {}".format(type(iris_dataset['data'])))

Type of data: <class 'numpy.ndarray'>


Las filas del array de datos representan las flores como tal, cuyas columnas contienen el valor de cada una de las 4 medidas correspondientes para cada flor.

La siguiente celda muestra la forma del array de datos.

In [15]:
# Notese el uso de la función 'shape' equivalente en NumPy
print("Shape of data: {}".format(iris_dataset['data'].shape))

Shape of data: (150, 4)


We see that the array contains measurements for 150 different flowers. Remember
that the individual items are called samples in machine learning, and their properties are called features. The shape of the data array is the number of samples multiplied by the number of features. This is a convention in scikit-learn, and your data will always be assumed to be in this shape. Here are the feature values for the first five samples:

In [16]:
print("First five columns of data:\n{}".format(iris_dataset['data'][:5]))

First five columns of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


### Target array

El 'Target Array' es el conjunto de datos objetivo que almacena el output deseado para un conjunto de medidas.

Contiene las especies de cada flor que fue medida, también como un array de NumPy.

In [19]:
print("Type of target: {}".format(type(iris_dataset['target'])))

Type of target: <class 'numpy.ndarray'>


In [20]:
print("Shape of target: {}".format(iris_dataset['target'].shape))

Shape of target: (150,)


En este caso el 'Target Array' contiene el output de cada especie de flor en números del 0 al 2, un número para cada especie de flor (classes), el significado de cada número esta almacenado en el índice del array 'iris_dataset\['target_names'\]'.

In [21]:
print("Target:\n{}".format(iris_dataset['target']))

Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [24]:
iris_dataset['target_names']

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')