<a href="https://colab.research.google.com/github/sharonmar/data-augmentation/blob/main/Copia_de_Classifying_Iris_Species_MEC_3A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **A First Application: Classifying Iris Species**

We are going to build a classifier for Iris-type flowers.
In the process, we will introduce some core concepts and terms.

There are samples of each of three species of Iris: Iris setosa, Iris virginica and Iris versicolor and for each of these species four traits of each sample were measured: the length and width of the sepals and petals. 

In these photos you can see the difference between petal and sepal  [(these photo)](https://www.oreilly.com/library/view/python-artificial-intelligence/9781789539462/assets/462dc4fa-fd62-4539-8599-ac80a441382c.png)

Our goal is to build a machine learning model that can learn from the measurements of these irises whose species is known, so that we can predict the species for a new iris.

# Meet the Data

The data we will use, is the **Iris_ dataset**.

It is included in **scikit-learn** in the datasets module. scikit-learn comes with a few small standard datasets that do not require to download any file from some external website.

 
We can load it by calling the **load_iris** function:
 load_iris()	Load and return the iris dataset (classification).
 The iris object that is returned by load_iris is a **Bunch object**, which is very similar to a dictionary. It contains keys and values: You search for words (keys), and get their definition (value). In programming, you can make the keys and values anything you choose (words, numbers, etc.).

In [None]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))



Keys of iris_dataset: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


Use the value **DESCR** to get a short description of the dataset. 

DESCR , short for DESCRIPTION, is a description of the dataset


In [None]:
print(iris_dataset['DESCR'][:193] + "\n...")


.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, pre
...


We can now view the the **target_name** for the iris dataset. The value of the key target_names is an array of strings, containing the species of
flower that we want to predict:

In [None]:
print("Target names: {}".format(iris_dataset['target_names']))


Target names: ['setosa' 'versicolor' 'virginica']


look to the features inside the dataset
**feature_names** are the names of the feature variables, in other words names of the columns in data

In [None]:
print("Feature names: \n{}".format(iris_dataset['feature_names']))

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


The data itself is contained in the target and data fields. data contains the numeric
measurements of sepal length, sepal width, petal length, and petal width in a NumPy array

In [None]:
print("Type of data: {}".format(type(iris_dataset['data'])))

Type of data: <class 'numpy.ndarray'>


The rows in the data array correspond to flowers, while the columns represent the
four measurements that were taken for each flower

In [None]:
print("Shape of data: {}".format(iris_dataset['data'].shape))

Shape of data: (150, 4)


look at the data.

Here are the feature values for the first five samples:




In [None]:
print("First five columns of data:\n{}".format(iris_dataset['data'][:5]))

First five columns of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


The target array contains the species of each of the flowers that were measured, also as a NumPy array

In [None]:
print("Type of target: {}".format(type(iris_dataset['target'])))

Type of target: <class 'numpy.ndarray'>


target is a one-dimensional array, with one entry per flower:

In [None]:
print("Shape of target: {}".format(iris_dataset['target'].shape))

Shape of target: (150,)


Species are encoded as integers from 0 to 2:

0 means setosa, 1 means versicolor, and 2 means virginica.
The meaning of the numbers is given by the matrix iris ['target_names']:

In [None]:
print("Target:\n{}".format(iris_dataset['target']))

Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


we will build a machine learning model from this data that can predict the Iris cies for a new set of measurements.
But, we need to know whether it actually works—that is, whether we
should trust its predictions.
we cannot use the data we used This is because our model can always simply remember the whole training set, and will therefore always predict the correct label for any point in the training set. 
To assess the model’s performance, we show data that it hasn’t seen before, for which we have labels. This is usually done by splitting the labeled data we
have collected (here, our 150 flower measurements) into two parts. 
- One part is called the training data or
training set(is used to build our machine learning model). 
- The rest is called the test data, test set, or hold-out set (The rest of the data will be used to assess how well the model works)

scikit-learn contains a function that shuffles the dataset and splits it for you: the
train_test_split function. This function extracts 75% of the rows in the data as the
training set, together with the corresponding labels for this data. The remaining 25%
of the data, together with the remaining labels, is declared as the test set. Deciding how much data you want to put into the training and the test set respectively is some‐
what arbitrary, but using a test set containing 25% of the data is a good rule of thumb.
In scikit-learn, data is usually denoted with a capital X, while labels are denoted by
a lowercase y. 

Let’s call train_test_split on our data and assign the outputs using this nomenclature

To make sure that we will get the same output if we run the same function several
times, we provide the pseudorandom number generator with a fixed seed using the
random_state parameter. This will make the outcome deterministic, so this line will
always have the same outcome. We will always fix the random_state in this way when
using randomized procedures in this book.


The output of the train_test_split function is X_train, X_test, y_train, and
y_test, which are all NumPy arrays. X_train contains 75% of the rows of the dataset,
and X_test contains the remaining 25%:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
 iris_dataset['data'], iris_dataset['target'], random_state=0)

In [None]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))

X_train shape: (112, 4)
y_train shape: (112,)


In [None]:
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_test shape: (38, 4)
y_test shape: (38,)


Before building a machine learning model it is often a good idea to inspect the data,
to see if the task is easily solvable without machine learning, or if the desired infor‐
mation might not be contained in the data.
Additionally, inspecting your data is a good way to find abnormalities and peculiarities. 
One of the best ways to inspect data is to visualize it. One way to do this is by using a
scatter plot.Unfortunately, computer
screens have only two dimensions, which allows us to plot only two (or maybe three)
features at a time. It is difficult to plot datasets with more than three features this way.
One way around this problem is to do a pair plot, which looks at all possible pairs of
features. If you have a small number of features, such as the four we have here, this is
quite reasonable. You should keep in mind, however, that a pair plot does not show
the interaction of all of features at once, so some interesting aspects of the data may
not be revealed when visualizing it this way.


In [None]:
# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# create a scatter matrix from the dataframe, color by y_train
grr = pd.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
 hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

AttributeError: ignored

To build a model,import classifier classes from Scikit learn.
Here we will use a k-nearestneighbors classifier, which is easy to understand. Building this model only consists of
storing the training set. To make a prediction for a new data point, the algorithm
finds the point in the training set that is closest to the new point. Then it assigns the
label of this training point to the new data point.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

The knn object encapsulates the algorithm that will be used to build the model from
the training data, as well the algorithm to make predictions on new data points. It will
also hold the information that the algorithm has extracted from the training data. In
the case of KNeighborsClassifier, it will just store the training set.
To build the model on the training set, we call the fit method of the knn object,
which takes as arguments the NumPy array X_train containing the training data and
the NumPy array y_train of the corresponding training labels:


In [None]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

### Making Predictions
We can now make predictions using this model on new data for which we might not
know the correct labels.

In [None]:
import numpy as np

X_new = np.array([[5, 2.9, 1, 0.2]])
print("X_new.shape: {}".format(X_new.shape))

X_new.shape: (1, 4)


Note that we made the measurements of this single flower into a row in a twodimensional NumPy array, as scikit-learn always expects two-dimensional arrays
for the data.
To make a prediction, we call the predict method of the knn object:

In [None]:
prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(
 iris_dataset['target_names'][prediction]))


Prediction: [0]
Predicted target name: ['setosa']


Evaluating the Model
This is where the test set that we created earlier comes in. This data was not used to
build the model, but we do know what the correct species is for each iris in the test
set.
Therefore, we can make a prediction for each iris in the test data and compare it
against its label (the known species). We can measure how well the model works by
computing the accuracy, which is the fraction of flowers for which the right species
was predicted:

In [None]:
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))

Test set predictions:
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


In [None]:
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))

Test set score: 0.97


We can also use the score method of the knn object, which will compute the test set accuracy for us:

For this model, the test set accuracy is about 0.97, which means we made the right
prediction for 97% of the irises in the test set, this means that we can expect our model to be correct 97% of the time for new
irises. For our hobby botanist application, this high level of accuracy means that our
model may be trustworthy enough to use. 

In [None]:
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))


Test set score: 0.97



This snippet contains the core code for applying any machine learning algorithm
using scikit-learn. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
 iris_dataset['data'], iris_dataset['target'], random_state=0)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

Test set score: 0.97


# reference

Andreas, C, M. and Sarah, G. (2016) **Introduction to Machine Learning with Python**, by O’Reilly Media, Inc.,in the United States of America.

