# Classifying Iris Species
You're a botanist that is interested in distinguishing the species. Some measurements being taken are the length and the width of the petals, as well as the length and the width of the sephals.

- Goal: Build a machine learning model that can learn from the measurement of these irises whose species is known.
- This is an example of <em>supervised learning</em>.

## Meet the Data

In [33]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

In [34]:
print("Keys: \n{}".format(iris_dataset.keys()))
print(iris_dataset['DESCR'][:193] + "\n")

Keys: 
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, pre



In [35]:
print("Target names: {}".format(iris_dataset['target_names']))

Target names: ['setosa' 'versicolor' 'virginica']


In [36]:
print("Feature names: \n{}".format(iris_dataset['feature_names']))

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [37]:
print("Type of data: {}".format(type(iris_dataset['data'])))
print("Shape of data: {}".format(iris_dataset['data'].shape))

Type of data: <class 'numpy.ndarray'>
Shape of data: (150, 4)


In [38]:
print("First five cols. of data: \n{}".format(iris_dataset['data'][:5]))

First five cols. of data: 
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


## Training and Testing

- Splitting data into 75% of training data and 25% of testing data is the rule of thumb.
- In scikit-learn's train_test_split() function in by default to split data with the above rule of thumb.

In [39]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], random_state=0)

In [40]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_train shape: (112, 4)
y_train shape: (112,)
X_test shape: (38, 4)
y_test shape: (38,)


## Data Visualization
- Inspect data to see if the task is solvable
- Inspect data for any abnormality/peculiarity.

In [41]:
import pandas as pd

In [42]:
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)

In [43]:
import mglearn as mg

ModuleNotFoundError: No module named 'mglearn'

In [None]:
# plot_grid = pd.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o', hist_kwds={'bins':20}, s=60, alpha=0.8, cmap=mglearn.cm3)

## Building the Model: _k_-Nearest Neighbors

- Many classification algorithms, one of them being KNN.
- To make a prediction for a new data point, the algorithm finds the point in the training set that is closest to the new point. 
- All ML models in sk-learn are implemented in their own classes, which are called Estimator classes.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

In [None]:
# Fitting the model with the training data
knn.fit(X_train, y_train)

## Making Predictions
What if we found an iris in the wild with:
- Sepal length of 5 cm,
- Sepal width of 2.9 cm,
- Petal length of 1 cm,
- Petal width of 0.2 cm.

In [None]:
import numpy as np
X_new = np.array([[5, 2.9, 1, 0.2]])
print(X_new.shape)

(1, 4)


In [None]:
prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(iris_dataset['target_names'][prediction]))

Prediction: [0]
Predicted target name: ['setosa']


## Model Evalutation

In [None]:
y_pred = knn.predict(X_test)
print("Test set predictions:\n{}".format(y_pred))

Test set predictions:
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


In [None]:
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))

Test set score: 0.97


In [None]:
# As an alternative to np.mean
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

Test set score: 0.97
