# Our First Machine Learning Application
In this application, we will be classifying the iris dataset. This dataset is about classifying the species of iris flowers whose measurement have been collected and labelled in the dataset. The measurements collected are the length and width of the sepals and length and width of the petals. The measurements are in centimeters. 
The species of the flowers whose data has been recorded are **setosa, versicolor, and virginica**. 

<img src="./iris_image_2.png">

Because we have measurements for which we know the correct species of iris, this is a **supervised learning problem**. In this problem we want to predict one of the several options (the species of iris). This is an example of a **classification problem**. The possible outputs are called classes. So this is a three class problem as every iris in the dataset belongs to these three classes.

# Data Collection and PreProcessing
We are not going to focus on data collection on our first example, so we will be using the dataset provided by the scikit-learn library which is already collected, cleaned and structured for us so that we can focus on other stuff like shich algorithm to use

In [1]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

The dataset returned by the load_iris function is a *Bunch* object which is very similar to a dictionary. It contains keys and volumes

# Exploring the DataSet
Here we will explore the dataset to see what it contains and how we can use and process it to our advantage. This step is very important as it gives us crucial insights on how to classify this dataset.

In [2]:
print("Keys of iris_dataset:\n{}".format(iris_dataset.keys()))

Keys of iris_dataset:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


The 'DESCR' key contains an short description of the dataset of which the first 200 lines are displayed below. Feel free to check it out more

In [3]:
print(iris_dataset['DESCR'][:200])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive


The value of the key 'target_names' is an array of strings, containing the species of the flowers that we want to predict.

In [4]:
print(iris_dataset['target_names'])

['setosa' 'versicolor' 'virginica']


The value of 'feature_names' is a list of strings, giving the description of each feature:

In [5]:
print("Feature Names: \n{}".format(iris_dataset['feature_names']))

Feature Names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


The data itself is contained is contained in the 'data' and 'target' fields, data contains the numeric measurements of the sepal length, sepal width, petal length, and petal width in a Numpy array:

In [6]:
print("Type of data: {}".format(type(iris_dataset['data'])))

Type of data: <class 'numpy.ndarray'>


The rows in the data array corresponds to the number of flowers whose measurement has been taken, while the columns represent the four measurements that were taken for each flower:

In [7]:
print("Shape of data: {}".format(iris_dataset['data'].shape))

Shape of data: (150, 4)


This means that the dataset contains 150 flower examples (which is called *samples* in machine learning), each example containing 4 measurements (called *features*)

Here are the feature value for the first 5 samples

In [8]:
print("First Five Samples: \n{}".format(iris_dataset['data'][:5]))

First Five Samples: 
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


The target array contains the species of each of the flowers that were measured, also in a numpy array

In [9]:
print("Type of Target: {}".format(type(iris_dataset['target'])))

Type of Target: <class 'numpy.ndarray'>


In [10]:
print("Shape of Target: {}".format(iris_dataset['target'].shape))

Shape of Target: (150,)


The species are encoded as integers from 0 to 2
0. setosa
1. versicolor
2. virginica

In [11]:
print("Target: \n{}".format(iris_dataset['target']))

Target: 
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


# Training and Testing
We also want to set some kind of metric which tests how much our machine learning model has actually learned. Its just like giving exams, and as in the case of exams, we cannot use the examples in which we have taught the system. Therefore we split the dataset into two parts called **training dataset and testing dataset**. We train our model on the training dataset and then test its performance on the testing dataset.
scikit-learn comes with a function that shuffles the dataset and splits it into the training and testing datasets. This function is called the **train_test_split** and by default it splits in the ratio of **0.75:0.25**. That is 75% of the data is given to training set and 25% of the data is given to testing set. This ratio can be changed to suit your needs too

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X_train,X_test,y_train,y_test = train_test_split(iris_dataset['data'],iris_dataset['target'],random_state=21)

In the above example, the X_train and y_train are the data and target features of the training set, similarly the X_test and y_test are the data and target features of the testing set. The random_state parameter sets the random seed so that every time you use a particular random_state, you get the same distribution of examples

# Having a look at the data
It is a good idea to have a look at the data statistically as this gives important insights. And the best way to do it is by visualizing it. One way to visualize it is by using a scatter plot. A scatter plot maps every feature with every other feature taking two at a time. Have a look:

In [14]:
import pandas as pd
import mglearn
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15,15),marker='o',hist_kwds={'bins':20},
                       s=60,alpha=0.8,cmap = mglearn.cm3)

Don't worry if the parameters scare you out a bit, for now just think of this as a piece of magic code which will let you create this plot. You will understand everything in no time.

# Building your first model: k-Nearest Neighbors
Now we are going to use k-nearest neighbors classifier, also called knn classifier to calssify our dataset, but before we have to train it. Now training the knn algorithm is really easy to understand, as this algorithm only stores the training set. To make a prediction for a new data point, the algorithm finds the point in the training set that is closest to the new point. Then it assigns the label of this training point to the new data point
The **k** in **k-nearest neighbors** signifies that instead of using only one neighbor, we will be using k nearest neighbors and then make the prediction based on the majority of labels of the k nearest neighbors. This is the only parameter we need to set while training this

In [15]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3) #Setting the no of neighbors to 3

In [16]:
#Training the model
knn.fit(X_train,y_train) #This fit method trains our model on the given dataset

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

# Evaluating the model on the test dataset:
Now we will evaluate the performance of the model on the test dataset. We will measure the **accuracy** of the model, which is the fraction of flowers for which the right species was predicted:

In [17]:
y_pred = knn.predict(X_test)
print(y_pred) # prints the predicted labels

[1 0 0 0 1 1 0 2 0 0 1 1 2 2 0 1 2 1 0 2 2 1 2 1 0 1 0 0 1 2 0 2 2 0 2 1 1
 2]


In [18]:
knn.score(X_test,y_test) #This prints the testing accuracy

0.9473684210526315

### So we can observe that our testing accuracy is approximately 95%, 94.736% to be precise upto 3 digits after decimal. Congratulations on building your first model