# Building a Classification Model for the Iris data set
Notebook adapted from:
Chanin Nantasenamat

<i>Data Professor YouTube channel, http://youtube.com/dataprofessor </i>





In this Jupyter notebook, we will be building a classification model for the Iris data set using the random forest algorithm. 

You can follow the video tutorial: https://www.youtube.com/watch?v=XmSlFPDjKdc

[![IMAGE ALT TEXT](http://img.youtube.com/vi/XmSlFPDjKdc/0.jpg)](http://www.youtube.com/watch?v=XmSlFPDjKdc "Machine Learning in Python: Building a Classification Model")

## 1. Import libraries

In [2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

## 2. Load the *iris* data set

In [3]:
iris = datasets.load_iris()

## 3. Input features
The ***iris*** data set contains 4 input features and 1 output variable (the class label).
![picture](https://sebastianraschka.com/images/blog/2015/principal_component_analysis_files/iris.png)

(image from: https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html)


### 3.1. Input features

In [4]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


### 3.2. Output features

There a three species of iris flowers in this dataset

In [5]:
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


## 4. Glimpse of the data

### 4.1. Input features

We first look at how the dataset is composed!
So this dataset has 4 measurements (features), ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

In [6]:
iris.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

### 4.2. Output variable (the Class label)

The measurements above belong to one of the three types of iris flowers (0 = 'setosa' 1= 'versicolor' 2 = 'virginica')

In [8]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

### 4.3. Assigning *input* and *output* variables
Let's assign the 4 input variables to X and the output variable (class label) to Y

In [9]:
X = iris.data
Y = iris.target

### 4.3. Let's examine the data dimension

In [10]:
X.shape

(150, 4)

In [11]:
Y.shape

(150,)

## 5. Build Classification Model using Random Forest

Random forest is the classification method used for this approach.

More specific: Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. https://en.wikipedia.org/wiki/Random_forest

After loadding the model, the classifier tries to find a best Fit on the data to be able to predict later unseen data.

In [13]:
clf = RandomForestClassifier()

In [25]:
clf.fit(X, Y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

## 6. Feature Importance

We can show which of the feature contributes most (%) to the fitting process

In [15]:
print(clf.feature_importances_)

[0.11514125 0.01836092 0.41335839 0.45313944]


## 7. Make Prediction

If we have a model (fit), we can insert a know of unknown datapoint (measurments) into the model to get a prediction.

Before we look into unknown data, we perform a check on the given dataset (X)

In [18]:
X[0]

array([5.1, 3.5, 1.4, 0.2])

In [19]:
print(clf.predict([[5.1, 3.5, 1.4, 0.2]]))

[0]


In [22]:
print(clf.predict(X[[0]]))

['setosa']


We can also check how sure the model is with this outcome, so we can print the probability for the different classifications.

In [23]:
print(clf.predict_proba(X[[0]]))

[[1. 0. 0.]]


With the following function you can print the class name instead of the class number

note: you can also type: clf.fit(X, iris.target_names[Y]) as X is defined as iris.data and Y is defined as iris.target

In [32]:
clf.fit(iris.data, iris.target_names[iris.target])

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [33]:
print(clf.predict(X[[0]]))

['setosa']


## 8. Data split (80/20 ratio)

To train and validate on the same dataset, the data should be seperated. One part to train the model on and another part to validate/test the model performance on.

It is also wise to randomize the data you are training/testing on. A testset might be biased, like this dataset is. (the data is sorted on flower type as you might have noted). train_test_split already takes care of this, but you can pass a paramater to force randomisaztion (shuffle=True).

see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [34]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [35]:
X_train.shape, Y_train.shape

((120, 4), (120,))

In [36]:
X_test.shape, Y_test.shape

((30, 4), (30,))

## 9. Rebuild the Random Forest Model

Now using the training set only to make the model (fit)

In [37]:
clf.fit(X_train, Y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

### 9.1. Performs prediction on single sample from the data set

In [38]:
print(clf.predict([[5.1, 3.5, 1.4, 0.2]]))

[0]


In [39]:
print(clf.predict_proba([[5.1, 3.5, 1.4, 0.2]]))

[[1. 0. 0.]]


### 9.2. Performs prediction on the test set

After makeking a model on the training set, we are going to validate the generated model on the test set (data where the model did not train on, so it never seen these samples.)

#### *Predicted class labels*

In [41]:
print(clf.predict(X_test))

[1 1 1 2 2 2 1 2 0 2 0 2 2 0 0 2 2 2 0 1 0 2 0 2 2 2 0 1 1 2]


#### *Actual class labels*

In [42]:
print(Y_test)

[1 1 2 2 2 2 1 2 0 2 0 2 2 0 0 2 2 1 0 1 0 2 0 2 2 2 0 1 2 2]


## 10. Model Performance

Lets evaluate how accuate the model is by comparing the predicted outcomes with the actual measurements/class

In [43]:
print(clf.score(X_test, Y_test))

0.9
