# IRIS DATA CLASSIFICATION

This note book classification of hte IRIS data set which is provides in the scikit learn website

# TABLE OF CONTENTS

1. [IMPORTS FOR NOTEBOOK](#1-IMPORTS-FOR-NOTEBOOK)
2. [LOAD IRIS DATA](#2-LOAD-IRIS-DATA)
3. [CONVERT TO PANDAS DATAFRAME](#3-CONVERT-TO-PANDAS-DATAFRAME)
4. [TRAIN TEST SPLIT](#4-TRAIN-TEST-SPLIT)
5. [IRIS DATA CLASSIFICATION ](#5-IRIS-DATA-CLASSIFICATION)
6. [METRICS FOR CLASSIFICATION](#6-METRICS-FOR-CLASSIFICATION)
7. [ACCURACY SCORE](##6.1-ACCURACY-SCORE)
8. [F1 - SCORE](##6.2-F1-SCORE)
9. [LOG LOSS](##6.3-LOG-LOSS)

# 1 - IMPORTS FOR NOTEBOOK

In [42]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 2 LOAD IRIS DATA

In [43]:
from sklearn.datasets import load_iris
iris_data = load_iris()

# 3 CONVERT TO PANDAS DATAFRAME

### 3.1 - DATA OVERVIEW
<p>Attribute Information:</p>
<ol>
    <li><b>sepal length </b>in cm      </li>
    <li><b>sepal width </b>in cm       </li>
    <li><b>petal length </b>in cm     </li>
    <li><b>petal width </b>in cm      </li>
    <li><b>class:</b></li>
        <ol>
            <li>Iris-Setosa</li>
            <li>Iris-Versicolour</li>
            <li>Iris-Virginica</li>
        </ol>
</ol>

In [3]:
pd_iris_data = pd.DataFrame(iris_data.data,columns=['SEP-LEN','SEP-WID','PET-LEN','PET-WID'])

In [4]:
pd_iris_data

Unnamed: 0,SEP-LEN,SEP-WID,PET-LEN,PET-WID
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [5]:
iris_data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [6]:
target = pd.DataFrame(iris_data.target,columns=['target'])

In [7]:
target

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0
...,...
145,2
146,2
147,2
148,2


# 4 TRAIN TEST SPLIT

In [8]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(pd_iris_data, target, test_size=0.33, random_state=2020)

In [9]:
x_test.shape

(50, 4)

In [10]:
y_test.shape

(50, 1)

# 5 IRIS DATA CLASSIFICATION 

In [11]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(x_train, y_train)
y_pred = clf.predict(x_test)

  y = column_or_1d(y, warn=True)


# 6 METRICS FOR CLASSIFICATION

## 6.1 ACCURACY SCORE

In [44]:
from sklearn.metrics import accuracy_score
accuracy_score = accuracy_score(y_test, y_pred)
accuracy_score

0.94

In [45]:
from sklearn.metrics import balanced_accuracy_score
b_accuracy = balanced_accuracy_score(y_test, y_pred)
b_accuracy

0.9375

In [46]:
#from sklearn.metrics import average_precision_score
#avg_pscore = average_precision_score(y_test, y_pred)
#avg_pscore

In [47]:
# this only supports binary classification
#from sklearn.metrics import brier_score_loss
#b_score_loss = brier_score_loss(y_test, y_pred)
#b_score_loss

## 6.2 F1 - SCORE

#### MACRO: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

In [48]:
from sklearn.metrics import f1_score
f1_macro = f1_score(y_test, y_pred, average='macro')
print("f1_macro : {}".format(f1_macro))

f1_macro : 0.9369458128078817


#### MICRO : Calculate metrics globally by counting the total true positives, false negatives and false positives

In [49]:
f1_micro = f1_score(y_test, y_pred, average='micro')
print("f1_micro : {}".format(f1_micro))

f1_micro : 0.94


#### Weighted : Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

In [50]:
f1_wieghted = f1_score(y_test, y_pred, average='weighted')
print("f1_wieghted : {}".format(f1_wieghted))

f1_wieghted : 0.9394679802955664


#### Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score).

In [51]:
#f1_samples = f1_score(y_test, y_pred, average='samples')
#print("f1_samples : {}".format(f1_samples))

## 6.3 LOG LOSS

In [52]:
y_test['target'].values

array([2, 0, 1, 1, 1, 2, 2, 1, 0, 0, 2, 2, 0, 2, 2, 0, 1, 1, 2, 0, 0, 2,
       1, 0, 2, 1, 1, 1, 0, 0, 2, 0, 0, 0, 2, 0, 0, 1, 0, 2, 0, 2, 1, 0,
       1, 2, 2, 1, 1, 1])

In [53]:
np.unique(y_test['target'].values)

array([0, 1, 2])

In [54]:
np.unique(y_pred)

array([0, 1, 2])

In [55]:
from sklearn.metrics import log_loss
lloss = log_loss(y_test['target'].values, y_pred,labels=[0,1,2])
lloss 

ValueError: The number of classes in labels is different from that in y_pred. Classes found in labels: [0 1 2]