# Iris Flower Dataset

### Context

The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

#### Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from pandas.plotting import parallel_coordinates
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

### EDA

### Import dataset

In [None]:
dataset = pd.read_csv('../input/iris-flower-dataset/IRIS.csv')

In [None]:
dataset.head(10)

In [None]:
dataset.info()

In [None]:
dataset.shape

In [None]:
dataset.describe()

Let's check if we have any missing values ? 

In [None]:
dataset.isnull().sum()

Let's analyse quickly our target 

In [None]:
dataset.species.value_counts().plot(kind="bar",color='green')

We have 50 species in each categories

There are different types of plots like bar plot, box plot, scatter plot etc.
Scatter plot is very useful when we are analyzing the relation ship between 2 features on x and y axis.
In seaborn library we have pairplot function which is very useful to scatter plot all the features at once instead of plotting them individually.

In [None]:
#sns.pairplot(dataset)
sns.pairplot(dataset, hue="species", height = 2, palette = 'colorblind');

Note that some variables seem to be highly correlated, e.g. petal_length and petal_width. In addition, the petal measurements separate the different species better than the sepal ones.

In [None]:
plt.figure(figsize=(10,11))
sns.heatmap(dataset.corr(),annot=True)
plt.plot()

Now we will see how these features are correlated to each other using heatmap in seaborn library. We can see that Sepal Length and Sepal Width features are slightly correlated with each other.

Let’s see how our data is distributed based on Sepal Length and Width features using scatterplot.

In [None]:
sns.FacetGrid(dataset,hue="species").map(plt.scatter,"sepal_length","sepal_width").add_legend()
plt.show()

Similarly scatter plot of data based on Petal Length and Width features


In [None]:
sns.FacetGrid(dataset,hue="species").map(plt.scatter,"petal_length","petal_width").add_legend()
plt.show()

We have now a better overview of our dataset. 
Let's conclude this EDA with my favorite automatic EDA libraries --- Pandas Profiling

### Pandas Profiling

In [None]:
from pandas_profiling import ProfileReport

In [None]:
design_report = ProfileReport(dataset)
design_report.to_file(output_file='report.html')

Another cool visualization tool is parallel coordinate plot, which represents each row as a line.

In [None]:
parallel_coordinates(dataset, "species", color = ['blue', 'red', 'green']);

As we have seen before, petal measurements can separate species better than the sepal ones.

## Train-Test Split


Now, we can split the dataset into a training set and a test set. Usually, we should also have a validation set, which is used to evaluate the performance of each classifier, fine-tune, and determine the best model. The test set is mainly used for reporting. However, due to the small size of this dataset, we can simplify it by using the test set to serve the purpose of the validation set.

In [None]:
list(dataset.columns)

In [None]:
X = dataset.drop('species',axis=1)

In [None]:
y = dataset.species

In [None]:
from sklearn.model_selection import train_test_split


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,stratify=y,random_state=42)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
y_test.value_counts()

In [None]:
y_train.value_counts()

### Build Classifiers

### Decision Tree

In [None]:
classifier_dt = DecisionTreeClassifier(max_depth = 3, random_state = 1)
classifier_dt.fit(X_train,y_train)
prediction=classifier_dt.predict(X_test)
print("Accuracy Score : " , accuracy_score(prediction,y_test))

This decision tree predicts 89.4% of the test data correctly.

We can use feature_importances to understand the importance of each predictor. 

In [None]:
classifier_dt.feature_importances_

From the output and based on the indices of the four features, we know that the first two features (sepal measurements) are of no importance, and only the petal ones are used to build this tree.

*We can also visiualize the classification rules*

In [None]:
fn = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
cn = ['setosa', 'versicolor', 'virginica']

In [None]:
plt.figure(figsize = (10,8))
plot_tree(classifier_dt, feature_names = fn, class_names = cn, filled = True);

### Another way to show the prediction results is through a confusion matrix:

In [None]:
conf_mat = metrics.plot_confusion_matrix(classifier_dt, X_test, y_test,
                                 display_labels=cn,
                                 cmap=plt.cm.Blues,
                                 normalize=None)
conf_mat.ax_.set_title('Decision Tree Confusion matrix, without normalization');

Thanks to classification report we can see precision,recal and f1 of each class.

In [None]:
confusion = confusion_matrix(y_test, prediction)
print('Confusion Matrix\n')
print(confusion)

#importing accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('\nAccuracy: {:.2f}\n'.format(accuracy_score(y_test, prediction)))



from sklearn.metrics import classification_report
print('\nClassification Report\n')
print(classification_report(y_test, prediction, target_names=['Class 1', 'Class 2', 'Class 3']))


- Precision: It tells you what fraction of predictions as a positive class were actually positive. To calculate precision, use the following formula: TP/(TP+FP).
- Recall: It tells you what fraction of all positive samples were correctly predicted as positive by the classifier. It is also known as True Positive Rate (TPR), Sensitivity, Probability of Detection. To calculate Recall, use the following formula: TP/(TP+FN).
- F1-score: It combines precision and recall into a single measure. Mathematically it’s the harmonic mean of precision and recall. 