# Iris Flower DataSet


* Dataset : Iris Flower Dataset
1. a simple data set to learn the basics 
2. three flowers of iris species.
3. features: Sepal_legth, Sepal_Width,Petal_length,Petal_width
### Objective : Classify a new flower as belonging to one of the 3 classes given the 4 features.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Load iris.csv in to a pandas dataframe.
iris = pd.read_csv("../input/Iris.csv")
iris.head()

In [None]:
#(Q) how many data points and features?
iris.shape

In [None]:
#(Q) what are the column names in our dataset?
iris.columns

In [None]:
#(Q) how many data points for each class are present?
#(or) how many flowers for each species are present>
iris["Species"].value_counts()

* so the dataset is balanced data set because number of data points for every class is 50 

# Exploratory Data Analysis (EDA)
## Uni variate analysis
### Histograms / PDF.

In [None]:
#Histogram of petal length
sns.FacetGrid(iris,hue="Species",size=5) \
    .map(sns.distplot,"PetalLengthCm") \
    .add_legend()
plt.show()

### Observations:
1. setosa well seperated from versicolor and virginica.
2. there is a some overlap between versicolor and virginica.

In [None]:
#Histogram of petal width
sns.FacetGrid(iris,hue="Species",size=5) \
    .map(sns.distplot,"PetalWidthCm") \
    .add_legend()
plt.show()

### Observations:
1.  there is small intersection between setosa and versicolor.
2.  there is some overlap between versicolor and virginica.

In [None]:
#Histogram of sepal length
sns.FacetGrid(iris,hue="Species",size=5) \
    .map(sns.distplot,"SepalLengthCm") \
    .add_legend()
plt.show()


### Observations:
1.Setosa it self is not well separted from versicolor and virginica.
2.there is massive overlap between setosa, versicolor and virginica.


In [None]:
#Histogram of sepal width
sns.FacetGrid(iris,hue="Species",size=5) \
    .map(sns.distplot,"SepalWidthCm") \
    .add_legend()
plt.show()

### Observation:
1. it is very difficult to seperate setosa, versicolor and virginica

### Conclusions:
1. petal length is slightly better than petal width
2. if i would pick one feature i will pick petal length.
3. if i would pick two feature i will pick petal length and petal width and plot pairplot.
4. PL>PW>>SL>>SW

## Bivariate Analysis
### 1) 2D Scatter plot




In [None]:
#2D scatter plot:
iris.plot(kind='scatter', x='SepalLengthCm',y='SepalWidthCm')
plt.show()

### Observations:
1. cannot make much sens out it.
2. what if we color the points by either class-label/flower-type.

In [None]:
#2D scatter plot with color-coding for each flower type/class.
#here "sns" corresponds to seaborn.
sns.set_style("whitegrid")
sns.FacetGrid(iris, hue="Species", size=4) \
    .map(plt.scatter,"SepalLengthCm","SepalWidthCm") \
    .add_legend()
plt.show()

# Notice that the blue points can be easly seperated.
# from red and green by drawing a line.
# but red and green data points cannot be easly seperated.
#can we draw multiple 2-D scatter plots for each combination of features?
# How many combinations exist? 4C2 = 6.

### Observations:
1. Using sepal_length and sepal_width features, we can distinguish setosa flowers from others.
2. Seperating versicolor from virginica is much harder as they have considerable overlap.

## 2) Pair-Plot

# Pairwise Scatter Plot: Pair-Plot
## Dis-advantages:
1. it can be used when number of features are less (below 10  ).
2. cannot visualize higher dimensional patterns in 3-D and 4-D.
3. only possible to view 2D patterns.


In [None]:
sns.set_style("whitegrid")
sns.pairplot(iris,hue="Species",size=3,aspect=1)
plt.show()
#the diagonal elements are PDF for each feature.

### Observations:
1. petal_length and petal_width are the most useful features to identify various flower types.
2. while setosa can be easly identified (linearly seperable).virginica and versiclor have some overlap(almost linearly seperable).
3. We can find 'lines' and "if-else" conditions to build a simple model to classify the flower types.

# Data Pre Processing

In [None]:
iris.head(2) #show the first 2 rows from the dataset


In [None]:
#(Q) any null value is present or not?
iris.isnull().sum()

Observation: No null values are present in the dataset

In [None]:
#(Q) what is the mean, varience and standard dieviation of the each feature?
iris.describe()

In [None]:
X = iris.drop(["Species"], axis=1)
y = iris["Species"]

# Machine Learning

**Now the given problem is a classification problem.. Thus we will be using the classification algorithms to build a model.**
1. **Classification:** samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data

In [None]:
#heatmap is to identify the highly correlated features
plt.figure(figsize=(7,4)) 
sns.heatmap(iris.corr(),annot=True,cmap='cubehelix_r')
plt.show()

### Observations:
1. correlation coefficient of petal length and petal width are 0.96.
2. correlation coefficient of sepel legth and petal length are 0.87.


### Using KNN Algorithm

In [None]:
from sklearn import metrics
from sklearn.model_selection import train_test_split

In [None]:
#Splitting The Data into Training And Testing Dataset
# the attribute test_size=0.3 splits the data into 70% and 30% ratio. train=70% and test=33%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

In [None]:
#select the algorithm
from sklearn import neighbors
model = neighbors.KNeighborsClassifier(n_neighbors=3)

In [None]:
# we train the algorithm with the training data and the training output
model.fit(X_train,y_train)

In [None]:
#now we pass the testing data to the trained algorithm
predict = model.predict(X_test)

In [None]:
#now we check the accuracy of the algorithm. 
#we pass the predicted output by the model and the actual output
from sklearn.metrics import accuracy_score #for checking the model accuracy
print('The accuracy of the KNN is',metrics.accuracy_score(predict,y_test))

### Observation:
KNN is giving 100% Accuracy . We will continue to check the accuracy for different models.

### 
Support Vector Machine (SVM)

In [None]:
#importing svm from sklearn library

from sklearn import svm
svc = svm.SVC(C=1.0, kernel='rbf') #select the algorithm

In [None]:
# we train the algorithm with the training data and the training output
svc.fit(X_train,y_train)

In [None]:
#now we pass the testing data to the trained algorithm
pred = svc.predict(X_test)

In [None]:
#now we check the accuracy of the algorithm. 
#we pass the predicted output by the model and the actual output
print('The accuracy of the SVM is:',metrics.accuracy_score(pred,y_test))

### Observation:
SVM is giving 100% Accuracy . 

### Decision Tree

In [None]:
model=DecisionTreeClassifier()
model.fit(X_train,y_train) 
prediction=model.predict(X_test) 
print('The accuracy of the Decision Tree using Petals is:',metrics.accuracy_score(prediction,y_test))


### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression # for Logistic Regression algorithm
model = LogisticRegression()
model.fit(X_train,y_train) 
prediction=model.predict(X_test) 
print('The accuracy of the Logistic Regression using Petals is:',metrics.accuracy_score(prediction,y_test))



### Observation:
Logistic regression is giving 86% Accuracy . 

### Conclusions:
1. Using Petals over Sepal for training the data gives a much better accuracy.
2. This was expected as we saw in the heatmap above that the correlation between the Sepal Width and Length was very low whereas the correlation between Petal Width and Length was very high.