# Iris Dataset

**Context**

The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

This dataset became a typical test case for many statistical classification techniques in machine learning such as support vector machines

Content
The dataset contains a set of 150 records under 5 attributes - Petal Length, Petal Width, Sepal Length, Sepal width and Class(Species).

In [None]:
# required libraries

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import numpy as np
np.seterr(divide='ignore', invalid='ignore')


In [None]:
# read the dataset
iris_data = pd.read_csv('../input/iris-flower-dataset/IRIS.csv')
iris_data.head()

In [None]:
iris_data.tail()

In [None]:
iris_data.columns

**Fixing the the headings**

In [None]:
iris_data.columns = iris_data.columns.str.title()

In [None]:
iris_data.columns

**label encode the target variable**

In [None]:
encode = LabelEncoder()
iris_data.Species = encode.fit_transform(iris_data.Species)


In [None]:
iris_data.head()

## Exploratory Data Analysis

Let's create some simple plots to check out the data!

In [None]:
iris_data.describe()

In [None]:
iris_data.info()

In [None]:
sns.set_style("whitegrid")
sns.pairplot(iris_data, hue="Species")

In [None]:
sns.heatmap(iris_data.corr(), cmap="magma",annot=True)


## Train Test Split
let's split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = iris_data.drop('Species', axis=1)
y = iris_data['Species']

In [None]:
# train-test-split   

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)


In [None]:
print('shape of training data : ',X_train.shape)
print('shape of testing data',X_test.shape)



In [None]:
# create the object of the model
model = LogisticRegression(solver='newton-cg', multi_class='auto')

model.fit(X_train,y_train)



In [None]:
predict = model.predict(X_test)
predict

## Model Evaluation
Let's evaluate the model by checking out it's coefficients and how we can interpret them.

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
# Summary of the predictions made by the classifier
print(classification_report(y_test, predict))
print(confusion_matrix(y_test, predict))

# Accuracy score
print('\n\nAccuracy Score on test data : \n\n')
print(accuracy_score(y_test,predict))

## Using KNN
Import KNeighborsClassifier from scikit learn

In [None]:
from sklearn.neighbors import KNeighborsClassifier

**Create a KNN model instance with n_neighbors=1**

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

**Fit this KNN model to the training data.**




In [None]:
knn.fit(X_train,y_train)

In [None]:
knn.fit(X_train,y_train)

In [None]:
pred = knn.predict(X_test)

## Predictions and Evaluations

Let's evaluate our KNN model!

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(confusion_matrix(y_test,pred))

In [None]:
print(classification_report(y_test,pred))

## Choosing a K Value

Let's go ahead and use the elbow method to pick a good K Value:

In [None]:
error_rate = []

# Will take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

Here we can see that that after arouns K>1 the error rate just tends to hover around 0.06-0.05 Let's retrain the model with that and check the classification report!

In [None]:
# NOW WITH K=23
knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=23')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))

# Train a (SVM) Support Vector Machine Model

Now its time to train a Support Vector Machine Classifier. 


In [None]:
from sklearn.svm import SVC

In [None]:
svc_model = SVC()

In [None]:
svc_model.fit(X_train,y_train)

## Model Evaluation


In [None]:
predictions = svc_model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(confusion_matrix(y_test,predictions))

In [None]:
print(classification_report(y_test,predictions))

Wow! You should have noticed that your model was pretty good! Let's see if we can tune the parameters to try to get even better (unlikely, and you probably would be satisfied with these results in real like because the data set is quite small, but I just want to practice using GridSearch.

## Gridsearch Practice

** Import GridsearchCV from SciKit Learn.**

In [None]:
from sklearn.model_selection import GridSearchCV

**Create a dictionary called param_grid and fill out some parameters for C and gamma.**

In [None]:
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001]} 

** Create a GridSearchCV object and fit it to the training data.**

In [None]:
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=2)
grid.fit(X_train,y_train)

** Now take that grid model and create some predictions using the test set and create classification reports and confusion matrices for them.

In [None]:
grid_predictions = grid.predict(X_test)

In [None]:
print(confusion_matrix(y_test,grid_predictions))

In [None]:
print(classification_report(y_test,grid_predictions))

You should have done about the same or exactly the same, this makes sense, there is basically just one point that is too noisey to grab, which makes sense, we don't want to have an overfit model that would be able to grab that.