### Python Refresher ML
_#1_ :: IRIS DATASET

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv("../input/iris-flower-dataset/IRIS.csv")

In [None]:
df.head()

# Exploratory Data Analysis

### Dataset Summary
1. Dimensions of the Dataset
1. Data Peeks
1. Statistical summary of all attributes
1. Breakdown of the data by the class variable

### Data Visualization
1. Univariate Plot
1. Mulitvariate Plot

In [None]:
## Dimensions: Data Shape
df.shape

In [None]:
### Statistical Summary
df.describe() 


Notice how this only works for Numerical Values, furthermore, this shows that we are dealing with a Classification problem.

In [None]:
### Class Distribution
df.groupby('species').size()

## Data Visualization

In [None]:
### Univariate Plot: Box and whisker plots

df.plot(kind="box", subplots=True, layout=(2,2), sharex=False, sharey=False);

In [None]:
## Histograms
df.hist();

It looks like perhaps two of the `sepal` variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

In [None]:
## Mulitvariate Plots
from pandas.plotting import scatter_matrix

scatter_matrix(df);

## Modeling
1. Separate out a validation dataset.
1. Set-up the test harness to use 10-folds cross validation.
1. Build multiple different models to predict species from flower measurements.
1. Select the best model.

In [None]:
## Load Modelling libraries/dependencies

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

In [None]:
## Splitting Validation Dataset
array = df.values ## we use arrays here to reduce computational time
X = array[:,0:4]
y = array[:,4]

X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size = 0.20, random_state = 1)

In [None]:
## Model Selection
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma = 'auto')))

## evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = StratifiedKFold(n_splits = 10, random_state=1, shuffle = True)
    cv_results = cross_val_score(model, X_train, y_train, cv = kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

In [None]:
pd.DataFrame(results, index = names).T

In [None]:
## plotting the results
plt.boxplot(results, labels = names)
plt.title('Algorithm Comparison')
plt.show()

Our metric above shows that all our Algorithms make the 100% accuracy, however on average the SVM metric provides the best accuracy, this has been tested so far on the training set, to further examine just how accurate the Algorithm is we will make predictions on the reserved validation set.

In [None]:
# Support Vector Classification
# # Make Predictions on Validation
model = SVC(gamma="auto")
model.fit(X_train, y_train)
predict = model.predict(X_validation)

In [None]:
## Evaluate Predictions
print("Acc_Score","\n\n",accuracy_score(y_validation, predict),"\n")
print("Confusion_Score","\n\n",confusion_matrix(y_validation, predict),"\n")
print("Classification_Report","\n\n",classification_report(y_validation, predict),"\n")

In [None]:
sns.heatmap(confusion_matrix(y_validation, predict), annot=True)