### Hello  Data Ninja's

This is an iris machine learning kernel I forked and modified from Ashwin's page. As a Data Enthusiast I tried to make his code a little more elegant, concise and very explanatory at the same time. This is to respect some zen of python
* simple is better than complex
* don't repeat yourself

The dataset for this analysis are prevalent on kaggle. You could use this link to download the dataset https://www.kaggle.com/ . I however included the dataset and some images in this notebook.

<h1 style="text-align:center"> Overview of the Iris Flower </h1>
<p style="text-align:justify">
    Iris is a genus of about 260–300 species of flowering plants with showy flowers. 
    It takes its name from the Greek word for a rainbow, which is also the name for the Greek 
    goddess of the rainbow, Iris. Some authors state that the name refers to the wide variety of 
    flower colors found among the many species. See https://en.wikipedia.org/wiki/Iris_(plant)
</p>

<table>
<col width="280px">
<tr>
<th>
<img src="images/iris_setosa.jpg" align="left" style="width:250px;height:200px;"/>
</th>
<th>
<img src="images/iris_versicolor.jpg" align="middle" style="width:250px;height:200px;"/>
</th>
<th>
<img src="images/iris_virginica.jpg" align="right" style="width:250px;height:200px;"/><br>
</th>
</tr>
<tr>
<td><b style="float:left">Iris Setosa Species</b> </td>
<td><b style="float:left">Iris Versicolor Species</b> </td>
<td><b style="float:left">Iris Virginica Species</b> </td>
</tr>
</table>
<table>
<col width="280px">
<tr>
<th>
<img src="images/petal and sepal.jfif" align="middle" style="width:250px;height:200px;"/>
</th>
</tr>
<tr>
<td><b style="float:left">Petals and Sepals of a Flower</b> </td>
</tr>
</table>


### This data analysis and machine learning is actualised using python and associated framework/libraries.
I also employed some unix command. You may also import libraries in the middle of your analysis. One caveat though is that you must import a library before you can use a method of that library

In [None]:
#import the libraries for data analysis

import numpy as np                       # for scientific calculation
import pandas as pd                      # for loading and exploring the data 
import matplotlib.pyplot as plt          # for visualizing your analysis
import seaborn as sns                    # built upon matplotlib also for visualizing your analysis
import os                                # for interfacing with the operating system
import sqlite3                           # for loading and manipulating an sqlite dataset


## Quality Assurance 

In [None]:
#check what is in the project folder
!ls

In [None]:
#check what is in the data and images folder
!ls data images

In [None]:
'''
 quality assurance measures to ensure that dataset is available before the analysis.
 an error message is flagged if neither - .csv or .sqlite - of the dataset are in the data folder
'''
data_folder_content = os.listdir("data")
sqlite_error_message = "Error: sqlite file not available, check instructions above to download it"
csv_error_message = "Error: csv file not available, check instructions above to download it"
assert "database.sqlite" in data_folder_content, sqlite_error_message
assert "Iris.csv" in data_folder_content, csv_error_message

## Exploring the .csv Data Format 

In [None]:
iris = pd.read_csv("data/Iris.csv")           #load the iris.csv data
iris.head(5)                                  #display the first 5 rows 

In [None]:
'''
ashwin used the code
iris.drop('Id',axis=1, inplace=True) to remove the Id column(represented as axis=1) within the data itself (inplace=True)
we could also accomplish this with del iris['Id']
'''
del iris['Id']      #let's remove the Id column we don't need it.

In [None]:
iris.isnull().any().any(), iris.shape     #check to see if there is any missing cell

In [None]:
iris.info()      # see a more detailed description of the dataset 

In [None]:
iris.describe().transpose()       #let's do some elementary statistic on the iris data

What the above says is that in the iris sample we gathered the mean of the Sepal Length which was measured in cm
is 5.84, the standard deviation is 0.828. The least measurement for the Sepal Length is 4.3cm while highest measurement for the Sepal Length is 7.9cm. Thus it goes for other attributes measured - sepal width, petal length and petal width. 

## Exploring the Sqlite Data Format

In [None]:
cnx = sqlite3.connect('data/database.sqlite')           # Create a connection to the sql database
iris2 = pd.read_sql("SELECT * FROM Iris", cnx)          # Read the data with a pandas method

In [None]:
iris2.info()

Observe that we have loaded a csv and sqlite dataset. The choice of which data to manipulate is dependent on your skills with anyone of these technologies.

## Visualizing the Iris Data 
We will be working with the .csv iris data from this point on

In [None]:
'''
we will use fig variable to gather all the three species we plot, choose the kind of plot we want, in this case scatter.
give each specie a color and set a corresponding label of the species.
the plt.gcf() collates the scatter plot
we made the choice of our axis limits from the dimension of our statistics min and max information
Also note that the order of the code below is important
'''

fig = iris[iris.Species=='Iris-setosa'].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='orange', label='setosa')
iris[iris.Species=='Iris-versicolor'].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='blue', label='versicolor',ax=fig)
iris[iris.Species=='Iris-virginica'].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='green', label='virginica', ax=fig)
fig.set_xlabel("Sepal Length")
fig.set_ylabel("Sepal Width")
fig.set_title("Sepal Length VS Sepal Width")
fig = plt.gcf()
fig.set_size_inches(10,6)
plt.show()

The above graph shows relationship between the sepal length and sepal width. Now we will check relationship between the petal length and petal width.

In [None]:
fig = iris[iris.Species=='Iris-setosa'].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='orange', label='Setosa')
iris[iris.Species=='Iris-versicolor'].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='blue', label='versicolor',ax=fig)
iris[iris.Species=='Iris-virginica'].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='green', label='virginica', ax=fig)
fig.set_xlabel("Petal Length")
fig.set_ylabel("Petal Width")
fig.set_title(" Petal Length VS Petal Width")
fig=plt.gcf()
fig.set_size_inches(10,6)
plt.show()

The Petal feature are better clustered than the Sepal features. This indicates that the Petals feature may be the preffered feature for analysis and modelling.

## Now let us see how the length and width are distributed¶

A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal distribution), outliers, skewness, etc.

In [None]:
iris.hist(edgecolor='black', linewidth=1.2, bins=20)
fig=plt.gcf()
fig.set_size_inches(12,6)
plt.show()

The above graph show that amongst all three species there more of the species measuring (midpoint of the bar) 1.25 in petallength, 0.25 in petalwidth, 5.6 & 6.3 in sepallength and 3.1 in sepalwidth. 

## Now let us see how the length and width vary according to the species

In [None]:
plt.figure(figsize=(15,10))
plt.subplot(2,2,1)
sns.violinplot(x='Species',y='PetalLengthCm',data=iris)
plt.subplot(2,2,2)
sns.violinplot(x='Species',y='PetalWidthCm',data=iris)
plt.subplot(2,2,3)
sns.violinplot(x='Species',y='SepalLengthCm',data=iris)
plt.subplot(2,2,4)
sns.violinplot(x='Species',y='SepalWidthCm',data=iris)
plt.show()

The violinplot shows variation in the lengthiness and broadness of the sampled species.

### Before we start, Some Machine Learning Intro

Any ML task is accomplished with data in 2 dimesional format i.e rows and column. Typically we are interested in the attributes and target variable of the data. An attribute is a property (length, width, color etc.) of a sample that may be used to determine its classification. In the following dataset, the attributes are the petal and sepal length and width. It is also known as feature.Target variable, in the machine learning context is the specie that should outputed when tested on the model we train. Here the target variables are the 3 flower species.

## There is a Supervised and Unsupervised Machine Learning. 

Supervised learning is one where you have some input data and the output they map to, train an algorithm with this data some  that the model can predict an output if given an input that we do not know its output.

Unsupervised learning is one where we have a data and want to categories it into groups given their similarity. Which means we determine were this input data falls into.

### The Project is a Supervised Classification Task. 

We will be using the classification algorithms to build a model.

** Classification:** This is done when when you are trying to categories where a sample belongs.From our data we want to learn  where a new collected sample of any of the three species will fall into after we have built our model from the data we have now.



In [None]:

'''
Now we will import specific modules from SciKit-Learn
importing classification algorithms
'''

# For Supervised Learning
from sklearn.linear_model import LogisticRegression               # for Logistic Regression algorithm
from sklearn.tree import DecisionTreeClassifier                   # for using Decision Tree Classifier algorithm
from sklearn.neighbors import KNeighborsClassifier                # for K-nearest neighbours algorithm
from sklearn.svm import SVC                                       # for support vector classifier
from sklearn.model_selection import train_test_split              # for spiltting the dataset
from sklearn.metrics import accuracy_score                        # for checking the model accuracy



Now, when we train any algorithm, the number of features and their correlation plays an important role. If there are features and many of the features are highly correlated, then training an algorithm with all the featues will reduce the accuracy. Thus features selection should be done carefully. This dataset has less featues but still we will see the correlation.

In [None]:
plt.figure(figsize=(7,4)) 
sns.heatmap(iris.corr(),annot=True,cmap='cubehelix_r')      #draws  heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()

### Observation

The Sepal Width and Length are not correlated The Petal Width and Length are highly correlated
We will use all the features for training the algorithm and check the accuracy.
Then we will use 1 Petal Feature and 1 Sepal Feature to check the accuracy of the algorithm as we are using only 2 features that are not correlated. Thus we can have a variance in the dataset which may help in better accuracy. We will check it later.

### Steps to be followed when applying an algorithm

<ol>
<li>Split the dataset into training and testing dataset. The testing dataset is generally smaller(one-third of the dataset) as it will help in training the model better. </li>
<li> Select any algorithm based on the problem (classification or regression) whatever algorithm is needed for the task. This is a classification task but all algorithms are tested for this analysis </li>
<li> Then pass the training dataset to the algorithm to train it. We use the .fit() method </li>
<li> Then pass the testing data to the trained algorithm to predict the outcome. We use the .predict() method. </li>
<li> We then check the accuracy by passing the predicted outcome and the actual output to the model. </li>

### Splitting The Data into Training And Testing Dataset

In [None]:
'''
From our earlier explaination we know that feature maps to the attributes measured 
petal length, petal width, setal length, sepal width
while target maps to the species of iris considered as sample i.e Iris-setosa, Iris-virginica, and Iris-versicolor
we will create a list of these. 
'''
features = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']
target = ['Species']
X = iris[features]
y = iris[target]

In [None]:
X

In [None]:
y

In [None]:
'''
the whole iris dataset is split into training and testing data
the attribute test_size=0.3 cuts out 30% of the 150 sample data and assigns it to the test variable
you could equally assign the attribute train_size = 0.7 in the ratio of 1:2 => test:train.
the random_state attribute randomizes the dataset division process instead of approaching it linearly.
'''
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state=324)
print(train_X.shape)              #see number of data for the training phase
print(test_X.shape)               #see number of data for the testing phase

### Lets manually see a feature to target mapping of the Train and Test Dataset

In [None]:
train_X.head(2) , train_y.head(2)   # lets see what target some rows of the train data feature maps to.

In [None]:
test_X.head(2) , test_y.head(2)

### Support Vector Machine (SVM)

In [None]:
model = SVC()                                           # the algorithm choose
model.fit(train_X,train_y)                              # we train the algorithm with the training input and the training output
prediction = model.predict(test_X)                      # now we pass the testing data to the trained algorithm
accuracy = accuracy_score(test_y,prediction)            #now we check the accuracy algorithm's prediction to the reserved test output.
print('The accuracy of the SVM is:',accuracy) 

### Logistic Regression

In [None]:
model = LogisticRegression()                            # the algorithm choose
model.fit(train_X,train_y)                              # we train the algorithm with the training input and the training output
prediction = model.predict(test_X)                      # now we pass the testing data to the trained algorithm
accuracy = accuracy_score(prediction,test_y)            #now we check the accuracy algorithm's prediction to the reserved test output.
print('The accuracy of the Logistic Regression is:',accuracy) 

### Decision Tree Classifier

In [None]:
model = DecisionTreeClassifier(max_leaf_nodes=4, random_state=0) # the algorithm choose
model.fit(train_X,train_y)                              # we train the algorithm with the training input and the training output
prediction = model.predict(test_X)                      # now we pass the testing data to the trained algorithm
accuracy = accuracy_score(prediction,test_y)            #now we check the accuracy of the algorithm.
print('The accuracy of the Decision Tree Classifier is:',accuracy) 

### K-Nearest Neighbours

In [None]:
model=KNeighborsClassifier(n_neighbors=3)               # this examines 3 neighbours for putting the new data into a class
model.fit(train_X,train_y)                              # we train the algorithm with the training input and the training output
prediction = model.predict(test_X)                      # now we pass the testing data to the trained algorithm
accuracy = accuracy_score(prediction,test_y)            # now we check the accuracy of the algorithm.
print('The accuracy of the KNN is:',accuracy) 

#### Let's check the accuracy for various values of n for K-Nearest neighbours

In [None]:
n= range(1,11)                        # the number of neighbours we would like to check
a=pd.Series()                         # the accuracy values store in pandas series data format
for i in range(1,11):
    model=KNeighborsClassifier(n_neighbors=i) 
    model.fit(train_X,train_y)
    prediction=model.predict(test_X)
    a=a.append(pd.Series(accuracy_score(prediction,test_y)))
plt.plot(n, a)
plt.xticks(n)
plt.xlabel("n_value")
plt.ylabel("accuracy")
plt.show()

Above is the graph showing the accuracy for the KNN models using different values of n. values 1-3, 5-9 performed well

### We used all the features of iris in above models. Now we will use Petals and Sepals Seperately

### Creating Petals And Sepals Training Data

In [None]:
# Preparing the feature and target for the created petal and sepal dataset
# the p and s subscript to X and y indicate the feature and target for petals and sepals curated dataset
Xp = iris[['PetalLengthCm','PetalWidthCm']].copy()
yp = iris['Species'].copy()

Xs = iris[['SepalLengthCm','SepalWidthCm']].copy()
ys = iris['Species'].copy()

### Spiltting the Petal and Sepal to Training and Testing Data

In [None]:
train_Xp,test_Xp,train_yp,test_yp = train_test_split(Xp,yp,test_size=0.3,random_state=342)  #petals

train_Xs,test_Xs,train_ys,test_ys = train_test_split(Xs,ys,test_size=0.3,random_state=342)  #sepals


### SVC

In [None]:
model=SVC()
model.fit(train_Xp,train_yp) 
prediction=model.predict(test_Xp)
accuracy = accuracy_score(prediction,test_yp)
print('The accuracy of the SVC using Petals is:',accuracy)

model=SVC()
model.fit(train_Xs,train_ys) 
prediction=model.predict(test_Xs) 
accuracy = accuracy_score(prediction,test_yp)
print('The accuracy of the SVC using Sepals is:',accuracy)

### Logistic Regression

In [None]:
model = LogisticRegression()
model.fit(train_Xp,train_yp) 
prediction=model.predict(test_Xp)
accuracy = accuracy_score(prediction,test_yp)
print('The accuracy of the Logistic Regression using Petals is:',accuracy)

model = LogisticRegression()
model.fit(train_Xs,train_ys) 
prediction=model.predict(test_Xs) 
accuracy = accuracy_score(prediction,test_yp)
print('The accuracy of the Logistic Regression using Sepals is:',accuracy)

### Decision Tree

In [None]:
model=DecisionTreeClassifier()
model.fit(train_Xp,train_yp) 
prediction=model.predict(test_Xp)
accuracy = accuracy_score(prediction,test_yp)
print('The accuracy of the Decision Tree using Petals is:',accuracy)

model=DecisionTreeClassifier()
model.fit(train_Xs,train_ys) 
prediction=model.predict(test_Xs) 
accuracy = accuracy_score(prediction,test_yp)
print('The accuracy of the Decision Tree using Sepals is:',accuracy)

### K-Nearest Neighbours

In [None]:
model=KNeighborsClassifier(n_neighbors=3) 
model.fit(train_Xp,train_yp) 
prediction=model.predict(test_Xp)
accuracy = accuracy_score(prediction,test_yp)
print('The accuracy of the KNN using Petals is:',accuracy)

model=DecisionTreeClassifier()
model.fit(train_Xs,train_ys) 
prediction=model.predict(test_Xs) 
accuracy = accuracy_score(prediction,test_yp)
print('The accuracy of the KNN using Sepals is:',accuracy)

## Unsupervised Learning
Although we have correctly handled data with the necessary algorithms our task requires. Let us try to classify our collected data assuming we do not have a Species column for this data. Our goal is to categories the data into three, four or more group and learning a pattern of the group each sample fall into and then we give it label by ourselve.

This task may be suitable to a customer dataset. I will just do it on this iris data for the fun.

In [None]:
# For Unsupervised Learning
from sklearn.cluster import KMeans
from itertools import cycle, islice
from pandas.plotting import parallel_coordinates


In [None]:
# Choose the features 
features = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']
X = iris[features].copy()                          # X is the input data
X

### K-Means

In [None]:
# Use k-Means Clustering 
model = KMeans(n_clusters=3)
cluster = model.fit(X)
prediction = model.predict(X)
print(prediction)
print(prediction.shape)

The above confirms that our K-Means model predicted equal number of target to features

In [None]:
#What are the centers of 3 clusters we formed ? 
centers = cluster.cluster_centers_
centers


## Plots 

Let us first create some utility functions which will help us in plotting graphs:

In [None]:
# Function that creates a DataFrame with a column for cluster group

def pd_centers(features, centers):
    colNames = list(features)
    colNames.append('Prediction')

# Zip with a column called 'prediction' (index)
    Z = [np.append(A, index) for index, A in enumerate(centers)]

    # Convert to pandas data frame for plotting
    P = pd.DataFrame(Z,columns=colNames)
    P['Prediction'] = P['Prediction'].astype(int)
    return P

In [None]:
P = pd_centers(features, centers)
P

In [None]:
fig = P[P.Prediction== 0].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='orange', label='Setosa')
P[P.Prediction== 1].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='blue', label='versicolor',ax=fig)
P[P.Prediction== 2].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='green', label='virginica', ax=fig)
fig.set_xlabel("Petal Length")
fig.set_ylabel("Petal Width")
fig.set_title(" Petal Length VS Petal Width")
fig=plt.gcf()
fig.set_size_inches(10,6)
plt.show()

### We will now create a new merged data of our initial collected samples and our model's prediction and see if the clusters fit.

### Observations:

Using Petals over Sepal for training the data gave a much better accuracy.
This was expected as we noticed earlier that petal features were more adequate for the analysis.Thus we have just implemented some of the common Machine Learning. 


### Did I hear you ask. How do we make sense of all of these and can these be useful to a non data scientist?

#### Good Question! The Answer.

Assuming you are a botanist who goes out into the field to collect flower sample. This is where we can help reduce your work  load.

The final part of this process will be to build a web application where you can just insert the input data and voila you are told what species of sample you are trying to decipher. This is how Amazon recommends list of books you may like given that you have clicked a particular book which belongs to a certain category in their database.

I am currently working on developing a web application this data analysis.
