# My Project Notebook

**Rachel King**

***

## **Introduction** ##

The Iris dataset is one of the most well-known data sets in relation to pattern recognition.

The dataset was created in 1936 by R.A Fisher. It contains 3 classes, with each class referencing a type of iris plant. There are 50 instances of each class, with 150 instances in total.

Each class has 4 attributes, or variables, which are:

- Sepal length in centimetres
- Sepal width in centimetres
- Petal length in centimetres
- Petal width in centimetres

One class (Iris-setosa) is linearly separable from the other two, while the latter (Iris-versicolor and Iris-virginica) are not linearly separable from each other.

An image of the three Iris plant species referred to in the dataset can be seen below:

![Iris Plants]()

## Importing of Modules ##

Here we import packages we rely on to aid anlaysis and visualisation of the data:

- Pandas (a Python library used for working with datasets and is used to analyse, explore and manipulate data)
- Numpy (a Python library used for working with arrays)
- Matplotlib (a Python library used for plotting data and for visualisation)
- Tabulate (a Python package used to print tabular data in nicely formatted tables)
- Seaborn (a Python library used for data visualisation - provides informative statistical graphics)
- Sys (a Python module that provides functions and variables that are used to manipulate parts of the Python runtime environment)

These modules are very useful as they enable data to be analysed, structured into readable and well-formatted tables & graphs and provide control over the input and output of the program.
This is very important when trying to create a clear picture of the story of the data and what it represents.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from tabulate import tabulate
import seaborn as sns
import sys

The dataset is imported into the workspace directly from its URL.
It is then stored as a variable iris so it can be called to analyse and visualise the data it contains.

In [2]:
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
# using the attribute information as the column names
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class']
iris =  pd.read_csv(csv_url, names = col_names)

## Summary of Each Variable ##

A number of tables were created which statistically summarise the dataset as a whole, as well as statistical summaries of each of the three classes of iris plant.

In [3]:
# Summary of each Variable
print("A table of statistics for each variable can be seen below.")
print("The first table displays statistics of the data set as a whole, while Tables 2,3,4 and 5")
print("display statistics such as mean, min and max for each of the 4 variables when grouped together by class.\n")

print("Table 1 - Iris Dataset Statistics")
iris_stats = iris.agg({'Sepal_Length': ['mean', 'min', 'max', 'std'],
                           'Sepal_Width': ['mean', 'min', 'max', 'std'],
                           'Petal_Length': ['mean', 'min', 'max', 'std'],
                           'Petal_Width': ['mean', 'min', 'max', 'std']})
print(tabulate(iris_stats, headers = ["Stat", "Sepal_Length (cm)", "Sepal_Width (cm)", "Petal_Length (cm)", "Petal_Width (cm)"], tablefmt='grid', stralign='center'))
print('\n')

# Sepal Length Statistics
print("Table 2 - Iris Dataset Sepal Length Statistics")
table_of_data = iris.groupby('Class').agg({'Sepal_Length': ['mean', 'min', 'max']})
table_of_data.reset_index(inplace=False)
print(tabulate(table_of_data, headers = ["Class", "Mean (cm)", "Min (cm)", "Max (cm)"], tablefmt='grid', stralign='center'))
print('\n')

# Sepal Width Statistics
print("Table 3 - Iris Dataset Sepal Width Statistics")
table_of_data = iris.groupby('Class').agg({'Sepal_Width': ['mean', 'min', 'max']})
table_of_data.reset_index(inplace=False)
print(tabulate(table_of_data, headers = ["Class", "Mean (cm)", "Min (cm)", "Max (cm)"], tablefmt='grid', stralign='center'))
print('\n')

# Petal Length Statistics
print("Table 4 - Iris Dataset Petal Length Statistics")
table_of_data = iris.groupby('Class').agg({'Petal_Length': ['mean', 'min', 'max']})
table_of_data.reset_index(inplace=False)
print(tabulate(table_of_data, headers = ["Class", "Mean (cm)", "Min (cm)", "Max (cm)"], tablefmt='grid', stralign='center'))
print('\n')

# Petal Width Statistics
print("Table 5 - Iris Dataset Petal Width Statistics")
table_of_data = iris.groupby('Class').agg({'Petal_Width': ['mean', 'min', 'max']})
table_of_data.reset_index(inplace=False)
print(tabulate(table_of_data, headers = ["Class", "Mean (cm)", "Min (cm)", "Max (cm)"], tablefmt='grid', stralign='center'))
print("\n")

print("Sepal Length is the largest variable, with a mean of 5.8, while Petal Width is the smallest with a mean of 1.2.")
print("The largest Sepal Lengths occur in the Iris virginica species while the smallest occur in the Iris setosa.")
print("However, Iris setosa represents the largest Sepal Widths. \n\n")
print("For the petal characteristics then, Petal Length and Petal Width are both largest in the Iris virginica species, and smallest in the Iris setosa species.")
print("Petal Length is the variable with the widest spread of data, with a standard deviation of 1.7 and a range from a minimum of 1 to a maximum of 6.9.")
print("Sepal Width then is the variable with the smallest spread of data, as all three species of Iris plant have similar sepal widths.")

A table of statistics for each variable can be seen below.
The first table displays statistics of the data set as a whole, while Tables 2,3,4 and 5
display statistics such as mean, min and max for each of the 4 variables when grouped together by class.

Table 1 - Iris Dataset Statistics
+--------+---------------------+--------------------+---------------------+--------------------+
|  Stat  |   Sepal_Length (cm) |   Sepal_Width (cm) |   Petal_Length (cm) |   Petal_Width (cm) |
|  mean  |            5.84333  |           3.054    |             3.75867 |           1.19867  |
+--------+---------------------+--------------------+---------------------+--------------------+
|  min   |            4.3      |           2        |             1       |           0.1      |
+--------+---------------------+--------------------+---------------------+--------------------+
|  max   |            7.9      |           4.4      |             6.9     |           2.5      |
+--------+---------------------+-

***

End