This repository is used for the final project given during the Programming and scripting module on Higher Diploma in Data Analytics course from GMIT. Topic of the project is research and investigation of IFisher's ris dataset.
Detailed project description can be found on GitHub from the lecturer Ian McLoughlin.
Iris flower data, also known as Fisher's Iris dataset was introduced by British biologist and statistitian Sir Ronald Aylmer Fisher. In 1936, Sir Fisher published a report titled “The Use of Multiple Measurements in Taxonomic Problems” in the journal Annals of Eugenics. Sir Fisher didn’t collect these data himself. Credits for the data source go to Dr. Edgar Anderson, who collected the majority of the data at the Gaspé Peninsula.
In this article, Fisher developed and evaluated a linear function to differentiate Iris species based on the morphology of their flowers. It was the first time that the sepal and petal measures of the three Iris species as mentioned above appeared publicly. [01]
Iris flower difference in species is pictured below. [02]
This Iris dataset contains a set of 150 records which represent three iris species (Iris setosa, Iris versicolor and Iris virginica) with 50 samples each.
The columns that represent records mentioned above are :
- Id
- SepalLengthCm
- SepalWidthCm
- PetalLengthCm
- PetalWidthCm
- Species
Iris dataset [03] used in this analysis can be found among files in this repository as Iris_dataset.csv.
In this section is explanation of the code for the imported libraries, dataset import and summary. Code used for plotting is explained in Plots.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
NumPy is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
Shorter definition is thah NumPy is the fundamental package for scientific computing in Python. [04]\
pandas is a Python package for data science; it offers data structures for data manipulation and analysis. [05]
In this project pandas is used for creating a summary of the dataset from a .csv file.\
Matplotlib is a comprehensive visualisation library in Python, built on NumPy arrays, for creating static, animated and interactive 2D plots or arrays. [06] [07]
matplotlib.pyplot is a state-based interface to matplotlib. It provides a MATLAB-like way of plotting. pyplot is mainly intended for interactive plots and simple cases of programmatic plot generation. [08]
Seaborn is a library for making statistical graphics in Python. It is built on top of matplotlib and closely integrated with pandas data structures. [09]
Working with DataFrames is a bit easier with the Seaborn because the plotting functions operate on DataFrames and arrays that contain a whole dataset. [10]
Elite data science has interesting tutorial on seaborn presented on a famous Pokemon cartoon based dataset.
sys module represents system-specific parameters and functions and provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. [11]
Interesting tutorials for working with these libraries can be found on Worthy mentions.
List of usefull cheat sheets for libraries used in this project:
ifds = pd.read_csv("Iris_dataset.csv", index_col = "Id")
This line of code is used for reading the .csv file into DataFrame and storing it as a variable ifds (iris flower dataset) for further analysis and manipulation.
Since pandas is using zero-based integer indices in the DataFrame, index_col = "Id" was used to make the Id column an index column while reading the file. That means that the index column will not be taken into consideration while analysing the data. [12]
Part of the code for summary:
def summary_to_file():
sys.stdout = open ("analysis_summary.txt","w")
...
print(ifds)
...
print (ifds.describe())
...
print (ifds.info())
...
print (ifds["Species"].value_counts())
...
print (((ifds["Species"].value_counts(normalize=True))*100))
sys.stdout.close()
Dataset summary is not shown while starting the program, but rather stored in analysis_summary.txt.
Function summary_to_file() is created for making the summary and writing it into the file at the same time.
Writing outputs of the summary into a file is achieved with use of sys module and it's attribute stdout. stdout (standard output stream) is simply a default place to send a program’s text output. [13] [14]
Initial idea was to create a function with outputs of summary and write that output into a .txt file. After a long research and "trial and error technique" it seemed to complicated to code and this approach is chosen over writing in file with the help of .write(), because code is simpler and any print operation will write it's output to a .txt file, where .write() function only takes string value as an input(). [15] [16] [17]
ifds is giving the overview of the whole dataset loaded from the Iris_dataset.csv file.
ifds.describe() gives the summary of the numeric values in the given dataset. It shows the count of variables in the dataset which can point out to any possible missing values. It calculates the mean, standard deviation, minimum and maximum value, and also 1st, 2nd and 3rd percentile of the columns with numeric value. [18]
Output
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
ifds. info() prints information about given dataset including the index data type and column data types, non-null values and memory usage. [18]
Output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1 to 150
Data columns (total 5 columns):
SepalLengthCm 150 non-null float64
SepalWidthCm 150 non-null float64
PetlLengthCm 150 non-null float64
PetalWidthCm 150 non-null float64
Species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 7.0+ KB
Method value_counts() is used to count the number of desired columns. In this case, the column of interest is column Species. [19]
With defining the parameter normalise to True (it is False by default), these values can be presented in percentile (or relative frequencies) as well. [20]
Output
Iris-virginica 50
Iris-versicolor 50
Iris-setosa 50
Name: Species, dtype: int64
Or, viewed in percentile:
Iris-setosa 33.333333
Iris-versicolor 33.333333
Iris-virginica 33.333333
Name: Species, dtype: float64
Histograms are coded with the help of functions. There are 4 functions representing each histogram: Sepal Length, Sepal Width, Petal Length and Petal Width. All of those functions are grouped in a function called histograms().
Example of part of the code:
iris_s = ifds[ifds.Species == "Iris-setosa"]
iris_vers = ifds[ifds.Species == "Iris-versicolor"]
iris_virg = ifds[ifds.Species == "Iris-virginica"]
def petal_length_hist():
plt.figure(figsize = (9,9))
sns.distplot(iris_s["PetalLengthCm"], kde = False, label = "Iris setosa", color = "deeppink")
sns.distplot(iris_vers["PetalLengthCm"], kde = False, label = "Iris versicolor", color = "mediumorchid")
sns.distplot(iris_virg["PetalLengthCm"], kde = False, label = "Iris virginica", color = "navy")
plt.title("Petal length in cm", size = 20)
plt.xlabel("")
plt.ylabel("Frequency", size = 16)
plt.legend()
plt.savefig("Petal-lenght.png")
plt.show()
Variables iris_s, iris_vers and iris_virg are used for subsetting original dataframes for Iris setosa, Iris versicolor and Iris virginica, respectively. They are set outside of the functions for multiple use.[21]
Lot of parameters in codes are added for aesthetic purposes only. Example od that is adding size to title and labels text. [22] figsize is defined as 9 by 9 inches so on the saved picture the legend wouldn't be positioned over the histogram. Important to notice - figure size must be defined before start of plotting. [23]
distplot() is a function used to flexibly plot a univariate distribution of observations. [24]
Parameter kde (kernel density estimate) is set to False as it was unnecessary in this case.
Parameter color was set for a better distinction between species of flowers and nicer picture. [25]\
From the Sepal length and Sepal width comparison picture it is visible that it is easier to distinguish Iris setosa than Iris versicolor and Iris virginica. Iris setosa has wider and shorter sepals, while the other species are not easy to differentiate based on this data.
From the Petal length and Petal width comparison picture the difference bewtween the three speices is much more noticable. Iris setosa is very distinct and has the smallest and narrowest petals of the three. Iris virginica has the biggest petals.
Scatterplots are coded as two different functions: Sepal width and length comparison and Petal width and length comparison. Both those functions are united uder a function scatterplots().
Scatterplot code exmple:
def sepal_length_width_scat():
plt.figure(figsize = (9,9))
sns.scatterplot(x = "SepalLengthCm", y = "SepalWidthCm", data = ifds, marker = "o", hue = "Species",
palette = ["deeppink","mediumorchid","navy"], edgecolor = "dimgrey")
plt.title("Sepal length and Sepal width comparison", size = 20)
plt.xlabel("Sepal length", size = 16)
plt.ylabel("Sepal widthth", size = 16)
plt.legend()
plt.savefig("Sepal-length-width.png")
plt.show()
sns.scatterplot() depicts the joint distribution of two variables using a cloud of points, where each point represents an observation in the dataset. Viewr can then determine if there is any meaningful relationships between the presented data. [26] [27]
Data that are used and compared this are columns "SepalLengthCm" and "SepalWidthCm" and they are grouped by "Species". [28]
Like in histograms, lots of parameters for scatterplots are added for aesthetic purposes.
Palette of colors used is the same as for the histograms.
Circle style marker with an edgecolor is chosen for neater look. [29] [30]
Pairplot gives the better comparison and observation of the data and provides enough informations to draw conclusions.
def pairplot():
sns.pairplot(ifds, hue = "Species", diag_kind = "hist", palette = ["deeppink","mediumorchid","navy"])
plt.savefig("Iris-dataset-pairplot.png")
plt.show()
Pairplot is used for plotting pairwise relationships in datasets. The default diagonal plot is KDE, but in this case it is changed to histogram with the parameter diag_kind. Color palette remained the same. [31]
Because there is 4 different variables (SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm) 4x4 plot is created.
Even though it has the widest sepals of all three species, Iris setosa is the smallest flower.
If compared by sepal width and length, Iris versicolor and Iris virginica would not be distinguished easy.
But observing the petal length and width, and petal and sepal ratios the difference is noticed, with Iris virginica being the biggest of the flowers.
- Visual Studio Code - version 1.44.2
- cmder - version 1.3.14.982
- python - version 3.7.4.final.0
- Anaconda3 - 2019.10
- Notepad++ - version 7.8.5
- Mozzila Firefox 75.0 (64-bit)
[01] Towards data science. The Iris dataset - A little bit of history and biology
[02] The Good Python. Iris dataset.
[03] Kaggle. UCI Machine learning. Iris dataset download.
[04] Numpy.org. What is Numpy?
[05] Datacamp. Pandas tutorial
[06] Geeksforgeeks. Introduction mathplotlib
[07] Matplotlib.org
[08] Matplotlib.org. Matplotlib.pyplot
[09] Seaborn. Introduction.
[10] Datacamp. Seaborn Python tutorial
[11] Python.org. Sys
[12] Real python. Python csv.
[13] Lutz, M. (2009)."Learning Python", pg. 303
[14] StackOverflow.Sys.stdout
[15] Real Python. Read Write files Python
[16] Geeksforgeeks. Reading and writing text files
[17] StackOverflow. Python writing function output to a file.
[18] Towards Data Science. Getting started to data analysis with Python pandas
[19] Medium. Exploratory data analysis.
[20] Towards Data Science. Getting more value from the pandas value counts.
[21] Cmdline tips. How to make histogram in python with pandas and seaborn.
[22] StackOverflow. Text size of x and y axis and the title on matplotlib.
[23] StackOverflow. Change size of figures drawn with matplotlib.
[24] Seaborn.pydata. Seaborn.distplot
[25] Python graph gallery. Select color with matplotlib
[26] Seaborn. Seaborn scatterplot.
[27] Seaborn. Relational tutorial.
[28] Honing Data Science
[29] Matplotlib. Markers
[30] StackOverflow. Matplotlib border around Scatterplot points.
[31] Kite. Seaborn pairplot.
This is the list of sources that have not been used in analysis or summary of the Iris dataset but rather for better understanding of requirements for the project, researching how to edit the readme file and also interesting sources worth of reading.
- Fisher, R. A. (1936). “The Use of Multiple Measurements in Taxonomic Problems”
- Towards data science. Introduction to pandas.
- Geeksforgeeks. NumPy - Introduction
- Geeksforgeeks. NumPy - Advanced
- Real python. Pandas DataFrame
- Pandas.pydata.org
- Elite data science. Seaborn tutorial.
- Python programming. Sys module.
- Data Flair. Python sys module
- Python graph gallery
- Queirozf. Pandas dataframe plot examples with matplotlib pyplot.
- Matplotlib. Colormaps.