# Fundamentals of Data Analysis Project

## Rebecca Feeley

***

> - The project is to create a notebook investigating the variables and data points within the well-known iris flower data set associated with Ronald A Fisher.
> - In the notebook, you should discuss the classification of each variable within the data set according to common variable types and scales of measurement in mathematics, statistics, and Python.
> - Select, demonstrate, and explain the most appropriate summary statistics to describe each variable.
> - Select, demonstrate, and explain the most appropriate plot(s) for each variable.

The Iris Flower Data Set is a multivariate data set which originated in a 1936 paper by Ronald Fisher.
The data itself regarding the Iris Setosa, Iris Virginica and Iris Versicolour flowers were collected by Dr Edgar Anderson in Canada.
The data set is used for multivariate data analysis and in Machine Learning. 
It contains 50 samples from each of three species of Iris Flower:
- Iris Setosa, Iris Versicolour and Iris Virginica

For each species type, there are 4 measurements available, measured in centimeters:
- Septal Length
- Septal Width
- Petal Length
- Petal Width

_Obtaining the dataset_
I downloaded the Iris Data Set from UC Irvine Machine Learning which can be assessed at https://archive.ics.uci.edu/dataset/53/iris. I stored this data in a CSV file in my repository and used the Pandas library to read the data from the CSV file to allow me to manipulate the data with Python code.


!['Image of the Iris Flower Species'](https://miro.medium.com/v2/resize:fit:3500/1*f6KbPXwksAliMIsibFyGJw.png)


In [None]:
# Importing all of the libraries required to complete analysis of the dataset
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 
import sys


iris_data = pd.read_csv("IrisData.csv", header=None) # setting that there is no header so it is recognised that the data begins from line 1
column_names = ["Sepal_Length(cm)", "Sepal_Width(cm)", "Petal_Length(cm)", "Petal_Width(cm)", "Species"] # I have assigned names to each of the columns
iris_data = pd.read_csv("IrisData.csv", names = column_names, header=None)

Firstly, I am inspecting the dataset for any missing values by using the isnull() function. 
The resulting output of False for each variable shows that there are no missing values.

In [None]:
iris_data.isnull().any()  #if any values are missing, result will display as TRUE


Now, I am using the head() and tail() functions to display the first 5 lines of the Iris data set and the last 5 lines of the Iris data set.
This allows us to see the column names of the data, how many columns are in the data, and a general overview of the top and bottom of the dataset.

In [242]:
iris_data.head()  

Unnamed: 0,Sepal_Length(cm),Sepal_Width(cm),Petal_Length(cm),Petal_Width(cm),Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [241]:
iris_data.tail() 

Unnamed: 0,Sepal_Length(cm),Sepal_Width(cm),Petal_Length(cm),Petal_Width(cm),Species
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


Now, I am using the value_counts() function to determine if the dataset is balanced. This function checks if there are equal row entries for each of the species type of Iris flower. 
The output shows that this dataset is balanced and there are 50 rows of entries for each of the three species of Iris flower.

In [243]:
iris_data.value_counts("Species") 

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

The dtypes function displays the data type of each variable. The possible outcomes are object (i.e string), int64 (whole number) and float64 (number with decimals).

The output of the this function shows that the Sepal Length, Sepal Width, Petal Length and Petal Width are float64 type data which means these variables are numerical variables; and Species is object type data which means it is categorical in nature.

In [248]:
iris_data.dtypes

Sepal_Length(cm)    float64
Sepal_Width(cm)     float64
Petal_Length(cm)    float64
Petal_Width(cm)     float64
Species              object
dtype: object

In [251]:
iris_data.info() # a function which summarises the above information on the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Sepal_Length(cm)  150 non-null    float64
 1   Sepal_Width(cm)   150 non-null    float64
 2   Petal_Length(cm)  150 non-null    float64
 3   Petal_Width(cm)   150 non-null    float64
 4   Species           150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


## Classifying the data types 

Firstly, I will discuss the classification of each variable of the Iris data set according to common variable types and scales of measurement in mathematics, statistics, and Python.

There are two primary classes of data: Qualitative and Quantitative data.

Qualitative data can also be called categorical data. Such data cannot be easily measured using numerbers. It generally refers to a category or type. Such data also has no numerical value which would allow them to be ranked; instead they may only be sorted by category.
There are two subsets of qualitative data: Nominal and Ordinal data.
Nominal data refers to variables which do not have an intrinsic or a numerical value which would allow them to be ordered or ranked.
Ordinal data refers to variables which are in an order due to the relation amongst the different categories i.e they have a natural ordering on a scale but stil maintain their class of value, differentiating it from nominal data. However, they still do not have a numerical value. 


Quantitiative data, can also be called numerical data. This is data which can be quantified, i.e it has an intrinsic numerical value through which it can be measured. 
There are two subsets of quantitative data: Discrete and Continuous data.
Discrete data refers to variables which are numerical in nature, but only includes integers which have a limited possible amount of values and cannot be further divided subdivided. 
Continous data refers to variables which can be broken down into even smaller values. It does not include only integers, but rather data which can be further subdivided into interval and ratio data. Interval data is data which can be categorised, ordered and evenly spaced (even intervals) but does not have a natural zero point. Ratio data is data which can be categorised, ordered, evenly spaced and also have a natural zero point.

Petal Length - Ratio Continuous numerical variable 
Petal Width - Ratio Continuous numerical variable
Sepal Length - Ratio Continuous numerical variable
Sepal Width - Ratio Continuous numerical variable
Species - Nominal Categorical Variable

It is clear the the species variable is qualitative (categorical) and it is nominal as the flowers have no instrinsic value which would allow them to be ranked in any particular order.

The remaining four variables of length and width are quantitative (numerical); and it is continuous as it can be broken down to a very exact figure and it is ratio data as such data has a nature zero point (cannot be a minus length or width) and the figures provide a means of ranking.


The summary statistics of the Iris Dataset are obtained most easily using the describe() function. This function will generate the following analysis of the numerical variables (sepal length, sepal width, petal length and petal width). 

Count - Total number of variable entries
Mean - Average value 
Standard deviation - is a measure of how spread out the variable is from the mean
Minimum - smallest/minimum value of that variable
25 percentile - 25 percentile mark
50 percentile - 50 percentile mark (median)
75 percentiles - 75 percentile mark
Maximum - highest/maximum value of that variable

In [258]:
iris_data['Species'].describe()

count             150
unique              3
top       Iris-setosa
freq               50
Name: Species, dtype: object

In [257]:
pd.set_option('display.max_rows', None) #this stops output from only displaying some of the result
pd.set_option('display.max_columns', None)

summary_stats = iris_data.groupby('Species').describe() # I have used the groupby function to separate all of the statistics based on species type
print(summary_stats)

                Sepal_Length(cm)                                              \
                           count   mean       std  min    25%  50%  75%  max   
Species                                                                        
Iris-setosa                 50.0  5.006  0.352490  4.3  4.800  5.0  5.2  5.8   
Iris-versicolor             50.0  5.936  0.516171  4.9  5.600  5.9  6.3  7.0   
Iris-virginica              50.0  6.588  0.635880  4.9  6.225  6.5  6.9  7.9   

                Sepal_Width(cm)                                                \
                          count   mean       std  min    25%  50%    75%  max   
Species                                                                         
Iris-setosa                50.0  3.418  0.381024  2.3  3.125  3.4  3.675  4.4   
Iris-versicolor            50.0  2.770  0.313798  2.0  2.525  2.8  3.000  3.4   
Iris-virginica             50.0  2.974  0.322497  2.2  2.800  3.0  3.175  3.8   

                Petal_Length(cm)

# PLOTTING THE VARIABLES & Analysis

Firstly, I will conduct univariate analysis on each of the variables, which means I will analyse each of the variables separately. 
Then, I will conduct bivariate analysis on two variables e.g Sepal Length and Sepal Width; then Petal Length and Sepal Width and so on and explore and potential relationships and correlations amongst the variables.

# Univariate Analysis

### Species 

As I have already established, there are 3 species type of Iris Flower contained in the dataset:
Iris Setosa
Iris Virginica
Iris Versicolor
and there are 50 entries for each species type.

This variable is a categorical variable and is best visualised using a countplot.

In [None]:
column_names = ["Sepal_Length(cm)", "Sepal_Width(cm)", "Petal_Length(cm)", "Petal_Width(cm)", "Species"] 
iris_data = pd.read_csv("IrisData.csv", names = column_names, header=None)
iris_data['Species'].value_counts().plot(kind = 'bar', linewidth = 3, edgecolor = 'black', grid=False, color = ['purple', 'pink', 'blue'])
plt.title('Species of the Iris Dataset')
plt.xlabel('Species Type')
plt.ylabel('Count')
plt.xticks(rotation = 360) # so the x labels display horizontally instead of vertically
plt.show()

The above visualisation is very limited in what it tells us about the data, just that frequency of each species.
However, I will now examine the numerical variables by category below.

In [None]:
iris_data = pd.read_csv("IrisData.csv", names=column_names) # creating a histogram of each variable (does not specify iris flower species)
iris_data.hist(bins=8, color='green', edgecolor='black', grid=False, figsize=(9,8))  
plt.suptitle('Histograms of the Length and Width of each Iris Characteristic') 
plt.subplots_adjust(top=0.9, hspace=0.4) #adjust space between graphs to prevent titles overlapping
plt.show()

Looking at the numerical variables, Sepal Length, Sepal Width, Petal Length and Petal Width, a histogram is a very useful visualisation for the data. 
It allows us to see the distribution of the data of each of the above variables, and the separate species are differentiated by colour.

In [None]:
column_names = ["Sepal_Length(cm)", "Sepal_Width(cm)", "Petal_Length(cm)", "Petal_Width(cm)", "Species"] 
iris_data = pd.read_csv("IrisData.csv", names = column_names, header=None)
fig, axes = plt.subplots(2,2, figsize = (15,15))            
IrisSetosa = iris_data[iris_data.Species == "Iris-setosa"] # I created variables for each species type of Iris plant
IrisVersicolor = iris_data[iris_data.Species == "Iris-versicolor"]
IrisVirginica = iris_data[iris_data.Species == "Iris-virginica"]

axes[0,0].set_title("Sepal Length (in cm)", fontweight='bold')    
axes[0,0].set(xlabel='Sepal Length (in cm)')
axes[0,0].hist(IrisSetosa['Sepal_Length(cm)'], bins=8, alpha=0.5, label="Iris-setosa", color='green')            
axes[0,0].hist(IrisVersicolor['Sepal_Length(cm)'], bins=8, alpha=0.5, label="Iris-versicolor", color='purple')  
axes[0,0].hist(IrisVirginica['Sepal_Length(cm)'], bins=8, alpha=0.5, label="Iris-virginica", color='blue')    
axes[0,0].legend(loc='upper right') #setting the legend to display in upper right hand corner                        
    
axes[0,1].set_title("Sepal Width (in cm)", fontweight='bold')      
axes[0,1].set(xlabel='Sepal Width (in cm)')
axes[0,1].hist(IrisSetosa['Sepal_Width(cm)'], bins=8, alpha=0.5, label="Iris-setosa", color='green')
axes[0,1].hist(IrisVersicolor['Sepal_Width(cm)'], bins=8, alpha=0.5, label="Iris-versicolor", color='purple')
axes[0,1].hist(IrisVirginica['Sepal_Width(cm)'], bins=8, alpha=0.5, label="Iris-virginica", color='blue')
axes[0,1].legend(loc='upper right')

axes[1,0].set_title("Petal Length (in cm)", fontweight='bold')     
axes[1,0].set(xlabel='Petal Length (in cm)')
axes[1,0].hist(IrisSetosa['Petal_Length(cm)'], bins=8, alpha=0.5, label="Iris-setosa", color='green')
axes[1,0].hist(IrisVersicolor['Petal_Length(cm)'], bins=8, alpha=0.5, label="Iris-versicolor", color='purple')
axes[1,0].hist(IrisVirginica['Petal_Length(cm)'], bins=8, alpha=0.5, label="Iris-virginica", color='blue')
axes[1,0].legend(loc='upper right')

axes[1,1].set_title("Petal Width (in cm)", fontweight='bold')       
axes[1,1].set(xlabel='Petal Width (in cm)')
axes[1,1].hist(IrisSetosa['Petal_Width(cm)'], bins=8, alpha=0.5, label="Iris-setosa", color='green')
axes[1,1].hist(IrisVersicolor['Petal_Width(cm)'], bins=8, alpha=0.5, label="Iris-versicolor", color='purple')
axes[1,1].hist(IrisVirginica['Petal_Width(cm)'], bins=8, alpha=0.5, label="Iris-virginica", color='blue')
axes[1,1].legend(loc='upper right')

In [None]:
column_names = ["Sepal_Length(cm)", "Sepal_Width(cm)", "Petal_Length(cm)", "Petal_Width(cm)", "Species"] 
iris_data = pd.read_csv("IrisData.csv", names = column_names, header=None)
fig, axes = plt.subplots(2,2, figsize = (15,15))            
IrisSetosa = iris_data[iris_data.Species == "Iris-setosa"] # I created variables for each species type of Iris plant
IrisVersicolor = iris_data[iris_data.Species == "Iris-versicolor"]
IrisVirginica = iris_data[iris_data.Species == "Iris-virginica"]


axes[0,0].set_title("Sepal Length of Iris Setosa, Versicolor and Virginica", fontweight='bold')    
axes[0,0].set(xlabel='Sepal Length (in cm)')
sns.histplot(IrisSetosa['Sepal_Length(cm)'], bins=8, label="Iris-setosa", color='green', kde=True, ax=axes[0,0], edgecolor=None)            
sns.histplot(IrisVersicolor['Sepal_Length(cm)'], bins=8, label="Iris-versicolor", color='purple', kde=True, ax=axes[0,0], edgecolor=None)  
sns.histplot(IrisVirginica['Sepal_Length(cm)'], bins=8, label="Iris-virginica", color='blue', kde=True, ax=axes[0,0], edgecolor=None)  

axes[0,1].set_title("Sepal Width of Iris Setosa, Versicolor and Virginica", fontweight='bold')      
axes[0,1].set(xlabel='Sepal Width (in cm)')
sns.histplot(IrisSetosa['Sepal_Width(cm)'], bins=8, label="Iris-setosa", color='green', kde=True, ax=axes[0,1], edgecolor=None)
sns.histplot(IrisVersicolor['Sepal_Width(cm)'], bins=8, label="Iris-versicolor", color='purple', kde=True, ax=axes[0,1], edgecolor=None)
sns.histplot(IrisVirginica['Sepal_Width(cm)'], bins=8, label="Iris-virginica", color='blue', kde=True, ax=axes[0,1], edgecolor=None)

axes[1,0].set_title("Petal Length of 'Setosa, Versicolor and Virginica", fontweight='bold')     
axes[1,0].set(xlabel='Petal Length (in cm)')
sns.histplot(IrisSetosa['Petal_Length(cm)'], bins=8, label="Iris-setosa", color='green', kde=True, ax=axes[1,0], edgecolor=None)
sns.histplot(IrisVersicolor['Petal_Length(cm)'], bins=8, label="Iris-versicolor", color='purple', kde=True, ax=axes[1,0], edgecolor=None)
sns.histplot(IrisVirginica['Petal_Length(cm)'], bins=8, label="Iris-virginica", color='blue', kde=True, ax=axes[1,0], edgecolor=None)

axes[1,1].set_title("Petal Width of Iris Setosa, Versicolor and Virginica", fontweight='bold')       
axes[1,1].set(xlabel='Petal Width (in cm)')
sns.histplot(IrisSetosa['Petal_Width(cm)'], bins=8, label="Iris-setosa", color='green',kde=True, ax=axes[1,1], edgecolor=None)
sns.histplot(IrisVersicolor['Petal_Width(cm)'], bins=8, label="Iris-versicolor", color='purple', kde=True, ax=axes[1,1], edgecolor=None)
sns.histplot(IrisVirginica['Petal_Width(cm)'], bins=8, label="Iris-virginica", color='blue', kde=True, ax=axes[1,1], edgecolor=None)
plt.subplots_adjust(bottom=0.2)

plt.suptitle('Histograms')
fig.suptitle("Histograms and Probability Distribution Plots of each of the Iris Flower characteristics", y=0.92) #removing large gap between the plots and the main title

In [None]:
column_names = ["Sepal_Length(cm)", "Sepal_Width(cm)", "Petal_Length(cm)", "Petal_Width(cm)", "Species"] 
iris_data = pd.read_csv("IrisData.csv", names = column_names, header=None)
fig, axes = plt.subplots(2,2, figsize = (15,15))            
IrisSetosa = iris_data[iris_data.Species == "Iris-setosa"] # I created variables for each species type of Iris plant
IrisVersicolor = iris_data[iris_data.Species == "Iris-versicolor"]
IrisVirginica = iris_data[iris_data.Species == "Iris-virginica"]


axes[0,0].set_title("Sepal Length (in cm)", fontweight='bold')    
axes[0,0].set(xlabel='Species')
sns.boxplot(x='Species',y='Sepal_Length(cm)',data=iris_data ,palette='PuBuGn', ax=axes[0,0])   

axes[0,1].set_title("Sepal Width (in cm)", fontweight='bold')      
axes[0,1].set(xlabel='Sepal Width (in cm)')
sns.boxplot(x='Species',y='Sepal_Width(cm)',data=iris_data ,palette='PuBuGn', ax=axes[0,1]) 

axes[1,0].set_title("Petal Length (in cm)", fontweight='bold')     
axes[1,0].set(xlabel='Petal Length (in cm)')
sns.boxplot(x='Species',y='Petal_Length(cm)',data=iris_data ,palette='PuBuGn', ax=axes[1,0]) 

axes[1,1].set_title("Petal Width (in cm)", fontweight='bold')       
axes[1,1].set(xlabel='Petal Width (in cm)')
sns.boxplot(x='Species',y='Petal_Length(cm)',data=iris_data ,palette='PuBuGn', ax=axes[1,1]) 

plt.suptitle('Boxplot of the Petal Length of each species class')
fig.suptitle('Boxplot of the Petal Length of each Species Class', y=0.92)


# Bivariate Analysis

In [None]:
iris_data = pd.read_csv("IrisData.csv", names=column_names)

# Scatter plot of Sepal Length x Sepal Width
sns.scatterplot(data=iris_data, x='Sepal_Length(cm)', y='Sepal_Width(cm)', hue='Species', palette=['green', 'purple', 'blue'])
plt.title('Correlation between Sepal Length & Width (by Species Type)')
plt.plot() 
plt.show()

sns.scatterplot(data=iris_data, x='Petal_Length(cm)', y='Petal_Width(cm)', hue='Species', palette=['green', 'purple', 'blue'])
plt.title('Correlation between Petal Length & Width (by Species Type)')
plt.plot() 
plt.show()


In [None]:
sns.pairplot(iris_data, hue='Species', palette=['green', 'purple', 'blue'])
plt.plot()
plt.show()
plt.savefig('Pairplot of all the features of the Iris Setosa, Versicolour & Virginica.png') #saving the pairplot for future reference
plt.close()


CorrelationMatrix = iris_data.corr(numeric_only=True) #so it only uses the numerical variables, not by species
CorrelationMatrix.to_string('Correlation.txt') #saving the output to a text file
print(CorrelationMatrix)


In [None]:


# Create the heatmap with annotations, annotations to 2 decimal places
sns.heatmap(CorrelationMatrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=1)
# annot=True so the Pearson values display on each square of the heatmap.

plt.suptitle('Heatmap of the correlation between each variable')
plt.subplots_adjust(left=0.25, bottom = 0.3)
plt.show()


The Pearson method of analysis is what I have chosen for the heatmap. 1 means total positive correlation, 0 is no correlation and -1 is total negative correlation. The results for the iris data set are across the range from 1 to -1.


The End 

*** 