
# **Iris Dataset Analysis**




#### Authored by: Stephen Kerr

---



## **Iris Dataset Introduction**

The **Iris Dataset** is a famous classix multi-class classification dataset, which contains **$150$ samples** of Iris flowers from three species: **Setosa**, **Versicolor**, and **Virginica**.   
Each sample includes four features: **Sepal length** (in cm), **Sepal width** (in cm), **Petal length** (in cm), **Petal width** (in cm)

The raw data can be seen in **Inputs** folder in [iris.data](https://github.com/skerr17/pands_project/blob/main/inputs/iris.data) which was sourced from [UCI Machine Learning Repository - Iris Dataset](https://archive.ics.uci.edu/dataset/53/iris). 

The Image below illustrates the Three Iris Flower Species and their anatomy (Image sourced from [here](https://www.analyticsvidhya.com/blog/2022/06/iris-flowers-classification-using-machine-learning/)).

![iris flower image](iris_species_image.png) 

---


## **Exploring the Iris Dataset**

Following doing some research on the Iris Dataset I conducted some initial exploration of the data in the [iris.data](https://github.com/skerr17/pands_project/blob/main/inputs/iris.data). 

The function `generate_descriptive_statistics()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), generates the descriptive statistics for the global Iris Dataset and by species (`global_descriptive_stats` and `stats_by_species`).  

`generate_descriptive_statistics()` calculates the Count, Mean, Standard Deviation, Minimum, $25th$ Percentile, Median, $75th$ Percentile, and Maximum for all $n=150$ Iris samples. 

Whereas, `stats_by_species` calculates the Count, Mean, Standard Deviation, Minimum, $25th$ Percentile, Median, $75th$ Percentile, and Maximum for each **Species** of Iris. Using the `groupby('Species')` to subset the dataset based upon each Samples species (**Setosa**, **Versicolor**, and **Virginica**) each having a sample size of $n=50$.

Finally, saving all the Descriptive Statistics in a Tabular format in a `Text` file titled **[iris_descriptive_stats.txt](outputs/iris_descriptive_stats.txt)** in the **Outputs Folder**.

### Insights / Observations:

#### Key insights from the **Global Descriptive Statistics** are the following:

-  The **Petal Length & Width** are more variable than the **Sepal Length & Width**, suggesting they might have more predictive power for classification (will return to this point to validate).
- **Sepal Width** is the most consistent feature across all the species with a standard deviation of $0.43 cm$.
- **Petal Length** has the highest variability with a standard deviation of $1.77 cm$.  

Overall, the **global statistics** show that Sepal dimensions tend to be larger than Petal dimensions. Specifically, **Sepal Length** has a mean of $5.84 cm$ and **Sepal Width** a mean of $3.05 cm$, compared to **Petal Length** with a mean of $3.76 cm$ and **Petal Width** with $1.20 cm$.

#### Key insights from the **Descriptive Statistics per Species** are the following:

- **Setosa** is the smallest in overall measurements and has low variability. Making it the easiest to classify from the others.
- **Versicolor** is the intermediate in terms of feature values with more variability than **Setosa** and commonly overlapping with **Virginica** which means it is harder to classify.
- **Virginica** is the largest Sepal and Petal dimensions with the highest within species variation.

As suggested in the **Global Descriptive Statistics** the **Petal Length & Width** is the most effective feature for distinguishing species, while **Sepal Width** being the hardest to classify the species.

In [3]:
# in this code cell i want to have the tables generated and displayed nicely
import analysis_code

# create a directory for the output files
output_dir = Pathlib('outputs')
output_dir.mkdir(parents=True, exist_ok=True) # Create the directory if it doesn't exist

# Read the data from the iris.data file
raw_data_dir = Path('inputs') # Path to the iris data file
raw_data_dir.mkdir(parents=True, exist_ok=True) # Create the directory if it doesn't exist
data_path = raw_data_dir / 'iris.data' # Path to the iris data file
# Check if the file exists
if data_path.exists():
    # Read the data into a pandas dataframe # Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    # Add column names to the dataframe from the iris documentation # Reference: https://archive.ics.uci.edu/dataset/53/iris
    iris_data = pd.read_csv(data_path, header=None, usecols=[0, 1, 2, 3, 4], 
                names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])
else:
    print(f"Error: The file {data_path} does not exist.")
    return

    # prepare the data for plotting
    variables, variables_titles, species, format_species, colors, labels = prepare_data(iris_data)

global_descriptive_stats, stats_by_species = generate_descriptive_statistics(iris_data, output_dir, variables_titles, species, format_species)

NameError: name 'Pathlib' is not defined


---

## **Visualise the Features**

After conducting the descriptive analysis I went about visulasing the Iris dataset Features and their relationships. 
- First with **Histograms** for each feature using **Matplotlib's** `hist()` method (see documentation [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)).
- **Scatter Plots** for each feature using **Matplotlib's** `scatter()` method (see documentation [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html)).
- **Pairs Plot** for each feature and their relationship with eachother using **Seaborn's** `pairplot()` function (see documentation [here](https://seaborn.pydata.org/generated/seaborn.pairplot.html)),
- Finally, a **Heatmap** of the of the  **Pearson Correlation Coefficient** calculated using **`.corr()`** a method from **Pandas** (see `.corr()` method documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)) and visulaised using **Seaborn's** `heatmap()` function (see documentation [here](https://seaborn.pydata.org/generated/seaborn.heatmap.html)),


### **Histograms of the Features**

Histograms are used to display the distribution of numerical features within a dataset by showing the frequency of observations within defined intervals or 'bins'. 

The function `plot_histograms()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), generates a Histogram for each variable with the different species colour coded to individual `.png` files titled `{variable}_histogram.png` in the **[Outputs Folder](https://github.com/skerr17/pands_project/tree/main/outputs)**.

See all the Histograms linked here:
- [petal_length_histogram.png](https://github.com/skerr17/pands_project/blob/main/outputs/petal_length_histogram.png)
- [petal_width_histogram.png](https://github.com/skerr17/pands_project/blob/main/outputs/petal_width_histogram.png)
- [sepal_length_histogram.png](https://github.com/skerr17/pands_project/blob/main/outputs/sepal_length_histogram.png)
- [sepal_width_histogram.png](https://github.com/skerr17/pands_project/blob/main/outputs/sepal_width_histogram.png)

#### Insights / Observations:
 
For the Iris dataset, plotting histograms for each feature helps reveal the spread of the data across each species. For example, the [petal_length_histogram.png](https://github.com/skerr17/pands_project/blob/main/outputs/petal_length_histogram.png) clearly shows the difference between the species with the **Setosa** distinctly separated from **Versicolor** and **Virginica**. Therefore, for classification the **Petal Length** is useful feature. Note, see the below Histogram.

<img src="outputs/petal_length_histogram.png" alt="Petal Length Histogram" style="width:400px; height:auto;">



Histograms also aid in spotting skewness or outliers that could influence modeling decisions. The Iris Dataset doesn't seem to have any significant outliers. 

Notably, the [sepal_width_histogram.png](https://github.com/skerr17/pands_project/blob/main/outputs/sepal_width_histogram.png) being less effective is visualised by the overlap between the species. See the below Histogram.  /workspaces/pands_project/outputs/sepal_width_histogram.png

<img src="outputs/sepal_width_histogram.png" alt="Sepal Width Histogram" style="width:400px; height:auto;">

Reference: see this stack overflow thread to understand how I resized my Histograms linked [here](https://stackoverflow.com/questions/41598916/resize-the-image-in-jupyter-notebook-using-markdown)


### **Scatter Plot of the Features**

Scatter plots are used to visualise the relationship between two numerical features.

The function `plot_scatter()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), generates a Scatter Plot for each variable with the different species colour coded to a single `.png` files titled [iris_scatter.png](https://github.com/skerr17/pands_project/blob/main/outputs/iris_scatter.png) in the **[Outputs Folder](https://github.com/skerr17/pands_project/tree/main/outputs)**. 

#### Insights / Observations: 

- There appears to be a strong correlation between **Petal Lenght** and **Petal Width** within all the Iris Species.
- As was seen in the Descriptive Statistics and the Histograms **Setosa** is the most distinct from the two other species **Versicolor** and **Virginica** that share significant overlap. See the Scatter Plots below.

<img src="outputs/iris_scatter.png" alt="Iris Scatter Plots" style="width:700px; height:auto;">


### **Pairs Plot of the Features**

**Seaborn's** `pairplot()` function (see documentation [here](https://seaborn.pydata.org/generated/seaborn.pairplot.html)) is a powerful tool for exploratory data analysis, as it provides a quick and comprehensive visual summary of the relationships between multiple numerical variables. See [Geek For Geeks]('https://www.geeksforgeeks.org/python-seaborn-pairplot-method/') for a helpful guide on how to use `pairplot()`.

The function `pairsplots()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), generates a a grid of plots for each variable with the different species colour coded to a single `.png` files titled [iiris_pairplot.png](https://github.com/skerr17/pands_project/blob/main/outputs/iris_pairplot.png) in the **[Outputs Folder](https://github.com/skerr17/pands_project/tree/main/outputs)**. 

In [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py) `pairsplots()` was configured with the following:
- **Off-Diagonal plots** show scatterplots with regression lines (using the argument `kind='reg'` ), helping us understand the correlation and trends. 
- **Diagonal plots** display the distribution of each variable giving us insight into the spread of the data across the Iris dataset. 
- The `hue='species'` argument and the `palette='colours'` argument colours the data points based upon the species with the colour coding used throughout this analysis of the Iris Dataset: `{'setosa': 'red', 'versicolor': 'green', 'virginica': 'blue'}`

#### Insights / Observations: 

- The **Pairs Plot** again validates the observations seen in the other visualisations, for instance, the **Diagonal plots** again highlight the previously stated difference between **Sepal** dimensions and **Petal** dimensions in terms of their ability to differentiate between species.

<img src="outputs/iris_pairplot.png" alt="Iris Scatter Plots" style="width:700px; height:auto;">


### **Heatmap of the Correlation Coefficient**

The **Pearson Correlation Coefficient** (also known as the Standard Correlation Coefficient) ranges from -1 to 1 and measures the linear relationship between two variables. Heatmaps is another way to easily visualise representations of correlation between two variables.  

The function `corrleation_matrix_heatmap()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), generates the Correlation Matrix of the Iris Dataset (in the code called `corr_matrix`) using **`.corr()`** a method from **Pandas** (see the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)).  
Then using **Seaborn's** `heatmap()` function (see documentation [here](https://seaborn.pydata.org/generated/seaborn.heatmap.html)) a Heatmap is created with a with a Colour Bar included to explaining what the Colours represent in the heatmap: 
- **Red** indicates a Strong Positive Correlation between Features (values close to **1**), 
- **Light Grey** indicates a Weak Correlation between the Features (values close to **0**),
- **Blue** indicates a Strong Negative Correlation between the Features (values close to **-1**).

Finally the Heatmap is saved as to a `.png` file titled **[iris_correlation_matrix.png](https://github.com/skerr17/pands_project/blob/main/outputs/iris_correlation_matrix.png)** in the **[Outputs Folder](https://github.com/skerr17/pands_project/tree/main/outputs)**.


#### Insights / Observations:
The Correlation Coefficient is useful as it helps identify relationships between features, but it is important to note that potetnially strong correlations may provide redundant information, and weak correlations may be indicate independent features, but also there could be an unknown confounding variable. 

>***'Correlation Does Not Imply Causation'***

An **absolute value of 1** (i.e., 1 or -1) indicates a perfect linear relationship between the x and y, with every data point lying on the line. In the Heatmap a value of 1 is seen when a **Feature** is correlated with itself. This value of 1 doesn't provide any useful insight and is an artifact of the way the calculation was done and displayed.

In the Heatmap an example of a **Strong Positive Correlation Coefficient** is seen between **Sepal Length** and **Petal Length** with a value of $0.87$, indicated visually with a close to red colour. A **Positive Correlation Coefficient** means when x increases, y tends to increase as well or when **Sepal Length** increases so **Petal Length** tends to increase with a Strong Correlation but not perfect. 
> **Note:** The **Strongest Positive Correlation** is between ***'Petal legth'*** and ***'Petal Width'*** with a value of $0.96$ which is indicated in the Heatmap with red.  

A value of 0 implies there is no linear relationship between the variables. Therefore values with close to 0 have a **Weak Correlation Coefficient**. In the Heatmap for example, **Sepal Length** and **Sepal Width** have a **Weak Negative Correlation Coefficient** of $-0.11$ indicated by a light grey colour.  

A **Negative Correlation Coefficient** means as x increases y tends to decreases. For example, **Petal Length** and **Sepal Width** have a **Weak to Moderate Negative Correlation Coefficient** of $-0.42$ which suggests that when **Petal Length** increases **Sepal Width** typically decreases but not in a perfectly linear way. 

See the Heatmap below.


<img src="outputs/iris_correlation_matrix.png" alt="Iris Scatter Plots" style="width:700px; height:auto;">


---

## **Principal Component Analysis** 


Following on from the visualisation of the features. I decided to perform a Principal Component Analysis (PCA) on the Iris Dataset. Having too many features in data can cause problem like overfitting (a model performs well on training data but fails on new data), slower computation, and lower accuracy. Coined **the curse of dimensionality** (see [geeks for geeks article on curse of dimensionality](https://www.geeksforgeeks.org/curse-of-dimensionality-in-machine-learning/) and [wiki](https://en.wikipedia.org/wiki/Curse_of_dimensionality) for more details on the curse of dimensionality). 

PCA is one of the most widely used dimensionality reduction technique. It works by transforming high-dimensional data into a lower-dimensional space while maximizing the variance (spread) of the data, preserving the most important patterns and relationships in the data. PCA prioritises the directions where the data varies the most as more variation = more useful information. (see [geek for geek article on PCA](https://www.geeksforgeeks.org/principal-component-analysis-pca/)).


The function `pca_analysis()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), perfoms PCA on the Iris Dataset to reduce the number of features from 4 to 2. 
 
The Iris dataset is first standardised the Iris Dataset using **** ``


Note, the reason I am doing PCA is as part of my job I work with Data Scientist that create Model of Real-Time Bioreactors producing drugs. They have mentioned things in my work like PCA and MSPM (multivariate statistical process monitoring). They have to do feature engineering a good bit, so I wanted to learn what PCA means.



The **Pearson Correlation Coefficient** (also known as the Standard Correlation Coefficient) ranges from -1 to 1 and measures the linear relationship between two variables. Heatmaps is another way to easily visualise representations of correlation between two variables.  

The function `corrleation_matrix_heatmap()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), generates the Correlation Matrix of the Iris Dataset (in the code called `corr_matrix`) using **`.corr()`** a method from **Pandas** (see the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)).  
Then using **Seaborn's** `heatmap()` function (see documentation [here](https://seaborn.pydata.org/generated/seaborn.heatmap.html)) a Heatmap is created with a with a Colour Bar included to explaining what the Colours represent in the heatmap: 
- **Red** indicates a Strong Positive Correlation between Features (values close to **1**), 
- **Light Grey** indicates a Weak Correlation between the Features (values close to **0**),
- **Blue** indicates a Strong Negative Correlation between the Features (values close to **-1**).

Finally the Heatmap is saved as to a `.png` file titled **[iris_correlation_matrix.png](https://github.com/skerr17/pands_project/blob/main/outputs/iris_correlation_matrix.png)** in the **[Outputs Folder](https://github.com/skerr17/pands_project/tree/main/outputs)**.


#### Insights / Observations:
The Correlation Coefficient is