
# **Iris Dataset Analysis**




#### Authored by: Stephen Kerr

---



## **Iris Dataset Introduction**

The **Iris Dataset** is a famous classic multi-class classification dataset, which contains **$150$ samples** of Iris flowers from three species: **Setosa**, **Versicolor**, and **Virginica**.   
Each sample includes four features: **Sepal length** (in cm), **Sepal width** (in cm), **Petal length** (in cm), and **Petal width** (in cm)

The raw data can be seen in **Inputs** folder in [iris.data](https://github.com/skerr17/pands_project/blob/main/inputs/iris.data) which was sourced from [UCI Machine Learning Repository - Iris Dataset](https://archive.ics.uci.edu/dataset/53/iris). 

The Image below illustrates the Three Iris Flower Species and their anatomy (Image sourced from [here](https://www.analyticsvidhya.com/blog/2022/06/iris-flowers-classification-using-machine-learning/)).

![iris flower image](inputs/iris_species_image.png) 

---


## **Exploring the Iris Dataset**

Following doing some research on the Iris Dataset I conducted some initial exploration of the data in the [iris.data](https://github.com/skerr17/pands_project/blob/main/inputs/iris.data). 

The function `generate_descriptive_statistics()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), generates the descriptive statistics for the global Iris Dataset and by species (`global_descriptive_stats` and `stats_by_species`).  

`generate_descriptive_statistics()` calculates the Count, Mean, Standard Deviation, Minimum, $25th$ Percentile, Median, $75th$ Percentile, and Maximum for all $n=150$ Iris samples. 

Whereas, `stats_by_species` calculates the Count, Mean, Standard Deviation, Minimum, $25th$ Percentile, Median, $75th$ Percentile, and Maximum for each **Species** of Iris. Using the `groupby('Species')` to subset the dataset based upon each Samples species (**Setosa**, **Versicolor**, and **Virginica**) each having a sample size of $n=50$.

Finally, saving all the Descriptive Statistics in a Tabular format in a `Text` file titled **[iris_descriptive_stats.txt](outputs/iris_descriptive_stats.txt)** in the **Outputs Folder**.

### Insights / Observations:

#### Key insights from the **Global Descriptive Statistics** are the following:

-  The **Petal Length & Width** are more variable than the **Sepal Length & Width**, suggesting they might have more predictive power for classification (will return to this point to validate).
- **Sepal Width** is the most consistent feature across all the species with a standard deviation of $0.43 cm$.
- **Petal Length** has the highest variability with a standard deviation of $1.77 cm$.  

Overall, the **global statistics** show that Sepal dimensions tend to be larger than Petal dimensions. Specifically, **Sepal Length** has a mean of $5.84 cm$ and **Sepal Width** a mean of $3.05 cm$, compared to **Petal Length** with a mean of $3.76 cm$ and **Petal Width** with $1.20 cm$.

#### Key insights from the **Descriptive Statistics per Species** are the following:

- **Setosa** is the smallest in overall measurements and has low variability. Making it the easiest to classify from the others.
- **Versicolor** is the intermediate in terms of feature values with more variability than **Setosa** and commonly overlapping with **Virginica** which means it is harder to classify.
- **Virginica** is the largest Sepal and Petal dimensions with the highest within species variation.

As suggested in the **Global Descriptive Statistics** the **Petal Length & Width** is the most effective feature for distinguishing species, while **Sepal Width** being the hardest to classify the species.

### **Imports**

In [1]:
# import pandas as pd
import pandas as pd

# import numpy as np
import numpy as np

# import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# import tabulate - For nice tables - Reference: https://pypi.org/project/tabulate/
from tabulate import tabulate

# from pathlib import Path - Reference: https://docs.python.org/3/library/pathlib.html
# imported pathlib as it has better error handling and is more robust and OS independent than os.path
from pathlib import Path

# import seaborn as sns - Reference: https://seaborn.pydata.org/index.html
import seaborn as sns

# import PCA from sklearn.decomposition - Reference: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
from sklearn.decomposition import PCA

# import StandardScaler from sklearn.preprocessing - Reference: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler

# our own code module for analysis
import analysis_code as ac

# import Ipython.display as display - to display my tables
from IPython.display import display 


### **Setting up the parameters**

In [2]:
# create a directory for the output files
output_dir = Path('outputs')
output_dir.mkdir(parents=True, exist_ok=True) # Create the directory if it doesn't exist

# Read the data from the iris.data file
raw_data_dir = Path('inputs') # Path to the iris data file
raw_data_dir.mkdir(parents=True, exist_ok=True) # Create the directory if it doesn't exist
data_path = raw_data_dir / 'iris.data' # Path to the iris data file
# Check if the file exists
if data_path.exists():
    # Read the data into a pandas dataframe # Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    # Add column names to the dataframe from the iris documentation # Reference: https://archive.ics.uci.edu/dataset/53/iris
    iris_data = pd.read_csv(data_path, header=None, usecols=[0, 1, 2, 3, 4], 
                names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])
else:
    print(f"Error: The file {data_path} does not exist.")

# prepare the data using the analysis_code module
variables, variables_titles, species, format_species, colours, labels = ac.prepare_data(iris_data)


### **Displaying the Head of the Iris DataFrame**

In [3]:
# display the first 10 rows of the Iris dataframe 
iris_data.head(11).style.set_caption("Iris DataFrame")

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


### **Displaying the Descriptive Statistics Tables**

In [4]:

# generate descriptive statistics both gloabl and by species
global_iris_descriptive_stats, iris_descriptive_stats_by_species = ac.generate_descriptive_statistics(iris_data, output_dir, variables_titles, species, format_species)

In [5]:

# print the descriptive statistics to the console
print('===Global Descriptive Statistics===\n')
print(tabulate(global_iris_descriptive_stats, headers='keys', tablefmt='grid'))
print('\n\n')
print('===Descriptive Statistics by Species===\n')
print(tabulate(iris_descriptive_stats_by_species.stack(future_stack=True), headers='keys', tablefmt='grid'))

===Global Descriptive Statistics===

+--------------------+----------------+---------------+----------------+---------------+
|                    |   Sepal Length |   Sepal Width |   Petal Length |   Petal Width |
| Count              |     150        |    150        |      150       |    150        |
+--------------------+----------------+---------------+----------------+---------------+
| Mean               |       5.84333  |      3.054    |        3.75867 |      1.19867  |
+--------------------+----------------+---------------+----------------+---------------+
| Standard Deviation |       0.828066 |      0.433594 |        1.76442 |      0.763161 |
+--------------------+----------------+---------------+----------------+---------------+
| Minimum            |       4.3      |      2        |        1       |      0.1      |
+--------------------+----------------+---------------+----------------+---------------+
| 25th Percentile    |       5.1      |      2.8      |        1.6     | 


---

## **Data Visualisation**

After conducting the descriptive analysis I went about visualising the Iris dataset Features and their relationships. 
- First with **Histograms** for each feature using **Matplotlib's** `hist()` method (see documentation [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)).
- **Scatter Plots** for each feature using **Matplotlib's** `scatter()` method (see documentation [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html)).
- **Pairs Plot** for each feature and their relationship with each other using **Seaborn's** `pairplot()` function (see documentation [here](https://seaborn.pydata.org/generated/seaborn.pairplot.html)),
- Finally, a **Heatmap** of the of the  **Pearson Correlation Coefficient** calculated using **`.corr()`** a method from **Pandas** (see `.corr()` method documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)) and visualised using **Seaborn's** `heatmap()` function (see documentation [here](https://seaborn.pydata.org/generated/seaborn.heatmap.html)),


### **Histograms of the Features**

Histograms are used to display the distribution of numerical features within a dataset by showing the frequency of observations within defined intervals or 'bins'. 

The function `plot_histograms()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), generates a Histogram for each variable with the different species colour coded to individual `.png` files titled `{variable}_histogram.png` in the **[Outputs Folder](https://github.com/skerr17/pands_project/tree/main/outputs)**.

See all the Histograms linked here:
- [petal_length_histogram.png](https://github.com/skerr17/pands_project/blob/main/outputs/petal_length_histogram.png)
- [petal_width_histogram.png](https://github.com/skerr17/pands_project/blob/main/outputs/petal_width_histogram.png)
- [sepal_length_histogram.png](https://github.com/skerr17/pands_project/blob/main/outputs/sepal_length_histogram.png)
- [sepal_width_histogram.png](https://github.com/skerr17/pands_project/blob/main/outputs/sepal_width_histogram.png)

#### **Insights / Observations:**
 
For the Iris dataset, plotting histograms for each feature helps reveal the spread of the data across each species. For example, the [petal_length_histogram.png](https://github.com/skerr17/pands_project/blob/main/outputs/petal_length_histogram.png) clearly shows the difference between the species with the **Setosa** distinctly separated from **Versicolor** and **Virginica**. Therefore, for classification the **Petal Length** is a useful feature. Note, see the below Histogram.

<img src="outputs/petal_length_histogram.png" alt="Petal Length Histogram" style="width:400px; height:auto;">



Histograms also aid in spotting skewness or outliers that could influence modelling decisions. The Iris Dataset doesn't seem to have any significant outliers. 

Notably, the [sepal_width_histogram.png](https://github.com/skerr17/pands_project/blob/main/outputs/sepal_width_histogram.png) being less effective is visualised by the overlap between the species. See the below Histogram.

<img src="outputs/sepal_width_histogram.png" alt="Sepal Width Histogram" style="width:400px; height:auto;">

Reference: see this stack overflow thread to understand how I resized my Histograms linked [here](https://stackoverflow.com/questions/41598916/resize-the-image-in-jupyter-notebook-using-markdown)


### **Scatter Plot of the Features**

Scatter plots are used to visualise the relationship between two numerical features.

The function `plot_scatter()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), generates a Scatter Plot for each variable with the different species colour coded to a single `.png` file titled [iris_scatter.png](https://github.com/skerr17/pands_project/blob/main/outputs/iris_scatter.png) in the **[Outputs Folder](https://github.com/skerr17/pands_project/tree/main/outputs)**. 

#### **Insights / Observations:** 

- There appears to be a strong correlation between **Petal Length** and **Petal Width** within all the Iris Species.
- As was seen in the Descriptive Statistics and the Histograms **Setosa** is the most distinct from the two other species **Versicolor** and **Virginica** that share significant overlap. 

See the Scatter Plots below.

<img src="outputs/iris_scatter.png" alt="Iris Scatter Plots" style="width:700px; height:auto;">


### **Pairs Plot of the Features**

**Seaborn's** `pairplot()` function (see documentation [here](https://seaborn.pydata.org/generated/seaborn.pairplot.html)) is a powerful tool for exploratory data analysis, as it provides a quick and comprehensive visual summary of the relationships between multiple numerical variables. See [Geek For Geeks]('https://www.geeksforgeeks.org/python-seaborn-pairplot-method/') for a helpful guide on how to use `pairplot()`.

The function `pairsplots()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), generates a grid of plots for each variable with the different species colour coded to a single `.png` files titled [iris_pairplot.png](https://github.com/skerr17/pands_project/blob/main/outputs/iris_pairplot.png) in the **[Outputs Folder](https://github.com/skerr17/pands_project/tree/main/outputs)**. 

In [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py) `pairsplots()` was configured with the following:
- **Off-Diagonal plots** show scatterplots with regression lines (using the argument `kind='reg'` ), helping us understand the correlation and trends. 
- **Diagonal plots** display the distribution of each variable giving us insight into the spread of the data across the Iris dataset. 
- The `hue='species'` argument and the `palette='colours'` argument colours the data points based upon the species with the colour coding used throughout this analysis of the Iris Dataset: `{'setosa': 'red', 'versicolor': 'green', 'virginica': 'blue'}`

#### **Insights / Observations:** 

The **Pairs Plot** again validates the observations seen in the other visualisations, for instance, the **Diagonal plots** again highlight the previously stated difference between **Sepal** dimensions and **Petal** dimensions in terms of their ability to differentiate between species. 

See the Pairs Plot below.

<img src="outputs/iris_pairplot.png" alt="Iris Scatter Plots" style="width:700px; height:auto;">


### **Heatmap of the Correlation Coefficient**

The **Pearson Correlation Coefficient** (also known as the Standard Correlation Coefficient) ranges from -1 to 1 and measures the linear relationship between two variables. A Heatmap is another way to easily visualise representations of correlation between two variables.  

The function `correlation_matrix_heatmap()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), generates the Correlation Matrix of the Iris Dataset (in the code called `corr_matrix`) using **`.corr()`** a method from **Pandas** (see the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)).  
Then using **Seaborn's** `heatmap()` function (see documentation [here](https://seaborn.pydata.org/generated/seaborn.heatmap.html)) a Heatmap is created with a Colour Bar included to explaining what the Colours represent in the heatmap: 
- **Red** indicates a Strong Positive Correlation between Features (values close to **1**), 
- **Light Grey** indicates a Weak Correlation between the Features (values close to **0**),
- **Blue** indicates a Strong Negative Correlation between the Features (values close to **-1**).

Finally the Heatmap is saved as a `.png` file titled [iris_correlation_matrix.png](https://github.com/skerr17/pands_project/blob/main/outputs/iris_correlation_matrix.png) in the [Outputs Folder](https://github.com/skerr17/pands_project/tree/main/outputs).


#### **Insights / Observations:**
The Correlation Coefficient is useful as it helps identify relationships between features, but it is important to note that potentially strong correlations may provide redundant information, and weak correlations may be indicate independent features, but also there could be an unknown confounding variable. 

>***'Correlation Does Not Imply Causation'***

An **absolute value of 1** (i.e., 1 or -1) indicates a perfect linear relationship between the x and y, with every data point lying on the line. In the Heatmap a value of 1 is seen when a **Feature** is correlated with itself. This value of 1 doesn't provide any useful insight and is an artifact of the way the calculation was done and displayed.

In the Heatmap an example of a **Strong Positive Correlation Coefficient** is seen between **Sepal Length** and **Petal Length** with a value of $0.87$, indicated visually with a close to red colour. A **Positive Correlation Coefficient** means when x increases, y tends to increase as well or when **Sepal Length** increases so **Petal Length** tends to increase with a Strong Correlation but not perfect. 
> **Note:** The **Strongest Positive Correlation** is between **Petal legth** and **Petal Width** with a value of $0.96$ which is indicated in the Heatmap with red.  

A value of 0 implies there is no linear relationship between the variables. Therefore values with close to 0 have a **Weak Correlation Coefficient**. In the Heatmap for example, **Sepal Length** and **Sepal Width** have a **Weak Negative Correlation Coefficient** of $-0.11$ indicated by a light grey colour.  

A **Negative Correlation Coefficient** means as x increases y tends to decrease. For example, **Petal Length** and **Sepal Width** have a **Weak to Moderate Negative Correlation Coefficient** of $-0.42$ which suggests that when **Petal Length** increases **Sepal Width** typically decreases but not in a perfectly linear way. 

See the Heatmap below.


<img src="outputs/iris_correlation_matrix.png" alt="Iris Correlation Matrix Heatmap" style="width:700px; height:auto;">

### **Displaying the Correlation Matrix**

In [6]:
# generate the correlation matrix
correlation_matrix = ac.correlation_matrix_heatmap(iris_data, variables_titles, output_dir)

In [7]:
# display the correlation matrix
correlation_matrix.style.set_caption("Iris Dataset Correlation Matrix")

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,-0.109369,0.871754,0.817954
sepal_width,-0.109369,1.0,-0.420516,-0.356544
petal_length,0.871754,-0.420516,1.0,0.962757
petal_width,0.817954,-0.356544,0.962757,1.0



---

## **Principal Component Analysis** 


Following on from the visualisation of the features. I decided to perform a **Principal Component Analysis (PCA)** on the Iris Dataset. Having too many features in data can cause problem like overfitting (a model performs well on training data but fails on new data), slower computation, and lower accuracy. Coined **the curse of dimensionality** (see [geeks for geeks article on curse of dimensionality](https://www.geeksforgeeks.org/curse-of-dimensionality-in-machine-learning/) and [wiki](https://en.wikipedia.org/wiki/Curse_of_dimensionality) for more details on the curse of dimensionality). 

PCA is one of the most widely used dimensionality reduction technique. It works by transforming high-dimensional data into a lower-dimensional space while maximizing the variance (spread) of the data, preserving the most important patterns and relationships in the data. PCA prioritises the directions where the data varies the most as more variation = more useful information. (see [geek for geek article on PCA](https://www.geeksforgeeks.org/principal-component-analysis-pca/)).


The function `pca_analysis()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), perfoms PCA on the Iris Dataset to reduce the number of features from 4 to 2 and then visualises it into a scatter plot. 


>Note, the reason I am doing PCA is as part of my job I work with Data Scientist in the Life Sciences Industry that create Models of Real-Time Data for Manufacturing Process (for example the various conditions measurements (pH, live cell count, temperature etc.,) bioreactors producing drugs). They have mentioned things in my work like PCA and MSPM (multivariate statistical process monitoring). They have to do feature engineering a good bit, so I wanted to learn what PCA means.


#### **Preprocessing for PCA**
1. The Iris dataset is first standardised the Iris Dataset using from **sklearn.preprocessing** `StandardScaler()` (see documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)). The scaler normalises the Iris Dataset features so that their mean is zero and their variance is one. 
2. The `scaled_data` (which will be used in the PCA) is created using `fit_transform()` this method from **sklearn** fits the scaled data to the Iris Dataset and then transforms it (see documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html#sklearn.base.TransformerMixin.fit_transform)). 

#### **Performing the PCA** 
Now that we have the Iris Dataset standardised we can perform the PCA using `PCA()` function from **sklearn.decomposition** (see the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)). In the function we `PCA()` can define the number of principal components we want to reduce the Iris Dataset to. In my code I decided upon reducing the data to 2 (`PCA(n_components=2)`). Similarliy to before we need to now apply the PCA to the scaled data using the `fit_transform()` method again, this time however it will return the data with a reduced dimensionality (originally 4D, but now will be 2D).

#### **Package the PCA Results into a DataFrame** 
The `pca_result` variable that contains the PCA Results is a **numpy array** to make it easier to plot I packaged the PCA Results into a Pandas DataFrame called `pca_df`. Setting the column names (`columns=['PC1', 'PC2']`) and adding in the **Species** column to enable plotting the of the PCA Results by their Species. 

#### **Calculated the Explained Variance Ratio for the PCA** 
In PCA the **explained variance ratio** indicates how much of the original data's variability is captured by each principal component, i.e., the information that was retained from the original data after the PCA was performed. Using **sklearn's** `explained_variance_ratio_` (see the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)). 

Explained Variance Ratios: 
- `pc1_variance`$= 72.77\%$,
- `pc2_variance`$= 23.03\%$,
- `total_variance`$=95.80\%$ (sum of both principal components variance). 


#### **Plotting the PCA Results into a Scatter Plot** 
The final thing the `pca_analysis()` function does is it plots the **PCA Result Data** into a scatter plot using **Matplotlib** and saves as a `.png` file titled [iris_pca.png](https://github.com/skerr17/pands_project/blob/main/outputs/iris_pca.png) in the [Outputs Folder](https://github.com/skerr17/pands_project/tree/main/outputs).





#### **Insights / Observations:**

The **PCA** on the Iris Dataset successfully reduced the dimensionality from 4D to 2D while preserving most of the variance, with the clear separation of **Setosa** from the other species, and highlighted the overlap between **Versicolor** and **Virginica**.

See the PCA Scatter Plot below.

<img src="outputs/iris_pca.png" alt="Iris PCA Plot" style="width:700px; height:auto;">


To highlight the power of PCA let us do visual examination of the [iris_scatter.png](https://github.com/skerr17/pands_project/blob/main/outputs/iris_scatter.png) created previously with the PCA scatter plot [iris_pca.png](https://github.com/skerr17/pands_project/blob/main/outputs/iris_pca.png). Noted added the two images into a table using HTML so they can be viewed side by side (see Stack Overflow thread [here](https://stackoverflow.com/questions/33647774/how-to-include-two-pictures-side-by-side-in-for-ipython-notebook-jupyter)). 

As can be seen below in the two images, the **iris_scatter.png** (on the right) shows all the linear relationships between the features of the Iris Dataset (in total 6 scatter plots), some show a strong correlation such as **Petal Length vs Petal Width** while others show a weak correlation like **Sepal Width vs Petal Width** as discussed previously. Whereas the **iris_pca.png** (on the left) by combining the features into two principal components simplifies the data into 1 scatter plot, with minimal information loss but reducing the visual overload. 

The Separation of the different Iris Species was also maintained following the PCA reduction of features from 4 to 2. The distinct separation of **Setosa** described previously and the overlap of **Versicolor** and **Virginica** was  maintained. 

Due to the above observations it is fair to say the PCA conducted was an efficient summariser of the original data structure reducing complexity but maintaining the information (with $95.80\%$ total explained variance). 

<table>
    <tr>
        <td><img src="outputs/iris_pca.png" alt="Iris PCA Plot" style="width:700px; height:auto;"></td>
        <td><img src="outputs/iris_scatter.png" alt="Iris Scatter Plots" style="width:700px; height:auto;"></td>
    </tr>
</table>


### **Displaying the PCA 2D DataFrame**

In [8]:
# perform PCA analysis
pca_df = ac.pca_analysis(iris_data, variables, species, colours, output_dir)

In [9]:
# display the first 10 rows of the PCA dataframe
pca_df.head(11).style.set_caption("Iris PCA Analysis")

Unnamed: 0,PC1,PC2,species
0,-2.264542,0.505704,Iris-setosa
1,-2.086426,-0.655405,Iris-setosa
2,-2.36795,-0.318477,Iris-setosa
3,-2.304197,-0.575368,Iris-setosa
4,-2.388777,0.674767,Iris-setosa
5,-2.070537,1.518549,Iris-setosa
6,-2.445711,0.074563,Iris-setosa
7,-2.233842,0.247614,Iris-setosa
8,-2.341958,-1.095146,Iris-setosa
9,-2.188676,-0.448629,Iris-setosa



---

## **Conclusion** 

This Jupyter Notebook provides a comprehensive analysis of the Iris Dataset, showcasing various data exploration, visualisation, and finally a dimensionality reduction technique (PCA). Below is a summary of the steps we took and key insights gained: 

#### 1.  **Iris Dataset Introduction**

-   Following some external research we outlined the context of the Iris Dataset, it's origin, it's use as multi-classification dataset, and the number of samples and what they represent Iris flowers in three different species.

#### 2.  **Exploring Iris Dataset**

-   After we established the context around the data we took a deep dive into exploring it by conducting descriptive statistics both globally and by species. 
-   This initial exploration provided the foundation for further analysis with key observations being **Petal Length** and **Petal Width** showing high variability, making them more useful for classification. Also **Sepal Width** was the most consistent feature across the species highlighting it as not being good for classification. 

#### 3.  **Data Visualisation**

- Next we conducted some data visualisation creating: 

    - **Histograms** that highlighted the distribution of each feature across the species, 
    - **Scatter Plots** displayed the relationship between features, with again strong correlations observed between **Petal Length** and **Petal Width**,
    - **Pairs Plot** provided a comprehensive view of all the feature relationships and the distribution of each feature (combining the value of the histograms and scatter plots into one grid), 
    - **Heatmap** visualised the correlation matrix, this confirmed the strong positive correlation between **Petal Length** and **Petal Width**. 


#### 4.  **Principal Component Analysis (PCA)**

- Finally, a PCA was performed to reduce the dataset's dimensionality from 4 to 2 while retaining $95.80\%$ of the toal variance seen in the orginal dataset. 


### **Key Takeaways**

- The Iris Dataset's features, particualrly **Petal Length** and **Petal Width** , are effective for distingushing Iris species, especially the **Setosa** from the other species. 

- However, as shown above the the overlap between **Versicolor** and **Virginica**. Suggesting these two species are more difficult to differentiate based on the features provided in the Iris Dataset.  

- PCA proved to be a valuable tool for reducing the complexity while preserving the Iris Dataset structure, enabling better visualisation and understanding of the data. 

Overall, the analysis demonstrated the importance of combining statistical exploration, visualisation, and dimensionality reduction for effective data understanding. 

# **End** 