# **Iris Dataset Analysis**

### Authored by: Stephen Kerr

---

## **Iris Dataset Introduction**

The **Iris Dataset** is a famous classix multi-class classification dataset, which contains **$150$ samples** of Iris flowers from three species: **Setosa**, **Versicolor**, and **Virginica**.   
Each sample includes four features: **Sepal length** (in cm), **Sepal width** (in cm), **Petal length** (in cm), **Petal width** (in cm)

The raw data can be seen in **Inputs** folder in [iris.data](https://github.com/skerr17/pands_project/blob/main/inputs/iris.data) which was sourced from [UCI Machine Learning Repository - Iris Dataset](https://archive.ics.uci.edu/dataset/53/iris). 

The Image below illustrates the Three Iris Flower Species and their anatomy (Image sourced from [here](https://www.analyticsvidhya.com/blog/2022/06/iris-flowers-classification-using-machine-learning/)).

![iris flower image](iris_species_image.png) 

---

## **Exploring the Iris Dataset**

Following doing some research on the Iris Dataset I conducted some initial exploration of the data in the [iris.data](https://github.com/skerr17/pands_project/blob/main/inputs/iris.data). 

The function `generate_descriptive_statistics()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), generates the descriptive statistics for the global Iris Dataset and by species (`global_descriptive_stats` and `stats_by_species`).  

`generate_descriptive_statistics()` calculates the Count, Mean, Standard Deviation, Minimum, 25th Percentile, Median, 75th Percentile, and Maximum for all $n=150$ Iris samples. 

Whereas, `stats_by_species` calculates the Count, Mean, Standard Deviation, Minimum, 25th Percentile, Median, 75th Percentile, and Maximum for each **Species** of Iris. Using the `groupby('Species')` to subset the dataset based upon each Samples species (**Setosa**, **Versicolor**, and **Virginica**) each having a sample size of $n=50$.

Finally, saving all the Descriptive Statistics in a Tabular format in a `Text` file titled **[iris_descriptive_stats.txt](outputs/iris_descriptive_stats.txt)** in the **Outputs Folder**.

### Insights / Observations: 
#### Key insights from the **Global Descriptive Statistics** are the following:
-  The **Petal Length & Width** are more variable than the **Sepal Length & Width**, suggesting they might have more predictive power for classification (will return to this point to validate).
- **Sepal Width** is the most consistent feature across all the species with a standard deviation of 0.43cm.
- **Petal Length** has the highest variability with a standard deviation of 1.77cm.  

Overall, the **global statistics** show that sepal dimensions tend to be larger than petal dimensions. Specifically, sepal length has a mean of 5.84 cm and sepal width a mean of 3.05 cm, compared to petal length with a mean of 3.76 cm and petal width with 1.20 cm.

#### Key insights from the **Descriptive Statistics per Species** are the following:
- **Setosa** is the smallest in overall measurements and has low variability. Making it the easiest to classify from the others.
- **Versicolor** is the intermediate in terms of feature values with more variability than **Setosa** and commonly overlapping with **Virginica** which means it is harder to classify.
- **Virginica** is the largest Sepal and Petal dimensions with the highest within species variation.

As suggested in the **Global Descriptive Statistics** the **Petal Length & Width** is the most effective feature for distinguishing species, while **Sepal Width** being the hardest to classify the species.

In [3]:
# in this code cell i want to have the tables generated and displayed nicely
import analysis_code

# create a directory for the output files
output_dir = Pathlib('outputs')
output_dir.mkdir(parents=True, exist_ok=True) # Create the directory if it doesn't exist

# Read the data from the iris.data file
raw_data_dir = Path('inputs') # Path to the iris data file
raw_data_dir.mkdir(parents=True, exist_ok=True) # Create the directory if it doesn't exist
data_path = raw_data_dir / 'iris.data' # Path to the iris data file
# Check if the file exists
if data_path.exists():
    # Read the data into a pandas dataframe # Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    # Add column names to the dataframe from the iris documentation # Reference: https://archive.ics.uci.edu/dataset/53/iris
    iris_data = pd.read_csv(data_path, header=None, usecols=[0, 1, 2, 3, 4], 
                names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])
else:
    print(f"Error: The file {data_path} does not exist.")
    return

    # prepare the data for plotting
    variables, variables_titles, species, format_species, colors, labels = prepare_data(iris_data)

global_descriptive_stats, stats_by_species = generate_descriptive_statistics(iris_data, output_dir, variables_titles, species, format_species)

NameError: name 'Pathlib' is not defined

---

## **Visualise the Features**

After conducting the descriptive analysis I went about visulasing the Iris dataset Features and their relationships. 
- First with **Histograms** for each feature using **Matplotlib's** `hist()` method (see documentation [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)).
- **Scatter Plots** for each feature using **Matplotlib's** `scatter()` method (see documentation [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html)).
- **Pairs Plot** for each feature and their relationship with eachother using **Seaborn's** `pairplot()` function (see documentation [here](https://seaborn.pydata.org/generated/seaborn.pairplot.html)),
- Finally, a **Heatmap** of the of the  **Pearson Correlation Coefficient** calculated using **`.corr()`** a method from **Pandas** (see `.corr()` method documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)) and visulaised using **Seaborn's** `heatmap()` function (see documentation [here](https://seaborn.pydata.org/generated/seaborn.heatmap.html)),

### **Histograms of the Features**

Histograms are used to display the distribution of numerical features within a dataset by showing the frequency of observations within defined intervals or 'bins'. 

The function `plot_histograms()` found in [analysis_code.py](https://github.com/skerr17/pands_project/blob/main/analysis_code.py), generates the a Histograms for each variable with the different species colour coded to individual `.png` files titled `{variable}_histogram.png` in the **[Outputs Folder](/workspaces/pands_project/outputs)**.

See all the Histograms linked here:
- [petal_length_histogram.png](/workspaces/pands_project/outputs/petal_length_histogram.png)
- [petal_width_histogram.png](/workspaces/pands_project/outputs/petal_width_histogram.png)
- [sepal_length_histogram.png](/workspaces/pands_project/outputs/sepal_length_histogram.png)
- [sepal_width_histogram.png](/workspaces/pands_project/outputs/sepal_width_histogram.png)

#### Insights / Observations: 
For the Iris dataset, plotting histograms for each feature helps reveal the spread of the data across each species. For example, the [petal_length_histogram.png](/workspaces/pands_project/outputs/petal_length_histogram.png) clearly shows the difference between the species with the **Setosa** distinctly separated from **Versicolor** and **Virginica**. Therefore, for classification the **Petal Length** is useful feature. Note, see the below Histogram.

<img src="outputs/petal_length_histogram.png" alt="Petal Length Histogram" style="width:400px; height:auto;">



The Histogram also aid in spotting skewness or outliers that could influence modeling decisions. Notably, the [sepal_width_histogram.png](/workspaces/pands_project/outputs/sepal_width_histogram.png) being less effective is visualised by the overlap between the species. See the below Histogram.  /workspaces/pands_project/outputs/sepal_width_histogram.png

<img src="outputs/sepal_width_histogram.png" alt="Sepal Width Histogram" style="width:400px; height:auto;">

Reference: see this stack overflow thread to understand how I resized my Histograms linked [here](https://stackoverflow.com/questions/41598916/resize-the-image-in-jupyter-notebook-using-markdown)