# IrisSpeciesPredictor

#### By: Tiffany Chu, Gaurang Ahuja, Nguyen Nguyen, Vienne Lee

## Summary
We are using the Iris dataset to answer the question: “Can we predict the Iris species using petal and sepal measurements?”

## Introduction
The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.  One class is linearly separable from the other 2; the latter are not linearly separable from each other.

We cleaned up the data, and then looked at the different features. We plotted histograms for each feature and then some pairwise plots to see the relation between them. We then proceeded towards training a model on the data and used it to answer the question.

- Key findings TODO !!!!

## Methods & Results

### Loading the Data
The code below uses the URL to create a dataframe, and then saves that as a CSV to the data folder.

In [14]:
import pandas as pd
import os

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"

# Create the data folder if it doesn't already exist
os.makedirs("./data", exist_ok=True)

# Read from URL
df = pd.read_csv(url)

# Save locally
save_path = "./data/iris.csv"
df.to_csv(save_path, index=False)

print(f"File saved to: {save_path}")

File saved to: ./data/iris.csv


### Feature Summary
There are 4 numerical features in this dataset:

1. `sepal_length`: The length of the sepal (outer part of the flower)
2. `sepal_width`: The width of the sepal
3. `petal_length`: The length of the petal
4. `petal_width`: The width of the petal

There is 1 categorical feature in the dataset:
1. `species`: The target variable (label)

### Data Cleaning
In the code below, we check some basic stats for the dataframe such as the number of rows, columns etc. We check if there are any missing values and then convert the columns names to a more standard format using underscores and lowercase. We then check how many unique species there are.

In [16]:
df_shape = df.shape
print(f"Dataframe has {df_shape[0]} rows")
print(f"Dataframe has {df_shape[1]} cols\n")

Dataframe has 150 rows
Dataframe has 5 cols



In [18]:
print(f"Dataframe info:\n")
df.info()

Dataframe info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [27]:
#Check for missing values
missing_values = df.isna().sum()
print(f"\nMissing values per column:\n")
print(f"{missing_values}")


Missing values per column:

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64


In [21]:
#Clean up column names
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

In [23]:
#See how many uniques species there are
print(f"\nUnique species of Iris: {df['species'].unique()}\n")


Unique species of Iris: ['setosa' 'versicolor' 'virginica']



In [24]:
#See the first few rows after cleanup
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### Data Wrangling
Since our goal is classification, this section will look at statistics which will help distinguish between the species. Therefore, we will look at things like mean, median, min, max etc for each feature. We can also group by specie and look at stats. This will be followed by graphical analysis where we will explore histograms, density plots etc.

In [15]:
#Summary of numeric columns
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [25]:
#Summary per specie
df.groupby("species").agg(["mean", "median", "std", "min", "max"])

Unnamed: 0_level_0,sepal_length,sepal_length,sepal_length,sepal_length,sepal_length,sepal_width,sepal_width,sepal_width,sepal_width,sepal_width,petal_length,petal_length,petal_length,petal_length,petal_length,petal_width,petal_width,petal_width,petal_width,petal_width
Unnamed: 0_level_1,mean,median,std,min,max,mean,median,std,min,max,mean,median,std,min,max,mean,median,std,min,max
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
setosa,5.006,5.0,0.35249,4.3,5.8,3.428,3.4,0.379064,2.3,4.4,1.462,1.5,0.173664,1.0,1.9,0.246,0.2,0.105386,0.1,0.6
versicolor,5.936,5.9,0.516171,4.9,7.0,2.77,2.8,0.313798,2.0,3.4,4.26,4.35,0.469911,3.0,5.1,1.326,1.3,0.197753,1.0,1.8
virginica,6.588,6.5,0.63588,4.9,7.9,2.974,3.0,0.322497,2.2,3.8,5.552,5.55,0.551895,4.5,6.9,2.026,2.0,0.27465,1.4,2.5


In [26]:
#Count per specie
print(df['species'].value_counts())

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64


From the above code cells, we can see:
- The species are evenly distributed.
- Petal measurements display the strongest separation
- Setosa is distinct while versicolor and virginica show some overlap
- Petal dimensions have meaningful differences

## EDA Plots - TODO Vienne

Add histograms, plots etc

## Discussion
- What the results mean?
- Limitations?
- Improvements?

## References

Iris dataset reference: https://archive.ics.uci.edu/dataset/53/iris
- Documentations we referred to?
- AI tools?