# Iris Species Classification - Exploratory Data Analysis

## Dataset Overview
This analysis focuses on the famous Iris dataset from R.A. Fisher's 1936 paper, containing measurements of sepal and petal dimensions for three iris species: setosa, versicolor, and virginica.

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Load the training dataset
df = pd.read_csv('/Users/yuvalheffetz/ds-agent-projects/session_d952a262-98d7-433d-8f7e-ad34becf969d/data/train_set.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nColumn names: {list(df.columns)}")
print(f"\nData types:\n{df.dtypes}")

In [None]:
# Display basic information about the dataset
print("Dataset Info:")
df.info()
print("\nFirst few rows:")
df.head()

In [None]:
# Examine target variable distribution
print("Species distribution:")
species_counts = df['Species'].value_counts()
print(species_counts)
print(f"\nTotal samples: {len(df)}")
print(f"Number of features: {len(df.columns) - 2}")  # Excluding Id and Species columns

In [None]:
# Create species distribution visualization
app_color_palette = [
    'rgba(99, 110, 250, 0.8)',   # Blue
    'rgba(239, 85, 59, 0.8)',    # Red/Orange-Red
    'rgba(0, 204, 150, 0.8)',    # Green
    'rgba(171, 99, 250, 0.8)',   # Purple
    'rgba(255, 161, 90, 0.8)',   # Orange
    'rgba(25, 211, 243, 0.8)',   # Cyan
    'rgba(255, 102, 146, 0.8)',  # Pink
    'rgba(182, 232, 128, 0.8)',  # Light Green
    'rgba(255, 151, 255, 0.8)',  # Magenta
    'rgba(254, 203, 82, 0.8)'    # Yellow
]

# Create an interactive bar chart for species distribution
fig = px.bar(
    x=species_counts.index,
    y=species_counts.values,
    labels={'x': 'Iris Species', 'y': 'Number of Samples'},
    color=species_counts.index,
    color_discrete_sequence=app_color_palette[:3]
)

# Apply consistent styling
fig.update_layout(
    height=600,
    paper_bgcolor='rgba(0,0,0,0)',  # Transparent background
    plot_bgcolor='rgba(0,0,0,0)',   # Transparent plot area
    font=dict(color='#8B5CF6', size=12),  # App's purple color for text
    title_font=dict(color='#7C3AED', size=16),  # Slightly darker purple for titles
    xaxis=dict(
        gridcolor='rgba(139,92,246,0.2)',  # Purple-tinted grid
        zerolinecolor='rgba(139,92,246,0.3)',
        tickfont=dict(color='#8B5CF6', size=11),  # Purple tick labels
        title_font=dict(color='#7C3AED', size=12)  # Darker purple axis titles
    ),
    yaxis=dict(
        gridcolor='rgba(139,92,246,0.2)',  # Purple-tinted grid
        zerolinecolor='rgba(139,92,246,0.3)', 
        tickfont=dict(color='#8B5CF6', size=11),  # Purple tick labels
        title_font=dict(color='#7C3AED', size=12)  # Darker purple axis titles
    ),
    legend=dict(font=dict(color='#8B5CF6', size=11)),  # Purple legend
    showlegend=False  # Hide legend for cleaner look
)

# Add hover information
fig.update_traces(
    hovertemplate="<b>%{x}</b><br>" +
                  "Samples: %{y}<br>" +
                  "Percentage: %{customdata:.1f}%<extra></extra>",
    customdata=[count/len(df)*100 for count in species_counts.values]
)

# Save the plot
fig.write_html(
    "/Users/yuvalheffetz/ds-agent-projects/session_d952a262-98d7-433d-8f7e-ad34becf969d/research/plots/species_distribution.html",
    include_plotlyjs=True,
    config={'responsive': True, 'displayModeBar': False}
)

fig.show()

In [None]:
# Summary statistics for numerical features
numerical_cols = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']
print("Summary statistics for numerical features:")
df[numerical_cols].describe()

In [None]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

## Key Findings from EDA

### Dataset Characteristics:
- **Balanced dataset**: Perfect class balance with 40 samples per species in training set
- **No missing values**: Complete dataset with no data quality issues
- **4 numerical features**: All measurements are continuous variables representing physical dimensions

### Species Distribution:
- **Iris-setosa**: 40 samples (33.3%)
- **Iris-versicolor**: 40 samples (33.3%)
- **Iris-virginica**: 40 samples (33.3%)

### Insights for ML Pipeline:
1. **No class imbalance handling needed** - perfect balance across all species
2. **No missing value imputation required** - clean dataset
3. **Feature scaling may be beneficial** - different measurement scales across features
4. **Classification task** - three-class problem with well-defined target variable