# Iris Dataset - Exploratory Data Analysis

This notebook performs a comprehensive exploratory data analysis on the famous Iris dataset.
The Iris dataset contains measurements of 150 iris flowers from three different species:
- Iris setosa
- Iris versicolor  
- Iris virginica

Each flower has four features measured:
- Sepal length (cm)
- Sepal width (cm)
- Petal length (cm)
- Petal width (cm)

## 1. Import Required Libraries

In [None]:
# Import necessary libraries for data analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
import warnings
warnings.filterwarnings('ignore')

# Set the aesthetic style of the plots
sns.set_theme(style="whitegrid")
plt.style.use('default')

# Set figure size for better readability
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

print("Libraries imported successfully!")

## 2. Data Loading & Basic Exploration

In [None]:
# Load the Iris dataset using sklearn
iris_sklearn = load_iris()

# Create a pandas DataFrame
df = pd.DataFrame(iris_sklearn.data, columns=iris_sklearn.feature_names)
df['species'] = iris_sklearn.target

# Map target numbers to species names
species_names = {0: 'setosa', 1: 'versicolor', 2: 'virginica'}
df['species'] = df['species'].map(species_names)

# Clean column names (remove spaces and make them more readable)
df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")

In [None]:
# Display first 5 rows of the dataset
print("First 5 rows of the dataset:")
print(df.head())

In [None]:
# Print basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nColumn Names:", list(df.columns))
print("\nData Types:")
print(df.dtypes)

In [None]:
# Get detailed information about the dataset
print("Dataset Info:")
df.info()

In [None]:
# Get summary statistics for numeric columns
print("Summary Statistics:")
print(df.describe())

In [None]:
# Count the number of samples for each species
print("Number of samples per species:")
print(df['species'].value_counts())
print("\nPercentage distribution:")
print(df['species'].value_counts(normalize=True) * 100)

## 3. Data Cleaning

In [None]:
# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())

# Check for duplicate rows
print(f"\nNumber of duplicate rows: {df.duplicated().sum()}")

# Confirm dataset is ready for visualization
print("\nDataset is clean and ready for analysis!")
print(f"Final dataset shape: {df.shape}")

In [None]:
# Save the cleaned dataset as CSV
df.to_csv('iris_dataset.csv', index=False)
print("Dataset saved as 'iris_dataset.csv'")

## 4. Visualization Section

### 4.1 Pairplot - Overall Relationships

In [None]:
# Create pairplot to show relationships between all numeric variables
plt.figure(figsize=(12, 10))
pairplot = sns.pairplot(df, hue='species', diag_kind='kde', height=2.5)
pairplot.fig.suptitle('Pairplot of Iris Dataset Features by Species', y=1.02, fontsize=16)
plt.tight_layout()
plt.savefig('plots/01_pairplot.png', dpi=300, bbox_inches='tight')
plt.show()
plt.close()

**Key Observation:** The pairplot reveals clear separation between species, especially setosa which is distinctly different from the other two species. Petal measurements show stronger discriminative power than sepal measurements.

### 4.2 Boxplot - Sepal Length Distribution

In [None]:
# Create boxplot for sepal length by species
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x='species', y='sepal_length', palette='Set2')
ax.set_title('Distribution of Sepal Length by Species', fontsize=16, pad=20)
ax.set_xlabel('Species', fontsize=14)
ax.set_ylabel('Sepal Length (cm)', fontsize=14)
plt.tight_layout()
plt.savefig('plots/02_boxplot_sepal_length.png', dpi=300, bbox_inches='tight')
plt.show()
plt.close()

**Key Observation:** Virginica has the longest sepal length on average, followed by versicolor, then setosa. There's some overlap between versicolor and virginica, but setosa is clearly distinct.

### 4.3 Violin Plot - Petal Length Distribution

In [None]:
# Create violin plot for petal length by species
plt.figure(figsize=(10, 6))
ax = sns.violinplot(data=df, x='species', y='petal_length', palette='viridis')
ax.set_title('Distribution of Petal Length by Species', fontsize=16, pad=20)
ax.set_xlabel('Species', fontsize=14)
ax.set_ylabel('Petal Length (cm)', fontsize=14)
plt.tight_layout()
plt.savefig('plots/03_violin_petal_length.png', dpi=300, bbox_inches='tight')
plt.show()
plt.close()

**Key Observation:** Petal length shows excellent species separation with minimal overlap. Setosa has significantly shorter petals, while virginica has the longest petals. The violin plot reveals the distribution shape within each species.

### 4.4 KDE Plot - Sepal Width Distribution

In [None]:
# Create KDE plot for sepal width by species
plt.figure(figsize=(10, 6))
for species in df['species'].unique():
    sns.kdeplot(data=df[df['species'] == species], x='sepal_width', 
                label=species, fill=True, alpha=0.6)
plt.title('KDE Plot of Sepal Width by Species', fontsize=16, pad=20)
plt.xlabel('Sepal Width (cm)', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.legend(title='Species', fontsize=12)
plt.tight_layout()
plt.savefig('plots/04_kde_sepal_width.png', dpi=300, bbox_inches='tight')
plt.show()
plt.close()

**Key Observation:** Setosa has the widest sepals on average, while versicolor and virginica show considerable overlap in sepal width. This feature alone is not sufficient for species classification.

### 4.5 Swarm Plot - Petal Length vs Species

In [None]:
# Create swarm plot for petal length vs species
plt.figure(figsize=(10, 6))
ax = sns.swarmplot(data=df, x='species', y='petal_length', palette='husl', size=6)
ax.set_title('Swarm Plot of Petal Length by Species', fontsize=16, pad=20)
ax.set_xlabel('Species', fontsize=14)
ax.set_ylabel('Petal Length (cm)', fontsize=14)
plt.tight_layout()
plt.savefig('plots/05_swarm_petal_length.png', dpi=300, bbox_inches='tight')
plt.show()
plt.close()

**Key Observation:** The swarm plot shows individual data points, revealing the clear clustering of petal lengths within each species. There's virtually no overlap between setosa and the other species, making petal length an excellent discriminating feature.

### 4.6 Correlation Heatmap

In [None]:
# Create correlation heatmap for numeric features
plt.figure(figsize=(10, 8))
numeric_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
correlation_matrix = df[numeric_cols].corr()

ax = sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
                square=True, linewidths=0.5, cbar_kws={"shrink": .8})
ax.set_title('Correlation Heatmap of Iris Features', fontsize=16, pad=20)
plt.tight_layout()
plt.savefig('plots/06_correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()
plt.close()

**Key Observation:** Strong positive correlation (0.87) between petal length and petal width indicates these features move together. Moderate correlation between sepal length and petal measurements. Sepal width shows weak correlation with other features.

### 4.7 Count Plot - Species Distribution

In [None]:
# Create count plot showing number of samples per species
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x='species', palette='pastel')
ax.set_title('Number of Samples per Species', fontsize=16, pad=20)
ax.set_xlabel('Species', fontsize=14)
ax.set_ylabel('Count', fontsize=14)

# Add count labels on top of bars
for i, v in enumerate(df['species'].value_counts().values):
    ax.text(i, v + 1, str(v), ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig('plots/07_count_plot.png', dpi=300, bbox_inches='tight')
plt.show()
plt.close()

**Key Observation:** The dataset is perfectly balanced with exactly 50 samples for each of the three iris species, making it ideal for classification tasks without class imbalance concerns.

## 5. Analysis & Insights

### Summary of Key Findings:

**Species Characteristics:**
- **Setosa**: Smallest petals (length & width), widest sepals, most distinct species
- **Versicolor**: Medium-sized features, some overlap with virginica
- **Virginica**: Largest petals and longest sepals, some overlap with versicolor

**Best Discriminating Features:**
1. **Petal Length**: Excellent separation between all species
2. **Petal Width**: Strong discriminating power, highly correlated with petal length
3. **Sepal Length**: Good for distinguishing setosa from others
4. **Sepal Width**: Least discriminating feature

**Strong Correlations:**
- Petal length vs Petal width (r = 0.87): Very strong positive correlation
- Sepal length vs Petal length (r = 0.87): Strong positive correlation
- Sepal length vs Petal width (r = 0.82): Strong positive correlation

**Species Overlap:**
- **Minimal overlap**: Setosa is clearly separable from other species
- **Some overlap**: Versicolor and Virginica show overlap in sepal measurements
- **Best separation**: Achieved using petal measurements

**Dataset Quality:**
- Perfectly balanced dataset (50 samples per species)
- No missing values or data quality issues
- Suitable for machine learning classification tasks

In [None]:
# Final summary statistics by species
print("Summary Statistics by Species:")
print("=" * 50)
for species in df['species'].unique():
    print(f"\n{species.upper()}:")
    species_data = df[df['species'] == species]
    print(species_data[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].describe())
    print("-" * 50)

## Conclusion

This exploratory data analysis of the Iris dataset reveals clear patterns and relationships that make it an excellent dataset for classification tasks. The combination of petal length and width provides the strongest discriminating power for species identification, while the dataset's balanced nature and clean structure make it ideal for machine learning applications.