# Machine Learning Tutorial: Iris Classification

This notebook demonstrates a complete machine learning workflow using the classic Iris dataset.

## Objectives
- Load and explore the Iris dataset
- Visualize the data to understand patterns
- Build and compare multiple classification models
- Evaluate model performance
- Make predictions on new data

## 1. Import Libraries

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from lightgbm import LGBMClassifier

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported successfully!")

## 2. Load and Explore the Dataset

The Iris dataset contains measurements of 150 iris flowers from three different species:
- Setosa
- Versicolor
- Virginica

Features:
- Sepal length (cm)
- Sepal width (cm)
- Petal length (cm)
- Petal width (cm)

In [None]:
# Load the dataset
iris = load_iris()

# Create a DataFrame for easier manipulation
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = iris.target
df['species_name'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Basic statistics
df.describe()

In [None]:
# Check class distribution
print("Class distribution:")
print(df['species_name'].value_counts())

# Visualize class distribution
plt.figure(figsize=(8, 5))
df['species_name'].value_counts().plot(kind='bar', color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
plt.title('Distribution of Iris Species')
plt.xlabel('Species')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 3. Data Visualization

In [None]:
# Pairplot to visualize relationships between features
sns.pairplot(df, hue='species_name', palette='Set2', diag_kind='kde')
plt.suptitle('Pairplot of Iris Features', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 8))
correlation_matrix = df[iris.feature_names].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, fmt='.2f')
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

In [None]:
# Box plots for each feature
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Distribution of Features by Species', fontsize=16)

for idx, feature in enumerate(iris.feature_names):
    ax = axes[idx // 2, idx % 2]
    sns.boxplot(data=df, x='species_name', y=feature, ax=ax, palette='Set2')
    ax.set_title(feature.replace(' (cm)', '').title())
    ax.set_xlabel('Species')
    ax.set_ylabel('Measurement (cm)')

plt.tight_layout()
plt.show()

## 4. Prepare Data for Modeling

In [None]:
# Split features and target
X = iris.data
y = iris.target

# Split into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

# Standardize features (important for some algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nData prepared and standardized successfully!")

## 5. Train Multiple Models

We'll train and compare four different classification algorithms:
1. Logistic Regression
2. Decision Tree
3. Random Forest
4. LightGBM

In [None]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=200),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=5),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100),
    'LightGBM': LGBMClassifier(random_state=42, n_estimators=100, verbose=-1)
}

# Train models and store results
results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Use scaled data for Logistic Regression, original for tree-based models
    if name == 'Logistic Regression':
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        # Cross-validation score
        cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        # Cross-validation score
        cv_scores = cross_val_score(model, X_train, y_train, cv=5)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Store results
    results[name] = {
        'model': model,
        'predictions': y_pred,
        'accuracy': accuracy,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std()
    }
    
    print(f"Test Accuracy: {accuracy:.4f}")
    print(f"Cross-validation Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

## 6. Model Evaluation and Comparison

In [None]:
# Compare model accuracies
comparison_df = pd.DataFrame({
    'Model': list(results.keys()),
    'Test Accuracy': [results[m]['accuracy'] for m in results.keys()],
    'CV Mean': [results[m]['cv_mean'] for m in results.keys()],
    'CV Std': [results[m]['cv_std'] for m in results.keys()]
})

comparison_df = comparison_df.sort_values('Test Accuracy', ascending=False).reset_index(drop=True)
print("Model Comparison:")
print(comparison_df.to_string(index=False))

# Visualize comparison
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(comparison_df))
width = 0.35

bars1 = ax.bar(x - width/2, comparison_df['Test Accuracy'], width, label='Test Accuracy', color='#4ECDC4')
bars2 = ax.bar(x + width/2, comparison_df['CV Mean'], width, label='CV Mean', color='#45B7D1')

ax.set_xlabel('Model')
ax.set_ylabel('Accuracy')
ax.set_title('Model Performance Comparison')
ax.set_xticks(x)
ax.set_xticklabels(comparison_df['Model'], rotation=45, ha='right')
ax.legend()
ax.set_ylim([0.9, 1.02])  # Extra headroom for labels at top

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 0.002,
                f'{height:.3f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

In [None]:
# Detailed evaluation for the best model
best_model_name = comparison_df.iloc[0]['Model']
best_predictions = results[best_model_name]['predictions']

print(f"\nDetailed Evaluation for {best_model_name}:\n")
print("Classification Report:")
print(classification_report(y_test, best_predictions, target_names=iris.target_names))

In [None]:
# Confusion Matrix for best model
cm = confusion_matrix(y_test, best_predictions)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, 
            yticklabels=iris.target_names,
            cbar_kws={'label': 'Count'})
plt.title(f'Confusion Matrix - {best_model_name}')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

## 7. Make Predictions on New Data

Let's use our best model to make predictions on hypothetical new iris flowers.

In [None]:
# Create some example new data
new_samples = np.array([
    [5.1, 3.5, 1.4, 0.2],  # Likely Setosa
    [6.7, 3.0, 5.2, 2.3],  # Likely Virginica
    [5.9, 3.0, 4.2, 1.5],  # Likely Versicolor
])

# Get the best model
best_model = results[best_model_name]['model']

# Make predictions (scale if using Logistic Regression)
if best_model_name == 'Logistic Regression':
    new_samples_scaled = scaler.transform(new_samples)
    predictions = best_model.predict(new_samples_scaled)
    probabilities = best_model.predict_proba(new_samples_scaled)
else:
    predictions = best_model.predict(new_samples)
    probabilities = best_model.predict_proba(new_samples)

# Display predictions
print("Predictions for New Samples:\n")
for i, (sample, pred, probs) in enumerate(zip(new_samples, predictions, probabilities)):
    print(f"Sample {i+1}: {sample}")
    print(f"  Predicted Species: {iris.target_names[pred]}")
    print(f"  Probabilities: {dict(zip(iris.target_names, probs.round(3)))}")
    print()

## Summary

In this tutorial, we:

1. ✅ Loaded and explored the Iris dataset
2. ✅ Visualized the data using various plots
3. ✅ Prepared the data by splitting and scaling
4. ✅ Trained four different classification models
5. ✅ Evaluated and compared model performance
6. ✅ Made predictions on new data

**Key Findings:**
- All models achieved high accuracy (>90%) on this well-separated dataset
- Petal measurements (length and width) are stronger predictors than sepal measurements
- Tree-based models (Random Forest, LightGBM) performed particularly well

**Next Steps:**
- Try different hyperparameters with `GridSearchCV` or `RandomizedSearchCV`
- Experiment with feature engineering
- Apply these techniques to more complex datasets