# Machine Learning Basics: Iris Classification

## Overview
This notebook demonstrates a simple machine learning workflow using the famous Iris dataset. We'll build a model that can automatically identify iris flower species based on physical measurements.

## The Problem
Imagine you're a botanist who finds an iris flower and measures:
- Sepal length (cm)
- Sepal width (cm)
- Petal length (cm)
- Petal width (cm)

**Goal:** Predict which of the 3 iris species it belongs to:
- Iris Setosa
- Iris Versicolor
- Iris Virginica

## Why Machine Learning?
Instead of manually creating rules, we'll let the model **learn patterns** from 150 known examples, then use those patterns to classify new flowers automatically.

## Dataset
- **Size:** 150 flowers (50 of each species)
- **Features:** 4 measurements per flower
- **Target:** Species classification (3 classes)
- **Source:** Built into scikit-learn (no external files needed)

#### Descriptive statistics on the dataset

In [None]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import scipy

# Load the Iris dataset
iris = load_iris()
X = iris.data      # Features: the 4 measurements
y = iris.target    # Labels: species (0, 1, or 2)

# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = iris.target_names[y]

# Display basic information
print(f"Dataset shape: {X.shape}")
print(f"Number of samples: {len(X)}")
print(f"Number of features: {X.shape[1]}")
print(f"Species: {iris.target_names}")
print(df.describe())
print(df.groupby('species').size())
print("\nDistribution of length values")
plt.figure(figsize=(12, 8))
df.hist()
plt.tight_layout()
plt.show()


### The first rows of the dataset

In [None]:
df.head(20)


#### Split Data into Training and Test Sets

We split the data into two parts:
- **70% Training data:** The model learns patterns from this
- **30% Test data:** We use this to evaluate how well the model works on unseen data

This prevents "cheating" - the model can't just memorize answers.
The Iris dataset is already prepared:
- x holds the features, 4 values
- y holds the target, the iris

In [None]:
# Split the data: 70% training, 30% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,      # 30% for testing
    random_state=7     # Fixed seed for reproducibility
)

# Show the split
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"\nTraining set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")


#### Train the Model

We'll use a **Decision Tree Classifier** - a model that learns a series of yes/no questions to classify flowers.

The model will analyze the training data and automatically discover patterns like:
- "If petal length < 2.5cm, then it's Setosa"
- "If petal width > 1.8cm, then it's Virginica"

We use a max depth of 3 that provides a good balance as starting point.

In [None]:
# Create the model
model = DecisionTreeClassifier(max_depth=3, random_state=42)

# Train the model on our training data
model.fit(X_train, y_train)

print("✓ Model training complete!")
print(f"\nModel type: {type(model).__name__}")
print(f"Max depth: {model.max_depth}")
print(f"Number of features used: {model.n_features_in_}")


#### Test the Model

Now we test how well our trained model performs on the **unseen test data** (the 30% we held back).

The model has never seen these 45 flowers before - this is the real test!

In [None]:
# Make predictions on test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Test Accuracy: {accuracy:.2%}")
print(f"\nCorrect predictions: {(y_pred == y_test).sum()} out of {len(y_test)}")
print(f"Wrong predictions: {(y_pred != y_test).sum()}")

#### Detailed Classification Report

Let's see how well the model performs for each species:

In [None]:
# Detailed report per species
print(classification_report(y_test, y_pred, target_names=iris.target_names))


#### Predict New Flowers

Now let's use our trained model to classify flowers we've never seen before.

We'll create some example measurements and let the model predict the species.

In [None]:
# Create new flower measurements (manually invented examples)
new_flowers = [
    [5.1, 3.5, 1.4, 0.2],  # Small petals - probably Setosa
    [6.5, 3.0, 5.2, 2.0],  # Large petals - probably Virginica
    [5.7, 2.8, 4.1, 1.3],  # Medium petals - probably Versicolor
]

# Make predictions
predictions = model.predict(new_flowers)

# Display results
print("Predictions for new flowers:\n")
for i, flower in enumerate(new_flowers):
    species = iris.target_names[predictions[i]]
    print(f"Flower {i+1}: {flower}")
    print(f"  → Predicted species: {species}\n")

#### Try Your Own!

You can modify the `new_flowers` list above with your own measurements:
- Sepal length (cm): typically 4.0 - 8.0
- Sepal width (cm): typically 2.0 - 4.5
- Petal length (cm): typically 1.0 - 7.0
- Petal width (cm): typically 0.1 - 2.5

#### Visualize the Decision Tree

Let's look at the actual decision rules the model learned. Each box shows:
- The decision rule (e.g., "petal width <= 0.8")
- The gini impurity (measure of how mixed the classes are)
- The number of samples at that node
- The predicted class (color-coded)

In [None]:
from sklearn.tree import plot_tree

# Create a large figure for better readability
plt.figure(figsize=(20, 10))

# Plot the decision tree
plot_tree(
    model,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    filled=True,           # Color nodes by majority class
    rounded=True,          # Rounded boxes look nicer
    fontsize=12
)

plt.title("Decision Tree Structure (max_depth=3)", fontsize=16, pad=20)
plt.tight_layout()
plt.show()

#### Compare Different Models

Let's see how different machine learning algorithms perform on the same Iris dataset.

We'll compare:
- **Decision Tree:** Uses yes/no questions (what we've been using)
- **Random Forest:** Multiple decision trees voting together
- **Logistic Regression:** Finds linear boundaries between classes
- **Support Vector Machine (SVM):** Finds optimal separation planes
- **K-Nearest Neighbors (KNN):** Classifies based on similar examples

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Define different models to compare
models = {
    'Decision Tree': DecisionTreeClassifier(max_depth=3, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': LogisticRegression(max_iter=200, random_state=42),
    'SVM': SVC(kernel='rbf', random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5)
}

# Train and evaluate each model
results = {}

print("Training and testing different models...\n")
print("-" * 50)

for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)

    # Test on training data
    train_accuracy = model.score(X_train, y_train)

    # Test on test data
    test_accuracy = model.score(X_test, y_test)

    # Store results
    results[name] = {
        'train': train_accuracy,
        'test': test_accuracy
    }

    print(f"{name}")
    print(f"  Training Accuracy: {train_accuracy:.2%}")
    print(f"  Test Accuracy:     {test_accuracy:.2%}")
    print("-" * 50)

#### Visual Comparison

In [None]:
import numpy as np

# Prepare data for plotting
model_names = list(results.keys())
train_scores = [results[name]['train'] for name in model_names]
test_scores = [results[name]['test'] for name in model_names]

# Create bar chart
x = np.arange(len(model_names))
width = 0.35

fig, ax = plt.subplots(figsize=(12, 6))
bars1 = ax.bar(x - width/2, train_scores, width, label='Training Accuracy', alpha=0.8)
bars2 = ax.bar(x + width/2, test_scores, width, label='Test Accuracy', alpha=0.8)

# Customize the plot
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Model Comparison: Training vs Test Accuracy', fontsize=14, pad=20)
ax.set_xticks(x)
ax.set_xticklabels(model_names, rotation=45, ha='right')
ax.legend()
ax.set_ylim([0.85, 1.02])  # Focus on the relevant range
ax.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.1%}',
                ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

### Key Observations

**What to look for:**
- **High test accuracy:** The model works well on new data
- **Similar train/test scores:** The model generalizes well (not overfitting)
- **Train >> Test:** Warning sign of overfitting (memorizing training data)

**Typical results for Iris:**
- All models perform well (95-100%) because Iris is a simple dataset
- Random Forest often has highest accuracy
- Decision Tree is most interpretable (we can see the rules)
- SVM and Logistic Regression work well for this linearly separable data

**For real-world problems:**
- Results vary much more between models
- You'd need to test multiple algorithms
- Consider speed, interpretability, and accuracy trade-offs

## Summary: What We Learned

### Machine Learning Type

**This is Supervised Learning:**
- We have labeled training data (features + known species)
- The model learns the relationship between inputs and outputs
- We "supervise" the learning by providing correct answers

**This is Multi-Class Classification:**
- We predict one of 3 categories (Setosa, Versicolor, Virginica)
- As opposed to:
  - **Binary Classification:** Only 2 classes (e.g., Spam/Not Spam, Fraud/Legitimate)
  - **Regression:** Predicting continuous numbers (e.g., house prices, temperature)

**Other ML types we didn't use:**
- **Unsupervised Learning:** Finding patterns without labels (e.g., customer segmentation, clustering similar flowers without knowing species)
- **Reinforcement Learning:** Learning through trial and error with rewards (e.g., game AI, robotics)

### The Machine Learning Workflow

We completed a full ML pipeline from start to finish:

1. **Load Data:** Used the built-in Iris dataset (150 flowers, 4 measurements each)
2. **Explore Data:** Examined the structure and statistics
3. **Split Data:** 70% training, 30% testing (to prevent cheating)
4. **Train Model:** Decision Tree learned patterns from 105 flowers
5. **Test Model:** Evaluated on 45 unseen flowers
6. **Make Predictions:** Classified new flowers based on measurements
7. **Visualize:** Saw the actual decision rules the model uses
8. **Compare Models:** Tested 5 different algorithms

### Key Concepts

**Features (X):** The input data we measure
- Sepal length, sepal width, petal length, petal width

**Target (y):** What we want to predict
- Iris species (Setosa, Versicolor, Virginica)

**Training:** The model learns patterns from labeled examples

**Testing:** We evaluate how well it works on new, unseen data

**Accuracy:** Percentage of correct predictions

### Why Split Train/Test?

If we test on the same data we trained on, we're cheating! The model has seen the answers. It's like giving students the exact exam questions before the test.

By holding back 30% for testing, we get an honest measure of how well the model generalizes to new data.

### Decision Tree Advantages

✓ Easy to understand and visualize
✓ Works with minimal data preprocessing
✓ Can explain every prediction (interpretable)
✓ Fast training and prediction

### Next Steps

**To improve this project, you could:**
- Try different `max_depth` values (1, 5, 10, None)
- Experiment with other `random_state` values
- Use cross-validation for more robust evaluation
- Test with your own custom flower measurements
- Apply this to a different dataset (wine, digits, etc.)
- Build a Streamlit app to make predictions interactively

**To learn more:**
- Try regression problems (predicting house prices, temperatures)
- Explore unsupervised learning (clustering without labels)
- Study other algorithms (Neural Networks, Gradient Boosting)
- Learn about feature engineering
- Explore hyperparameter tuning
- Work with real-world messy data