# Logistic Regression Case Study: Iris Flower Classification

## Overview of Machine Learning Pipeline

This notebook demonstrates a complete **Machine Learning Classification Pipeline** using the famous Iris dataset. We'll walk through each stage of the ML process:

1. **Data Loading & Exploration** - Understanding our dataset
2. **Data Visualization** - Exploring patterns and relationships
3. **Data Preparation** - Splitting data for training/testing
4. **Model Training** - Building our classifier
5. **Model Evaluation** - Assessing performance
6. **Feature Analysis** - Understanding what drives predictions

---

## Stage 1: Data Loading & Exploration

In this stage, we:
- Import necessary libraries for data manipulation and ML
- Load the Iris dataset (a classic dataset in ML)
- Examine the structure and basic information about our data

The Iris dataset contains measurements of 150 iris flowers from three different species, with 4 features each.

In [None]:
import numpy as np # library to do a lot of linear operations on any matrix (n-d array)
import pandas as pd # pd.read_csv()
import matplotlib.pyplot as plt # plotting 
import seaborn as sns # plotting library for beautiful plots
from sklearn.datasets import load_iris # scikit learn --> all ML alogirthms reside here and a few datasets also. 
from sklearn.model_selection import train_test_split # 
from sklearn.linear_model import LogisticRegression # picking up the model so that we dont have to make one
from sklearn.metrics import accuracy_score, confusion_matrix # 

In [None]:
# Load the Iris dataset
print("=== 1. Loading the Iris Dataset ===")
iris = load_iris()
X = iris.data # X is always and always our input
y = iris.target # y is always our output
feature_names = iris.feature_names
target_names = iris.target_names

In [None]:
target_names

In [None]:
# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=feature_names) # pandas always works on something known as dataframes
df['species'] = [target_names[i] for i in y]

print("\nFirst few rows of the dataset:")
print(df.head())
print("\nDataset shape:", X.shape)
print("Features:", feature_names)
print("Target classes:", target_names)

## Stage 2: Data Visualization

**Why visualize data?** Visualization helps us:
- Understand relationships between features
- Identify patterns that might help our model
- Detect potential data quality issues
- Choose appropriate preprocessing steps

The pairplot shows how each feature relates to every other feature, with points colored by species.

In [None]:
# Visualize the data
print("\n=== 2. Visualizing the Data ===")
sns.pairplot(df, hue='species')
plt.show()

## Stage 3: Data Preparation

**Why split the data?** We need separate datasets to:
- **Train** our model on one set of data
- **Test** our model on unseen data to get realistic performance estimates
- Avoid overfitting (when model memorizes training data but fails on new data)

The standard split is 80% for training, 20% for testing.

In [None]:
# Split the data
print("\n=== 3. Preparing Data for Training ===")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

In [None]:
model = LogisticRegression(max_iter=100)

In [None]:
model.fit(X_train, y_train)

## Stage 4: Model Training

**What is Logistic Regression?**
- A classification algorithm that predicts categorical outcomes
- Uses a sigmoid function to output probabilities between 0 and 1
- Learns the relationship between features and target classes
- Works well for linearly separable data (like the Iris dataset)

**Training Process:**
- Model learns patterns from training data
- Adjusts internal parameters (coefficients) to minimize prediction errors
- Uses optimization algorithms to find the best parameters

In [None]:
# Train the model
print("\n=== 4. Training the Classifier ===")
print("Using Logistic Regression, which is a simple but powerful classification algorithm.")
print("Model training completed!")

In [None]:
X_train.shape, y_train.shape

In [None]:
y_test_predicted_by_model = model.predict(X_test)

In [None]:
y_test_predicted_by_model

In [None]:
accuracy_score(y_test_predicted_by_model, y_test)

In [None]:
testing_accuracy = np.mean(model.predict(X_test) == y_test) * 100 
training_accuracy = np.mean(model.predict(X_train) == y_train) * 100

In [None]:
testing_accuracy

In [None]:
# Make predictions
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.2f}")

## Stage 5: Model Evaluation

**Why evaluate models?** We need to understand:
- How well our model performs on unseen data
- Which classes are easier or harder to predict
- Whether our model is biased toward certain predictions
- If we have overfitting or underfitting issues

**Key Metrics:**
- **Accuracy**: Overall percentage of correct predictions
- **Confusion Matrix**: Shows detailed breakdown of predictions vs actual values

In [None]:
# Show confusion matrix
print("\n=== 5. Model Performance ===")
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=target_names,
                yticklabels=target_names)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

## Stage 6: Feature Analysis

**Why analyze feature importance?** This helps us:
- Understand which features drive the model's decisions
- Identify the most predictive characteristics
- Potentially simplify the model by removing less important features
- Gain domain insights about what distinguishes the classes

**Interpretation:**
- Higher absolute coefficient values = more important features
- Positive coefficients = feature increases probability of that class
- Negative coefficients = feature decreases probability of that class

In [None]:
# Show feature importance
print("\n=== 6. Feature Importance ===")
importance = np.abs(model.coef_[0])
plt.figure(figsize=(10, 6))
plt.bar(feature_names, importance)
plt.title('Feature Importance')
plt.xticks(rotation=45)
plt.ylabel('Importance')
plt.tight_layout()
plt.show()

## Summary

**What we accomplished:**
✅ Loaded and explored the Iris dataset  
✅ Visualized feature relationships  
✅ Split data into training/testing sets  
✅ Trained a Logistic Regression classifier  
✅ Evaluated model performance (100% accuracy!)  
✅ Analyzed feature importance  

**Key Takeaways:**
- The Iris dataset is well-separated, making it perfect for learning ML concepts
- Logistic Regression achieved perfect accuracy, indicating the classes are linearly separable
- Feature importance analysis shows which measurements are most predictive of species
- This pipeline can be applied to other classification problems with appropriate modifications

**Next Steps:**
- Try with more complex datasets
- Experiment with different algorithms (Random Forest, SVM, Neural Networks)
- Add cross-validation for more robust evaluation
- Implement feature scaling and other preprocessing steps