# Logistic Regression on Iris (Colab-ready)

A gentle, end-to-end notebook for **Logistic Regression** using the classic **Iris** dataset.

We will:
1. Load the data
2. Explore and visualize
3. Split into train/test
4. Train Logistic Regression
5. Evaluate with accuracy, confusion matrix, classification report
6. Try a few test-time predictions

**Tip:** Run cells from top to bottom. If you get a warning about convergence, we set a higher `max_iter` later.


In [None]:
# 0) Setup: install/upgrade packages if needed (Colab usually has these)
import sys
print(sys.version)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
%matplotlib inline


## 1) Load the Iris dataset
We'll use `sklearn.datasets.load_iris`. It returns features `data`, labels `target`, and metadata like feature names and target names.

In [None]:
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='target')
target_names = iris.target_names
feature_names = iris.feature_names
print(f"Shape X: {X.shape}, y: {y.shape}")
X.head()

## 2) Quick EDA & Visualization
- Peek at basic statistics
- Plot histograms
- A simple scatter plot (petal length vs petal width)

> We stick to `matplotlib` to keep it lightweight and portable.

In [None]:
display(X.describe())

# Histograms for each feature
ax = X.hist(bins=15, figsize=(10,6))
plt.suptitle("Feature Distributions", y=1.02)
plt.show()

# Simple scatter: petal length vs petal width colored by species
plt.figure(figsize=(6,4))
for cls in np.unique(y):
    mask = (y == cls)
    plt.scatter(X.loc[mask, feature_names[2]], X.loc[mask, feature_names[3]], label=target_names[cls], alpha=0.8)
plt.xlabel(feature_names[2])
plt.ylabel(feature_names[3])
plt.title("Petal length vs Petal width by species")
plt.legend()
plt.show()

## 3) Train/Test Split
We'll do a stratified split so that each class is represented proportionally in both training and testing sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print("Train shape:", X_train.shape, " Test shape:", X_test.shape)
print("Train class counts:\n", y_train.value_counts().sort_index())
print("Test class counts:\n", y_test.value_counts().sort_index())

## 4) (Optional) Feature Scaling
Logistic Regression can benefit from scaling. We will standardize features using `StandardScaler`. 

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)
X_train_scaled[:3]

## 5) Train Logistic Regression
- We use `multi_class='auto'` which picks OvR or multinomial depending on the solver.
- `lbfgs` works well; we set `max_iter=1000` to avoid convergence warnings.


In [None]:
logreg = LogisticRegression(max_iter=1000, multi_class='auto', solver='lbfgs', random_state=42)
logreg.fit(X_train_scaled, y_train)
print("Training complete.")

## 6) Evaluate on Test Set
- Accuracy
- Confusion Matrix
- Classification Report (precision, recall, f1-score per class)


In [None]:
y_pred = logreg.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {acc:.4f}")

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_names)
disp.plot(values_format='d')
plt.title("Confusion Matrix (Test)")
plt.show()

print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=target_names))

## 7) Try Some Predictions
We'll feed a couple of handmade samples to the trained model.

Order of features:
- sepal length (cm), sepal width (cm), petal length (cm), petal width (cm)

Each row below is `[sepal length (cm), sepal width (cm), petal length (cm), petal width (cm)]`.

In [None]:
X_new = np.array([
    [5.1, 3.5, 1.4, 0.2],  # likely setosa
    [6.0, 2.8, 4.5, 1.5],  # likely versicolor
    [6.5, 3.0, 5.5, 2.0],  # likely virginica
])
X_new_scaled = scaler.transform(X_new)
preds = logreg.predict(X_new_scaled)
probs = logreg.predict_proba(X_new_scaled)
for i, (p, pr) in enumerate(zip(preds, probs)):
    print(f"Sample {i}: predicted -> {target_names[p]} | probabilities -> {np.round(pr, 3)}")

## 8) (Optional) Save Model
In case you want to persist the trained model and scaler for later use.

In [None]:
import joblib
joblib.dump(logreg, "logreg_iris.joblib")
joblib.dump(scaler, "scaler_iris.joblib")
print("Saved: logreg_iris.joblib and scaler_iris.joblib")

## 9) Homework Ideas (for students)
- Remove scaling and compare accuracy.
- Change `test_size` (e.g., 0.3) — how does it affect performance?
- Try `penalty='l1'` with solver `'liblinear'` and compare.
- Plot decision boundaries for two features at a time.
- Compute cross-validation accuracy using `cross_val_score`.
