[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shang-vikas/series1-coding-exercises/blob/main/exercises/blog-01/exercise-01.ipynb)

In [10]:

# Install required packages using the kernel's Python interpreter
import sys
import subprocess
import importlib

def install_if_missing(package, import_name=None):
    """Install package if it's not already installed."""
    if import_name is None:
        import_name = package
    try:
        importlib.import_module(import_name)
        print(f"✓ {package} is already installed")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"✓ {package} installed successfully")

# Install required packages
install_if_missing("numpy")
install_if_missing("scikit-learn", "sklearn")
install_if_missing("pandas")

✓ numpy is already installed
✓ scikit-learn is already installed
Installing pandas...
Collecting pandas
  Using cached pandas-2.3.3-cp310-cp310-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.3-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.3.3-cp310-cp310-macosx_11_0_arm64.whl (10.8 MB)
Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Using cached tzdata-2025.3-py2.py3-none-any.whl (348 kB)
Installing collected packages: pytz, tzdata, pandas
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [pandas]2m2/3[0m [pandas]
[1A[2KSuccessfully installed pandas-2.3.3 pytz-2025.2 tzdata-2025.3
✓ pandas installed successfully



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/opt/homebrew/opt/python@3.10/bin/python3.10 -m pip install --upgrade pip[0m


In [11]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    roc_auc_score
)

# Load dataset
data = load_breast_cancer()

# Convert to DataFrame for readability
df = pd.DataFrame(data.data, columns=data.feature_names)
df["target"] = data.target

print("First 5 rows of the dataset:\n")
print(df.head())

print("\nDataset shape:", df.shape)
print("\nClass distribution:\n")
print(df["target"].value_counts())

# Split data
X = df.drop("target", axis=1)
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Scale for SVM (important)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

models = {
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "SVM (Linear)": SVC(kernel="linear", probability=True, random_state=42)
}

for name, model in models.items():
    
    if "SVM" in name:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        y_prob = model.predict_proba(X_test_scaled)[:, 1]
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1]
    
    print(f"\n===== {name} =====")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("Recall:", recall_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
    print("ROC-AUC:", roc_auc_score(y_test, y_prob))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


First 5 rows of the dataset:

   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter

Understanding Classification Metrics

This is where understanding separates button-pushers from practitioners.

Confusion Matrix

The confusion matrix is the raw truth table of predictions. Everything else is built on top of it.

It shows:

True Positives (TP) – Correctly predicted malignant

True Negatives (TN) – Correctly predicted benign

False Positives (FP) – Predicted malignant but actually benign

False Negatives (FN) – Predicted benign but actually malignant

In cancer detection, false negatives are extremely dangerous. That’s not just math — that’s a missed diagnosis.

Accuracy
Accuracy=Correct PredictionsTotal Predictions
Accuracy=
Total Predictions
Correct Predictions
	​


Accuracy is intuitive. It answers:

“How often is the model right?”

The problem? It can be misleading when classes are imbalanced.

If 95% of patients are healthy, a model that always predicts “healthy” gets 95% accuracy.
That model is useless.

Accuracy ignores what kind of mistakes are being made.

Precision
Precision=TPTP+FP
Precision=
TP+FP
TP
	​


Out of all predicted positives, how many were actually positive?

Precision answers:

“When the model says ‘malignant,’ how often is it correct?”

If precision is low, you’re raising too many false alarms.

Recall (Sensitivity)
Recall=TPTP+FN
Recall=
TP+FN
TP
	​


Out of all actual positives, how many did you catch?

Recall answers:

“Of all the malignant cases, how many did we detect?”

In medical diagnosis, recall often matters more than precision.
Missing cancer is worse than ordering an unnecessary follow-up test.

F1 Score
F1=Harmonic Mean of Precision and Recall
F1=Harmonic Mean of Precision and Recall

The harmonic mean penalizes extreme imbalance between precision and recall.

If one is high and the other is low, F1 drops sharply.

F1 is useful when you want a balanced trade-off between catching positives and avoiding false alarms.

ROC–AUC

ROC stands for Receiver Operating Characteristic.
AUC stands for Area Under the Curve.

ROC–AUC measures how well the model separates classes across all possible thresholds.

1.0 → Perfect separation

0.5 → Random guessing

It is threshold-independent, meaning it evaluates ranking ability, not just final yes/no predictions.

It answers:

“How well does the model distinguish malignant from benign overall?”

Model Characteristics
Decision Tree

Interpretable (like a flowchart)

Easy to explain

Prone to overfitting

Random Forest

Many trees voting together

Reduces variance

Usually strong accuracy and AUC

Harder to interpret

SVM (Support Vector Machine)

Strong theoretical foundation

Performs well in high-dimensional space (like 30-feature medical data)

Sensitive to feature scaling

Less interpretable

The Real Point

Machine learning is not about picking “the best model.”

It’s about choosing the right trade-off for the problem.

In cancer detection → optimize for recall

In spam detection → optimize for precision

In finance → false positives and false negatives have different dollar costs

Metrics are not just numbers.

They encode priorities.