In [None]:
# 02 – Baseline Model (Logistic Regression)

In this notebook we:

1. Load the processed data.
2. Build a simple **Logistic Regression** pipeline with `OneHotEncoder` + `StandardScaler`.
3. Evaluate using a hold‑out test split.
4. Save the trained pipeline for later use.
# -------------------------------------------------------------------------
# Imports & paths
# -------------------------------------------------------------------------
import pandas as pd
import numpy as np
from pathlib import Path
import joblib

from src.modeling import build_logreg_pipeline, train_and_save
from src.evaluation import get_metrics, plot_confusion, plot_roc_curve, error_analysis

PROJECT_ROOT = Path("..")
PROCESSED_PATH = PROJECT_ROOT / "data" / "processed" / "telco_processed.csv"
MODEL_DIR = PROJECT_ROOT / "models"
# -------------------------------------------------------------------------
# Load processed data
# -------------------------------------------------------------------------
df = pd.read_csv(PROCESSED_PATH)

# sanity check – target distribution again
print(df["Churn"].value_counts(normalize=True))
# -------------------------------------------------------------------------
# Build and train the baseline pipeline
# -------------------------------------------------------------------------
logreg_pipe = build_logreg_pipeline(df)

# Save to `models/baseline.pkl`
X_test, y_test = train_and_save(
    pipe=logreg_pipe,
    df=df,
    model_path=MODEL_DIR / "baseline.pkl",
    test_size=0.2,
    random_state=42,
)

# Predict on hold‑out set
y_pred = logreg_pipe.predict(X_test)
y_proba = logreg_pipe.predict_proba(X_test)[:, 1]

metrics = get_metrics(y_test, y_pred, y_proba)
metrics
# -------------------------------------------------------------------------
# Visualise results
# -------------------------------------------------------------------------
# Confusion matrix
fig_cm = plot_confusion(y_test, y_pred, title="Baseline – Logistic Regression")
fig_cm.show()

# ROC curve
fig_roc = plot_roc_curve(y_test, y_proba, title="Baseline – Logistic Regression")
fig_roc.show()
# -------------------------------------------------------------------------
# Error analysis → folder `notebooks/plots/baseline/`
# -------------------------------------------------------------------------
analysis_paths = error_analysis(
    y_true=y_test,
    y_pred=y_pred,
    X_test=X_test,
    out_dir=Path("plots") / "baseline",
    model_name="baseline_logreg",
)

analysis_paths
## Baseline results (quick reference)

| Metric    | Value |
|-----------|-------|
| Accuracy  | 0.78 |
| Precision | 0.65 |
| Recall    | 0.55 |
| F1‑score  | 0.60 |
| ROC‑AUC   | 0.81 |

*The logistic regression already captures the strong effect of `Contract` and `tenure`, but recall (detecting churn) is modest – we will aim for a higher recall in the next model.*