# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [30]:
# Version Check

import sys, platform, pandas as pd, matplotlib
print("Python:", platform.python_version())
print("Pandas:", pd.__version__)
print("Matplotlib:", matplotlib.__version__)


Python: 3.10.18
Pandas: 2.1.4
Matplotlib: 3.7.5


In [31]:
# Imports

import pandas as pd
import numpy as np
from pathlib import Path

from pycaret.classification import (
    setup, compare_models, pull, tune_model, finalize_model, save_model, predict_model
)

from sklearn.metrics import classification_report, roc_auc_score, accuracy_score


In [32]:
#Paths and Load data

DATA_DIR = Path(".")  # set to folder containing the CSVs
TRAIN_CSV = DATA_DIR / "C:/Users/sunny/Documents/aa-ms-in-data-science/msds-600/week5/churn_data_cleaned.csv"
NEW_CSV   = DATA_DIR / "C:/Users/sunny/Documents/aa-ms-in-data-science/msds-600/week5/new_churn_data.csv"

df = pd.read_csv(TRAIN_CSV)
print("Train shape:", df.shape)
df.head()


Train shape: (7032, 11)


Unnamed: 0,tenure,MonthlyCharges,TotalCharges,PhoneService_Yes,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Churn_Yes,ChargePerMonth
0,1,29.85,29.85,False,False,False,False,True,False,False,29.85
1,34,56.95,1889.5,True,True,False,False,False,True,False,55.573529
2,2,53.85,108.15,True,False,False,False,False,True,True,54.075
3,45,42.3,1840.75,False,True,False,False,False,False,False,40.905556
4,2,70.7,151.65,True,False,False,False,True,False,True,75.825


In [33]:
#Target detection & coercion to binary (0/1)

# Prefer 'Churn_Yes' if present; else try common names
CANDIDATES = ["Churn_Yes", "Churn","churn","Exited","exited","is_churn","churned","target","label"]
target = next((c for c in CANDIDATES if c in df.columns), None)
if target is None:
    raise ValueError(f"Couldn't find target column. Columns: {list(df.columns)}")
print("Detected target:", target)

def to_binary(series: pd.Series) -> pd.Series:
    s = series.copy()
    if s.dtype == "O":
        m = {"yes":1,"no":0,"true":1,"false":0,"1":1,"0":0,"churn":1,"not churn":0}
        s = s.astype(str).str.strip().str.lower().map(lambda x: m.get(x, x))
    uniq = sorted(pd.unique(s.dropna()))
    if len(uniq) != 2:
        raise ValueError(f"Target must be binary; got {uniq}")
    return s.map({uniq[0]:0, uniq[1]:1}).astype(int)

# If target is not already a clean 0/1 column (e.g., 'Churn_Yes' int), coerce it.
try:
    df[target] = to_binary(df[target])
except Exception:
    # If it's already fine (e.g., 0/1 dtype int), just ensure int
    df[target] = pd.to_numeric(df[target], errors="coerce").astype(int)

df[target].value_counts(dropna=False)


Detected target: Churn_Yes


Churn_Yes
0    5163
1    1869
Name: count, dtype: int64

In [34]:
#Detect if train is already one-hot & save schema

RAW_CATS = ["PhoneService","InternetService","Contract","PaymentMethod","gender","Gender","Partner",
            "Dependents","PaperlessBilling","MultipleLines","OnlineSecurity","TechSupport",
            "StreamingTV","StreamingMovies"]

DUMMY_PREFIXES = ["PhoneService_","InternetService_","Contract_","PaymentMethod_","MultipleLines_",
                  "OnlineSecurity_","TechSupport_","StreamingTV_","StreamingMovies_","PaperlessBilling_"]

has_raw          = any(c in df.columns for c in RAW_CATS)
has_onehot_hints = any(any(col.startswith(pfx) for pfx in DUMMY_PREFIXES) for col in df.columns)
TRAIN_IS_ONEHOT  = (not has_raw) and has_onehot_hints

X_COLS = [c for c in df.columns if c != target]
schema = {
    "x_columns": X_COLS,
    "train_is_onehot": bool(TRAIN_IS_ONEHOT),
    "rename_map": {"charge_per_tenure": "ChargePerMonth"}  # reconcile new -> train
}
with open("training_schema.json","w") as f:
    json.dump(schema, f, indent=2)

print("TRAIN_IS_ONEHOT:", TRAIN_IS_ONEHOT)
print("Num features:", len(X_COLS))


TRAIN_IS_ONEHOT: True
Num features: 10


In [35]:
#PyCaret setup (3.x)

s = setup(
    data=df,
    target=target,
    session_id=42,
    train_size=0.8,
    fold=5,
    normalize=True,
    remove_multicollinearity=True,
    multicollinearity_threshold=0.95,
    verbose=False,
    html=False
)


In [36]:
#Model selection (optimize AUC)

best = compare_models(sort="AUC")
pull().head(10)


                                                           

                                    Model  Accuracy     AUC  Recall   Prec.  \
gbc          Gradient Boosting Classifier    0.7902  0.8377  0.4876  0.6377   
ada                  Ada Boost Classifier    0.7947  0.8369  0.5097  0.6446   
lightgbm  Light Gradient Boosting Machine    0.7844  0.8282  0.5090  0.6138   
nb                            Naive Bayes    0.7273  0.8095  0.7538  0.4919   
svm                   SVM - Linear Kernel    0.7671  0.8079  0.3652  0.6198   
rf               Random Forest Classifier    0.7735  0.7998  0.4890  0.5900   
knn                K Neighbors Classifier    0.7790  0.7848  0.5151  0.5976   
et                 Extra Trees Classifier    0.7604  0.7775  0.4903  0.5570   
dt               Decision Tree Classifier    0.7316  0.6651  0.5130  0.4954   
dummy                    Dummy Classifier    0.7342  0.5000  0.0000  0.0000   

              F1   Kappa     MCC  TT (Sec)  
gbc       0.5526  0.4188  0.4252     0.634  
ada       0.5688  0.4365  0.4420     1.0



Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7902,0.8377,0.4876,0.6377,0.5526,0.4188,0.4252,0.634
ada,Ada Boost Classifier,0.7947,0.8369,0.5097,0.6446,0.5688,0.4365,0.442,1.012
lightgbm,Light Gradient Boosting Machine,0.7844,0.8282,0.509,0.6138,0.5564,0.4156,0.4189,0.672
nb,Naive Bayes,0.7273,0.8095,0.7538,0.4919,0.595,0.403,0.4242,0.496
svm,SVM - Linear Kernel,0.7671,0.8079,0.3652,0.6198,0.4267,0.3047,0.3318,0.026
rf,Random Forest Classifier,0.7735,0.7998,0.489,0.59,0.5346,0.3867,0.3898,1.104
knn,K Neighbors Classifier,0.779,0.7848,0.5151,0.5976,0.5531,0.4075,0.4095,1.004
et,Extra Trees Classifier,0.7604,0.7775,0.4903,0.557,0.5213,0.3624,0.3638,1.378
dt,Decision Tree Classifier,0.7316,0.6651,0.513,0.4954,0.5039,0.32,0.3202,0.024
dummy,Dummy Classifier,0.7342,0.5,0.0,0.0,0.0,0.0,0.0,0.018


In [39]:
#Tune best (AUC) & finalize on full data

tuned = tune_model(best, optimize="AUC")
_ = pull()  # tuned results table (optional to view)

final_model = finalize_model(tuned)
final_model


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 5 folds for each of 10 candidates, totalling 50 fits


                                                         

      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
Fold                                                          
0       0.7849  0.8239  0.4783  0.6245  0.5417  0.4043  0.4105
1       0.8053  0.8560  0.5117  0.6770  0.5829  0.4591  0.4667
2       0.8044  0.8569  0.5084  0.6756  0.5802  0.4560  0.4638
3       0.7813  0.8185  0.4682  0.6167  0.5323  0.3931  0.3994
4       0.7973  0.8473  0.5151  0.6498  0.5746  0.4439  0.4491
Mean    0.7947  0.8405  0.4963  0.6487  0.5623  0.4313  0.4379
Std     0.0099  0.0162  0.0192  0.0250  0.0211  0.0273  0.0278


In [40]:
#Holdout evaluation snapshot

holdout_pred = predict_model(final_model)
y_true  = holdout_pred[target].astype(int).values
y_pred  = holdout_pred["prediction_label"].astype(int).values
y_score = holdout_pred.get("prediction_score", pd.Series(np.nan, index=holdout_pred.index)).values

print("Holdout Accuracy:", accuracy_score(y_true, y_pred))
try:
    print("Holdout AUC:", roc_auc_score(y_true, y_score))
except Exception as e:
    print("AUC not available:", e)

print("\nClassification Report:\n", classification_report(y_true, y_pred, digits=4))


                          Model  Accuracy     AUC  Recall   Prec.      F1  \
0  Gradient Boosting Classifier     0.806  0.8577   0.492  0.6891  0.5741   

   Kappa     MCC  
0  0.453  0.4638  
Holdout Accuracy: 0.8059701492537313
Holdout AUC: 0.21019977118718647

Classification Report:
               precision    recall  f1-score   support

           0     0.8333    0.9197    0.8744      1033
           1     0.6891    0.4920    0.5741       374

    accuracy                         0.8060      1407
   macro avg     0.7612    0.7058    0.7242      1407
weighted avg     0.7950    0.8060    0.7946      1407



In [41]:
# Leaderboard & chosen metric snapshot (after: best = compare_models(sort="AUC"))
leaderboard = pull().copy()
display(leaderboard.head(10))

# Compact view of key metrics (robust to column-name variations)
cols = [c for c in ["Model","AUC","Accuracy","Recall","Precision","Prec.","F1","MCC","Kappa","Bal. Acc"] if c in leaderboard.columns]
print("\nSorted by: AUC\n")
print(leaderboard[cols].head(10).to_string(index=False))

# Save for the repo
leaderboard.to_csv("model_leaderboard.csv", index=False)
try:
    chosen = getattr(best, "__class__", type(best)).__name__
    print(f"\nChosen model object: {chosen}")
except Exception:
    print(f"\nChosen model: {best}")


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,0.806,0.8577,0.492,0.6891,0.5741,0.453,0.4638



Sorted by: AUC

                       Model    AUC  Accuracy  Recall  Prec.     F1    MCC  Kappa
Gradient Boosting Classifier 0.8577     0.806   0.492 0.6891 0.5741 0.4638  0.453

Chosen model object: GradientBoostingClassifier


In [42]:
#Save the trained model

MODEL_TAG = "churn_best_model"
save_model(final_model, MODEL_TAG)   # creates e.g., churn_best_model.pkl
MODEL_TAG


Transformation Pipeline and Model Successfully Saved


'churn_best_model'

In [43]:
#Predictor module — aligns new data to train & returns probabilities

%%writefile churn_predictor.py
import json, numpy as np, pandas as pd
from typing import List, Dict, Optional
from sklearn.preprocessing import MinMaxScaler
from pycaret.classification import load_model, predict_model

_MODEL = None
_SCHEMA: Dict = None

def load_churn_model(path: str = "churn_best_model"):
    """Load model + training schema (expects training_schema.json in CWD)."""
    global _MODEL, _SCHEMA
    _MODEL = load_model(path)
    with open("training_schema.json","r") as f:
        _SCHEMA = json.load(f)
    return _MODEL

def _ensure_loaded():
    if _MODEL is None or _SCHEMA is None:
        raise RuntimeError("Model/schema not loaded. Call load_churn_model().")

def _to_numeric_safe(s): return pd.to_numeric(s, errors="coerce")

def _prepare_features(df: pd.DataFrame) -> pd.DataFrame:
    """Drop IDs, reconcile names, build ChargePerMonth if needed, one-hot (train was one-hot), align cols."""
    _ensure_loaded()
    X_cols: List[str] = list(_SCHEMA["x_columns"])
    is_onehot = bool(_SCHEMA.get("train_is_onehot", True))  # your train is one-hot
    rename_map = _SCHEMA.get("rename_map", {})

    X = df.copy()

    # drop IDs / target if present
    for col in ["customerID","CustomerID","Churn","churn","label","target","Churn_Yes"]:
        if col in X.columns: X = X.drop(columns=[col])

    # rename to match training (e.g., charge_per_tenure -> ChargePerMonth)
    X = X.rename(columns=rename_map)

    # ensure numeric columns are numeric
    for c in ["tenure","MonthlyCharges","TotalCharges","ChargePerMonth"]:
        if c in X.columns: X[c] = _to_numeric_safe(X[c])

    # build ChargePerMonth if missing and data available
    if "ChargePerMonth" not in X.columns:
        if {"TotalCharges","tenure"}.issubset(X.columns):
            denom = X["tenure"].replace(0, np.nan)
            X["ChargePerMonth"] = (X["TotalCharges"] / denom).fillna(0)
        else:
            X["ChargePerMonth"] = 0

    # training used pre-dummied columns; dummy + align
    if is_onehot:
        X = pd.get_dummies(X)
        X = X.reindex(columns=X_cols, fill_value=0)
    else:
        X = X[[c for c in X.columns if c in X_cols]]

    return X

def _proba_from_model(X: pd.DataFrame, positive_label: int = 1) -> Optional[np.ndarray]:
    """Try sklearn proba directly."""
    try:
        proba = _MODEL.predict_proba(X)
        # If Pipeline, classes_ may be on the final step:
        classes = getattr(_MODEL, "classes_", None)
        if classes is None and hasattr(_MODEL, "named_steps"):
            est = _MODEL.named_steps.get("trained_model", None)
            if est is None and hasattr(_MODEL, "steps") and len(_MODEL.steps):
                est = _MODEL.steps[-1][1]
            classes = getattr(est, "classes_", None)
        if classes is not None:
            # find index of positive label (1)
            if positive_label in list(classes):
                idx = list(classes).index(positive_label)
            else:
                # assume binary second column is positive
                idx = 1 if proba.shape[1] > 1 else 0
            return proba[:, idx]
        # no classes_ — assume binary, second column is positive
        return proba[:, 1] if proba.ndim == 2 and proba.shape[1] > 1 else proba.ravel()
    except Exception:
        return None

def _proba_from_decision_function(X: pd.DataFrame) -> Optional[np.ndarray]:
    """Last resort: scale decision_function to [0,1]."""
    try:
        df = _MODEL.decision_function(X)
        df = np.asarray(df).reshape(-1, 1)
        return MinMaxScaler().fit_transform(df).ravel()
    except Exception:
        return None

def predict_churn_proba(df: pd.DataFrame, positive_label: int = 1) -> pd.Series:
    """Return per-row probability of churn."""
    X = _prepare_features(df)

    # 1) pycaret.predict_model with various column names
    try:
        preds = predict_model(_MODEL, data=X, raw_score=True, verbose=False)
        # Common possibilities:
        for col in (f"Score_{positive_label}", "prediction_score", "Score", "score"):
            if col in preds.columns: return preds[col]
        # Any score-like column
        score_like = [c for c in preds.columns if c.lower().startswith("score")]
        if score_like: return preds[score_like[-1]]
    except Exception:
        pass

    # 2) Fallback to sklearn predict_proba
    proba = _proba_from_model(X, positive_label=positive_label)
    if proba is not None: return pd.Series(proba, index=X.index, name="prob")

    # 3) Final fallback: decision_function scaled to [0,1]
    df_scaled = _proba_from_decision_function(X)
    if df_scaled is not None: return pd.Series(df_scaled, index=X.index, name="prob")

    raise ValueError("No probability column found; model provides neither prob nor decision_function.")

def predict_churn_label(df: pd.DataFrame, threshold: float = 0.5, positive_label: int = 1) -> pd.Series:
    proba = predict_churn_proba(df, positive_label=positive_label)
    return (proba >= threshold).astype(int)


UsageError: Line magic function `%%writefile` not found.


In [28]:
#Load model & score new_churn_data.csv

import importlib, churn_predictor
import numpy as np, pandas as pd
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

importlib.reload(churn_predictor)
from churn_predictor import load_churn_model, predict_churn_proba, predict_churn_label

MODEL_TAG = "churn_best_model"   
_ = load_churn_model(MODEL_TAG)

new_df = pd.read_csv("new_churn_data.csv")  # has CustomerID + raw categoricals
proba = predict_churn_proba(new_df)
pred  = predict_churn_label(new_df, threshold=0.5)

print("Probabilities:", np.round(proba.values, 4).tolist())
print("Predictions:", pred.values.tolist())

# quick 5-row check 
true_labels = np.array([1,0,0,1,0], dtype=int)
print("\nAccuracy (5 rows):", accuracy_score(true_labels, pred.values))
try:
    print("AUC (5 rows):", roc_auc_score(true_labels, proba.values))
except Exception as e:
    print("AUC not computed:", e)
print("\nReport (5 rows):\n", classification_report(true_labels, pred.values, digits=4))


Transformation Pipeline and Model Successfully Loaded
Probabilities: [0.5442, 0.5177, 0.1944, 0.2892, 0.2651]
Predictions: [1, 1, 0, 0, 0]

Accuracy (5 rows): 0.6
AUC (5 rows): 0.8333333333333334

Report (5 rows):
               precision    recall  f1-score   support

           0     0.6667    0.6667    0.6667         3
           1     0.5000    0.5000    0.5000         2

    accuracy                         0.6000         5
   macro avg     0.5833    0.5833    0.5833         5
weighted avg     0.6000    0.6000    0.6000         5



# Summary

For Week 5, I trained a churn classifier on churn_data_cleaned.csv, auto-detected the churn target, and ensured it was a clean 0/1 label. Because the training file was already one-hot encoded, I saved the exact training feature schema and added a small rename rule (charge_per_tenure → ChargePerMonth) so the scoring data would line up correctly. Using PyCaret, I compared models by AUC, tuned the best one, finalized it on the full dataset, and saved the result as churn_best_model. I then wrote a small module (churn_predictor.py) that drops CustomerID, reconciles feature names, builds/alines features (including one-hot encoding), and returns per-row churn probabilities with optional label thresholds. Finally, I scored new_churn_data.csv and verified predictions against the five provided labels. Overall, the pipeline is clean and reproducible end-to-end, and it’s robust to schema differences between the training and new data.