# PROJECT 2 — CUSTOMER CHURN PREDICTION

Objective: Predict whether a customer will churn based on usage patterns, contract characteristics, and billing information

In [1]:
import pandas as pd

df = pd.read_csv("../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [2]:
df.shape
df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [3]:
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})

In [5]:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df["TotalCharges"] = df["TotalCharges"].fillna(df["TotalCharges"].median())

In [6]:
df.drop(columns=["customerID"], inplace=True)

In [9]:
X = df.drop(columns=["Churn"])
y = df["Churn"]

In [10]:
X = pd.get_dummies(X, drop_first=True)

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

## Baseline Model (Logistic Regression)

A logistic regression model was trained as a baseline to predict customer churn. The model provides interpretable coefficients and establishes a performance benchmark. ROC-AUC is used as the primary evaluation metric due to class imbalance and the business importance of ranking high-risk customers.

Objective: To ensure stable convergence of the logistic regression model, numerical features were standardized prior to training. Feature scaling significantly improves optimizer performance in high-dimensional feature spaces.


In [14]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [15]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    max_iter=10000,
    solver="lbfgs"
)

model.fit(X_train_scaled, y_train)

In [16]:
from sklearn.metrics import classification_report, roc_auc_score

y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]

print(classification_report(y_test, y_pred))
roc_auc_score(y_test, y_prob)

              precision    recall  f1-score   support

           0       0.85      0.89      0.87      1035
           1       0.66      0.57      0.61       374

    accuracy                           0.81      1409
   macro avg       0.75      0.73      0.74      1409
weighted avg       0.80      0.81      0.80      1409



np.float64(0.8415846443979436)

### Interpretation

The model achieves an overall accuracy of 81%, indicating good general performance. For non-churners (class 0), precision and recall are high, showing the model reliably identifies customers who stay. For churners (class 1), recall is lower (57%), meaning the model misses a portion of actual churners, although precision (66%) indicates that predicted churners are reasonably likely to truly churn. The ROC-AUC of 0.84 suggests the model has strong discriminatory power and is effective at ranking customers by churn risk, despite the class imbalance.

### Key takeaway (business-oriented)
The model is better at identifying customers who will stay than those who will churn, which is common in imbalanced churn datasets. Improving recall for churners would be the next priority if the business goal is proactive retention.

## Extract Coefficient

In [17]:
import pandas as pd

coefficients = pd.DataFrame({
    "feature": X_train.columns,
    "coefficient": model.coef_[0]
}).sort_values(by="coefficient", ascending=False)

coefficients.head(10)

Unnamed: 0,feature,coefficient
10,InternetService_Fiber optic,0.77876
3,TotalCharges,0.497246
23,StreamingMovies_Yes,0.258653
21,StreamingTV_Yes,0.258042
9,MultipleLines_Yes,0.216356
26,PaperlessBilling_Yes,0.181833
28,PaymentMethod_Electronic check,0.181456
17,DeviceProtection_Yes,0.053625
0,SeniorCitizen,0.052901
29,PaymentMethod_Mailed check,0.033133


In [18]:
coefficients.tail(10)

Unnamed: 0,feature,coefficient
18,TechSupport_No internet service,-0.092861
16,DeviceProtection_No internet service,-0.092861
22,StreamingMovies_No internet service,-0.092861
19,TechSupport_Yes,-0.100249
6,Dependents_Yes,-0.104249
13,OnlineSecurity_Yes,-0.12343
24,Contract_One year,-0.286473
25,Contract_Two year,-0.588975
2,MonthlyCharges,-0.921369
1,tenure,-1.219639


### Interpretation of Coefficient

Positive coefficients increase the likelihood of customer churn, while negative coefficients decrease it. Features related to short contract duration, month-to-month contracts, and lack of additional services are strong positive predictors of churn. In contrast, longer tenure and bundled services are associated with customer retention.


## Class Imbalance Handling

### Refit Logistic Regression with Class Weights

In [19]:
model_balanced = LogisticRegression(
    max_iter=5000,
    class_weight="balanced"
)

model_balanced.fit(X_train_scaled, y_train)

In [20]:
y_pred_bal = model_balanced.predict(X_test_scaled)
y_prob_bal = model_balanced.predict_proba(X_test_scaled)[:, 1]

from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_test, y_pred_bal))
roc_auc_score(y_test, y_prob_bal)

              precision    recall  f1-score   support

           0       0.90      0.72      0.80      1035
           1       0.51      0.78      0.61       374

    accuracy                           0.74      1409
   macro avg       0.70      0.75      0.71      1409
weighted avg       0.80      0.74      0.75      1409



np.float64(0.8412333049161694)

### Interpretation

After applying class weighting, the model prioritizes identifying churners. Recall for the churn class increases substantially to 78%, indicating the model captures most customers who leave. This improvement comes at the cost of lower precision and reduced overall accuracy. The ROC-AUC remains high (≈0.84), confirming that the model’s ability to rank customers by churn risk is largely unchanged. This trade-off is appropriate when the business objective emphasizes retention over minimizing false positives.