# Supervised Learning Project - Beta Bank Churn Prediction

Goal: Predict customer churn (whether a customer will exit) and achieve **F1 ≥ 0.59**.


In [1]:
# 1. Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import f1_score, roc_auc_score, classification_report
from sklearn.utils import shuffle 


In [2]:
# 2. Load Data
df = pd.read_csv('/datasets/Churn.csv')  # <- corrected file path
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [3]:
# Inspect data
print(df.info())
print(df.describe())
print(df['Exited'].value_counts(normalize=True))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB
None
         RowNumber    CustomerId   CreditScore           Age       Tenure  \
count  10000

In [4]:
# 2.5. Handle Missing Values
print("Missing values before handling:")
print(df.isnull().sum())

# Fill missing values in Tenure with median
df['Tenure'].fillna(df['Tenure'].median(), inplace=True)

print("\nMissing values after handling:")
print(df.isnull().sum())


Missing values before handling:
RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

Missing values after handling:
RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64


In [5]:
# 3. Preprocessing
# Remove unnecessary columns (RowNumber, CustomerId, Surname)
X = df.drop(['Exited', 'RowNumber', 'CustomerId', 'Surname'], axis=1)
y = df['Exited']

# One-hot encode categoricals
X = pd.get_dummies(X, drop_first=True)

print("Features after preprocessing:")
print(X.columns.tolist())
print(f"Shape: {X.shape}")

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y)

# Scale numeric features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print(f"\nTraining set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

Features after preprocessing:
['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Geography_Germany', 'Geography_Spain', 'Gender_Male']
Shape: (10000, 11)

Training set shape: (7000, 11)
Test set shape: (3000, 11)


In [6]:
# 4. Baseline Model - Logistic Regression
log_reg = LogisticRegression(max_iter=200, random_state=42)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

print("Baseline F1:", f1_score(y_test, y_pred))
print("Baseline ROC-AUC:", roc_auc_score(y_test, log_reg.predict_proba(X_test)[:,1]))

Baseline F1: 0.29925187032418954
Baseline ROC-AUC: 0.7883767595478183


In [7]:
# --- Manual Upsampling (compatible with numpy arrays) ---
import numpy as np
import pandas as pd
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, roc_auc_score

# Convert NumPy arrays to pandas DataFrames/Series if needed
if isinstance(X_train, np.ndarray):
    X_train = pd.DataFrame(X_train)
if isinstance(y_train, np.ndarray):
    y_train = pd.Series(y_train)

# Reset indices to avoid alignment issues
X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)

# Split training data into negative (0) and positive (1) observations
features_zeros = X_train[y_train == 0]
features_ones = X_train[y_train == 1]
target_zeros = y_train[y_train == 0]
target_ones = y_train[y_train == 1]

# Repeat the minority class to balance the dataset
repeat = 3  # can adjust 2–5 depending on how imbalanced your data is
features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat, ignore_index=True)
target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat, ignore_index=True)

# Shuffle the data so classes are mixed properly
X_train_sm, y_train_sm = shuffle(features_upsampled, target_upsampled, random_state=42)

# Train model on upsampled data
rf = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42)
rf.fit(X_train_sm, y_train_sm)

# Evaluate model
y_pred = rf.predict(X_test)
print("Random Forest (Manual Upsampling) F1:", f1_score(y_test, y_pred))
print("Random Forest (Manual Upsampling) ROC-AUC:", roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1]))

Random Forest (Manual Upsampling) F1: 0.6414790996784565
Random Forest (Manual Upsampling) ROC-AUC: 0.8765255922706294


In [8]:
# 6. Hyperparameter Tuning Example with GridSearchCV (Random Forest)
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [5, 10, 15]
}

grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, scoring='f1', cv=3)
grid.fit(X_train_sm, y_train_sm)

print("Best Params:", grid.best_params_)
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Tuned Random Forest F1:", f1_score(y_test, y_pred))
print("Tuned Random Forest ROC-AUC:", roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]))

Best Params: {'max_depth': 15, 'n_estimators': 200}
Tuned Random Forest F1: 0.6239015817223199
Tuned Random Forest ROC-AUC: 0.8675811599673628


In [10]:
# 7. Final Evaluation Report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.93      0.91      2389
           1       0.67      0.58      0.62       611

    accuracy                           0.86      3000
   macro avg       0.79      0.75      0.77      3000
weighted avg       0.85      0.86      0.85      3000



## Conclusion

This supervised learning project successfully developed a model to predict customer churn for Beta Bank. The goal was to achieve an **F1 score ≥ 0.59**, and this objective was met.

### Key Findings:

1. **Data Quality**: The dataset contained 10,000 customer records with 11 features after preprocessing. Missing values in the `Tenure` column (909 missing) were handled by filling with the median value.

2. **Class Imbalance**: The target variable `Exited` showed significant class imbalance, with approximately 20% of customers churning. This required special handling through upsampling techniques.

3. **Model Performance**:
   - **Baseline Logistic Regression**: Achieved an F1 score of 0.299, which was below the target threshold
   - **Random Forest with Manual Upsampling**: Achieved an F1 score of **0.641**, successfully exceeding the target
   - **Hyperparameter-Tuned Random Forest**: Achieved an F1 score of **0.624** with optimized parameters (max_depth=15, n_estimators=200)

4. **Final Model**: The tuned Random Forest classifier achieved:
   - **F1 Score**: 0.624 (exceeds target of 0.59 ✓)
   - **ROC-AUC Score**: 0.868
   - **Precision**: 0.67 for churn class
   - **Recall**: 0.58 for churn class
   - **Overall Accuracy**: 86%

### Recommendations:

The Random Forest model with upsampling and hyperparameter tuning successfully predicts customer churn and can be deployed to help Beta Bank identify at-risk customers. The model's ability to correctly identify 58% of churning customers (recall) while maintaining 67% precision makes it a valuable tool for proactive customer retention strategies.

**Goal Achievement**: ✅ **F1 Score = 0.624 ≥ 0.59** (Target Met)
