## *Spring 2025*
Paul Signorelli, Christopher Sáez, Wisdom Okwen, Austin Campbell, Ethan Byrd, Will Scuria



# 1.Business Problem: Customer Churn


- The dataset is based on real bank data from 2022, but was slightly modified to:
    - preserve real customers privacies  
    - preserve the bank's privacy  
    - allow for richer analysis  
- To prevent customers from churning (i.e., take steps to incentivice them to stay), we need to be able to faithfully identify them.

>Task: Predict which customers are most likely to churn

# 2. Load, Wrangle, and Clean Data


In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


In [17]:
# load data.json as a dataframe
df = pd.read_json('data/data.json')
df

Unnamed: 0,ClientID,Surname,Firstname,FICOScore,Subsidiary,Gender,Age,Balance,Products,BankCC,Active,RegDeposits,LifeInsur,PlatStatus,Terminated
0,61BOS20150MF65876258487565N,Myles,Fidel,657,Boston,Male,28,64821.12,2,0,0,15330,0,0,0
1,91CHL20170DA95890902611393N,Drenner,Arron,493,Chapel Hill,Male,64,90161.70,1,0,1,5599,0,0,0
2,91CHL20180MC38607441559869N,Muir,Charolette,820,Chapel Hill,Female,46,0.00,1,0,0,15185,0,0,1
3,61BOS20110SH53586596382094N,Schimpf,Herschel,670,Boston,Male,37,230.10,2,1,1,13,1,0,0
4,40ATL20110MK15149165663931P,Montez,Kisha,664,Atlanta,Female,33,76318.32,2,1,1,5278,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23995,40ATL20160TJ2301576765838N,Tondreau,Jeffrey,678,Atlanta,Male,47,88960.56,2,1,0,14963,0,0,1
23996,61BOS20150HB1119146617276N,Hevrin,Brad,627,Boston,Male,28,862.68,1,0,1,15206,0,0,0
23997,40ATL20140RP44230979729440N,Russer,Penney,682,Atlanta,Female,48,76374.48,2,0,0,189,1,0,0
23998,40ATL20140PB19106060056904N,Polizio,Brigette,839,Atlanta,Female,39,112642.93,1,1,1,6696,0,0,1


In [22]:
# BUG: If the cell above does not work, try uncommenting the bottom line of this cell and running it first

# Download training data as "data.json"
# !wget -O data.json "https://raw.githubusercontent.com/wisdom-okwen/comp-560-project/refs/heads/main/data/Bank_Churn.json"

In [19]:
### Cleaning the Data

# Categorical Columns
categorical_columns = ['Gender', 'Subsidiary', 'BankCC', 'Active', 'LifeInsur', 'PlatStatus', 'Terminated']

# Fixing Categorical Columns
# Fixing typo in 'Gender'
df['Gender'] = df['Gender'].replace('Feale', 'Female')

# Fixing 'BankCC' - assuming binary variable (replace 2 with 1)
df['BankCC'] = df['BankCC'].replace(2, 1)

# Fixing 'Active' - assuming binary variable (replace 2 with 1)
df['Active'] = df['Active'].replace(2, 1)

# Fixing 'PlatStatus' - replacing 'yes' with '1'
df['PlatStatus'] = df['PlatStatus'].replace('yes', '1')

# Converting categorical columns to type 'category'
for col in categorical_columns:
    df[col] = df[col].astype('category')

# Numerical Variables
valid_ranges = {
    'FICOScore': (300, 850),
    'Age': (18, 100),
    'Balance': (-np.inf, np.inf),  # Allowing negative balances
    'Products': (1, 5),
    'RegDeposits': (0, 100000)
}

def fix_numerical_variable(data, column, valid_range):
    lower, upper = valid_range
    # Replace invalid or missing values with the median of valid values
    median = data[column][(data[column] >= lower) & (data[column] <= upper)].median()
    # Cast the median to the same type as the column
    if pd.api.types.is_integer_dtype(data[column]):
        median = int(median)
    elif pd.api.types.is_float_dtype(data[column]):
        median = float(median)

    # Replace invalid or missing values
    data.loc[(data[column] < lower) | (data[column] > upper) | (data[column].isnull()), column] = median

    # Calculate bounds for outlier removal
    mean = data[column].mean()
    std_dev = data[column].std()
    upper_limit = mean + 3 * std_dev
    lower_limit = mean - 3 * std_dev

    # Remove outliers
    return data[(data[column] >= lower_limit) & (data[column] <= upper_limit)]

# Cleaning numerical variables
for col, range_ in valid_ranges.items():
    df = fix_numerical_variable(df, col, range_)

# After Cleaning: Verify Ranges
cleaned_ranges = {col: (df[col].min(), df[col].max()) for col in valid_ranges.keys()}

print("Cleaned Ranges:", cleaned_ranges)

Cleaned Ranges: {'FICOScore': (350, 850), 'Age': (18, 71), 'Balance': (-128372.66, 260951.61), 'Products': (1, 3), 'RegDeposits': (0, 18833)}


# 3. Build a Predictive Model: Supervised Learning
- It's time to train a machine learning model to predict which customers are likely to churn.
  - We have variables (columns) describing customers: our independent variables or Xs.
  - We have an outcome ("terminated") for each custiomer: our dependent variabel or Y.
  
Let's create a classification model that can predict whether a customer will churn or not using the *Bank_Churn_Train.json* dataset. These are the steps to train the machine learning model:

1. Separate the independent (X) variables from the dependent (Y) variable.
2. Transform categorical independept variables (X) into binary variables (i.e., dummy coding or one-hot encoding.)
3. Split the dataset into a *train* and *test* set:
   - Train the model on the *train* set, and then evaluate its performance on the *test* set (which it has not seen during training) to assess its accuracy.
4. Load and instantiate (i.e., set up) a classification model. This includes selecting the model type and configuring parameters that control how it is trained.
5. Train the model (run the training process).



### 3.1 Separate the independent (X) variables from the dependent (Y) variable
- Let's separate the independent (X) variables from the dependent (Y) variable (Terminated) in the dataframe. Keep only variables that are either categorical, boolean, or numeric

In [20]:
# Separate the independent (X) variables from the dependent (Y) variable (Terminated) in my dataframe. Keep only variables that are either categorical, boolean, or numeric.

# Assuming 'df' is your DataFrame as defined in the provided code.

# Separate independent (X) and dependent (Y) variables
X = df.drop('Terminated', axis=1)
y = df['Terminated']

# Keep only specified data types in X
X = X.select_dtypes(include=['category', 'bool', 'number'])

print("Independent variables (X):")
print(X.head())
print(X.shape)
print(y.shape)

print("\nDependent variable (Y):")
print(y.head())

Independent variables (X):
   FICOScore   Subsidiary  Gender  Age   Balance  Products BankCC Active  \
0        657       Boston    Male   28  64821.12         2      0      0   
1        493  Chapel Hill    Male   64  90161.70         1      0      1   
2        820  Chapel Hill  Female   46      0.00         1      0      0   
3        670       Boston    Male   37    230.10         2      1      1   
4        664      Atlanta  Female   33  76318.32         2      1      1   

   RegDeposits LifeInsur PlatStatus  
0        15330         0          0  
1         5599         0          0  
2        15185         0          0  
3           13         1          0  
4         5278         1          1  
(23467, 11)
(23467,)

Dependent variable (Y):
0    0
1    0
2    1
3    0
4    0
Name: Terminated, dtype: category
Categories (2, int64): [0, 1]


### 3.2 Transform categorical independept variables (X) into binary variables

In [21]:
# Perform one-hot encoding on categorical features
X = pd.get_dummies(X, columns=X.select_dtypes(include=['category']).columns, drop_first=True)

print("\nIndependent variables (X) after one-hot encoding:")
print(X.head())
X.shape


Independent variables (X) after one-hot encoding:
   FICOScore  Age   Balance  Products  RegDeposits  Subsidiary_Boston  \
0        657   28  64821.12         2        15330               True   
1        493   64  90161.70         1         5599              False   
2        820   46      0.00         1        15185              False   
3        670   37    230.10         2           13               True   
4        664   33  76318.32         2         5278              False   

   Subsidiary_Chapel Hill  Gender_Male  BankCC_1  Active_1  LifeInsur_1  \
0                   False         True     False     False        False   
1                    True         True     False      True        False   
2                    True        False     False     False        False   
3                   False         True      True      True         True   
4                   False        False      True      True         True   

   PlatStatus_1  
0         False  
1         False  
2    

(23467, 12)

### 3.3 Split the dataset into a training set and a testing set

In [23]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training and 20% test

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (18773, 12)
X_test shape: (4694, 12)
y_train shape: (18773,)
y_test shape: (4694,)


### 3.4 Instantiate the classifier model

In [24]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the RandomForestClassifier
rf_classifier = RandomForestClassifier(random_state=42)

### 3.5 Train, make predictions, and evaluate the classifier model

In [25]:
# Train the model
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Random Forest Classifier: {accuracy}")
print(classification_report(y_test, y_pred))

Accuracy of the Random Forest Classifier: 0.9054111631870473
              precision    recall  f1-score   support

           0       0.92      0.96      0.94      3562
           1       0.85      0.74      0.79      1132

    accuracy                           0.91      4694
   macro avg       0.88      0.85      0.87      4694
weighted avg       0.90      0.91      0.90      4694



# 4. Evaluate Performance and Improve the Model


### 4.1 Generalization

Ideally, a model performs equally well on data it has not seen. In other words, it ***generalizes*** well.

In [26]:
from sklearn.model_selection import cross_val_score

# Example using 5-fold cross-validation
cv_scores = cross_val_score(rf_classifier, X, y, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation score: {np.mean(cv_scores)}")

Cross-validation scores: [0.90093737 0.90796762 0.90176859 0.89963776 0.89814618]
Mean cross-validation score: 0.901691502024927


# 5. Tensor deep learning model


### 5.1 Prepare the data and preprocess the features 

In [28]:
from sklearn.preprocessing import StandardScaler
df = pd.read_json('data/data.json')

# 1. Feature Engineering: Extract "YearOpened" from ClientID and compute "Tenure"
# Assuming characters at positions 6 to 9 in ClientID represent the account opening year (e.g., "2015")
df['YearOpened'] = df['ClientID'].str[5:9].astype(int)
current_year = 2022  # Alternatively, you could use: pd.Timestamp.now().year
df['Tenure'] = current_year - df['YearOpened']

# 2. Drop columns that are not useful for prediction (e.g., ClientID, Surname, Firstname)
df.drop(['ClientID', 'Surname', 'Firstname'], axis=1, inplace=True)

# 3. Convert appropriate columns to categorical type
# Add "PlatStatus" to the list of categorical columns
categorical_columns = ['Subsidiary', 'Gender', 'BankCC', 'Active', 'LifeInsur', 'PlatStatus', 'Terminated']
for col in categorical_columns:
    df[col] = df[col].astype('category')

# 4. One-Hot Encode the categorical features
df_prepared = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

# 5. Optional: Scale numerical features
# Identify numerical columns (excluding the target 'Terminated')
# Ensure that categorical features are also excluded here
num_cols = df_prepared.columns.difference(['Terminated'] + categorical_columns)  # Exclude categorical columns
scaler = StandardScaler()
df_prepared[num_cols] = scaler.fit_transform(df_prepared[num_cols])

print("\nPrepared Data:")
print(df_prepared.head())


Prepared Data:
   FICOScore       Age   Balance  Products  RegDeposits  YearOpened    Tenure  \
0   0.306362 -1.042322 -0.012109  0.675159     0.127654    0.397984 -0.397984   
1  -1.250425  2.303642  0.027026 -0.761477    -0.093466    1.020321 -1.020321   
2   1.853656  0.630660 -0.112217 -0.761477     0.124360    1.331490 -1.331490   
3   0.429766 -0.205831 -0.111862  0.675159    -0.220398   -0.846689  0.846689   
4   0.372810 -0.577605  0.005646  0.675159    -0.100760   -0.846689  0.846689   

   Subsidiary_Boston  Subsidiary_Chapel Hill  Gender_Female  Gender_Male  \
0           1.007276               -0.572603      -0.935571     0.935649   
1          -0.992776                1.746410      -0.935571     0.935649   
2          -0.992776                1.746410       1.068866    -1.068777   
3           1.007276               -0.572603      -0.935571     0.935649   
4          -0.992776               -0.572603       1.068866    -1.068777   

   BankCC_1  BankCC_2  Active_1  Active_

### 5.2 Create tensors and split data into training & testing

In [29]:
X_new = df_prepared.drop('Terminated_1', axis=1)
y_new = df_prepared['Terminated_1']
# Convert pandas DataFrames/Series to PyTorch tensors
# The 'ClientID' column has already been dropped during preprocessing.
# Ensure X has the correct dimensions for the neural network.
X_tensor = torch.tensor(X_new.astype(float).values, dtype=torch.float32)

# If y is binary (0/1), we can use Long type for classification with CrossEntropyLoss.
y_tensor = torch.tensor(y_new.values, dtype=torch.long)

# Split the data into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X_tensor, y_tensor, test_size=0.2, random_state=42)

# Create DataLoader for batching during training
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

### 5.3 Create the deep learning model with loss function and optimizer

In [30]:
# Define the deep learning neural network Model
class ChurnModel(nn.Module):
    def __init__(self, input_dim):
        super(ChurnModel, self).__init__()
        self.network = nn.Sequential(
          nn.Linear(input_dim, 64),
          nn.ReLU(),
          nn.Dropout(0.3), # Dropout is a form of regularization
          nn.Linear(64, 32),
          nn.ReLU(),
          nn.Dropout(0.3),
          nn.Linear(32, 2)
       )

    def forward(self, x):
        return self.network(x)

# Get input dimension from X_tensor
input_dim = X_tensor.shape[1]
model = ChurnModel(input_dim)

# Set Loss Function and Optimizer 
criterion = nn.CrossEntropyLoss()  # suitable for multi-class (even if binary)
optimizer = optim.Adam(model.parameters(), lr=0.001)

### 5.4 Train the deep learning model

In [31]:
# Training Loop
num_epochs = 50
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for batch_x, batch_y in train_loader:
        optimizer.zero_grad()
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    avg_loss = running_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

Epoch [1/50], Loss: 0.3033
Epoch [2/50], Loss: 0.2552
Epoch [3/50], Loss: 0.2431
Epoch [4/50], Loss: 0.2420
Epoch [5/50], Loss: 0.2373
Epoch [6/50], Loss: 0.2316
Epoch [7/50], Loss: 0.2300
Epoch [8/50], Loss: 0.2338
Epoch [9/50], Loss: 0.2248
Epoch [10/50], Loss: 0.2239
Epoch [11/50], Loss: 0.2241
Epoch [12/50], Loss: 0.2212
Epoch [13/50], Loss: 0.2206
Epoch [14/50], Loss: 0.2171
Epoch [15/50], Loss: 0.2190
Epoch [16/50], Loss: 0.2165
Epoch [17/50], Loss: 0.2136
Epoch [18/50], Loss: 0.2116
Epoch [19/50], Loss: 0.2160
Epoch [20/50], Loss: 0.2140
Epoch [21/50], Loss: 0.2167
Epoch [22/50], Loss: 0.2117
Epoch [23/50], Loss: 0.2091
Epoch [24/50], Loss: 0.2111
Epoch [25/50], Loss: 0.2089
Epoch [26/50], Loss: 0.2081
Epoch [27/50], Loss: 0.2104
Epoch [28/50], Loss: 0.2097
Epoch [29/50], Loss: 0.2105
Epoch [30/50], Loss: 0.2049
Epoch [31/50], Loss: 0.2076
Epoch [32/50], Loss: 0.2061
Epoch [33/50], Loss: 0.2071
Epoch [34/50], Loss: 0.2047
Epoch [35/50], Loss: 0.2072
Epoch [36/50], Loss: 0.2063
E

### 5.5 Evaluate the deep learning model

In [32]:
# Evaluate the Model
model.eval()
with torch.no_grad():
    test_outputs = model(X_test)
    # Get predicted classes by taking the argmax along the class dimension
    _, predicted = torch.max(test_outputs, 1)
    test_accuracy = (predicted == y_test).float().mean()
    print("Test Accuracy:", test_accuracy.item())


Test Accuracy: 0.9114583134651184
