# **BUSI 488 / COMP 488 Data Science in the Business World**
## *Spring 2023* 
Daniel M. Ringel  
Kenan-Flagler Business School  
*The University of North Carolina at Chapel Hill*  
dmr@unc.edu

## Customer Churn - Who to keep and who to let go?

*March 21, 2023*  
Version 2.1


# **Call for Nominations: Recognize a Professor for their Teaching** 

> **Nominate here:** https://tinyurl.com/weatherspoon2023 

![Weatherspoon](https://mapxp.app/BUSI488/Weatherspoon2023.png)


# **Call for Nominations: Recognize a Professor for their Teaching** 

 * The mediocre teacher tells. The good teacher explains. The superior teacher demonstrates. The great teacher inspires.  *William A. Ward*

 * I cannot teach anybody anything; I can only make them think. *Socrates*

 * Tell me and I forget. Teach me and I remember. Involve me and I learn. *Benjamin Franklin*

 > **Nominate here:** https://tinyurl.com/weatherspoon2023

# ***Don't forget to put in YOUR nominations before Monday, March 27th, 2023!***

# Today's Agenda
> Discuss this notebook in your team (Team for TA3) for 40 minutes

1. **What is Customer Churn**
2. **Identify Customers that are at Risk of Churning**
3. **Load and Clean Data**
4. **EDA with Visualization**
5. **Feature Engineering**
6. **Data Preprocessing Pipeline**
7. **Churn Prediction Model**
8. **Making Things Better: Did we overlook something?**
9. **Finalize Model**
10. **Decide Who to Fight for and Who to let Go**
11. **How well did we do?**
12. **What Next?**

> Discuss team findings in class

## Prep-Check:
- Reviewed notebook of class 17
- Ran this entire notebook before class and reflected on the insights it creates

# 1. What is Customer Churn?


![Lab vs Real-World](https://atrium.ai/wp-content/uploads/2021/07/What-stops-customer-churn-Having-a-centralized-data-hub-does-and-heres-why.jpeg)


####***Customer churn***
- customer attrition
- customer turnover
- customer defection

***is the loss of clients or customers***

Firms that have subscription or membership business models usually monitor customer churn closely:

- Banks 
- Telephone service companies
- Internet service providers
- Pay TV companies
- Insurance firms
- Gyms
- etc.   

----------------

####***Customer churn rates*** often a key business metric (along with cash flow, EBITDA (earnings before interest, tax, depreciation), etc.) 
* Cost of retaining an existing customer is far less than acquiring a new one.

-----------------

####Dedicated departments attempt to ***prevent churn*** and ***win back churned customers***   
- long-term customers can be worth more than newly acquired customers 

####***BUT: Competitors*** may make special offers to entice customers away 
- Customers leave in hope of better service or value for money
- ***Switching cost*** can create hurdles

-----------------
#### Important business activity: ***Customer Retention***
- Can be costly -  *why?*
- To focus retention efforts, must understand ***which customers are at risk of churning***.  

  

*Source: definition adapted from Wikipedia.com*

# 2. **Today's Business Challenge:** How to Identify customers that are at risk of churning?
- We will use a dataset that is based on real bank data, but was slightly modified for the purpose of this case study to 
    - preserve real customers privacies  
    - preserve the bank's privacy  
    - allow for richer analysis  

###**The Bank's Problem:** Decide on retention measures for right customers.

- What question(s) are we trying to answer?
- How can the answer it/them?

# 3. Load and Clean Bank Data

The bank provies us with two data sets: 
1. Data on bank customers that previously churned / did not churn (Training Set)
2. Data on bank customers where the bank needs to decide on retention measures (New Customer Set)

Both data sets contain the following variables: 

* ***ClientID:***  unique identifier of the bank customer
* ***Surname:*** surname of customer
* ***Firstname:*** firstname of customer
* ***FICOScore:*** the average credit score of the customer in the past year
* ***Subsidiary:*** the bank subsidiary that manages the customer relationship
* ***Gender:*** Female or Male
* ***Age:*** age of customer
* ***Balance:*** total balance across all accounts (if applicable) such as checking, savings and credit
* ***Product:*** number of banking products the customer uses
* ***BankCC:*** whether the customer has a credit card from the bank
* ***Active:*** indicates an active customer with regular transactions in the past 3 months
* ***RegDeposits:*** average monthly deposits that are made to the account across the past year (e.g., salary or pension)
* ***LifeInsur:*** whether the customer has a special life insurance policy from the bank
* ***PlatStatus:*** whether the customer has Platinum status at the bank (receives several perks and better service)

The training data set additionally includes the following variable:

* ***Terminated:*** whether the customer closed their accounts with the bank within the 6 months following the the download of the data from the bank's database

The new customer data set, on the other hand, additionally contains the following variable:

* ***BnkRev:*** approximation of how much revenue the bank makes with each customer in a year

## 3.1 Import libraries
Note: putting these at the top helps tell the users what libraries need to be installed. 

In [None]:
# 1. Load what we will need for data wrangling, visualization, and modeling
from google.colab import drive
import numpy as np
import pandas as pd
import pickle
pd.options.display.max_rows = None
pd.options.display.max_columns = None

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Support functions for much later modeling
from sklearn.preprocessing import minmax_scale
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

# Classification Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Scoring Functions
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
# 2. Add my Google Drive
drive.mount('/content/gdrive')

# 3. Go to folder on Google Drive that contains files
%cd /content/gdrive/MyDrive/MBA742/Class06

# 4. Special shell command to view the files in the home directory of the notebook environment
!ls 

## 3.2 Load Training Data

Let's load the data and describe it to get a fist feel for it!

In [None]:
# 1. Read data file (training) into a pandas dataframe
df = pd.read_json("Bank_Churn_Train.json") # read in pandas Dataframe

# 2. Number of rows (i.e., customer records) and columns (i.e., features)
print(f"\n Number of Rows and Columns: {df.shape} \n")

# 3. Take a look at the first 10 rows of the data
df.head(10)

## 3.3 Examine Data

Let's take a first deeper look at our data set

In [None]:
# 1. Get a first impression
df.describe()

#### Do you notice anything?

In [None]:
# 2. What about data types? Do they make sense?
df.info()

In [None]:
# 3. Let's get unique counts for each variable
df.nunique()

#### Does anything strike you as odd?

In [None]:
# 3. Let's take a look at the values that the suspicious columns contain, for example, "Active"
df['Active'].unique()

# 3b. Others?

In [None]:
# 4. Check columns for missing values
df.isnull().sum()

#### When there are no missing values, we might still have to impute some values. **Why?**

## 3.4 Data Validity, Anomalies, and Missing Data

We will handle numerical and categorical variables separately.

### 3.4.1 Categorical Variables

In [None]:
# 1. Let's clean-up the categorical columns: Take a look first!

# 1a. Define which columns are obviously categorical
cat_cols = ['Gender','Subsidiary'] 

# 1b. Define which columns must be categorical because they have an indicator value (0,1)
zero_one_cols = ['BankCC','Active','LifeInsur', 'PlatStatus','Terminated'] 

# 1c. Cycle through both types of categorical columns and print values and their frequencies
for col in cat_cols+zero_one_cols:  
  print(col)
  print(df[col].value_counts( ))
  print()

In [None]:
# NOTE, that dtype in the output above does not refer to the variable, but to the counts that are output!
df.PlatStatus.dtype # check data type of PlatStatus

#### **Let's clean up validity violations in the categorical data by:**
- recoding them
- enforcing boundaries

***Let's create a dataframe cleaning function that can be used in a pipeline.***  
- This function will directly modify the dataframe (not a copy)  
  - Keeping this straight is important, because Pandas is very clever about not copying datasets that may be huge.
  - Panda's *slice* concept: A dataframe may just be pointers to where the data is in a larger dataframe. 
  - You've probably seen the warning:
      
        SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. 
        Try using ```.loc[row_indexer,col_indexer] =``` value instead

     This warns of potential confusion: are you asking to modify the original, large dataframe? Or to copy and modify just the data in the slice, while preserving the original?

In [None]:
# 3. Define a function that handles the cleaning of categorical variables
''' 
This function is to clean categories in the Bank Churn data.
Directly modifies the data frame df using .loc to avoid warning about 'copy of a slice'
'''
def clean_BankChurn_categories(df):
  
  # 3a. Identify the two types of categorical data 
  cat_cols = ['Gender','Subsidiary']
  zero_one_cols = ['BankCC','Active','LifeInsur', 'PlatStatus','Terminated']

  # 3b. Clean-up Gender 
  df.loc[df.Gender.str.startswith('F'), 'Gender'] = 'Female'  
  df.loc[df.Gender.str.startswith('M'), 'Gender'] = 'Male'
  
  # 3c. Fix yes/no in zero_one_cols (PlatStatus needs this so others might, too.)
  for col in zero_one_cols:
    if (df[col].dtype == 'object'):  
      df.loc[df[col] == 'yes', col] = '1' # recode "yes" to 1, if present 
      df.loc[df[col] == 'no', col] = '0' # recode "no" to 0, if present
      df.loc[:,col] = df[col].astype(int) # make it really be 0-1

  # 3d. Enforce boundaries for zero/one columns:
  for col in zero_one_cols:
    df.loc[:,col].clip(0,1, inplace = True) # "clip" assigns values outside boundary to boundary values.

  # 3e. Typecast all categorical and zero/one columns to categorical
  for col in cat_cols+zero_one_cols:
    df.loc[:,col] = df[col].astype('category')

In [None]:
# 4. Let's clean the categorical variables by calling our function!
clean_BankChurn_categories(df)

# 5. Check the types
df.dtypes

### 3.4.2 Numerical Variables

Let's use some domain knowledge to identify customers where some of the data just don't make sense (i.e., are not valid)!  

Ask yourself:
- what is a valid FICO Score?
- a valid Age?
- a valid Account Balance?
- plausible Regular Deposits?

In [None]:
# Let's filter out all cases that are suspicious
df.loc[(df.FICOScore < 300) | (df.Age > 100) | (df.Balance < -5000) | (df.Balance >5e5) | (df.Products > 10) | (df.RegDeposits < 0) | (df.RegDeposits > 1e5) ]

####**Let's clean up the numerical data by:**
- imputing invalid (or implausible) values 
- dropping cases where implausibilities are not easily resolved through imputation
- removing outliers



In [None]:
# 1. Define Outlier Detection Function (we will call this function within our numerical cleaning function that we define next)

''' This function can be used on any dataset to return a list of index values for the outliers (based on standard deviation)
Only appropriate for numerical features''' 

def get_outliers(data, columns):
    # we create an empty list
    outlier_idxs = []
    # Number of standard deviations we keep. 
    nsd = 3 
    for col in columns:
        elements = data[col]
        # we get the mean value for each column
        mean = elements.mean()
        # and the standard deviation of the column
        sd = elements.std()
        # we then get the index values of all values higher or lower than the mean +/- nsd standard deviations
        outliers_mask = data[(data[col] > mean + nsd*sd) | (data[col]  < mean  - nsd*sd)].index
        # and add those index values to our list
        outlier_idxs  += [x for x in outliers_mask]
    return list(set(outlier_idxs))

In [None]:
# 2. Define a function that cleans up the anomalies in the numerical columns
''' 
This function is to clean numeric fields of the Bank Churn data.
Directly modifies the data frame df using .loc and drop inplace.
'''
def clean_BankChurn_numeric(df):
  
  # 2a. Impute invalid data with medians
  df.loc[df.Age > 100,'Age'] = df.Age.median()
  df.loc[df.Products > 10, 'Products'] = df.Products.median()
  
  # 2b. Mark rows with values outside of valid ranges by setting these values to None 
  df.loc[df.FICOScore<=0, 'FICOScore'] = None
  df.loc[(df.Balance < -5000) | (df.Balance > 5e5), 'Balance'] = None
  df.loc[(df.RegDeposits < 0) | (df.RegDeposits > 1e5), 'RegDeposits'] = None
   
  # 2c. Drop rows that contain missing values (and those set to None)
  df.dropna(inplace=True)

  # 2d. Remove outliers for CERTAIN numerical variables only (using function defined above in 1.)
  #numeric_features = df.select_dtypes(include=['int64', 'float64']).columns # this line would select all numerical values... why might that be a bad idea here?
  numeric_features=['Balance','RegDeposits']
  outs = get_outliers(df, numeric_features) # get indices of rows that contain outlier values
  df.drop(outs, axis = 0,inplace=True)

In [None]:
# 3. Let's use our function to clean-up the numeric columns
clean_BankChurn_numeric(df)

# 4. Check our work:
df.describe()

## 3.5 Drop Unnecessary Columns
Not all columns of a data set need (or should!) be included for the purpose of training a machine learning model. 

In [None]:
# 1. Take a look at the data
df.head()

In [None]:
# 2. We won't need all of these variables - let's drop the ones that we think will not help our model
df = df.drop(["ClientID", "Surname", "Firstname"], axis = 1) 

# 3. Check which ones are left
df.tail()

**From the above, a couple of questions linger:**

1. The data appears to be a snapshot as some point in time. E.g. the balance is for a given date, which leaves a lot of questions:
 - What date is it, and of what relevance is this date?
 - Would it be possible to obtain balances over a period of time as opposed to a single date?
2. There are customers marked terminated that still have a balance in their account! What would this mean? Could they have terminated a product and not the bank?
3. What does being an active member mean? Are there different degrees of activity? Might it be better to provide transaction counts, both for credits and debits?
4. A breakdown of the products bought into by a customer could provide more information than the product count.


# 4. Some EDA with Visualization

Exploratory Data Analysis is useful to get a better feeling for the data and get a first glimpse at factors that might drive a certain phenomenon such as churn.

## 4.1 Let's see how many customers actually churned

In [None]:
# 1. Create a pie chart
labels = 'Churned', 'Retained'
sizes = [df.Terminated[df['Terminated']==1].count(), df.Terminated[df['Terminated']==0].count()]
explode = (0, 0.1)
fig1, ax1 = plt.subplots(figsize=(10, 8))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90,textprops={'fontsize': 18})
ax1.axis('equal')
plt.title("Proportion of customers who exited (churn)", size = 20)
plt.show()

1. So, about 24% of the customers have churned. So the baseline model could be to predict that 24% of the customers will churn.  

2. Given that 24% is a small number, we want to ensure that our chosen model can predict with greater accuracy than 24%

- **Why?**

3. Note, the bank wants to identify and retain customers that would churn (all of them?). The bank may be willing to trade for some inaccuracy on predicting the customers that are not going to churn.

- **How would this impact your model?**


## 4.2 Let's see whether some categorical variables are related to churn...

In [None]:
# 1. Create Barcharts for key variables that are split by our target variable(Terminated)
fig, axarr = plt.subplots(3, 2, figsize=(20, 18))
sns.set(font_scale = 1.25)
sns.countplot(x='Subsidiary', hue = 'Terminated',data = df, ax=axarr[0][0])
sns.countplot(x='Gender', hue = 'Terminated',data = df, ax=axarr[0][1])
sns.countplot(x='BankCC', hue = 'Terminated',data = df, ax=axarr[1][0])
sns.countplot(x='Active', hue = 'Terminated',data = df, ax=axarr[1][1])
sns.countplot(x='LifeInsur', hue = 'Terminated',data = df, ax=axarr[2][0])
sns.countplot(x='PlatStatus', hue = 'Terminated',data = df, ax=axarr[2][1])
fig.tight_layout(pad=2.0)

**We note the following:**

- The majority of the data is from Boston customers. However, the proportion of churned customers is inversely related to the population of customers, suggesting that the bank may have a problem (maybe not enough customer service resources allocated) in the areas where it has fewer clients.
- The proportion of female customers churning is greater than that of male customers
- Interestingly, the majority of the customers that churned are those with the bank's credit card. This may be a coincidence, since the majority of the customers have the bank's credit card.
- Unsurprisingly, the inactive members churn more. It is a concern that the overall proportion of inactive mebers is high. Perhaps the bank can implement a program to turn this group into active customers? This would definitely decrease customer churn.
- Customers that have a life insurance plan with the bank tend to stick around. Perhaps this is an opportunity to create a customer retention measure?
- Most customers do not have platinum status; those who do, don't churn.

## 4.2 We can also explore probability distributions of termination by variables such as age 

In [None]:
# 1. Visualize the distributions of numerical variables by status (i.e., our target variable)
sns.set(font_scale = 1.5)
for col in ['Age','FICOScore','Balance','Products','RegDeposits']:
  facet = sns.FacetGrid(df, hue="Terminated",aspect=4)
  facet.map(sns.kdeplot, col, shade= True)
  facet.set(xlim=(df[col].min(), df[col].max()))
  facet.add_legend()
  plt.show()

**Alternatively, we can use box-plots!**

In [None]:
# 2. Boxplots for numerical variables
sns.set(font_scale = 1.25)
fig, axarr = plt.subplots(2,3, figsize=(20, 12))
sns.boxplot(y='Age',x = 'Terminated', hue = 'Terminated',data = df, ax=axarr[0][0])
sns.boxplot(y='FICOScore',x = 'Terminated', hue = 'Terminated',data = df , ax=axarr[0][1])
sns.boxplot(y='Balance',x = 'Terminated', hue = 'Terminated',data = df, ax=axarr[0][2])
sns.boxplot(y='Products',x = 'Terminated', hue = 'Terminated',data = df, ax=axarr[1][0])
sns.boxplot(y='RegDeposits',x = 'Terminated', hue = 'Terminated',data = df, ax=axarr[1][1])
fig.tight_layout(pad=2.0)

**We note the following:**

- The older customers are churning more than the younger ones, suggesting differences in service preference across the age categories. The bank may need to review their target market or review the strategies for retention for different age groups
- There is no significant difference in the credit score distribution between retained and churned customers.
- The bank is losing customers with higher bank balances which is likely to hit their available capital for lending.
- Neither the number of products nor the regular deposits has a significant effect on the likelihood to churn.

# 5. Feature Engineering

Can we use the data to create some new features that may be predicitive of customers leaving the bank?
In finance we often use ratios as meaningful KPIs (key performance indicators). Let's generate the Balance/Deposit ratio for the bank's customers!

In [None]:
# 1. Add a new variable for Balance to Deposit Ratio to our dataframe ('BalanceDepositRatio')
df['BalanceDepositRatio'] = df.Balance/(df.RegDeposits+0.01)

# 2. Visually inspect its relation to our target variable
plt.figure(figsize=(10, 8), dpi=80)
sns.boxplot(y='BalanceDepositRatio',x = 'Terminated', hue = 'Terminated',data = df)
plt.ylim(-100, 100)
sns.set(font_scale = 1.5)

In [None]:
print(f"Median Balance-Deposit-Ratio Overall: {df.BalanceDepositRatio.median()}")
print(f"Median Balance-Deposit-Ratio Retained: {df.BalanceDepositRatio[df['Terminated']==0].median()}")
print(f"Median Balance-Deposit-Ratio Churned: {df.BalanceDepositRatio[df['Terminated']==1].median()}")

- We saw that the salary has little effect on the chance of a customer churning. 
- However, the bank balance and the regular deposits indicate that customers with a higher balance deposit ratio churn more.
- Our finding could be worrying to the bank because this impacts their source of loan capital

In [None]:
# 3. Check that our data frame includes our new variable
df.head()
df.describe()

# 6. Data Preprocessing Pipeline
Let's put it all together in a pipeline that we can use to:
1. Test different models on the same data
2. Apply to future bank data (e.g., validation data set)

Let's re-load the data to start with a clean slate. 
  - This is also always the starting point for our analysis so that we can be sure that everything runs on the original data.

In [None]:
# 1. Create a function that does all pre-processing steps for us
""" Pre-processing Pipeline directly modifies dataframe df using .loc but then returns a new data frame with dummy variables.
# drop_dummy defaults to True to drop one of each one hot encoded variables and avoid multicollinearity."""

def PrePipe(df, drop_dummy=True):
    # 1a. Clean Categorical and Numeric data with our two Functions:
    clean_BankChurn_categories(df)
    clean_BankChurn_numeric(df)
    
    # 1b Feature Engineering
    df['BalanceDepositRatio'] = df.Balance/(df.RegDeposits+0.0001)
        
    # 1c Reorder the columns, dropping unused
    continuous_vars = ['FICOScore','Age','Balance','Products','RegDeposits', 'BalanceDepositRatio',]
    cat_cols = ['Gender','Subsidiary']
    zero_one_cols = ['BankCC','Active','LifeInsur', 'PlatStatus']

    # 1d Mix-max scale the data between 0 and 1
    df.loc[:,continuous_vars] = minmax_scale(df[continuous_vars])
  
    # 1e One-Hot Encode Categorical Variables
    return pd.get_dummies(df[['Terminated'] + continuous_vars + zero_one_cols + cat_cols], columns = cat_cols, drop_first=drop_dummy)

In [None]:
# 2. Load data and Pre-process
df = pd.read_json("Bank_Churn_Train.json")
df = PrePipe(df)
df.head()

# 7. Churn Prediction Model

We are now ready to predict customer churn. At the top, we imported from scikit-learn
- fit models
- support functions
- scoring functions


## 7.1 Classifier Performance
We want to systematically evaluate the performance of our classifier. Here we define functions that output our performance metrics.

In [None]:
# 1. Function to output best model score, parameters and estimator
def best_model(model):
    print(model.best_score_)    
    print(model.best_params_)
    print(model.best_estimator_)

# 2. Function to output accuracy score, visualize the confusion matrix, and print the classification report
def show_results(y_test, y_pred):
  # 2a. Output the accuracy of our prediction
  print(accuracy_score(y_test, y_pred))
  # 2b. Visualize the confusion matrix to make it easier to read
  con_matrix = confusion_matrix(y_test, y_pred)
  confusion_matrix_df = pd.DataFrame(con_matrix, ('Retained', 'Churned'), ('Retained', 'Churned'))
  heatmap = sns.heatmap(confusion_matrix_df, annot=True, annot_kws={"size": 20}, fmt="d", cmap="Blues")
  heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize = 14)
  heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize = 14)
  plt.ylabel('Actual', fontsize = 14)
  plt.xlabel('Predicted', fontsize = 14)
  # 2c. Print the classification report
  print(classification_report(y_test, y_pred))

## 7.2 Let's use Logistic Regression to predict customer churn

### 7.2.1 Load and Pre-process Data

In [None]:
# 1. Start by loading and pre-processing our data
df = pd.read_json("Bank_Churn_Train.json")
df = PrePipe(df, drop_dummy=True)

# 2. Separate our target and input variables. 
y = df.Terminated
X = df.drop(columns=['Terminated'])

# 3. split sample into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# 4. Check if data have right shape
print("Train: Response Variable: ",y_train.shape)
print("Train: Feature Variables: ",X_train.shape)
print("Test: Response Variable: ",y_test.shape)
print("Test: Feature Variables: ",X_test.shape)

### 7.2.2 Train a Logistic Regression Model

In [None]:
# 1. Instantiate the classifier: logreg
logreg = LogisticRegression(solver='lbfgs', C=1, tol=.0001, max_iter=1000)

# 2. Fit the classifier to the training data
logreg.fit(X_train, y_train)

# 3. Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)

# 4. Call function to evaluate model performance and show results
show_results(y_test, y_pred)

In [None]:
# 5. We can examine the intercept as follows:
print (f'Logistic Regression Intercept: {logreg.intercept_[0]}')

# 6. We can examine the coefficients as follows:
print (f'\nLogistic Regression Coefficients: {logreg.coef_[0]}')

## **Which "Square" of the Confusion Matrix do we care for most?**

**Do you remember what we learned in Class 05 ?**

------------
<p style="text-align: left; font-size:120%; font-weight: normal; font-style: normal;">
$\text{Accuracy} = \frac{t_p + t_n}{t_p + t_n + f_p + f_n}$ <br><br> 
$\text{Precision} = \frac{t_p}{t_p + f_p}$    <br><br>    
$\text{Recall} = \frac{t_p}{t_p + f_n}$   <br><br>    
$F_1 \text{ score} = 2 \times \frac{\textit{precision}\, \times \,\textit{recall}}{\textit{precision}\, + \,\textit{recall}}$ 
</p>

------------


####**Precision** measures the ability of the classifier not to mislabel a negative sample as positive
####**Recall** measures the ability of the classifier to find all the positive samples.

### 7.2.3 Can we improve our results?

Let's tune our hyperparameters using a Grid-Search with Cross-Fold validation

In [None]:
# 1. Define parameter space to test
param_grid = {'C': [.1,1,10], 'tol':[.001, .0001, .00001]}

# 2. Instantiate model: Note that we can define which score (i.e., performance metric such as precision or recall) we want to tune the hyperparameters towards. WHY would we do so?
log_Grid = GridSearchCV(LogisticRegression(solver='lbfgs', max_iter=1000),param_grid, cv=5, refit=True, verbose=0, scoring = 'recall')

# 3. Fit model to data
log_Grid.fit(X_train, y_train);

# 4. Show model accuracy and best parameters (i.e., tuned)
print(best_model(log_Grid))

In [None]:
# 5. Model is already trained with the optimal parameters identified and set
y_pred = log_Grid.predict(X_test)

# 6. Call function to evaluate model performance and show results
show_results(y_test, y_pred)

# The code below produces the same results because we set refit=True in our grid search above

#LR=LogisticRegression(C=10, solver='lbfgs', tol=0.001)
#LR.fit(X_train, y_train);
#y_pred = LR.predict(X_test)
#show_results(y_test, y_pred)


## 7.3 Can a Random Forest Model do better?

### 7.3.1 Load and Pre-process Data

In [None]:
# 1. Start by loading and pre-processing our data
df = pd.read_json("Bank_Churn_Train.json")
df = PrePipe(df, drop_dummy=True)

# 2. Separate our target and input variables. 
y = df.Terminated
X = df.drop(columns=['Terminated'])

# 3. split sample into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# 4. Check if data have right shape
print("Train: Response Variable: ",y_train.shape)
print("Train: Feature Variables: ",X_train.shape)
print("Test: Response Variable: ",y_test.shape)
print("Test: Feature Variables: ",X_test.shape)

### 7.3.2 Train a Random Forest Model

In [None]:
# 1. Instantiate a Random Forest Classifier (RandomForestClassifier was previously imported from sklearn)
forest = RandomForestClassifier(max_features=6, n_estimators=25, max_depth=9,random_state=21)

# 2. Train the model using the training sets
forest.fit(X_train, y_train)  

# 3. Predict the response for test dataset
y_pred = forest.predict(X_test)

# 4. Call function to evaluate and show model performance
show_results(y_test, y_pred)

### 7.3.3 Let's tune our hyperparameters using a Grid-Search with Cross-Fold validation

In [None]:
# 1. Define grid (i.e., hyperparameter combinations to test for)
param_grid = {'n_estimators': [10, 25, 50], 'max_depth' : [6, 9, 12], 'max_features' : [3, 6, 9]}

# 2. Instantiate the model (do not include parameters from the parameter grid in the classifier that you use; here, RandomForestClassifier())
forest_Grid = GridSearchCV(RandomForestClassifier(random_state=21), param_grid, cv=5, refit=True, verbose=0, scoring = 'recall')

# 3. Fit the model (i.e., train it on training data)
forest_Grid.fit(X_train, y_train);

# 4. Output optimal Hyperparameter combination
print(best_model(forest_Grid))

In [None]:
# 5. Model is already trained with the optimal parameters identified and set: Use it to make prediction
y_pred = forest_Grid.predict(X_test)

# 6. Call function to evaluate model performance and show results
show_results(y_test, y_pred)

### 7.3.4 Let's take a look at which features are most important for our prediction

In [None]:
# 1. Extract features and their importances
feat_importances = pd.Series(forest.feature_importances_, index=X_train.columns)

# 2. Sort importances_rf
sorted_importances_rf = feat_importances.sort_values()

# 3. Make a horizontal bar plot
plt.figure(figsize=(12,8))
sorted_importances_rf.plot(kind='barh', color='skyblue'); 
plt.show()

# 8. Making things better: Did we overlook something?

In [None]:
# 1. Start by loading our data
df = pd.read_json("Bank_Churn_Train.json")

In [None]:
# 2. Let's take a close look at the variables again
df.head(10)

**Which variables did we previously drop?**
- ClientID
- Surname
- Firstname

## 8.1 Extract Information from a Variable

In [None]:
# 1. Extract CityCode from ClientID
df['CityCode'] = df.ClientID.str[:5]

# 2. Extract Year from ClientID
df['Year'] = df.ClientID.str[5:9].astype('int')

# 3. Keep only those after 2005 
df = df[df.Year > 2005]

# 4. Take a look
df.head()

In [None]:
# 5. It looks as if CityCode replicates Subsidiary
df[['Subsidiary','CityCode']].value_counts()

**We won't need all of these - let's drop the ones that we think will not help our model**

In [None]:
# 6. Drop replicate columns and those that carry no meaning for our model
df = df.drop(["CityCode"], axis = 1) # we used it just for exploration
df = df.drop(["ClientID", "Surname", "Firstname"], axis = 1) 

# ... And check which ones are left
df.head()

## 8.2 Feature Engineering from "Year"





In [None]:
# 1. Rather than looking at Year, let's look at the number of years the customer has been with the bank.
df['Tenure'] = 2022 - df.Year
df.head()

In [None]:
# 2. We may suspect that older people may have longer Tenure, so let's look at that ratio. 
df['TenureByAge'] = df.Tenure/(df.Age)

# 2a. Let's look at the distributions
plt.figure(figsize=(8, 6), dpi=80)
sns.boxplot(y='TenureByAge',x = 'Terminated', hue = 'Terminated',data = df)
sns.set(font_scale = 1.5)

In [None]:
# 3. Lastly we introduce a variable to capture credit score given age to take into account credit behavior visavis adult life
df['FICOScoreGivenAge'] = df.Age/df.FICOScore*10

# 3a. Let's look at the distributions
plt.figure(figsize=(8, 6), dpi=80)
sns.boxplot(y='FICOScoreGivenAge',x = 'Terminated', hue = 'Terminated',data = df)
sns.set(font_scale = 1.5)

## 8.3 Update Pre-Processing Pipeline
- Get Year from ClientID
- Feature Engineer Tenure variables

In [None]:
# 4. Update Pre-Processing Pipeline

''' Updated function that does all of the pre-processing for us
It directly modifies dataframe df using .loc but then returns a new data frame with dummy variables.
optionally drop_dummy defaults to True to drop one of each one hot encoded variables and avoid multicollinearity.'''
def PrePipe(df, drop_dummy=True):  
    # 4a. Clean variables
    clean_BankChurn_categories(df)
    clean_BankChurn_numeric(df)
    
    # 4b. Engineer Features
    df['BalanceDepositRatio'] = df.Balance/(df.RegDeposits+0.0001)

    # 4c. Engineer NEW Tenure Features
    df['Year'] = df.ClientID.str[5:9].astype('int')
    df['Tenure'] = 2022 - df.Year
    df['TenureByAge'] = df.Tenure/(df.Age)
    df['FICOScoreGivenAge'] = (df.Age)/df.FICOScore*10
     
    # 4d. Reorder the columns, dropping unused. NEW: now includes our new tenure-based features
    continuous_vars = ['FICOScore','Age','Balance','Products','RegDeposits', 'Tenure','BalanceDepositRatio','TenureByAge','FICOScoreGivenAge']
    cat_cols = ['Gender','Subsidiary']
    zero_one_cols = ['BankCC','Active','LifeInsur', 'PlatStatus']

    # 4. Decision Trees do not require variables to be on the same scale; let's skip this step by commenting it out
    # df.loc[:,continuous_vars] = minmax_scale(df[continuous_vars])
  
    # One-Hot Encode Categorical Variables
    return pd.get_dummies(df[['Terminated'] + continuous_vars + zero_one_cols + cat_cols], columns = cat_cols, drop_first=drop_dummy)

## 8.4 Train Model again and Evaluate
- Does Tenure matter?

### 8.4.1 Load and Clean Data

In [None]:
# 1. Start by loading and pre-processing our data
df = pd.read_json("Bank_Churn_Train.json")
df = PrePipe(df, drop_dummy=True)

# 2. Separate our target and input variables. 
y = df.Terminated
X = df.drop(columns=['Terminated'])

# 3. split sample into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# 4. Check if data have right shape
print("Train: Response Variable: ",y_train.shape)
print("Train: Feature Variables: ",X_train.shape)
print("Test: Response Variable: ",y_test.shape)
print("Test: Feature Variables: ",X_test.shape)

### 8.4.1 Random Forest Classifier
- Hyperparameter tuning
- Cross-Fold Validation
- Model Performance

In [None]:
# 1. Define grid (i.e., hyperparameter combinations to test for)
param_grid = {'n_estimators': [25, 50], 'max_depth' : [10, 20], 'max_features' : [5, 7]}

# 2. Instantiate the model (do not include parameters from the parameter grid in the classifier that you use; here, RandomForestClassifier())
NewForest_Grid = GridSearchCV(RandomForestClassifier(random_state=21), param_grid, cv=5, refit=True, n_jobs=-1, verbose=0, scoring = 'recall')

# 3. Fit the model (i.e., train it on training data)
NewForest_Grid.fit(X_train, y_train);

# 4. Output optimal Hyperparameter combination
print(best_model(NewForest_Grid))

***Use parameters to train and fit model***

In [None]:
# 5. Instantiate a Random Forest Classifier (RandomForestClassifier was previously imported from sklearn)
NewForest = RandomForestClassifier(max_features=7, n_estimators=50, max_depth=20, random_state=21)

# 6. Train the model using the training sets
NewForest.fit(X_train, y_train)  

# 7. Predict the response for test dataset
y_pred = NewForest.predict(X_test)

# 8. Call function to evaluate and show model performance
show_results(y_test, y_pred)

***What about Feature Importance? Did anything change?***

In [None]:
# 9. Extract features and their importances
feat_importances = pd.Series(NewForest.feature_importances_, index=X_train.columns)

# 10. Sort importances_rf
sorted_importances_rf = feat_importances.sort_values()

# 11. Make a horizontal bar plot
plt.figure(figsize=(12,8))
sorted_importances_rf.plot(kind='barh', color='skyblue'); 
plt.show()

# 9. Finalize our Model
- We find that
  - Random Forest beats Logistic Regression in our empirical setting. 
  - Hyperparameter tuning improves performance slightly
  - Feature Engineering "Tenure" dramatically improves model performance
- Let's train a ***Final Random Forest Model*** on all variables:
    - Use entire training data
    - Use Cross-Fold validation
    - Hyperparameter tuning: Randomized Search CV for faster tuning

In [None]:
# 1. Start by loading and pre-processing our data
df = pd.read_json("Bank_Churn_Train.json")
df = PrePipe(df, drop_dummy=True)

# 2. Separate our target and input variables. 
y = df.Terminated
X = df.drop(columns=['Terminated'])

**An exhaustive search of the parameter grid can take a long time.** Particularly:
- when there are many parameters to tune
- when these parameters can assume many different values
- on a slow computer (or limited virtual environment, like free CoLab)

In [None]:
# 3. Define grid (i.e., hyperparameter combinations to test for)
#param_grid = {'n_estimators': [25, 50], 'max_depth' : [10, 20], 'max_features' : [5, 7]}

# 4. Instantiate the model (do not include parameters from the parameter grid in the classifier that you use; here, RandomForestClassifier())
#FinalForest_Grid = GridSearchCV(RandomForestClassifier(random_state=21), param_grid, cv=5, refit=True, n_jobs=-1, verbose=0, scoring = 'recall')

# 5. Fit the model (i.e., train it on training data)
#FinalForest_Grid.fit(X, y);

# 6. Output optimal Hyperparameter combination
#print(best_model(FinalForest_Grid))

# 7. Instantiate a Random Forest Classifier (RandomForestClassifier was previously imported from sklearn)
#FinalForest = RandomForestClassifier(max_features=6, n_estimators=100, max_depth=12,random_state=21)

# 8. Train the model using the training sets
#FinalForest.fit(X, y) 

***To speed things up***: Use RandomizedSearchCV. 

In contrast to `GridSearchCV`, not all parameter values are tried, but rather a fixed number of parameter settings is sampled from specified distributions. The number of parameter settings that are tried is given by `n_iter`.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# 1 .define search
param_grid = {'n_estimators': [25, 50, 100], 'max_depth' : [10, 15, 20], 'max_features' : [3, 5, 7]}
search = RandomizedSearchCV(RandomForestClassifier(random_state=21), param_grid, n_iter=15, scoring='recall', n_jobs=-1, cv=5, random_state=21)

# 5. Fit the model (i.e., train it on training data)
search.fit(X, y);

# 6. Output optimal Hyperparameter combination
print(best_model(search))

In [None]:
# 7. Instantiate a Random Forest Classifier (RandomForestClassifier was previously imported from sklearn): Set the parameters that we found with RandomizedSearchCV
FinalForest = RandomForestClassifier(max_depth=20, max_features=7, n_estimators=50, random_state=21)

# 8. Train the model using the training sets
FinalForest.fit(X, y)  

In [None]:
# 9. Predict with our final model
y_pred = FinalForest.predict(X_test)

# 10. Call function to evaluate and show model performance
show_results(y_test, y_pred)

In [None]:
# 11. Save our Model in a file
filename = 'finalized_model.sav'
pickle.dump(FinalForest, open(filename, 'wb'))

# 10. Use our Model to Decide on Who to Fight for and Who to Let Go
- Use our model to predict churn
- Determine who to let go and who to keep
  - Need a metric for customer value!

In [None]:
# 1. Load our trained Model from file
filename = 'finalized_model.sav'
FinalForest = pickle.load(open(filename, 'rb'))

In [None]:
# 1. Load New Customer Data and take a first look
NewCustomers = pd.read_json("Bank_Churn_NewCustomers.json")
NewCustomers.head()

In [None]:
NewCustomers.shape

**How are these data different to our training data?**

## 10.1 Prepare New Customer Data for Prediction
  - Cannot drop customers - Why?
  - No response variable - Why?

### 10.1.1 Update Preprocessing
- Need to update preprocessing

In [None]:
# 1. Update Function that handles the Cleaning of Categorical Variables: There is no variable "Terminated" in New Customer Data
''' 
This function is to clean categories in the New Customer Bank Churn Data.
Directly modifies the data frame df using .loc 
'''
def clean_BankChurn_categories(df):
  
  # 1a. Identify the two types of categorical data 
  cat_cols = ['Gender','Subsidiary']
  zero_one_cols = ['BankCC','Active','LifeInsur', 'PlatStatus']

  # 1b. Clean-up Gender
  df.loc[df.Gender.str.startswith('F'), 'Gender'] = 'Female'
  df.loc[df.Gender.str.startswith('M'), 'Gender'] = 'Male'
  
  # 1c. Fix PlatStatus
  if (df.PlatStatus.dtype == 'object') and (df.PlatStatus == 'yes').any():  # recode "yes" to 1, if present
    df.loc[df.PlatStatus == 'yes', 'PlatStatus'] = '1'
  if (df.PlatStatus.dtype == 'object') and (df.PlatStatus == 'no').any():   # recode "no" to 0, if present
    df.loc[df.PlatStatus == 'yes', 'PlatStatus'] = '0'
  df.loc[:,'PlatStatus'] = df['PlatStatus'].astype(int)

  # 1d. Enforce boundaries for zero/one columns:
  for col in zero_one_cols:
    df.loc[:,col].clip(0,1, inplace = True) # "clip" assigns values outside boundary to boundary values.

  # 1e. Typecast all categorical and zero/one columns to categorical
  for col in cat_cols+zero_one_cols:
    df.loc[:,col] = df[col].astype('category')

In [None]:
# 2. Update Function that Cleans the Numerical Variables: Cannot drop customers! Need to predict them all.
''' 
This function is to clean numeric fields of the New Customer Bank Churn data.
Directly modifies the data frame df using .loc and drop inplace.
'''
def clean_BankChurn_numeric(df):
  
  # 2a. Impute invalid data with medians
  df.loc[df.Age > 100,'Age'] = df.Age.median()
  df.loc[df.Products > 10, 'Products'] = df.Products.median()
  
  # 2b. Set values outside of valid ranges to valid values (at limits)
  df.loc[df.FICOScore<=299, 'FICOScore'] = 300
  df.loc[df.Balance < -5000, 'Balance'] = -5000
  df.loc[df.Balance > 5e5, 'Balance'] = 5e5
  df.loc[df.RegDeposits < 0, 'RegDeposits'] = 0
  df.loc[df.RegDeposits > 1e5, 'RegDeposits'] = 1e5 

In [None]:
# 3. Update Preprocessing Function
# It directly modifies dataframe df using .loc but then returns a new data frame with dummy variables.
# optinonally drop_dummy defaults to True to drop one of each one hot encoded variables and avoid multicollinearity.
def PrePipe(df, drop_dummy=True):
    clean_BankChurn_categories(df)
    clean_BankChurn_numeric(df)
    
    # Engineer Features
    df['BalanceDepositRatio'] = df.Balance/(df.RegDeposits+0.0001)

    # Engineer NEW Tenure Features
    df['Year'] = df.ClientID.str[5:9].astype('int')
    df['Tenure'] = 2022 - df.Year
    df['TenureByAge'] = df.Tenure/(df.Age)
    df['FICOScoreGivenAge'] = (df.Age)/df.FICOScore*10
     
    # Reorder the columns, dropping unused. NEW: now includes our new tenure-based features
    continuous_vars = ['FICOScore','Age','Balance','Products','RegDeposits', 'Tenure','BalanceDepositRatio','TenureByAge','FICOScoreGivenAge']
    cat_cols = ['Gender','Subsidiary']
    zero_one_cols = ['BankCC','Active','LifeInsur', 'PlatStatus']

    # mix-max scale the data between 0 and 1: won't use because we don't need to with a tree model
    #df.loc[:,continuous_vars] = minmax_scale(df[continuous_vars])
  
    # One-Hot Encode Categorical Variables
    return pd.get_dummies(df[continuous_vars + zero_one_cols + cat_cols], columns = cat_cols, drop_first=drop_dummy)

### 10.1.2 Remove Variables that Model was not Trained on
-  Our model has not seen Bank Revenue in its training. Must remove it!

In [None]:
# 1. Create new Dataframe that includes Variables that our Model was trained on. VERY IMPORTANT! Why?
# 1a Drop Bank Revenue
df = NewCustomers.drop(columns=['BnkRev'])

## 10.1.3 Preprocess Data with New Pipeline


In [None]:
# 2. Preprocess data with our Preprocessing Pipeline
X = PrePipe(df, drop_dummy=True)

## 10.2 Use our Trained Model to predict which Customers will Churn

In [None]:
# 1. Predict with Base Random Forest Model
y_pred = FinalForest.predict(X)

# 2. Add prediction to New Customer Data
NewCustomers['AtRisk']=y_pred

# 3. Check if it worked
NewCustomers.head()

## 10.3 Quantify Financial Risk to Bank
- How much Revenue is at Stake?
  - Depends of Customer Revenue
  - Depends on Probability that customer churns
- Informs how much to spend on customer retention measures!

### 10.3.1 Use Model to estimate Churn Probabilities

In [None]:
# 1. Predict churn probabilities and add directly to New Customer Data
NewCustomers['ChurnProb']=FinalForest.predict_proba(X)[:, 1]

# 2. Check if it worked
NewCustomers.sort_values(by=['ChurnProb'], ascending=False, inplace=True)
NewCustomers.head(10)

In [None]:
NewCustomers.tail(10)

In [None]:
# 3. Predict Churn Probability of an individual customer

# 3a. Describe Customer along the variable that our model was trained on
FICOScore=788
Age=22
Balance=12000
Products=1
RegDeposits=6000
Tenure=2
BalanceDepositRatio=Balance/(RegDeposits+0.0001)
TenureByAge=Tenure/Age
FICOScoreGivenAge=Age/FICOScore*10
BankCC=1
Active=1
LifeInsur=0
PlatStatus=0
Gender_Male=1
Subsidiary_Boston=0
Subsidiary_Chapel_Hill=1

# 3b. Construct a DataFrame for our individual customer that we can pass to our model
x_new = pd.DataFrame([[FICOScore,Age,Balance,Products,RegDeposits,Tenure,BalanceDepositRatio,TenureByAge,FICOScoreGivenAge,BankCC,Active,LifeInsur,PlatStatus,Gender_Male,Subsidiary_Boston,Subsidiary_Chapel_Hill]],
                      columns=['FICOScore','Age','Balance','Products','RegDeposits','Tenure','BalanceDepositRatio','TenureByAge','FICOScoreGivenAge','BankCC','Active','LifeInsur','PlatStatus','Gender_Male','Subsidiary_Boston','Subsidiary_Chapel Hill'])

# 3c. Use our trained model to predict the probability that the customer will churn
Cprop=FinalForest.predict_proba(x_new)[:, 1]
print(f"Predicted Probability to Churn: {Cprop}")



### 10.3.2 Describe Churn Risk


In [None]:
# 1. Describe Churn Probabilities
NewCustomers.ChurnProb.describe()

In [None]:
# 2. Visualize the distribution of Churn Probabilities 
sns.set(font_scale = 1.5)
sns.displot(NewCustomers, x="ChurnProb", height=6, aspect=1.5)

### 10.3.3 Investigate Bank Revenue at Risk
One possibility: For each customer calculate `ChurnProb` $\times$ `BnkRev`

In [None]:
# 1. Calculate new Variable RevAtRisk
NewCustomers['RevAtRisk']=NewCustomers['ChurnProb'] * NewCustomers['BnkRev']

# 2. Describe Revenue at Risk
NewCustomers.RevAtRisk.describe()

In [None]:
# 3. What is the total revenue at risk?
NewCustomers.RevAtRisk.sum()

In [None]:
# 4. Visualize the distribution of Churn Probabilities 
sns.set(font_scale = 1.5)
sns.displot(NewCustomers, x="RevAtRisk", bins=6, height=6, aspect=1.5)

In [None]:
# 5. Look up the top 10 customers with highest revenue at risk
NewCustomers.sort_values(by=['RevAtRisk'], ascending=False, inplace=True)
NewCustomers.head(10)

## 10.4 Where to from here?

1. **What managerial decisions might our Model inform?**
- 
- 
- 

2. **Are there other Variables to consider?**
- 
- 
- 

3. **What are possible Limitations?**
- 
- 
- 

# 11. How well did we do?
Six months later, we know which customers churned (assuming that the bank did not implement any retention measures). 

***Let's go back and see how well our model predicted the churn of the new customers.***

In [None]:
# 1. Load outcome data and take a look
outcomes = pd.read_json("Bank_Churn_NewCustomers_Outcome.json")
outcomes.head()

In [None]:
# 2. Get outcome variable
y = outcomes.Terminated

In [None]:
# 3. Evaluate our Model's predictions from 6 months ago
show_results(y, y_pred)

## 11.1 Where our Model failed to predict Churn


In [None]:
# 1. Sort our Data by its index to ensure that rows for prediction and truth are aligned
NewCustomers.sort_index(inplace=True)

# 2. Add Outcome to our NewCustomers dataframe
NewCustomers['Terminated']=y

# 3. Add variable for PredFailChurn (predicted not at risk, but ultimately churned)
NewCustomers['PredFailChurn']=(NewCustomers['Terminated']==1) & (NewCustomers['AtRisk']==0)

# 4. See where our Model fails:
NewCustomers[NewCustomers['PredFailChurn']==True].head(25)

## 11.2 Impact of Model Failure

Up to how much revenue might the bank lose because our model failed to identify customers that are at risk of churning?

In [None]:
# 1. Sum BnkRev for Customers where we failed to predict that they will churn
print(f"Lost revenue of churned customers that we failed to identify: $ {NewCustomers[NewCustomers['PredFailChurn']==True]['BnkRev'].sum()}")

# 12. What Next?
1. How to fix Model failure?  

2. Predictions for new Customers?  

3. Update Predictions?  


# **Looking Ahead:**  

####**Next Class:** Thrsday, March 23, 2023

#### ***Algorithmic Bias*** 

#### **Read before class:** Read Lambrecht, A. and Tucker, C., 2019. [Algorithmic bias? An empirical study of apparent gender-based discrimination in the display of STEM career ads.](https://pubsonline.informs.org/doi/abs/10.1287/mnsc.2018.3093) Management Science, 65(7), pp.2966-2981.




# **Call for Nominations: Recognize a Professor for their Teaching** 

> **Put your Nomintations in before Monday, March 27th, 2023:** https://tinyurl.com/weatherspoon2023 

![Weatherspoon](https://mapxp.app/BUSI488/Weatherspoon2023.png)


This notebook was inspired by the following:  
https://github.com/soanems/bank-customer-churn-python/blob/master/Bank%20Customer%20Churn_2.ipynb  
https://www.kaggle.com/kmalit/bank-customer-churn-prediction  
https://academy.vertabelo.com/blog/python-customer-churn-prediction/  
http://dataskunkworks.com/2018/06/05/predicting-customer-churn-with-python-logistic-regression-decision-trees-and-random-forests/  
https://www.neuraldesigner.com/learning/examples/bank-churn  
https://www.kaggle.com/nasirislamsujan/bank-customer-churn-prediction?scriptVersionId=5729160  