# Failed Banks ML Project

## Econ 1680: MLTA and Econ

#### Name: Lena Kim

Research Question: Given a failed bank’s balance sheet metrics, how can we predict
whether it is acquired by a national bank, a regional bank, or not acquired at all? Which
metrics contribute the most weight to this decision?

IMPORTANT: This notebook focuses on the Multinomial Logit Regression Method in order to answer the classification problem posed above. I will import code from the preliminary/overarching notebook with the "run" method.


In [65]:
#econ1680MLProject
%run ML_BASES.ipynb


In [105]:
import warnings

warnings.filterwarnings("ignore")


## Now that all of the bases are run, let's focus on the MN Logit Regression
### We will implement year fixed effects to account for time series differences

In [66]:
#first get the year of failure from 'Failure Date' column
banks['Year'] = banks['Failure Date'].dt.year


### Create Fixed Effects with Preprocessing:

In [67]:
import statsmodels.api as sm
#create dummy variables for year fixed effects
year_dummies = pd.get_dummies(banks['Year'], prefix='Year', drop_first=True)

#concatenate year dummies with other independent variables:
X = pd.concat([banks[['Cash and Investments', 'Due from FDIC Corp and Receivables', 'Assets in Liquidation', 'Total Assets', 
                    'Administrative Liabilities', 'Total Unpaid Other Claimants', 'Uninsured Deposit Claims',
                    'General Creditor', 'Total Liabilities']], year_dummies], axis=1)

#define target variable:
y = banks['Acquisition Type']

#splitting into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 1680)

#cast boolean values of year dummies to ints:
year_columns = ['Year_2009.0', 'Year_2010.0', 'Year_2011.0', 'Year_2012.0', 
                'Year_2013.0', 'Year_2014.0', 'Year_2015.0', 'Year_2017.0', 
                'Year_2019.0', 'Year_2020.0']

X_train[year_columns] = X_train[year_columns].astype(int)

#now imput and scale on every column except binary year columns:
columns_to_impute_scale = [col for col in X_train.columns if col not in year_columns]

#imputer with mean:
imputer = SimpleImputer(strategy='mean')
X_train[columns_to_impute_scale] = imputer.fit_transform(X_train[columns_to_impute_scale])
X_test[columns_to_impute_scale] = imputer.transform(X_test[columns_to_impute_scale])

#scale features:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train[columns_to_impute_scale])
X_test_scaled = scaler.transform(X_test[columns_to_impute_scale])

#convert to df:
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=columns_to_impute_scale, index=X_train.index)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=columns_to_impute_scale, index=X_test.index)

#concate scaled columns with year binaries:
X_train_final = pd.concat([X_train_scaled_df, X_train[year_columns]], axis=1)
X_test_final = pd.concat([X_test_scaled_df, X_test[year_columns]], axis=1)

In [87]:
#fit multinomial logit regression model with year fixed effects:
# y_train.reset_index(drop=True, inplace=True) #make sure the indices are aligned for mn_logit
# X_train.reset_index(drop=True, inplace=True)
model = sm.MNLogit(y_train, sm.add_constant(X_train_final)) #define model based on training set
results = model.fit(method='cg') #cg optimization is computationally efficient and suitable for large scale optimization with many 
#predictors, especially after adding in the year dummies.

#print summary of results
print(results.summary())


         Current function value: 0.866690
         Iterations: 35
         Function evaluations: 96
         Gradient evaluations: 96
                          MNLogit Regression Results                          
Dep. Variable:       Acquisition Type   No. Observations:                  103
Model:                        MNLogit   Df Residuals:                       63
Method:                           MLE   Df Model:                           38
Date:                Fri, 29 Mar 2024   Pseudo R-squ.:                  0.1679
Time:                        13:44:22   Log-Likelihood:                -89.269
converged:                      False   LL-Null:                       -107.28
Covariance Type:            nonrobust   LLR p-value:                    0.5613
                Acquisition Type=1       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
const                       

In [69]:
X_train.head()

Unnamed: 0,Cash and Investments,Due from FDIC Corp and Receivables,Assets in Liquidation,Total Assets,Administrative Liabilities,Total Unpaid Other Claimants,Uninsured Deposit Claims,General Creditor,Total Liabilities,Year_2009.0,Year_2010.0,Year_2011.0,Year_2012.0,Year_2013.0,Year_2014.0,Year_2015.0,Year_2017.0,Year_2019.0,Year_2020.0
59,2348.0,0.0,0.0,2348.0,0.0,925.0,0.0,925.0,249946.0,0,0,0,0,0,0,0,0,0,0
55,25944.0,0.0,0.0,25944.0,3.0,23018.0,0.0,7875.0,663850.0,0,0,0,0,0,0,0,0,0,0
54,2238.0,0.0,0.0,2238.0,0.0,178.0,0.0,178.0,14380.0,0,0,0,1,0,0,0,0,0,0
16,1195.0,0.0,0.0,1195.0,4.0,583.0,209.0,583.0,72624.0,0,1,0,0,0,0,0,0,0,0
126,7902.44,-0.87,297.86,8092.69,964.78,155959.3,1148.44,36061.02,480190.46,0,0,0,0,0,0,0,0,0,0


## Multinomial LogReg with Imputation and Regularization (No time series)

In [107]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GroupKFold, GroupShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import fbeta_score, confusion_matrix, accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer

def ML_Multinomial_LogReg_L1_kfold(X, y, random_state, n_folds):
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=random_state)
    
    # create the pipeline: preprocessor + supervised ML method
    std_ftrs = ['Cash and Investments', 'Due from FDIC Corp and Receivables', 'Assets in Liquidation', 'Total Assets', 
                'Administrative Liabilities', 'Total Unpaid Other Claimants', 'Uninsured Deposit Claims',
                'General Creditor', 'Total Liabilities']
    
    numeric_transformer = Pipeline(steps=[
        ('imputer', IterativeImputer(estimator=LinearRegression(), random_state=random_state, max_iter=1000)),
        ('scaler', StandardScaler())
    ])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, std_ftrs)
        ],
        remainder='passthrough'
    )

    pipe = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression(max_iter=10000, multi_class='multinomial'))
    ])
    
    # the parameter(s) we want to tune
    param_grid = {
        'classifier__solver': ['saga'],
        'classifier__penalty': ['l1'],
        'classifier__C': np.logspace(-3, 3, 7)
    }
    
    # prepare gridsearch
    grid = GridSearchCV(pipe, 
                        param_grid=param_grid,
                        scoring='accuracy',
                        cv=kf, 
                        return_train_score=True,
                        n_jobs=-1, 
                        verbose=10)
    
    # do kfold CV
    grid_result = grid.fit(X, y)
    
    print()
    means = grid_result.cv_results_['mean_test_score']
    stds = grid_result.cv_results_['std_test_score']
    
    print(f'Best params: {grid.best_params_}')
    print(f"mean CV: {means} +/ {stds}")
    
    return grid

In [108]:
final_models_list = []
test_scores = []

for i in range(5):
    print(f'Random State # {i}')
    print()
    
    fin_grid = ML_Multinomial_LogReg_L1_kfold(X, y, 42*i, 4)
    
    final_models_list.append(fin_grid)

Random State # 0

Fitting 4 folds for each of 7 candidates, totalling 28 fits

Best params: {'classifier__C': 0.001, 'classifier__penalty': 'l1', 'classifier__solver': 'saga'}
mean CV: [0.47324811 0.47324811 0.46567235 0.46590909 0.41122159 0.38044508
 0.38044508] +/ [0.05834968 0.05834968 0.06572193 0.09667443 0.12476371 0.09135282
 0.09135282]
Random State # 1

Fitting 4 folds for each of 7 candidates, totalling 28 fits

Best params: {'classifier__C': 0.01, 'classifier__penalty': 'l1', 'classifier__solver': 'saga'}
mean CV: [0.34043561 0.47324811 0.46543561 0.41074811 0.41074811 0.40293561
 0.41051136] +/ [0.14113582 0.12494686 0.11373398 0.06674751 0.02343032 0.03592561
 0.04225703]
Random State # 2

Fitting 4 folds for each of 7 candidates, totalling 28 fits

Best params: {'classifier__C': 0.001, 'classifier__penalty': 'l1', 'classifier__solver': 'saga'}
mean CV: [0.47277462 0.47277462 0.46496212 0.45691288 0.33357008 0.33380682
 0.33404356] +/ [0.05887751 0.05887751 0.04741    0.0



In [109]:
# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing
std_ftrs = ['Cash and Investments', 'Due from FDIC Corp and Receivables', 'Assets in Liquidation', 'Total Assets', 
            'Administrative Liabilities', 'Total Unpaid Other Claimants', 'Uninsured Deposit Claims',
            'General Creditor', 'Total Liabilities']

numeric_transformer = Pipeline(steps=[
    ('imputer', IterativeImputer(estimator=LinearRegression(), random_state=42, max_iter=1000)),
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, std_ftrs)
    ],
    remainder='passthrough'
)

# Creating the pipeline
pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=10000, multi_class='multinomial', solver='saga', penalty='l1', C=0.001))
])

# Fitting the model
pipe.fit(X_train, y_train)

# Predicting on the test set
y_pred = pipe.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.38461538461538464
