## **Predicting Bank Failures Using Machine Learning**

***Project Overview***

The United States Federal Deposit Insurance Corporation (FDIC) maintains user faith in the banking system by insuring deposits at select financial institutions up to $250,000. This is generally complemented by an in depth monitoring and risk assessment, to hold financial institutions accountable to just banking standards, and manage turbulent times, all with the end goal of limiting the economic impact of a bank failure. 

To assist with risk mitigation, the FDIC assigns a "Bank Score" to all insured institutions, which is a measure of "forward-looking risk measures" as well as historical performance (*FDIC Risk Based Assessments*, FDIC.gov). This is already a highly complex statistical model that relies heavily on reviewer discretion and a rigid scorecard, rather than specific bank data or indicators. This creates an opportunity to explore the potential of applying a machine learning model to FDIC collected data to ultimately predict bank health and supplement professional skepticism when analyzing these financial institutions.

**Can the FDIC implement Machine Learning models to perform additional anaylsis of large financial institutions for riskiness?** The FDIC's 2023 budget alone is $2.409 billion, specifically indended to "modernize and enhance the FDIC's information technology infrastructure" (*FDIC Board Approves 2023 Operating Budget*, FDIC.gov). Embodying AI to assist with regulatory practices could save time and improve accuracy for an institution crucial to American Financial Instutions, while reducing costs in the long run. This model categorizes large financial institutions (small banks have a separate scoring system) as risky or safe, based on publicly available banking data. FDIC scores large financal institutions risk between 0 and 100, though for purposes of banking and the importance of using healthy institutions, we define the risky/safe cutoff as **75** as opposed to 50. We use publicly available data provided by the FDIC via the Homeland Infrastructure Foundation. We acknowledge that this data is from 2016, meaning it does not contain banking data or newer institutions. However, the model and fields are still collected and reported by the FDIC, and once obtaining current data, one could apply this model to draw meaningful conclusions going forward.

***Model Development and Operational Details***

In this workbook, we perform in depth exploratory data analysis, data wrangling, and finally apply a logistic regression analysis to categorize banks as risky or healthy, in line with the FDIC's collected data. As discussed above, we use the Homeland Infrastructure Foundation's 2016 FDIC Insured Banks Data Set.




# Imports

In [1]:
##Install Packages
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 10)
import numpy as np
import sklearn.tree
import sklearn.metrics
import sklearn.model_selection
import graphviz
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from IPython.display import Image

# Loading data

## Geographical data

In [2]:
##Loading Data
dfg = pd.read_csv('https://drive.google.com/uc?export=download&id=1vD7uj5Tpz2IvDj49YXR_4Xw2Pebhk_Ix')
dfg

  dfg = pd.read_csv('https://drive.google.com/uc?export=download&id=1vD7uj5Tpz2IvDj49YXR_4Xw2Pebhk_Ix')


Unnamed: 0,index,FID,ADDRESBR,BRNUM,BRSERTYP,CBSABR,CBSANAMB,CITYBR,CNTRYNAB,CNTYNAMB,DEPSUMBR,GEOCODE_CE,GEOCODE__1,NAMEBR,STALPBR,STCNTYBR,STNAMEBR,UNINUMBR,ZIPBR,CERT,ADDRESS,ASSET,BKCLASS,CITY,CNTRYNA,DENOVO,DEPDOM,NAMEFULL,NAMEHCR,REGAGNT,REPDTE,RSSDID,STALP,STCNTY,STNAME,ZIP,BKMO,LOC_NAME,STATUS,SCORE,x,y,GeocodeSou
0,0,13001,950 Park Street,0,11,14460,"Boston-Cambridge-Newton, MA-NH",Stoughton,United States,Norfolk,46645,2007,456101,Stoughton Co-Operative Bank,MA,25021,Massachusetts,33215,2072,26513,950 Park Street,96326,SM,Stoughton,United States,0,75160,Stoughton Co-operative Bank,,FED,2014-06-30T00:00:00.000Z,164975,MA,25021,MASSACHUSETTS,2072,1,PointAddress,M,100,-71.073321,42.111569,HSIP USA_ZIP4 Composite
1,1,13002,97 Lowell Road,360,11,14460,"Boston-Cambridge-Newton, MA-NH",Concord,United States,Middlesex,154554,1039,361300,Concord Branch,MA,25017,Massachusetts,33217,1742,57957,One Citizens Plaza,100642478,N,Providence,United States,0,68755303,"Citizens Bank, National Association",UK FINANCIAL INVESTMENTS LIMITED,OCC,2014-06-30T00:00:00.000Z,3303298,RI,44007,RHODE ISLAND,2903,0,PointAddress,M,100,-71.352231,42.463779,HSIP USA_ZIP4 Composite
2,2,13003,342 Main Street,0,11,14460,"Boston-Cambridge-Newton, MA-NH",Wakefield,United States,Middlesex,74480,4016,335100,Wakefield Co-Operative Bank,MA,25017,Massachusetts,33218,1880,26516,342 Main Street,174742,SB,Wakefield,United States,0,147124,Wakefield Co-operative Bank,,FDIC,2014-06-30T00:00:00.000Z,330873,MA,25017,MASSACHUSETTS,1880,1,PointAddress,M,100,-71.071241,42.504579,HSIP USA_ZIP4 Composite
3,3,13004,121 Main Street,0,11,12700,"Barnstable Town, MA",Yarmouth Port,United States,Barnstable,144528,2017,11802,Cape Cod Co-Operative Bank,MA,25001,Massachusetts,33219,2675,26517,121 Main Street,745395,SB,Yarmouth Port,United States,0,554956,Cape Cod Co-operative Bank,"COASTAL AFFILIATES, MHC",FDIC,2014-06-30T00:00:00.000Z,119779,MA,25001,MASSACHUSETTS,2675,1,StreetAddress,M,100,-70.252098,41.703698,HSIP USA_ZIP4 Composite
4,4,13005,205 Main St,0,11,0,,Frost,United States,Faribault,34860,1197,460500,Frost State Bank,MN,27043,Minnesota,33221,56033,26519,205 Main St,39881,NM,Frost,United States,0,34860,Frost State Bank,,FDIC,2014-06-30T00:00:00.000Z,943853,MN,27043,MINNESOTA,56033,1,StreetAddress,M,100,-93.923821,43.584759,HSIP USA_ZIP4 Composite
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94720,94720,93996,9176 Conroy Windermere Rd,7864,11,36740,"Orlando-Kissimmee-Sanford, FL",Windermere,United States,Orange,85637,2046,14809,Isleworth Branch,FL,12095,Florida,552034,34786,628,1111 Polaris Parkway,2002047000,N,Columbus,United States,0,1032549000,"JPMorgan Chase Bank, National Association",JPMORGAN CHASE & CO.,OCC,2014-06-30T00:00:00.000Z,852218,OH,39041,OHIO,43240,0,StreetAddress,M,100,-81.511215,28.493529,HSIP USA_ZIP4 Composite
94721,94721,93997,10975 Tamiami Trail North,7865,11,34940,"Naples-Immokalee-Marco Island, FL",Naples,United States,Collier,4446,3003,10110,US 41 and Immokalee Branch,FL,12021,Florida,552035,34108,628,1111 Polaris Parkway,2002047000,N,Columbus,United States,0,1032549000,"JPMorgan Chase Bank, National Association",JPMORGAN CHASE & CO.,OCC,2014-06-30T00:00:00.000Z,852218,OH,39041,OHIO,43240,0,PointAddress,M,100,-81.801661,26.271220,HSIP USA_ZIP4 Composite
94722,94722,93998,"855 El Camino Real, Bldg 2, Suite 67",7866,11,41940,"San Jose-Sunnyvale-Santa Clara, CA",Palo Alto,United States,San Mateo,2239,1052,511609,Town and Country Palo Alto Branch,CA,6081,California,552036,94301,628,1111 Polaris Parkway,2002047000,N,Columbus,United States,0,1032549000,"JPMorgan Chase Bank, National Association",JPMORGAN CHASE & CO.,OCC,2014-06-30T00:00:00.000Z,852218,OH,39041,OHIO,43240,0,PointAddress,M,100,-122.160481,37.437761,HSIP USA_ZIP4 Composite
94723,94723,93999,13824 Narcoossee Road,7867,11,36740,"Orlando-Kissimmee-Sanford, FL",Orlando,United States,Orange,8050,1167,16704,Narcoossee & Laureate Branch,FL,12095,Florida,552037,32832,628,1111 Polaris Parkway,2002047000,N,Columbus,United States,0,1032549000,"JPMorgan Chase Bank, National Association",JPMORGAN CHASE & CO.,OCC,2014-06-30T00:00:00.000Z,852218,OH,39041,OHIO,43240,0,StreetAddress,M,100,-81.244670,28.365735,HSIP USA_ZIP4 Composite


https://drive.google.com/file/d/1vD7uj5Tpz2IvDj49YXR_4Xw2Pebhk_Ix/view?usp=share_link


Andrew: REVIEW DATA FIELDS, WALK THROUGH CLEANING, WALK THROUGH MODEL BUILD, ACKNOWLEDGE BIASES (Geographical, etc.)

Divya: 
- Early columns are geography; should be less important. We can use state and/or zip code to simplify; no point in going address-level imo. Regarding Andrew's comment, we can do an early check to see how well different geos are represented in the data, but that'll probably be linked to general population and financial metrics in those zip codes. 
- Downloaded the CSV... Where is all the financial data?? I only see geo data...
- Source of labels of Bank Failure: https://banks.data.fdic.gov/bankfind-suite/failures
    - goes from 2012 to 2020
    - 

## Financial data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Bank Failure Data, possible source of labels. Maybe can join on some financial ID
dff = pd.read_csv('/content/drive/MyDrive/BankFailures.csv', error_bad_lines=False)
dff

# Data exploration

In [None]:
mr1 = dfg['NAMEBR'].str.contains('Wakefield')
mr2 = dfg['FID'] == 3983
dfg.loc[mr2, :]

mr1 = dff['FIN'] == 93996
dff.loc[mr1, :]

## Understanding how to merge geographical and financial data

In [36]:
## Simplifying datasets to simplify assesment

### Geographical dataset
g_rc = [
    'FID',
    'CERT',
    'STALPBR',
    'CITY',
    'CITYBR',
    'CNTYNAMB',
    'NAMEFULL',
    'NAMEBR',
    # 'SCORE',
]
dfg1 = dfg.loc[:, g_rc].copy()

### Financial dataset
f_rc = [
    'CERT',
    'ID',
    'FIN',
    'STATE',
    'CITY',
    'NAME_DOUBLEQUOTES',
]
dff1 = dff.loc[:, f_rc].copy()


## Merging datasets and evaluating results
dfx = pd.merge(
    left=dfg1,
    right=dff1,
    how='left',
    left_on='FID',
    right_on='ID',
    suffixes=('_dfg', '_dff'),
)

mr1 = dfx['CERT_dff'].notnull()
dfx.loc[mr1, :]

KeyError: 'CERT_y'

# Data wrangling

# Feature engineering

# Building model

# Evaluating model

# *Additional notes*

##### Code adjustments to run locally

In [3]:
## Code line to read bank failure data locally
dff = pd.read_csv('BankFailuresCleaned - BankFailures.csv')