## Rohan Bhatt
#### UID: 117942330

##### Research Question: Are bank failures more closely correlated with general economic recessions or with specific financial trends (e.g., Crypto boom, ML boom)?
##### This question was interesting to me because I have worked in the fintech space for several of my internships, and since I was 18 I've always had an interest in the stock market (I learned to trade, and subsequently fail/break-even trading options), so given my general interest in the space, seeing the fall of FTX, SVB, and more sparked my interest. Were there failures like FTX with an industry during the .COM boom? Or do banks usually sell out during recessions? 

##### - The population we will be working with is all banks in the United States, with a focus on those who have failed since October 1st, 2000.
##### - Variables: Dependent variable: Bank Failure (binary, fail or no fail). Independent variables are many, including economic indicators like GDP growth rate, unemployment. inflation, etc as well as financial trend indicators like capitalization of certain markets i.e. crypto, investment in industries like AI, etc. Potentially confounding variables include bank size (total assets), geographic location, and regulatory changes
##### - Hypothesis: Bank failures are more strongly correlated with general economic recessions than with specific financial trends, its just that financial trends have exacerbated the risk of failure during economic pitfalls/downturns.
##### - Data collection: The primary dataset is the FDIC failed bank list, economic data collected will be historical data on GDP growth rates, unemployment rates, and other relative economic indicators from places like the Federal Reserve, Census Bureau, etc. Financial trend data will also be important like historical data on crypto market capitalization, and capitalization of specific markets over a 20 year period (10/1/2000-2024)

In [3]:
import findspark
findspark.init()

In [17]:
import re
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, col
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.read.csv("banklist.csv", header=True, inferSchema=True)

df.show(5)
df.printSchema()

+--------------------+-------------+------+-----+----------------------+-------------+-----+
|          Bank Name�|        City�|State�|Cert�|Acquiring Institution�|Closing Date�| Fund|
+--------------------+-------------+------+-----+----------------------+-------------+-----+
|Republic First Ba...| Philadelphia|    PA|27332|  Fulton Bank, Nati...|    26-Apr-24|10546|
|       Citizens Bank|     Sac City|    IA| 8758|  Iowa Trust & Savi...|     3-Nov-23|10545|
|Heartland Tri-Sta...|      Elkhart|    KS|25851|  Dream First Bank,...|    28-Jul-23|10544|
| First Republic Bank|San Francisco|    CA|59017|  JPMorgan Chase Ba...|     1-May-23|10543|
|      Signature Bank|     New York|    NY|57053|   Flagstar Bank, N.A.|    12-Mar-23|10540|
+--------------------+-------------+------+-----+----------------------+-------------+-----+
only showing top 5 rows

root
 |-- Bank Name�: string (nullable = true)
 |-- City�: string (nullable = true)
 |-- State�: string (nullable = true)
 |-- Cert�: inte

In [18]:
#Trying to get rid of weird symbols and cleaning up data
new_column_names = [re.sub(r'[^\x00-\x7F]', '', col_name) for col_name in df.columns]
df = df.toDF(*new_column_names)
df.show(5)

+--------------------+-------------+-----+-----+---------------------+------------+-----+
|           Bank Name|         City|State| Cert|Acquiring Institution|Closing Date| Fund|
+--------------------+-------------+-----+-----+---------------------+------------+-----+
|Republic First Ba...| Philadelphia|   PA|27332| Fulton Bank, Nati...|   26-Apr-24|10546|
|       Citizens Bank|     Sac City|   IA| 8758| Iowa Trust & Savi...|    3-Nov-23|10545|
|Heartland Tri-Sta...|      Elkhart|   KS|25851| Dream First Bank,...|   28-Jul-23|10544|
| First Republic Bank|San Francisco|   CA|59017| JPMorgan Chase Ba...|    1-May-23|10543|
|      Signature Bank|     New York|   NY|57053|  Flagstar Bank, N.A.|   12-Mar-23|10540|
+--------------------+-------------+-----+-----+---------------------+------------+-----+
only showing top 5 rows



### First steps will be to do time-series on the dates, and see commonalities (i.e. does anything hover more over2008? from there will import more datasets)