# Should this SBA Loan be Approved?

The Small Business Administration (SBA) is a United States Government Agency formed in 1953 that provides support to entrepreneurs and small businesses. This support comes through loans made through smaller local banks, which are guaranteed up to 90%. In short, local banks give SBA loans to small businesses, and if the loan defaults, the SBA covers up to 90% of the remaining charge off. This helps mitigate risk for the local banks, and helps small businesses get the capital that they need. Since the SBA is assuming almost all of the risk in the situation, they have to be very careful in deciding whether or not to approve certain loans.

Here I have a historical dataset with about 900,000 SBA loans, and over 20 columns of details surrounding the loans.

**Goal:** Assess the risk factors for Borrowers and build a model to decide whether an SBA loan should be approved.

In [None]:
# import libraries
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.impute import SimpleImputer
plt.style.use('fivethirtyeight')
import os

# **Data Reading and Preprocessing**

Here is an actual glimpse of the dataset.

In [None]:
df = pd.read_csv('/kaggle/input/should-this-loan-be-approved-or-denied/SBAnational.csv',low_memory=False)
df.head()

In this data set, we have the following information about each case
* LoanNr_ChkDgt - Identification Number for Each Loan Case
* Name - Borrower Name
* City - Borrower City
* State - Borrower State
* Zip - Borrower Zip Code
* Bank - Bank Name
* BankState - Bank State
* NAICS - North American Industry Classification System Code
* ApprovalDate - Date SBA Commitment Issued
* ApprovalFY - Fiscal Year of Commitment
* Term - Loan Term in Months
* NoEmp - Number of Business' Employees
* NewExist - New or Existing Business (1 = existing,0 = new)
* CreateJob - Total Jobs Created by Business
* RetainedJob - Number of Jobs Retained
* FranchiseCode - Franchise Code (00000 or 00001) = No Franchise
* UrbanRural - Location of Business (1 = Urban, 2 = Rural, 0 = Undefined)
* RevLineCR - Revolving Line of Credit (Y = Yes, N = No)
* LowDoc - LowDoc Loan Program (Smaller Loan with smaller application and expidited processing time, Y = Yes, N = No)
* ChgOffDate - The Date when the Loan is Declared to be in Default
* DisbursementDate - Disbursement Date
* DisbursementGross - Amount Disbursed
* BalanceGross - Gross Amount Outstanding
* MIS_Status - Loan Status (CHGOFF = Charged off, PIF = Paid in Full)
* ChgOffPrinGr - Charged-off Amount
* GrAppv - Gross Amount of Loan Approved by bank
* SBA_Appv - Amount guaranteed by SBA

Let's check for missing values in each column. 

In [None]:
df.isnull().sum()

Let's just drop the charge off date column, since it has a ton of null values and doesn't even provide any useful information. We can get all the information we need about the timing of the loan from the Fiscal Year and Term columns. 

Then, I think we should simply remove the other null rows because our data set is massive with about 900,000 rows, so we can afford to drop a couple thousand without negatively affecting our sampling error.

In [None]:
df = df.drop(columns = 'ChgOffDate')
df = df.dropna()

Now let's check if all of the columns are formatted properly for us to start conducting some analysis.

In [None]:
df.dtypes

Some of the columns like BalanceGross and GrAppv are objects, but should be floats in order for us to start comparing them numerically. We can write up a function to fix this and apply it to all the necessary columns.

We can also make a column for whether or not the loan was paid off. We have MIS_Status, but we can just make our own with binary variables

In [None]:
# makes column for Paid in Full (1=Paid in full, 0 = no)
df['Paid'] = df['MIS_Status'].replace({'P I F':'1','CHGOFF':'0'}, regex=True)
df['Paid'] = df['Paid'].astype(float)


#fixes object values in some of our columns
def fix_num(number):
    num = number.replace("$", "")
    num = num.replace(",","")
    num = num.replace(" ","")
    return float(num)

df['BalanceGross'] = df['BalanceGross'].apply(lambda x: fix_num(x))
df['DisbursementGross'] = df['DisbursementGross'].apply(lambda x: fix_num(x))
df['ChgOffPrinGr'] = df['ChgOffPrinGr'].apply(lambda x: fix_num(x))
df['GrAppv'] = df['GrAppv'].apply(lambda x: fix_num(x))
df['SBA_Appv'] = df['SBA_Appv'].apply(lambda x: fix_num(x))

Other columns need some work on their formatting too, so the lines of code below fix that. 

In [None]:
# changes fiscal year of commitment from object to int
df['ApprovalFY'] = df['ApprovalFY'].replace({'A':'','B':''}, regex = True).astype(int)

# changes new vs existing business from 1 and 2 to 1(new) and 0(existing) for interpretability
df['NewExist'] = df['NewExist'].replace(1,0)
df['NewExist'] = df['NewExist'].replace(2,1)

# changes RevLineCR to binary variable
df['RevLineCr'] = df['RevLineCr'].replace({'Y':'1','N':'0'}, regex=True)
valid = ['1', '0']
df = df.loc[df['RevLineCr'].isin(valid)]
df['RevLineCr'] = df['RevLineCr'].astype(int)

# changes LowDoc to binary variable
df['LowDoc'] = df['LowDoc'].replace({'Y':'1', 'N':'0'}, regex=True)
valid1 = ['1', '0']
df = df.loc[df['LowDoc'].isin(valid)]
df['LowDoc'] = df['LowDoc'].astype(int)

# makes franchise a binary variables
df['FranchiseCode'] = df['FranchiseCode'].replace(1,0)
df['FranchiseCode'] = np.where((df.FranchiseCode != 0),1,df.FranchiseCode)
df.rename(columns={"FranchiseCode":"Franchise"},inplace=True)

**Feature Engineering**

I'd also like to make some new columns in order to provide more depth to our analysis. For example, SBA loans with over a 20 year Term have to be backed by real estate, so I'd like to create a binary variable for RealEstate. Here are the variables I'll be creating.

* RealEstate - 1 if loan is backed by real estate, 0 if not
* Recession - 1 if Fiscal Year of Approval is during a Recession, 0 if not

In [None]:
# Real Estate
df['RealEstate'] = df['Term'] > 240 
df['RealEstate'] = df['RealEstate'].astype(str)
df['RealEstate'] = df['RealEstate'].replace({'False':'0','True':'1'},regex=True).astype(int)

# Recession
rec_years = [1969,1970,1973,1974,1975,1980,1981,1982,1990,1991,2001,2007,2008,2009]
df['Recession'] = df['ApprovalFY'].isin(rec_years)
df['Recession'] = df['Recession'].astype(str)
df['Recession'] = df['Recession'].replace({'False':'0','True':'1'},regex=True).astype(int)

Now we are ready.

# **Exploratory Data Analysis**

Let's start by looking at some of the distributions of our variables.

In [None]:
# makes dataframe with only numeric variables
numeric = ['Paid','ApprovalFY', 'Term', 'NoEmp', 'NewExist', 'CreateJob', 'RetainedJob',
           'UrbanRural', 'RevLineCr', 'LowDoc', 'DisbursementGross', 'BalanceGross', 
            'GrAppv', 'SBA_Appv', 'RealEstate', 'Franchise','Recession']
num_df = df[numeric]
num_df.describe().T

Notable Findings:
* Recession had a mean of .22, meaning 22% of all SBA loans were approved during a Recession
* NewExist had a mean of .28, meaning 28% of all SBA loans were given to new businesses
* Rarely any franchises received SBA loans, since its mean was only .05
* RealEstate had a mean of .07, meaning only 7% of all SBA loans were backed by real estate

Now let's make a correlation matrix, so we can start to get an understanding of the relationship between our variables.

In [None]:
corr = num_df.corr()
fig, ax = plt.subplots(figsize=(15,15))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, annot = True, ax=ax, mask=mask, cmap = "Blues").set(title='Feature Correlations')
plt.show()

Notable Findings:
* The best indicators for Paid were Term, SBA_Appv, and GrAppv (.31, .13 and .12 respectively).
* RealEstate has a .1 correlation with Paid, but I would've expected it to be higher since loans backed by real estate are generally considered less risky and more likely to be paid off.
* GrAppv has a -.26 correlation with RevLineCr, which means that banks generally approve smaller amounts for businesses that have a revolving line of credit, which seems odd.
* RevLineCr has a -.12 correlation with Paid, meaning that businesses with a revolving line of credit were more likely to default. That might explain why Banks are less willing to loan them larger amounts.
* NewExist has a -.021 correlation with Paid. While its only a slightly negative correlation, this still means existing businesses were actually more likely to default on their SBA loans. 
* Recession had a -.16 correlation with Paid, which is a lower magnitude than I would've expected. I would've thought there would be a strong negative correlation between the two variables since businesses are probably less likely to pay off a loan during a recession.
* Term and DisbursementGross have a strong positive correlation (.47), which means that loans with larger terms are usually larger.

Let's take a look at the distrbitutions of our important variables like Term and DisbursementGross.

In [None]:
fig = plt.figure(figsize=(12,7))
sns.distplot(a=df['Term'], bins = 40,kde=False)
plt.title('Distribution of Loan Terms')
plt.ylabel('Count')
plt.show()

The distribution of loan terms isn't quite normal, so we might want to consider scaling it later. There seems to be popular loan terms around the 100, 250 and 300 month marks.

In [None]:
fig = plt.figure(figsize=(12,7))
sns.distplot(a=df['DisbursementGross'], kde=False, bins=40)
plt.title('Distibution of Gross Disbursement')
plt.ylabel('Count')
plt.show()

The distibution is not normal, so we'll fix that by logarithmic scaling before we build our model. SBA_Appv and GrAppv are almost perfectly correlated with DisbursementGross, so I'll assume those are also skewed and will take the log of those too.

Let's see if there are any trends over time for the percentage of loans paid in full.

In [None]:
fig = plt.figure(figsize=(10,10))
sns.lineplot(x="ApprovalFY", y="Paid", data=df)
plt.title('SBA Loan PIF(Paid in Full) Rate Over Time')
plt.show()

We can see periods where the success rate of loans were very high, specifically 1990-2005 and 2010-2014. We can also see dips in success rate during historical recessions like 2008 and the mid 1980's. 

The extremely low success rate of loans before 1985 may be attributed to the small sample size of loans during that period. If we look at a histogram of the amount of loans durting that period, we'll see that very few SBA loans were taken until about 1990.

In [None]:
fig = plt.figure(figsize=(10,10))
sns.distplot(a=df['ApprovalFY'],bins=40, kde=False)
plt.title('Amount of SBA Loans Taken Over Time')
plt.ylabel('Count')
plt.xlabel('Year')
plt.show()

We can also observe a decrease in the amount of SBA loans taken after 2010. Even though the amount of loans taken decreased, the success rate was still very high, so we know that businesses that have taken out loans in the past few years are much less risky than before.

We can graph the defaulted vs. paid loans over time to get another visual of which periods were more successful than others.

In [None]:
post_1985 = df[df['ApprovalFY'] >= 1986]
fig = plt.figure(figsize=(15,10))
sns.countplot(x="ApprovalFY",data=post_1985,hue="Paid")
plt.title('Amount of Defulted vs Paid SBA Loans over Time')
plt.xticks(rotation=75)
plt.show()

There seemed to be a leading effect on loans taken before 2008 defaulting more, even though the recession didn't really hit until late 2007/2008. I think this is because the recession hit the existing loans harder than the loans taken out during the recession. Lenders probably agreed to more lenient conditions to ensure the lons they issued wouldn't default.

Let's not forget we have a column for the industry of each business, so let's take a look at the differences between each industry

In [None]:
# map the NAICS codes to their actual industries
df['NAICS'] = df['NAICS'].astype(str)
def first_two(string):
    s = string[:2]
    return s
df['NAICS'] = df['NAICS'].apply(lambda x: first_two(x))

dic = {'11': 'Agriculture, Fishing, Forestry, and Hunting',
      '21': 'Mining, Quarrying, Oil and Gas Extraction',
      '22': 'Utilities', '23': 'Construction', '31':'Manufacturing',
      '32': 'Manufacturing', '33': 'Manufacturing', '42': 'Wholesale Trade',
      '44': 'Retail Trade', '45':'Retail Trade', '48':'Transport and Warehouse',
      '49': 'Transport and Warehouse', '51':'Information',
      '52': 'Finance and Insurance', '53':'Real Estate and Rental Leasing',
      '54':'Profesisonal, Scientific, and Technical Services',
      '55':'Management', '56':'Administrative and Support and Waste Management',
      '61':'Educational Services', '62':'Health Care and Social Assistance',
      '71': 'Arts, Entertainment and Recreation', '72':'Accomodation and Food Services',
      '81': 'Other Services', '92':'Public Administration'}

df['NAICS'] = df['NAICS'].map(dic)

In [None]:
fig = plt.figure(figsize=(15,10))
sns.barplot(x=df['NAICS'],y = df['Paid'],palette = "deep")
plt.xticks(rotation=80)
plt.title('Paid In Full Rates for Each Industry')
plt.xlabel('Industry')
plt.ylabel('Loan PIF Rate')
plt.show()

Notable Findings:
* We can see here that the least risky industries seem to be Health Care, Mining, Agriculture, and Management.
* On the other hand, Real Estate and Finance were historically the riskiest industries to approve a loan to, with about a 25% default rate.

Let's see what factors differentiate these industries. The Loan's term seemed to have a big impact on the default rate, so let's see how that varies accross industry.

In [None]:
fig = plt.figure(figsize=(15,10))
sns.barplot(x=df['NAICS'],y = df['Term'],palette = "deep")
plt.xticks(rotation=80)
plt.title('Loan Terms for Each Industry')
plt.xlabel('Industry')
plt.ylabel('Loan Term (Months)')
plt.show()

Surely enough, the industries that took out the longest loans were usually likely to pay them back. The Management industry took out the longest loans by far, followed by Health Care and Food Services. This evidence supports our original findings that loans with longer Terms are less likely to default. Mining, Quarrying, Oil, and Gas did not follow this trend however, as it was one of the least risky borrowers, but did not take out long term loans.

Now, let's see which Industries are taking out the largest loans, and see if there is any correlation there. 

In [None]:
fig = plt.figure(figsize=(15,10))
sns.barplot(x=df['NAICS'],y = df['DisbursementGross'],palette = "deep")
plt.xticks(rotation=80)
plt.title('Loan Size in Each Industry')
plt.xlabel('Industry')
plt.ylabel('Gross Disbursement(USD)')
plt.show()

Not surprisingly we see some of the same industries at the top ranks in both Term and Gross Amount Disbursed. However, industries like Manufacturing and Wholesale Trade have relatively large Gross Amounts Approved compared to their modest Terms.

Overall, we can see some moderate differences in SBA loan strategies among industries. However, I don't think we'll need to specify our model to any specific industry because the differences weren't so extreme that our model wouldn't be able to handle it.

Now, let's consider differences among states. Let's see each state's default rate along with their total amount of SBA loans.

In [None]:
state_default = df.groupby(['State','Paid'])['State'].count().unstack('Paid')
state_default['Total SBA Loans Taken'] = state_default[1] + state_default[0]
state_default['PIF Percentage'] = state_default[1]/(state_default[1] + state_default[0])
state_default['Default Percentage'] = (1 - state_default['PIF Percentage'])
state_default = state_default.sort_values(by = 'Default Percentage')
state_default

Notable Findings: 
* Florida and DC had the highest default rates (25.7% and 23.5%)
* Montana and Wyoming had the lowest default rates (6.8% and 6.9%)
* There does not seem to be any correlation between the amount of loans a state took and their default rate; for example Montana and DC have a similar amount of total loans taken and they are on opposite ends of the spectrum for default rate.

Overall, our analysis of state has shown us that there are major differences in default rates among each state, which is very important for us to consider when creating our model. We can either specify our model to a certain state or region, or make sure we choose a type of model that effectively handles this kind of variance.

Let's take a closer look at loans backed by real estate. I found it very interesting that it only had a .1 correlation with Paid, since Loans backed by real estate have to have very long terms. Term has a .31 correlation with Paid, so I would've expected RealEstate to have a similar value. With that being said let's see how loans backed by real estate perform.

In [None]:
fig = plt.figure(figsize=(7,7))
sns.barplot(x="RealEstate", y="Paid", data=df)
plt.title('PIF Rate for Loans backed by Real Estate')
plt.xlabel('Real Estate')
plt.show()

We can see that SBA loans backed by real estate are very safe, especially compared to average loans. Let's see how the use of real estate changed over time, or if any specific industry uses real estate backed loans more than others.

In [None]:
fig = plt.figure(figsize=(12,7))
sns.lineplot(x="ApprovalFY", y ="RealEstate", data=df)
plt.xlabel('Year')
plt.title('Real Estate Use over Time')

fig = plt.figure(figsize=(12,7))
sns.barplot(x="NAICS",y="RealEstate",data=df)
plt.title('Real Estate Use by Industry')
plt.xticks(rotation = 80)

plt.show()

The use of real estate backed loans seems to be steady throughout the years, besides the spike in the early 1970s.

The management industry is again the shining star with by far the most real estate backed loans. 

**Before we begin to construct our model, let's summarize our main findings:**
* SBA loans are common during recessions like the 2008 housing crisis, although many of the SBA Loans taken during recessions default
* Term is our best indicator of a loan's chances of defaulting
* SBA Loans have been succesful recently, having a much lower chance of default than prior years
* Some industries are much more risky to lend to than others. Management is historically the safest industry to loan to, while Finance and Insurance has proven to be the riskiest industry to lend to historically
* Some states are much more risky to lend to than others. Florida and DC had the highest default rates of any state, while Montana and Wyoming had the lowest default rates
* Loans backed by Real Estate are very safe, having a much lower default rates than standard loans

# **Modeling**

We need to build a classification model in order to predict whether or not a loan will default. I think we should start with a random forest as a baseline, and try to improve on the results using a Gradient Boost model. I think Gradient Boost will be more effective than the Random Forest because it is a more advanced version of the Random Forest. While the Random Forest simply makes a large number of decision trees and combines the results at the end of the process, Gradient Boost combines results after each tree is made, effectively taking small steps toward a better model every time. However, Gradient Boost is not always a great choice as it sometimes suffers from overfitting if there is a lot of noise within the data. For this reason we have to be careful when considering which variables to choose.

Before building any models we have to prepare some of the data. Let's remove the columns we found to be insignificant like Name and Gross Balance. Also, we can remove GrAppv and DisbursementGross since they were extremely similar to SBA_Appv. SBA_Appv had the highest correlation with Paid so thats why I'll choose that column to move forward with.

As it stands, scikit-learn Random Forests and Gradient Boost cannot handle categorical data, so we'll have to format our State and Industry Columns somehow. Rather than one hot encoding, I'd rather simply map the average default rates for each state and industry directly into the data, replacing the names of the states and industries. I think this will still convey the necessary information to our model without using unnecessary storage with one hot encoding.

In [None]:
df = df.drop(['LoanNr_ChkDgt','Bank','GrAppv','DisbursementGross','Name', 'City', 'MIS_Status', 'ApprovalDate', 
              'Zip','BankState', 'DisbursementDate','ChgOffPrinGr','BalanceGross'], axis = 1)

# take log log to fix skew
df['SBA_Appv'] = np.log(df['SBA_Appv'])

# mapping state and industry default rates

state_def = {'MT':.068, 'WY':.069,'VT':.073,'ND':.076,'SD':.078,'ME':.096,
            'NH':.105,'NM':.107,'NE':.112,'AK':.114,'IA':.115,'MN':.116,
            'RI':.118,'WI':.121,'MA':.127,'KS':.129,'WA':.133,'CT':.136,
            'ID':.141,'PA':.145,'OR':.149,'MO':.151,'HI':.153,'OK':.154,
            'MS':.157,'WV':.162,'OH':.163,'AL':.165,'AR':.167,'IN':.175,
            'UT':.175,'DE':.175,'CA':.177,'CO':.178,'VA':.180,'LA':.181,
            'NC':.184,'TX':.186,'MD':.191,'KY':.192,'SC':.192,'NY':.195,
            'NJ':.195,'AZ':.203,'TN':.206,'MI':.225,'NV':.225,'IL':.225,
            'GA':.227,'DC':.235,'FL':.257}

df['State'] = df['State'].map(state_def)

ind_def = {'Accomodation and Food Services':.217,'Administrative and Support and Waste Management':.225,
          'Agriculture, Fishing, Forestry, and Hunting':.089,'Arts, Entertainment and Recreatiom':.202,
          'Construction':.227,'Educational Services':.236,'Finance and Insurance':.276,
          'Health Care and Social Assistance':.101,'Information':.242,'Management':.098,
          'Manufacturing':.149,'Mining, Quarrying, Oil and Gas Extraction':.083,
          'Other Services':.191,'Profesisonal, Scientific, and Technical Services':.184,
          'Public Administration':.155,'Real Estate and Rental Leasing':.279,
          'Retail Trade':.222,'Transport and Warehouse':.258,'Utilities':.137,
          'Wholesale Trade':.187}

df['NAICS'] = df['NAICS'].map(ind_def)
df = df.rename(columns={'State':'State Default Rate','NAICS':'Industry Default Rate'})

In [None]:
# splitting up our data and getting everything ready
y = df['Paid']
x = df.drop(['Paid'], axis = 1)
train_X, test_X, train_y, test_y = train_test_split(x, y, random_state = 0)
my_imputer = SimpleImputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)

In [None]:
forest = RandomForestClassifier(n_estimators=100)
forest.fit(train_X, train_y)
forest_pred = forest.predict(test_X)
score = (accuracy_score(test_y,forest_pred) * 100)
print('Random Forest Accuracy: %r' % round(score,2), '%')

Our Random Forest model had over 94% accuracy! While I'm very satisfied with these results, let's see if we can improve on this using Gradient Boost.

In [None]:
gboost = GradientBoostingClassifier()
gboost.fit(train_X, train_y)
gboost_pred = gboost.predict(test_X)
score = (accuracy_score(test_y,gboost_pred) * 100)
print('Gradient Boost Accuracy: %r' % round(score,2), '%')

Gradient Boost actually performed slightly worse than the standard Random Forest, which is surprising to me. I believe that this might be due to the amount of noise within the dataset that we used.

# **Conclusion and Next Steps**

It was very interesting to see the various risk factors for an SBA loan. I was most surprised at the variance of default rates among every state in the US. Overall, I was very happy with the performance of our models as both had over 90% accuracy.

However, it is clear that our analysis was missing something very important; financial statements. To think our models had such great accuracy and didn't have any knowledge of the cash flows of each business was very impressive to me, and showed me the power of machine learning. In the future, I'd love to take a look at more financial data for businesses and determine some of the important metrics when considering how businesses are going to perform in the future.