# Paycheck Protection Program (PPP) Loans

In response to the negative impact of COVID on the economy, the federal government declared a state of emergency and provided relief loans to small businesses to support them during the pandemic.  In response to a FOIA request, federal judges decided that this information was important and necessary to disclose to the public.  There has been significant controversy regarding whether these loans were effective in boosting the depressed economy.  Many such loans were forgiven.  We seek to examine these data and make connections with other related datasets to find patterns.

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# dataset sourced from https://data.sba.gov/dataset/ppp-foia
# created March 2, 2021, 4:30 PM (UTC-08:00)
# updated October 3, 2023, 8:04 AM (UTC-07:00)
# split across 13 csv files - 12 files describe loans under 150k
# 1 file contains data on loans made above 150k

ppp_1 = pd.read_csv('../data/public_up_to_150k_1_230930.csv')

In [4]:
ppp_1.head()

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,BusinessType,OriginatingLenderLocationID,OriginatingLender,OriginatingLenderCity,OriginatingLenderState,Gender,Veteran,NonProfit,ForgivenessAmount,ForgivenessDate
0,3509338307,01/22/2021,,PPS,Exemption 6,,,,,02/18/2022,...,Non-Profit Organization,,,,,Unanswered,Unanswered,Y,150775.38,01/13/2022
1,5375617707,05/01/2020,101.0,PPP,NOT AVAILABLE,,,,,07/16/2021,...,,9551.0,"Bank of America, National Association",CHARLOTTE,NC,Unanswered,Unanswered,,150083.01,06/11/2021
2,9677497701,05/01/2020,464.0,PPP,NORTH CHARLESTON HOSPITALITY GROUP LLC,192 College Park Rd,Ladson,,29456-3517,09/25/2021,...,Limited Liability Company(LLC),19248.0,Synovus Bank,COLUMBUS,GA,Unanswered,Unanswered,,141920.11,08/25/2021
3,9547167709,05/01/2020,464.0,PPP,Q AND J SERVICES LLC,301 Old Georgetown Road,Manning,,29102-2734,04/20/2021,...,Limited Liability Company(LLC),19248.0,Synovus Bank,COLUMBUS,GA,Unanswered,Unanswered,,137747.78,03/29/2021
4,8885207205,04/28/2020,,PPP,Exemption 6,,,,,05/20/2021,...,Non-Profit Organization,,,,,Unanswered,Unanswered,Y,131876.98,04/27/2021


In [5]:
ppp_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900000 entries, 0 to 899999
Data columns (total 53 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   LoanNumber                   900000 non-null  int64  
 1   DateApproved                 900000 non-null  object 
 2   SBAOfficeCode                899972 non-null  float64
 3   ProcessingMethod             900000 non-null  object 
 4   BorrowerName                 899999 non-null  object 
 5   BorrowerAddress              899858 non-null  object 
 6   BorrowerCity                 899855 non-null  object 
 7   BorrowerState                899848 non-null  object 
 8   BorrowerZip                  899859 non-null  object 
 9   LoanStatusDate               879679 non-null  object 
 10  LoanStatus                   900000 non-null  object 
 11  Term                         900000 non-null  int64  
 12  SBAGuarantyPercentage        900000 non-null  int64  
 13 

There are 53 features which encompass descriptive information such as the company name of the borrower, the geographic location of the company, and the type of business.  There are a number of target variables which spring to mind.  For example, the status of the loan.  At the time of the collection of this dataset, if the loan is over, we can observe the status of that loan as paid in full, in default, or has been forgiven.  We could model the outcome of a loan.

Another potential target is to model what percentage of a loan would be forgiven.

If we can access data about all PPP loan applications, we can build a model to predict the approved loan amount.

We will update this section if other interesting research questions make themselves known.  We hypothesize that if we can find related datasets about pandemic economy outcomes OR geographically organized information about COVID disease burden like hospitalizations OR about local government policy for enforced isolation leading to reduced patronage of small businesses, then we can ask even more interesting questions.  (*) One such augmented hypothesis might ask whether disease burden reduced small business revenue or isolation policies.  

We took guidance from this source: https://www.jennifer-case.com/featured-projects

In [71]:
# Generate an easy to read data dictionary
dict_df = pd.read_csv('../data/ppp-data-dictionary.csv', index_col=0)
dict_df['dtype'] = ppp_1.dtypes
dict_df.reset_index(inplace=True)

In [72]:
dict_df.iloc[:5, :]

Unnamed: 0,Field Name,Field Description,dtype
0,LoanNumber,Loan Number (unique identifier),int64
1,DateApproved,Loan Funded Date,object
2,SBAOfficeCode,SBA Origination Office Code,float64
3,ProcessingMethod,Loan Delivery Method (PPP for first draw; PPS ...,object
4,BorrowerName,Borrower Name,object


LoanNumber is a unique numeric identifier.

DateApproved is the date the loan was funded.

SBAOfficeCode is an identifier for which Small Business Administration office processed this loan application.

BorrowerName is the company name.

In [73]:
dict_df.iloc[3,1] # Greater detail on ProcessingMethod

'Loan Delivery Method (PPP for first draw; PPS for second draw)'

ProcessingMethod is categorical. It is designated PPP if the loan was drawn once from the original program.  

The loan program was revived and extended to March 31, 2021.  In this case, only businesses who had previously been granted a First Draw loan can qualify for a Second Draw loan.  A borrower can get a Second Draw Loan before their First Draw Loan funds are used up, but it will only be disbursed afterwards. These loans are designated PPS. 

(Reference: https://www.cpmlaw.com/understanding-the-difference-between-the-new-and-old-ppp-loan-programs/)

In [75]:
dict_df.iloc[5:10,:]

Unnamed: 0,Field Name,Field Description,dtype
5,BorrowerAddress,Borrower Street Address,object
6,BorrowerCity,Borrower City,object
7,BorrowerState,Borrower State,object
8,BorrowerZip,Borrower Zip Code,object
9,LoanStatusDate,Loan Status Date\n- Loan Status Date is blank...,object


These fields record address and geographic location data for the borrower.  Of interest is that these data have been converted into latitude and longitude data (Source: https://www.geocod.io/geocoded-ppp-loan-data/). This can enable geocoded data augmentation with census data, Congressional districts, and demographic data.

In [76]:
dict_df.iloc[9, 1]

'Loan Status Date\n- Loan Status Date is  blank when the loan is disbursed but not Paid In Full or Charged Off'

LoanStatusDate is the date when the loan was fully paid back or forgiven.
When loan repayment is still owed, either from being in default or the loan term has not yet ended, the field is blank. This is an important feature to wrangle as it is a potential target variable for classification.

In [77]:
dict_df.iloc[10:15,:]

Unnamed: 0,Field Name,Field Description,dtype
10,LoanStatus,Loan Status Description\n- Loan Status is repl...,object
11,Term,Loan Maturity in Months,int64
12,SBAGuarantyPercentage,SBA Guaranty Percentage,int64
13,InitialApprovalAmount,Loan Approval Amount(at origination),float64
14,CurrentApprovalAmount,Loan Approval Amount (current),float64


In [78]:
dict_df.iloc[10, 1]

"Loan Status Description\n- Loan Status is replaced by 'Exemption 4' when the loan is disbursed but not Paid in Full or Charged Off"

LoanStatus can be Paid in Full or Charged Off.  However, if the loan is still owed, this will be designated Exemption 4.  We will add a more detailed description of what Exemption 4 includes.

Term is the length of the loan period in months.

SBAGuarantyPercentage - if the small business defaults on the loan which is issued by a private lender, the SBA guarantees repayment of a substantial percentage of the loan amount.  This is intended to encourage lenders to grant credit by being a substitute for collateral.  There are some interesting secondary market effects for selling the guaranteed portion of the loan. (Refer here: https://www.wolterskluwer.com/en/expert-insights/sba-loan-guarantees)

InitialApprovalAmount is the amount of the loan.

CurrentApprovalAmount is the current amount of the approval amount and may have been increased since the original loan extended.  The dataset includes Second Draw loans, so this may describe the loan amount after a second draw.

In [84]:
dict_df.iloc[15:23,:]

Unnamed: 0,Field Name,Field Description,dtype
15,UndisbursedAmount,Undisbursed Amount,float64
16,FranchiseName,Franchise Name,object
17,ServicingLenderLocationID,Lender Location ID (unique identifier),float64
18,ServicingLenderName,Servicing Lender Name,object
19,ServicingLenderAddress,Servicing Lender Street Address,object
20,ServicingLenderCity,Servicing Lender City,object
21,ServicingLenderState,Servicing Lender State,object
22,ServicingLenderZip,Servicing Lender Zip Code,object


UndisbursedAmount - the amount of the loan which has not been disbursed yet.

FranchiseName is the name of franchise if applicable.

ServicingLenderLocationID identifiers for the private lender that actually issued the loan.
ServicingLenderName
ServicingLenderAddress
ServicingLenderCity
ServicingLenderState

In [87]:
dict_df.iloc[23:31,:]

Unnamed: 0,Field Name,Field Description,dtype
23,RuralUrbanIndicator,Rural or Urban Indicator (R/U),object
24,HubzoneIndicator,Hubzone Indicator (Y/N),object
25,LMIIndicator,LMI Indicator (Y/N),object
26,BusinessAgeDescription,Business Age Description,object
27,ProjectCity,Project City,object
28,ProjectCountyName,Project County Name,object
29,ProjectState,Project State,object
30,ProjectZip,Project Zip Code,object


RuralUrbanIndicator - identifies the business as located in a rural or urban location

HubzoneIndicator - a HUBZone is a business located in a low income, high poverty, or high unemployment community

LMIIndicator - low and moderate income communities

BusinessAgeDescription	- how long the company has been in business

ProjectCity, ProjectCountyName, ProjectState, ProjectZip - address information for the project

TODO: BUT we need to look closer to determine what projec this refers to

In [95]:
dict_df.iloc[31:43,:]

Unnamed: 0,Field Name,Field Description,dtype
31,CD,Project Congressional District,object
32,JobsReported,Number of Employees,float64
33,NAICSCode,NAICS 6 digit code,float64
34,Race,Borrower Race Description,object
35,Ethnicity,Borrower Ethnicity Description,object
36,UTILITIES_PROCEED,Note: Proceed data is lender reported at origi...,float64
37,PAYROLL_PROCEED,Note: Proceed data is lender reported at origi...,float64
38,MORTGAGE_INTEREST_PROCEED,Note: Proceed data is lender reported at origi...,float64
39,RENT_PROCEED,Note: Proceed data is lender reported at origi...,float64
40,REFINANCE_EIDL_PROCEED,Note: Proceed data is lender reported at origi...,float64


CD - The congressional district where this project was located

JobsReported - number of employees in this business aka business size

NAICSCode - North American Industry Classification System code describes what kind of business is conducted.  For example, agricultural products, or computer services.

Race, Ethnicity - Race, ethnicity of the borrower.  This information is relevant to the incentives created to prioritize racial and gender minority owned businesses for PPP aid.  These businesses may have qualified for other kinds of aid or benefits.

In [93]:
dict_df.iloc[36,1]

'Note: Proceed data is lender reported at origination.  On the PPP application the proceeds fields were check boxes.  '

UTILITIES_PROCEED, PAYROLL_PROCEED, MORTGAGE_INTEREST_PROCEED, RENT_PROCEED, REFINANCE_EIDL_PROCEED, HEALTH_CARE_PROCEED 

These are categories which were checkboxes on the loan application. A loan proceed is the funds which were disbursed as part of the loan. These categories refer to what the loan proceeds would be used for.  For example, are they to be used for covering payroll, mortgage interest, or health care for workers.

In [96]:
dict_df.iloc[43:,:]

Unnamed: 0,Field Name,Field Description,dtype
43,BusinessType,Business Type Description,object
44,OriginatingLenderLocationID,Originating Lender ID (unique identifier),float64
45,OriginatingLender,Originating Lender Name,object
46,OriginatingLenderCity,Originating Lender City,object
47,OriginatingLenderState,Originating Lender State,object
48,Gender,Gender Indicator,object
49,Veteran,Veteran Indicator,object
50,NonProfit,'Yes' if Business Type = Non-Profit Organizati...,object
51,ForgivenessAmount,Forgiveness Amount,float64
52,ForgivenessDate,Forgiveness Paid Date,object


BusinessType - whether the business is a non profit, or other designation

OriginatingLenderLocationID, OriginatingLender, OriginatingLenderCity, OriginatingLenderState - describe biographical information about the originating lender.  This is the lender which approved and structured the loan.  This stands in contrast with the servicing lender which administrates the disbursement of the funds and loan repayment.  Sometimes, the same lender entity handles both tasks.

Gender, Veteran - whether the borrow has a special status which the PPP is directed to prioritize for additional benefits

NonProfit - binary category of whether the business is a nonprofit organization

ForgivenessAmount - The amount (full or partial) of the loan which was forgiven in the case that the borrower cannot pay back the loan.  When small businesses are unable to recoup the money, but if they spent the loan proceeds correctly according to the PPP rules, the SBA will forgive these loans.

ForgivenessDate	- The date when the ForgivenessAmount is authorized. 

### Cleaning up the datatypes

In [103]:
# TODO make a pipline to process all the separate csvs this way

In [104]:
ppp_1['DateApproved'] = pd.to_datetime(ppp_1['DateApproved'])
ppp_1['DateApproved'].head()

0   2021-01-22
1   2020-05-01
2   2020-05-01
3   2020-05-01
4   2020-04-28
Name: DateApproved, dtype: datetime64[ns]

In [115]:
ppp_1['LoanNumber'].head()

0    3509338307
1    5375617707
2    9677497701
3    9547167709
4    8885207205
Name: LoanNumber, dtype: int64

In [118]:
ppp_1['SBAOfficeCode'].describe()

count   899,972.00
mean        827.96
std         193.93
min         101.00
25%         669.00
50%         914.00
75%         942.00
max       9,030.00
Name: SBAOfficeCode, dtype: float64

In [120]:
ppp_1['ProcessingMethod'].describe()

count     900000
unique         2
top          PPP
freq      653482
Name: ProcessingMethod, dtype: object

In [123]:
ppp_1['BorrowerName'].unique()

array(['Exemption 6', 'NOT AVAILABLE',
       'NORTH CHARLESTON HOSPITALITY GROUP LLC', ...,
       'PRIYANKARA KATUGAMPOLAGE', 'ANTHONY ALFORD',
       'BRYAN KREFT LAW OFFICES'], dtype=object)

In [125]:
ppp_1['LoanStatusDate'] = pd.to_datetime(ppp_1['LoanStatusDate'])
ppp_1[['LoanStatusDate', 'LoanStatus']].head()

Unnamed: 0,LoanStatusDate,LoanStatus
0,2022-02-18,Paid in Full
1,2021-07-16,Paid in Full
2,2021-09-25,Paid in Full
3,2021-04-20,Paid in Full
4,2021-05-20,Paid in Full


In [170]:
# ppp_1 = pd.read_csv('../data/public_up_to_150k_1_230930.csv')
ppp_1['LoanStatusDate'].value_counts()

LoanStatusDate
04/16/2021    20556
03/22/2022    14774
09/28/2021    13955
09/25/2021    13493
10/16/2021    12571
              ...  
05/02/2021        1
06/24/2023        1
10/10/2021        1
10/30/2020        1
01/30/2023        1
Name: count, Length: 925, dtype: int64

In [172]:
np.sum(ppp_1['LoanStatusDate'].isna()) / ppp_1['LoanStatusDate'].size

0.022578888888888888

About 2.25% of loans do not have a recorded Loan Status Date, corresponding to loans which have not been paid back or forgiven.  This may include loans which are partially repaid.

In [175]:
ppp_1[ppp_1['LoanStatusDate'].isna()][['BusinessType', 'BorrowerName', 'CurrentApprovalAmount']]

Unnamed: 0,BusinessType,BorrowerName,CurrentApprovalAmount
65,Self-Employed Individuals,Exemption 6,20468.00
153,Sole Proprietorship,KIANNA MITCHELL,20000.00
296,Corporation,FRONTIER AUTO SALES LLC,143462.00
532,Sole Proprietorship,STACY GOULD,134418.00
558,Trust,"ALASKA 360, LLC",133394.00
...,...,...,...
899891,Sole Proprietorship,CAMERON MOLETTE,20833.00
899899,Independent Contractors,HAIDEE GOMEZ,20833.00
899928,Sole Proprietorship,ALEXIS LEE,20833.00
899979,Self-Employed Individuals,XI LUO,20833.00


In [176]:
ppp_1['LoanStatus'].value_counts()

LoanStatus
Paid in Full    839306
Charged Off      40373
Exemption 4      20321
Name: count, dtype: int64

In [131]:
ppp_1[['SBAGuarantyPercentage', 'Term']].describe()

Unnamed: 0,SBAGuarantyPercentage,Term
count,900000.0,900000.0
mean,100.0,43.25
std,0.0,17.89
min,100.0,0.0
25%,100.0,24.0
50%,100.0,60.0
75%,100.0,60.0
max,100.0,136.0


In [137]:
loan_sz_inc = ppp_1[ppp_1['InitialApprovalAmount'] < ppp_1['CurrentApprovalAmount']]

In [142]:
# By how much were loan amounts increased?
percent_inc = (loan_sz_inc['CurrentApprovalAmount'] - loan_sz_inc['InitialApprovalAmount'])/loan_sz_inc['InitialApprovalAmount']

In [143]:
percent_inc.describe()

count   9,914.00
mean         inf
std          NaN
min         0.00
25%         0.06
50%         0.32
75%         1.21
max          inf
dtype: float64

There are nearly 10,000 loans which were increased.  The median amount of increase was about 32% of the original loan size.

In [144]:
ppp_1['UndisbursedAmount'].describe()

count   899,943.00
mean          0.08
std          71.39
min           0.00
25%           0.00
50%           0.00
75%           0.00
max      67,720.00
Name: UndisbursedAmount, dtype: float64

In [147]:
np.sum(ppp_1['UndisbursedAmount'] > 0)  # only one loan was not fully disbursed

1

In [154]:
ppp_1[ppp_1['FranchiseName'].notna()]['FranchiseName'].head(10)

383    Best Western - Membership Agreement
404                          Vision Source
502                               Pita Pit
521                           Ace Hardware
560                                 Subway
608    Best Western - Membership Agreement
744                                Re-Bath
745                                Re-Bath
899                             SportClips
919                Four Points by Sheraton
Name: FranchiseName, dtype: object

Whether a small business is part of a franchise may signal its robustness for loan repayment.

In [166]:
serv_orig_lender = ppp_1[['CurrentApprovalAmount', 'ServicingLenderName', 'OriginatingLender']]
diff_lenders_mask = (ppp_1['ServicingLenderName'] != ppp_1['OriginatingLender']) & ppp_1['OriginatingLender'].notna()
print(serv_orig_lender[diff_lenders_mask]['OriginatingLender'].value_counts()[:10])
print(serv_orig_lender[diff_lenders_mask]['ServicingLenderName'].value_counts()[:10])

OriginatingLender
BSD Capital, LLC dba Lendistry       17088
US Bank National Association         11025
Kabbage, Inc.                         6648
First Source Federal Credit Union     6572
Celtic Bank Corporation               5411
First Republic Bank                   5111
Readycap Lending, LLC                 5028
Northeast Bank                        2839
Cadence Bank                          2303
WebBank                               2297
Name: count, dtype: int64
ServicingLenderName
Harvest Small Business Finance, LLC          14373
Loan Source Incorporated                     11353
U.S. Bank, National Association              11026
Customers Bank                               10416
PNC Bank, National Association                6572
Square Capital, LLC                           5411
JPMorgan Chase Bank, National Association     5111
Quontic Bank                                  3246
Fed – Kabbage                                 3131
Lendistry SBLC, LLC                           2

Originating and servicing lenders may be different entities.

In [167]:
ppp_1['BusinessType'].value_counts()

BusinessType
Corporation                            268729
Sole Proprietorship                    232108
Limited  Liability Company(LLC)        164490
Subchapter S Corporation                97160
Self-Employed Individuals               50255
Independent Contractors                 40116
Non-Profit Organization                 18654
Partnership                             12323
Limited Liability Partnership            5520
Single Member LLC                        3979
Professional Association                 3432
Cooperative                               978
501(c)3 – Non Profit                      856
501(c)6 – Non Profit Membership           422
Trust                                     249
Non-Profit Childcare Center               238
Tenant in Common                          116
Qualified Joint-Venture (spouses)          84
Joint Venture                              79
Employee Stock Ownership Plan(ESOP)        39
Tribal Concerns                            10
501(c)19 – Non Profit

Business type describes different kinds of incorporations, whether it's a nonprofit, or self-employed individual businesses.

In [181]:
np.sum(ppp_1['RuralUrbanIndicator'].isna())

0

In [189]:
print(np.sum(ppp_1['HubzoneIndicator'].isna()))

0


There are no missing values. There is a significant fraction of borrowers in HubZones.

In [190]:
ppp_1['BusinessAgeDescription'].head()

0    Existing or more than 2 years old
1      New Business or 2 years or less
2    Existing or more than 2 years old
3    Existing or more than 2 years old
4                           Unanswered
Name: BusinessAgeDescription, dtype: object

In [191]:
ppp_1['BusinessAgeDescription'].value_counts()

BusinessAgeDescription
Existing or more than 2 years old         786363
New Business or 2 years or less            76518
Unanswered                                 36762
Startup, Loan Funds will Open Business       234
Change of Ownership                          123
Name: count, dtype: int64

The Business Age Description should be translated to a more intuitive numeric or categorical classification. We can see there are a few classes of businesses with the majority being businesses that have operated for more than 2 years.

In [192]:
# TODO decide how to best encode business age

In [201]:
np.sum(ppp_1['JobsReported'].isna()) # One business does not report their number of employees

1

In [203]:
ppp_1.where(ppp_1['JobsReported'].notna(), other=0, inplace=True)

In [204]:
ppp_1['JobsReported'].astype('int32')

0          15
1          12
2           3
3         170
4          14
         ... 
899995      1
899996      1
899997      1
899998      1
899999      1
Name: JobsReported, Length: 900000, dtype: int32

In [205]:
ppp_1['NAICSCode'].value_counts()

NAICSCode
722,511.00    31839
812,112.00    28527
621,210.00    28197
621,111.00    21841
541,110.00    19959
              ...  
311,711.00        1
517,212.00        1
313,311.00        1
333,312.00        1
454,112.00        1
Name: count, Length: 1130, dtype: int64

We remark that there are consideration for how to clean the data that are  not yet obvious.  We will proceed first to visualize some of the most important features and do some selected EDA with the goal of building the baseline classification / prediction model.

We also have to sort out the csv backup issue.