# Analyzing PPP Loans Data for Bay Area

*Based on ZIP Codes*

In [1]:
import pandas as pd
import numpy as np
import re
import warnings
import os
from os.path import dirname, abspath

warnings.filterwarnings('ignore')

In [2]:
# Get Current Working Directory and Parent Path (for reading files in different folders)
## Source: https://stackoverflow.com/questions/30218802/get-parent-of-current-directory-from-python-script/30218825
d = dirname(dirname(abspath(os.getcwd())))
d

'C:\\Users\\Trang\\Desktop\\afn'

## Part 1: Load Other Tables for PPP Loan Data

The following datasets are external to the PPP Loan Datasets and will be used to match and filter some columns in the PPP Loan files.

`NAICS_codes`: To merge with `PPP_combined` on 6-digit NAICS code, need to match with 4-digit code

`bay_area_zip`: New dataset with zip codes of all of Bay Area (might need to filter for 6 cities)

In [3]:
#Import NAICS 6 and 4 digit concordance table

NAICS_codes = pd.read_csv(d + '/ppp-loan-data/naics/naics_data_rsei_v238.csv')
NAICS_codes['2017NAICSCode'] = NAICS_codes['2017NAICSCode'].apply(str)
NAICS_codes.head()

Unnamed: 0,2017NAICSCode,LongName,Changed2017,TRIIndustrySector,IndustrySubsector,4DigitNAICS,NewIndustry
0,111110,Soybean Farming,,999 Other,1111 Oilseed and Grain Farming,1111,False
1,111120,Oilseed (except Soybean) Farming,,999 Other,1111 Oilseed and Grain Farming,1111,False
2,111130,Dry Pea and Bean Farming,,999 Other,1111 Oilseed and Grain Farming,1111,False
3,111140,Wheat Farming,,999 Other,1111 Oilseed and Grain Farming,1111,False
4,111150,Corn Farming,,999 Other,1111 Oilseed and Grain Farming,1111,False


In [4]:
#Import Bay Area Zip Codes (New Dataset as of 5/28/2021) to filter from PPP Loan data
bay_area_zip = pd.read_csv(d + '/ppp-loan-data/bay_zipcodes/cb_2018_us_zcta510_500k_BAY.csv')
bay_area_zip = bay_area_zip.rename(columns={'ZCTA5CE10':'zip_code_5'})
bay_area_zip['zip_code_5'] =bay_area_zip['zip_code_5'].apply(str)
bay_area_zip.head()

Unnamed: 0,zip_code_5,AFFGEOID10,GEOID10,ALAND10,AWATER10
0,94952,8600000US94952,94952,466821663,2998248
1,94127,8600000US94127,94127,4585722,9359
2,95363,8600000US95363,95363,305162606,1236641
3,95441,8600000US95441,95441,263937111,4704473
4,94574,8600000US94574,94574,330686479,5409337


## Part 2: Sample Analysis on one CSV file

Used `public_up_to_150k_1.csv` to:
1. Analyze NaN values in PPP Loans data. *(See Questions and Notes)
2. Split full dataframe in 2: NaN Zip Codes & Non-NaN Zip Codes
3. Added modified zip code column to Non-NULL dataframe
4. Extracted only Bay Area Zip Codes using table in Part 1

**TODO:** 
- Analyze NaN Zips and add Bay Area & Minority-Owned ones to `bay_area_ppp`
- Merge the NAICS 4-digit codes
- Apply analysis to all files to make `PPP_combined`

In [3]:
# Read in one file
ppp_sub_150_1 = pd.read_csv(d + '/ppp-loan-data/ppp_loan_datasets/public_up_to_150k_1.csv')
ppp_sub_150_1.head()

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,HEALTH_CARE_PROCEED,DEBT_INTEREST_PROCEED,BusinessType,OriginatingLenderLocationID,OriginatingLender,OriginatingLenderCity,OriginatingLenderState,Gender,Veteran,NonProfit
0,5375617707,05/01/2020,101.0,PPP,NOT AVAILABLE,,,,,,...,,,,9551.0,"Bank of America, National Association",CHARLOTTE,NC,Unanswered,Unanswered,
1,9677497701,05/01/2020,464.0,PPP,NORTH CHARLESTON HOSPITALITY GROUP LLC,192 College Park Rd,Ladson,,29456-3517,,...,,,Limited Liability Company(LLC),19248.0,Synovus Bank,COLUMBUS,GA,Unanswered,Unanswered,
2,9547167709,05/01/2020,464.0,PPP,Q AND J SERVICES LLC,301 Old Georgetown Road,Manning,,29102-2734,04/20/2021,...,,,Limited Liability Company(LLC),19248.0,Synovus Bank,COLUMBUS,GA,Unanswered,Unanswered,
3,6448037706,05/01/2020,515.0,PPP,OPTIMIZED PROCESS SOLUTIONS DBA AAA INDUSTRIES,24500 CAPITOL,REDFORD,,48239-2446,04/16/2021,...,,,Limited Liability Company(LLC),9551.0,"Bank of America, National Association",CHARLOTTE,NC,Male Owned,Non-Veteran,
4,9609017706,05/01/2020,464.0,PPP,"D2 POWER SPORTS, LLC",125 Simuel Dr.,Spartanburg,,29303-2085,,...,,,Limited Liability Company(LLC),19248.0,Synovus Bank,COLUMBUS,GA,Unanswered,Unanswered,


### Part 2a. Analyze NaN Zip Codes

Looking Into Businesses with little to no information on location

**Questions:**

1. What should we do with NaN zip codes? What if it is minority-owned but area is unknown?

**Notes:**

There are 124 NaN Zip Codes in `ppp_sub_150_1`. Should we drop them all? Or manually filter them?

Idea - Automate Process:

1. In each table, get all NaN Zip Codes -> New DF
2. In new DF, get Race and Ethnicity where it is answered
3. Filter for anything !White and !Not Hispanic or Latino
4. View DF with indices of minority businesses with no Zip Codes -> Manually Check?
5. If 'BorrowerState' or 'ProjectState' == CA: keep, else: drop from table
6. Manually check remaining rows.

In [19]:
# NaN Example: No location information, Nothing on Race and Ethincity
ppp_sub_150_1.loc[0]

LoanNumber                                                5375617707
DateApproved                                              05/01/2020
SBAOfficeCode                                                    101
ProcessingMethod                                                 PPP
BorrowerName                                           NOT AVAILABLE
BorrowerAddress                                                  NaN
BorrowerCity                                                     NaN
BorrowerState                                                    NaN
BorrowerZip                                                      NaN
LoanStatusDate                                                   NaN
LoanStatus                                               Exemption 4
Term                                                              24
SBAGuarantyPercentage                                            100
InitialApprovalAmount                                         148440
CurrentApprovalAmount             

In [17]:
# Count NaN Zip Codes (BorrowerZip) in table
ppp_sub_150_1['BorrowerZip'].isna().sum()

124

In [28]:
# Get rows where BorrowerZip is NaN
nan_zip_df = ppp_sub_150_1[ppp_sub_150_1['BorrowerZip'].isnull()]
nan_zip_df.head()

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,HEALTH_CARE_PROCEED,DEBT_INTEREST_PROCEED,BusinessType,OriginatingLenderLocationID,OriginatingLender,OriginatingLenderCity,OriginatingLenderState,Gender,Veteran,NonProfit
0,5375617707,05/01/2020,101.0,PPP,NOT AVAILABLE,,,,,,...,,,,9551.0,"Bank of America, National Association",CHARLOTTE,NC,Unanswered,Unanswered,
5,9789867710,05/01/2020,101.0,PPP,VULCAN MACHINERY CORPORATION,,,,,,...,,,Corporation,57328.0,The Huntington National Bank,COLUMBUS,OH,Male Owned,Non-Veteran,
6,9589997709,05/01/2020,101.0,PPP,"TJK KITCHENS &AMP; BREWPUBS, LLC",,,,,,...,,,Limited Liability Company(LLC),57328.0,The Huntington National Bank,COLUMBUS,OH,Unanswered,Unanswered,
8,9662387700,05/01/2020,101.0,PPP,RON GOLDSTONE,,,,,,...,,,,57328.0,The Huntington National Bank,COLUMBUS,OH,Unanswered,Unanswered,
9,2767027201,04/16/2020,,PPP,Exemption 6,,,,,02/18/2021,...,0.0,0.0,Subchapter S Corporation,,,,,Unanswered,Unanswered,


In [31]:
# Get Race of these NaN rows
nan_zip_df[nan_zip_df['Race'] != 'Unanswered']['Race']

11                            White
15                            White
22                            White
25                            White
26                            White
48                            White
51                            White
67        Black or African American
76                            White
79                            White
83                            White
87                            White
95                            White
102                           White
103                           White
119                           White
120                           White
130                           White
466347                        Asian
Name: Race, dtype: object

In [32]:
# Get Ethnicity of these NaN rows
nan_zip_df[nan_zip_df['Ethnicity'] != 'Unknown/NotStated']['Ethnicity']

11        Not Hispanic or Latino
22        Not Hispanic or Latino
25        Not Hispanic or Latino
26        Not Hispanic or Latino
48        Not Hispanic or Latino
51        Not Hispanic or Latino
67        Not Hispanic or Latino
76        Not Hispanic or Latino
79        Not Hispanic or Latino
83        Not Hispanic or Latino
87        Not Hispanic or Latino
95        Not Hispanic or Latino
102       Not Hispanic or Latino
103       Not Hispanic or Latino
119       Not Hispanic or Latino
120       Not Hispanic or Latino
130       Not Hispanic or Latino
133           Hispanic or Latino
466347    Not Hispanic or Latino
Name: Ethnicity, dtype: object

***Examples of NaN, Minority-owned businesses***

In [39]:
'''
BorrowerName                                 SUNKISSBIZ LLC
BorrowerAddress                                         NaN
BorrowerCity                                            NaN
BorrowerState                                           NaN
BorrowerZip                                             NaN
...
NAICSCode                                            611610
Race                              Black or African American
Ethnicity                            Not Hispanic or Latino
'''
nan_zip_df.loc[67] #Located in MI -> Drop

In [40]:
'''
BorrowerName                                       ART N FUN STUDIO, INC
BorrowerAddress                                                      NaN
BorrowerCity                                                         NaN
BorrowerState                                                         CA
BorrowerZip                                                          NaN
...
NAICSCode                                                         712110
Race                                                               Asian
Ethnicity                                         Not Hispanic or Latino
'''
nan_zip_df.loc[466347] # Located In Santa Clara, CA, 95051 -> Keep?

In [41]:
'''
BorrowerName                           Exemption 6
BorrowerAddress                                NaN
BorrowerCity                                   NaN
BorrowerState                                  NaN
BorrowerZip                                    NaN
...
NAICSCode                                   561720
Race                                    Unanswered
Ethnicity                       Hispanic or Latino
'''
nan_zip_df.loc[133] # No location information

### Part 2b. Split Dataframe - Get Non-NaN Dataframe 


In [88]:
not_nan_df = ppp_sub_150_1[ppp_sub_150_1['BorrowerZip'].notnull()]

not_nan_df.head()

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,HEALTH_CARE_PROCEED,DEBT_INTEREST_PROCEED,BusinessType,OriginatingLenderLocationID,OriginatingLender,OriginatingLenderCity,OriginatingLenderState,Gender,Veteran,NonProfit
1,9677497701,05/01/2020,464.0,PPP,NORTH CHARLESTON HOSPITALITY GROUP LLC,192 College Park Rd,Ladson,,29456-3517,,...,,,Limited Liability Company(LLC),19248.0,Synovus Bank,COLUMBUS,GA,Unanswered,Unanswered,
2,9547167709,05/01/2020,464.0,PPP,Q AND J SERVICES LLC,301 Old Georgetown Road,Manning,,29102-2734,04/20/2021,...,,,Limited Liability Company(LLC),19248.0,Synovus Bank,COLUMBUS,GA,Unanswered,Unanswered,
3,6448037706,05/01/2020,515.0,PPP,OPTIMIZED PROCESS SOLUTIONS DBA AAA INDUSTRIES,24500 CAPITOL,REDFORD,,48239-2446,04/16/2021,...,,,Limited Liability Company(LLC),9551.0,"Bank of America, National Association",CHARLOTTE,NC,Male Owned,Non-Veteran,
4,9609017706,05/01/2020,464.0,PPP,"D2 POWER SPORTS, LLC",125 Simuel Dr.,Spartanburg,,29303-2085,,...,,,Limited Liability Company(LLC),19248.0,Synovus Bank,COLUMBUS,GA,Unanswered,Unanswered,
7,6486007709,05/01/2020,914.0,PPP,SEAN T KY DDS INC,200 S EL MOLINO AVE STE 5,PASADENA,,91101-2985,,...,,,Corporation,9551.0,"Bank of America, National Association",CHARLOTTE,NC,Unanswered,Unanswered,


In [89]:
# Check that lengths of DF split matches up
len(ppp_sub_150_1) == len(not_nan_df) + len(nan_zip_df)

True

### Part 2c. Add Modified Zip Code Column

Extract first 5 digits of `BorrowerZip` column

In [94]:
# Source: https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/
pattern = '(^\d{5})'
not_nan_df['zip_code_5'] = not_nan_df['BorrowerZip'].str.extract(pattern)
not_nan_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,DEBT_INTEREST_PROCEED,BusinessType,OriginatingLenderLocationID,OriginatingLender,OriginatingLenderCity,OriginatingLenderState,Gender,Veteran,NonProfit,zip_code_5
1,9677497701,05/01/2020,464.0,PPP,NORTH CHARLESTON HOSPITALITY GROUP LLC,192 College Park Rd,Ladson,,29456-3517,,...,,Limited Liability Company(LLC),19248.0,Synovus Bank,COLUMBUS,GA,Unanswered,Unanswered,,29456
2,9547167709,05/01/2020,464.0,PPP,Q AND J SERVICES LLC,301 Old Georgetown Road,Manning,,29102-2734,04/20/2021,...,,Limited Liability Company(LLC),19248.0,Synovus Bank,COLUMBUS,GA,Unanswered,Unanswered,,29102
3,6448037706,05/01/2020,515.0,PPP,OPTIMIZED PROCESS SOLUTIONS DBA AAA INDUSTRIES,24500 CAPITOL,REDFORD,,48239-2446,04/16/2021,...,,Limited Liability Company(LLC),9551.0,"Bank of America, National Association",CHARLOTTE,NC,Male Owned,Non-Veteran,,48239
4,9609017706,05/01/2020,464.0,PPP,"D2 POWER SPORTS, LLC",125 Simuel Dr.,Spartanburg,,29303-2085,,...,,Limited Liability Company(LLC),19248.0,Synovus Bank,COLUMBUS,GA,Unanswered,Unanswered,,29303
7,6486007709,05/01/2020,914.0,PPP,SEAN T KY DDS INC,200 S EL MOLINO AVE STE 5,PASADENA,,91101-2985,,...,,Corporation,9551.0,"Bank of America, National Association",CHARLOTTE,NC,Unanswered,Unanswered,,91101


In [84]:
bay_area_ppp = not_nan_df[not_nan_df['zip_code_5'].isin(bay_area_zip['zip_code_5'])]
bay_area_ppp.head()

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,DEBT_INTEREST_PROCEED,BusinessType,OriginatingLenderLocationID,OriginatingLender,OriginatingLenderCity,OriginatingLenderState,Gender,Veteran,NonProfit,zip_code_5
102272,8585717704,05/01/2020,459.0,PPP,MARKET AUTO TRUCK COLLISION CORPORATION,140 SAN JOSE AVE,SAN JOSE,AL,95125,,...,,Corporation,122043.0,WebBank,SALT LAKE CITY,UT,Unanswered,Unanswered,,95125
392350,1464468601,03/13/2021,912.0,PPS,GOLD RUSH KETTLE KORN LLC,4690 E 2nd St Ste 9,Benicia,CA,94510-1008,,...,,Limited Liability Company(LLC),11822.0,"MUFG Union Bank, National Association",SAN FRANCISCO,CA,Male Owned,Non-Veteran,,94510
392359,2663308509,02/22/2021,912.0,PPS,KALIBER LABS INC,188 King St Unit 307,San Francisco,CA,94107-4903,,...,,Corporation,9551.0,"Bank of America, National Association",CHARLOTTE,NC,Male Owned,Non-Veteran,,94107
392364,4137348607,03/18/2021,912.0,PPS,ROSS MCDONALD COMPANY INC,1154 Stealth St,Livermore,CA,94551-9300,,...,,Corporation,474333.0,First Republic Bank,SAN FRANCISCO,CA,Unanswered,Unanswered,,94551
392365,4315748602,03/18/2021,912.0,PPS,GATES EISENHART DAWSON,125 S Market St Ste 1200,San Jose,CA,95113-2288,,...,,Partnership,474333.0,First Republic Bank,SAN FRANCISCO,CA,Unanswered,Non-Veteran,,95113


In [96]:
#Create new column for year of approval 
bay_area_ppp.loc[:,'YearApproved'] = bay_area_ppp['DateApproved'].str[-4:]

#Create new column for NAICS 6 digit in string format 
bay_area_ppp.loc[:,'NAICS_6'] = bay_area_ppp['NAICSCode'].apply(lambda y: str(y)[:6])

### Part 2d. Add `Minority` Column 

Options: Yes, No, Unanswered

In [118]:
bay_area_ppp[(bay_area_ppp['Ethnicity'] == 'Hispanic or Latino')]['Race']

# Question: Hispanic or Latino +  White? => Yes or No

392397                        White
392410                        White
392530                        White
392858                        White
392868                        White
                    ...            
899256                   Unanswered
899439                   Unanswered
899733    Black or African American
899775                        White
899949                        White
Name: Race, Length: 3274, dtype: object

In [120]:
# Source: https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/

conditions = [
    (bay_area_ppp['Race']  == 'White'),
    (bay_area_ppp['Race'] == 'Unanswered') & (bay_area_ppp['Ethnicity'] == 'Not Hispanic or Latino'),
    (bay_area_ppp['Race'] == 'Unanswered') & (bay_area_ppp['Ethnicity'] == 'Unknown/NotStated'),
    (bay_area_ppp['Ethnicity'] == 'Hispanic or Latino') ,
    (bay_area_ppp['Race']  != 'White') & (bay_area_ppp['Race'] != 'Unanswered') ,
    ]

values = ['No', 'No', 'Unanswered', 'Yes', 'Yes']

# Note: 0 in Minority column indicates a condition that is not covered
bay_area_ppp['Minority'] = np.select(conditions, values)
bay_area_ppp.head()

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,OriginatingLender,OriginatingLenderCity,OriginatingLenderState,Gender,Veteran,NonProfit,zip_code_5,YearApproved,NAICS_6,Minority
102272,8585717704,05/01/2020,459.0,PPP,MARKET AUTO TRUCK COLLISION CORPORATION,140 SAN JOSE AVE,SAN JOSE,AL,95125,,...,WebBank,SALT LAKE CITY,UT,Unanswered,Unanswered,,95125,2020,811121,Unanswered
392350,1464468601,03/13/2021,912.0,PPS,GOLD RUSH KETTLE KORN LLC,4690 E 2nd St Ste 9,Benicia,CA,94510-1008,,...,"MUFG Union Bank, National Association",SAN FRANCISCO,CA,Male Owned,Non-Veteran,,94510,2021,722330,No
392359,2663308509,02/22/2021,912.0,PPS,KALIBER LABS INC,188 King St Unit 307,San Francisco,CA,94107-4903,,...,"Bank of America, National Association",CHARLOTTE,NC,Male Owned,Non-Veteran,,94107,2021,541511,No
392364,4137348607,03/18/2021,912.0,PPS,ROSS MCDONALD COMPANY INC,1154 Stealth St,Livermore,CA,94551-9300,,...,First Republic Bank,SAN FRANCISCO,CA,Unanswered,Unanswered,,94551,2021,541420,Unanswered
392365,4315748602,03/18/2021,912.0,PPS,GATES EISENHART DAWSON,125 S Market St Ste 1200,San Jose,CA,95113-2288,,...,First Republic Bank,SAN FRANCISCO,CA,Unanswered,Non-Veteran,,95113,2021,541110,Unanswered


### Part 2e. Add NAICS 4-digit Column

In [126]:
#Merge NAICS 4 digit with PPP data and remove the 4 digit code from IndustrySubsector name
bay_area_ppp_NAICS = pd.merge(bay_area_ppp, NAICS_codes, how='left', left_on='NAICS_6', right_on='2017NAICSCode')
bay_area_ppp_NAICS.loc[:,'NAICS_4'] = bay_area_ppp_NAICS['4DigitNAICS'].apply(lambda y: str(y)[:4])
bay_area_ppp_NAICS = bay_area_ppp_NAICS.drop(['2017NAICSCode','Changed2017','TRIIndustrySector','NewIndustry','4DigitNAICS'], axis=1)
bay_area_ppp_NAICS['IndustrySubsector'] = bay_area_ppp_NAICS['IndustrySubsector'].str[5:]
bay_area_ppp_NAICS.head()

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,Gender,Veteran,NonProfit,zip_code_5,YearApproved,NAICS_6,Minority,LongName,IndustrySubsector,NAICS_4
0,8585717704,05/01/2020,459.0,PPP,MARKET AUTO TRUCK COLLISION CORPORATION,140 SAN JOSE AVE,SAN JOSE,AL,95125,,...,Unanswered,Unanswered,,95125,2020,811121,Unanswered,"Automotive Body, Paint, and Interior Repair an...",Automotive Repair and Maintenance,8111
1,1464468601,03/13/2021,912.0,PPS,GOLD RUSH KETTLE KORN LLC,4690 E 2nd St Ste 9,Benicia,CA,94510-1008,,...,Male Owned,Non-Veteran,,94510,2021,722330,No,Mobile Food Services,Special Food Services,7223
2,2663308509,02/22/2021,912.0,PPS,KALIBER LABS INC,188 King St Unit 307,San Francisco,CA,94107-4903,,...,Male Owned,Non-Veteran,,94107,2021,541511,No,Custom Computer Programming Services,Computer Systems Design and Related Services,5415
3,4137348607,03/18/2021,912.0,PPS,ROSS MCDONALD COMPANY INC,1154 Stealth St,Livermore,CA,94551-9300,,...,Unanswered,Unanswered,,94551,2021,541420,Unanswered,Industrial Design Services,Specialized Design Services,5414
4,4315748602,03/18/2021,912.0,PPS,GATES EISENHART DAWSON,125 S Market St Ste 1200,San Jose,CA,95113-2288,,...,Unanswered,Non-Veteran,,95113,2021,541110,Unanswered,Offices of Lawyers,Legal Services,5411


## Part 3. Go through all files to build `PPP_combined`

In [12]:
def process_files():
    '''
    Apply analysis to all PPP loan data files without loading it all at once.
    1. Split dataframe in 2 based on whether or not BorrowerZip is null.
    2. Add 5-digit Zip Code, Minority, and 4-digit NAICS code columns of ONLY BAY AREA BUSINESSES
    3. Append split and processed dataframes to list for concatentation
    4. Concatenate all dataframes into PPP_BA_combined (contains Bay Area businesses)
    5. Save comprehensive list to CSV files
    
    Returns two Dataframes
    '''
    filenames = [d + '/ppp-loan-data/ppp_loan_datasets_0621/public_up_to_150k_{}.csv'.format(i) for i in range(1,12)]
    filenames.append(d + '/ppp-loan-data/ppp_loan_datasets_0621/public_150k_plus.csv')
    nan_df_list = list()
    PPP_combined_list = list()
    for file in filenames:
        print('Processing {}...'.format(file))
        df = pd.read_csv(file)
                       
        nan_zip_df, not_nan_df = split_df_on_zip(df)
        
        
        # Add Zip Code, Minority, and NAICS_4 columns
        PPP_bay = add_zip_col(not_nan_df)
        PPP_bay = add_minority_col(PPP_bay) 
        PPP_bay = add_naics_4_col(PPP_bay)
        
        # append nan zips from dataframe
        nan_df_list.append(nan_zip_df)
        
        # append ppp_combined
        PPP_combined_list.append(PPP_bay)
    
    print('Finished reading files.')
    # Join dataframes across all files
    print('Joining dataframes...')
    main_nan_zip = pd.concat(nan_df_list)
    PPP_BA_combined = pd.concat(PPP_combined_list)
    
    
    # Save to files
    print('Saving to file...')
    main_nan_zip.to_csv(d + '/ppp-loan-data/out/nan_zips.csv', index=False)
    PPP_BA_combined.to_csv(d + '/ppp-loan-data/out/bay_bus_from_ppp.csv', index=False)
    
    print('done.')
    return main_nan_zip, PPP_BA_combined

In [13]:
def split_df_on_zip(df):
    '''
    Counts the NaN Zip Code rows in a table and creates 2 dataframes based on whether BorrowerZip is null or not.
    Checks that there are no excluded rows in split.
    
    Input: Dataframe - full table of businesses
    Output: Returns two Dataframes split from input
    '''
    # Count NaN Zip Codes (BorrowerZip) in table
    print('There are {} NaN values in this table.'.format(df['BorrowerZip'].isna().sum()))
    
    # Get rows where BorrowerZip is NaN
    nan_zip_df = df[df['BorrowerZip'].isnull()]
    # Get rows where BorrowerZip is not NaN
    not_nan_df = df[df['BorrowerZip'].notnull()]

    # Check that lengths of DF split matches up
    print('Split length matches total:', len(df) == len(not_nan_df) + len(nan_zip_df))
    
    return nan_zip_df, not_nan_df

In [7]:
def add_zip_col(df):
    '''
    Use Regex to extract the 5-digit Zip Code from BorrowerZip column. Create new column
    and matched new column with known Bay Area zip codes from 'cb_2018_us_zcta510_500k_BAY.csv'
    
    Input: Dataframe without null Zip codes
    Output: Updated Dataframe filtered for Bay Area Zip Codes
    '''
    # Source: https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/
    pattern = '(^\d{5})'
    df['zip_code_5'] = df['BorrowerZip'].str.extract(pattern)
    ppp_df_with_zip = df[df['zip_code_5'].isin(bay_area_zip['zip_code_5'])]
    
    return ppp_df_with_zip

In [8]:
def add_minority_col(df):
    '''
    Created new column in dataframe with four options based on Race and Ethinicity column:
    {Yes, No, Unanswered}.
    
    Input: Dataframe without null Zip codes
    Output: Updated Dataframe with Minority column.
    '''
    # Source: https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
    conditions = [
        (df['Race']  == 'White'),
        (df['Race'] == 'Unanswered') & (df['Ethnicity'] == 'Not Hispanic or Latino'),
        (df['Race'] == 'Unanswered') & (df['Ethnicity'] == 'Unknown/NotStated'),
        (df['Ethnicity'] == 'Hispanic or Latino') ,
        (df['Race']  != 'White') & (df['Race'] != 'Unanswered') ,
        ]

    values = ['No', 'No', 'Unanswered', 'Yes', 'Yes']

    # Note: 0 in Minority column indicates a condition that is not covered
    df.loc[:,'Minority'] = np.select(conditions, values)
    return df

In [9]:
def add_naics_4_col(df):
    '''
    Adde YearApproved and NAICS_6 columns. Merged dataframe with NAICS_codes 
    to remove 4-digit code from IndustrySubsector name.
    
    Input: Dataframe without null Zip codes
    Output: Updated Dataframe with YearApproved, NAICS_6, NAICS_4 column.
    '''
    #Create new column for year of approval 
    df.loc[:,'YearApproved'] = df['DateApproved'].str[-4:]

    #Create new column for NAICS 6 digit in string format 
    df.loc[:,'NAICS_6'] = df['NAICSCode'].apply(lambda y: str(y)[:6])
    
    #Merge NAICS 4 digit with PPP data and remove the 4 digit code from IndustrySubsector name
    bay_area_ppp_NAICS = pd.merge(df, NAICS_codes, how='left', left_on='NAICS_6', right_on='2017NAICSCode')
    bay_area_ppp_NAICS.loc[:,'NAICS_4'] = bay_area_ppp_NAICS['4DigitNAICS'].apply(lambda y: str(y)[:4])
    bay_area_ppp_NAICS = bay_area_ppp_NAICS.drop(['2017NAICSCode','Changed2017','TRIIndustrySector','NewIndustry','4DigitNAICS'], axis=1)
    bay_area_ppp_NAICS.loc[:,'IndustrySubsector'] = bay_area_ppp_NAICS['IndustrySubsector'].str[5:]
    
    return bay_area_ppp_NAICS

In [11]:
# Run to process all the files and get 2 resulting dataframes

nan_zips, PPP_combined = process_files()

Processing C:\Users\Trang\Desktop\afn/ppp-loan-data/ppp_loan_datasets_0621/public_up_to_150k_1.csv...
There are 125 NaN values in this table.
Split length matches total: True
Processing C:\Users\Trang\Desktop\afn/ppp-loan-data/ppp_loan_datasets_0621/public_up_to_150k_2.csv...
There are 8 NaN values in this table.
Split length matches total: True
Processing C:\Users\Trang\Desktop\afn/ppp-loan-data/ppp_loan_datasets_0621/public_up_to_150k_3.csv...
There are 3 NaN values in this table.
Split length matches total: True
Processing C:\Users\Trang\Desktop\afn/ppp-loan-data/ppp_loan_datasets_0621/public_up_to_150k_4.csv...
There are 0 NaN values in this table.
Split length matches total: True
Processing C:\Users\Trang\Desktop\afn/ppp-loan-data/ppp_loan_datasets_0621/public_up_to_150k_5.csv...
There are 7 NaN values in this table.
Split length matches total: True
Processing C:\Users\Trang\Desktop\afn/ppp-loan-data/ppp_loan_datasets_0621/public_up_to_150k_6.csv...
There are 1 NaN values in this 

In [14]:
len(nan_zips)

176

In [15]:
nan_zips.head()

# TODO: Manually go through table to check for Bay Area businesses. Additionally for Minority-owned BA businesses.

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,BusinessType,OriginatingLenderLocationID,OriginatingLender,OriginatingLenderCity,OriginatingLenderState,Gender,Veteran,NonProfit,ForgivenessAmount,ForgivenessDate
0,5375617707,05/01/2020,101.0,PPP,NOT AVAILABLE,,,,,0,...,,9551.0,"Bank of America, National Association",CHARLOTTE,NC,Unanswered,Unanswered,,0.0,
5,9789867710,05/01/2020,101.0,PPP,VULCAN MACHINERY CORPORATION,,,,,0,...,Corporation,57328.0,The Huntington National Bank,COLUMBUS,OH,Male Owned,Non-Veteran,,112104.83,03/05/2021
6,9589997709,05/01/2020,101.0,PPP,"TJK KITCHENS &AMP; BREWPUBS, LLC",,,,,0,...,Limited Liability Company(LLC),57328.0,The Huntington National Bank,COLUMBUS,OH,Unanswered,Unanswered,,110881.96,03/01/2021
8,9662387700,05/01/2020,101.0,PPP,RON GOLDSTONE,,,,,05/22/2021,...,,57328.0,The Huntington National Bank,COLUMBUS,OH,Unanswered,Unanswered,,21036.19,04/29/2021
9,2767027201,04/16/2020,,PPP,Exemption 6,,,,,02/18/2021,...,Subchapter S Corporation,,,,,Unanswered,Unanswered,,91638.0,01/19/2021


In [16]:
nan_zips[nan_zips['BorrowerName'] == 'ART N FUN STUDIO, INC']

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,BusinessType,OriginatingLenderLocationID,OriginatingLender,OriginatingLenderCity,OriginatingLenderState,Gender,Veteran,NonProfit,ForgivenessAmount,ForgivenessDate
519733,2946407704,05/01/2020,912.0,PPP,"ART N FUN STUDIO, INC",,,CA,,05/22/2021,...,Corporation,48270.0,"JPMorgan Chase Bank, National Association",COLUMBUS,OH,Female Owned,Non-Veteran,,94610.05,04/08/2021


In [17]:
## UPDATED FOR NEW PPP DATA
# Add NaN Zips to PPP_combined after manually going through them.
nans_in_bay_list = [519733]
nans_in_bay = nan_zips.loc[nans_in_bay_list]
nans_in_bay['zip_code_5'] = np.nan
nans_in_bay = add_minority_col(nans_in_bay) 
nans_in_bay = add_naics_4_col(nans_in_bay)
nans_in_bay

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,NonProfit,ForgivenessAmount,ForgivenessDate,zip_code_5,Minority,YearApproved,NAICS_6,LongName,IndustrySubsector,NAICS_4
0,2946407704,05/01/2020,912.0,PPP,"ART N FUN STUDIO, INC",,,CA,,05/22/2021,...,,94610.05,04/08/2021,,Yes,2020,712110,Museums,"Museums, Historical Sites, and Similar Institu...",7121


In [18]:
# Append NaN Rows to main dataframe, PPP_combined
PPP_combined = PPP_combined.append(nans_in_bay, ignore_index=True)

In [19]:
PPP_combined.head()

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,NonProfit,ForgivenessAmount,ForgivenessDate,zip_code_5,Minority,YearApproved,NAICS_6,LongName,IndustrySubsector,NAICS_4
0,8585717704,05/01/2020,459.0,PPP,MARKET AUTO TRUCK COLLISION CORPORATION,140 SAN JOSE AVE,SAN JOSE,AL,95125,0,...,,15281.29,05/13/2021,95125,Unanswered,2020,811121,"Automotive Body, Paint, and Interior Repair an...",Automotive Repair and Maintenance,8111
1,1464468601,03/13/2021,912.0,PPS,GOLD RUSH KETTLE KORN LLC,4690 E 2nd St Ste 9,Benicia,CA,94510-1008,0,...,,0.0,,94510,No,2021,722330,Mobile Food Services,Special Food Services,7223
2,2663308509,02/22/2021,912.0,PPS,KALIBER LABS INC,188 King St Unit 307,San Francisco,CA,94107-4903,0,...,,0.0,,94107,No,2021,541511,Custom Computer Programming Services,Computer Systems Design and Related Services,5415
3,4137348607,03/18/2021,912.0,PPS,ROSS MCDONALD COMPANY INC,1154 Stealth St,Livermore,CA,94551-9300,0,...,,0.0,,94551,Unanswered,2021,541420,Industrial Design Services,Specialized Design Services,5414
4,4315748602,03/18/2021,912.0,PPS,GATES EISENHART DAWSON,125 S Market St Ste 1200,San Jose,CA,95113-2288,0,...,,0.0,,95113,Unanswered,2021,541110,Offices of Lawyers,Legal Services,5411


### Extract Bay Area, Minority-owned businesses

Filter for businesses are minority-owned and save to file.

In [38]:
def extract_bay_minority_bus(df):
    BA_minority_PPP = df[df['Minority'] == 'Yes'].reset_index(drop=True)
    BA_minority_PPP.to_csv(d + '/ppp-loan-data/out/bay_bus_from_ppp_minority.csv', index=False)
    return BA_minority_PPP

PPP_combined_minority = extract_bay_minority_bus(PPP_combined)
PPP_combined_minority.head()

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,NonProfit,ForgivenessAmount,ForgivenessDate,zip_code_5,Minority,YearApproved,NAICS_6,LongName,IndustrySubsector,NAICS_4
0,5224738402,02/08/2021,912.0,PPS,ICHINA,70 Valley Ct,Atherton,CA,94027-6472,0,...,,0.0,,94027,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
1,5359188509,02/27/2021,912.0,PPS,BANZAI IZAKAYA INC.,1633 Bonanza St,Walnut Creek,CA,94596-4525,0,...,,0.0,,94596,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
2,5554978603,03/20/2021,912.0,PPS,RANGOON SUPER STARS LLC,542 Freya Way,Pleasant Hill,CA,94523-1712,0,...,,0.0,,94523,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
3,5681669004,05/22/2021,912.0,PPS,VIVA EL ESPANOL SPANISH SCHOOL INC.,3205 Stanley Blvd,Lafayette,CA,94549-3239,05/22/2021,...,,0.0,,94549,Yes,2021,624410,Child Day Care Services,Child Day Care Services,6244
4,9634738305,01/31/2021,912.0,PPS,JBR PARTNERS INC,1333 Evans Ave,San Francisco,CA,94124-1705,0,...,,0.0,,94124,Yes,2021,541820,Public Relations Agencies,"Advertising, Public Relations, and Related Ser...",5418


## Part 4. Summary Data Tables for Bay Area Businesses

*Same code as Laura's below, but applied to only Bay Area businesses.*

In [21]:
PPP_ind_4 = pd.DataFrame(PPP_combined.groupby(['NAICS_4','IndustrySubsector'])['JobsReported']
                          .agg(Businesses='count',Jobs='sum').sort_values(by='Businesses', ascending=False))
PPP_ind_4.reset_index(inplace=True)
PPP_ind_4.index = np.arange(1, len(PPP_ind_4) + 1)
PPP_ind_4.head(20)

Unnamed: 0,NAICS_4,IndustrySubsector,Businesses,Jobs
1,7225,Restaurants and Other Eating Places,16679,340855
2,8121,Personal Care Services,15434,46515
3,4853,Taxi and Limousine Service,11950,14725
4,5416,"Management, Scientific, and Technical Consulti...",9272,49674
5,6212,Offices of Dentists,7389,51529
6,2361,Residential Building Construction,7331,49676
7,6213,Offices of Other Health Practitioners,6660,28839
8,5312,Offices of Real Estate Agents and Brokers,6612,15767
9,5419,"Other Professional, Scientific, and Technical ...",6556,36213
10,5411,Legal Services,5943,30459


In [22]:
retail_4 = ['4411', '4412', '4413', '4421', '4422', '4431', '4441', '4442', '4451', '4452', '4453', '4461', '4471', '4481', 
            '4482', '4483', '4511', '4512', '4522', '4523', '4531', '4532', '4533', '4539', '4541', '4542', '4543']

In [23]:
PPP_combined['NAICS_4'].dtypes

dtype('O')

In [24]:
PPP_retail_4 = PPP_combined[PPP_combined['NAICS_4'].isin(retail_4)]
PPP_retail_4

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,NonProfit,ForgivenessAmount,ForgivenessDate,zip_code_5,Minority,YearApproved,NAICS_6,LongName,IndustrySubsector,NAICS_4
51,4819837700,05/01/2020,912.0,PPP,SAM'S ELECTRO INC,1555 BOTELHO DR. 437,WALNUT CREEK,CA,94596,0,...,,151542.78,05/27/2021,94596,Unanswered,2020,453998,All Other Miscellaneous Store Retailers (excep...,Other Miscellaneous Store Retailers,4539
103,3317428305,01/21/2021,912.0,PPS,PCIFIC FUELS INC.,46494 Mission Blvd,Fremont,CA,94539-7063,0,...,,0.00,,94539,No,2021,447110,Gasoline Stations with Convenience Stores,Gasoline Stations,4471
174,5462957005,04/05/2020,912.0,PPP,TIN RX CASTRO SAN FRANCISCO,2175 Market Street,SAN FRANCISCO,CA,94114-1321,02/24/2021,...,,150614.08,01/07/2021,94114,No,2020,446110,Pharmacies and Drug Stores,Health and Personal Care Stores,4461
200,7119297108,04/14/2020,912.0,PPP,EZRA CONSTRUCTION,1156 Keeler Avenue,Berkeley,CA,94708,0,...,,0.00,,94708,Unanswered,2020,453998,All Other Miscellaneous Store Retailers (excep...,Other Miscellaneous Store Retailers,4539
214,6293057706,05/01/2020,912.0,PPP,STRATEGIC BUILDING PRODUCTS LLC,10920 BIGGE ST,SAN LEANDRO,CA,94577-1121,03/18/2021,...,,150441.04,02/05/2021,94577,Unanswered,2020,444110,Home Centers,Building Material and Supplies Dealers,4441
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
270022,8371687408,05/18/2020,912.0,PPP,NAPA BARREL REPAIR SHOP INC,918 ENTERPRISE WAY STE K,NAPA,CA,94558-6230,0,...,,0.00,,94558,Unanswered,2020,443141,Household Appliance Stores,Electronics and Appliance Stores,4431
270029,8728438403,02/13/2021,912.0,PPS,U.S. HISPANIC VENTURES INC.,2400 Mission St,San Francisco,CA,94110-2415,0,...,,0.00,,94110,No,2021,446191,Food (Health) Supplement Stores,Health and Personal Care Stores,4461
270037,9079687002,04/09/2020,912.0,PPP,GOTFLOOR.COM,2240 DE LA CRUZ BLVD,SANTA CLARA,CA,95050-3008,04/16/2021,...,,151298.63,03/03/2021,95050,Unanswered,2020,444190,Other Building Material Dealers,Building Material and Supplies Dealers,4441
270062,9958578508,03/12/2021,912.0,PPS,CANADIAN-AMERICAN OIL COMPANY,444 Divisadero St # 100,San Francisco,CA,94117-2211,0,...,,0.00,,94117,Unanswered,2021,447110,Gasoline Stations with Convenience Stores,Gasoline Stations,4471


In [25]:
personal_4 = ['8121', '8122', '8123', '8129']

In [26]:
PPP_personal_4 = PPP_combined[PPP_combined['NAICS_4'].isin(personal_4)]
PPP_personal_4

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,NonProfit,ForgivenessAmount,ForgivenessDate,zip_code_5,Minority,YearApproved,NAICS_6,LongName,IndustrySubsector,NAICS_4
82,3863827700,05/01/2020,912.0,PPP,KLUB K9 PLAY CENTER,174 COMMERCIAL ST,SUNNYVALE,CA,94086,0,...,,0.00,,94086,Unanswered,2020,812910,Pet Care (except Veterinary) Services,Other Personal Services,8129
83,7010998305,01/27/2021,912.0,PPS,KLUB K9 PLAY CENTER INC,174 Commercial St,Sunnyvale,CA,94086-5201,0,...,,0.00,,94086,Unanswered,2021,812910,Pet Care (except Veterinary) Services,Other Personal Services,8129
127,8598858303,01/29/2021,912.0,PPS,SOFTAP INC,1046 Country Ln,Pleasanton,CA,94588-9515,0,...,,0.00,,94588,Unanswered,2021,812199,Other Personal Care Services,Personal Care Services,8121
128,2147418108,07/11/2020,912.0,PPP,JACQUELINES LLC,3960 Adeline Street 108,Emeryville,CA,94608-3511,0,...,,150932.63,05/13/2021,94608,Unanswered,2020,812990,All Other Personal Services,Other Personal Services,8129
172,5093848301,01/25/2021,912.0,PPS,SPROOS INC,552 San Anselmo Ave,San Anselmo,CA,94960-2621,0,...,,0.00,,94960,Unanswered,2021,812112,Beauty Salons,Personal Care Services,8121
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
269869,3997997809,05/27/2020,912.0,PPP,J'S BARBER SHOP AND HAIR SALON,141 Sunset Avenue,Suisun City,CA,94585-2063,0,...,,0.00,,94585,Unanswered,2020,812990,All Other Personal Services,Other Personal Services,8129
269975,6975237208,04/28/2020,912.0,PPP,MADUSALON INC,300 Divisadero Street,San Francisco,CA,94117-2209,0,...,,0.00,,94117,Unanswered,2020,812112,Beauty Salons,Personal Care Services,8121
270003,7858308403,02/12/2021,912.0,PPS,ARCHIMEDES BANYA SF,748 Innes Ave,San Francisco,CA,94124-3054,0,...,,0.00,,94124,Unanswered,2021,812199,Other Personal Care Services,Personal Care Services,8121
270005,7915248504,03/08/2021,912.0,PPS,SEVEN SALON INC,5358 College Ave,Oakland,CA,94618-1417,0,...,,0.00,,94618,Yes,2021,812112,Beauty Salons,Personal Care Services,8121


In [27]:
food_4 = ['7223', '7224', '7225']

In [28]:
PPP_food_4 = PPP_combined[PPP_combined['NAICS_4'].isin(food_4)]
PPP_food_4

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,NonProfit,ForgivenessAmount,ForgivenessDate,zip_code_5,Minority,YearApproved,NAICS_6,LongName,IndustrySubsector,NAICS_4
1,1464468601,03/13/2021,912.0,PPS,GOLD RUSH KETTLE KORN LLC,4690 E 2nd St Ste 9,Benicia,CA,94510-1008,0,...,,0.0,,94510,No,2021,722330,Mobile Food Services,Special Food Services,7223
6,5224738402,02/08/2021,912.0,PPS,ICHINA,70 Valley Ct,Atherton,CA,94027-6472,0,...,,0.0,,94027,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
7,5359188509,02/27/2021,912.0,PPS,BANZAI IZAKAYA INC.,1633 Bonanza St,Walnut Creek,CA,94596-4525,0,...,,0.0,,94596,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
8,5422668907,04/30/2021,912.0,PPS,RIVA CUCINA LLC,800 Heinz Ave Ste 19,Berkeley,CA,94710-2747,0,...,,0.0,,94710,No,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
9,5554978603,03/20/2021,912.0,PPS,RANGOON SUPER STARS LLC,542 Freya Way,Pleasant Hill,CA,94523-1712,0,...,,0.0,,94523,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
270034,8962538407,02/14/2021,912.0,PPS,PYEONG CHANG TOFU HOUSE INC,4701 Telegraph Ave,Oakland,CA,94609-2023,0,...,,0.0,,94609,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
270044,9360078408,02/16/2021,912.0,PPS,SIRI GROUP INC.,1175 Folsom St,San Francisco,CA,94103-3930,0,...,,0.0,,94103,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
270045,9403758303,01/30/2021,912.0,PPS,AKINAI,2092 3rd St,San Francisco,CA,94107-3122,0,...,,0.0,,94107,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
270046,9479807109,04/15/2020,912.0,PPP,"UNIVERSAL ENTERPRISE (USA), INC.",76 S ABEL ST,MILPITAS,CA,95035-5251,04/16/2021,...,,151362.5,03/11/2021,95035,Yes,2020,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225


In [29]:
PPP_minority_ind_4 = pd.DataFrame(PPP_combined_minority.groupby(['NAICS_4','IndustrySubsector'])['JobsReported']
                          .agg(Businesses='count',Jobs='sum').sort_values(by='Businesses', ascending=False))
PPP_minority_ind_4.reset_index(inplace=True)
PPP_minority_ind_4.index = np.arange(1, len(PPP_minority_ind_4) + 1)
PPP_minority_ind_4.head(20)

Unnamed: 0,NAICS_4,IndustrySubsector,Businesses,Jobs
1,7225,Restaurants and Other Eating Places,5163,88486
2,8121,Personal Care Services,3373,9800
3,4853,Taxi and Limousine Service,1481,1722
4,6212,Offices of Dentists,1381,10105
5,4841,General Freight Trucking,1018,2510
6,8129,Other Personal Services,995,4120
7,5416,"Management, Scientific, and Technical Consulti...",933,4443
8,2361,Residential Building Construction,902,3151
9,7211,Traveler Accommodation,705,7889
10,8111,Automotive Repair and Maintenance,704,5009


In [30]:
PPP_minority_retail_4 = PPP_minority_ind_4[PPP_minority_ind_4['NAICS_4'].isin(retail_4)]
PPP_minority_retail_4

Unnamed: 0,NAICS_4,IndustrySubsector,Businesses,Jobs
19,4539,Other Miscellaneous Store Retailers,620,2766
22,4481,Clothing Stores,466,1461
29,4461,Health and Personal Care Stores,307,1107
33,4471,Gasoline Stations,247,2736
39,4451,Grocery Stores,218,3507
41,4452,Specialty Food Stores,192,1902
71,4453,"Beer, Wine, and Liquor Stores",84,822
79,4483,"Jewelry, Luggage, and Leather Goods Stores",77,325
80,4541,Electronic Shopping and Mail-Order Houses,76,254
87,4511,"Sporting Goods, Hobby, and Musical Instrument ...",61,307


In [31]:
PPP_combined['YearApproved'].unique()

array(['2020', '2021'], dtype=object)

In [32]:
#Extract for 2021 approvals only 
PPP_combined_2021 = PPP_combined[PPP_combined['YearApproved']=='2021']
PPP_combined_2021.head(10)

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,NonProfit,ForgivenessAmount,ForgivenessDate,zip_code_5,Minority,YearApproved,NAICS_6,LongName,IndustrySubsector,NAICS_4
1,1464468601,03/13/2021,912.0,PPS,GOLD RUSH KETTLE KORN LLC,4690 E 2nd St Ste 9,Benicia,CA,94510-1008,0,...,,0.0,,94510,No,2021,722330,Mobile Food Services,Special Food Services,7223
2,2663308509,02/22/2021,912.0,PPS,KALIBER LABS INC,188 King St Unit 307,San Francisco,CA,94107-4903,0,...,,0.0,,94107,No,2021,541511,Custom Computer Programming Services,Computer Systems Design and Related Services,5415
3,4137348607,03/18/2021,912.0,PPS,ROSS MCDONALD COMPANY INC,1154 Stealth St,Livermore,CA,94551-9300,0,...,,0.0,,94551,Unanswered,2021,541420,Industrial Design Services,Specialized Design Services,5414
4,4315748602,03/18/2021,912.0,PPS,GATES EISENHART DAWSON,125 S Market St Ste 1200,San Jose,CA,95113-2288,0,...,,0.0,,95113,Unanswered,2021,541110,Offices of Lawyers,Legal Services,5411
5,4337488902,04/28/2021,912.0,PPS,INNOVATIVE CONTROL SOLUTIONS,383 Princeton Ln,Danville,CA,94526-4125,0,...,,0.0,,94526,Unanswered,2021,238210,Electrical Contractors and Other Wiring Instal...,Building Equipment Contractors,2382
6,5224738402,02/08/2021,912.0,PPS,ICHINA,70 Valley Ct,Atherton,CA,94027-6472,0,...,,0.0,,94027,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
7,5359188509,02/27/2021,912.0,PPS,BANZAI IZAKAYA INC.,1633 Bonanza St,Walnut Creek,CA,94596-4525,0,...,,0.0,,94596,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
8,5422668907,04/30/2021,912.0,PPS,RIVA CUCINA LLC,800 Heinz Ave Ste 19,Berkeley,CA,94710-2747,0,...,,0.0,,94710,No,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
9,5554978603,03/20/2021,912.0,PPS,RANGOON SUPER STARS LLC,542 Freya Way,Pleasant Hill,CA,94523-1712,0,...,,0.0,,94523,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
10,5681669004,05/22/2021,912.0,PPS,VIVA EL ESPANOL SPANISH SCHOOL INC.,3205 Stanley Blvd,Lafayette,CA,94549-3239,05/22/2021,...,,0.0,,94549,Yes,2021,624410,Child Day Care Services,Child Day Care Services,6244


In [33]:
PPP_ind_4_2021 = pd.DataFrame(PPP_combined_2021.groupby(['NAICS_4','IndustrySubsector'])['JobsReported']
                          .agg(Businesses='count',Jobs='sum').sort_values(by='Businesses', ascending=False))
PPP_ind_4_2021.reset_index(inplace=True)
PPP_ind_4_2021.index = np.arange(1, len(PPP_ind_4_2021) + 1)
PPP_ind_4_2021.head(20)

Unnamed: 0,NAICS_4,IndustrySubsector,Businesses,Jobs
1,4853,Taxi and Limousine Service,10744,11854
2,8121,Personal Care Services,10741,23987
3,7225,Restaurants and Other Eating Places,8233,142491
4,5416,"Management, Scientific, and Technical Consulti...",4261,13359
5,2361,Residential Building Construction,3599,17438
6,6212,Offices of Dentists,3348,23285
7,5312,Offices of Real Estate Agents and Brokers,3325,5204
8,8129,Other Personal Services,3144,9978
9,5419,"Other Professional, Scientific, and Technical ...",3022,9251
10,6213,Offices of Other Health Practitioners,2951,10992


In [34]:
PPP_retail_4_2021 = PPP_ind_4_2021[PPP_ind_4_2021['NAICS_4'].isin(retail_4)]
PPP_retail_4_2021

Unnamed: 0,NAICS_4,IndustrySubsector,Businesses,Jobs
20,4539,Other Miscellaneous Store Retailers,1777,5805
28,4481,Clothing Stores,1420,4270
36,4461,Health and Personal Care Stores,835,2292
54,4471,Gasoline Stations,494,6279
68,4451,Grocery Stores,360,2874
74,4483,"Jewelry, Luggage, and Leather Goods Stores",321,1354
76,4541,Electronic Shopping and Mail-Order Houses,302,1153
78,4411,Automobile Dealers,297,4228
82,4452,Specialty Food Stores,280,1956
83,4511,"Sporting Goods, Hobby, and Musical Instrument ...",274,1453


In [35]:
PPP_combined_minority_2021 = PPP_combined_2021[PPP_combined_2021['Minority']=='Yes']
PPP_combined_minority_2021.head(10)

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,NonProfit,ForgivenessAmount,ForgivenessDate,zip_code_5,Minority,YearApproved,NAICS_6,LongName,IndustrySubsector,NAICS_4
6,5224738402,02/08/2021,912.0,PPS,ICHINA,70 Valley Ct,Atherton,CA,94027-6472,0,...,,0.0,,94027,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
7,5359188509,02/27/2021,912.0,PPS,BANZAI IZAKAYA INC.,1633 Bonanza St,Walnut Creek,CA,94596-4525,0,...,,0.0,,94596,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
9,5554978603,03/20/2021,912.0,PPS,RANGOON SUPER STARS LLC,542 Freya Way,Pleasant Hill,CA,94523-1712,0,...,,0.0,,94523,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
10,5681669004,05/22/2021,912.0,PPS,VIVA EL ESPANOL SPANISH SCHOOL INC.,3205 Stanley Blvd,Lafayette,CA,94549-3239,05/22/2021,...,,0.0,,94549,Yes,2021,624410,Child Day Care Services,Child Day Care Services,6244
20,9634738305,01/31/2021,912.0,PPS,JBR PARTNERS INC,1333 Evans Ave,San Francisco,CA,94124-1705,0,...,,0.0,,94124,Yes,2021,541820,Public Relations Agencies,"Advertising, Public Relations, and Related Ser...",5418
21,5004708402,02/07/2021,912.0,PPS,OSAKE INC,2446 Patio Ct,Santa Rosa,CA,95405-6737,0,...,,0.0,,95405,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
23,5174478702,04/02/2021,912.0,PPS,NEW JUMBO SEAFOOD RESTAURANT LLC,1532 Noriega St,San Francisco,CA,94122-4434,0,...,,0.0,,94122,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
45,5147588302,01/25/2021,912.0,PPS,KVS PRODUCTION INC,601 Mission Bay Blvd N,San Francisco,CA,94158-2472,0,...,,0.0,,94158,Yes,2021,722513,Limited-Service Restaurants,Restaurants and Other Eating Places,7225
50,3404988405,02/04/2021,912.0,PPS,RED CHILI GROUP INC,29583 Mission Blvd,Hayward,CA,94544-6129,0,...,,0.0,,94544,Yes,2021,722511,Full-Service Restaurants,Restaurants and Other Eating Places,7225
54,6070528704,04/03/2021,912.0,PPS,SPARX 7 LLC,333 Potrero Ave,San Francisco,CA,94103-4816,0,...,,0.0,,94103,Yes,2021,721310,"Rooming and Boarding Houses, Dormitories, and ...","Rooming and Boarding Houses, Dormitories, and ...",7213


In [36]:
PPP_minority_ind_4_2021 = pd.DataFrame(PPP_combined_minority_2021.groupby(['NAICS_4','IndustrySubsector'])['JobsReported']
                          .agg(Businesses='count',Jobs='sum').sort_values(by='Businesses', ascending=False))
PPP_minority_ind_4_2021.reset_index(inplace=True)
PPP_minority_ind_4_2021.index = np.arange(1, len(PPP_minority_ind_4_2021) + 1)
PPP_minority_ind_4_2021.head(20)

Unnamed: 0,NAICS_4,IndustrySubsector,Businesses,Jobs
1,7225,Restaurants and Other Eating Places,2848,38950
2,8121,Personal Care Services,2755,5559
3,4853,Taxi and Limousine Service,1345,1455
4,4841,General Freight Trucking,814,1554
5,6212,Offices of Dentists,721,4961
6,8129,Other Personal Services,710,2264
7,2361,Residential Building Construction,676,1811
8,4922,Local Messengers and Local Delivery,651,766
9,5416,"Management, Scientific, and Technical Consulti...",587,1537
10,5617,Services to Buildings and Dwellings,550,1817


In [37]:
PPP_minority_retail_4_2021 = PPP_minority_ind_4_2021[PPP_minority_ind_4_2021['NAICS_4'].isin(retail_4)]
PPP_minority_retail_4_2021

Unnamed: 0,NAICS_4,IndustrySubsector,Businesses,Jobs
18,4539,Other Miscellaneous Store Retailers,401,918
20,4481,Clothing Stores,391,821
26,4461,Health and Personal Care Stores,232,541
40,4471,Gasoline Stations,134,1439
52,4451,Grocery Stores,75,549
55,4452,Specialty Food Stores,71,451
60,4541,Electronic Shopping and Mail-Order Houses,64,148
73,4483,"Jewelry, Luggage, and Leather Goods Stores",47,179
83,4411,Automobile Dealers,40,275
84,4543,Direct Selling Establishments,39,44
