# Supporting America's Small Businesses

The Data Science Working Group of San Francisco has partnered with the San Francisco District Office (SFDO) of the Small Business Administration (SBA) to tackle several questions. In particular, we were tasked with identifying areas of low participation in SBA programs to help the SFDO more effectively allocate their resources. We also wanted to better visualize the businesses that have utilized SBA resources so that city officials could better identify which businesses in their region have been helped by the SBA and have become major success stories.

The SBA is a federal government agency that supports the growth and success of small businesses in the United States. The SBA is engaged with in business support in three primary areas:

1. Finance
2. Education
3. Government Contracting

In addition, the SBA provides other services such as: counseling, incubator spaces, mentoring, and contracting (helping small businesses with federal government. By law 23 % of federal government contracts are reserved for small businesses). The agency is unique in that most of their programs are delivered via public/private partnerships with for-profit lenders and non-profit organizations. As a result of this, business owners very rarely interact directly with the SBA and more commonly interact with a partner. For example, a business owner in need of financing might go to their regular bank and get offered a special loan product backed by the SBA. Another example would be business owners might go to a non-profit funded by the SBA via a grant to get counseling. The SBA district office plays a compliance role (making sure partners live up to agreement). They also make the case of SBA programs and bring in new partners. Ideally, we want to be able to identify successful, popular businesses that benefited from SBA programs.

**Problem**: Identify areas underserved by the SBA

**Goal**: Help SFDO identify high pri areas (county, types of businesses)to maximize impact of their loans 

**Decision solution will help influence**: Help SBA allocate budget and headcount resources to high pri places

**Deliverable**: SFDO will have a matrix hierarchy for each level of SBA. At the end, SBA people will be equipped with "target profiles" of where to increase their efforts (e.g. SF county, businesses of x size in y type of industry) and "non-target profiles" where they'll diminish their presence (get rid of low impact loans). For the state level, they'd want to understand counties that are underserved. They'd also want a holistic picture of what type of loans to which businesses make the most impact, which type of businesses or demographics are underserved. For the county level, they'll make the decisions on which type of businesses to fund based on highest impact. 

**Question Statement**: Where should SBA allocate their budget resources for 2018 to make the highest impact (quantify it and say increase their presence by 10%)? 

E.g. the target profiles for SBA for 2018's expansion should be in SF and Alameda, focused towards restaurant type establishments. Because these two counties are highly populated, the loans we give to these businesses will serve a larger population. Restaurants are also a target because they have historically have had low loan disbursments given a high success rate. 

**Approach**: Many ways to identify high pri areas (or underserved). Underserved could mean places where there's a lack of SBA resources dedicated historically e.g. counties where SBA doesn't have projects. Also could mean disproportion of businesses and proportion of loan dedicated. What's the optimized penetration SBA wants to have? Underserved could also mean, places where historically loans have had high impact are not getting the right proportion of loans. Underserved could also mean type of businesses that see less of a % of loans. Counties where businesses usually thrive due to high demand e.g. use google places data to show where most good businesses are located. Help businesses in places where people "hang out" e.g. on Google, the yellow zones are where there's a high influx of people. 

We decided to focus on 3 main dimensions: counties, business type, 

**Summary statistics**:
-There are 12 project counties in the SF district that are served by SBA. This is out of a total of x project counties that SBA could be serving. (Oppty of underserved counties)

-Santa Clara and Alameda make up 42% of the total loans disbursed by SBA from 1991 to 2017. With SF, total make up is 55%. 

-54% of loans are paid in full across the years and 22% were exempt from paying back

**example blog**: http://datascience.codeforsanfrancisco.org/what-california-counties-are-arresting-abnormally-many-juveniles/

## Exploring the Data

In [1]:
# This is where we will load the data set and produce some basic summary statistics.
# For example, we might be interested in:
# 1. Counties served by the SFDO
# 2. How many loans per county per year were given out
# 3. Break out loans by loan status (discharged, paid off, etc.)
# 4. Do some basic correlations with demographic characteristics from Census Data
# 5. What's the breakdown of loans by types of businesses (NAICS code)
# 6. What population do each of the counties serve? E.g. if we give businesses in areas where higher population, would mean higher impact.
# 7. How do we define "successful" businesses? Is it those with good yelp ratings? Can we break down characteristics of "successful" businesses
# 8. Which counties have had the highest rate of loans to successful businesses? 


In [27]:
# 1. Counties served by the SFDO
sba_loans.ProjectCounty.unique()
#How do we compare this list with the total number of counties in SF District? Are there counties in SF that are NOT on this list?
#Would be interesting to find labor data to understand the # of businesses in each of these counties. Is there an even proportion of businesses served in each county? Why are some counties getting higher proporitions? Does this signify underserved by SBA if certain counties see lower proportion of loan disbursment? 
#Other interesting q's: what population does each county serve? Wouldn't SBA have higher impact if served more populated communities (more population = higher demand)? 

array(['NAPA', 'MENDOCINO', nan, 'SAN MATEO', 'SANTA CLARA',
       'CONTRA COSTA', 'SOLANO', 'MARIN', 'SONOMA', 'SANTA CRUZ',
       'HUMBOLDT', 'SAN FRANCISCO', 'ALAMEDA', 'LAKE', 'DEL NORTE'], dtype=object)

In [58]:
# 3. Break out loans by loan status (discharged, paid off, etc.)
county = pd.crosstab(sba_loans["ProjectCounty"],columns = "percentage of loans").sort_values('percentage of loans', ascending=False)
county/county.sum()


col_0,count
ProjectCounty,Unnamed: 1_level_1
SANTA CLARA,0.219085
ALAMEDA,0.214757
SAN FRANCISCO,0.133608
CONTRA COSTA,0.107866
SAN MATEO,0.080287
SONOMA,0.066079
SANTA CRUZ,0.046455
MARIN,0.039135
SOLANO,0.03875
NAPA,0.020621


In [39]:
# 2. How many loans per county per year were given out
pd.crosstab(sba_loans["ProjectCounty"],sba_loans["ApprovalFiscalYear"],margins=True)

ApprovalFiscalYear,1991.0,1992.0,1993.0,1994.0,1995.0,1996.0,1997.0,1998.0,1999.0,2000.0,...,2009.0,2010.0,2011.0,2012.0,2013.0,2014.0,2015.0,2016.0,2017.0,All
ProjectCounty,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ALAMEDA,175,194,211,245,298,323,351,324,293,310,...,248,231,329,296,337,304,455,476,93,9477
CONTRA COSTA,69,108,96,115,170,201,166,174,145,133,...,125,140,147,137,149,163,217,262,50,4760
DEL NORTE,2,2,6,4,8,2,5,5,2,3,...,1,1,1,1,3,2,3,2,0,70
HUMBOLDT,13,24,21,38,52,39,52,36,32,31,...,22,19,10,19,24,14,26,21,14,776
LAKE,5,4,5,11,13,16,12,9,3,4,...,3,7,5,6,7,3,12,6,4,202
MARIN,27,36,39,60,78,67,50,60,48,36,...,48,65,69,43,43,55,71,90,12,1727
MENDOCINO,10,9,16,19,18,25,19,20,9,14,...,15,12,8,10,9,13,22,12,1,424
NAPA,11,12,22,38,46,44,23,22,20,12,...,16,32,29,32,37,33,52,42,9,910
SAN FRANCISCO,71,90,100,135,185,204,224,236,246,177,...,153,166,197,198,190,225,282,256,67,5896
SAN MATEO,43,66,65,103,109,133,155,122,146,116,...,98,103,119,118,134,139,161,163,33,3543


In [42]:
#Percentage breakdown of loans per county per year
def percConvert(ser):
  return ser/float(ser[-1])
pd.crosstab(sba_loans["ProjectCounty"],sba_loans["ApprovalFiscalYear"],margins=True).apply(percConvert, axis=0)

ApprovalFiscalYear,1991.0,1992.0,1993.0,1994.0,1995.0,1996.0,1997.0,1998.0,1999.0,2000.0,...,2009.0,2010.0,2011.0,2012.0,2013.0,2014.0,2015.0,2016.0,2017.0,All
ProjectCounty,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ALAMEDA,0.225806,0.205945,0.207473,0.188897,0.178764,0.187682,0.210054,0.196126,0.19188,0.237184,...,0.212876,0.181604,0.226272,0.212186,0.235829,0.206522,0.231082,0.241993,0.216783,0.214757
CONTRA COSTA,0.089032,0.11465,0.094395,0.088666,0.10198,0.116793,0.099342,0.105327,0.094957,0.10176,...,0.107296,0.110063,0.1011,0.098208,0.104269,0.110734,0.110208,0.133198,0.11655,0.107866
DEL NORTE,0.002581,0.002123,0.0059,0.003084,0.004799,0.001162,0.002992,0.003027,0.00131,0.002295,...,0.000858,0.000786,0.000688,0.000717,0.002099,0.001359,0.001524,0.001017,0.0,0.001586
HUMBOLDT,0.016774,0.025478,0.020649,0.029298,0.031194,0.022661,0.031119,0.021792,0.020956,0.023718,...,0.018884,0.014937,0.006878,0.01362,0.016795,0.009511,0.013205,0.010676,0.032634,0.017585
LAKE,0.006452,0.004246,0.004916,0.008481,0.007798,0.009297,0.007181,0.005448,0.001965,0.00306,...,0.002575,0.005503,0.003439,0.004301,0.004899,0.002038,0.006094,0.00305,0.009324,0.004577
MARIN,0.034839,0.038217,0.038348,0.046261,0.046791,0.038931,0.029922,0.03632,0.031434,0.027544,...,0.041202,0.051101,0.047455,0.030824,0.030091,0.037364,0.036059,0.045755,0.027972,0.039135
MENDOCINO,0.012903,0.009554,0.015733,0.014649,0.010798,0.014526,0.01137,0.012107,0.005894,0.010712,...,0.012876,0.009434,0.005502,0.007168,0.006298,0.008832,0.011173,0.006101,0.002331,0.009608
NAPA,0.014194,0.012739,0.021632,0.029298,0.027594,0.025567,0.013764,0.013317,0.013098,0.009181,...,0.013734,0.025157,0.019945,0.022939,0.025892,0.022418,0.026409,0.021352,0.020979,0.020621
SAN FRANCISCO,0.091613,0.095541,0.098328,0.104086,0.110978,0.118536,0.134051,0.142857,0.1611,0.135425,...,0.13133,0.130503,0.135488,0.141935,0.13296,0.152853,0.14322,0.130147,0.156177,0.133608
SAN MATEO,0.055484,0.070064,0.063913,0.079414,0.065387,0.077281,0.092759,0.07385,0.095612,0.088753,...,0.08412,0.080975,0.081843,0.084588,0.093772,0.094429,0.081767,0.082867,0.076923,0.080287


In [57]:
# 3. Break out loans by loan status (discharged, paid off, etc.)
loans = pd.crosstab(sba_loans["LoanStatus"],columns = "count").sort_values('count', ascending=False)
loans/loans.sum()
#pd.crosstab(index=titanic_train["Survived"] columns="count"),  # Make a crosstab columns="count") 

col_0,count
LoanStatus,Unnamed: 1_level_1
PIF,0.546571
EXEMPT,0.220414
CHGOFF,0.109935
CANCLD,0.102253
COMMIT,0.015093
NOT FUNDED,0.005734


In [2]:
import pandas as pd
sba_loans = pd.read_csv('./Data/SFDO_504_7A-clean.csv')

In [4]:
sba_loans.head()

Unnamed: 0,Column,Program,BorrName,BorrStreet,BorrCity,BorrState,BorrZip,BankName,BankStreet,BankCity,...,BusinessType,LoanStatus,ChargeOffDate,GrossChargeOffAmount,RevolverStatus,JobsSupported,ThirdPartyLender_Name,ThirdPartyLender_City,ThirdPartyLender_State,ThirdPartyDollars
0,3098.0,7A,PER CASO PRODUCTIONS,6600 YOUNT ST #12,YOUNTVILLE,CA,94599.0,Bank of Hope,"3731 Wilshire Blvd, Ste 1000",LOS ANGELES,...,INDIVIDUAL,CHGOFF,6/17/2009,11775.0,0.0,2.0,,,,
1,4752.0,504,WINE GARDEN LLC,6476 WASHINGTON STREET,YOUNTVILLE,CA,94599.0,Bay Area Employment Development Company,1801 Oakland Boulevard,Walnut Creek,...,CORPORATION,PIF,,0.0,,25.0,,,,
2,5338.0,7A,BORDEAUX HOUSE,6600 WASHINGTON STREET,YOUNTVILLE,CA,94599.0,Bank of the West,180 Montgomery St,SAN FRANCISCO,...,INDIVIDUAL,PIF,,0.0,0.0,0.0,,,,
3,5836.0,7A,"VITA PARTNERS, LLC",6725 WASHINGTON STREET,YOUNTVILLE,CA,94599.0,First Republic Bank,111 Pine St,SAN FRANCISCO,...,CORPORATION,CANCLD,,0.0,0.0,85.0,,,,
4,12154.0,7A,NAPA VALLEY BIKE TOURS AND NAP,6488 WASHINGTON ST,YOUNTVILLE,CA,94599.0,"Wells Fargo Bank, National Association",101 N Philips Ave,SIOUX FALLS,...,CORPORATION,PIF,,0.0,0.0,8.0,,,,


In [6]:
sba_loans.columns

Index([u'Column', u'Program', u'BorrName', u'BorrStreet', u'BorrCity',
       u'BorrState', u'BorrZip', u'BankName', u'BankStreet', u'BankCity',
       u'BankState', u'BankZip', u'GrossApproval', u'SBAGuaranteedApproval',
       u'ApprovalDate', u'ApprovalFiscalYear', u'FirstDisbursementDate',
       u'DeliveryMethod', u'subpgmdesc', u'InitialInterestRate',
       u'TermInMonths', u'NaicsCode', u'NaicsDescription', u'FranchiseCode',
       u'FranchiseName', u'ProjectCounty', u'ProjectState',
       u'SBADistrictOffice', u'CongressionalDistrict', u'BusinessType',
       u'LoanStatus', u'ChargeOffDate', u'GrossChargeOffAmount',
       u'RevolverStatus', u'JobsSupported', u'ThirdPartyLender_Name',
       u'ThirdPartyLender_City', u'ThirdPartyLender_State',
       u'ThirdPartyDollars'],
      dtype='object')

In [61]:
type = pd.crosstab(sba_loans["NaicsDescription"],columns = "count").sort_values('count', ascending=False)
print type

col_0                                               count
NaicsDescription                                         
Full-Service Restaurants                             1532
Offices of Dentists                                  1413
General Automotive Repair                             852
Limited-Service Restaurants                           807
Drycleaning and Laundry Services (except Coin-O...    598
Offices of Lawyers                                    508
Hotels (except Casino Hotels) and Motels              497
Offices of Physicians (except Mental Health Spe...    492
Gasoline Stations with Convenience Stores             490
Beer, Wine, and Liquor Stores                         472
Child Day Care Services                               424
Beauty Salons                                         422
Supermarkets and Other Grocery (except Convenie...    418
Automotive Body, Paint, and Interior Repair and...    409
Fitness and Recreational Sports Centers               365
Engineering Se

In [36]:
sba_loans.ProjectCounty.value_counts()
cabin_tab/cabin_tab.sum()
#I want to find the percentage break out

SANTA CLARA      9668
ALAMEDA          9477
SAN FRANCISCO    5896
CONTRA COSTA     4760
SAN MATEO        3543
SONOMA           2916
SANTA CRUZ       2050
MARIN            1727
SOLANO           1710
NAPA              910
HUMBOLDT          776
MENDOCINO         424
LAKE              202
DEL NORTE          70
Name: ProjectCounty, dtype: int64

In [35]:
import pandas as pd
sba_loans = pd.read_csv('./Data/SFDO_504_7A-clean.csv')

#Number of loans per county per year
sba_loans.groupby(['ApprovalFiscalYear','ProjectCounty']).size()

#sba_loans.ProjectCounty.value_counts()
#.sort_values(ascending=False)
#I want to get the total number of loans that year in the last row
#I want to get the breakout in percentages

ApprovalFiscalYear  ProjectCounty
1991.0              ALAMEDA          175
                    CONTRA COSTA      69
                    DEL NORTE          2
                    HUMBOLDT          13
                    LAKE               5
                    MARIN             27
                    MENDOCINO         10
                    NAPA              11
                    SAN FRANCISCO     71
                    SAN MATEO         43
                    SANTA CLARA      176
                    SANTA CRUZ        64
                    SOLANO            44
                    SONOMA            65
1992.0              ALAMEDA          194
                    CONTRA COSTA     108
                    DEL NORTE          2
                    HUMBOLDT          24
                    LAKE               4
                    MARIN             36
                    MENDOCINO          9
                    NAPA              12
                    SAN FRANCISCO     90
                    SAN

## Analysis and Data Visualization

In [None]:
# This is where we might put some more advanced analysis and data visualizations
# For example, we might be interested in:
# 1. Doing some basic regression analysis on racial demographics? Evidence of discrimination?
# 2. Some nice maps

## Conclusion

In [None]:
# What are some overall conclusions? What did we take away?