# Introduction
## SBA - Small Business Profiles for the States and Territories

The Office of Advocacy’s Small Business Profiles are an annual analysis of each state’s small business activities. Each profile gathers the latest information from key federal data-gathering agencies to provide a snapshot of small business health and economic activity. This year’s profiles report on state economic growth and employment; small business employment, industry composition, and turnover; plus business owner demographics and county-level employment change. 

https://www.sba.gov/

In [1]:
from IPython.core.display import display, HTML
display(HTML("""<style> .container {width:96% !important;}</style>"""))

from IPython.display import IFrame

In [2]:
import pandas as pd
import multiprocessing
import numpy as np
from multiprocessing.dummy import Pool as ThreadPool
from functools import partial
import math

# Handle s3 or local
import s3fs
from os import listdir
from os.path import isfile, join
import subprocess

## Path to the files

In [3]:
import sys
sys.path.insert(0,'../')
from Tools.paths import *

In [4]:
def list_files(path,ext = 'pdf'):
    if path.startswith('s3://'):  
        onlyfiles = subprocess.check_output(['aws', 's3', 'ls', path_s3])
        onlyfiles = onlyfiles.split('\n')
        onlyfiles = [f.split(" ")[-1] for f in onlyfiles]
    else:
        onlyfiles = [f for f in listdir(path_local) if isfile(join(path_local, f))]
    onlyfiles = [f for f in onlyfiles if f.endswith('.{}'.format(ext))]
    files = [f.replace('.{}'.format(ext),'') for f in onlyfiles]
    return files

In [5]:
def path(path,name,ext = 'pdf'):
    path_file = '{}{}.{}'.format(path,name,ext)
    return path_file

## Loading the file with PyPDF

In [6]:
import PyPDF2

In [7]:
def load_pdf(path_file):
    
    def get_content(fp_in):
        content = []
        pdf = PyPDF2.PdfFileReader(fp_in)
        number_of_pages = pdf.getNumPages()
        for i in xrange(number_of_pages):
            page = pdf.getPage(i).extractText().split()
            content.append(page)
        return content
    
    if path_file.startswith('s3://'):  
        fs = s3fs.S3FileSystem()
        with fs.open(path_file, 'rb') as fp_in:
            content = get_content(fp_in)

    else:
        fp_in = file(path_file,'rb')
        content = get_content(fp_in)

    return content

In [8]:
%%time
files = list_files(path_s3)[0]
path_file = path(path_s3,files)
file_pdf = load_pdf(path_file)

CPU times: user 348 ms, sys: 28 ms, total: 376 ms
Wall time: 1.49 s


In [9]:
for fp in file_pdf:
    print fp 
    print '\n'

[u'AlabamaSmallBusiness,2016', u'5', u'SBAofAdvocacy', u'ALABAMA', u'382,524', u'SmallBusinesses', u'765,293', u'SmallBusinessEmployees', u'96.7%', u'ofAlabamaBusinesses', u'47.7%', u'ofAlabamaEmployees', u'EMPLOYMENT', u'5,734', u'netnewjobs', u'1', u'DIVERSITY', u'30.7%', u'increaseinminority', u'ownership', u'2', u'TRADE', u'81.2%', u'ofAlabamaexporters', u'3', u'O', u'VERALL', u'A', u'LABAMA', u'E', u'CONOMY', u'\u0141', u'Inthethirdquarterof2015,Alabamagrewatanannualrateof', u'2.2%', u'whichwasfasterthantheoverallUSgrowthrateof', u'1.9%', u".Bycomparison,Alabama's2014growthof", u'3.6%', u'wasupfromthe2013levelof', u'3.1%', u'.(Source:', u'BEA', u')', u'\u0141', u'Atthecloseof2015,unemploymentwas', u'6.3%', u',upfrom', u'6.1%', u'atthecloseof2014.Thiswasabovethenationalunem-', u'ploymentrateof', u'5.0%', u'.(Source:', u'CPS', u')', u'E', u'MPLOYMENT', u'\u0141', u'Alabamasmallbusinessesemployed', u'765,293', u'people,or', u'47.7%', u'oftheprivateworkforce,in2013.(Source:', u'SUSB',

![alt text](Header.png)

In [10]:
print file_pdf[0]

[u'AlabamaSmallBusiness,2016', u'5', u'SBAofAdvocacy', u'ALABAMA', u'382,524', u'SmallBusinesses', u'765,293', u'SmallBusinessEmployees', u'96.7%', u'ofAlabamaBusinesses', u'47.7%', u'ofAlabamaEmployees', u'EMPLOYMENT', u'5,734', u'netnewjobs', u'1', u'DIVERSITY', u'30.7%', u'increaseinminority', u'ownership', u'2', u'TRADE', u'81.2%', u'ofAlabamaexporters', u'3', u'O', u'VERALL', u'A', u'LABAMA', u'E', u'CONOMY', u'\u0141', u'Inthethirdquarterof2015,Alabamagrewatanannualrateof', u'2.2%', u'whichwasfasterthantheoverallUSgrowthrateof', u'1.9%', u".Bycomparison,Alabama's2014growthof", u'3.6%', u'wasupfromthe2013levelof', u'3.1%', u'.(Source:', u'BEA', u')', u'\u0141', u'Atthecloseof2015,unemploymentwas', u'6.3%', u',upfrom', u'6.1%', u'atthecloseof2014.Thiswasabovethenationalunem-', u'ploymentrateof', u'5.0%', u'.(Source:', u'CPS', u')', u'E', u'MPLOYMENT', u'\u0141', u'Alabamasmallbusinessesemployed', u'765,293', u'people,or', u'47.7%', u'oftheprivateworkforce,in2013.(Source:', u'SUSB',

## Loading the file with Tabula

tabula-py is a simple Python wrapper of tabula-java, which can read table of PDF. You can read tables from PDF and convert into pandas's DataFrame. tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file.

In [11]:
import tabula

In [12]:
def load_pdf(path_file):
    if path_file.startswith('s3://'):  
        fs = s3fs.S3FileSystem()
        with fs.open(path_file, 'rb') as fp_in:
            pdf = tabula.read_pdf(fp_in,multiple_tables=True, pages = 'all')
    else:
        pdf = tabula.read_pdf(path_file,multiple_tables=True, pages = 'all')
    return pdf

In [13]:
%%time
files = list_files(path_local)[0]
path_file = path(path_local,files)
file_pdf = load_pdf(path_file)

CPU times: user 4 ms, sys: 8 ms, total: 12 ms
Wall time: 4.37 s


In [14]:
for fp in file_pdf:
    print fp 
    print '\n'

      0                                                  1        2  \
0   NaN  of small business employment. See Figure 1 for...      NaN   
1   NaN      tails on firms with employees. (Source: SUSB)    1.5 M   
2     •  Private-sector employment increased 1.3% in 20...      NaN   
3   NaN  was below the previous year’s increase of 1.7%...      NaN   
4   NaN                                               CES)    1.0 M   
5     •  The number of proprietors increased in 2014 by...      NaN   
6   NaN           tive to the previous year. (Source: BEA)      NaN   
7   NaN                                                NaN      NaN   
8   NaN                                                NaN  500.0 K   
9   NaN                                                NaN      NaN   
10    •  Small businesses created 5,734 net jobs in 201...      NaN   
11  NaN  the seven BDS size-classes, firms employing 50...      NaN   
12  NaN  ployees experienced the largest gains, adding ...      NaN   

     

![alt text](Table.png)

In [15]:
file_pdf[-1]

Unnamed: 0,0,1,2,3,4
0,Retail Trade,10674,9627,27992,38666
1,Other Services (except Public Administration),10042,9332,63575,73617
2,"Professional, Scientific, and Technical Services",8081,7378,31099,39180
3,Health Care and Social Assistance,7823,6670,21808,29631
4,Construction,7143,6373,39463,46606
5,Accommodation and Food Services,5525,4255,4889,10414
6,Wholesale Trade,3785,2974,5061,8846
7,Manufacturing,3377,2349,4425,7802
8,"Administrative, Support, and Waste Management",3355,2842,37265,40620
9,Finance and Insurance,2916,2582,7842,10758


## Loading the file with pdf_query (An alternative to PdfMiner)

PDFQuery is a light wrapper around pdfminer, lxml and pyquery. It's designed to reliably extract data from sets of PDFs with as little code as possible.

In [16]:
import pdfquery

In [17]:
def load_pdf(path_file):
    if path_file.startswith('s3://'):  
        fs = s3fs.S3FileSystem()
        with fs.open(path_file, 'rb') as fp_in:
            pdf = pdfquery.PDFQuery(fp_in)
            pdf.load()
    else:
        pdf = pdfquery.PDFQuery(path_file)
        pdf.load()        
    return pdf

In [18]:
%%time
files = list_files(path_local)[0]
path_file = path(path_local,files)
file_pdf = load_pdf(path_file)

CPU times: user 3.3 s, sys: 96 ms, total: 3.4 s
Wall time: 3.44 s


In [19]:
file_pdf

<pdfquery.pdfquery.PDFQuery at 0x7f5a068b5090>

### Finding some text and retrieving the coordinates

![alt text](Header.png)

In [20]:
def getCoordinates(pdf,query, type_search = "Line"):
        name = pdf.pq('LTText%sHorizontal:contains("%s")' % (type_search,query))
        for n in name:
            d = dict()
            d["left_corner"] = math.floor(float(n.layout.x0)* 1000)/1000.0
            d["bottom_corner"] = math.floor(float(n.layout.y0)* 1000)/1000.0
            d["right_corner"] = math.ceil(float(n.layout.x1)* 1000)/1000.0
            d["upper_corner"] = math.ceil(float(n.layout.y1)* 1000)/1000.0
            d["text"] = n.layout.get_text()
            d["pageid"] = int(float(n.iterancestors('LTPage').next().layout.pageid))
            yield d

In [21]:
g = getCoordinates(file_pdf,'Small Businesses', type_search='Line')
d = next(g,None)
d

{'bottom_corner': 635.368,
 'left_corner': 103.344,
 'pageid': 1,
 'right_corner': 190.135,
 'text': u'Small Businesses\n',
 'upper_corner': 648.985}

### Retrieving text around given a set of  coordinates

In [22]:
file_pdf.pq(('LTPage[pageid="%s"] LTTextBoxHorizontal:overlaps_bbox("%f,%f,%f,%f")' % (d['pageid'],
                                                                                  d['left_corner'],
                                                                                  d['bottom_corner'],
                                                                                  d['right_corner'],
                                                                                  d['upper_corner']))).text()

'Small Businesses\nof Alabama Businesses'

In [23]:
left_corner = 0
file_pdf.pq(('LTPage[pageid="%s"] LTTextBoxHorizontal:overlaps_bbox("%f,%f,%f,%f")' % (d['pageid'],
                                                                                  left_corner,
                                                                                  d['bottom_corner'],
                                                                                  d['right_corner'],
                                                                                  d['upper_corner']))).text()

'382,524\n96.7% Small Businesses\nof Alabama Businesses'

### Reading several fields all at once

In [24]:
KeyFigures = ['EMPLOYMENT',
              'DIVERSITY',
              'TRADE']    
delta_bottom = 30

Info = [('with_formatter', 'text')]

for kf in KeyFigures:
    g = getCoordinates(pdf=file_pdf,query=kf,type_search="Box")
    d = next(g,None)
    Info.append(tuple((kf,'LTPage[pageid="%s"] LTTextBoxHorizontal:overlaps_bbox("%f,%f,%f,%f")'%(d['pageid'],
                                                                                                   d["left_corner"],
                                                                                                   d["bottom_corner"]-delta_bottom,
                                                                                                   d["right_corner"],
                                                                                                   d["upper_corner"]))))
    info = file_pdf.extract(Info)
info

{'DIVERSITY': 'DIVERSITY 30.7% increase in minority ownership2',
 'EMPLOYMENT': 'EMPLOYMENT 5,734 net new jobs1',
 'TRADE': 'TRADE\n81.2% of Alabama exporters3'}

### A better example

![alt text](Dictionary.png)

In [25]:
def info1(file_pdf):
    col_right_align = 300
    DemographicGroup = ['American-owned',
                        'Asian-owned',
                        'Islander-owned',
                        'Hispanic-owned',
                        'Alaskan-owned',
                        'Minority-owned',
                        'Nonminority-owned']    
    
    DemographicInfo = [('with_formatter', 'text')]
    
    for dg in DemographicGroup:
        g = getCoordinates(pdf=file_pdf,query=dg,type_search="Line")
        d = next(g,None)
        DemographicInfo.append(tuple((dg,'LTTextLineHorizontal:in_bbox("%f,%f,%f,%f")'%(d["left_corner"],
                                                                                        d["bottom_corner"],
                                                                                        col_right_align,
                                                                                        d["upper_corner"]))))
    info = file_pdf.extract(DemographicInfo)
    return info

In [26]:
info1(file_pdf)

{'Alaskan-owned': 'Native American/Alaskan-owned l 27.0%',
 'American-owned': 'African American-owned l 28.7%',
 'Asian-owned': 'Asian-owned l 35.4%',
 'Hispanic-owned': 'Hispanic-owned l 51.5%',
 'Islander-owned': u'Hawaiian/Paci\ufb01c Islander-owned l -16.9%',
 'Minority-owned': 'Minority-owned l 30.7%',
 'Nonminority-owned': 'Nonminority-owned l -8.6%'}

### How about a full table?

![alt text](Table.png)

In [27]:
def getTable(file_pdf, col_width, row_space, row_height,title,bottom_corner_dif,headers,col_left_align):
    
    table = list()
    table.append(headers)
    
    g = getCoordinates(pdf=file_pdf,query=title,type_search="Line")
    d = next(g,None)
    
    pageid = d['pageid']
    bottom_corner = d['bottom_corner'] - bottom_corner_dif

    while 1:
        columns = (c for c in xrange(len(headers)))
        boxes = list()
        for c in columns:
            boxes.append(tuple(('col_%s' %(c),
                               'LTPage[pageid="%s"] LTTextLineHorizontal:overlaps_bbox("%f,%f,%f,%f")' % (pageid,
                                                                                                          col_left_align[c],
                                                                                                          bottom_corner,
                                                                                                          col_left_align[c]+col_width,
                                                                                                          bottom_corner+row_height))))



        columns = [c for c in xrange(len(headers))]
        row = file_pdf.extract(boxes)
        columns = [row['col_{}'.format(c)].text() for c in columns]
        table.append(columns)
        if 'Total' in row['col_0'].text():
            break

        bottom_corner -= row_space
    return table

In [28]:
def info2(file_pdf):
    col_width = 35
    col_left_align = [50,295,371,449,532]
    row_space = 16.78
    row_height = 14
    bottom_corner_dif = 126.91
    headers = ['Industry',
                '1-499 Employees',
                '1-19 Employees',
                'Nonemployer Firms',
                'Total Small Firms'] 

    table = getTable(col_left_align=col_left_align,
                     col_width=col_width,
                     file_pdf=file_pdf,
                     headers=headers,
                     row_height=row_height,
                     row_space = row_space,
                     bottom_corner_dif=bottom_corner_dif,
                     title = "Table 1")
                     
    return table

In [29]:
info2(file_pdf)

[['Industry',
  '1-499 Employees',
  '1-19 Employees',
  'Nonemployer Firms',
  'Total Small Firms'],
 ['Retail Trade', '10,674', '9,627', '27,992', '38,666'],
 ['Other Services (except Public Administration)',
  '10,042',
  '9,332',
  '63,575',
  '73,617'],
 [u'Professional, Scienti\ufb01c, and Technical Services',
  '8,081',
  '7,378',
  '31,099',
  '39,180'],
 ['Health Care and Social Assistance', '7,823', '6,670', '21,808', '29,631'],
 ['Construction', '7,143', '6,373', '39,463', '46,606'],
 ['Accommodation and Food Services', '5,525', '4,255', '4,889', '10,414'],
 ['Wholesale Trade', '3,785', '2,974', '5,061', '8,846'],
 ['Manufacturing', '3,377', '2,349', '4,425', '7,802'],
 ['Administrative, Support, and Waste Management',
  '3,355',
  '2,842',
  '37,265',
  '40,620'],
 ['Finance and Insurance', '2,916', '2,582', '7,842', '10,758'],
 ['Real Estate and Rental and Leasing', '2,799', '2,590', '29,081', '31,880'],
 ['Transportation and Warehousing', '2,197', '1,834', '12,669', '14,8

### Another example

In [30]:
def info3(file_pdf):
    col_width = 35
    col_left_align = [50,325,400,532]
    row_space = 13.6
    row_height = 12.4
    bottom_corner_dif = 115.5

    headers = ['Industry',
               'Small Business Employment',
               'Total Private Employment',
               'Small Business Emp Share']    
    
    table = getTable(col_left_align=col_left_align,
                     col_width=col_width,
                     file_pdf=file_pdf,
                     headers=headers,
                     row_height=row_height,
                     row_space = row_space,
                     bottom_corner_dif=bottom_corner_dif,
                     title = "Table 2")

    return table

In [31]:
df = info3(file_pdf)
df

[['Industry',
  'Small Business Employment',
  'Total Private Employment',
  'Small Business Emp Share'],
 ['Health Care and Social Assistance', '113,580', '240,549', '47.2%'],
 ['Accommodation and Food Services', '89,707', '161,421', '55.6%'],
 ['Retail Trade', '87,257', '222,277', '39.3%'],
 ['Manufacturing', '79,632', '242,093', '32.9%'],
 ['Other Services (except Public Administration)',
  '68,770',
  '80,073',
  '85.9%'],
 ['Construction', '65,147', '78,318', '83.2%'],
 [u'Professional, Scienti\ufb01c, and Technical Services',
  '57,856',
  '92,520',
  '62.5%'],
 ['Administrative, Support, and Waste Management',
  '44,577',
  '133,720',
  '33.3%'],
 ['Wholesale Trade', '44,232', '72,175', '61.3%'],
 ['Finance and Insurance', '24,832', '69,332', '35.8%'],
 ['Transportation and Warehousing', '24,484', '58,471', '41.9%'],
 ['Real Estate and Rental and Leasing', '15,577', '23,257', '67.0%'],
 ['Educational Services', '13,791', '28,969', '47.6%'],
 ['Arts, Entertainment, and Recreation

### To pandas

In [32]:
columns = df.pop(0)
pd.DataFrame(df,columns=columns)

Unnamed: 0,Industry,Small Business Employment,Total Private Employment,Small Business Emp Share
0,Health Care and Social Assistance,113580,240549,47.2%
1,Accommodation and Food Services,89707,161421,55.6%
2,Retail Trade,87257,222277,39.3%
3,Manufacturing,79632,242093,32.9%
4,Other Services (except Public Administration),68770,80073,85.9%
5,Construction,65147,78318,83.2%
6,"Professional, Scientiﬁc, and Technical Services",57856,92520,62.5%
7,"Administrative, Support, and Waste Management",44577,133720,33.3%
8,Wholesale Trade,44232,72175,61.3%
9,Finance and Insurance,24832,69332,35.8%
