We want to look through the IRS' master file of tax-exempt organizations
and look for the zip codes with the most nonprofits
and then look at what industries have the most nonprofits.
This will help us quickly calculate some common descriptive stats
as well as some easy string cleaning functions

In [1]:
#import our third-party library
import pandas as pd

In [38]:
#set this option to make your dollar amounts
#display with commas and 2 decimal places in your 
#summary dataframes
pd.set_option('display.float_format', lambda x: '{:,.2f}'.format(x))

Go to this link: https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf and right-click on NY
to get the URL for the csv

In [2]:
#create a new dataframe by reading in the CSV at that URL
#this saves us the trouble of making the request ourself and 
#converting the response into something pandas can work with.
#This is a big help
ny = pd.read_csv('https://www.irs.gov/pub/irs-soi/eo_ny.csv')

In [4]:
#use the dataframe's .head() method to see the first
#few rows and columns of data
ny.head()

Unnamed: 0,EIN,NAME,ICO,STREET,CITY,STATE,ZIP,GROUP,SUBSECTION,AFFILIATION,...,ASSET_CD,INCOME_CD,FILING_REQ_CD,PF_FILING_REQ_CD,ACCT_PD,ASSET_AMT,INCOME_AMT,REVENUE_AMT,NTEE_CD,SORT_NAME
0,2022084,ST ROSALIAS ROMAN CATHOLIC CHURCH,,1230 65TH ST,BROOKLYN,NY,11219-5614,928,3,9,...,0,0,6,0,12,,,,,
1,2045409,GENERAL COUNCIL OF ASSEMBLIES OF GOD FULL GOSP...,,3210 SOUTHWESTERN BLVD,ORCHARD PARK,NY,14127-1229,1678,3,9,...,0,0,6,0,8,,,,,
2,10263908,SKOWHEGAN SCHOOL OF PAINTING AND SCULPTURE INC,,136 WEST 22ND STREET,NEW YORK,NY,10011-2424,0,3,3,...,8,6,1,0,12,18839129.0,3515868.0,2503199.0,A250,
3,10284115,MAIN IDEA INC,% SARAH STERN TREASURER,180 EAST PROSPECT AVE 178,MAMARONECK,NY,10543-3709,0,3,3,...,5,4,1,0,12,898463.0,478588.0,312621.0,,
4,10391592,MAINE JAZZ CAMP,,PO BOX 150597,BROOKLYN,NY,11215-0597,0,3,3,...,2,3,1,0,12,21577.0,65887.0,65887.0,B99Z,


In [21]:
#show me all my unique column names
ny.columns

Index([u'EIN', u'NAME', u'ICO', u'STREET', u'CITY', u'STATE', u'ZIP', u'GROUP',
       u'SUBSECTION', u'AFFILIATION', u'CLASSIFICATION', u'RULING',
       u'DEDUCTIBILITY', u'FOUNDATION', u'ACTIVITY', u'ORGANIZATION',
       u'STATUS', u'TAX_PERIOD', u'ASSET_CD', u'INCOME_CD', u'FILING_REQ_CD',
       u'PF_FILING_REQ_CD', u'ACCT_PD', u'ASSET_AMT', u'INCOME_AMT',
       u'REVENUE_AMT', u'NTEE_CD', u'SORT_NAME'],
      dtype='object')

In [10]:
#pass the filter on the ny dataframe a list of the three 
#numeric columns we want to calculate descriptive statistics for
#using the dataframe's .describe() method
ny[['ASSET_AMT','INCOME_AMT','REVENUE_AMT']].describe()

Unnamed: 0,ASSET_AMT,INCOME_AMT,REVENUE_AMT
count,83272.0,83272.0,71396.0
mean,6717461.02,4653583.38,3729685.7
std,136416583.22,98199804.76,71814748.72
min,0.0,-1735647.0,-13754927.0
25%,0.0,0.0,0.0
50%,21658.5,27227.5,8578.5
75%,441835.75,283431.5,194880.5
max,16672598896.0,14087558909.0,6493574288.0


In [15]:
#let's calculate the maximum value in the ASSET_AMT column by using
#the column's .max() method and assign it to the ny_max variable
ny_max = ny.ASSET_AMT.max()

In [12]:
#similarly, we can use the column's .mean() method
ny.ASSET_AMT.mean()

6717461.0158396577

In [13]:
#and the column's .median() method
#all of these descriptive functions take a column of numbers
#and return a single, summary number
ny.ASSET_AMT.median()

21658.5

In [39]:
#using our ny_max variable, filter our dataframe
#to show just the nonprofit that had the max ASSET_AMT
ny[ny.ASSET_AMT==ny_max]

Unnamed: 0,EIN,NAME,ICO,STREET,CITY,STATE,ZIP,GROUP,SUBSECTION,AFFILIATION,...,FILING_REQ_CD,PF_FILING_REQ_CD,ACCT_PD,ASSET_AMT,INCOME_AMT,REVENUE_AMT,NTEE_CD,SORT_NAME,SHORT_ZIP,SHORT_NTEE
31159,135598093,TRUSTEES OF COLUMBIA UNIVERSITY IN THE CITY OF...,% JULIA SHANAHAN,615 WEST 131 STREET 4TH FLOOR,NEW YORK,NY,10027-7984,8057,3,6,...,1,0,6,16672598896.0,6284244530.0,4708225588.0,B430,,10027,B


In [19]:
#similarly, we can create a filter that shows
#just the rows with ASSET_AMT over $10,000,000
ny[ny.ASSET_AMT>10000000]

Unnamed: 0,EIN,NAME,ICO,STREET,CITY,STATE,ZIP,GROUP,SUBSECTION,AFFILIATION,...,ASSET_CD,INCOME_CD,FILING_REQ_CD,PF_FILING_REQ_CD,ACCT_PD,ASSET_AMT,INCOME_AMT,REVENUE_AMT,NTEE_CD,SORT_NAME
2,10263908,SKOWHEGAN SCHOOL OF PAINTING AND SCULPTURE INC,,136 WEST 22ND STREET,NEW YORK,NY,10011-2424,0,3,3,...,8,6,1,0,12,18839129.00,3515868.00,2503199.00,A250,
17,10553351,DANCING TIDES FOUNDATION INC,% JEFF WILKINS,1475 FRANKLIN AVE,GARDEN CITY,NY,11530-1662,0,3,3,...,8,6,0,1,12,20918833.00,2490358.00,,T22,ARNELL NATHAN E TTEE
90,10603628,LEARNINGSPRING SCHOOL,% MARILYN SIMONS,247 E 20TH ST,NEW YORK,NY,10003-1801,0,3,3,...,8,7,1,0,6,45952763.00,7202443.00,6463225.00,B28,
111,10623055,PINDAROS FOUNDATION INC,% MARKS PANETH & SHRON LLP,685 3RD,NEW YORK,NY,10017-4024,0,3,3,...,8,6,0,1,12,38882811.00,3080407.00,,T22,
240,10708733,DEMOCRACY NOW PRODUCTIONS INC,,207 WEST 25TH ST 11TH FLOOR,NEW YORK,NY,10001-7161,0,3,3,...,8,7,1,0,12,20406857.00,9155773.00,8158144.00,A30,
259,10718949,CIVITELLA RANIERI FOUNDATION INC,% THE FOUNDATION,28 HUBERT ST,NEW YORK,NY,10013-2041,0,3,3,...,8,6,0,1,12,24031526.00,4814547.00,,,
355,10790110,EXCELLENCE ACADEMIES INC,% CAROLYN HACK,826 BROADWAY 9TH FLOOR,NEW YORK,NY,10003-4826,0,3,3,...,9,5,1,0,6,51159617.00,603539.00,603539.00,B192,
441,10880911,THE ROSS INSTITUTE INC,% DUANE RITTER,18 GOODFRIEND DR,EAST HAMPTON,NY,11937-2584,0,3,3,...,9,8,1,0,6,102144099.00,38476680.00,37932416.00,B20,
445,10885377,COMIC RELIEF INC,% RICHARD L SCOTT,488 MADISON AVE FL 10,NEW YORK,NY,10022-5723,0,3,3,...,8,8,1,0,12,12757292.00,38027287.00,38027287.00,T12,
554,16186903,MERCE CUNNINGHAM TRUST,% CUNNINGHAM MERCE TTEE,130 W 56TH ST RM 708,NEW YORK,NY,10019-3962,0,3,3,...,8,7,0,1,12,10602952.00,5206721.00,,T23,CUNNINGHAM MERCE TTEE


In [22]:
#so getting back to one of our first questions
#which zip has the most nonprofits?
#the problem here is that the zip with the +4 number
#is too small a geography, too specific.
ny.ZIP.value_counts()

14203-2309    325
14450-0444    263
10017-0000    203
11210-3714    166
13815-0000    141
10022-0000    141
10018-0000    139
10019-0000    128
11120-0001    123
10016-0000    107
10005-4414    100
14850-0000     96
12212-5014     89
10036-0000     88
10163-5005     84
10001-7604     78
10010-4102     77
10027-0000     75
10001-0000     74
13815-1646     71
10008-1297     70
10005-0000     69
10007-2233     68
10005-4401     67
14221-7748     67
11229-4111     66
10111-0100     64
12866-0860     62
10017-4201     59
10023-0000     59
             ... 
11210-1364      1
14051-1753      1
11375-5734      1
11213-4037      1
11211-0521      1
11213-4038      1
10977-3854      1
10977-3856      1
10977-3857      1
14209-0852      1
12309-6413      1
11705-1500      1
14214-1328      1
11710-2455      1
11201-1595      1
11754-0201      1
13460-0818      1
10960-3122      1
10024-5408      1
11214-3725      1
10011-5434      1
11434-7042      1
10025-2206      1
10036-3464      1
10977-2103

In [25]:
#let's "apply" a function to every row in the ZIP column of the ny dataframe
#we can use lambda to make that function for the variable x 
#(each cell in that column) on the fly
#here we split the full zip+4 string on the dash ('-') and 
#choose the first item in the list that's returned: the 5-digit zip
ny['SHORT_ZIP'] = ny.ZIP.apply(lambda x: x.split('-')[0])

Explaining apply: https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas
"Summing up, apply works on a row / column basis of a DataFrame, applymap works element-wise on a DataFrame, and map works element-wise on a Series."

In [42]:
#look at the first five rows of our new, SHORT_ZIP column
#this looks like what we want
ny.SHORT_ZIP.head()

0    11219
1    14127
2    10011
3    10543
4    11215
Name: SHORT_ZIP, dtype: object

In [28]:
#now try value counts on the new column
#that's a very different place than the full zip+4 value_counts we did before
ny.SHORT_ZIP.value_counts()

10022    1273
11219    1229
10017    1109
10001    1018
10016     977
10018     955
11211     952
10036     859
10019     839
10952     763
10023     722
10003     702
11204     677
11210     633
10025     629
11201     605
10011     603
11230     583
14203     558
10024     555
10010     529
10004     525
10027     501
10977     500
14850     500
10013     492
10021     485
10005     474
14450     442
10128     429
         ... 
12195       1
10060       1
13758       1
10286       1
11453       1
14893       1
13611       1
12240       1
12769       1
12768       1
13450       1
12465       1
11256       1
14261       1
14884       1
12781       1
12785       1
13847       1
12784       1
14827       1
12416       1
14824       1
14477       1
14475       1
12511       1
13129       1
12411       1
13457       1
12417       1
11915       1
Name: SHORT_ZIP, Length: 1970, dtype: int64

Remember to consult our data dictionary: https://www.irs.gov/pub/irs-soi/eo_info.pdf

In [29]:
#now we want to know which industry has the most nonprofits.
#After reading our data dictionary carefully, we see that it's
#the NTEE_CD column that has the codes we want
#but, like our zip codes, this is a little too specific.
ny.NTEE_CD.value_counts()

X20     2702
T20     2514
X30     2435
X21     2129
T22     2125
P20     1628
B82     1318
T30      994
O50      969
D20      837
S41      822
A20      744
M24      740
B90      734
Q33      717
B99      658
A23      646
N60      631
A80      629
A65      587
N50      573
S20      565
X20Z     541
J40      540
B94      532
W30      494
B11      489
X99      469
P99      461
A68      427
        ... 
W60I       1
E50J       1
G038       1
T015       1
K033       1
K032       1
G300       1
P70C       1
E602       1
N00Z       1
B4XZ       1
A55Z       1
B057       1
C030       1
U31C       1
Y220       1
A41Z       1
E87Z       1
A613       1
H70Z       1
M054       1
G320       1
A77J       1
P377       1
K11        1
P114       1
F22M       1
F22I       1
E50L       1
A6EI       1
Name: NTEE_CD, Length: 2314, dtype: int64

In [32]:
#the .describe() method in this string column
#tells us how many have values and how many repeat
#and the most popular one
ny.NTEE_CD.describe()

count     69572
unique     2314
top         X20
freq       2702
Name: NTEE_CD, dtype: object

In [33]:
#but looking at the .info() method of the whole dataframe
#we see that there are many rows of the NTEE_CD column
#that are empty
ny.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106607 entries, 0 to 106606
Data columns (total 29 columns):
EIN                 106607 non-null int64
NAME                106607 non-null object
ICO                 68630 non-null object
STREET              106607 non-null object
CITY                106607 non-null object
STATE               106607 non-null object
ZIP                 106607 non-null object
GROUP               106607 non-null int64
SUBSECTION          106607 non-null int64
AFFILIATION         106607 non-null int64
CLASSIFICATION      106607 non-null int64
RULING              106607 non-null int64
DEDUCTIBILITY       106607 non-null int64
FOUNDATION          106607 non-null int64
ACTIVITY            106607 non-null int64
ORGANIZATION        106607 non-null int64
STATUS              106607 non-null int64
TAX_PERIOD          83994 non-null float64
ASSET_CD            106607 non-null int64
INCOME_CD           106607 non-null int64
FILING_REQ_CD       106607 non-null int64
P

In [34]:
#to allow us to clean up the strings in NTEE_CD column
#we need to fill in any blanks (or "na's") with 0s
#because you can't slice a string ("x[0]", for instance)
#that doesn't exist
ny.NTEE_CD.fillna('0', inplace=True)

In [35]:
#like our zips above, let's apply a new function to each cell
#in each row of the NTEE_CD column and assign the result to
#a new column in our dataframe.
#we'll take each string and just take the 0th character (first in Python)
#in this instance, that's the one-letter code for the industry groups
#according to the IRS's classification
ny['SHORT_NTEE'] = ny.NTEE_CD.apply(lambda x: x[0])

In [36]:
#see the first few rows of the shortened column
ny.SHORT_NTEE.head()

0    0
1    0
2    A
3    0
4    B
Name: SHORT_NTEE, dtype: object

In [37]:
#run value counts. At the top are the empty rows that we filled with "0"
#then X for religious organizations and T for philanthropies
ny.SHORT_NTEE.value_counts()

0    37036
X    10723
T     8780
B     8051
A     7909
P     6266
N     4194
S     3467
E     2571
Q     2129
L     2002
O     1746
M     1419
D     1402
C     1254
W     1137
J      993
G      892
Y      814
F      777
I      630
Z      551
K      536
R      494
H      464
U      232
V      138
Name: SHORT_NTEE, dtype: int64

In [None]:
#what would you want to know next?