In [None]:
import plotly.express as px
from us_nonprofits.etl import extract

In [1]:
from us_nonprofits.constants import REGIONS

In [2]:
REGIONS.values()

dict_values(['https://www.irs.gov/pub/irs-soi/eo1.csv', 'https://www.irs.gov/pub/irs-soi/eo2.csv', 'https://www.irs.gov/pub/irs-soi/eo3.csv', 'https://www.irs.gov/pub/irs-soi/eo4.csv'])

# EOBMF Dataset

In [None]:
# To look at a specific region, pass a region into this extract function, 
# otherwise it extracts and returns all regions by default
df = extract()

In [None]:
df.columns

# What is the 'grain'
EOBMF Extract:
- My understanding is this dataset is a cumulative list of every unique tax-exempt business entity (non-profit) in the US. This list is updated on a monthly basis with the most recent tax information from the IRS. 

# Broad Categories of Dataset

EOBMF Extract has:
- Geographical information
    - `STREET`, `CITY`, `STATE`, `ZIP`
- Nonprofit code/identifier information
    - `EIN`, `NAME`, `ICO`, `GROUP`, `SUBSECTION`, `AFFILIATION`, `CLASSIFICATION`, `RULING`, `DEDUCTIBILITY`, `FOUNDATION`, `ACTIVITY`, `ORGANIZATION`, `STATUS`, `NTEE_CD`, `SORT_NAME`
- Most recent, high level tax filing information
    - `TAX_PERIOD`, `ASSET_CD`, `INCOME_CD`, `FILING_REQ_CD`,
       `PF_FILING_REQ_CD`, `ACCT_PD`, `ASSET_AMT`, `INCOME_AMT`, `REVENUE_AMT`

# Business/Research Problem to Investigate

I am not exactly sure what to make of the EOBMF dataset on its own, but my understanding is that this dataset contains mostly reference data for US nonprofits and so the main data of interest to me is the tax filing information. Revenue, income and assets are fairly interesting.

I would assume the most important information to a nonprofit is their revenue & expenses. Any ways they can increase the former or decrease the latter would be very critical for them. Know how similar nonprofits compare with each other or spotting anomolies in this sense could be interesting.

Another thing that I am personally interested is how much revenue that a nonprofit goes to the cause it is for rather than salaries, marketing, infrastructure expenses. I'd be more keen to donate to businesses that have a good "return" on money donated. 

First, I will try to get a quick handle on how the data is distributed among the different categories. As someone who has very little knowledge of the subject, some basic questions I'd like to know are:
- How are the nonprofits distributed among the different regions/states?
- How are the organizational codes distributed? (corporation, trust, association etc.)
- What does the exemption status look like among the different categories?
- How many haven't filed in the last year?
- Revenue/Income by sector/type, which sectors have most revenue?
- What does the overall data quality of this dataset?

### Exploratory Data Analysis

From reading [the documentation here](https://www.irs.gov/pub/irs-soi/eo_info.pdf), it seems like:
- `NTEE` code is better for getting a sense of nonprofits sector rather than `ACTIVITY` which seems to be a legacy field, so I will ignonre the latter for this analysis.
- `SORT_NAME`, `RULING`, `FOUNDATION` seems less useful as well as, so I will ignore them for now.
- Any column with suffix `_M` is a modified column that was added by me and not native to the datasource

In [None]:
df['COMMON_CD_M'] = df['NTEE_CD'].str[:1]

What's nice is the income/asset values are already categorized according to income bracekt (called `INCOME_CD`). So we can get a quick glance at how many enitties are in each group:

In [None]:
fig = px.histogram(df, x="INCOME_CD", histnorm='percent', title="% of Nonprofits in each Income Bracket")
fig.show()

It looks like 63% of nonprofits in this dataset have zero income. There's a segment of them in the middle of the pack and very few at the top (as expected) but also very few between group 0 and group 3 (>$25K in income).

In [None]:
df['INCOME_CD'].value_counts(normalize=True)*100

In fact, looking at just the nonprofits that have zero income, 98% also have zero assets:

In [None]:
df[df['INCOME_CD']==0]['ASSET_CD'].value_counts(normalize=True)

In [None]:
df[df['INCOME_CD']==0]['REVENUE_AMT'].value_counts()

I am not sure why this is. I don't know how nonprofits categorize money that goes towards the cause they are helping. 

In [None]:
df['ORGANIZATION'].value_counts(normalize=True)*100

Roughly ~95% of entities belong to two organization codes, **Corporation (72.5%)** and **Association (23%)**. Furthermore, looks like there are two codes, zero and six that aren't described in the online documentation. 

Let's see what sectors have the most income in aggregate:

In [None]:
sector_agg = df.groupby('COMMON_CD_M', as_index=False).agg({'INCOME_AMT': ['sum', 'mean', 'median', 'count']})

In [None]:
# sector_agg.to_csv('agg.csv', index=False)

In [None]:
fig = px.bar(y=sector_agg['INCOME_AMT']['sum'], 
             x=sector_agg['COMMON_CD_M'], 
             title='Total Income by Common Code (Proxy for Sector)')
fig.update_layout(
    xaxis_title="First Letter of Common Code",
    yaxis_title="Total Income ($)")
fig.show()

The top 5 broad sectors are:
1. Health – General and Rehabilitative (E)
2. Educational Institutions and Related Activities (B)
3. Philanthropy, Voluntarism and Grantmaking Foundations (T)
4. Mutual/Membership Benefit Organizations, Other (Y)
5. Human Services – Multipurpose and Other (P)

These 5 sectors represent just over 80% of all nonprofit revenue in the US, with the largest (Health) accounting for nearly 40% of the total nonprofit income.

In [None]:
fig = px.pie(values=sector_agg['INCOME_AMT']['sum'], 
             names=sector_agg['COMMON_CD_M'])
fig.show()

In [None]:
fig = px.bar(y=sector_agg['INCOME_AMT']['median'], 
             x=sector_agg['COMMON_CD_M'], 
             title='Income by Sector')
fig.show()

I am not quite sure what to make of the 3 EINs that have a lower case "c" common code as the firs digit. Seems like this should be capitalized.

In [None]:
df[df['COMMON_CD_M']=='c']

In [None]:
state_agg = df.groupby('STATE', as_index=False).agg({'INCOME_AMT': ['sum', 'mean', 'median', 'count']})

In [None]:
state_agg.sort_values([('INCOME_AMT','sum')], ascending=False).head(20)

Some open questions i have:
- Is it expected that many nonprofits have zero/negative income? This seems to be equivalent to breaking even for private business.
- If I sampled this on a regular basis and kept track of historical values for each nonprofit, what kind of trends exist?
- How do I group nonprofit entities that are branches that belong to one big mega nonprofit (e.g. all Scientology churches together)

# Form 990 Dataset

In [1]:
import pandas as pd

In [3]:
x = pd.read_excel("https://www.irs.gov/pub/irs-soi/19eofinextractdoc.xlsx", engine='openpyxl')

In [6]:
# df2 = pd.read_excel("https://www.irs.gov/pub/irs-soi/19eoextract990.xlsx", engine='openpyxl')

In [8]:
y = pd.read_excel("https://www.irs.gov/pub/irs-soi/19eoextractez.xlsx", engine='openpyxl')

In [10]:
y.columns

Index(['elf', 'EIN', 'tax_pd', 'subseccd', 'totcntrbs', 'prgmservrev',
       'duesassesmnts', 'othrinvstinc', 'grsamtsalesastothr',
       'basisalesexpnsothr', 'gnsaleofastothr', 'grsincgaming',
       'grsrevnuefndrsng', 'direxpns', 'netincfndrsng', 'grsalesminusret',
       'costgoodsold', 'grsprft', 'othrevnue', 'totrevnue', 'totexpns',
       'totexcessyr', 'othrchgsnetassetfnd', 'networthend', 'totassetsend',
       'totliabend', 'totnetassetsend', 'actvtynotprevrptcd', 'chngsinorgcd',
       'unrelbusincd', 'filedf990tcd', 'contractioncd', 'politicalexpend',
       'filedf1120polcd', 'loanstoofficerscd', 'loanstoofficers',
       'initiationfee', 'grspublicrcpts', 's4958excessbenefcd',
       'prohibtdtxshltrcd', 'nonpfrea', 'totnooforgscnt', 'totsupport',
       'gftgrntsrcvd170', 'txrevnuelevied170', 'srvcsval170',
       'pubsuppsubtot170', 'exceeds2pct170', 'pubsupplesspct170',
       'samepubsuppsubtot170', 'grsinc170', 'netincunreltd170', 'othrinc170',
       'totsupp170'

# What is the grain?

The "grain" for this dataset is more or less the same as the EOBMF dataset except it is more detailed. The EOBMF dump was more like a summary of the nonprofit enitties, whereas this dataset has all the tax filing data that is given on nearly every line of the 990 tax form.

Examining the documentation for this data, it looks like the dataset has:
- Compensation data 
- Breakdown of expenses 
- Breakdown of different revenue streams

Because there is much more data here and it would take me much more time to figure out excatly what all the columns mean and the differences between the different 990 filings, I decided this was a good place to stop given the scope of the excercise.

### Next Steps



- I assume the data between the two datasources can be joined on the EIN column. Having said that, I am not entirely clear what group of nonrpofits is in the 990-EZ filing vs the regular 990 filing. There is also data for the 990-PF forms but they are not available past 2016.
- The data is dumped on a yearly basis, so you can see interesting YoY changes to revenue by a certain nonprofit or a certain sector of nonprofits. This could allow someone to see larger trends over time or regime changes. Though 8 years of data (2012-2019) might not be long enough to detect anything meaningful. Might be only able to see extreme YoY changes. (e.g. when the ice bucket challenge generated a ton of money for ALS)
- The xlsx files are much bigger and will have be consumed in a better way than just reading it with `pandas` into memory. Also, some past files are in a `.dat` format so they would need some additional tweaking for `pandas` to correclty parse it. Furthermore, some are in ZIP files, so again gonna require more code and probably some caching of the data somewhere. 
- Many of the columns are binary (yes/no) data, so would have to encode them to 1 or 0.
- There are a lot of columns/features for each EIN in this dataset, I would try some dimensionalty reduction techniques like PCA to determine what loadings are contributing most to the revenue. 

Another avenue of investigation could be comparing similar nonprofits to each along certain dimensions and see where some are lacking. For example, if I am operating a nonprofit operating in animal sector, it would be good to know how my revenue or expenses compare to similar nonprofits (similar in # of employees or donations received). This way I can determine where I am lacking and make some adjustments. 