# Grants data from the National Institutes of Health

The National Institutes of Health ([NIH](https://www.nih.gov/)) is a branch of the USA government that provides funding for medical and health research. Information about grants funded by the NIH is publicly available [here](https://exporter.nih.gov/ExPORTER_Catalog.aspx). In this notebook we will scrape and clean data from the NIH for further analysis.

In [2]:
import requests, zipfile, io
import glob
import os

import pandas as pd
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

[Local functions](https://github.com/yuwie10/nih-awards/blob/master/cleaning_strings.py) to clean text data.

In [1]:
import cleaning_strings as cln
import nih_functions as nih

Download grants data from years 1985-2016

In [3]:
years = range(1985, 2017) #2.29 GB
for year in years:
    url = 'https://exporter.nih.gov/CSVs/final/RePORTER_PRJ_C_FY{}'.format(year) + '.zip'
    r = requests.get(url).content
    z = zipfile.ZipFile(io.BytesIO(r))
    z.extractall()

Import only one year to get column names/dtypes

In [4]:
#import first two rows of grants data from a single year
csv = 'RePORTER_PRJ_C_FY2016.csv'
df_columns = pd.read_csv(csv, encoding = 'latin1', nrows = 2)
pd.set_option('display.max_columns', 50)
df_columns

Unnamed: 0,APPLICATION_ID,ACTIVITY,ADMINISTERING_IC,APPLICATION_TYPE,ARRA_FUNDED,AWARD_NOTICE_DATE,BUDGET_START,BUDGET_END,CFDA_CODE,CORE_PROJECT_NUM,ED_INST_TYPE,FOA_NUMBER,FULL_PROJECT_NUM,FUNDING_ICs,FUNDING_MECHANISM,FY,IC_NAME,NIH_SPENDING_CATS,ORG_CITY,ORG_COUNTRY,ORG_DEPT,ORG_DISTRICT,ORG_DUNS,ORG_FIPS,ORG_NAME,ORG_STATE,ORG_ZIPCODE,PHR,PI_IDS,PI_NAMEs,PROGRAM_OFFICER_NAME,PROJECT_START,PROJECT_END,PROJECT_TERMS,PROJECT_TITLE,SERIAL_NUMBER,STUDY_SECTION,STUDY_SECTION_NAME,SUBPROJECT_ID,SUFFIX,SUPPORT_YEAR,DIRECT_COST_AMT,INDIRECT_COST_AMT,TOTAL_COST,TOTAL_COST_SUB_PROJECT
0,9115627,K23,GM,4,N,7/27/2016,8/1/2016,7/31/2017,859,K23GM104401,SCHOOLS OF MEDICINE,PA-11-009,4K23GM104401-04,NIGMS:194460\,OTHER RESEARCH-RELATED,2016,NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES,,NEW YORK,UNITED STATES,GENETICS,13,78861598,US,ICAHN SCHOOL OF MEDICINE AT MOUNT SINAI,NY,100296574,PUBLIC HEALTH RELEVANCE: Antiplatelet response...,10799126;,"SCOTT, STUART ALEXANDER;","LONG, ROCHELLE M.",8/1/2013,7/31/2017,ABCB1 gene; Accounting; acute coronary syndrom...,The Pharmacogenomic Control of Clopidogrel Res...,104401,GHD,Genetics of Health and Disease Study Section,,,4,180500,13960,194460,
1,9128072,R01,NS,4,N,8/15/2016,8/1/2016,7/31/2017,853,R01NS085165,SCHOOLS OF MEDICINE,PA-11-260,4R01NS085165-04,NINDS:335781\,Non-SBIR/STTR RPGs,2016,NATIONAL INSTITUTE OF NEUROLOGICAL DISORDERS A...,,BALTIMORE,UNITED STATES,ANESTHESIOLOGY,7,188435911,US,UNIVERSITY OF MARYLAND BALTIMORE,MD,212011508,PUBLIC HEALTH RELEVANCE: Activation of microgl...,7017365;,"POLSTER, BRIAN M;","MORRIS, JILL A",9/30/2013,7/31/2018,Acute; analog; Antioxidants; attenuation; Bind...,Novel Mechanisms of Microglial Neurotoxicity a...,85165,NOMD,Neural Oxidative Metabolism and Death Study Se...,,,4,218750,117031,335781,


There are three different dtypes in the grant data: str (the most common), floats and datetime. Create dictionaries/lists to specify dtypes on import.

In [5]:
#names of columns with dtypes of datetime or floats
dates = 'AWARD_NOTICE_DATE BUDGET_START BUDGET_END PROJECT_START PROJECT_END'.split()
nums = 'DIRECT_COST_AMT INDIRECT_COST_AMT TOTAL_COST TOTAL_COST_SUB_PROJECT'.split()
dtypes = nih.get_dtypes(df_columns, nums)

Import csvs from all years and concatenate into a single dataframe.

In [7]:
all_csvs = glob.glob('RePORTER_PRJ_C_FY*.csv')
all_grants = pd.DataFrame()
list_ = []
for csv in all_csvs:
    df = pd.read_csv(csv, index_col = None, header = 0, encoding = 'latin1',
                    dtype = dtypes, parse_dates = dates)
    list_.append(df)
all_grants = pd.concat(list_)
all_grants.to_csv('raw_data.csv', index = None, compression = 'gzip')

Remove raw csvs from NIH's website from computer

In [8]:
for csv in all_csvs:
    os.remove(csv)

# Pre-processing

Re-arrange columns to original column sequence and convert to lowercase

In [9]:
#all_grants = pd.read_csv('raw_data', compression = 'gzip', dtype = dtypes, parse_dates = dates)
all_grants = all_grants[df_columns.columns.tolist()]

In [10]:
all_grants.columns = all_grants.columns.str.lower()
all_grants.head(1)

Unnamed: 0,application_id,activity,administering_ic,application_type,arra_funded,award_notice_date,budget_start,budget_end,cfda_code,core_project_num,ed_inst_type,foa_number,full_project_num,funding_ics,funding_mechanism,fy,ic_name,nih_spending_cats,org_city,org_country,org_dept,org_district,org_duns,org_fips,org_name,org_state,org_zipcode,phr,pi_ids,pi_names,program_officer_name,project_start,project_end,project_terms,project_title,serial_number,study_section,study_section_name,subproject_id,suffix,support_year,direct_cost_amt,indirect_cost_amt,total_cost,total_cost_sub_project
0,3000011,A03,AH,1,,NaT,1985-07-01,1986-06-30,,A03AH000859,SCHOOLS OF PUBLIC HEALTH,,1A03AH000859-01,,,1985,"DIVISION OF ASSOCIATED, DENTAL HEALTH PROFESSIONS",,BIRMINGHAM,UNITED STATES,,7,4514360,US,UNIVERSITY OF ALABAMA AT BIRMINGHAM,AL,35294,,3700006;,"BRIDGERS, WILLIAM F;",,1985-07-01,1986-06-30 00:00:00,,PUBLIC HEALTH TRAINEESHIPS,859,STC,,,,1,,,,


Convert string columns to lowercase

In [11]:
for col in all_grants:
    if all_grants[col].dtype == 'O':
        all_grants[col] = all_grants[col].str.lower()
all_grants.head(1)

Unnamed: 0,application_id,activity,administering_ic,application_type,arra_funded,award_notice_date,budget_start,budget_end,cfda_code,core_project_num,ed_inst_type,foa_number,full_project_num,funding_ics,funding_mechanism,fy,ic_name,nih_spending_cats,org_city,org_country,org_dept,org_district,org_duns,org_fips,org_name,org_state,org_zipcode,phr,pi_ids,pi_names,program_officer_name,project_start,project_end,project_terms,project_title,serial_number,study_section,study_section_name,subproject_id,suffix,support_year,direct_cost_amt,indirect_cost_amt,total_cost,total_cost_sub_project
0,3000011,a03,ah,1,,,1985-07-01,1986-06-30,,a03ah000859,schools of public health,,1a03ah000859-01,,,1985,"division of associated, dental health professions",,birmingham,united states,,7,4514360,us,university of alabama at birmingham,al,35294,,3700006;,"bridgers, william f;",,1985-07-01,,,public health traineeships,859,stc,,,,1,,,,


In [12]:
all_grants.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2223292 entries, 0 to 71826
Data columns (total 45 columns):
application_id            object
activity                  object
administering_ic          object
application_type          object
arra_funded               object
award_notice_date         object
budget_start              datetime64[ns]
budget_end                datetime64[ns]
cfda_code                 object
core_project_num          object
ed_inst_type              object
foa_number                object
full_project_num          object
funding_ics               object
funding_mechanism         object
fy                        object
ic_name                   object
nih_spending_cats         object
org_city                  object
org_country               object
org_dept                  object
org_district              object
org_duns                  object
org_fips                  object
org_name                  object
org_state                 object
org_zipcode    

Column 'project_end' was not successfully imported as a datetime object because a few grants had project end dates with year 3012 instead of 2012 and other errors. Change 3012 to 2012 and coerce errors to NaT. Can go back and check end dates with original data frame if necessary.

In [13]:
all_grants['project_end'] = all_grants['project_end'].replace({'04/30/3012':'04/30/2012', '4/30/3012':'04/30/2012'})
all_grants['project_end'] = pd.to_datetime(all_grants['project_end'], errors = 'coerce')

In [14]:
all_grants.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2223292 entries, 0 to 71826
Data columns (total 45 columns):
application_id            object
activity                  object
administering_ic          object
application_type          object
arra_funded               object
award_notice_date         object
budget_start              datetime64[ns]
budget_end                datetime64[ns]
cfda_code                 object
core_project_num          object
ed_inst_type              object
foa_number                object
full_project_num          object
funding_ics               object
funding_mechanism         object
fy                        object
ic_name                   object
nih_spending_cats         object
org_city                  object
org_country               object
org_dept                  object
org_district              object
org_duns                  object
org_fips                  object
org_name                  object
org_state                 object
org_zipcode    

Each fiscal year is indexed independently; reset index.

In [15]:
all_grants.reset_index(drop = True, inplace = True)

### Missing PI IDs
There are pi_ids where the entry is either ';' or '; '. When the entries are stripped and the data frame saved and re-imported, these show up as NaNs, making it difficult to uniquely identify these PIs. To counteract this problem, we need to fill these entries. First strip the final empty space and semi-colon.

In [16]:
all_grants = cln.strip_series(all_grants, ['pi_ids'], strip = ' ')
all_grants = cln.strip_series(all_grants, ['pi_ids'], strip = ';')

In [20]:
#all_grants.to_csv('all_grants.csv', index = False, compression = 'gzip')

dates = [date.lower() for date in dates]
dtypes = {k.lower(): v for k, v in dtypes.items()}

In [21]:
len(all_grants.ix[all_grants['pi_ids'].isnull()])

12098

In [22]:
fill_values = list(range(len(all_grants)))
fill_values = [str(value) for value in fill_values]
all_grants['fill_values'] = fill_values

In [23]:
all_grants['pi_ids'].fillna(all_grants['fill_values'], inplace = True)
del all_grants['fill_values']

In [24]:
len(all_grants.ix[all_grants['pi_ids'].isnull()])

0

Save data prior to any processing for future reference. We will also load this dataset [here](cleaning-pi-info.ipynb) to clean PI information.

In [27]:
all_grants.to_csv('all_grants.csv', index = False, compression = 'gzip')

Let's continue cleaning the data.

There are a total of 45 columns, which may contain redundant or unnecessary information. Descriptions of the data contained in each column was scraped from [here](https://exporter.nih.gov/about.aspx), and the code to scrape this information can be found [here](scrape_grant_info.ipynb). We will first filter out unnecessary columns based on descriptions. 

In [28]:
all_grants = pd.read_csv('all_grants.csv', compression = 'gzip',
                        dtype = dtypes, parse_dates = dates)
cols_info = nih.view_col_info('grant_col_info_all.csv')
cols_info[:9]

Unnamed: 0,column_name,descriptions
0,application_id,A unique identifier of the project record in the ExPORTER database.
1,activity,"A 3-character code identifying the grant, contract, or intramural activity through which a project is supported. Within each funding mechanism , NIH uses 3-character activity codes (e.g., F32, K08, P01, R01, T32, etc.) to differentiate the wide variety of research-related programs NIH supports. A comprehensive list of activity codes for grants and cooperative agreements may be found on the Types of Grant Programs Web page. RePORTER also includes R&D contracts (activity codes beginning with the letter N) and intramural projects (beginning with the letter Z)."
2,administering_ic,"Administering Institute or Center - A two-character code to designate the agency,NIH Institute, or Center administering the grant. See Institute/Center code definitions"
3,application_type,"A one-digit code to identify the type of application funded: 1 = New application 2 = Competing continuation (also, competing renewal) 3 = Application for additional (supplemental) support. There are two kinds of type 3competing revisions (which are peer-reviewed and administrative supplements) 4 = Competing extension for an R37 award or first non-competing year of a Fast Track SBIR/STTR award 5 = Non-competing continuation 7 = Change of grantee institution 9 = Change of NIH awarding Institute or Division (on a competing continuation)"
4,arra_funded,“Y” indicates a project supported by funds appropriated through the American Recovery and Reinvestment Act of 2009.
5,award_notice_date,Award notice date or Notice of Grant Award (NGA) is a legally binding document stating the government has obligated funds and which defines the period of support and the terms and conditions of award.\r\n
6,budget_start,The date when a project’s funding for a particular fiscal year begins.
7,budget_end,The date when a project’s funding for a particular fiscal year ends.
8,cfda_code,"Federal programs are assigned a number in the Catalog of Federal Domestic Assistance (CFDA), which is referred to as the ""CFDA code."" The CFDA database helps the Federal government track all programs it has domestically funded. \r\n"


To remove (initial): 5-8, redundant information

In [29]:
to_drop = 'award_notice_date budget_start budget_end cfda_code'.split()
all_grants = all_grants.drop(to_drop, axis = 1)

In [30]:
cols_info[9:18]

Unnamed: 0,column_name,descriptions
9,core_project_num,"An identifier for each research project, used to associate the project with publication and patent records. This identifier is not specific to any particular year of the project. It consists of the project activity code, administering IC, and serial number (a concatenation of Activity, Administering_IC, and Serial_Number). \r\n"
10,ed_inst_type,Generic name for the grouping of components across an institution who has applied for or receives NIH funding. The official name as used by NIH is Major Component Combining Name. \r\n
11,foa_number,"The number of the funding opportunity announcement, if any, under which the project application was solicited. Funding opportunity announcements may be categorized as program announcements, requests for applications, notices of funding availability, solicitations, or other names depending on the agency and type of program. Funding opportunity announcements can be found at Grants.gov/FIND and in the NIH Guide for Grants and Contracts"
12,full_project_num,"Commonly referred to as a grant number, intramural project, or contract number. For grants, this unique identification number is composed of the type code, activity code, Institute/Center code, serial number, support year, and (optional) a suffix code to designate amended applications and supplements."
13,funding_ic(s),"The NIH Institute or Center(s) providing funding for a project are designated by their acronyms (see Institute/Center acronyms ). Each funding IC is followed by a colon (:) and the amount of funding provided for the fiscal year by that IC. Multiple ICs are separated by semicolons (;). Project funding information is available only for NIH, CDC, and FDA projects ."
14,funding_mechanism,"The major mechanism categories used in NIH Budget mechanism tables for the President’s budget. Extramural research awards are divided into three main funding mechanisms: grants, cooperative agreements and contracts. A funding mechanism is the type of funded application or transaction used at the NIH. Within each funding mechanism NIH includes programs. Programs can be further refined by specific activity codes."
15,fy,The fiscal year appropriation from which project funds were obligated.
16,ic_name,"Full name of the administering agency, Institute, or Center."
17,nih_spending_cats,"Congressionally-mandated reporting categories into which NIH projects are categorized. Available for fiscal years 2008 and later. Each project’s spending category designations for each fiscal year are made available the following year as part of the next President’s Budget request. See the Research, Condition, and Disease Categorization System for more information on the categorization process."


To remove: 9 (may add in later if correlating with publications), 10-12, 16; also 2 (redundant with funding_ics)

In [31]:
to_drop2 = 'administering_ic core_project_num ed_inst_type foa_number full_project_num ic_name'.split()
all_grants = all_grants.drop(to_drop2, axis = 1)

In [32]:
col_info[18:27]

Unnamed: 0,column_name,descriptions
18,org_city,"The city in which the business office of the grantee organization or contractor is located. Note that this may be different from the research performance site. For all NIH intramural projects, Bethesda, MD is used."
19,org_country,The country in which the business office of the grantee organization or contractor is located. Note that this may be different from the research performance site.
20,org_dept,"The departmental affiliation of the contact principal investigator for a project, using a standardized categorization of departments. Names are available only for medical school departments."
21,org_district,The congressional district in which the business office of the grantee organization or contractor is located. Note that this may be different from the research performance site.
22,org_duns,"This field may contain multiple DUNS Numbers separated by a semi-colon. The Data Universal Numbering System is a unique nine-digit number assigned by Dun and Bradstreet Information Services, recognized as the universal standard for identifying and keeping track of business worldwide. \r\n"
23,org_fips,The country code of the grantee organization or contractor as defined in the Federal Information Processing Standard.
24,org_name,"The name of the educational institution, research organization, business, or government agency receiving funding for the grant, contract, cooperative agreement, or intramural project."
25,org_state,The state in which the business office of the grantee organization or contractor is located. Note that this may be different from the research performance site.
26,org_zipcode,The zip code in which the business office of the grantee organization or contractor is located. Note that this may be different from the research performance site.


All redundant: 18, 20-23, 25

Also remove 19, 24, 26; these will be added again later.

In [33]:
to_drop3 = 'org_city org_country org_dept org_district org_duns org_fips org_name org_state org_zipcode'.split()
all_grants = all_grants.drop(to_drop3, axis = 1)

In [34]:
col_info[27:36]

Unnamed: 0,column_name,descriptions
27,phr,"Submitted as part of a grant application, this statement articulates a project's potential to improve public health."
28,pi_id(s),A unique identifier for each of the project Principal Investigators. Each PI in the RePORTER database has a unique identifier that is constant from project to project and year to year.
29,pi_name(s),The name(s) of the Principal Investigator(s) designated by the organization to direct the research project.
30,program_officer_name,An Institute staff member who coordinates the substantive aspects of a contract from planning the request for proposal to oversight.
31,project_start,"The start date of a project. For subprojects of a multi-project grant, this is the start date of the parent award."
32,project_end,"The current end date of the project, including any future years for which commitments have been made. For subprojects of a multi-project grant, this is the end date of the parent award. Upon competitive renewal of a grant, the project end date is extended by the length of the renewal award."
33,project_terms,"Prior to fiscal year 2008, these were thesaurus terms assigned by NIH CRISP indexers. For projects funded in fiscal year 2008 and later, these are concepts that are mined from the project's title, abstract, and specific aims using an automated text mining tool."
34,project_title,"Title of the funded grant, contract, or intramural (sub)project."
35,serial_number,A six-digit number assigned in serial number order within each administering organization.


In [35]:
to_drop4 = 'pi_names program_officer_name project_title serial_number'.split()
all_grants = all_grants.drop(to_drop4, axis = 1)

In [36]:
col_info[36:]

Unnamed: 0,column_name,descriptions
36,study_section,A designator of the legislatively-mandated panel of subject matter experts that reviewed the research grant application for scientific and technical merit.
37,study_section_name,The full name of a regular standing Study Section that reviewed the research grant application for scientific and technical merit. Applications reviewed by panels other than regular standing study sections are designated by “Special Emphasis Panel.”
38,subproject_id,A unique numeric designation assigned to subprojects of a “parent” multi-project research grant.
39,suffix,"A suffix to the grant application number that includes the letter ""A"" and a serial number to identify an amended version of an original application and/or the letter ""S"" and serial number indicating a supplement to the project. ."
40,support_year,"The year of support for a project, as shown in the full project number. For example, a project with number 5R01GM0123456-04 is in its fourth year of support."
41,direct_cost_amt,Total indirect cost funding for a project from all NIH Institute and Centers for a given fiscal year. Costs are available only for NIH awards funded in FY 2012 and onward. Indirect cost amounts are not available for SBIR/STTR awards.
42,indirect_cost_amt,Total indirect cost funding for a project from all NIH Institute and Centers for a given fiscal year. Costs are available only for NIH awards funded in FY 2012 and onward. Indirect cost amounts are not available for SBIR/STTR awards.
43,total_cost,"Total project funding from all NIH Institute and Centers for a given fiscal year. Costs are available only for: NIH, CDC, and FDA grant awards (only the parent record of multi-project grants). -NIH intramural projects (activity codes beginning with “Z”) in FY 2007 and later fiscal years. -NIH contracts (activity codes beginning with “N”) . For multi-project grants, Total_Cost includes funding for all of the constituent subprojects. This field will be blank on subproject records; the total cost of each subproject is found in Total_Cost_Sub_Project ."
44,total_cost_sub_project,Applies to subproject records only. Total funding for a subproject from all NIH Institute and Centers for a given fiscal year. Costs are available only for NIH awards.


37 (redundant with 36, can always look up)

In [37]:
#check number of subprojects
all_grants.shape
subproject_cols = 'subproject_id suffix'.split()
all_grants[subproject_cols].isnull().sum()

(2223292, 22)

subproject_id    1727151
suffix           1918047
dtype: int64

For simplicity will only investigate projects and not subprojects

In [38]:
all_grants = all_grants.drop(subproject_cols, axis = 1)

In [39]:
pd.set_option('display.max_colwidth', 50)
all_grants.head()

Unnamed: 0,application_id,activity,application_type,arra_funded,funding_ics,funding_mechanism,fy,nih_spending_cats,phr,pi_ids,project_start,project_end,project_terms,study_section,study_section_name,support_year,direct_cost_amt,indirect_cost_amt,total_cost,total_cost_sub_project
0,3000011,a03,1,,,,1985,,,3700006,1985-07-01,NaT,,stc,,1,,,,
1,3000012,a03,1,,,,1985,,,2407264,1985-07-01,NaT,,stc,,1,,,,
2,3000013,a03,1,,,,1985,,,1871887,1985-07-01,NaT,,stc,,1,,,,
3,3000014,a03,1,,,,1985,,,1877259,1985-07-01,NaT,,stc,,1,,,,
4,3000015,a03,1,,,,1985,,,1957769,1985-07-01,NaT,,stc,,1,,,,


## Extract funding institute information 
The column 'funding_ics' contains information about the agency or agencies that funded the grant as well as the amount awarded by that agency. Isolate this information along with the application_id for separate analysis and remove the column 'funding_ics'.

In [40]:
institute_funds = all_grants.filter(items = 'application_id fy funding_ics'.split())
institute_funds.tail()

Unnamed: 0,application_id,fy,funding_ics
2223287,9119172,2016,nigms:180552\
2223288,9128041,2016,nhlbi:751173\
2223289,9033088,2016,nci:354563\
2223290,9070525,2016,nimh:46182\
2223291,9057001,2016,nci:306063\


In [41]:
all_grants = all_grants.drop(['funding_ics'], axis = 1)

### Institute funds per grant

In [42]:
institute_funds.shape

(2223292, 3)

Split 'funding_ics' column so that every row contains a single institute code associated with the application and the amount of money given by that institute.

In [43]:
institute_funds['funding_ics'] = institute_funds['funding_ics'].str.strip('\ ')
institute_funds = cln.split_rows(institute_funds, col_name = 'funding_ics', by = '\\')
institute_funds.reset_index(drop = True, inplace = True)
institute_funds.head()

Unnamed: 0,application_id,fy,funding_ics
0,3000011,1985,
1,3000012,1985,
2,3000013,1985,
3,3000014,1985,
4,3000015,1985,


Create a new column, 'funds_awarded', with the amount of money the institute awarded to the particular application.

In [44]:
ics = list(institute_funds['funding_ics'])
for i in range(len(ics)):
    if type(ics[i]) == float:
        ics[i] = [np.nan, np.nan]
    else:
        ics[i] = ics[i].split(':')

to_concat = pd.DataFrame(ics, columns = ['institute', 'funds_awarded'])
institute_funds = pd.concat([institute_funds, to_concat], axis = 1)
del institute_funds['funding_ics']

In [45]:
institute_funds.tail()

Unnamed: 0,application_id,fy,institute,funds_awarded
2260239,9119172,2016,nigms,180552
2260240,9128041,2016,nhlbi,751173
2260241,9033088,2016,nci,354563
2260242,9070525,2016,nimh,46182
2260243,9057001,2016,nci,306063


Save to .csv

In [46]:
institute_funds.to_csv('institute_funds.csv', index = False, compression = 'gzip')

## Cost of grants (funds)
There are 4 cost columns. Indirect and direct costs sum to total costs or to total subproject costs. Drop indirect and direct costs columns and combine the total costs into one column (costs are either listed as total cost or total subproject cost).

In [47]:
all_grants.tail(2)

Unnamed: 0,application_id,activity,application_type,arra_funded,funding_mechanism,fy,nih_spending_cats,phr,pi_ids,project_start,project_end,project_terms,study_section,study_section_name,support_year,direct_cost_amt,indirect_cost_amt,total_cost,total_cost_sub_project
2223290,9070525,f30,4,n,"training, individual",2016,,the world health organization estimates that n...,10944221,2012-06-01,2017-05-31,amino acid sequence; anterior; anxiety; axon; ...,zrg1,special emphasis panel,5,46182.0,,46182.0,
2223291,9057001,r01,5,n,non-sbir/sttr rpgs,2016,,public health relevance: trip13 overexpressio...,9288457,2014-05-01,2018-04-30,adaptor signaling protein; address; affect; ag...,cg,cancer genetics study section,3,207500.0,98563.0,306063.0,


In [48]:
all_grants = all_grants.drop(['direct_cost_amt', 'indirect_cost_amt'], axis = 1)
all_grants['total_cost'].fillna(all_grants['total_cost_sub_project'], inplace = True)
del all_grants['total_cost_sub_project']

In [49]:
all_grants.rename(columns = {'total_cost':'funds'}, inplace = True)
all_grants.tail()

Unnamed: 0,application_id,activity,application_type,arra_funded,funding_mechanism,fy,nih_spending_cats,phr,pi_ids,project_start,project_end,project_terms,study_section,study_section_name,support_year,funds
2223287,9119172,p20,4,n,research centers,2016,,mycobacterium bovis is the causative agent of ...,9524770,2016-07-01,2017-06-30,adaptive immunity; animals; arm; biology; bovi...,zrr1,special emphasis panel,5,180552.0
2223288,9128041,u01,5,n,non-sbir/sttr rpgs,2016,,public health relevance: disorders of excitabi...,6490459,2015-09-01,2020-08-31,3-dimensional; academia; adherence; affect; am...,zeb1,special emphasis panel,2,751173.0
2223289,9033088,r01,5,n,non-sbir/sttr rpgs,2016,,public health relevance: hepatocarinoma is a m...,1901669,2015-07-01,2020-06-30,1-phosphatidylinositol 3-kinase; ablation; add...,tcb,tumor cell biology study section,2,354563.0
2223290,9070525,f30,4,n,"training, individual",2016,,the world health organization estimates that n...,10944221,2012-06-01,2017-05-31,amino acid sequence; anterior; anxiety; axon; ...,zrg1,special emphasis panel,5,46182.0
2223291,9057001,r01,5,n,non-sbir/sttr rpgs,2016,,public health relevance: trip13 overexpressio...,9288457,2014-05-01,2018-04-30,adaptor signaling protein; address; affect; ag...,cg,cancer genetics study section,3,306063.0


## Selecting years to analyze
As we are interested in investigating factors driving funding of grants over time, years where funding information is unavailable or lacking are less relevant and can be ignored. However, we do not want to remove all grants where funding is not available as certain types of grants are not listed with funding information, but it could still be worth investigating how often/many these grants are awarded. Therefore we will determine the proportion of grants without funding information per year and remove those with a large proportion of missing funding information.

In [50]:
years = all_grants['fy'].unique().tolist()
percent_funded = {year: None for year in years}
for year in years:
    total = len(all_grants.ix[all_grants['fy'] == year])
    nulls = len(all_grants.ix[(all_grants['funds'].isnull()) & (all_grants['fy'] == year)])
    percent = nulls/total
    percent_funded[year] = {'proportion null':percent, 'total':total}
percent_funded

{'1985': {'proportion null': 1.0, 'total': 49748},
 '1986': {'proportion null': 1.0, 'total': 42996},
 '1987': {'proportion null': 1.0, 'total': 47294},
 '1988': {'proportion null': 1.0, 'total': 47898},
 '1989': {'proportion null': 1.0, 'total': 48032},
 '1990': {'proportion null': 1.0, 'total': 52000},
 '1991': {'proportion null': 1.0, 'total': 53491},
 '1992': {'proportion null': 1.0, 'total': 51737},
 '1993': {'proportion null': 1.0, 'total': 51404},
 '1994': {'proportion null': 1.0, 'total': 53435},
 '1995': {'proportion null': 1.0, 'total': 54739},
 '1996': {'proportion null': 1.0, 'total': 65224},
 '1997': {'proportion null': 1.0, 'total': 70243},
 '1998': {'proportion null': 1.0, 'total': 71904},
 '1999': {'proportion null': 1.0, 'total': 80081},
 '2000': {'proportion null': 0.18039520958083832, 'total': 83500},
 '2001': {'proportion null': 0.23329846797514306, 'total': 81265},
 '2002': {'proportion null': 0.2331011831269554, 'total': 83423},
 '2003': {'proportion null': 0.0799

No funding information is available until the 2000s. Therefore remove grants from 1985-2000.

In [51]:
all_grants = all_grants.ix[all_grants['fy'] >= '2000']
all_grants.shape
all_grants.head()

(1383066, 16)

Unnamed: 0,application_id,activity,application_type,arra_funded,funding_mechanism,fy,nih_spending_cats,phr,pi_ids,project_start,project_end,project_terms,study_section,study_section_name,support_year,funds
840226,6258248,c06,1,,,2000,,,1860776,2000-09-22,NaT,,strb,scientific and technical review board on biome...,1,1488000.0
840227,6033399,c06,1,,,2000,,,6423558,2000-07-01,NaT,,strb,scientific and technical review board on biome...,1,1000000.0
840228,6039178,c06,1,,,2000,,,1871945,2000-07-01,NaT,,strb,scientific and technical review board on biome...,1,1000000.0
840229,6258225,c06,1,,,2000,,,6522067,2000-09-01,NaT,,strb,scientific and technical review board on biome...,1,1000000.0
840230,6258259,c06,1,,,2000,,,8756854,2000-08-01,NaT,,strb,scientific and technical review board on biome...,1,1999999.0


## Splitting individual PIs when more than one is listed on a grant
Split grants with multiple PIs so that each row only has a single PI listed. This will allow analysis on funding per individual and per institution.

In [52]:
col_to_clean = 'nih_spending_cats pi_ids project_terms'.split()

#strip ';' from columns
all_grants = cln.strip_series(all_grants, col_to_clean, strip = '; ')

#individual PIs are delimited by a ';', so split into rows along the ';'
all_grants = cln.split_rows(all_grants, col_name = 'pi_ids', by = ';')

#strip '(contact)' string and final white space
all_grants = cln.strip_series(all_grants, ['pi_ids'])
all_grants = cln.strip_series(all_grants, ['pi_ids'], strip = ' ')
all_grants.tail()

Unnamed: 0,application_id,activity,application_type,arra_funded,funding_mechanism,fy,nih_spending_cats,phr,project_start,project_end,project_terms,study_section,study_section_name,support_year,funds,pi_ids
2223287,9119172,p20,4,n,research centers,2016,,mycobacterium bovis is the causative agent of ...,2016-07-01,2017-06-30,adaptive immunity; animals; arm; biology; bovi...,zrr1,special emphasis panel,5,180552.0,9524770
2223288,9128041,u01,5,n,non-sbir/sttr rpgs,2016,,public health relevance: disorders of excitabi...,2015-09-01,2020-08-31,3-dimensional; academia; adherence; affect; am...,zeb1,special emphasis panel,2,751173.0,6490459
2223289,9033088,r01,5,n,non-sbir/sttr rpgs,2016,,public health relevance: hepatocarinoma is a m...,2015-07-01,2020-06-30,1-phosphatidylinositol 3-kinase; ablation; add...,tcb,tumor cell biology study section,2,354563.0,1901669
2223290,9070525,f30,4,n,"training, individual",2016,,the world health organization estimates that n...,2012-06-01,2017-05-31,amino acid sequence; anterior; anxiety; axon; ...,zrg1,special emphasis panel,5,46182.0,10944221
2223291,9057001,r01,5,n,non-sbir/sttr rpgs,2016,,public health relevance: trip13 overexpressio...,2014-05-01,2018-04-30,adaptor signaling protein; address; affect; ag...,cg,cancer genetics study section,3,306063.0,9288457


In [53]:
all_grants.shape

(1446254, 16)

## Splitting rows by PI ids

Split grant totals by number of associated PIs (assumption is that all PIs on a grant receive the same amount of money).

In [54]:
pi_per_grant = pd.DataFrame(all_grants['application_id'].value_counts())
pi_per_grant = pi_per_grant.reset_index()
pi_per_grant.columns = ['application_id', 'num_pis']

#Match application IDs in df of grants and pi_per_grant
#Divide the 3 cost columns by number of PIs per grant
all_grants = pd.merge(all_grants, pi_per_grant, on = 'application_id')

In [55]:
#divide cost columns by # of times the application_id occurs
all_grants['funds'] = all_grants['funds'] // all_grants['num_pis']

Save to csv

In [56]:
all_grants.to_csv('for_analysis.csv', index = False, compression = 'gzip')

For text mining of grant abstracts, see [here](funding-by-abstracts.ipynb).