# Scraping NIH grant information 

Information about the variables in the NIH Exporter database can be found [here](https://exporter.nih.gov/about.aspx).

In [1]:
import requests, zipfile, io
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
url = 'https://exporter.nih.gov/about.aspx'
r = requests.get(url)

#checks that the request was successful
try:
    r.raise_for_status()
except Exception as exc:
    print('There was a problem {}'.format(exc))

html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')

### Local functions

In [5]:
def join_list(list_of_strings, string_to_join = ' '):
    '''
    Joins a list of strings into a single string.
    '''
    text = string_to_join.join(list_of_strings)
    text = text.strip().replace('\xa0', '')
    return text

def get_full_desc(soup, index_num, gen_num):
    '''
    Scrape full column descriptions for those descriptions where
    text and html are embedded.
    '''
    gen = soup.find_all('strong')[index_num].next_siblings
    text_list = []
    for index, item in enumerate(gen):
        if index < gen_num:
            text_list.append(item.string)
    text = join_list(text_list)
    return text

def get_desc_uls(soup, subcomponent, string_to_join = ' -'):
    uls = soup.find_all('ul')
    elements = []
    for ul in uls:
        for element in ul.find_all(subcomponent):
            elements.append(element.text.strip())
    return join_list(elements, string_to_join)

def concat_df_strings(df, index, to_concat, rn):
    df.iloc[index, 1] = ((str(df.iloc[index, 1]) + to_concat).replace(rn, ' ').strip())
    return df

Create dataframe of column names and respective descriptions from < strong > tags. < strong > tags correspond to column names and the next_sibling to the associated description.

In [3]:
cols = []
desc = []
for strong_tag in soup.find_all('strong'):
    cols.append(strong_tag.text)
    desc.append(strong_tag.next_sibling)

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 5000)

col_info = pd.DataFrame()
col_info['column_name'] = cols
col_info['descriptions'] = desc

col_info.head()

Unnamed: 0,column_name,descriptions
0,Application_ID:,A unique identifier of the project record in the ExPORTER database.
1,Activity:,"A 3-character code identifying the grant, contract, or intramural activity through which a project is supported. Within each"
2,\n\r\n Administering_IC:,"Administering Institute or Center - A two-character code to designate the agency, NIH Institute, or Center administering the grant. See"
3,Application_Type:,A one-digit code to identify the type of application funded:
4,ARRA_Funded:,“Y” indicates a project supported by funds appropriated through the American Recovery and Reinvestment Act of 2009.


Perform basic text cleaning

In [4]:
col_info['column_name'] = col_info['column_name'].str.replace('\n\r\n', '')
col_info['column_name'] = col_info['column_name'].str.replace(':', '')
col_info['column_name'] = col_info['column_name'].str.lower()
col_info = col_info.drop(col_info.index[[8, 37]]) # <br/> text

#save information about publications in case needed in the future
publication_info = col_info.iloc[46:, :]
publication_info.to_csv('publication_info.csv')
col_info = col_info.drop(publication_info.index)

Not all descriptions are full descriptions due to the fact that href links are embedded along with text and are therefore not captured by next_sibling. To get the full descriptions of these columns, including the titles of the links, see below.

In [5]:
col_gen_pairs = [
    (1, 7),
    (2, 2),
    (12, 4),
    (14, 3),
    (15, 3),
    (18, 3),
    (44, 3),
    (45, 3),
]

full_descs = []
for pair in col_gen_pairs:
    description = get_full_desc(soup, pair[0], pair[1])
    full_descs.append(description)

Replace partial descriptions with full descriptions.

In [6]:
indexer = [_[0] for _ in col_gen_pairs]
for i in indexer:
    col_info = col_info.replace(col_info.ix[i, 1], full_descs[indexer.index(i)])

In one description, text was bolded (< strong >) in the body of the paragraph and therefore a new row was formed. Add text to appropriate description and remove extraneous row.

In [7]:
col_info.ix[42, 1] = str(col_info.ix[42, 1]) + '04 is in its fourth year of support.'

col_info = col_info.drop(col_info.index[[41]])

Two descriptions had associated lists. Get the list information and add to the appropriate descriptions.

In [None]:
rn = '\r\n'

application_type = get_desc_uls(soup, 'p')
application_type = application_type.replace(rn, '')
application_type = application_type.replace('\t\t\t\t\t\t\t', '')

total_cost = get_desc_uls(soup, 'li')
total_cost = total_cost + soup.find_all('ul')[-1].next_sibling.string.replace(rn, '')

col_info = concat_df_strings(col_info, -2, total_cost, rn)
col_info = concat_df_strings(col_info, 3, application_type, rn)

Final cleaning

In [11]:
col_info = col_info.replace(rn, '')

Final table

In [12]:
col_info.head(10)

Unnamed: 0,column_name,descriptions
0,application_id,A unique identifier of the project record in the ExPORTER database.
1,activity,"A 3-character code identifying the grant, contract, or intramural activity through which a project is supported. Within each funding mechanism , NIH uses 3-character activity codes (e.g., F32, K08, P01, R01, T32, etc.) to differentiate the wide variety of research-related programs NIH supports. A comprehensive list of activity codes for grants and cooperative agreements may be found on the Types of Grant Programs Web page. RePORTER also includes R&D contracts (activity codes beginning with the letter N) and intramural projects (beginning with the letter Z)."
2,administering_ic,"Administering Institute or Center - A two-character code to designate the agency,NIH Institute, or Center administering the grant. See Institute/Center code definitions"
3,application_type,"A one-digit code to identify the type of application funded: 1 = New application 2 = Competing continuation (also, competing renewal) 3 = Application for additional (supplemental) support. There are two kinds of type 3competing revisions (which are peer-reviewed and administrative supplements) 4 = Competing extension for an R37 award or first non-competing year of a Fast Track SBIR/STTR award 5 = Non-competing continuation 7 = Change of grantee institution 9 = Change of NIH awarding Institute or Division (on a competing continuation)"
4,arra_funded,“Y” indicates a project supported by funds appropriated through the American Recovery and Reinvestment Act of 2009.
5,award_notice_date,Award notice date or Notice of Grant Award (NGA) is a legally binding document stating the government has obligated funds and which defines the period of support and the terms and conditions of award.
6,budget_start,The date when a project’s funding for a particular fiscal year begins.
7,budget_end,The date when a project’s funding for a particular fiscal year ends.
9,cfda_code,"Federal programs are assigned a number in the Catalog of Federal Domestic Assistance (CFDA), which is referred to as the ""CFDA code."" The CFDA database helps the Federal government track all programs it has domestically funded."
10,core_project_num,"An identifier for each research project, used to associate the project with publication and patent records. This identifier is not specific to any particular year of the project. It consists of the project activity code, administering IC, and serial number (a concatenation of Activity, Administering_IC, and Serial_Number)."


Write to csv

In [13]:
col_info.to_csv('grant_col_info_all.csv', index = False)

## Scrape application type information

The information about the application type is incomplete or unclear. The NIH has a [page](https://grants.nih.gov/grants/how-to-apply-application-guide/prepare-to-apply-and-register/type-of-applications.htm) describing application types in more detail.

In [21]:
url = 'https://grants.nih.gov/grants/how-to-apply-application-guide/prepare-to-apply-and-register/type-of-applications.htm'
r = requests.get(url)

#checks that the request was successful
try:
    r.raise_for_status()
except Exception as exc:
    print('There was a problem {}'.format(exc))

html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')

In [22]:
result = soup.find('div', {'class':'field field--name-body field--type-text-with-summary field--label-hidden'})
result

<div class="field field--name-body field--type-text-with-summary field--label-hidden"><div class="field__items"><div class="field__item even" property="content:encoded"><div class="syndicate"><p>
	Choose the type of application you plan to submit from the chart below. Learn about specific submission requirements, and be sure to follow them, as well as the instructions in the application forms and the funding opportunity announcement.<br/>
<br/>
	The type of application you submit can impact...
</p>
<ul>
<li>
		...your due dates. Some announcements have specific due dates based on application type. For example, the standard R01 due dates for new applications are February/June/October 5, while the standard due dates for resubmission/revision/renewal applications are March/July/November 5.<br/>
		 
	</li>
<li>
		...your ability to submit to a specific announcement. Each FOA indicates the application types allowed for that opportunity.<br/>
		 
	</li>
<li>
		...the business rules enforced 

In [82]:
tags = []
for tag in result.find_all('p'):
    tags.append(tag.get_text())

#get tags associated with application type information
tags = tags[1:35]

#strip auxiliary characters
for idx, i in enumerate(tags):
    tags[idx] = i.strip('\n').strip('\t').strip('\xa0').rstrip('*\n')

In [83]:
#strings beginning with 'Learn' are actually hyperlinks; remove these
for tag in tags:
    if 'Learn' in tag:
        tags.remove(tag)

#also remove empty strings
tags.remove('')

#join elements 8 and 9 into a single string, as this description was split in two
tags[8:10] = [' '.join(tags[8:10])]

#element 5 has extraneous characters that need to be removed
tags[5] = tags[5].replace('\t', '').replace('\xa0', '').replace('\n', ' ')

Create dataframe with application type descriptions.

In [105]:
app_types = pd.DataFrame()
app_types['type'] = range(1, 10)
app_types['stage'] = tags[1::3]
app_types['description'] = tags[2::3]

In [106]:
app_types

Unnamed: 0,type,stage,description
0,1,New,Initial request for support of a project that has not yet been funded.
1,2,Renewal,"Initial request for additional funding for a period subsequent to that provided by a current award. Renewal applications compete for funding with all other peer reviewed applications and must be developed as fully as though the applicant is applying for the first time. (Previously referred to as “competing continuation.”) If your renewal and subsequent resubmission of renewal application are not funded, you must use the ""new"" application type to compete for additional funding and continuity with your previous award will not be retained."
2,3,Competing Revision,"Initial request for (or the award of) additional funds during a current project period to support new or additional activities that are not identified in the current award. This request reflects an expansion of the scope of the grant-approved activities. Competitive revisions require peer review. (Competing revision replaces the previous NIH term, ""competing supplement."") An administrative supplement is a request for (or the award of) additional funds during a current project period to provide for an increase in costs due to unforeseen circumstances. All additional costs must be within the scope of the peer reviewed and approved project."
3,4,Extension,Request for additional years of support beyond the years previously awarded. (Used only for select programs.)
4,5,Noncompeting Continuation,Request or award for a subsequent budget period within a previously approved project for which a recipient does not have to compete with other applications.
5,6,Change of Organization Status (Successor-in-Interest),"Process whereby the rights to and obligations under an NIH grant(s) are acquired incidental to the transfer of all of the assets of the grantee or the transfer of that part of the assets involved in the performance of the grant(s). May result from legislative or other legal action, such as a merger or other corporate change."
6,7,Change of Grantee or Training Institution,Transfer of the legal and administrative responsibility for a grant-supported project or activity from one legal entity to another before the completion date of the approved project period (competitive segment).
7,8,Change of Institute or Center,Change of awarding NIH institute or center for the noncompeting continuation (Type 5).
8,9,Change of Institute or Center,Change of awarding NIH institute or center for the renewal (Type 2).


In [108]:
app_types.to_csv('app_types.csv', index = False)