# Dataset: NSF Awards (1970-2016)

## What is XML?
From Wikipedia: 
*The Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable (i.e. text-based) and machine-readable. XML emphasizes simplicity, generality, and usability across the Internet.*

A basic XML document is organized hierarchically like a tree, with a single base (root) extending into multiple branches. Each branch will have an unique name and may extend into further branches (children) nested under the parent branch. 

## Parsing XML in Python
Python provides support for XML parsing, and several XML-parsing packages are included in the Anaconda distribution. Here we will use a simple package, *xml*, to inspect the contents of the very first XML file of our dataset.

In [None]:
import os
from xml.etree import ElementTree

## Use the ElementTree command to parse the XML file.
tree = ElementTree.parse(os.path.join('1970','7000047.xml'))

## Extract the "root" of the XML file from the tree.
root = tree.getroot()

## Inspect the branches of the root.
print('The branches of the root:')
for n, branch in enumerate(root.getchildren()):
    print(n+1,branch.tag)

As confirmed by manual inspection of the XML document, our document contains only one root-branch: Award. Let's now inspect the children of the Award branch.

In [None]:
## Take the first (and only) branch of the root.
Award = root[0]

## Print all of the children of Award.
print('The branches of the Award branch:')
for n, child in enumerate(Award.getchildren()):
    print(n+1,child.tag)

Finally, let's now inspect the data stored under the child, AwardTitle.

In [None]:
## Extract the first child, AwardTitle, from the Award branch.
AwardTitle = Award[0]

## Inspect the data stored under AwardTitle.
print(AwardTitle.tag, AwardTitle.text)

After manual inspection of several of the XML documents, we are going to extract for further analysis the following fields:
* AwardTitle: Title of the grant.
* AwardAmount: Funds allocated by the grant.
* Directorate: Awarding NSF Organization.
* Division: Specific division of the Directorate.
* AbstractNarration: The grant abstract.
* InstitutionName: Awardee institution.
* InstitutionStateCode: Awardee state.



In [None]:
import os
import numpy as np
from pandas import DataFrame, Series
from xml.etree import ElementTree

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Setup.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

## Locate grant folders using the listdir command.
grant_dirs = np.arange(1970,2017).astype(str)

## Define column names.
columns = ('Year', 'ID', 'Title', 'Funds', 'Directorate', 'Division',
          'Institution', 'State')

## Open a new file to document all XML files with parsing errors.
errors = open('parsing_errors.txt', 'w')

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Main loop.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

for gdir in grant_dirs:
                
    ## Define year.
    year = gdir    
    
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### Initialize files (separate for each year).
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    
    ## Initialize empty pandas dataframe.
    df = DataFrame([], columns=columns)
    
    ## Initialize text file for abstracts.
    Abstracts = open(os.path.join(str(year), 'abstracts.txt'), 'w+')
    
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### Locate and loop over XML files for a year.
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    
    ## Locate all XML files (check if ends with '.xml')
    xml_files = sorted([f for f in os.listdir(gdir) if f.endswith('.xml')])
    
    for xml in xml_files:
        
        #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
        ### Open and parse XML file.
        #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
        
        ## Use the ElementTree command to parse the XML file.
        try: 
            tree = ElementTree.parse(os.path.join(gdir, xml))
        except:
            errors.write('%s\n' %os.path.join(gdir, xml))
            continue

        ## Extract the "root" of the XML file from the tree.
        root = tree.getroot()
        
        ## Take the first (and only) branch of the root.
        Award = root[0]
        
        #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
        ### Locate and store desired information.
        #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
        
        ## Initialize series.
        series = Series()
        series['Year'] = year
        series['ID'] = xml.replace('.xml','')
        
        ## Locate information.
        series['Title'] = Award.find('AwardTitle').text
        series['Funds'] = Award.find('AwardAmount').text
        
        organization = Award.find('Organization')
        series['Directorate'] = organization.find('Directorate').find('LongName').text
        series['Division'] = organization.find('Division').find('LongName').text
        
        institution = Award.find('Institution')
        series['Institution'] = institution.find('Name').text
        series['State'] = institution.find('StateCode').text
        
        ## Append to DataFrame.
        df = df.append(series, ignore_index=True)
        
        #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
        ### Write abstract to file (if possible).
        #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
                    
        ## Extract abstract.
        abstract = Award.find('AbstractNarration').text
        if abstract is None: abstract = ''
        
        ## Write to file.
        Abstracts.write('%s\t%s\n' %(series['ID'], abstract))

    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### Save dataframe for particular year.
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    
    ## Save CSV file.
    df.to_csv(os.path.join(year, 'grants.csv'), index=False)
                   
    ## Save abstracts file.
    Abstracts.close()
        
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Finish up.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
        
## Save parsing errors file.
errors.close()

## Count total parsing errors.
with open('parsing_errors.txt', 'r') as errors:
    msg = '%s parsing errors occurred. See error file for details.'
    print(msg %len(errors.readlines()))

print('Done.')