Capstone Project Documentation

Data Collection and Preparation

Awards Data

Downloading the data

Award-level data for the SBIR and STTR programs is available through the SBIR.gov awards database. While the database provides users with the ability to search keywords, topic codes, and companies, and filter by key grant criteria like award year, program phase, federal agency and US state, the geographic information is limited to the state level, and users cannot browse the data according to localities and regions. Additionally, while users can browse proposal solicitations and the solicitation information pages link out to the full call for proposals, neither the linked topic area codes are not clearly tied to the solicitations page nor are the linked grants. I investigated the possibility of better linking this important qualitative data by examining the data structure of the awards, solicitations and topic records.

First, after having a bit of trouble getting the correct parameters set to make API calls to the awards database to download the data, I downloaded the awards data from the database manually. I then downloaded the related solicitations and topics datasets. The data dictionary, including all relevant fields can be found on the SBIR website here or in this repository here

Data review and preparation

Award Data Structure

   {
        "Company": "Aclarity LLC",
        "Award_Title": "SBIR Phase I: Performance and Feasibility Evaluation of Electrochemical Advanced Oxidation Technology for Water Purification",
        "Agency": "National Science Foundation",
        "Branch": "",
        "Phase": "Phase I",
        "Program": "SBIR",
        "Agency_Tracking_Number": "1819438",
        "Contract": "1819438",
        "Proposal_Award_Date": "June 15, 2018",
        "Contract_End_Date": "May 31, 2019",
        "Solicitation_Number": "",
        "Solicitation_Year": "2017",
        "Topic_Code": "CT",
        "Award_Year": "2018",
        "Award_Amount": "225000.00",
        "DUNS": "080961763",
        "Hubzone_Owned": "N",
        "Socially_and_Economically_Disadvantaged": "N",
        "Woman_Owned": "Y",
        "Number_Employees": "2",
        "Company_Website": "",
        "Address1": "10 Chestnut Hill Rd.",
        "Address2": "",
        "City": "North Oxford",
        "State": "MA",
        "Zip": "01537",
        "Contact_Name": "First B Last ",
        "Contact_Title": "",
        "Contact_Phone": "(774) xxx-xxxx",
        "Contact_Email": "info@company.com",
        "PI_Name": "First B Last ",
        "PI_Title": "",
        "PI_Phone": "(774) xxx-xxxx",
        "PI_Email": "info@company.com",
        "RI_Name": "",
        "RI_POC_Name": "",
        "RI_POC_Phone": "",
        "Research_Keywords": "",
        "Abstract": "The broader impact\/commercialization potential of this SBIR project is an electrochemical water treatment technology which has the potential for minimizing cost, requiring little to no maintenance, and comprehensively treating harmful contaminants such as pathogens, toxic organics, and metals in drinking water. About 80M U.S. homeowners are seeking a water purification solution for fear of their water quality. They are unsatisfied with the high maintenance, long-term costs, and lack of comprehensive treatment capabilities of existing systems. This SBIR Phase I project proposes to optimize, demonstrate, and scale an electrochemical water purification system for residential point-of-entry application. Existing water purification systems are largely ineffective in comprehensively treating contaminants such as pathogens, toxic organics, and metals, and also require frequent maintenance which contributes to high costs and waste generation. To address these concerns, the treatment effectiveness, cost, and feasibility of treating water contaminated with pathogens and toxic organic compounds by the proposed electrochemical technology will be studied in laboratory and pilot scale applications. Design parameters will be optimized for highest treatment capability and lowest costs and maintenance needs. Prototypes will be scaled for pilot evaluation at flow rates for residential point-of-entry application and evaluated for robustness. The laboratory and pilot units will be evaluated for perfluorinated compound (PFC) removal consistent with NSF\/ANSI P473 and for pathogen disinfection following EPA Purifier Standard Certification. PFC removal and disinfection are keys to proving a comprehensive water purification solution for decentralized treatment. The end result will be a small product automated by sensors that connects directly in-line with a building&apos;s plumbing. This award reflects NSF&apos;s statutory mission and has been deemed worthy of support through evaluation using the Foundation&apos;s intellectual merit and broader impacts review criteria."
    }

Solicitation Data Structure

{
    "solicitation_title": "DHS SBIR Solicitation for FY21",
    "solicitation_number": "21.1",
    "program": "SBIR",
    "phase": "Phase I",
    "agency": "Department of Homeland Security",
    "branch": "Science and Technology Directorate",
    "solicitation_year": "2021",
    "release_date": "2020-11-12",
    "open_date": "2020-12-15",
    "close_date": "2021-01-15",
    "application_due_date": ["2021-01-15"],
    "occurrence_number": None,
    "sbir_solicitation_link": "https://www.sbir.gov/node/1824709",
    "solicitation_agency_url": "https://beta.sam.gov/opp/d409cb1d4152463193fdadd11624ad72/view",
    "current_status": "closed",
    "solicitation_topics": [
        {
            "topic_title": "5G & Wi-Fi 6/6E Coexistence for Secure Federal Networks",
            "branch": "Science and Technology Directorate",
            "topic_number": "DHS211-002",
            "topic_description": "Investigate the interoperability and security of 5G and Wi-Fi 6/6E as coexisting technologies to support secure federal networks",
            "sbir_topic_link": "https://www.sbir.gov/node/1836423",
            "subtopics": []
        },
        ...
    ]
}

Topic Data Structure

{
    "TopicTitle": "Sensors to Measure Space Suit Interactions with the Human Body",
    "Description": "Lead Center: JSC\nParticipating Center(s): JSC\nTechnology Area: TA6 Human Health, Life Support and Habitation Systems\nSpace suits can be tested unmanned for range of motion and joint torque in an attempt to quantify and compare space suit joint designs and overall suit architecture. However, this data is irrelevant if humans using the suits aren&apos;t effective.  Characterizing human suited performance has continued to be a challenge, partly due to limitations in sensor technology. One concept is to use sensors placed at/on the human body, underneath the pressure garment to obtain knowledge of the human bodies movements.  This data could then be compared against the suit motion.  Various sensors, sensor technologies, and sensor implementations have been attempted over two decades of efforts, but each has had issues.  Previous efforts have used Force Sensitive Resistors (FSR), TouchSense shear sensors, pressure-sensing arrays (Tek scan etc.), piezo-electric sensors, among others but have not met all requirements. Most issues have centered around accuracy when placed on the pliant surface of the skin, and accuracy when placed over curved surfaces of the skin. Accuracy has been sufficient to delineate low, medium or high levels of force but not a reliable quantitative value. This, combined with aberrant readings when the sensor is bent has led to these sensors only providing a rough idea of the interaction between the suit and the skin: while in a controlled environment the sensors are accurate to within 10% or so, the accuracy falls significantly when measuring the skin and being bent or pressed in inconsistent ways; on the order of 50% accuracy or worse. The sensors also are prone to drift (falling out of calibration) quickly during use. Lastly, the sensors, while pliant, are still relatively thick and as such translates to discomfort and loss of tactility. This is typical during all previous testing but most notable when sensors are bent along an axis (or worse, along two axis such as required to follow a complex anatomical contour). As such, the effect on the suit/skin interface that is being measured is changed, which adds an additional complication to interpreting data output from these sensors. Much of the work within JSC has improved the integration, comfort, and calibration of these sensors, but the accuracy performance characteristics when in use have not been sufficient to meet requirements. A new sensor technology is warranted for use in our application.\nCurrent critical needs that this technology would enable include the ability to optimize suit design for ergonomics, comfort and fit without the sole reliance on subjective feedback. While subjective feedback is important, developing a method to quantify the amount of force or pressure on a particular anatomical or suit landmark will aid in providing a richer definition of the suit/human interface that can be leveraged to make space suits more comfortable while reducing risk of injury. Taken together, these improvements will enhance EVA performance, reduce overhead and reduce personnel and programmatic risk. This technology implementation would require relatively accurate pressure or force readings in the medium to high range.\nIn the future, alternative space suit architectures such as mechanical counter pressure may be feasible, and a critical ancillary to such an architecture is to verify that necessary physiological pressure requirements are being met to ensure the health and safety of the crew. To this end, the technology should be able to accurately measure mechanical pressure on the human skin in the low pressure (&lt; 10 psi) range.\nPerformance targets vary upon application, but the sensing technology should have the following characteristics:\n\nMeasures force and/or mechanical pressure.\nAccurate to within 10%.\nResistant to aberrant readings when under moderate bending, shear or torsion.\nEither sufficiently pliant, or high enough spatial resolution, to follow anatomical curves on the human skin without discomfort or lack of mobility.\nThin profile (~mm).\nPackaged at high spatial resolution (~cm) or sufficiently small to facilitate a custom packaging/substrate solution with a high spatial resolution.\nFree of rigid or sharp points that would cause discomfort.\nLow power (~5V, ~mA).\nCapable of integration to the inside of the pressurized suit surface as well as the human skin (or integrated to conformal garment).\nAt this early stage, a simple digital readout capability to evaluate sensor performance.\n\nFor this SBIR opportunity specifically, we are looking for a single sensor technology that targets the above requirements including readout capability. They should either be packaged into a component level prototype (shoulder or arm segment with multiple sensors) or a flexible packaging option (multiple sensors that could be integrated ad-hoc into a component level prototype through placement of said sensors on the skin or comfort garment).\nThe most attention should be paid to maximizing spatial resolution, accuracy and thinness for this prototype. Lastly, as previous work has demonstrated a relatively high failure rate of these sensor types over time, the individual sensor elements should be replaceable and/or spares should be provided.",
    "Agency": "National Aeronautics and Space Administration",
    "Branch": "",
    "Phase": "Phase I",
    "Program": "SBIR",
    "ReleaseDate": "November 17, 2016",
    "OpenDate": "November 17, 2016",
    "CloseDate": "January 20, 2017",
    "SolicitationAgencyURL": "http://sbir.nasa.gov/solicit-detail/58007",
    "TopicNumber": "H4.03",
    "SolicitationYear": "2017",
    "SBIRTopicLink": "https://www.sbir.gov/node/1227041"
 }

To protect privacy of the persons mentioned in the award records, and because this information was not relevant to my task, I deleted data fields with personal identifiers from the awards data. I then joined all downloaded JSON files into a master JSON and CSV file to work with later on, using my python script merge_SBIRjson.py.

The resulting dataset here captured the 65,749 grants awarded between 2008 and 2018 from the participating Federal departments and agencies.

Before attempting to join the solicitations and topic information onto the awards data, which would have provided an additional wealth of qualitative information, I examined the availability of the data. 81.4% of awards included a Solicitation Code and 94.1% of all awards included a Topic Code. In theory, these codes should allow for linking to the solicitations and topics information. However, while attempting to join the information using the script join_AwardsTopicSolic.py I realized that the identifiers for solicitations and topics were not consistently unique. While URLs in the SBIRTopicLink were mostly unique where they existed, they resolved to landing pages that also included Topics that had non-unique Topic Codes. For Topics, I attempted to make a composite identifier, taking into account the grant program, agency, topic code and year. For Solicitations, I attempted to resolve formatting and punctuation differences that appeared between the awards, topics and solicitations datasets and caused mismatch. However, even after these efforts only 52.1% of awards were linked to solicitations data and 32.2% of awards were linked to topics data. Due to the incompleteness of the linking effort, I decided not to use solicitations and topics as a source of qualitative information about the grant.

If the Topics and Solicitations datasets reliably included consistently unique resource identifiers (URIs), additional analysis of the similarity between grant abstracts, solicitations and topics would be possible. One could also imagine examining the consistency or novelty of solicitations and topics over time as a proxy for examining thematic grant making priorities across agencies, or examining the geographic spread of cleanly defined topics. Given that the coding schemes were not so neat, I decided to use various methods of keyword extraction to enrich the data.

Award making details by Agency and Program

	All Agencies	Nat'l Science Found.	Dept. Health and Human Services	Dept. of Defense	Dept. of Education	Dept. of Commerce	Dept. of Agriculture	NASA	Dept. of Transportation	Dept. Homeland Sec.	Dept. of Energy	Environmental Protection Agency
Small Business Innovation Research (SBIR) Program
Awards	57,182	4,121	12,369	27,425	335	516	1,166	5,126	307	697	4,682	437
Total Amount ($M)	$24,983 M	$1,163 M	$7,891 M	$11,651 M	$102 M	$96 M	$237 M	$1,478 M	$93 M	$282 M	$1,933 M	$56 M
% with Topic Codes	93.9%	100.0%	79.9%	99.9%	38.5%	77.1%	81.7%	98.1%	96.7%	88.2%	96.5%	81.5%
% with Research Keywords	43.5%	0.0%	0.4%	69.5%	60.0%	0.4%	38.7%	88.6%	0.3%	60.5%	0.4%	28.1%
% to Women Owned Businesses	12.9%	14.5%	12.4%	14.6%	18.5%	9.1%	9.9%	10.7%	22.5%	10.9%	6.4%	7.6%
% to Businesses Owned by Socioeconomically Disadvantaged Groups	59.8%	9.8%	4.1%	6.1%	0.9%	5.2%	2.7%	8.3%	19.9%	7.3%	4.3%	5.5%
% to Businesses in Hubzones	2.4%	7.9%	0.1%	1.8%	3.0%	5.0%	7.8%	1.7%	3.3%	3.7%	6.3%	2.3%
Small Business Technology Transfer (STTR) Program
Awards	8,567	700	2,090	4,435	0	0	0	678	0	1	663	0
Total Amount ($M)	$3,226 M	$189 M	$1,034 M	$1,543 M				$205 M		$1 M	$254 M
% with Topic Codes	95.7%	96.0%	84.3%	99.9%				98.7%		100.0%	99.8%
% with Research Keywords	43.8%	0.0%	0.0%	71.5%				85.1%		0.0%	0.8%
% to Women Owned Businesses	12.3%	15.3%	12.3%	12.6%				11.4%		0.0%	8.9%
% to Businesses Owned by Socioeconomically Disadvantaged Groups	7.2%	8.7%	2.1%	8.8%				11.1%		0.0%	7.4%
% to Businesses in Hubzones	2.5%	8.9%	0.1%	2.1%				1.8%		0.0%	6.9%

Vocabularies, Taxonomies and Linked Open Data Ontologies

Sources of knowledge organization schemes reviewed or consulted

The sources below reflect a variety of statistical, economic, bibliographic and policy area classifications and specialized vocabulary. Terminology from each source can be extracted and indexed within a corpus, in order to enhance qualitative understanding. Dictionary based topic modeling using sources like these can potentially provide qualitative analysts with a more ready form of evidence to interpret. To enhance dictionary based extraction in the future, though, a common set of pre-processing techniques (including lemmatization, stemming, etc.) might be applied to both the dictionary and the source text. Comparing dictionary based methods of keyword extraction to mathematical methods is one aim of this project.

Source	Description	Used?
STW Thesaurus of Economics	Areas of economics	Y
Food and Agriculture Organization of the United Nations - AGROVOC Thesaurus	Food systems and agricutltural classifications from the UN	Y
GESIS Thesaurus of Social Sciences	Areas of social sciences	N
EuroSciVoc - European Science Vocabulary	Science related classifications from the European Commission	Y
European Institute for Gender Equality (EIGE) Glossary & Thesaurus	Gender equality thesaurus from the European Commission	Y
European Environment Agency General Multilingual Environmental Thesaurus (GEMET)	Environmental issue classifications from the European Commission	Y
REEGLE Clean Energy Linked Data	Clean energy and environmental area thesarus from REEP/REEGLE	Y
EUROVOC Thesaurus of Activities related to the EU	Governmental, social, political, legal and economic classifications from the European Commission	Y (removed abbrev < 5 chars)
American Economic Association JEL classifications	Areas of economics	N
European Commission Skills, Competencies, qualifications and Occupations	Skill, labor sector and occupational classifications from the European Commission	N
EU Statistical classification of products by activity, 2.1 (CPA 2.1)	Statistical classifications of products, from the European Commission	N
UN International classifications on economic statistics - Central product classification (CPC) - International Standard Industrial Classification of All Economic Activities, Revision 4	Classifications related to economics, industrial areas and products, from the UN	N
UNBIS Thesaurus	Thesaurus of issues related to the work of the UN	N
Geological Survey of Austria Geological Thesaurus (Minerals, Mineral Resources, Lithology)	Geological classifications from the Geological Survey of Austria	N
U.S. DEPARTMENT OF AGRICULTURE Agricultural Thesaurus and Glossary	Food systems and agricutltural classifications from the United States Dept. of Agriculture	N
US Geological Survey linkout to other glossary/thesauri	Geological classifications and other vocabularies, linked by the United States Geological Survey	N
North American Industry Classification System	Industry area and economic classifications from the United States Census	N

Setting up text mining pipelines

Downloading vocabularies and doing file conversions

Vocabularies and ontologies above were either downloaded in RDF/XML, TTL, XLSX or CSV formats in total or after completing a SPARQL query for the alternate and preferred label in the scheme. For a few resources, I used the SPARQL endpoint with the following queries to extract English labels.

PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#> 
SELECT ?concept ?prefLabel (GROUP_CONCAT ( concat('"',?altLabel,'"@',lang(?altLabel)); separator=", " ) as ?altLabels) 
WHERE { 
  ?concept a skos:Concept . 
  ?concept skosxl:prefLabel/skosxl:literalForm ?prefLabel . 
  BIND("en" AS ?lang)
  FILTER(lang(?prefLabel) = ?lang) 
  OPTIONAL{
    ?concept skosxl:altLabel/skosxl:literalForm ?altLabel . 
    FILTER(lang(?altLabel) = ?lang) 
  }
}
GROUP BY ?concept ?prefLabel

#####################################################

PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
SELECT ?Concept ?prefLabel ?altLabel ?broader
WHERE
{
?Concept ?x skos:Concept .
{ ?Concept skos:prefLabel ?prefLabel . FILTER (regex(lang(?prefLabel), 'en$', 'i'))  }
{ ?Concept skos:altLabel ?altLabel . FILTER (regex(lang(?altLabel), 'en$', 'i'))  }
?Concept skos:broader ?broader .
 }

For sources which did not have an immediately accessible SPARQL endpoint, I used Protege to open vobularies and ontologies in RDF/XML, NT, TTL formats, identify individual entities and extract relationships into CSV format. Given my novice SPARQL skills, I found it easier to manipulate the CSV using Excel and Python.

Evaluating NLPre scripts

To decide replacement scheme, I reviewed test the test files in the NLPre pipeline and the function for making replacements. The scheme yields documents which are scanned for keyterms across multiple dictionaries, with the output being tagged ngrams in the source document as well as an index of entities appearing each vocabulary or ontology.

Example output from replace_from_dictionary()

doc = "lymphoma survivors in korea . Describe the correlates of unmet needs among non_Hodgkin_lymphoma ( non_Hodgkin_lymphoma ) survivors in Korea and identify non_Hodgkin_lymphoma patients with an abnormal white blood cell count ."

tagged = replace_from_dictionary(prefix="*MeSH*_")(doc)

tagged
> Out[5]: 'lymphoma survivors in korea .Describe the correlates of unmet needs among non_Hodgkin_lymphoma ( non_Hodgkin_lymphoma ) survivors in Korea and identify non_Hodgkin_lymphoma patients with an abnormal *MeSH*_Leukocyte_Count .'

Updating the term index

Based on the replace_from_dictionary function, I created a function that would then check for tagged ngrams and add them to a main index of documents and identified terms from each linked-data dictionary used. The resulting function is shown below:

class update_term_index(object):

    """
    After replace_dictionary is run:
        1. update the user's index of terms found in each document
        2. remove the prefix tag from the previous dictionary so another round of tagging can occur

    Note: this class should be modified to enhance flexibility with respect to prefix/suffix tagging.
    For now, note that tags should be flanked appear flanked with asterisks and ending with underscore,
        like: *MeSH*_
    TO DO: replace the regex to one which is simply r"^"+vocab_key+([a-zA-Z0-9]+[\_]*)
    """

    def __init__(self, prefix=False, suffix=False,vocab_key="",dict_in=None,doc_key=""):
        """
        Initialize the indexer.

        Args:
            prefix: if the replacer used a prefix, set to true to look for prefix tagged terms
            suffix: if the replacer used a suffix, set to true to look for suffix tagged terms
            vocab_key: the tag prefix/suffix used for the currently tagged dictionary
            dic_in: a dictionary object to be used as a document/vocab index
            doc_key: a document id to be used in the document/vocab index
        """
        self.logger = logging.getLogger(__name__)


        self.prefix = prefix
        self.suffix = suffix
        self.vocab_key = vocab_key
        self.doc_key = doc_key
        self.dict_in = dict_in


    def __call__(self, doc):
        """
        Runs the indexer.

        Args:
            doc: the previously tagged document string
        Returns:
            doc_out: a de-tagged document string
            dict_in: the updated document/term index
        """

        
        # the list of all hits    
        # for input document, extract all dictionary hits and add them to the dictionary
        # TO DO: update to include True/False check for prefix/suffix,
        # TO DO: update to include a regex like: r"^"+vocab_key+([a-zA-Z0-9]+[\_]*) 
        #   to allow for more flexibletagging prefix/suffix
        hits = re.findall(r"\*[\_A-Za-z]+\*[\_a-zA-z]+",doc)
        
        # add an index entry for the given vocabulary for the document
        vocabDict = {self.vocab_key: hits}
        self.dict_in[self.doc_key].update(vocabDict)
        
        # detag the document
        doc_out = doc
        for h in hits:
            doc_out = doc_out.replace(h,h.replace(self.vocab_key,"").replace("_"," "))
        
        return [doc_out,self.dict_in]

Sample output

The sample below shows a document list object, and an example document/vocabulary index. The actual vocabularies were not called upon to create the output here; instead, just the prefixes were changed and the default MeSH dictionary was called three times with a new prefix. You can still see that the identified term "white blood cell count"--which was converted to "MeSH_Leukocyte_Count" by the replacer--now appears without demarcation, and can potentially be identified again by another vocabulary. This is useful in cases where we want to use several related ontologies or vocabularies to identify terms. In the future, though, to ensure that the replacement schemes for one dictionary don't obfuscate attempts by later dictionaries, we should either modify the replacement function or the indexer function to capture swapped terms and revert the text before it is called again with another vocabulary file.

// the docList contains the de-tagged document
docList  
{
    "doc_001": "lymphoma survivors in korea .\nDescribe the correlates of unmet needs among non_Hodgkin_lymphoma ( non_Hodgkin_lymphoma ) survivors in Korea and identify non_Hodgkin_lymphoma patients with an abnormal Leukocyte Count ."
}

// the mainIndex includes a list of identified terms for a given document and each vocabulary used
mainIndex
{
"doc_001": 
    {
        "*MeSH*_": ["*MeSH*_Leukocyte_Count"],
        "*EIGE*_": ["*EIGE*_Leukocyte_Count"],
        "*AGROVOC*_": ["*AGROVOC*_Leukocyte_Count"]
    }
}

Selecting and preparing vocabularies for replacement scheme

The National Library of Medicine's Medical Subject Headings (MeSH) are included with the NLPre package. The dictionary file includes a term column and a replacement column for over 190,000 terms. For the replacement, a given term label may either:

simply be swapped for a version where whitespaces are swapped for underscores, or
the label may be swapped with either a label corresponding to a broader taxonomic category to which the entity belongs, or a preferred label for entities with multiple ways of being named

In a linked-data context, co-reference resolution is managed by the application of URIs to a given entity which can have a single preferred label and many alternate labels. Relations in the ontology usually map terms to broader or narrower concepts, with the multiple kinds of labeling schemes aiding machine understanding of the semantic relationships between the referred to entities and latent concepts. In conversation with one of the package's authors, Travis Hoppe, I confirmed that the packaged MeSH dictionary uses simply uses all 2+ word terms and utilized some bit of semantic "rounding-up" for different label sets. For instance, in the example above "white blood cell count" is one of 16 labels that are all replaced by the more generic "Leukocyte_Count".

For the resources marked 'Y' in the table above, I ensured that all alternate labels appeared as a term to be replaced by a processed preferred label (with punctuation swapped for underscores). Additionally, all preferred labels themselves are included with a processed replacement label. For a few resources, including the STW Thesaurus of Economics and the UN Food and Agriculture Organization AGROVOC thesaurus, it appeared that alternate and preferred label relations also encoded some level of taxonomic relationship. Beside allowing these replacement schemes to persist, though, I made no additional effort at this time to do any semantic "rounding-up" within taxonomies. In order to make more informed semantic text mining decisions, I'd like to gain more facility navigating and extracting elements from RDF.

Running the pre-processing pipeline

Successfully running the edited pre-processing pipeline took me to the limits of my Python programming knowledge. Being aware of the utility and efficiency of the batched job queuing employed within the NIH word2vec pipeline, yet unable to quite grasp how to edit the package to successfully implement my own addition given time constraints, I resorted to taking a piecemeal approach.

First, I ran the half of the processing pipeline using the command line data import and parse steps from the word2vec pipeline. This included steps to identify acronyms and abbreviations in the text, handle unicode decoding issues, handling hyphenated words, removing title caps, replacing acronyms with their expanded forms, and expanding parenthetical comments into standalone sentences.

Next, using this partially processed output, I ran the replace and update dictionary steps within my iPython console from Spyder. I did this for seven of my vocabularies, excluding the Medical Subject Headings (MeSH) dictionary built in to NLPre, which is much larger (600K+ terms) than my other dictionaries [MAX TERMS--specify]. In the future, I need to re-write this script to do batch processing like the word2vec pipeline does; it took upwards of 36 hours to process the 65,000 records and create an index of all found vocabulary terms. At the moment, I was pleased that it worked at all and I was content to take time away from the computer. When I subsequently ran the MeSH dictionary using the in-built functions in the word2vec pipeline, it took only about 3 minutes to process the records. I then imported the tagged documents into my update_term_index.py script and updated my main document index with the results from the MeSH tagging.

Last, on the tagged output, I ran the final pipeline steps -- including token_replacement to take care of punctuation and special characters, and a part-of-speech tokenizer to remove words irrelevant to the subject matter of the award.

Geocoding

Using Geocod.io to doing the geocoding using the Address, City, State, ZIP code fields for all ~29K unique recipients. The service added 2018 census information including Census Tracts, Blocks, Metropolitan Area Divisions, and congressional district information for the 117th Congress (2021). In approximately 14 minutes, I could download the file at a cost of ~$40. Even though I was going to have to join additional congressional and county level identifiers in later, I though it was worth the cost to get started. There were 75 of 29,000 recipients (0.25%) that required manual validation, which I did using Google Maps.

With the geocoded recipients output, I then used qGIS to join the additinal congressional district and county level identifiers. Adding the county and district GEOIDs make the geographic queries possibe on the dashboard tool. Once I had a fully geo-coded recipients file, I then joined the addtional geographic information back into the awards file using a python script.

The SBIR awards data includes repeat recipients that sometimes have several DUNS Numbers or multiple addresses. The qGIS output files included identifiers for the 113th and 116th Congresses, as well as county FIPS identifiers. I joined the output back into the awards data by using the recipient DUNS#, name, and address; only 160 of 65,749 (< 0.25%) required manual validation. Using the various iterations of geocoded recipients data, I was able to manually correct these 160 awards by triangulating across various outputs. I only had to manually validate 1 additional recipient to complete this step.

Spatial clustering and Geo-Spatial Exploratory Data Analysis

An area of interest for me when it came to exploratory data analysis was to provide relevant contextual geographic data as a way to navigate the map and think about the knowledge production patterns. While working in the philanthropy sector, this has become an area of interest to me. In discussions with others working in the philanthropy sector, I asked when and where external data sources are used to analyze geographic funding choices. In some instances, I heard that geographic data was analyzed on an ad-hoc basis to provide insight into the demographic makeup of places and regions, as well as the workforce makeup of regions (e.g. where artists are located). These discussions also brought me in contact with the National Endowment for the Arts' Arts Data Profile Series and the Indiana Business Research Center's StatsAmerica Regional Innovation Indices and profilers. These two curated sets of data and indicators related to arts economies and artistic production, and regional innovation and R&D activity, respectively, inspired my interest in looking for external contextual data to provide a backdrop for exploration.

Ultimately, the choice to specifically include workforce composition data specifically was influenced by work on R&D knowledge spillovers. This work examines how the proximity of innovative firms to each other effects the spatial distribution of innovative activity and associated economic effects (Anselin et al., 2000; Bonaccorsi & Daraio, 2005; Jaffe et al., 1993; Wallsten, 2001). However, while many researcher in the literature explain the relevance of spatial proximity by pointing to the tacit, often "non-codified" social dimension of knowledge exchange, I encountered others who rather suggested that a more complex interplay between labor markets, instiutional collaborations and legal arrangements, and other political-economc factors are as much a cause of spatial clustering of innovative activities as are more rudimentary social and communicative factors (Breschi and Francesco, 2001). Seeing that quite a few researchers have remarked on the localization of specific industries and meanwhile also disaggregate findings of innovation spillovers or localizations across specific types of research domains. (Anselin et al., 2000; Audretsch & Feldman, 1996; Boschma et al., 2014), I thought it would be interesting to identify spatial clusters by workforce and provide users with this as a backdrop to guide browsing and investigation into the research activities being funded in different parts of the country. While the Innovation Index from StatsAmerica, for instance, included a battery of variables and indicators in their platform, I ultimately chose to select one for this application--the workforce composition of each county and congressional district by industrial sector.

I obtained the workforce data from the American Community Survey's 1-year and 5-year estimates and proceeded to explore the data in qGIS and R. As there clearly were non-random spatial distributions of higher density by industry (e.g. high proportions of manufacturing labor in the Midwest; high proportions of labor in Finance and Real Estate, and Science, Management and Technology sectors in particular large metropolitan areas), I proceeded to identify the clusters using geo-spatial statistical techniques. I performed clustering using the Univariate Local Moran's I test to produce a local indicator of spatial association (LISA) statistic for each unit of space, a clustering strength indicator (Moran's I) and a significance value of the clustering (p-value). I explored the data at both the county and congressional district levels.

I ultimately chose to use the congressional district level as the unit of analysis, because it lends itself to asking questions about how R&D policy agendas may or may not be taken up by representatives. Exploring the labor composition of a district alongside its SBIR/STTR funding might provide policy analysts with a sense of the interests representatives have in certain future-oriented policy areas. Additionally, exploring the workforce data at the congressional district level is interesting because of the more or less constant population of districts; it is interesting to compare the differences in labor forces over varying population densities and varying sized districts. This choice was also influenced by a conversation with a peer at a funding institution, who highlighted the potential utility in advocacy and policy contexts of being able to quickly get summary stats regarding funding activity at the district level. The Public Innovations Explorer was ultimately designed to be able to deliver on this premise.

I also produced % change measures for both units by comparing the 5-year ACS estimates for 2013 and 2018. The rationale for doing so was to identify whether there was any statistically significant spatial association with regard to changes in workforce composition. This is a relevant question for anyone interested in studying regional industrial change, economic policy or innovation policy. In this context, the Moran's I value provides evidence of whether the percentage of the population employed in a given sector tended to change more or less depending on whether the precentage employed in that sector in a neighboring spatial unit also underwent some level of change. On a practical level, if this clustering were indeed spatially significant, we could use it to identify hot-spots (strong increases) and cold-spots (strong decreases) of workforce composition change and ask whether certain underlying socio-economic factors or local policy changes in the corresponding places had some effect on the regional labor force changes. However, I ultimately found that there were not significant spatial effects in % changes, and left that analysis and line of questioning behind.

I used the following R scripts to produce LISA statistics and the congressional district level and county level. For each spatial unit I obtained a variable that I named according to a generic scheme LSAcl + _ + [industry], so that an area's cluster group value could be obtained for any given industry as the user would change industries on the chloropleth. As shown in the data here, I found there to be significant clutering across all industries, but there were stronger clustering effects (higher Moran's I values) at the congressional district level than at the county level. This made sense intuitively, since, on the one hand, congressional districts could integrate large swaths of sparsely populated counties, and on the other hand, districts could be relatively small and densely populated and next to many other small densely populated districts where spillover effects are likely more significant as compared to sparsely populated areas.

When creating the map, I used the LISA statistic clustering group to draw red borders around districts belonging to so-called "hot-spots" and blue borders for districts in "cold-spots." I ultimately chose not to draw the user's attention to specifically to these borders, though. My rationale was based on a few considerations: 1) the spatial clustering is not of funding data but rather of this background contextual data and it might confuse users; 2) I didn't want to draw so much attentiont to the method in case it would distract users from the overall utility of the tool or confuse them; and 3) the clustering data is not a conclusion in itself that leads to an quick and easily digestible conclusion, but rather offers one additional way to browse the map. In line with the 3rd reason here, this is also why I chose not to proceed to explore a battery of variables for presentation but rather to integrate one. In future work, I'd like to continue exploring the use of geo-spatial statistics to produce statistically-sound features that can guide geo-spatial data exploration. Especially with smaller units of space, these methods can help to identify non-intuitive, non-categorical groupings of space. Moreoever, I liked how the parallel coordinates plot allows a user to explore many inter-related dimensions at once on a map. In making a tool specifically geared toward analysts, having statistically tested clusters accompany chloropleths can speed up exploration and hypothesis formation. I hope that the design of the map and parallel coordinates interaction in the Public Innovations Explorer might be seen as a reusable and reproducible one, useful for across many other domains of geospatial data analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation.md

Documentation.md

Capstone Project Documentation

Data Collection and Preparation

Awards Data

Downloading the data

Data review and preparation

Award Data Structure

Solicitation Data Structure

Topic Data Structure

Award making details by Agency and Program

Vocabularies, Taxonomies and Linked Open Data Ontologies

Sources of knowledge organization schemes reviewed or consulted

Setting up text mining pipelines

Downloading vocabularies and doing file conversions

Evaluating NLPre scripts

Selecting and preparing vocabularies for replacement scheme

Running the pre-processing pipeline

Geocoding

Spatial clustering and Geo-Spatial Exploratory Data Analysis

Files

Documentation.md

Latest commit

History

Documentation.md

File metadata and controls

Capstone Project Documentation

Data Collection and Preparation

Awards Data

Downloading the data

Data review and preparation

Award Data Structure

Solicitation Data Structure

Topic Data Structure

Award making details by Agency and Program

Vocabularies, Taxonomies and Linked Open Data Ontologies

Sources of knowledge organization schemes reviewed or consulted

Setting up text mining pipelines

Downloading vocabularies and doing file conversions

Evaluating NLPre scripts

Selecting and preparing vocabularies for replacement scheme

Running the pre-processing pipeline

Geocoding

Spatial clustering and Geo-Spatial Exploratory Data Analysis