Information about research and other types of projects being conducted within the USGS is scattered and scant. We have a few records about projects on USGS Science Center web sites, which probably represent the most well curated prose describing projects. However, this information is not advertised and available in a way that we can easily identify the appropriate web pages, we would need to write a web scraper to get the information, and it is variable enough that we would likely end up with blobs of text with few usable structured elements.

There is an internal system used for managing the budgets for projects that contains a fair bit of structure and potentially usable narrative content. This system has long been probed for uses in science planning and activities outside budget and accounting, but it has always proven to be elusive based on the variable nature of its implementation across organizational units and lack of validation constraints in the data model outside the specific elements used in accounting. It is, however, our best bet if we want to try to get a handle on what we are doing in USGS as "project" type entities.

We can gain access to some of the information in this internal system via a set of internally accessible web services set up for the purpose of driving science support applications in one of our Mission Areas. These services provide some of the raw data structure that we can access via XML and pull together for our graph with a few caveats, assumptions, and uncertainties. Some of the structured aspects like identification of personnel involved in the work on a given activity have to be carefully worked through and validated because we don't have the same identifiers used elsewhere as reference points. The narrative content is incredibly dense and messy with a lack of validation constraints, which have allowed people to do things like bleed over one narrative type into another with only variable indicators in the text as to what the content contains. These issues mean that our best bet for characterizing projects and aligning them with defined concepts we are interested in understanding is to use some form of natural language processing and information extraction, which we will pursue in another step.

This notebook handles the process of accessing the internal web services and bringing data together for incorporation into our graph and subsequent analysis. It is really the only exception to our rule in the iSAID work of working with public sources of information, and we are only using it in this case because we have no other choice for reasonably complete and at least somewhat structured information about projects. Access to the internal services is limited and specialized. Basic aspects of that system are hinted at within this code provided for a basic level of transparency and completeness on our process, but all secrets are maintained outside the source code environment.

In [1]:
import requests
import xmltodict
import isaid_helpers
import re
import pickle
import click
from pylinkedcmd import utilities
from collections import Counter
import os
import pandas as pd
import datetime
import copy
from joblib import Parallel, delayed
from tqdm import tqdm
import glob
from nested_lookup import nested_lookup
import dateutil.parser
from urlextract import URLExtract

url_extractor = URLExtract()

# Data Pulls and Prep from SIPP
In the following codeblocks, we provide a basic process of accessing and minimal processing of information on projects and tasks from an internal system that we are using to bring this information together for our graph. We direct the process from the stand point of something we already have in our graph - records of organizational units and their internal business code identifier that can be used to fetch project/task records for each unit.

In [None]:
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    results = session.run("""
    MATCH (o:Organization)
    WHERE NOT o.sipp_CenterCode IS NULL
    RETURN o
    """)
    science_centers = results.data()

science_center_sipp_codes = list(set([o["o"]["sipp_SubBureauCode"] for o in science_centers]))
cost_centers = list(set([o["o"]["fbms_code"][0:6] for o in science_centers]))

Each source that we want to access is returned as XML that is just a little bit different. The following function takes care of the process of grabbing the output from these services, converting to a dictionary form, and smoothing over a couple of pesky issues to make the data easier to digest. It dumps data into a local cache for further processing. We are not pulling everything from the internal services; just those bits of information that will let us derive a basic description of project-type activities, the narrative content that describes those activities (including information about products that have been delivered), and information about the staff assigned to work on the activities.

In [None]:
def sipp_dumper(source, operation="dump", cost_center=None, project_number=None, return_data=False):
    if source == "ProjectTaskMaster" and cost_center is not None:
        cache_path = f"{isaid_helpers.local_cache_path_sipp}ProjectTaskMaster_{cost_center}.p"
        if operation == "dump":
            data_container = {
                cost_center: None
            }
            r = requests.get(f"{isaid_helpers.api_sipp_project_listing}?CostCenter={cost_center}")
            if r.status_code == 200:
                data = xmltodict.parse(r.text, dict_constructor=dict)
                
                if "Projects" in data and data["Projects"] is not None:
                    if isinstance(data["Projects"]["Project"], dict):
                        data_container[cost_center] = [data["Projects"]["Project"]]
                    elif isinstance(data["Projects"]["Project"], list):
                        data_container[cost_center] = data["Projects"]["Project"]
                        
                if data_container[cost_center] is not None:
                    for project in data_container[cost_center]:
                        if "Tasks" in project and project["Tasks"] is not None:
                            if isinstance(project["Tasks"]["Task"], dict):
                                project.update({"tasks": [project["Tasks"]["Task"]]})
                            elif isinstance(project["Tasks"]["Task"], list):
                                project.update({"tasks": project["Tasks"]["Task"]})
                            del project["Tasks"]
                
                pickle.dump(data_container, open(cache_path, "wb"))

            if return_data:
                return data_container
            else:
                return
        else:
            return pickle.load(open(cache_path, "rb"))

    elif source == "AccountDetail" and cost_center is not None:
        cache_path = f"{isaid_helpers.local_cache_path_sipp}AccountDetail_{cost_center}.p"
        if operation == "dump":
            data_container = {
                cost_center: None
            }
            r = requests.get(f"{isaid_helpers.api_sipp_account_detail}?CostCenter={cost_center}")
            if r.status_code == 200:
                data = xmltodict.parse(r.text, dict_constructor=dict)
                
                if "AccountDetails" in data and data["AccountDetails"] is not None:
                    if isinstance(data["AccountDetails"]["Account"], dict):
                        data_container[cost_center] = [data["AccountDetails"]["Account"]]
                    elif isinstance(data["AccountDetails"]["Account"], list):
                        data_container[cost_center] = data["AccountDetails"]["Account"]
                        
                pickle.dump(data_container, open(cache_path, "wb"))

            if return_data:
                return data_container
            else:
                return
        else:
            return pickle.load(open(cache_path, "rb"))

    elif source == "StaffRequestDetail" and cost_center is not None:
        cache_path = f"{isaid_helpers.local_cache_path_sipp}StaffRequestDetail_{cost_center}.p"
        if operation == "dump":
            data_container = {
                cost_center: None
            }
            r = requests.get(f"{isaid_helpers.api_sipp_project_staffing}?FPPSOrganization={cost_center}")
            if r.status_code == 200:
                data = xmltodict.parse(r.text, dict_constructor=dict)
                
                if "StaffRequestDetails" in data and data["StaffRequestDetails"] is not None:
                    if isinstance(data["StaffRequestDetails"]["StaffRequestDetail"], dict):
                        data_container[cost_center] = [data["StaffRequestDetails"]["StaffRequestDetail"]]
                    elif isinstance(data["StaffRequestDetails"]["StaffRequestDetail"], list):
                        data_container[cost_center] = data["StaffRequestDetails"]["StaffRequestDetail"]
                        
                pickle.dump(data_container, open(cache_path, "wb"))

            if return_data:
                return data_container
            else:
                return
        else:
            return pickle.load(open(cache_path, "rb"))

    elif source == "ProjectXML" and project_number is not None:
        cache_path = f"{isaid_helpers.local_cache_path_sipp}ProjectXML_{project_number}.p"
        if operation == "dump":
            data_container = {
                project_number: None
            }
            r = requests.get(f"{isaid_helpers.api_sipp_project_narratives}?ProjectNumber={project_number}")
            if r.status_code == 200:
                data = xmltodict.parse(r.text, dict_constructor=dict)
                data_container[project_number] = data["Project"]

                if "Tasks" in data["Project"] and data["Project"]["Tasks"] is not None:
                    if isinstance(data["Project"]["Tasks"]["Task"], dict):
                        data_container[project_number]["tasks"] = [data["Project"]["Tasks"]["Task"]]
                    elif isinstance(data["Project"]["Tasks"]["Task"], list):
                        data_container[project_number]["tasks"] = data["Project"]["Tasks"]["Task"]
                    del data_container[project_number]["Tasks"]
                        
                pickle.dump(data_container, open(cache_path, "wb"))

            if return_data:
                return data_container
            else:
                return
        else:
            return pickle.load(open(cache_path, "rb"))

def load_sipp_data(source):
    data_package = dict()
    for file_name in glob.glob(f'{isaid_helpers.local_cache_path_sipp}{source}_*.p'):
        data_package.update(pickle.load(open(file_name, 'rb')))
    
    return data_package



Some of the services take a little while to respond and produce a significant amount of content, so we break things up and run queries in parallel to assemble our local cache, stored as pickle files that we later pull together for processing into derived "project" records.

In [None]:
if click.confirm('Are you sure you want to run the process to build the project summary data from SIPP?', default=False):
    Parallel(n_jobs=10, prefer="threads")(
        delayed(sipp_dumper)
        (
            source="ProjectTaskMaster", cost_center=i, 
        ) for i in tqdm(cost_centers)
    )
else:
    print("The following files are available for the ProjectTaskMaster data pull:")
    print()
    display(glob.glob(f'{isaid_helpers.local_cache_path_sipp}ProjectTaskMaster_*.p'))


In [None]:
if click.confirm('Are you sure you want to run the process to build the accounting detail data from SIPP?', default=False):
    Parallel(n_jobs=10, prefer="threads")(
        delayed(sipp_dumper)
        (
            source="AccountDetail", cost_center=i, 
        ) for i in tqdm(cost_centers)
    )
else:
    print("The following files are available for the AccountDetail data pull:")
    print()
    display(glob.glob(f'{isaid_helpers.local_cache_path_sipp}AccountDetail_*.p'))

In [None]:
if click.confirm('Are you sure you want to run the process to build the staff request data from SIPP?', default=False):
    Parallel(n_jobs=10, prefer="threads")(
        delayed(sipp_dumper)
        (
            source="StaffRequestDetail", cost_center=i, 
        ) for i in tqdm(cost_centers)
    )
else:
    print("The following files are available for the StaffRequestDetail data pull:")
    print()
    display(glob.glob(f'{isaid_helpers.local_cache_path_sipp}StaffRequestDetail_*.p'))

## Project Narratives
The project narratives can be very large, including all of their potential tasks and subtasks. Because we don't necessarily want to incorporate all projects into the graph based on the types of questions we are asking, we use a filtered_projects function on our project summary information (that does include all projects by cost center) to only fetch full project/task records for those that we will process into entities for the graph. At this point, we are filtering to only active projects that are in the Conduct Research or Conduct Assessments categories.

In [None]:
if click.confirm('Are you sure you want to run the process to rebuild the project/task detailed data from SIPP?', default=False):
    for file_name in glob.glob(f'{isaid_helpers.local_cache_path_sipp}ProjectXML_*.p'):
        os.remove(file_name)

    Parallel(n_jobs=10, prefer="threads")(
        delayed(sipp_dumper)
        (
            source="ProjectXML", project_number=i, 
        ) for i in tqdm(filtered_projects(project_summary))
    )
else:
    print("The following files are available for the ProjectXML data pull:")
    print()
    display(glob.glob(f'{isaid_helpers.local_cache_path_sipp}ProjectXML_*.p'))

# Load Cached Data

In [None]:
%%time
cache_data = {
    "summary": load_sipp_data("ProjectTaskMaster"),
    "narrative": load_sipp_data("ProjectXML"),
    "accounting": load_sipp_data("AccountDetail"),
    "staffing": load_sipp_data("StaffRequestDetail")
}

# Person Records
Contained within the internal system on projects are records about people that will be an important connecting point into our graph. When examining projects with science assessment and planning questions, we almost always want to know about the people working on the projects. In the internal system, this information is both supplied directly for projects and tasks, indicating the "chiefs" or leaders (with some caveats as to accuracy of that designation), and within the budgeting information that indicates hours requested from staff for work on a given activity. Unfortunately, the internal system uses its own internal identifier for personnel that is in no way connected to any other system of identifiers. A subset of records (project chiefs/task leaders) do have email addresses for some instances, giving us a reference point, but most of the meaningful references to project staffing have only names. In general, these names do align with the official names for staff that we already have in our system from the master source we tap into via the ScienceBase Directory, but we need to treat this carefully and only use cases where we can make a match on a unique name.

The following codeblocks work through the process of gathering all unique personnel identifiers from across the places in the data where they are found and then consulting our existing graph to determine where we can find reasonably certain alignment with existing master data records. These are then teed up for incorporation into the graph as a lead-in step to pulling in project information. With the simple numeric internal identifier used for recordkeeping in the internal system added to the person documents in our graph, we can then establish relationships between people and projects, but only for those where we can establish reasonable alignment between names (and emails in some cases) and more firm records of personnel.

We go ahead and load all our data in the following codeblock as we need project_summary and narrative_content here and will need the other two datasets in the process below where we build logical project records.

In the next two codeblocks, we pull out all unique combinations of internal ID (FPPSID) and name (and email in the case of project chiefs/task leads). Our purpose is to link the internal identifiers that will be found throughout the data to a known person with other more well used identifiers that we've pulled into our graph from other sources.

In [None]:
all_contacts = list()

for fbms_code in project_summary.keys():
    if project_summary[fbms_code] is not None:
        for project in project_summary[fbms_code]:
            if project["ProjectChiefEmail"] is not None and project["ProjectChiefFPPSID"] is not None:
                all_contacts.append((project["ProjectChiefFPPSID"], project["ProjectChiefEmail"], project["ProjectChief"]))

for contact_key in ["AssociateProjectChief", "TaskLeader", "SubtaskLeader"]:
    for contact in nested_lookup(contact_key, narrative_content):
        if contact is not None:
            if isinstance(contact, dict):
                contact = [contact]
            for c in contact:
                if c["Email"] is not None and c["FPPSID"] is not None:
                    all_contacts.append((c["FPPSID"], c["Email"], c["Name"]))

unique_chiefs = list(set(all_contacts))
print(len(unique_chiefs))

In [None]:
staff_requests = load_sipp_data("StaffRequestDetail")

all_staff = list()
for fbms_code in staff_requests.keys():
    if staff_requests[fbms_code] is not None:
        all_staff.extend((sr["FPPSID"], sr["EmployeeName"]) for sr in staff_requests[fbms_code])

unique_staff = list(set(all_staff))
print(len(unique_staff))

In this codeblock, we assemble a data structure for our staff members, including new properties we'll add to our graph, and run a check to determine where we may have duplicate names. Even though the internal project system record shows that these are different people, we don't have enough information about them at this point to disambiguate who the internal identifiers belong to. We could likely track these down later as there aren't many of them, but we can't necessarily do that in an automated way.

In [None]:
basis_staff = list()

for s in unique_staff:
    staff_email = None
    
    chief_record = next((i for i in unique_chiefs if i[0] == s[0]), None)
    if chief_record is not None:
        staff_email = chief_record[1]
    
    name_parts = s[1].split(",")
    if len(name_parts) == 1:
        staff_name = name_parts[0].title()
    else:
        staff_name = f'{name_parts[1].strip().title()} {name_parts[0].strip().title()}'

    person = {
        "id_fpps": s[0],
        "name": staff_name,
        "first_name": staff_name.split(" ")[0],
        "last_name": staff_name.split(" ")[-1],
        "email": staff_email,
        "duplicate_name": False
    }
    basis_staff.append(person)

for extra_s in [i for i in unique_chiefs if i[0] not in [s[0] for s in unique_staff]]:
    name_parts = extra_s[2].split(",")
    if len(name_parts) == 1:
        staff_name = name_parts[0].title()
    else:
        staff_name = f'{name_parts[1].strip().title()} {name_parts[0].strip().title()}'

    person = {
        "id_fpps": extra_s[0],
        "name": staff_name,
        "first_name": staff_name.split(" ")[0],
        "last_name": staff_name.split(" ")[-1],
        "email": extra_s[1],
        "duplicate_name": False
    }
    basis_staff.append(person)

counter_names = Counter([i["name"] for i in basis_staff])
for name in list([n for n in counter_names if counter_names[n]>1]):
    for staff_person in [i for i in basis_staff if i["name"] == name]:
        staff_person.update({"duplicate_name": True})

print("TOTAL STAFF FROM BASIS:", len(basis_staff))
print("STAFF WITH DUPLICATE NAMES:", len([i for i in basis_staff if i["duplicate_name"]]))

Now, we need to put some information together with what we already know about people from other work to pull them into our graph. In the following codeblocks, we get the current personnel from our graph, check for cases where we can link project chiefs/task leaders on email address, and then run a query to determine where we can make reasonable matches on name.

In [None]:
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    results = session.run("""
    MATCH (p:Person)
    WHERE NOT p.email IS NULL
    OR NOT p.orcid IS NULL
    RETURN p.name, p.email, p.orcid
    """)
    personnel_in_graph = results.data()

for person in personnel_in_graph:
    person.update({"first_name": person["p.name"].split(" ")[0], "last_name": person["p.name"].split(" ")[-1]})
    

In [None]:
for record in [i for i in basis_staff if i["email"] is not None and i["email"] not in [p["p.email"] for p in personnel_in_graph]]:
    record.update({"missing_from_graph": "Email not in graph"})
    
print("EMAILS FROM BASIS NOT IN GRAPH:", len([i for i in basis_staff if "missing_from_graph" in i and i["missing_from_graph"] == "Email not in graph"]))
                                              

In [None]:
for basis_record in basis_staff:
    if not basis_record["duplicate_name"] and "missing_from_graph" not in basis_record and basis_record["name"] is not None:
        email_in_graph = next(
            (
                i["p.email"] for i in personnel_in_graph 
                if i["p.name"].lower() == basis_record["name"].lower()
            ), None)
        if email_in_graph is None:
            names_in_graph = [
                i for i in personnel_in_graph 
                if i["first_name"].lower() == basis_record["first_name"].lower() 
                and i["last_name"].lower() == basis_record["last_name"].lower()
            ]
            if len(names_in_graph) == 1:
                email_in_graph = names_in_graph[0]["p.email"]

        if email_in_graph is not None:
            basis_record.update({"email": email_in_graph})
        else:
            basis_record.update({"missing_from_graph":  "Name not in graph"})
    

It turns out that we can match most records from the internal system to something in our master data that we've previously brought into the graph. This gives us a good number of records where we can attach the internal opaque identifier to persons in our graph, giving us the ability to then establish relationships from people to projects.

In [None]:
print("BASIS PERSONNEL MATCHED TO GRAPH:", len([i for i in basis_staff if "missing_from_graph" not in i]))
display(Counter([i["missing_from_graph"] for i in basis_staff if "missing_from_graph" in i]))

We now have a simple dataset with email addresses that we already have in our graph from a master data source that we cabn use as reasonable unique identifiers within this particular context along with the internal opaque identifier used in the system we're pulling our project information from. We can dump this to a CSV file for processing into our graph so that we'll be able to simply match on those identifiers in establishing relationships to projects.

In [None]:
pd.DataFrame([i for i in basis_staff if "missing_from_graph" not in i]).to_csv(isaid_helpers.f_graphable_sipp_personnel, index=False)

# Graphable Projects
Now we need to create logical project entities we can pull into our graph. The data we're building from is quite messy, with the system applied and used in variable ways across organizational units in USGS in terms of the detail and organization of descriptive narrative information that is the most useful thing for us to use in our assessment and cataloging use cases. In the following codeblocks and functional logic, we attempt to simplify a basic project schema, using as many common properties already pulled into our graph as possible. We abstract away from the nested structure of projects, tasks, and subtasks used in the source system since we find that there are activities down within that structure that describe logical "projects" as we would think of them in a science planning and assessment context. We stitch information together from across multiple parts of the underlying data structure to synthesize logical connections from projects to people. Given the messy and highly variable nature of the narrative content, we simply aggregate as much as we can into a "description" property (which is variable in our graph anyway) so that we can run information extraction routines in later work.

The project/task/subtask records in our underlying system do also contain potentially interesting listings of products associated with the projects. These are generally in the form of text strings but can also reference another internal system with another opaque identifier that we may be able to exploit. In the near term, we are checking the citation listings for URLs that sometimes contain DOIs that we can use and attempting to use a citation parser with other query mechanisms to identify citations that can be linked to known registry systems in some way. This gives us some number of linkable products that we can either identify and match to in our graph already or tee up for further items to pull into our graph. There will always be some number of things listed as products that we either can't parse and disambiguate enough to positively identify or are the types of things that won't have linkable citations (e.g., conference abstracts, posters, etc.). These are still interesting products whose titles and descriptive information may contain useful text for extraction algorithms, so we add them into our graph as lists for additional information extraction references.

In this process, we are also applying a crude high level filter for only certain types of projects that are most pertinent to our science assessment and planning use cases. Projects in our internal source system are classified into categories, and we focus initially on Conduct Research and Conduct Assessments categorized projects as those that typically contain the richest information content that is most pertinent to our current line of inquiry. This leaves out other important projects (e.g., Data Collection, Manage and Distribute Data, Technical Assistance) that we will need to investigate further.

There is a LOT of potentially useful content scattered within the data structure of this internal system. The function below that builds the flattened, summarized structure suitable for our graph is dense and perhaps overly complicated. There may be better ways to map and transform the content to what we need, but this does let us work through things in a somewhat logical fashion as we iterate through how it translates into the graph.

In [7]:
isaid_helpers.api_sipp_project_narratives

'https://sipp.cr.usgs.gov/SIPPService/ProjectXML.ashx'

In [None]:
def basis_keywords_to_list(kw_string):
    delim_char = next((char for char in [";", ",", "/"] if char in kw_string), None)
    if delim_char is not None:
        return [k.strip() for k in kw_string.split(delim_char) if k]
    else:
        return list()

def project_nodes_from_project_narrative(project_number, narrative_dataset):
    if project_number not in narrative_dataset.keys():
        return
    p = narrative_dataset[project_number]

    project_nodes = list()
    
    project_record = {
        "id_basis_project": p["ProjectNumber"],
        "url": f"{isaid_helpers.api_sipp_project_narratives}?ProjectNumber={p['ProjectNumber']}",
        "id_cost_center": p["ProjectCostCenter"],
        "fbms_code": p["ProjectCostCenter"][0:6],
        "date_qualifier": str(datetime.datetime.utcnow().isoformat()),
        "source": "BASIS+ Project via SIPP Services",
        "name": p["ProjectTitle"].title().strip(),
        "type": p["ProjectType"],
        "status": p["ProjectStatus"],
        "project_chief_fppsid": p["ProjectChiefFPPSID"],
        "descriptive_texts": [i for i in nested_lookup("#text", {k:v for k,v in p.items() if k != "tasks"}) if i != "TBD"],
        "urls_in_descriptive_texts": list(),
        "dois_in_descriptive_texts": list()
    }
    project_record["descriptive_texts"].extend([i for i in nested_lookup("FiveYearGoalDescription", p) if i != "TBD"])
    
    for item in project_record["descriptive_texts"]:
        if isinstance(item, str):
            project_record["urls_in_descriptive_texts"].extend([i for i in url_extractor.find_urls(item) if "://" in i])
            project_record["dois_in_descriptive_texts"].extend(utilities.doi_from_string(item))
    
    if p["ProjectStartDate"] is not None:
        project_record["date_start"] = str(dateutil.parser.parse(p["ProjectStartDate"]).isoformat())

    if p["ProjectEndDate"] is not None:
        project_record["date_end"] = str(dateutil.parser.parse(p["ProjectEndDate"]).isoformat())
    
    associate_chiefs = nested_lookup("AssociateProjectChief", p)
    if associate_chiefs:
        if isinstance(associate_chiefs[0], list):
            project_record["associate_chief_fppsid"] = ",".join([i["FPPSID"] for i in associate_chiefs[0] if i["FPPSID"] is not None])
        elif isinstance(associate_chiefs[0], dict) and associate_chiefs[0]["FPPSID"] is not None:
            project_record["associate_chief_fppsid"] = associate_chiefs[0]["FPPSID"]
    
    if p["KeywordsNarrative"] is not None and "#text" in p["KeywordsNarrative"]:
        project_record["parsed_keywords"] = ";".join(basis_keywords_to_list(p["KeywordsNarrative"]["#text"]))
        
    delivered_products = nested_lookup("DeliveredProduct", {k:v for k,v in p.items() if k != "tasks"})
    if delivered_products:
        if isinstance(delivered_products[0], list):
            project_record["delivered_products"] = [parse_delivered_products(i) for i in delivered_products[0]]
        else:
            project_record["delivered_products"] = [parse_delivered_products(delivered_products[0])]
        
    project_nodes.append(project_record)
    
    # Task section
    if "tasks" in p and p["tasks"]:
        for t in p["tasks"]:
            task_node = {
                "id_basis_project": project_record["id_basis_project"],
                "id_cost_center": project_record["id_cost_center"],
                "fbms_code": project_record["fbms_code"],
                "type": project_record["type"],
                "id_basis_taskid": t["TaskID"],
                "basis_task_number": t["TaskNumber"],
                "url": f"{isaid_helpers.api_sipp_project_narratives}?ProjectNumber={p['ProjectNumber']}&TaskNumber={t['TaskNumber']}",
                "date_qualifier": project_record["date_qualifier"],
                "source": "BASIS+ Task via SIPP Services",
                "name": t["TaskTitle"],
                "urls_in_descriptive_texts": list(),
                "dois_in_descriptive_texts": list(),
                "keywords": nested_lookup("#text", nested_lookup("Keyword", t)),
                "descriptive_texts": [i for i in nested_lookup("#text", {k:v for k,v in t.items() if k != "Subtasks"}) if i != "TBD"]
            }
            
            task_node["descriptive_texts"].extend([i for i in nested_lookup("Text", {k:v for k,v in t.items() if k != "Subtasks"}) if i != "TBD"])
            
            for item in task_node["descriptive_texts"]:
                if isinstance(item, str):
                    task_node["urls_in_descriptive_texts"].extend([i for i in url_extractor.find_urls(item) if "://" in i])
                    task_node["dois_in_descriptive_texts"].extend(utilities.doi_from_string(item))
    
            if t["TaskStartDate"] is not None:
                task_node["date_start"] = str(dateutil.parser.parse(t["TaskStartDate"]).isoformat())

            if t["TaskEndDate"] is not None:
                TaskEndDate = dateutil.parser.parse(t["TaskEndDate"])
                if project_record["status"] == "Active" and TaskEndDate > datetime.datetime.now():
                    task_node["status"] = "Active"
                else:
                    task_node["status"] = "Complete"
                task_node["date_end"] = str(TaskEndDate.isoformat())
            else:
                if project_record["status"] == "Active":
                    task_node["status"] = "Active"

            task_leaders = nested_lookup("TaskLeader", t)
            if task_leaders:
                if isinstance(task_leaders[0], list):
                    task_node["task_leaders_fppsid"] = [i["FPPSID"] for i in task_leaders[0] if i["FPPSID"] is not None]
                elif isinstance(task_leaders[0], dict) and task_leaders[0]["FPPSID"] is not None:
                    task_node["task_leaders_fppsid"] = [task_leaders[0]["FPPSID"]]
                    
            keywords = nested_lookup("Keyword", {k:v for k,v in t.items() if k != "Subtasks"})
            if keywords:
                if isinstance(keywords[0], dict):
                    task_node["keywords"] = keywords[0]["#text"]
                elif isinstance(keywords[0], list):
                    task_node["keywords"] = ";".join([i["#text"] for i in keywords[0]])
                    
            if t["KeywordsNarrative"] is not None and "#text" in t["KeywordsNarrative"]:
                task_node["parsed_keywords"] = ";".join(basis_keywords_to_list(t["KeywordsNarrative"]["#text"]))

            delivered_products = nested_lookup("DeliveredProduct", {k:v for k,v in t.items() if k != "Subtasks"})
            if delivered_products:
                if isinstance(delivered_products[0], list):
                    project_record["delivered_products"] = [parse_delivered_products(i) for i in delivered_products[0]]
                else:
                    project_record["delivered_products"] = [parse_delivered_products(delivered_products[0])]
        
            project_nodes.append(task_node)
            
            # subtask section
            st_in_t = nested_lookup("Subtask", t)
            if st_in_t:
                if isinstance(st_in_t[0], dict):
                    subtasks = [st_in_t[0]]
                elif isinstance(st_in_t[0], list):
                    subtasks = st_in_t[0]
                    
                for st in subtasks:
                    subtask_node = {
                        "id_basis_project": project_record["id_basis_project"],
                        "id_cost_center": project_record["id_cost_center"],
                        "fbms_code": project_record["fbms_code"],
                        "type": project_record["type"],
                        "id_basis_taskid": task_node["id_basis_taskid"],
                        "basis_task_number": task_node["basis_task_number"],
                        "id_basis_subtaskid": st["SubtaskID"],
                        "basis_subtask_number": st["SubtaskNumber"],
                        "url": f"{isaid_helpers.api_sipp_project_narratives}?ProjectNumber={p['ProjectNumber']}&TaskNumber={t['TaskNumber']}",
                        "date_qualifier": project_record["date_qualifier"],
                        "source": "BASIS+ Subtask via SIPP Services",
                        "name": st["SubtaskTitle"],
                        "urls_in_descriptive_texts": list(),
                        "dois_in_descriptive_texts": list(),
                        "keywords": nested_lookup("#text", nested_lookup("Keyword", st)),
                        "descriptive_texts": [i for i in nested_lookup("#text", st) if i != "TBD"]
                    }

                    subtask_node["descriptive_texts"].extend([i for i in nested_lookup("Text", st) if i != "TBD"])

                    for item in subtask_node["descriptive_texts"]:
                        if isinstance(item, str):
                            subtask_node["urls_in_descriptive_texts"].extend([i for i in url_extractor.find_urls(item) if "://" in i])
                            subtask_node["dois_in_descriptive_texts"].extend(utilities.doi_from_string(item))
    
                    if st["SubtaskStartDate"] is not None:
                        subtask_node["date_start"] = str(dateutil.parser.parse(st["SubtaskStartDate"]).isoformat())

                    if st["SubtaskEndDate"] is not None:
                        SubtaskEndDate = dateutil.parser.parse(st["SubtaskEndDate"])
                        if project_record["status"] == "Active" and SubtaskEndDate > datetime.datetime.now():
                            subtask_node["status"] = "Active"
                        else:
                            subtask_node["status"] = "Complete"
                        subtask_node["date_end"] = str(SubtaskEndDate.isoformat())
                    else:
                        if project_record["status"] == "Active":
                            subtask_node["status"] = "Active"

                    subtask_leaders = nested_lookup("SubtaskLeader", st)
                    if subtask_leaders:
                        if isinstance(subtask_leaders[0], list):
                            subtask_node["subtask_leaders_fppsid"] = [i["FPPSID"] for i in subtask_leaders[0] if i["FPPSID"] is not None]
                        elif isinstance(subtask_leaders[0], dict) and subtask_leaders[0]["FPPSID"] is not None:
                            subtask_node["subtask_leaders_fppsid"] = [subtask_leaders[0]["FPPSID"]]

                    keywords = nested_lookup("Keyword", st)
                    if keywords:
                        if isinstance(keywords[0], dict):
                            subtask_node["keywords"] = keywords[0]["#text"]
                        elif isinstance(keywords[0], list):
                            subtask_node["keywords"] = ";".join([i["#text"] for i in keywords[0]])

                    if st["KeywordsNarrative"] is not None and "#text" in st["KeywordsNarrative"]:
                        subtask_node["parsed_keywords"] = ";".join(basis_keywords_to_list(st["KeywordsNarrative"]["#text"]))

                    delivered_products = nested_lookup("DeliveredProduct", st)
                    if delivered_products:
                        if isinstance(delivered_products[0], list):
                            project_record["delivered_products"] = [parse_delivered_products(i) for i in delivered_products[0]]
                        else:
                            project_record["delivered_products"] = [parse_delivered_products(delivered_products[0])]

                    project_nodes.append(subtask_node)
    
    return project_nodes

def parse_delivered_products(product_str):
    product_parts = product_str.split(":")
    product = {
        "status": product_parts[0].split(",")[-1].strip()
    }
    product["type"] = product_parts[0].replace(product["status"], "").strip()
    
    if len(product_parts) == 2:
        product["citation"] = product_parts[1].strip()
    elif len(product_parts) > 2:
        product["citation"] = ":".join(product_parts[1:]).strip()
    else:
        print(product_str)
        
    product["urls_in_string"] = [i for i in url_extractor.find_urls(product["citation"]) if "://" in i]
    product["dois_in_string"] = utilities.doi_from_string(product["citation"])
    
    return product

def filtered_projects(
    project_summary_data, 
    project_types=["Conduct Research","Conduct Assessments"],
    project_status=["Active"],
    return_type="ProjectNumber"
):
    all_projects = list()
    for fbms_code in project_summary_data.keys():
        if project_summary_data[fbms_code] is not None:
            for p in project_summary_data[fbms_code]:
                if p["ProjectType"] in project_types and p["ProjectStatus"] in project_status:
                    all_projects.append({
                        "ProjectNumber": p["ProjectNumber"],
                        "CostCenter": p["ProjectCostCenter"]
                    })
    
    if return_type == "ProjectNumber":
        return list(set([i["ProjectNumber"] for i in all_projects]))
    else:
        return all_projects
    
def accounting_summary(cost_center, project_number, data=cache_data["accounting"]):
    fbms_code = cost_center[0:6]
    if fbms_code not in data.keys():
        return list()
    cost_center_accounting = data[fbms_code]

    accounting = list()
    for item in [i for i in cost_center_accounting if i["ProjectNumber"] == project_number]:
        accounting_summary = {k:v for k,v in item.items() if k in [
                "CostCenterCode",
                "ProjectNumber",
                "TaskNumber",
                "AccountNumber"
            ]}
        accounting_summary["fbms_code"] = fbms_code
        accounting.append(accounting_summary)

    return accounting

def staffing_summary(project_accounting, data=cache_data["staffing"]):
    fbms_code = project_accounting["fbms_code"]
    if fbms_code not in data.keys():
        return list()
    cost_center_staffing = data[fbms_code]
    if cost_center_staffing is None:
        return list()
    
    staffing = list()
    for staffing_record in [i for i in cost_center_staffing if i["AccountNumber"] == project_accounting["AccountNumber"]]:
        staffing_summary = {k:v for k,v in staffing_record.items() if k in [
                "PayrollFY",
                "AccountFY",
                "FPPSID",
                "PlannedRegularHours",
                "ActualRegularHours"
            ]}
        staffing_summary.update(project_accounting)
        staffing_summary.update({"max_hours": max([float(staffing_summary["PlannedRegularHours"]), float(staffing_summary["ActualRegularHours"])])})
        
        staffing.append(staffing_summary)
    
    return staffing


The processing of full project records does quite a bit of summarization and decisionmaking work as seen in the project_nodes_from_project_narrative function above. Because we have a heterogeneous implementation of the underlying internal management system where this information is captured, we find logical activities that we would consider distinct science projects at multiple levels in a project/task/subtask hierarchy. For now, our process "smooths" that situation over, pulling the lower level items up treating them all as projects, but retaining the linkages for relationships in our graph. We also do a fair bit of pre-processing and optimization of the information content here, doing things like extracting URLs and DOIs from some parts of the information content so that we can work those into additional entity identification later on.

Note: I did try paralellizing this processing step, but each one is using enough resources to not make that any more efficient. It's not huge data, but it's just big enough to mean I need to move this work elsewhere.

In [None]:
%%time
all_project_nodes = list()
for pnum in cache_data["narrative"].keys():
    all_project_nodes.extend(project_nodes_from_project_narrative(pnum, cache_data["narrative"]))

In [None]:
for project in all_project_nodes:
    if project["source"] == "BASIS+ Subtask via SIPP Services":
        project.update({"project_id": "_".join([
            project["id_basis_project"],
            project["id_basis_taskid"],
            project["id_basis_subtaskid"]
        ])})
    elif project["source"] == "BASIS+ Task via SIPP Services":
        project.update({"project_id": "_".join([
            project["id_basis_project"],
            project["id_basis_taskid"]
        ])})
    elif project["source"] == "BASIS+ Project via SIPP Services":
        project.update({"project_id": project["id_basis_project"]})

    if "task_leaders_fppsid" in project and project["task_leaders_fppsid"] is not None:
        project.update({"task_leaders_fppsid": ",".join(project["task_leaders_fppsid"])})

    if "subtask_leaders_fppsid" in project and project["subtask_leaders_fppsid"] is not None:
        project.update({"subtask_leaders_fppsid": ",".join(project["subtask_leaders_fppsid"])})
    
    if "keywords" in project and project["keywords"] and isinstance(project["keywords"], list):
        project.update({"keywords": ";".join(project["keywords"])})
        
print("DERIVED PROJECT RECORDS:", len(all_project_nodes))

In [None]:
pd.DataFrame(all_project_nodes).to_csv("data/graphable_sipp_projects.csv", index=False)

# Project Staffing
Because of how accounting and project staffing are structured in the source material we're working from, we have sufficient reference points at a granular level of the number of hours a staff person is budgeted for a given project/task that we can build the relationships as a separate table for incorporation into the graph. We can use the different internal opaque identifiers to determine who should be related to what project after we place those identifiers on the Person entities in the graph and pull in our Project entities with their identifiers. We use the same filtering mechanism we used to get narrative content, so we're only pulling together staffing requests for those projects we are incorporating into the graph.

In [None]:
%%time
graphable_projects = filtered_projects(project_summary, return_type="all")

accounting_summaries = list()
for item in graphable_projects:
    accounting_summaries.extend(
        accounting_summary(
            cost_center=item["CostCenter"],
            project_number=item["ProjectNumber"]
        )
    )
    
staffing_data = list()
for item in accounting_summaries:
    staffing_data.extend(staffing_summary(item))


In [None]:
pd.DataFrame(staffing_data).to_csv("data/graphable_sipp_staffing.csv", index=False)

# Structured Keywords
The BASIS+ task and subtask level records contain a structured Keywords property that is nominally supposed to only contain terms from the USGS Thesaurus. This gives us a reasonable set of information to operate against in bringing additional relationships to DefinedSubjectMatter terms in the graph. The following section loads all the data necessary to examine this part of the SIPP data pull against our reference vocabulary and the current records in the graph. It determines what new DefinedSubjectMatter entities need to be pulled in from the reference vocabulary and tees those up for processing into the graph and sets up a separate load file to establish relationships from the relevant task/subtask projects to those DefinedSubjectMatter entities.

In [3]:
if "graphable_sipp_projects" in locals():
    print(f"graphable_sipp_projects ready to operate with {len(graphable_sipp_projects)} records")
else:
    if os.path.exists(isaid_helpers.f_graphable_sipp_projects):
        graphable_sipp_projects = pd.read_csv(isaid_helpers.f_graphable_sipp_projects).to_dict(orient="records")
        print(f"graphable_sipp_projects loaded fresh and ready to operate with {len(graphable_sipp_projects)} records")
    else:
        print("Could not load graphable SIPP projects. Need that to proceed.")
    
if "reference_terms" not in locals() and not os.path.exists(isaid_helpers.f_ner_reference):
    print("Could not load reference vocabulary. Need that to proceed.")
else:
    reference_terms = pickle.load(open(isaid_helpers.f_ner_reference, "rb"))

usgs_thesuarus_terms_in_reference = [i for i in reference_terms if i["source"] == "USGS Thesaurus"]
usgs_thesaurus_term_labels = [i["label"] for i in usgs_thesuarus_terms_in_reference]
print(f"reference_terms ready to operate with {len(usgs_thesuarus_terms_in_reference)} terms from USGS Thesaurus")

if "defined_subjects_in_graph" in locals():
    print(f"defined_subjects_in_graph loaded with {len(defined_subjects_in_graph)} reference terms")
else:
    try:
        with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
            defined_subjects_in_graph = session.run("""
            MATCH (ds:DefinedSubjectMatter)
            RETURN ds.name AS name, ds.source AS source, ds.url AS url
            """).data()
    except Exception as e:
        print(f"Problem in retrieving defined_subjects_in_graph from graph DB: {e}")

thesaurus_terms_in_graph = [i["name"] for i in defined_subjects_in_graph if i["source"] == "USGS Thesaurus"]
other_terms_in_graph = [i["name"] for i in defined_subjects_in_graph if i["source"] != "USGS Thesaurus"]

display(Counter([i["source"] for i in defined_subjects_in_graph]))

graphable_sipp_projects ready to operate with 17119 records
reference_terms ready to operate with 1151 terms from USGS Thesaurus


Counter({'Lithologic classification of geologic map units': 29,
         'Alexandria Digital Library Feature Type Thesaurus': 29,
         'Common geographic areas': 61,
         'USGS Thesaurus': 1902,
         'Coastal and Marine Ecological Classification Standard': 12,
         'Marine Realms Information Bank (MRIB) keywords': 87,
         'Data Categories for Marine Planning': 9,
         'The National Map Theme Thesaurus': 9,
         'Thesaurus categories': 2,
         'ISO 19115 Topic Category': 8,
         'Named instances': 5,
         'USGS information products': 2,
         'EPA Climate Change Glossary': 37})

In a previous step, we processed through all project/task/subtask records to generate summarized "project" records for adding to our graph as entities. In that process we cleaned up some structural issues with the Keywords data from the source to put a semicolon delimited list of keywords into the records. Here, we pull all unique keywords together to determine which ones we can build relationships from to DefinedSubjectMatter terms in the graph.

Note: We do see a handful of terms that cannot be successfully matched to the USGS Thesaurus. These could come from non-preferred terms, which we did not yet bring into our reference vocabulary. For now, we'll ignore these.

In [4]:
all_keywords = list()
for keyword_set in nested_lookup("keywords", graphable_sipp_projects):
    if isinstance(keyword_set, str):
        all_keywords.extend(keyword_set.split(";"))
all_keywords = list(set(all_keywords))
all_keywords.sort()
print("UNIQUE KEYWORDS IN PROJECTS:", len(all_keywords))
print("KEYWORDS NOT IN USGS THESAURUS:", len([i for i in all_keywords if i not in usgs_thesaurus_term_labels]))
display([i for i in all_keywords if i not in usgs_thesaurus_term_labels])

UNIQUE KEYWORDS IN PROJECTS: 883
KEYWORDS NOT IN USGS THESAURUS: 16


['USGS business categories',
 'USGS-EMA-LOW-MR Aridlands',
 'USGS-EMA-LOW-MR Forest',
 'USGS-EMA-LOW-MR Rangeland',
 'USGS-EMA-LOW-MR Riparian/wetland',
 'USGS-EMA-LOW-MR River',
 '[]',
 'atmospheric deposition (chemical  particulate)',
 'habitat alteration',
 'institutional structures and activities',
 'methods',
 'product types',
 'sciences',
 'sexing (plants  animals)',
 'time periods',
 'topics']

To help ensure integrity of the data we're building asynchronously into the graph, we split up the process of establishing entities from that of creating relationships. We determine exactly what entities need to go into the graph and put those into their own load file and then establish relationships in another workflow by matching to those entities based on an appropriate identifier. In the case of reference terms as DefinedSubjectMatter entities in the graph, these can be created through many different workflows. We are only adding confirmed DefinedSubjectMatter entities from the reference vocabulary when those concepts are confirmed and linked to some other entities in the graph. So, we keep a running data table of those terms that we need to add to the graph and then run the same process to build those when it changes.

With the reference terms loaded, we have what we need in the graphable_sipp_projects data to match terms where possible to USGS Thesaurus terms that will be in our graph as soon as we run the load process. These are in a semicolon delimited list for each record where they are found, so we can split them in the load script to Neo4j and run each one through a relationship building process.

In [5]:
missing_keywords_from_graph = [i for i in all_keywords if i not in thesaurus_terms_in_graph]
additional_reference_terms_to_graph = [i for i in reference_terms if i["source"] == "USGS Thesaurus" and i["label"] in missing_keywords_from_graph]

Counter([i["concept_label"] for i in additional_reference_terms_to_graph])

df_reference_terms_to_graph = pd.DataFrame(additional_reference_terms_to_graph)

if os.path.exists(isaid_helpers.f_graphable_reference_terms):
    existing_graphable_reference_terms = pd.read_csv(isaid_helpers.f_graphable_reference_terms)
    df_reference_terms_to_graph = df_reference_terms_to_graph.append(existing_graphable_reference_terms)
    df_reference_terms_to_graph.drop_duplicates(subset="url", inplace=True)

df_reference_terms_to_graph.to_csv(isaid_helpers.f_graphable_reference_terms, index=False)
print(
    isaid_helpers.f_graphable_reference_terms, 
    "CREATED", 
    datetime.datetime.fromtimestamp(os.path.getmtime(isaid_helpers.f_graphable_reference_terms))
)

data/graphable_reference_terms.csv CREATED 2021-07-08 06:33:15.696401
