With our project data now somewhat laboriously built and optimized for mapping into the graph, we can proceed with bringing this new type of entity into the iSAID graph. We do this in a couple of steps, first bringing in the identifiers on existing people that will let us build relationships to projects, then bringing in the projects themselves, and then building out relationships.

In [1]:
import isaid_helpers
import pandas as pd

# Project Staff
In order to get project records fully into our graph and linked up in meaningful ways, we need to introduce an additional internal opaque identifier used for project staffing in the internal system we are tapping for "project" records. In our data-building step for the SIPP records, we did the work of pulling all unique id/name/email combinations and comparing to what we already have in our graph. So, in this step, we can use a match on email and add properties/values to the existing entities in the graph. I go ahead and include the first_name and last_name properties that we used to help in name disambiguation along with the identifier values.

In [3]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (p:Person {email: row.email})
                SET p.id_fpps = row.id_fpps,
                p.first_name = row.first_name,
                p.last_name = row.last_name
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sipp_personnel
    })

CPU times: user 1.05 ms, sys: 4.56 ms, total: 5.62 ms
Wall time: 298 ms


# Project Entities
Project entities are introduced to the graph for the first time here. I tried to stick with the same basic data modeling principles of a high level simplified set of properties, though we're not yet able to provide a meaningful URL for projects and the "descriptions" are currently so dense and heterogeneous as to not be meaningful in a single description presentation. Projects from our internal system are therefore much more defined by the relationships we can build from them to other entities in our graph, including the defined and undefined subjects we are working to build out.

In [6]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MERGE (pr:Project {project_id: row.project_id})
            ON CREATE
                SET pr.name = row.name,
                pr.id_basis_project = row.id_basis_project,
                pr.id_cost_center = row.id_cost_center,
                pr.fbms_code = row.id_cost_center,
                pr.source = row.source,
                pr.type = row.type,
                pr.status = row.status,
                pr.descriptive_texts = row.descriptive_texts,
                pr.id_basis_taskid = row.id_basis_taskid,
                pr.id_basis_subtaskid = row.id_basis_subtaskid,
                pr.basis_task_number = row.basis_task_number,
                pr.basis_subtask_number = row.basis_subtask_number
            ON MATCH
                SET pr.fbms_code = row.id_cost_center
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sipp_projects
    })

CPU times: user 2.99 ms, sys: 3.26 ms, total: 6.25 ms
Wall time: 1min 44s


In [8]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            WHERE row.source = "BASIS+ Task via SIPP Services"
            MATCH (t:Project {project_id: row.project_id})
            WITH t, row
                MATCH (p:Project {project_id: row.id_basis_project})
                MERGE (t)-[:TASK_OF]->(p)
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sipp_projects
    })

CPU times: user 8.48 ms, sys: 5.32 ms, total: 13.8 ms
Wall time: 7min 15s


In [9]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            WHERE row.source = "BASIS+ Subtask via SIPP Services"
            MATCH (st:Project {project_id: row.project_id})
            WITH st, row
                MATCH (t:Project {source: "BASIS+ Task via SIPP Services", id_basis_taskid: row.id_basis_taskid})
                MERGE (st)-[:SUBTASK_OF]->(t)
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sipp_projects
    })

CPU times: user 4.74 ms, sys: 4.06 ms, total: 8.8 ms
Wall time: 2min 26s


In [10]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            WHERE NOT row.project_chief_fppsid IS NULL
            MATCH (pr:Project {project_id: row.project_id})
            WITH pr, row
                MATCH (p:Person {id_fpps: row.project_chief_fppsid})
                MERGE (p)-[r:PROJECT_CHIEF_OF]->(pr)
                SET r.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sipp_projects
    })

CPU times: user 2.5 ms, sys: 2.5 ms, total: 5 ms
Wall time: 1min 19s


In [13]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            WHERE NOT row.task_leaders_fppsid IS NULL
            MATCH (pr:Project {project_id: row.project_id})
            UNWIND split(row.task_leaders_fppsid, ",") AS tl_id
                WITH pr, tl_id, row
                    MATCH (p:Person {id_fpps: tl_id})
                    MERGE (p)-[r:TASK_LEADER_OF]->(pr)
                    SET r.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sipp_projects
    })

CPU times: user 5.72 ms, sys: 3.27 ms, total: 8.98 ms
Wall time: 4min 28s


In [14]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            WHERE NOT row.subtask_leaders_fppsid IS NULL
            MATCH (pr:Project {project_id: row.project_id})
            UNWIND split(row.subtask_leaders_fppsid, ",") AS tl_id
                WITH pr, tl_id, row
                    MATCH (p:Person {id_fpps: tl_id})
                    MERGE (p)-[r:SUBTASK_LEADER_OF]->(pr)
                    SET r.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sipp_projects
    })

CPU times: user 2.59 ms, sys: 2.23 ms, total: 4.82 ms
Wall time: 1min 27s


In [15]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
           MATCH (pr:Project{project_id: row.project_id})
           WITH pr, row
               MATCH (o:Organization {fbms_code: row.id_cost_center})
               MERGE (o)-[:CONDUCTS]->(pr)
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sipp_projects
    })

CPU times: user 7.3 ms, sys: 5.81 ms, total: 13.1 ms
Wall time: 5min 25s


# Keywords
There are two types of keywords that show up in the internal project system we are using. One of these is a relatively clean source and comes from a more controlled part of the system model. The other is part of a loosely managed collection of other narratives, and the data in that area is really messy. I made an attempt to clean things up somewhat, but it still needs more work and NER may yield better results from this content anyway. Because the more structured keywords are actually pretty reasonable and reasonably sized (882 distinct terms), I go ahead and add them to the graph here as UndefinedSubjectMatter entities with an ADDRESSES_SUBJECT relationship with projects. Some of these terms appear to be specialized codes being used by USGS Programs or Mission Areas, so they may be able to get some type of definition with a little more digging.

In [6]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            WHERE NOT row.keywords IS NULL
            MATCH (pr:Project {project_id: row.project_id})
            UNWIND split(row.keywords, ";") AS kw
                WITH pr, kw, row
                    MERGE (t:UndefinedSubjectMatter {name: kw})
                    ON CREATE
                        SET t.source = row.source
                    MERGE (pr)-[r:ADDRESSES_SUBJECT]->(t)
                    SET r.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sipp_projects
    })

CPU times: user 2.5 ms, sys: 2.16 ms, total: 4.65 ms
Wall time: 1min 15s


# Project Staffing
Relationships between people and projects is a factor of the number of hours someone is budgeted to work on a project. We build here on the previous work we did to identify people from the internal system to those we pulled into our graph from master data and add in the internal opaque identifier used everywhere in the internal system. We then summarize the budget information to produce a simple data that gives us the personnel ID, project number and task number that we can use to identify a subset of the projects where we are able to make a "FUNDED_BY" relationship connection. The number of hours we are using is a fluid figure as this information is changing all the time as accounting is ongoing, but we determine the maximum between two different accounting numbers (planned and actual) and put that number into an "hours" property on the relationships.

In [4]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
           MATCH (pr:Project {id_basis_project: row.ProjectNumber, basis_task_number: row.TaskNumber})
           WITH pr, row
               MATCH (p:Person {id_fpps: row.FPPSID})
               MERGE (p)-[f:FUNDED_BY]->(pr)
               SET f.hours = row.max_hours
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sipp_staffing
    })

CPU times: user 5.87 ms, sys: 3.51 ms, total: 9.38 ms
Wall time: 4min 57s
