# ScienceBase People

The ScienceBase Directory provides us with the most comprehensive source of current and former USGS staff via its regular synchronization with the internal Active Directory. We have tuned up the source data provision on our current USGS-centric graph exercise to include only current and former USGS personnel, though the ScienceBase Directory does include records of people well beyond that scope. These records introduce a number of new or confirming properties for people along with new related nodes that are handled incrementally in processes below. For purposes of building our graph of everything USGS, we can use the ScienceBase Directory as a basic point of reference that should give us the vast majority of person records that we can connect to other information sources. It provides key identifiers with email addresses (unique within a certain time period context and sometimes our only reference point) and ORCID identifiers (not complete for all staff).

In [1]:
import isaid_helpers
import pandas as pd

In [10]:
df_sb_people = pd.read_csv(isaid_helpers.f_graphable_sb_people)
df_sb_people.head()

Unnamed: 0,name,last_name,url,email,source_id_sb_directory,fbms_code,active,last_updated,first_name,middle_name,...,address_line_2,city,state,zip,country,string_address,supervisor_name,supervisor_email,supervisor_uri,orcid
0,Hailey (Contractor) M Alspaugh,Alspaugh,,halspaugh@contractor.usgs.gov,https://www.sciencebase.gov/directory/person/7...,GGENLM0000,False,2020-02-12T07:00:00Z,Hailey (Contractor),M,...,,Richmond,VA,23228,US,"1730 East Parham Road, Richmond, VA 23228",Douglas L Moyer,dlmoyer@usgs.gov,https://www.sciencebase.gov/directory/person/7252,
1,Annika G Bollesen,Bollesen,,abollesen@contractor.usgs.gov,https://www.sciencebase.gov/directory/person/7...,GGEMNN0000,False,2019-10-17T06:00:00Z,Annika,G,...,,Jamestown,ND,58401,US,"8711 37Th Street SE, Jamestown, ND 58401",,,,
2,Adam C Cole,Cole,,accole@usgs.gov,https://www.sciencebase.gov/directory/person/7...,GGESMR0000,False,2020-05-21T06:00:00Z,Adam,C,...,,Lafayette,LA,70506,,"700 Cajundome Blvd., Lafayette, LA 70506",Jacoby Carter,carterj@usgs.gov,https://www.sciencebase.gov/directory/person/1617,
3,William (Contractor) D. Twiner,Twiner,,wtwiner@contractor.usgs.gov,https://www.sciencebase.gov/directory/person/7...,GGHWDJ3100,False,2020-12-03T07:00:00Z,William (Contractor),D,...,,Bay St Louis,MS,39529,US,"Buildings 2101 2204, Bay St Louis, MS 39529",Teri N Snazelle,tsnazelle@usgs.gov,https://www.sciencebase.gov/directory/person/5...,
4,Heather N Adams,Adams,,hadams@usgs.gov,https://www.sciencebase.gov/directory/person/7...,GGENLQ0000,False,2019-09-26T06:00:00Z,Heather,N,...,,Woods Hole,MA,2543,,"384 Woods Hole Road, Woods Hole, MA 02543",Janet L Paquette,jpaquette@usgs.gov,https://www.sciencebase.gov/directory/person/7847,


In [None]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            WHERE row.active = "True" AND NOT row.email IS NULL
                MERGE (p:Person {email: row.email})
                ON CREATE
                    SET p.orcid = row.orcid,
                    p.name = row.name,
                    p.url = row.url,
                    p.active = row.active,
                    p.source_id_sb_directory = row.source_id_sb_directory,
                    p.last_name = row.lastName,
                    p.first_name = row.first_name,
                    p.middle_name = row.middle_name,
                    p.fbms_code = row.fbms_code,
                    p.organization_name = row.organization_name,
                    p.job_title = row.job_title,
                    p.last_updated = row.last_updated
                ON MATCH
                    SET p.orcid = row.orcid,
                    p.name = row.name,
                    p.url = row.url,
                    p.active = row.active,
                    p.source_id_sb_directory = row.source_id_sb_directory,
                    p.last_name = row.lastName,
                    p.first_name = row.first_name,
                    p.middle_name = row.middle_name,
                    p.fbms_code = row.fbms_code,
                    p.organization_name = row.organization_name,
                    p.job_title = row.job_title,
                    p.last_updated = row.last_updated
        """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sb_people
    })


In [3]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            WHERE row.active = "True" AND NOT row.email IS NULL
                MATCH (p:Person {email: row.email})
                    
        WITH p, row
            WHERE NOT row.fbms_code IS NULL
                MATCH (o:Organization {fbms_code: row.fbms_code})
                MERGE (p)-[e:EMPLOYED_BY]->(o)
                    SET e.date_qualifier = row.last_updated,
                    e.reference = row.source_id_sb_directory
        """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sb_people
    })


CPU times: user 1.75 ms, sys: 1.6 ms, total: 3.35 ms
Wall time: 31.3 s


In [4]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            WHERE row.active = "True"
                MATCH (p:Person {email: row.email})
                    
        WITH p, row
            WHERE NOT row.job_title IS NULL
                MERGE (t:JobTitle {name: row.job_title})
                MERGE (p)-[jt:JOB_TITLE]->(t)
                    SET jt.date_qualifier = row.last_updated,
                    jt.reference = row.source_id_sb_directory
        """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sb_people
    })


CPU times: user 1.03 ms, sys: 2.17 ms, total: 3.19 ms
Wall time: 502 ms


In [5]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            WHERE row.active = "True"
                MATCH (p:Person {email: row.email})
                    
        WITH p, row
            WHERE NOT row.location_name IS NULL
                MERGE (l:Location {name: row.location_name})
                ON CREATE
                    SET l.description = row.location_description,
                    l.building_code = row.building_code,
                    l.address_line_1 = row.address_line_1,
                    l.address_line_2 = row.address_line_2,
                    l.city = row.city,
                    l.state = row.state,
                    l.zip = row.zip,
                    l.country = row.country,
                    l.string_address = row.string_address
                MERGE (p)-[loc:LOCATED_IN]->(l)
                    SET loc.date_qualifier = row.last_updated,
                    loc.reference = row.source_id_sb_directory
        """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sb_people
    })


CPU times: user 1.27 ms, sys: 1.41 ms, total: 2.68 ms
Wall time: 566 ms


In [6]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            WHERE row.active = "True" AND NOT row.supervisor_email IS NULL
                MATCH (p:Person {email: row.email})

                WITH p, row
                    MATCH (s:Person {email: row.supervisor_email})
                    MERGE (p)-[sup:SUPERVISED_BY]->(s)
                        SET sup.date_qualifier = row.last_updated,
                        sup.reference = row.source_id_sb_directory
        """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sb_people
    })

CPU times: user 1.04 ms, sys: 1.3 ms, total: 2.35 ms
Wall time: 622 ms


In [17]:
# Make sure we got all active people from the cache
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    results = session.run("""
    MATCH (p:Person)
    RETURN p.name, p.email, p.orcid
    """)
    persons_in_graph = results.data()

emails_in_graph = [i["p.email"] for i in persons_in_graph if "p.email" in i and i["p.email"] is not None]
df_sb_people.loc[df_sb_people.active & ~df_sb_people.email.isin(emails_in_graph)]