In processing USGS Profiles, we are currently counting on having already processes a primary source of personnel records from the ScienceBase Directory into our graph. Therefore, in working through the scraped profiles, we are only running a match on email address and then processing anything we find. We could well end up with cases where some Profile Pages do not represent people we already know about or they contain erroneous information that can't be linked to the more firm source coming from the ScienceBase Directory.

In [22]:
import isaid_helpers
import pandas as pd

In [23]:
pd.read_csv(isaid_helpers.f_graphable_profiles).head()

Unnamed: 0,profile,_date_cached,description,profile_image_url,email,orcid
0,https://www.usgs.gov/staff-profiles/esther-d-s...,2021-04-14T10:31:35.191741,,https://prd-wret.s3.us-west-2.amazonaws.com/as...,estroh@usgs.gov,0000-0003-4291-4647
1,https://www.usgs.gov/staff-profiles/morgan-b-w...,2021-04-14T13:59:25.432326,,https://prd-wret.s3.us-west-2.amazonaws.com/as...,mbwallace@usgs.gov,
2,https://www.usgs.gov/staff-profiles/erin-coenen,2021-05-09T21:41:38.687056,,https://prd-wret.s3.us-west-2.amazonaws.com/as...,ecoenen@usgs.gov,0000-0003-2470-3854
3,https://www.usgs.gov/staff-profiles/jessica-dyke,2021-05-14T15:34:09.722198,,https://prd-wret.s3.us-west-2.amazonaws.com/as...,jldyke@usgs.gov,
4,https://www.usgs.gov/staff-profiles/ian-michae...,2021-05-19T16:45:03.507388,,https://prd-wret.s3.us-west-2.amazonaws.com/as...,irogers@usgs.gov,0000-0001-8492-5358


In [24]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (p:Person {email: row.email}) 
                SET p.source_id_usgs_profiles = row.profile,
                p.description = row.description,
                p.image = row.profile_image_url

        WITH p, row
        WHERE NOT row.title IS NULL
            MERGE (t:JobTitle {name: row.title})
            MERGE (p)-[rt:JOB_TITLE]->(t)
                ON CREATE
                    SET rt.date_qualifier = row._date_cached,
                    rt.reference = row.profile
            """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_profiles
    })

CPU times: user 2.46 ms, sys: 7.41 ms, total: 9.87 ms
Wall time: 203 ms


In [25]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (p:Person {email: row.email}) 

        WITH p, row
        WHERE NOT row.expertise_term IS NULL
            MERGE (e:Expertise {name: row.expertise_term})
            ON CREATE
                SET e.source = "USGS Profile Pages"
            MERGE (p)-[pe:HAS_EXPERTISE]->(e)
                ON CREATE
                    SET pe.date_qualifier = row.date_qualifier,
                    pe.reference = row.reference
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_profile_expertise
    })

CPU times: user 1.33 ms, sys: 2.13 ms, total: 3.46 ms
Wall time: 239 ms


From some profiles, we have clues about creative works that a person has contributed to via links they have placed in the body of their pages that we've pulled in via the scraper. Some of these have already showed up via other means, so we need to first check the graph for what it already knows about so that we don't try and overwrite what is probably more thorough information collected elsewhere about any of these that we determined to have a DOI. The following checks the graph, flags profile creative works that are in the graph already, and re-saves the CSV file for processing.

In [26]:
works_in_profiles = pd.read_csv(isaid_helpers.f_graphable_profile_creative_works)

with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    results = session.run("""
    MATCH (w:CreativeWork)
    WHERE NOT w.doi IS NULL
    RETURN w
    """)
    works_in_graph = results.data()
    
def work_in_graph(doi):
    if doi is None:
        return False
    item_in_graph = next((i for i in works_in_graph if i["w"]["doi"] == doi), None)
    if item_in_graph:
        return True
    else:
        return False
    
works_in_profiles["work_in_graph"] = works_in_profiles.apply(lambda x: work_in_graph(x["doi"]), axis=1)
works_in_profiles.to_csv(isaid_helpers.f_graphable_profile_creative_works)

In [27]:
works_in_profiles.head()

Unnamed: 0.1,Unnamed: 0,email,orcid,url,doi,title,date_qualifier,reference,work_in_graph
0,0,dacox@usgs.gov,0000-0001-8302-3643,http://peer.berkeley.edu/news/2018/04/haywired...,,"HayWired Scenario Rollout: April 18, 2018",2021-06-06T14:29:37.920533,https://www.usgs.gov/staff-profiles/dale-alan-cox,False
1,1,eglenn@usgs.gov,0000-0001-9573-5410,https://journals.plos.org/plosone/article?id=1...,10.1371/journal.pone.0210643,Conservation planning for species recovery und...,2021-06-06T14:38:51.007528,https://www.usgs.gov/staff-profiles/elizabeth-...,True
2,2,dhaukos@usgs.gov,,http://dx.doi.org/10.1002/jwmg.1073,10.1002/jwmg.1073,Lesser prairie-chicken fence collision rates f...,2021-06-06T15:52:34.113621,https://www.usgs.gov/staff-profiles/dave-haukos,True
3,3,mrubenstein@usgs.gov,0000-0001-8569-781X,https://www.tandfonline.com/doi/abs/10.1080/20...,10.1080/20430779.2013.874260,A new tool to quantify carbon dioxide emission...,2021-06-06T16:47:00.736531,https://www.usgs.gov/staff-profiles/madeleine-...,True
4,4,burkardtn@usgs.gov,,http://www.aspanet.org/public/ASPADocs/PAR/T2P...,,Mathematical models frame environmental disput...,2021-06-07T14:01:20.513001,https://www.usgs.gov/staff-profiles/nina-burkardt,False


In [28]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (p:Person {email: row.email}) 

        WITH p, row
            WHERE row.work_in_graph = "True" AND NOT row.doi IS NULL
                MATCH (cw:CreativeWork {doi: row.DOI})
                MERGE (p)-[c:CONTRIBUTED_TO]->(cw)
                    ON CREATE
                        SET c.date_qualifier = row.date_qualifier,
                        c.reference = row.reference

        WITH p, row
            WHERE row.work_in_graph = "True" AND row.doi IS NULL
                MATCH (cw:CreativeWork {name: row.title})
                MERGE (p)-[c:CONTRIBUTED_TO]->(cw)
                    ON CREATE
                        SET c.date_qualifier = row.date_qualifier,
                        c.reference = row.reference

        WITH p, row
            WHERE row.work_in_graph = "False" AND NOT row.doi IS NULL
                MERGE (cw:CreativeWork {doi: row.doi})
                    ON CREATE
                        SET cw.name = row.title,
                        cw.url = row.url,
                        cw.original_source = "USGS Profile Pages"
                MERGE (p)-[c:CONTRIBUTED_TO]->(cw)
                    ON CREATE
                        SET c.date_qualifier = row.date_qualifier,
                        c.reference = row.reference

        WITH p, row
            WHERE row.work_in_graph = "False" AND row.doi IS NULL
                MERGE (cw:CreativeWork {name: row.title})
                    ON CREATE
                        SET cw.url = row.url,
                        cw.original_source = "USGS Profile Pages"
                MERGE (p)-[c:CONTRIBUTED_TO]->(cw)
                    ON CREATE
                        SET c.date_qualifier = row.date_qualifier,
                        c.reference = row.reference
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_profile_creative_works
    })

CPU times: user 1.03 ms, sys: 1.18 ms, total: 2.21 ms
Wall time: 109 ms
