This notebook explores a process of running a SPARQL query in the GeoKB to pull out a graph for analysis and load that into Neo4j where there are some better analytical possibilities for certain inquiries. There are lots of ways of loading data to Neo4j and many ways of organizing the nodes, properties, and edges. This particular example was prompted by a question from folks in the Midcontinent Region who were interested in exploring the co-author connections across Science Centers. We start with a query that pulls all Midcontinent organizational units, the affiliations of people to those organization entities, and the articles those people have authored after 2017. We load all three entity types as nodes with a couple of attributes and then create the relationships between them. I retain the full URL identifier for the QID identities in the GeoKB and mostly stick with the same basic semantics on entity attributes and relationship properties. In the final codeblock, I run a query in Neo4j that produces a table output with the basic interconnection between authors that are working across different Science Centers.

In [1]:
from neo4j import GraphDatabase

import sys
sys.path.append('../harvesters/')

from wbmaker import WikibaseConnection

In [2]:
geokb = WikibaseConnection('GEOKB_CLOUD')

In [3]:
neo4j_driver = GraphDatabase.driver("bolt://localhost:7687")
neo4j_session = neo4j_driver.session()

In [10]:
query_midcontinent = """
PREFIX ge: <https://geokb.wikibase.cloud/entity/>
PREFIX gp: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?org ?orgLabel ?orgAltLabel ?orgDescription ?subsidiary_of ?org_instance_ofLabel ?org_website
?person ?personLabel ?personAltLabel ?personDescription ?affiliated_with ?person_website
?article ?articleLabel ?articleDescription (YEAR(?publication_date) AS ?pub_year) ?doi ?article_instance_ofLabel
WHERE {
  ?org gp:P190* ge:Q44363 ;
       gp:P190 ?subsidiary_of ;
       gp:P1 ?org_instance_of .
  OPTIONAL {
    ?org gp:P145 ?org_website .
  }
  ?person gp:P108 ?org ;
          gp:P108 ?affiliated_with .
  OPTIONAL {
    ?person gp:P145 ?person_website .
  }
  ?article gp:P102 ?person ;
           gp:P1 ?article_instance_of ;
           gp:P7 ?publication_date .
  OPTIONAL {
    ?article gp:P74 ?doi .
  }
  FILTER (YEAR(?publication_date) > 2017)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
"""

midcontinent_pub_record = geokb.sparql_query(query_midcontinent)


In [28]:
org_nodes = midcontinent_pub_record[['org', 'orgLabel', 'orgAltLabel', 'orgDescription', 'org_website', 'org_instance_ofLabel']].drop_duplicates().reset_index(drop=True)
person_nodes = midcontinent_pub_record[['person', 'personLabel', 'personAltLabel', 'personDescription', 'person_website']].drop_duplicates().reset_index(drop=True)
article_nodes = midcontinent_pub_record[['article', 'articleLabel', 'articleDescription', 'pub_year', 'doi', 'article_instance_ofLabel']].drop_duplicates().reset_index(drop=True)

org_relationships = midcontinent_pub_record[['org', 'subsidiary_of']].drop_duplicates().reset_index(drop=True)
person_relationships = midcontinent_pub_record[['person', 'affiliated_with']].drop_duplicates().reset_index(drop=True)
article_relationships = midcontinent_pub_record[['article', 'person']].drop_duplicates().reset_index(drop=True)

In [22]:
org_nodes.head()

Unnamed: 0,org,orgLabel,orgAltLabel,orgDescription,org_website,org_instance_ofLabel
0,https://geokb.wikibase.cloud/entity/Q44222,Northern Rocky Mountain Science Center,,A USGS Science Center,https://www.usgs.gov/centers/norock,USGS Science Center
1,https://geokb.wikibase.cloud/entity/Q44228,Great Lakes Science Center,GLSC,A USGS Science Center,https://www.usgs.gov/centers/great-lakes-scien...,USGS Science Center
2,https://geokb.wikibase.cloud/entity/Q44248,Earth Resources Observation and Science Center,Earth Resources Observation and Science (EROS)...,A USGS Science Center,https://www.usgs.gov/centers/eros,USGS Science Center
3,https://geokb.wikibase.cloud/entity/Q44256,Northern Prairie Wildlife Research Center,,A USGS Science Center,https://www.usgs.gov/centers/northern-prairie-...,USGS Science Center
4,https://geokb.wikibase.cloud/entity/Q44246,Dakota Water Science Center,"North Dakota Water Science Center, South Dakot...",A USGS Water Science Center,https://www.usgs.gov/centers/dakota-water,USGS Water Science Center


In [24]:
for _, row in org_nodes.iterrows():
    neo4j_session.run("""
        MERGE (n:Organization {id: $id})
        ON CREATE SET n.label = $label,
                      n.altLabel = $altLabel,
                      n.description = $description,
                      n.website = $website,
                      n.instance_of = $instance_of
        """, 
        id=row['org'], 
        label=row['orgLabel'], 
        altLabel=row['orgAltLabel'], 
        description=row['orgDescription'],
        website=row['org_website'], 
        instance_of=row['org_instance_ofLabel'])

In [23]:
person_nodes.head()

Unnamed: 0,person,personLabel,personAltLabel,personDescription,person_website
0,https://geokb.wikibase.cloud/entity/Q44429,"Robert Al-Chokhachy, Ph.D.","Robert Al-Chokhachy, Ph.D.",Research Fish Biologist,https://www.usgs.gov/staff-profiles/robert-al-...
1,https://geokb.wikibase.cloud/entity/Q44747,"Erik Beever, Ph.D.","Erik Beever, Ph.D.",Research Ecologist,https://www.usgs.gov/staff-profiles/erik-beever
2,https://geokb.wikibase.cloud/entity/Q45150,"Muruleedhara (Murulee) N Byappanahalli, PhD",,Research Microbiologist,https://www.usgs.gov/staff-profiles/muruleedha...
3,https://geokb.wikibase.cloud/entity/Q45150,"Muruleedhara (Murulee) N Byappanahalli, PhD",,Research Microbiologist,https://www.usgs.gov/staff-profiles/muruleedha...
4,https://geokb.wikibase.cloud/entity/Q45372,"Joseph Clark, Ph.D.",,Supervisory Research Ecologist,https://www.usgs.gov/staff-profiles/joseph-clark


In [25]:
for _, row in person_nodes.iterrows():
    neo4j_session.run("""
        MERGE (n:Person {id: $id})
        ON CREATE SET n.label = $label,
                      n.altLabel = $altLabel,
                      n.description = $description,
                      n.website = $website
        """, 
        id=row['person'], 
        label=row['personLabel'], 
        altLabel=row['personAltLabel'], 
        description=row['personDescription'],
        website=row['person_website']
    )

In [26]:
article_nodes.head()

Unnamed: 0,article,articleLabel,articleDescription,pub_year,doi,article_instance_ofLabel
0,https://geokb.wikibase.cloud/entity/Q144919,Proactive Rainbow Trout suppression reduces th...,a Article (Journal Article) published by Ameri...,2018,10.1002/nafm.10177,scholarly article
1,https://geokb.wikibase.cloud/entity/Q145083,Long-term population dynamics and conservation...,a Article (Journal Article) published by Canad...,2018,10.1139/cjfas-2017-0466,scholarly article
2,https://geokb.wikibase.cloud/entity/Q145302,Vegetation responses to sagebrush-reduction tr...,a Article (Journal Article) published by Elsev...,2018,10.1016/j.ecolind.2017.12.033,scholarly article
3,https://geokb.wikibase.cloud/entity/Q145370,Environmental DNA (eDNA): A tool for quantifyi...,a Article (Journal Article) published by PLOS ...,2018,10.1371/journal.pone.0191720,scholarly article
4,https://geokb.wikibase.cloud/entity/Q145445,Estimating population extinction thresholds wi...,a Article (Journal Article) published by Publi...,2018,10.1371/journal.pone.0191435,scholarly article


In [27]:
for _, row in article_nodes.iterrows():
    neo4j_session.run("""
        MERGE (n:Article {id: $id})
        ON CREATE SET n.label = $label,
                      n.description = $description,
                      n.published = $published,
                      n.doi = $doi,
                      n.instance_of = $instance_of
        """, 
        id=row['article'], 
        label=row['articleLabel'], 
        description=row['articleDescription'],
        published=row['pub_year'],
        doi=f"https://doi.org/{row['doi']}",
        instance_of=row['article_instance_ofLabel']
    )

In [29]:
org_relationships.head()

Unnamed: 0,org,subsidiary_of
0,https://geokb.wikibase.cloud/entity/Q44222,https://geokb.wikibase.cloud/entity/Q44363
1,https://geokb.wikibase.cloud/entity/Q44228,https://geokb.wikibase.cloud/entity/Q44363
2,https://geokb.wikibase.cloud/entity/Q44248,https://geokb.wikibase.cloud/entity/Q44363
3,https://geokb.wikibase.cloud/entity/Q44256,https://geokb.wikibase.cloud/entity/Q44363
4,https://geokb.wikibase.cloud/entity/Q44246,https://geokb.wikibase.cloud/entity/Q44363


In [30]:
for _, row in org_relationships.iterrows():
    neo4j_session.run("MATCH (a:Organization {id: $org}) "
                      "MATCH (b:Organization {id: $subsidiary_of}) "
                      "MERGE (a)-[:SUBSIDIARY_OF]->(b)",
                      org=row['org'], subsidiary_of=row['subsidiary_of'])

In [31]:
person_relationships.head()

Unnamed: 0,person,affiliated_with
0,https://geokb.wikibase.cloud/entity/Q44429,https://geokb.wikibase.cloud/entity/Q44222
1,https://geokb.wikibase.cloud/entity/Q44747,https://geokb.wikibase.cloud/entity/Q44222
2,https://geokb.wikibase.cloud/entity/Q45150,https://geokb.wikibase.cloud/entity/Q44228
3,https://geokb.wikibase.cloud/entity/Q45372,https://geokb.wikibase.cloud/entity/Q44222
4,https://geokb.wikibase.cloud/entity/Q44845,https://geokb.wikibase.cloud/entity/Q44228


In [34]:
for _, row in person_relationships.iterrows():
    neo4j_session.run("MATCH (a:Person {id: $person}) "
                      "MATCH (b:Organization {id: $affiliated_with}) "
                      "MERGE (a)-[:AFFILIATED_WITH]->(b)",
                      person=row['person'], affiliated_with=row['affiliated_with'])

In [35]:
article_relationships.head()

Unnamed: 0,article,person
0,https://geokb.wikibase.cloud/entity/Q144919,https://geokb.wikibase.cloud/entity/Q44429
1,https://geokb.wikibase.cloud/entity/Q145083,https://geokb.wikibase.cloud/entity/Q44429
2,https://geokb.wikibase.cloud/entity/Q145302,https://geokb.wikibase.cloud/entity/Q44747
3,https://geokb.wikibase.cloud/entity/Q145370,https://geokb.wikibase.cloud/entity/Q45150
4,https://geokb.wikibase.cloud/entity/Q145445,https://geokb.wikibase.cloud/entity/Q45372


In [36]:
for _, row in article_relationships.iterrows():
    neo4j_session.run("MATCH (a:Article {id: $article}) "
                      "MATCH (b:Person {id: $person}) "
                      "MERGE (a)-[:HAS_AUTHOR]->(b)",
                      article=row['article'], person=row['person'])

# Demonstration Query
The following query runs in the Neo4j instance and produces a listing of the cross-Center co-authorship in the Midcontinent Region.

In [47]:
result = neo4j_session.run("MATCH (ar:Article)-[:HAS_AUTHOR]->(a1:Person)-[:AFFILIATED_WITH]->(o1:Organization), "
    "(ar)-[:HAS_AUTHOR]->(a2:Person)-[:AFFILIATED_WITH]->(o2:Organization) "
    "WHERE a1 <> a2 AND o1 <> o2 "
    "RETURN a1.label, o1.label, ar.label, a2.label, o2.label"
)

midcontinent_coauthors = result.to_df()
midcontinent_coauthors

Unnamed: 0,a1.label,o1.label,ar.label,a2.label,o2.label
0,Craig A Jackson,Upper Midwest Environmental Sciences Center,A round-robin evaluation of the repeatability ...,"Adam Sepulveda, Ph.D.",Northern Rocky Mountain Science Center
1,"Jon Amberg, PhD",Upper Midwest Environmental Sciences Center,A round-robin evaluation of the repeatability ...,"Adam Sepulveda, Ph.D.",Northern Rocky Mountain Science Center
2,"Jon Amberg, PhD",Upper Midwest Environmental Sciences Center,Adding invasive species bio-surveillance to th...,"Adam Sepulveda, Ph.D.",Northern Rocky Mountain Science Center
3,Sara LeAnn (Caldwell) Eldridge,Wyoming-Montana Water Science Center,Integrating environmental DNA results with div...,"Adam Sepulveda, Ph.D.",Northern Rocky Mountain Science Center
4,Elliott Barnhart,Wyoming-Montana Water Science Center,Integrating environmental DNA results with div...,"Adam Sepulveda, Ph.D.",Northern Rocky Mountain Science Center
...,...,...,...,...,...
241,James Larson,Upper Midwest Environmental Sciences Center,Associations between cyanobacteria and indices...,JC Nelson,Midcontinent Region
242,"Keith A Loftin, Ph.D.",Kansas Water Science Center,Associations between cyanobacteria and indices...,JC Nelson,Midcontinent Region
243,"Zachary R Laughrey, Ph.D.",Kansas Water Science Center,Associations between cyanobacteria and indices...,JC Nelson,Midcontinent Region
244,Sean Bailey,Upper Midwest Environmental Sciences Center,Associations between cyanobacteria and indices...,JC Nelson,Midcontinent Region
