This notebook works through the process of adding links from GeoKB items representing publications to contributors (authors, editors, etc.) that are also entities in the knowledge graph. This is built here as a secondary process run after a first pass where a subset of publication items were introduced from the USGS Publications Warehouse. It represents an iterative approach to building the knowledge graph which is likely how much of this work will play out over time. We might start with an initial representation for some type of entity containing properties we are using at the time, but then we might come back in and add additional information from the original or a related source to create new linkages and add further capability to the items.

I worked here with a cache I'd made previously of USGS Pubs Warehouse records for the iSAID work. I need to revisit how that process works and particularly what I think is a new type of interface built on GraphQL for the PW that should dramatically improve interface efficiency over the REST API.

In [1]:
import os
import pandas as pd
import numpy as np
from wbmaker import WikibaseConnection
import json

In [2]:
geokb = WikibaseConnection('GEOKB_CLOUD')

# Using a temporary cache of PW items from a database source for efficiency
isaid = geokb.pg_cnxn(
    db="isaid",
    db_user=os.environ["rds_username"],
    db_pass=os.environ["rds_password"],
    db_host=os.environ["rds_host"],
    db_port=os.environ["rds_port"]
)

In [3]:
# Get PW cache and the pubs we will be building on from the GeoKB
pw_cache = pd.read_sql_query(
    """
        SELECT "indexId", contributors
        FROM source_pw
    """, 
    con=isaid
)

geokb_pw_pubs = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3Fpw_index_id%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%2Fwdt%3AP2*%20wd%3AQ11%20.%0A%20%20%3Fitem%20wdt%3AP114%20%3Fpw_index_id%20.%0A%7D",
    output_format="dataframe"
)
geokb_pw_pubs["qid"] = geokb_pw_pubs["item"].apply(lambda x: x.split("/")[-1])

In [4]:
# Put identifiers together so we can build out the structure we need and break out contributor types
geokb_pubs_contributors = pd.merge(
    left=geokb_pw_pubs[["qid","pw_index_id"]],
    right=pw_cache.rename(columns={"indexId": "pw_index_id"}),
    how="left",
    on="pw_index_id"
)

geokb_pubs_contributors["contributors"] = geokb_pubs_contributors["contributors"].apply(json.loads)

contributors_normalized = pd.json_normalize(geokb_pubs_contributors["contributors"])
geokb_pubs_contributors_combined = pd.concat([
    geokb_pubs_contributors.drop("contributors", axis=1),
    contributors_normalized
], axis=1)

In [5]:
# Break out each contributor type
geokb_pw_authors = geokb_pubs_contributors_combined[geokb_pubs_contributors_combined["authors"].notnull()][["qid","authors"]].reset_index(drop=True)
geokb_pw_editors = geokb_pubs_contributors_combined[geokb_pubs_contributors_combined["editors"].notnull()][["qid","editors"]].reset_index(drop=True)
geokb_pw_compilers = geokb_pubs_contributors_combined[geokb_pubs_contributors_combined["compilers"].notnull()][["qid","compilers"]].reset_index(drop=True)

# Normalize the properties in the contributors structure for each contributor type
geokb_pw_authors = geokb_pw_authors.explode("authors").reset_index(drop=True)

geokb_pw_author_props = pd.json_normalize(geokb_pw_authors["authors"])
geokb_pw_authors = pd.concat([
    geokb_pw_authors.drop("authors", axis=1),
    geokb_pw_author_props
], axis=1)

geokb_pw_authors_identified = geokb_pw_authors[geokb_pw_authors.email.notnull() | geokb_pw_authors.orcid.notnull()][["qid","email","orcid"]].reset_index(drop=True)
geokb_pw_authors_identified["orcid"] = geokb_pw_authors_identified["orcid"].apply(lambda x: x.replace("https://orcid.org/", "") if isinstance(x, str) else None)


geokb_pw_editors = geokb_pw_editors.explode("editors").reset_index(drop=True)

geokb_pw_editor_props = pd.json_normalize(geokb_pw_editors["editors"])
geokb_pw_editors = pd.concat([
    geokb_pw_editors.drop("editors", axis=1),
    geokb_pw_editor_props
], axis=1)

geokb_pw_editors_identified = geokb_pw_editors[geokb_pw_editors.email.notnull() | geokb_pw_editors.orcid.notnull()][["qid","email","orcid"]].reset_index(drop=True)
geokb_pw_editors_identified["orcid"] = geokb_pw_editors_identified["orcid"].apply(lambda x: x.replace("https://orcid.org/", "") if isinstance(x, str) else None)


geokb_pw_compilers = geokb_pw_compilers.explode("compilers").reset_index(drop=True)

geokb_pw_compiler_props = pd.json_normalize(geokb_pw_compilers["compilers"])
geokb_pw_compilers = pd.concat([
    geokb_pw_compilers.drop("compilers", axis=1),
    geokb_pw_compiler_props
], axis=1)

geokb_pw_compilers_identified = geokb_pw_compilers[geokb_pw_compilers.email.notnull() | geokb_pw_compilers.orcid.notnull()][["qid","email","orcid"]].reset_index(drop=True)
geokb_pw_compilers_identified["orcid"] = geokb_pw_compilers_identified["orcid"].apply(lambda x: x.replace("https://orcid.org/", "") if isinstance(x, str) else None)

In [6]:
# Get the ORCIDs and emails we have in the GeoKB source
geokb_person_email = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fperson%20%3Femail%0AWHERE%20%7B%0A%20%20%3Fperson%20wdt%3AP1%20wd%3AQ3%20.%0A%20%20%3Fperson%20wdt%3AP109%20%3Femail%20.%0A%7D",
    output_format="dataframe"
)
geokb_person_email["person_qid"] = geokb_person_email["person"].apply(lambda x: x.split("/")[-1])
geokb_person_email["email"] = geokb_person_email["email"].apply(lambda x: x.split(":")[-1])

geokb_person_orcid = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fperson%20%3Forcid%0AWHERE%20%7B%0A%20%20%3Fperson%20wdt%3AP1%20wd%3AQ3%20.%0A%20%20%3Fperson%20wdt%3AP106%20%3Forcid%20.%0A%7D",
    output_format="dataframe"
)
geokb_person_orcid["person_qid"] = geokb_person_orcid["person"].apply(lambda x: x.split("/")[-1])


In [7]:
# Combine document and person identifiers for each contributor type for unique identifier (orcid or email) connections
linked_authors_on_orcid = pd.merge(
    left=geokb_pw_authors_identified[geokb_pw_authors_identified.orcid.notnull()][["qid","orcid"]],
    right=geokb_person_orcid[["person_qid","orcid"]],
    how="inner",
    on="orcid"
)

linked_authors_on_email = pd.merge(
    left=geokb_pw_authors_identified[geokb_pw_authors_identified.email.notnull()][["qid","email"]],
    right=geokb_person_email[["person_qid","email"]],
    how="inner",
    on="email"
)

linked_authors = np.unique(
    np.concatenate((
        linked_authors_on_orcid[["qid","person_qid"]].to_records(index=False),
        linked_authors_on_email[["qid","person_qid"]].to_records(index=False)
    ))
)

df_linked_authors = pd.DataFrame.from_records(linked_authors)


linked_editors_on_orcid = pd.merge(
    left=geokb_pw_editors_identified[geokb_pw_editors_identified.orcid.notnull()][["qid","orcid"]],
    right=geokb_person_orcid[["person_qid","orcid"]],
    how="inner",
    on="orcid"
)

linked_editors_on_email = pd.merge(
    left=geokb_pw_editors_identified[geokb_pw_editors_identified.email.notnull()][["qid","email"]],
    right=geokb_person_email[["person_qid","email"]],
    how="inner",
    on="email"
)

linked_editors = np.unique(
    np.concatenate((
        linked_editors_on_orcid[["qid","person_qid"]].to_records(index=False),
        linked_editors_on_email[["qid","person_qid"]].to_records(index=False)
    ))
)

df_linked_editors = pd.DataFrame.from_records(linked_editors)


linked_compilers_on_orcid = pd.merge(
    left=geokb_pw_compilers_identified[geokb_pw_compilers_identified.orcid.notnull()][["qid","orcid"]],
    right=geokb_person_orcid[["person_qid","orcid"]],
    how="inner",
    on="orcid"
)

linked_compilers_on_email = pd.merge(
    left=geokb_pw_compilers_identified[geokb_pw_compilers_identified.email.notnull()][["qid","email"]],
    right=geokb_person_email[["person_qid","email"]],
    how="inner",
    on="email"
)

linked_compilers = np.unique(
    np.concatenate((
        linked_compilers_on_orcid[["qid","person_qid"]].to_records(index=False),
        linked_compilers_on_email[["qid","person_qid"]].to_records(index=False)
    ))
)

df_linked_compilers = pd.DataFrame.from_records(linked_compilers)

In [8]:
# Add the appropriate property identifier for each contributor type
# Note: editor and compiler were just added and the indexing process in the Wikibase instance is still behind
df_linked_authors["predicate"] = geokb.prop_lookup["author"]
df_linked_editors["predicate"] = "P115"
df_linked_compilers["predicate"] = "P116"

df_linked_contributors = pd.concat([
    df_linked_authors,
    df_linked_editors,
    df_linked_compilers
])

At this point we have a structure we can work through to pull existing publication items and add links to their contribtors.

In [9]:
refs = geokb.models.References()
refs.add(
    geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup["data source"],
        value="Q54915"
    )
)

# I already ran through authors here, so I'm filtering those out for this run
df_linked_contributors = df_linked_contributors[df_linked_contributors.predicate != "P102"]

for (pub_qid, contributor_pid), contributor_qids in df_linked_contributors.groupby(["qid","predicate"])["person_qid"].agg(list).items():
    item = geokb.wbi.item.get(pub_qid)

    contributor_claims = []
    for qid in contributor_qids:
        contributor_claims.append(
            geokb.datatypes.Item(
                prop_nr=contributor_pid,
                value=qid,
                references=refs
            )
        )
    item.claims.add(contributor_claims)

    response = item.write(
        summary="Added linkable contributors to publication record"
    )
    print(response.id, contributor_qids)


Q54967 ['Q50723']
Q55021 ['Q46517', 'Q54742']
Q55039 ['Q46517']
Q55043 ['Q45644', 'Q46517', 'Q54742', 'Q54870']
Q55044 ['Q46517', 'Q54742']
Q55049 ['Q46830', 'Q48043', 'Q49590']
Q55055 ['Q49216', 'Q50599']
