This notebook works through an initial baseline representation of a subset of Pubs Warehouse catalog records for publications and USGS/other gov reports. Initially, we are pulling all USGS Numbered Series reports and all other items (journal articles, etc.) that have at least one contributor listed with an ORCID identifier (meaning we have the ability to link the product to a person and use the information in capacity assessment use cases).

Within this process, we also add new representations of people whose ORCID IDs turn up as contributors but where we didn't already turn them up via personnel profiles. This lets us link to those people as authors, editors, or compilers when we build out the product items.

In addition to the barebones information on each product (title, derived description, publication year, identifiers, contributor links), we also write any abstracts and tables of contents available in the PW Catalog to the item discussion page.

One a baseline is established, we can use a dates written to the [Pubs Warehouse Catalog](https://geokb.wikibase.cloud/wiki/Item:Q54915) source item to determine the parameter (mod_x_days) we need to send to the PW web service for changes since our last processing. That will be a slightly different process we will set up to run on a schedule that will pull in most of the functionality started here but will work out how to deal with changes introduced.

In [1]:
import pickle
import pandas as pd
import re
# import pypandoc
import os
import requests
import json
from urllib.parse import quote
from wbmaker import WikibaseConnection
from joblib import Parallel, delayed
from tqdm import tqdm

In [2]:
geokb = WikibaseConnection('GEOKB_CLOUD')

In [9]:
def get_pwid_values(qid, pid):
    r = requests.get(f"https://geokb.wikibase.cloud/w/api.php?action=wbgetclaims&format=json&entity={qid}&property={pid}").json()
    if r['claims']:
        return [{'qid': qid, 'pid': pid, 'claim_id': i['id'], 'claim_value': i['mainsnak']['datavalue']['value']} for i in r['claims']['P114']]
    else:
        pass

def redo_claim(wb, qid, pid, claim_value, refs):
    item = wb.wbi.item.get(qid)
    item.claims.remove(pid)
    r = item.write(summary=f"Removed {pid} claim")

    item = wb.wbi.item.get(qid)
    item.claims.add(
        wb.datatypes.ExternalID(
            prop_nr=pid,
            value=claim_value,
            references=refs
        )
    )
    r = item.write(summary=f"Added {pid} claim")

    return r.id

def entity_data(qid):
    return requests.get(f"https://geokb.wikibase.cloud/wiki/Special:EntityData/{qid}.json").json()


In [3]:
pwid_fixes = pd.read_parquet('./data/pid_fixes.parquet')
pwid_fixes.head()

Unnamed: 0,qid,pid,claim_id,claim_value
0,Q144957,P114,Q144957$AFBE17A5-442C-47AF-8C74-DD1FEAB97946,70202469
1,Q144958,P114,Q144958$C05DAE73-14E7-4BF3-ABEA-D7104EF7875F,70196687
2,Q145343,P114,Q145343$2F648780-E3C4-4314-AE5C-FACD374385B8,70217681
3,Q145344,P114,Q145344$E31EA774-C890-4508-A58C-98773C3058C7,70227916
4,Q145374,P114,Q145374$7A88DED2-6F81-49AB-8C80-A236B49D838B,70195145


# Get GeoKB PW Records
I have a problem here in that the query is not returning all items via the SPARQL service. Until that gets resolved, I had to run a few queries on PW indexId identifiers and stash the results in files so that I get a complete set of pubs that already have a representation. There is something of a basic safeguard in place in Wikibase that means we can't introduce two items with exactly the same label and classification. Trying to do so, throws an exception from the API.

In [None]:
geokb_pw_ids = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FindexId%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP114%20%3FindexId%20.%0A%7D",
    output_format="dataframe"
)
geokb_pw_ids['qid'] = geokb_pw_ids['item'].apply(lambda x: x.split('/')[-1])
geokb_pw_ids.drop(columns="item", inplace=True)
#geokb_pw_ids.drop_duplicates(inplace=True)

geokb_doi_ids = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3Fdoi%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP74%20%3Fdoi%20.%0A%7D",
    output_format="dataframe"
)
geokb_doi_ids['qid'] = geokb_doi_ids['item'].apply(lambda x: x.split('/')[-1])
geokb_doi_ids.drop(columns="item", inplace=True)
#geokb_doi_ids.drop_duplicates(inplace=True)

In [3]:
geokb_pub_ids = pd.read_csv('./data/geokb_indexid_doi.csv')
geokb_pub_ids['qid'] = geokb_pub_ids['item'].apply(lambda x: x.split('/')[-1])
geokb_pub_ids.drop(columns="item", inplace=True)

In [5]:
pub_entities = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%20%7B%0A%20%20%3Fpub_classes%20wdt%3AP2*%20wd%3AQ6%20.%0A%20%20%3Fitem%20wdt%3AP1%20%3Fpub_classes%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)

report_entities = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%20%7B%0A%20%20%3Fpub_classes%20wdt%3AP2*%20wd%3AQ8%20.%0A%20%20%3Fitem%20wdt%3AP1%20%3Fpub_classes%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)

all_docs = pd.concat([pub_entities, report_entities]).reset_index(drop=True)
all_docs['qid'] = all_docs['item'].apply(lambda x: x.split('/')[-1])

In [10]:
unidentified_pub_qids = all_docs[~all_docs['qid'].isin(geokb_pub_ids['qid'])]['qid'].to_list()

In [11]:
item_dump = Parallel(n_jobs=8, prefer="threads")(delayed(entity_data)(i) for i in tqdm(unidentified_pub_qids))

 71%|███████▏  | 11568/16212 [15:05<05:48, 13.31it/s]

In [None]:
geokb_missing_pwid = all_docs[~all_docs['qid'].isin(geokb_pw_ids['qid'])].reset_index(drop=True)
geokb_missing_pwid.head()
# missing_qid_list = missing_pwid['qid'].to_list()
# len(missing_qid_list)

In [None]:
geokb_doi_indexid = pd.read_csv('./data/geokb_doi_indexid.csv')
geokb_doi_indexid['qid'] = geokb_doi_indexid['item'].apply(lambda x: x.split('/')[-1])

In [None]:
check_qids = geokb_doi_indexid[
    (geokb_doi_indexid['indexId'].isnull())
    &
    (~geokb_doi_indexid['doi'].str.startswith('10.5066'))
]['qid'].to_list()

In [None]:
claim_check = Parallel(n_jobs=8, prefer="threads")(delayed(get_pwid_values)(i, 'P114') for i in tqdm(check_qids))

In [None]:
claims = []
for l in claim_check:
    if l is not None:
        claims.extend(l)

df_claims = pd.DataFrame(claims)

In [None]:
df_claims

In [None]:
wb.claim.get('Q144957')

# Reference Entities
The process I'm using in building a big baseline of records like this is to pull in the different reference sources I need to use in linking new items into the GeoKB. For publications, this includes the type classification for instance of claims, USGS organizations that give us ownership and funding relationships based on metadata, place names for some of the categories of places in PW records, and the ORCID mapping to GeoKB items for contributors.

In [None]:
# USGS report types
usgs_report_types = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP2%20wd%3AQ11%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)

usgs_report_types['instance_of_qid'] = usgs_report_types['item'].apply(lambda x: x.split('/')[-1])
usgs_report_types['series_title'] = usgs_report_types['itemLabel'].apply(lambda x: x.replace('USGS ', ''))

usgs_report_types.drop(columns=['item','itemLabel'], inplace=True)

# USGS organizations
usgs_orgs = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3Fitem_alt_label%20%3Finstance_ofLabel%0AWHERE%20%7B%0A%20%20%3Forg_types%20wdt%3AP2*%20wd%3AQ50862%20.%0A%20%20%3Fitem%20wdt%3AP1%20%3Forg_types%20.%0A%20%20%3Fitem%20wdt%3AP1%20%3Finstance_of%20.%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3Fitem%20skos%3AaltLabel%20%3Fitem_alt_label%20.%0A%20%20%20%20FILTER%20(lang(%3Fitem_alt_label)%3D'en')%0A%20%20%7D%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)

usgs_orgs['org_qid'] = usgs_orgs['item'].apply(lambda x: x.split('/')[-1])

usgs_org_lookup = pd.concat([
    usgs_orgs[['org_qid','itemLabel','instance_ofLabel']].drop_duplicates().rename(columns={'itemLabel': 'org_name'}),
    usgs_orgs[['org_qid','item_alt_label','instance_ofLabel']].rename(columns={'item_alt_label': 'org_name'})
])

# Countries, states, and counties
geokb_countries = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3Fitem_alt_label%20%3Fiso_country_code%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP38%20%3Fiso_country_code%20.%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3Fitem%20skos%3AaltLabel%20%3Fitem_alt_label%20.%0A%20%20%20%20FILTER%20(lang(%3Fitem_alt_label)%3D'en')%0A%20%20%7D%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)
geokb_countries["object"] = geokb_countries['item'].apply(lambda x: x.split('/')[-1])
country_lookup = pd.concat([
    geokb_countries[['object','itemLabel']].drop_duplicates().rename(columns={'itemLabel': 'place_name'}),
    geokb_countries[['object','item_alt_label']].dropna().rename(columns={'item_alt_label': 'place_name'}),
])
country_lookup.drop_duplicates(inplace=True)

geokb_states = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP12%20%3Fiso_code%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)
geokb_states["object"] = geokb_states['item'].apply(lambda x: x.split('/')[-1])
geokb_states.rename(columns={'itemLabel': 'place_name'}, inplace=True)
geokb_states.drop(columns="item", inplace=True)

geokb_counties = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20wd%3AQ481%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)
geokb_counties["object"] = geokb_counties['item'].apply(lambda x: x.split('/')[-1])
geokb_counties.rename(columns={'itemLabel': 'place_name'}, inplace=True)
geokb_counties.drop(columns="item", inplace=True)

# PW type classifiers
geokb_pub_classes = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3Fsubclass_of%20%3Fsubclass_ofLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP2*%20%3Ftop_classes%20.%0A%20%20VALUES%20%3Ftop_classes%20%7B%20wd%3AQ8%20wd%3AQ6%20%7D%0A%20%20%3Fitem%20wdt%3AP2%20%3Fsubclass_of%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)

geokb_pub_classes['instance_of_qid'] = geokb_pub_classes['item'].apply(lambda x: x.split("/")[-1])
geokb_pub_classes.drop(columns="item", inplace=True)

# Personnel ORCID IDs
geokb_orcid = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3Forcid%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP106%20%3Forcid%20.%0A%7D",
    output_format="dataframe"
)

geokb_orcid['person_qid'] = geokb_orcid['item'].apply(lambda x: x.split("/")[-1])
geokb_orcid.drop(columns=['item'], inplace=True)
geokb_orcid.drop_duplicates(inplace=True)

# Load from cache
The initial baseline for the Pubs Warehouse catalog is somewhat large at 175K+ items. I ran an initial set of codes to pull all raw catalog records into a cache and then run some minimal processing to produce parquet files to work on.

In [None]:
pw_core = pd.read_parquet('./data/pw_cache/pw_core.parquet')
pw_contributors = pd.read_parquet('./data/pw_cache/pw_contributors.parquet')
pw_texts = pd.read_parquet('./data/pw_texts.parquet')

# Missing

In [None]:
pw_core_matching_titles = pw_core[pw_core['title'].isin(geokb_missing_pwid['itemLabel'])][['indexId','title']].rename(columns={'title': 'itemLabel'}).reset_index(drop=True)

In [None]:
pw_core_matching_titles[pw_core_matching_titles.duplicated('itemLabel', keep=False)].sort_values('itemLabel')

In [None]:
def redo_p114(geokb, qid, refs):
    item = geokb.wbi.item.get(qid)
    item_json = item.get_json()
    if "P114" in item_json['claims']:
        current_pwid = [i['mainsnak']['datavalue']['value'] for i in item_json['claims']['P114']]
        item.claims.remove('P114')
        response = item.write(summary="Removed P114 claim")

        item = geokb.wbi.item.get(qid)
        new_claims = []
        for pwid in current_pwid:
            new_claims.append(
                geokb.datatypes.ExternalID(
                    prop_nr="P114",
                    value=pwid,
                    references=refs
                )
            )
        item.claims.add(new_claims)
        response = item.write(summary="Re-added P114 claim")
        return response.id

In [None]:
refs = geokb.models.References()
refs.add(
    geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup["data source"],
        value="Q54915"
    )
)

redo_p114(geokb, 'Q145467', refs)

In [None]:
redo_qids = geokb_missing_pwid[geokb_missing_pwid['itemLabel'].isin(pw_core['title'])]['qid'].to_list()

In [None]:
pwid_redo = Parallel(n_jobs=8, prefer="threads")(delayed(redo_p114)(geokb, i, refs) for i in tqdm(redo_qids))

In [None]:
pw_contributors_new = pw_contributors[~pw_contributors['indexId'].isin(geokb_index_ids)]
pw_core_new = pw_core[
    (~pw_core['indexId'].isin(geokb_index_ids)) 
    & 
    (~pw_core['doi'].isin(geokb_doi_ids))
    &
    (pw_core['indexId'].isin(pw_contributors_new['indexId']))
]
pw_texts_new = pw_texts[~pw_texts['indexId'].isin(geokb_index_ids)]

In [None]:
missing_index_id = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%20%7B%0A%20%20%3Fpub_classes%20wdt%3AP2*%20wd%3AQ6%20.%0A%20%20%3Fitem%20wdt%3AP1%20%3Fpub_classes%20.%0A%20%20MINUS%20%7B%20%3Fitem%20wdt%3AP114%20%3FindexId%20.%20%7D%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)
missing_index_id['qid'] = missing_index_id['item'].apply(lambda x: x.split('/')[-1])
missing_index_id.drop(columns="item", inplace=True)

In [None]:
pw_core_new_missing = pw_core_new[pw_core_new['title'].isin(missing_index_id['itemLabel'])].reset_index(drop=True)

pw_core_new_missing_qid = pd.merge(
    left=pw_core_new_missing,
    right=missing_index_id.rename(columns={'itemLabel': 'title'}),
    how="left",
    on="title"
)

In [None]:
pw_core_new_missing_qid[pw_core_new_missing_qid.duplicated(subset="qid", keep=False)].sort_values('title')

# Transformation Functions

In [None]:
def usgs_org_links(x):
    cost_center_orgs = [i.strip() for i in x['cost_centers']]
    if isinstance(x['programNote'], str):
        cost_center_orgs.append(x['programNote'])
    return list(set(cost_center_orgs))

def summary_text(row):
    page_lines = [f"= {row['title']} ="]
    if isinstance(row['docAbstract'], str):
        page_lines.append(pypandoc.convert_text(row['docAbstract'], 'mediawiki', format='html'))
    if isinstance(row['tableOfContents'], str):
        page_lines.append("== Table of Contents ==")
        page_lines.append(pypandoc.convert_text(row['tableOfContents'], 'mediawiki', format='html'))

    return '\n'.join(page_lines)

def split_place_names(x):
    if not isinstance(x, str):
        return None
    delim = ","
    if ";" in x:
        delim = ";"
    return [i.strip() for i in x.split(delim)]

def county_state(row):
    qualified_counties = []
    if row['state'] and row['county']:
        for st in row['state']:
            for ct in row['county']:
                qualified_counties.append(f"{ct}, {st}")
    if qualified_counties:
        return qualified_counties
    return

def pw_item_description(row):
    descriptive_lines = []
    if isinstance(row['pub_subtype'], str):
        descriptive_lines.append(f"a {row['pub_type']} ({row['pub_subtype']})")
    else:
        descriptive_lines.append(f"a {row['pub_type']}")

    if isinstance(row['publisher'], str):
        descriptive_lines.append(f"published by {row['publisher']}")

    if isinstance(row['series_title'], str):
        descriptive_lines.append(f"as part of series - {row['series_title']}")

    description_str = " ".join(descriptive_lines)
    if len(description_str) > 250:
        description_str = f"{description_str[:247]}..."
    return description_str

def chunk_list(lst, chunk_size):
    for i in range(0, len(lst), chunk_size):
        yield lst[i:i + chunk_size]
        
def search_orcid_list(orcid_list):
    orcid_headers = {
        "Access token": f"Bearer {os.environ['ORCID_ACCESS_TOKEN']}",
        "Content-type": "application/vnd.orcid+json"
    }

    r = requests.get(
        f"https://pub.orcid.org/v3.0/expanded-search/?q=orcid:({' OR '.join(orcid_list)})",
        headers=orcid_headers
    )
    if r.status_code == 200:
        if 'expanded-result' in r.json():
            return pd.DataFrame(r.json()['expanded-result'])
    
    return None

## Items to Process

Right now, I'm building a representation in the GeoKB for the following from the PW Catalog:
* All USGS Numbered Series (because so many of the historic documents contain unique geoscientific information that we want to be able to work against)
* All journal articles and other items who have ORCID-identified contributors (because of our capacity assessment use case)

This leaves out some number of PW items we may want to revisit.

In [None]:
missing_usgs_reports = pw_core[
    (~pw_core['indexId'].isin(geokb_index_ids))
    &
    (pw_core['pub_subtype'] == 'USGS Numbered Series')
].reset_index(drop=True)

usgs_series_ref = geokb_pub_classes[geokb_pub_classes['subclass_ofLabel'] == 'USGS Numbered Series'].reset_index(drop=True)
usgs_series_ref['series_title'] = usgs_series_ref['itemLabel'].str.replace('USGS ', '')

missing_usgs_reports = pd.merge(
    left=missing_usgs_reports,
    right=usgs_series_ref[['instance_of_qid','series_title']],
    how="left",
    on="series_title"
)

# All other PW items
pw_class_mapping = {
    'Report': 'report',
    'Article': 'scholarly article',
    'Book chapter': 'conference paper',
    'Conference Paper': 'conference paper',
    'Book': 'book',
    'Thesis': 'thesis',
    'Newsletter': 'newsletter',
    'Extramural-Authored Publication Paper': 'scholarly article'
}

df_pw_class_mapping = pd.DataFrame.from_dict(pw_class_mapping, orient='index', columns=['geokb_class_name'])
df_pw_class_mapping.reset_index(inplace=True)
df_pw_class_mapping.columns = ['pub_type', 'geokb_class_name']

pw_other = pw_core[
    (pw_core['indexId'].isin(pw_contributors['indexId']))
    &
    (pw_core['pub_subtype'] != 'USGS Numbered Series')
    &
    (~pw_core['indexId'].isin(geokb_index_ids))
].reset_index(drop=True)

pw_other = pd.merge(
    left=pw_other,
    right=df_pw_class_mapping,
    how="left",
    on="pub_type"
)

other_pub_ref = geokb_pub_classes[geokb_pub_classes['itemLabel'].isin(pw_class_mapping.values())].reset_index(drop=True)

pw_other = pd.merge(
    left=pw_other,
    right=other_pub_ref[['instance_of_qid','itemLabel']],
    how="left",
    left_on="geokb_class_name",
    right_on="itemLabel"
)

pw_other.drop(columns=['geokb_class_name','itemLabel'], inplace=True)

# Combine the two for processing
pw_core_combined = pd.concat([missing_usgs_reports, pw_other])

In [None]:
missing_pubs_on_ids = pw_core_combined[
    (~pw_core_combined['doi'].isin(geokb_doi_items['doi']))
    &
    (~pw_core_combined['indexId'].isin(geokb_doi_items['indexId']))
].reset_index(drop=True)

In [None]:
missing_pubs_on_ids[missing_pubs_on_ids['title'].isin(geokb_doi_items['itemLabel'])][['indexId','doi','title','publicationYear']]['title'].to_list()

## Missing ORCIDs
Pubs Warehouse records introduce a whole new set of person references who are not yet represented in the GeoKB, where our initial source of people was the USGS personnel profile pages. Most of these are non-USGS personnel who are co-authors on publications. Some of them are either current or former USGS personnel who do not have a personnel profile page, or at least not any more. ORCID does provide email addresses in some cases. In our processing, we record those that are usgs.gov addresses as a potential avenue to tie together additional information for people, including former staff in some cases. Depending on when people left the USGS, they may have a record retained in our internal directory.

In [None]:
missing_orcids = list(pw_contributors[
    (pw_contributors['indexId'].isin(pw_core_combined['indexId']))
    &
    (~pw_contributors['orcid'].isin(geokb_orcid['orcid']))
]['orcid'].unique())
print(len(missing_orcids))

In [None]:
orcid_check_dfs = []
for orcid_set in chunk_list(missing_orcids, 100):
    check_orcid = search_orcid_list(orcid_set)
    if check_orcid is not None:
        orcid_check_dfs.append(check_orcid)

In [None]:
harvested_orcids = pd.concat(orcid_check_dfs)

In [None]:
harvested_orcids = harvested_orcids.dropna(subset=["given-names","family-names","credit-name"], how="all")
harvested_orcids['label'] = harvested_orcids.apply(lambda x: f"{x['given-names'].split(',')[0]} {x['family-names']}", axis=1)

In [None]:
harvested_orcids.head()

Note: The process of discovering new ORCIDs and determining what to do with them could be a separate operation. I'm leaving off doind anything with the remaining missing ORCID identifiers here. Most of these are cases where we already have a person with exactly the same name/label. The Wikibase API throws an exception on trying to write these records. A few of these are likely actual duplicate ORCID records, meaning that a person had more than one ORCID registered for one reason or another. Most of them likely need to be disambiguated at some point. It's not vital that we have all linkages captured, so we'll leave these off for now.

In [None]:
refs_orcid = geokb.models.References()
refs_orcid.add(
    geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup['data source'],
        value="Q138415"
    )
)

orcids_added = []

for index, row in harvested_orcids.iterrows():
    item = geokb.wbi.item.new()

    item.labels.set('en', row['label'])
    item.descriptions.set('en', 'a record for a person added from ORCID based on a contributor to a USGS publication')
    if row['other-name']:
        item.aliases.set('en', row['other-name'])

    item.claims.add(
        geokb.datatypes.Item(
            prop_nr=geokb.prop_lookup['instance of'],
            value="Q3",
            references=refs_orcid
        )
    )

    item.claims.add(
        geokb.datatypes.ExternalID(
            prop_nr=geokb.prop_lookup['ORCID iD'],
            value=row['orcid-id'],
            references=refs_orcid
        )
    )

    email_claims = []
    for usgs_email in [i for i in row['email'] if i.endswith('usgs.gov')]:
        email_claims.append(
            geokb.datatypes.URL(
                prop_nr=geokb.prop_lookup['email address'],
                value=f"mailto:{usgs_email.strip()}",
                references=refs_orcid
            )
        )
    if email_claims:
        item.claims.add(email_claims)

    try:
        response = item.write(
            summary="Person record added from ORCID registry based on ORCID ID found as author/contributor to USGS publication"
        )
        orcid_added = (row['orcid-id'],response.id)
        orcids_added.append(orcid_added)
        print(orcid_added)
    except:
        print("FAILURE:", row['orcid-id'])

## Core Transformations

In [None]:
pw_core_combined['label'] = pw_core_combined['title'].apply(lambda x: x if len(x) <= 250 else f"{x[:247]}...")
pw_core_combined['description'] = pw_core_combined.apply(pw_item_description, axis=1)

## USGS Organization Mapping
We have a combination of a list of "USGS Cost Centers" and a separate field called programNote that contain links between publications and USGS organizations. Both of these contain some names that can be reasonably matched to our best master representation of USGS organizational units. In preparation, I did introduce a number of aliases into the GeoKB for the slight deviation on names used in the PW Catalog. We ultimately need to get this information better nailed down with some type of persistent, resolvable identifier for organizations (e.g., ROR should work in most cases).

I use the classification of organizations introduced in the GeoKB to distinguish between USGS Programs that fund the work, putting these into a funder linkage, and the other organizational types that I loosely claimed as "owner" of the products (meaning that the organizational unit was and is logically responsible for the product).

Given the fuzziness of the text values used in the PW Catalog (e.g., obvious misspellings, etc.), the information on cost centers/programs is a little bit suspect. However, I'm accepting it at face value for the time being.

The other piece of information we need to revisit here is the organizational affiliation for contributors. Presumably, this is the affiliation at the time of the specific item being published (or at least submitted for publication). We could potentially use this, at least from first authors, as a further indication of the organizational connection for the publication itself. For now, I'm leaving that relationship alone. We can also potentially use contributor affiliations in pub records for the people themselves, adding additional affiliation claims qualified by publication dates. However, given the questions about just how good or accurate this information is, I'm leaving that alone for now as well.

In [None]:
org_mapping_props = [
    "indexId",
    "cost_centers",
    "programNote"
]

pw_org_links = pw_core_combined[org_mapping_props].reset_index(drop=True)

pw_org_links['org_name'] = pw_org_links.apply(usgs_org_links, axis=1)

pw_org_mapping = pd.merge(
    left=pw_org_links[pw_org_links['org_name'].str.len() > 0][['indexId','org_name']].explode('org_name'),
    right=usgs_org_lookup[usgs_org_lookup['org_qid'] != 'Q44210'],
    how="inner",
    on="org_name"
)
pw_org_mapping['predicate'] = pw_org_mapping['instance_ofLabel'].apply(lambda x: geokb.prop_lookup['funder'] if x == "USGS Program" else geokb.prop_lookup['owner'])

pw_org_mapping.head()

## Contributors

In [None]:
pw_contributions = pd.merge(
    left=pw_contributors[pw_contributors['indexId'].isin(pw_core_combined['indexId'])][['indexId','orcid','pub_role']],
    right=geokb_orcid,
    how="inner",
    on="orcid"
)
pw_contributions['predicate'] = pw_contributions['pub_role'].apply(lambda x: geokb.prop_lookup[x[:-1]])

pw_contributions.head()

# Geo Names
The PW Catalog stores geo references in a couple of different ways. There are place names in country, state, county, city, and otherGeospatial fields. There is also a structure containing FeatureCollection GeoJSON. Both of these are somewhat suspect in terms of data quality when we find so many obvious misspellings in place names. For now, I decided to use country, state, and county references. I deal with the vagary of having counties listed without their states by putting together county and state myself from available information and only accepting cases where I get an exact match on name to how we represented these entities in the GeoKB (e.g., "Mesa County, Colorado").

The otherGeospatial property contains what are likely more interesting place names that we'll want to deal with as we further flesh out place name references in the GeoKB. The actual geometry in the feature collections might also be quite interesting, but there does appear to be a mishmash of very specific footprints, perhaps providing the study site from a given paper, and things that are more likely boundaries of entities that may as well be names.

In [None]:
pw_geo_props = [
    "indexId",
    "country",
    "state",
    "county"
]

pw_geonames = pw_core_combined[pw_geo_props].reset_index(drop=True)
pw_geonames['country'] = pw_geonames['country'].apply(split_place_names)
pw_geonames['state'] = pw_geonames['state'].apply(split_place_names)
pw_geonames['county'] = pw_geonames['county'].apply(split_place_names)
pw_geonames['county'] = pw_geonames.apply(county_state, axis=1)

place_name_dfs = []
for place_type in [i for i in pw_geo_props if i != "indexId"]:
    df = pw_geonames[['indexId',place_type]].explode(place_type).dropna()
    df['name_type'] = place_type
    place_name_dfs.append(df.rename(columns={place_type: 'place_name'})[['indexId','name_type','place_name']])

pw_place_names = pd.concat(place_name_dfs)

pw_place_name_claims = pd.concat([
    pd.merge(
        left=pw_place_names[pw_place_names['name_type'] == 'country'],
        right=country_lookup,
        how="inner",
        on="place_name"
    ),
    pd.merge(
        left=pw_place_names[pw_place_names['name_type'] == 'state'],
        right=geokb_states,
        how="inner",
        on="place_name"
    ),
    pd.merge(
        left=pw_place_names[pw_place_names['name_type'] == 'county'],
        right=geokb_counties,
        how="inner",
        on="place_name"
    )
])
pw_place_name_claims['predicate'] = geokb.prop_lookup['addresses place']

pw_place_name_claims.head()

## Descriptive texts
It is potentially quite useful to introduce the larger text "blobs" we have for some PW Catalog items into the GeoKB itself. We take advantage of the wiki page functionality of "item discussions" to store these texts in wiki markup. This makes the text available as part of Wikibase/Mediawiki search indexing, which may have its own benefits for UI and API uses. It also puts the texts somewhere resolvable so that we can reference them effectively in provenance for anything we derive through text processing.

In [None]:
summary_text_props = [
    "indexId",
    "title",
    "docAbstract",
    "tableOfContents"
]

pw_text_content = pw_core_combined[summary_text_props].reset_index(drop=True)

pw_text_content = pw_text_content[
    (pw_text_content['docAbstract'].notnull())
    |
    (pw_text_content['tableOfContents'].notnull())
].reset_index(drop=True)

pw_text_content['docAbstract'] = pw_text_content['docAbstract'].apply(lambda x: x if isinstance(x, str) and len(x) > 75 else pd.NA)
pw_text_content.dropna(subset=['docAbstract','tableOfContents'], how="all", inplace=True)

# The process of converting HTML to wikimedia markup takes a while
pw_text_content['summary_text'] = pw_text_content.apply(summary_text, axis=1)
pw_text_content = pw_text_content[['indexId','summary_text']].reset_index(drop=True)

pw_text_content.head()

# Write to GeoKB
With pre-processing done, we can now build new product items in the GeoKB with all of the following:
* labels (titles) and descriptions
* classification into one of the publication/report categories
* publication year
* author/editor/compiler contributor links
* organization links (funder or owner)
* place name links
* summary text on item discussion wiki pages (abstract and/or toc)

Before committing, I looked to see if we had duplicate labels. It turns out there are quite a few, many of which appear to be likely duplicates in the PW Catalog itself. They have separate indexId values, but other aspects of the record do look the same. It's hard to say if these are cataloging mistakes or what the issue is. To deal with it, I go ahead and drop duplicates, keeping any that have DOI values as the most valuable for our purposes.

In [None]:
pw_items = pw_core_combined[['indexId','doi','publicationYear','label','description','instance_of_qid','pub_subtype']].reset_index(drop=True)
pw_items_sorted = pw_items.sort_values('doi')
pw_unique_items = pw_items_sorted.drop_duplicates(subset='label', keep='first')
pw_unique_items

In [None]:
instance_of_claims = pw_unique_items[['indexId','instance_of_qid']].reset_index(drop=True).rename(columns={'indexId': 'subject', 'instance_of_qid': 'object'})
instance_of_claims['predicate'] = 'P1'

claims = pd.concat([
    instance_of_claims[['subject','predicate','object']],
    pw_contributions[pw_contributions['indexId'].isin(pw_unique_items['indexId'])][['indexId','predicate','person_qid']].rename(columns={'indexId': 'subject', 'person_qid': 'object'}),
    pw_org_mapping[pw_org_mapping['indexId'].isin(pw_unique_items['indexId'])][['indexId','predicate','org_qid']].rename(columns={'indexId': 'subject', 'org_qid': 'object'}),
    pw_place_name_claims[pw_place_name_claims['indexId'].isin(pw_unique_items['indexId'])][['indexId','predicate','object']].rename(columns={'indexId': 'subject'})
])

grouped_claims = claims.groupby(["subject","predicate"])['object'].agg(list).reset_index()

grouped_claims.head(10)

In [None]:
pw_text_content.head()

In [None]:
refs = geokb.models.References()
refs.add(
    geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup["data source"],
        value="Q54915"
    )
)

pw_items = []
failed_items = []

for index, row in commit_items.iterrows():
    item_claims = grouped_claims[grouped_claims['subject'] == row['indexId']]
    summary_text = pw_text_content[pw_text_content['indexId'] == row['indexId']]

    item = geokb.wbi.item.get(row['qid'])

    item.labels.set('en', row['label'])
    item.descriptions.set('en', row['description'])
    
    if row['pub_subtype'] == "USGS Numbered Series":
        item.aliases.set('en', row['indexId'])

    for idx, claim_group in item_claims.iterrows():
        claims = []
        for obj_qid in claim_group['object']:
            claims.append(
                geokb.datatypes.Item(
                    prop_nr=claim_group['predicate'],
                    value=obj_qid,
                    references=refs
                )
            )
        item.claims.add(claims)

    if isinstance(row['doi'], str):
        item.claims.add(
            geokb.datatypes.ExternalID(
                prop_nr=geokb.prop_lookup['DOI'],
                value=row['doi'],
                references=refs
            )
        )

    item.claims.add(
        geokb.datatypes.ExternalID(
            prop_nr=geokb.prop_lookup['USGS Publications Warehouse IndexID'],
            value=row['indexId'],
            references=refs
        )
    )

    if isinstance(row['publicationYear'], str):
        item.claims.add(
            geokb.datatypes.Time(
                prop_nr=geokb.prop_lookup['publication date'],
                time=f"+{row['publicationYear']}-01-01T00:00:00Z",
                precision=9,
                references=refs
            )
        )

    try:
        response = item.write(
            summary="Updated publication item from USGS Pubs Warehouse"
        )
        item_tuple = (row['indexId'], response.id)
        pw_items.append(item_tuple)
        print(item_tuple)

        if not summary_text.empty:
            page = geokb.mw_site.pages[f"Item_talk:{response.id}"]
            response_summary = page.edit(
                summary_text.iloc[0]['summary_text'],
                summary="Added abstract and other texts to publication item's discussion page for reference"
            )
            print(response_summary['title'])
    except Exception as e:
        failure_packet = (row['indexId'], str(e))
        failed_items.append(failure_packet)
        print(failure_packet)

# Interpub Relationships
I need to work through the relationships once I have most catalog items represented in the GeoKB. This will established part of and superseded by relationships.

# Pub Links
I need to decide what all to do with the links to content. We may want to selectively store links in the knowledge graph context that will ultimately serve as source material for some type of processing that results in new claims. This will aid in a better provenance trace on those claims without having to navigate another system.