Many Pubs Warehouse Catalog items provide a docAbstract property containing an abstract text, and some provide a separate tableOfContents field. Both of these can contain useful descriptive information that we can exploit via natural language processes of one kind or another. Both fields are in HTML syntax.

I thought it would be interesting to further exploit the item discussion wiki pages in another way by storing these larger text blobs. That way they are onboard within our knowledgebase environment. Other people can see them in context with their item metadata. I believe they are included in the Mediawiki search index but will need to experiment. As we run operations against the texts using LLMs or other techniques, we will be able to point directly to the text that was used to derive claims in the graph.

In [70]:
import pickle
import pandas as pd
import pypandoc
from wbmaker import WikibaseConnection
import re


In [40]:
geokb = WikibaseConnection('GEOKB_CLOUD')

In [3]:
pw_dump = pd.DataFrame(pickle.load(open('data/pw_usgs_reports_dump.pickle', 'rb')))

In [41]:
geokb_pw_ids = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FindexId%0AWHERE%20%7B%0A%20%20%3Fclasses%20wdt%3AP2%20wd%3AQ11%20.%0A%20%20%3Fitem%20wdt%3AP1%20%3Fclasses%20.%0A%20%20%3Fitem%20wdt%3AP114%20%3FindexId%20.%0A%7D",
    output_format="dataframe"
)

geokb_pw_ids["qid"] = geokb_pw_ids['item'].apply(lambda x: x.split('/')[-1])

id_dfs = [
    geokb_pw_ids.drop(columns=['item'])
]

# Workaround for partial failure state in SPARQL
import json
from glob import glob

for fn in glob('./data/extra_pwids/*'):
    cached_ids = pd.DataFrame(json.load(open(fn)))
    cached_ids['qid'] = cached_ids['item'].apply(lambda x: x.split('/')[-1])
    id_dfs.append(cached_ids.drop(columns=['item']))

id_map = pd.concat(id_dfs)

id_map.drop_duplicates(inplace=True)

In [65]:
pw_text_content = pw_dump[
    (pw_dump['docAbstract'].notnull())
    |
    (pw_dump['tableOfContents'].notnull())
][['indexId','title','docAbstract','tableOfContents']].reset_index(drop=True)

pattern = r'.*No\s+abstract\s+available.*'
regex_pattern = re.compile(pattern)
pw_text_content['docAbstract'] = pw_text_content['docAbstract'].str.replace(regex_pattern, '', regex=True)
pw_text_content['docAbstract'] = pw_text_content['docAbstract'].replace('', pd.NA)
pw_text_content.dropna(subset=['docAbstract','tableOfContents'], how="all", inplace=True)

In [81]:
commit_texts = pd.merge(
    left=pw_text_content,
    right=id_map,
    how="inner",
    on="indexId"
)
commit_texts.head()

Unnamed: 0,indexId,title,docAbstract,tableOfContents,qid
0,ofr20231045,LANDFIRE technical documentation,<h1>Executive Summary</h1><p>LANDFIRE (LF) com...,<ul><li>Executive Summary</li><li>Chapter A. I...,Q55218
1,fs20233027,Assessment of undiscovered conventional oil an...,<p>Using a geology-based assessment methodolog...,<ul><li>Introduction</li><li>Total Petroleum S...,Q55219
2,ofr20231052,Status of spectacled eiders (Somateria fischer...,<p>The nesting biology and demography of spect...,<ul><li>Acknowledgments</li><li>Abstract</li><...,Q55220
3,sir20235079,Techniques for estimating the magnitude and fr...,<p>Annual peak-flow data collected at U.S. Geo...,<ul><li>Abstract</li><li>Introduction</li><li>...,Q55221
4,ofr20231033,Guidelines for calibration of uncrewed aircraf...,<h1>Executive Summary</h1><p>This report outli...,<ul><li>Executive Summary</li><li>Introduction...,Q55222


In [84]:
for index, row in commit_texts.iterrows():
    page_lines = [f"= {row['title']} ="]
    if isinstance(row['docAbstract'], str):
        page_lines.append(pypandoc.convert_text(row['docAbstract'], 'mediawiki', format='html'))
    if isinstance(row['tableOfContents'], str):
        page_lines.append("== Table of Contents ==")
        page_lines.append(pypandoc.convert_text(row['tableOfContents'], 'mediawiki', format='html'))

    page_content = '\n'.join(page_lines)

    page = geokb.mw_site.pages[f"Item_talk:{row['qid']}"]
    response = page.edit(page_content, summary="Added abstract and other texts to publication item's discussion page for reference")
    print(response['title'])


Item talk:Q55218
Item talk:Q55219
Item talk:Q55220
Item talk:Q55221
Item talk:Q55222
Item talk:Q55223
Item talk:Q55224
Item talk:Q55225
Item talk:Q55226
Item talk:Q55227
Item talk:Q55228
Item talk:Q55229
Item talk:Q55230
Item talk:Q55231
Item talk:Q55232
Item talk:Q55233
Item talk:Q55234
Item talk:Q55235
Item talk:Q55236
Item talk:Q55237
Item talk:Q55238
Item talk:Q55239
Item talk:Q55240
Item talk:Q55241
Item talk:Q55242
Item talk:Q55243
Item talk:Q55244
Item talk:Q55245
Item talk:Q55246
Item talk:Q55247
Item talk:Q55248
Item talk:Q55249
Item talk:Q55250
Item talk:Q55251
Item talk:Q55252
Item talk:Q55253
Item talk:Q55254
Item talk:Q55255
Item talk:Q55256
Item talk:Q55257
Item talk:Q55258
Item talk:Q55259
Item talk:Q55260
Item talk:Q55261
Item talk:Q55262
Item talk:Q55263
Item talk:Q55264
Item talk:Q55265
Item talk:Q55266
Item talk:Q55267
Item talk:Q55268
Item talk:Q55269
Item talk:Q55270
Item talk:Q55271
Item talk:Q55272
Item talk:Q55273
Item talk:Q55274
Item talk:Q55275
Item talk:Q552