This notebook works through the process of pulling documents references in from Zotero collections that we are using as part of our GeoArchive efforts. These documents provide reference materials on mineral exploration history and other details important to our use cases. The GeoKB serves as a point where all of the pertinent assertions derived from these documents come together and link to other information for research and analysis.

We use Zotero because it provides a platform for effectivey managing the documents and their metadata, giving scientists and science support personnel a reasonable set of reference management tools. It also provides a solid API to operate against when we need to pull reference details into other systems.

### Pre-cached Inventory

The Zotero Python API (pyzotero) provides a number of methods for searching and working with libraries, collections, and items. We use Group Libraries as logical containers that are considered "GeoArchive Collections." Zotero libraries can be further organized into collections (folders) that can be hierarchical as a convenient management mechanism.

Once a library gets big enough with potentially thousands of items, it can be somewhat time consuming to pull everything into some other platform like the GeoKB (.everything() method in pyzotero handles this politely but slowly). For this reason, we built a utility function that periodically refreshes a cached inventory with all metadata for the library, storing this as a file within a specialty item in the library. When we want to grab everything for some purpose, we can simply get the inventory and work with it from there.

### Entity Identification and Extraction

One of the value-added services we layer on top of GeoArchive colections is a process to pipe contents through one or more engines for identifying important entities from within the archived materials and extracting them for further evaluation and use in research and assessments. Currently, we pipe all Zotero-based collections through the xDD (GeoDeepDive) engine at the University of Wisconsin-Madison. xDD processing includes a combination of search indexing and natural language processing (NLP) methods along with identification/extraction of tables, figures, and equations.

We're still developing the methodology to leverage the xDD processing pipelines to best effect. This will include pulling out key concepts from specific "dictionaries" and feeding those back to Zotero as tags as well as pulling document-specific extractions and dropping those as additional file attachments. These actions will roundtrip extracted information into the platform where the documents are being managed so that they are readily available in context for use.

The GeoKB piece of the architecture provides the platform where these extractions can be dynamically curated within a larger context of every other piece of related/linked information. We still need to work out where all the pieces of technology and processing software need to sit in order to handle this reliably and efficiently, but this notebook provides a start to how it all makes its way into the GeoKB.

In [25]:
import os
from pyzotero import zotero
import pandas as pd
from utils import sparql_query
from zipfile import ZipFile
import requests
from io import BytesIO

In [2]:
ni43101_library = pd.read_parquet('ni43101.p')

In [12]:
ni43101_library["place_project"] = ni43101_library.apply(lambda x: (x.project_name, x.place) if x.project_name != '' else None, axis=1)

In [31]:
ni43101_projects = list(ni43101_library.project_name.unique())

In [17]:
query_wd_mine = """
SELECT ?mine ?mineLabel ?country ?countryLabel ?admin_area ?admin_areaLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?mine wdt:P31 wd:Q820477.
  OPTIONAL {
    ?mine wdt:P17 ?country .
    ?mine wdt:P131 ?admin_area .
  }
}
"""

wd_mines = sparql_query(
    endpoint='https://query.wikidata.org/sparql',
    query=query_wd_mine,
    output='dataframe'
)

In [20]:
wd_mines["place_mine"] = wd_mines.apply(lambda x: (x.mineLabel, x.countryLabel), axis=1)

In [23]:
wd_mines[wd_mines.place_mine.isin(ni43101_library.place_project)]

Unnamed: 0,mine,mineLabel,country,countryLabel,admin_area,admin_areaLabel,place_mine
433,http://www.wikidata.org/entity/Q6119734,Lomero-Poyatos,http://www.wikidata.org/entity/Q29,Spain,http://www.wikidata.org/entity/Q1445364,El Cerro de Andévalo,"(Lomero-Poyatos, Spain)"
445,http://www.wikidata.org/entity/Q119701,Kombat,http://www.wikidata.org/entity/Q1030,Namibia,http://www.wikidata.org/entity/Q876506,Otjozondjupa Region,"(Kombat, Namibia)"
446,http://www.wikidata.org/entity/Q119701,Kombat,http://www.wikidata.org/entity/Q1030,Namibia,http://www.wikidata.org/entity/Q3711701,Otavi Constituency,"(Kombat, Namibia)"
1188,http://www.wikidata.org/entity/Q7240266,Premier,http://www.wikidata.org/entity/Q16,Canada,http://www.wikidata.org/entity/Q1973,British Columbia,"(Premier, Canada)"
1189,http://www.wikidata.org/entity/Q7240266,Premier,http://www.wikidata.org/entity/Q16,Canada,http://www.wikidata.org/entity/Q2138250,Regional District of Kitimat-Stikine,"(Premier, Canada)"
2685,http://www.wikidata.org/entity/Q21833858,San Francisco,http://www.wikidata.org/entity/Q414,Argentina,http://www.wikidata.org/entity/Q44803,Salta Province,"(San Francisco, Argentina)"
3778,http://www.wikidata.org/entity/Q4956061,Bralorne,http://www.wikidata.org/entity/Q16,Canada,http://www.wikidata.org/entity/Q1973,British Columbia,"(Bralorne, Canada)"
3779,http://www.wikidata.org/entity/Q4956061,Bralorne,http://www.wikidata.org/entity/Q16,Canada,http://www.wikidata.org/entity/Q132115,Squamish-Lillooet Regional District,"(Bralorne, Canada)"
4264,http://www.wikidata.org/entity/Q22398272,Salvadora,http://www.wikidata.org/entity/Q298,Chile,http://www.wikidata.org/entity/Q2118,Antofagasta Region,"(Salvadora, Chile)"
4265,http://www.wikidata.org/entity/Q22398294,Salvadora,http://www.wikidata.org/entity/Q298,Chile,http://www.wikidata.org/entity/Q2120,Atacama Region,"(Salvadora, Chile)"


In [26]:
gnis_national_file = 'https://geonames.usgs.gov/docs/stategaz/NationalFile.zip'
r_gnis_national_file = requests.get(gnis_national_file)
z = ZipFile(BytesIO(r_gnis_national_file.content))
# We know it's the first/only file and it's delimited with pipe
gnis_national = pd.read_csv(z.open(z.namelist()[0]), sep='|')

In [27]:
mine_features = gnis_national[gnis_national.FEATURE_CLASS == 'Mine']
mine_features.head(10)

Unnamed: 0,FEATURE_ID,FEATURE_NAME,FEATURE_CLASS,STATE_ALPHA,STATE_NUMERIC,COUNTY_NAME,COUNTY_NUMERIC,PRIMARY_LAT_DMS,PRIM_LONG_DMS,PRIM_LAT_DEC,PRIM_LONG_DEC,SOURCE_LAT_DMS,SOURCE_LONG_DMS,SOURCE_LAT_DEC,SOURCE_LONG_DEC,ELEV_IN_M,ELEV_IN_FT,MAP_NAME,DATE_CREATED,DATE_EDITED
44,444,Yucca Mine,Mine,AZ,4,Mohave,15.0,343909N,1142231W,34.652509,-114.375235,,,,,451.0,1480.0,Topock,02/08/1980,05/01/2006
70,470,Abe Lincoln Mine,Mine,AZ,4,Yavapai,25.0,340244N,1123232W,34.045586,-112.542118,,,,,1192.0,3911.0,Morgan Butte,02/08/1980,
73,473,Abril Mine,Mine,AZ,4,Cochise,3.0,315429N,1095929W,31.90814,-109.991459,,,,,2031.0,6663.0,Cochise Stronghold,02/08/1980,
83,483,Adams Mine,Mine,AZ,4,Mohave,15.0,345808N,1142335W,34.968892,-114.393014,,,,,659.0,2162.0,Boundary Cone,02/08/1980,
118,519,Aguinaldo Mine,Mine,AZ,4,Pima,19.0,315508N,1111712W,31.918971,-111.286767,,,,,1139.0,3737.0,Stevens Mountain,02/08/1980,
146,547,Alabama Mine,Mine,AZ,4,Mohave,15.0,352027N,1133603W,35.340831,-113.600772,,,,,1501.0,4924.0,Valentine SE,02/08/1980,
162,563,Alaska Mine,Mine,AZ,4,Maricopa,13.0,334342N,1131854W,33.728366,-113.314918,,,,,572.0,1877.0,Weldon Hill,02/08/1980,
163,564,Ajax Mine,Mine,AZ,4,Cochise,3.0,320048N,1091243W,32.013422,-109.212006,,,,,1473.0,4833.0,Blue Mountain,02/08/1980,
174,575,Alcyone Mine,Mine,AZ,4,Mohave,15.0,345934N,1142425W,34.992781,-114.406904,,,,,664.0,2178.0,Boundary Cone,02/08/1980,
187,588,Alice Mine,Mine,AZ,4,Pinal,21.0,330756N,1105525W,33.132284,-110.923722,,,,,935.0,3068.0,Hot Tamale Peak,02/08/1980,


Unnamed: 0,FEATURE_ID,FEATURE_NAME,FEATURE_CLASS,STATE_ALPHA,STATE_NUMERIC,COUNTY_NAME,COUNTY_NUMERIC,PRIMARY_LAT_DMS,PRIM_LONG_DMS,PRIM_LAT_DEC,PRIM_LONG_DEC,SOURCE_LAT_DMS,SOURCE_LONG_DMS,SOURCE_LAT_DEC,SOURCE_LONG_DEC,ELEV_IN_M,ELEV_IN_FT,MAP_NAME,DATE_CREATED,DATE_EDITED
2752,3181,Contact,Mine,AZ,4,Cochise,3.0,312448N,1095547W,31.413433,-109.929797,,,,,1684.0,5525.0,Bisbee,02/08/1980,
2800,3229,Copper King,Mine,AZ,4,Cochise,3.0,312615N,1095411W,31.437600,-109.903130,,,,,1616.0,5302.0,Bisbee,02/08/1980,
4462,4906,Galena,Mine,AZ,4,Cochise,3.0,312450N,1095336W,31.413989,-109.893407,,,,,1591.0,5220.0,Bisbee,02/08/1980,
11475,11977,Sunrise,Mine,AZ,4,Cochise,3.0,312610N,1095510W,31.436211,-109.919519,,,,,1833.0,6014.0,Bisbee,02/08/1980,
12518,13031,Uncle Sam,Mine,AZ,4,Cochise,3.0,312602N,1095512W,31.433988,-109.920075,,,,,1783.0,5850.0,Bisbee,02/08/1980,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1318950,1430409,Molly Gibson,Mine,UT,49,Juab,23.0,395437N,1120542W,39.910228,-112.094945,,,,,2014.0,6608.0,Eureka,12/31/1979,
1319397,1430861,North Star,Mine,UT,49,Juab,23.0,395515N,1120625W,39.920783,-112.106890,,,,,2107.0,6913.0,Eureka,12/31/1979,
1319741,1431211,Phoenix,Mine,UT,49,Juab,23.0,395537N,1120638W,39.926895,-112.110501,,,,,2225.0,7300.0,Eureka,12/31/1979,
1322956,1434491,Yankee,Mine,UT,49,Utah,49.0,395652N,1120552W,39.947728,-112.097723,,,,,2163.0,7096.0,Eureka,12/31/1979,


In [None]:
ni43101_library_id = '4530692'
ni43101_api_key = os.environ['NI43101_KEY']

### Connecting to Zotero

An API key along with a group library identifier are necessary parameters. Our seiminal use case is a collection of [National Instrument (NI) 43-101 Technical Reports](https://www.zotero.org/groups/4530692/usgs_ni_43-101_reports/library) that has been developed over many years and was recently brought into a Zotero group library for continued development, management, and use.

In [None]:
ni43101_library = zotero.Zotero(
    ni43101_library_id,
    'group', 
    ni43101_api_key
)

### Collections

Larger libraries in Zotero often benefit from being organized into collections (or folders). Collections aid in management operations by creating logical groupings to operate within, and they can help in navigating through materials. From a knowledgebase perspective, collections may provide a useful piece of information about the documents within the collections that we need to parse and bring into the knowledgebase context.

For our seminal use case with the NI 43-101 Technical Reports, we are organizing them into collections based on the geographic region for the mine or prospect documented within the report.

In [None]:
ni43101_collections = ni43101_library.all_collections()

In [None]:
df_ni43101_collections = pd.DataFrame([i['data'] for i in ni43101_collections])
df_ni43101_collections.head(8)

#### Notes

We will want to think through whether to incorporate these place names into the GeoKB. Ultimately, we want to have one or more point coordinate locations associated with a given document representing the mining property(ies) documented in a report. If we can get all the way down to that point through data extraction processing, we could conceivably build up additional place name identification including these same high level orientation tags used for document organization. Since we don't have that right now, or at least not comprehensively, we might at least start with these to model out how the information goes into the GeoKB source. We'll need to add in the higher level grouping at continental scale (e.g., South America) to work we started in the Place Name Reference notebook on country identification.

### Item Types and Information Models

Zotero uses a couple of different standards developed in the library community over many years to inform its model of item types and property schemas. Our goal in bringing these references into the GeoKB is not to duplicate everything from Zotero. We need to have something recognizable as a label and description for document items along with a clear pointer into the Zotero web interface to get further information, but we are mostly focused on the information derived from the documents pertinent to our mineral assessment use cases and the linkages to other things in the knowledgebase (e.g., identification of mineral occurrences and their documented characteristics).

We do need to go through the schemas we will be employing from Zotero and work out applicable mapping to GeoKB properties, building out additional reference sources as needed.

In [None]:
zotero_item_types = ni43101_library.item_types()
[i['localized'] for i in zotero_item_types]

#### All Fields

In [None]:
zotero_item_properties = ni43101_library.item_fields()
for x in [i['localized'] for i in zotero_item_properties]:
    print(x)

#### Report Fields

In [None]:
zotero_report_properties = ni43101_library.item_type_fields('report')
for x in [i['localized'] for i in zotero_report_properties]:
    print(x)

#### Notes

* Institution - For NI 43-101 Reports, we may establish a convention that this field applies to the commercial company owning the mining property described in the report. This would align with the exploration we are conducting on the similar SEC S-K 1300 reports. If this links by name to a uniquely named "commercial company" entity in the GeoKB, we can exploit additional information there (CIK identifiers, LEI identifiers, etc.) to go after further information sources such as other legal filings, news releases, or others that we can exploit for further intelligence gathering.
* Source identifiers/access points - Wherever possible, we should follow a "shortest path" principle in bringing document-type items into the GeoKB such that interacting with the knowledge graph includes being able to get to source material without unnecessary "clicks." For things like the NI 43-101 reports where we are using online Zotero storage for the original documents, we should incorporate those links into the GeoKB items (even if they may require authentication to access).
* Place - We will need to put some thought into how best to handle place names in our model. We are currently organizing documents into place name collections at the level of continental region, country name, and U.S. State name. Tags could provide another useful place to store place names (probably more functional than collections). Since they do help in organizing and orienteering through a library, storing at least some types of place names as metadata makes sense within Zotero, but we could also store detailed geographic area information within attached files (e.g., GeoJSON). In any case, we need to work out the logical mapping from however we do this to the GeoKB representation of these documents.
* Authors (part of the common schema in Zotero) - Authors in Zotero are identified by either full name or name parts in typical citation fashion. Better identification available in some citation metadata schemas like ORCID are not currently part of the Zotero schemas. We will have a representation for many people in the GeoKB, at least in terms of USGS publishing authors, dataset contributors, etc. We'll have to determine whether it is useful to establish explicit linkages to author items in the GeoKB from document items sourced from the GeoArchive. This is somewhat moot for NI 43-101 reports unless and until we get better author information into the system.

### Tags

Tags in a Zotero library provide a ready means to identify key aspects of the library contents with the ability to filter and browse. We are experimenting with a piece of software to pull useful tags from the xDD pipelines into the source Zotero library to help make that extracted information readily usable in their primary document management and access context. But tags can also be added by individual authorized users with write access to the Zotero library.

Some or all of these tags will be useful as linked entities within the GeoKB. We could tune the same "middleware" operating against the xDD API to feed information into both Zotero libraries and the GeoKB, but because of the multi-modal (human and robot) dynamic of tagging in Zotero, we may want to use Zotero as the initial foundation and then move tags from there into linked items in the GeoKB. One way or the other, we need to develop the conventions and rules on how the tags are built and processed to give us all the functionality we need.

In [None]:
ni43101_tags = ni43101_library.tags()
ni43101_tags[:20]

In [None]:
tag_prefixes = list(set([i.split(':')[0] for i in ni43101_tags if ':' in i]))
tag_prefixes

### Attachments

Zotero items can have any number of file attachments of any kind. For the NI 43-101 library, primary files are the raw PDF report content pulled from the Canadian Government portal or some other site. As we work through pulling the most useful extractions together from the xDD pipelines or other processing mechanisms, we may store additional files, including structured data files such as JSON or GeoJSON.

We are also working on protocols for individual document annotation and highlighting. Zotero libraries/items can be processed using the Zotfile extension to extract annotation and highlights to Markdown files that are attached to the items. These attachments can then be processed with code to introduce linked entities or other types of claims to the GeoKB.

In [None]:
top_items = ni43101_library.top()
top_items

In [None]:
top_items[0]['links']

#### Notes

* Direct document links - As discussed above, we may want to identify and pull in the primary link to a document into the GeoKB representation for these reports so that applications built on the GeoKB can link right to the source in addition to going to/through the library metadata item.