<a href="https://colab.research.google.com/github/skybristol/experiments/blob/dev/GeoArchive_pilot_with_USGS_leveraged_NI_43_101_Reports.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Prospectus
Exploring the structure and function of a GeoArchive in the USGS by developing prototype capability with the NI 43-101 Technical Reports*

# Background and Goal
The geosciences across USGS Missions and Programs leverage a broad array of both structured and unstructured data and information sources. Many of these are critical as sources of knowledge of geologic processes, recorded observations and measurements, identification of geologic features, and other details that make their way into our research and synthesized data products from geologic maps to energy and mineral assessments. In this project, we will explore the establishment of an active archive for an array of digital objects that are static from the point in time of when they were created or acquired but need to be continually built upon and serve as viable and accessible source material.

Historically, many of data and information source materials are managed as collections of files on local machines or network file servers, often accompanied by structured or looseform inventory information and other documentation. Data and information extracted from the original source material takes many different forms with often poorly documented provenance or processing steps between original source material and the information used in scientific analysis and interpretation. The methods and techniques used to evaluate especially unstructured or very non-standardized structured data and information sources are most often not aided by any more advanced text and data mining methods and technologies.

This project aims to address a foundational part of this problem by establishing a digital archive and source materials management solution. We are proposing to give it a name, identify, and presence in our overall architecture but build it entirely on existing infrastructure and with tools readily available in the USGS. The solution will meet the following major criteria:

1. Provide an online presence for digital files of various kinds, allowing for download access through various protocols on the web and access from cloud resources for efficient processing through various types of tools.
2. Provide a persistent, resolvable identifier for each unique artifact in the system that can be visited by humans, referenced/linked to effectively by other artifacts and data systems, and interfaced with by software code.
3. Provide a flexible system for the best available metadata about artifacts in the archive that enables the documentation to be improved and made more robust through time.
4. Provide methods for search and retrieval of archive materials through various means, including map interfaces for georeferenced materials.
5. Provide a capacity to house and provide both open, publicly available archive materials as well as sensitive materials requiring authenticated and authorized access controls.

# Approach
While an exhaustive development and examination of system requirements could indicate other and better technology options, there is an overall need to move ahead with a capability that introduces necessary functionality in support of ongoing work in the Mineral Resources Program. As such, we plan to leverage existing capability readily available to us via the ScienceBase catalog and digital repository platform but conduct the work in a way that does not lock in that particular solution for the long term. We will make design and implementation decisions toward flexibility to migrate between alternate solutions through time.

We will focus development initially on the NI 43-101 technical reports that have been collected by USGS Mineral Resources Program staff over some time to support mineral resource assessments and contributions to structured data assets. These reports are all single PDF documents, sometimes versioned through time for the same target mineral development asset, retrieved through an online portal system or alternate location on the web. The reports are essentially "web transients," meaning they are not already archived and maintained by any other single entity as a sustained resource, and so USGS staff have created their own internal archive for reference purposes. This current archive consists of two different folder/file structures on internal network servers that have adopted a relatively consistent file-naming convention that provides some useful identification information along with a spreadsheet for part of the collection containing additional metadata, including geospatial coordinates representing the site of the subject mineral development asset.

A small portion of the necessary metadata for the files is inherent in how the files have been named and organized in their respective folder structures or in the digital file header information. However, the majority of meaningful descriptive content comes from the file contents themselves. Some of the more important project/site identification information from the document contents has been pulled into an inventory spreadsheet that we will build a meta-model around. Other useful content that could be extracted from the document content will be explored through further work, starting with this project but extending beyond, in text mining/extraction.

Within the scope of this project we will attempt to get started on this exploration by teaming with the [GeoDeepDive](https://geodeepdive.org) project at U. Wisconsin-Madison, working with them to feed a representative sample of the NI 43-101 reports through their processing engines. GeoDeepDive and some related capabilities is evolving as a platform for operationally extracting useful data content out of "print" document sources, including several overall areas of important functionality for our purposes:

1.   Conversion of PDF text content, which is really only a "picture" of the text, into readable text that can be further parsed and processed. If we don't have the GeoDeepDive engine to run this process, we would want to employ similar methods on our own.
2.   Full text indexing of the extracted/converted PDF text content into an Elasticsearch engine that supports different types search and discovery methods through the various GeoDeepDive APIs.
3.   Processing through the search index to identify where key terms of interest to much of our work (e.g., geologic formations, rock types, etc.) are found within the texts and an API that returns the snippets of text around these words. This helps us both hone in on a set of documents of interest within the larger corpus and bring certain discovered terms of interest back into our own metadata documentation for categorization purposes (e.g., identifying the mineral commodities described within a particular document).
4.   Pre-processing the full text of documents into a data structure optimized for natural language processing against which NLP models such as named entity recognition and others can be developed and run at scale. Some routines that are of interest to USGS geoscience work have been developed already and can be built upon. This will be important as we look to do more complex things like tease out structured data from narrative sections in disclosure reports documenting the geologic settings of mineral projects.
5.   A newer processing engine called COSMOS is in development with promising early results. COSMOS extracts tables, figures, and equations from documents and is able to parse many of those into readable digital data structures. While this is a sensitive part of the project for proprietary, copyrighted materials, we may be able to employ it to good effect for the public domain documents we will be dealing with in this project to provide more robust, usable access to important tables and figures from the reports. Further work is needed in this part of the project to build methods to build out a usable metamodel for the extracted artifacts, something we may be able to contribute to for things like the fairly well structured and consistent disclosure reports.

# Project Team
We are deliberately working to develop a small project team with the majority of the resources contributed through a dedicated pilot effort funded by the Energy and Minerals Mission Area (EMMA). The project will be conducted fully in the open with regular opportunities for input welcome from anyone with an interest.

* Architect: Sky Bristol (on detail with EMMA designing and future USGS Geoscience Data System)
* Developer: Jay Shah (contractor from the USGS Cloud Hosting Solutions group, dedicated to working on EMMA development tasks)
* Subject Matter Experts: Jane Hammarstrom and Mike Zientek are providing expertise and insights on how the disclosure reports are used in the minerals assessment process; Carma San Juan, E.G. Boyce, Nick Karl, and Peter Schweitzer are all providing expertise on how the overall archive needs to be structured and managed
* CSS Collaborators: Lindsay Powers and Mikki Johnson will be working to make logical connections from the GeoArchive concept to broader USGS needs under the Data Preservation Program, specifically toward documenting methods and practices learned in this project
* Student Interns: Interns from the Geo-Launchpad Program at UNAVCO will be working through a summer program and will assist with transferring materials into the archive and working through to enhance metadata

# Timeline and Project Process
Much of the groundwork has been laid for the GeoArchive with capability already inherent in ScienceBase. This should allow us to build out a basic working solution fairly quickly with at least basic functionality. With that groundwork laid, we anticipate that the minimum viable products can be fully completed in 45 days or less from project start. All specific tasks, issues, and work of the project will be conducted within an open and transparent code project so that any stakeholder can check in on work in progress and weigh in on issues in need of subject matter expertise. The small project team will check in on video at least once weekly to work through any outstanding issues. Where we do run into challenges based on ScienceBase as a core dependency, we will work to ensure these are documented as backlog items for consideration within that project or otherwise resolved through some other means.

# Dependencies
To demonstrate a few things in this notebook, we need to install a few packages that are not inherent in the Google Colab environment. In our project, we will attempt to keep the technical dependencies to a minimum to maximize our chances of producing a sustainable solution. There will, however, be a number of things we need to work out to get our optimal target information content into the kind of shape it needs to be in for the functionality we want.

In [None]:
!pip install --upgrade geopandas
!pip install --upgrade pyshp
!pip install --upgrade shapely
!pip install --upgrade descartes
!pip install sciencebasepy
!pip install https://github.com/matplotlib/basemap/archive/master.zip

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

import pandas as pd
import geopandas as gpd
import re
import os
import fnmatch

from sciencebasepy import SbSession
import json

import requests
import datetime
import pickle
from collections import Counter
import uuid

# GeoArchive in ScienceBase
ScienceBase operates as a digital repository in USGS, containing all manner of scientific data and information. It is operated and maintained as a major portfolio element of the Core Science Systems Mission Area, Science Synthesis and Analytics Research Program. It is open for use by any organization in USGS and generally does not require any funding outlay from other Programs. For our purposes, it provides a reasonable platform upon which to build out an initial iteration of the GeoArchive.

For purposes of the project prospectus, we have stubbed out a top-level [collection item](https://www.sciencebase.gov/catalog/item/607ef112d34e8564d6) along with a next level collection for the [Mineral Development Project Disclosure Reports](https://www.sciencebase.gov/catalog/item/607efd7fd34e8564d6809ebc).

ScienceBase is organized in a hierarchical fashion where items can be "parented" by another item. This serves some purposes such as we've employed in this scenario so far, but the concept should not be conflated with a folder hierarchy structure in a file system. To accommodate organizing concepts such as the geopoloitical boundary context used in the inventory spreadsheet used for USMIN. For those purposes, we will employ the tagging structure and other metadata properties that produce facets within the ScienceBase Catalog system. The functionality that the hierarchy of collections provides does include the following:

* Different logical "sub-archives" within the GeoArchive may have different requirements for access permissions, both for adding/editing items within the archive and for read access. Having a separate collection for sub-archives like the disclosure reports facilitates that configuration.
* ScienceBase creates a logical set of geospatial services (WMS/KML) for collections that contain geospatially referenced items. Organizing into sub-archive collections facilitates using these as logical "layers" in mapping applications.

We have also set up a couple of example items within the disclosure reports sub-archive to iterate through as we work out the best ways to document the individual files. These items will really serve as the heart of the archive. They will contain the digital files and available metadata to aid in discovery and use of the archived materials.

# Inventory Spreadsheet
The team working on the USMIN project has been building and using a spreadsheet of the NI 43-101 reports they've been working with as part of building the mine inventory. The spreadsheet contains some useful metadata for a portion of the total NI 43-101 report archive that we will be able to use to build and/or add to items in ScienceBase.

The following codeblocks load the spreadsheet from a local file reference (will need to upload a file to your Google Colab runtime to work with this) and work through how we can leverage the contents.

In [None]:
def latest_tracking_sheet():
    try:
        return fnmatch.filter(os.listdir('.'), 'NI 43-101 Tracking sheet_*.xlsx')[0]
    except:
        print("File not found")
        return

def get_tracking_sheet(worksheet="inventory"):
    return pd.read_excel(latest_tracking_sheet())

In [None]:
tracking_sheet = get_tracking_sheet()

# Inventory Cleanup and Tagging
There are a couple of things in the spreadsheet that we need to deal with in order to effectively process it for our use. It's very easy to introduce inconsistencies and pesky things like special characters and encodings into string values in spreadsheet cells, especially working in something like Excel that allows for lots of "creativity." This all gets exposed and can cause problems when trying to use the data outside that original environment.

In our case here, we are going to extract a number of string values out of the spreadsheet and use those to create tags on our ScienceBase Items for organizing items into meaningful bins. Many of these bins actually represent some type of larger concept that are meaningful and could relate to or be found in other parts of our virtual data system. In pursuing the linked data principle, we will attempt to assign meaningful identifiers to these tags so that we can start rallying other data around the linked concepts.

In [None]:
# Remove rows with null filename, we can't do anything with these at this time
tracking_sheet.dropna(subset=["Filename"], inplace=True)

# Deal with extra kruft in the dates so we can turn these into real dates in the items
def dates_from_string(date_string):
    return re.findall(pattern=r'\d*-\d*', string=date_string)

tracking_sheet["effective_dates"] = tracking_sheet["Effective Date"].apply(dates_from_string)

# add UUID as a unique identifier reference
tracking_sheet['uuid'] = tracking_sheet.apply(lambda _: str(uuid.uuid4()), axis=1)

tracking_sheet.head()

Unnamed: 0,Region,Country,State,Name,Commodities,Effective Date,Document type,Latitude,Longitude,Filename,Root Directory,extension,Path,Notes,Location,Notes on attempt to plot given data in ArcMap (first fifty rows),Calculated Decimal Latitude,Calculated Decimal Longitude,project_name,country_label,country_identifier,us_state_label,us_state_identifier,document_title,part_of_series,effective_dates,uuid
0,International,Canada,,Dokis,Au Ag,9-2018,,,,Dokis Au Ag 9-2018,W:\NI 43-101 Reports,.pdf,W:\NI 43-101 Reports\International\Canada\Doki...,,It is geographically centered at UTM NAD 83 Zo...,Plots in the middle of Dokis Lake.,48.368795,-79.555252,Dokis,Canada,http://www.wikidata.org/entity/Q16,,,NI 43-101 Technical Report for the Dokis proje...,,[9-2018],de8b6b5b-26f2-40bc-af4c-77fa347159c4
1,International,Mexico,,Ixtaca,Au Ag,1-2019,FS,19° 40',-97° 51',Ixtaca Au Ag 1-2019 FS,W:\NI 43-101 Reports,.pdf,W:\NI 43-101 Reports\International\Mexico\Ixta...,,,Plots at Tuligtic in ArcMap ~8-10km northwest ...,19.699721,-97.85501,Ixtaca,Mexico,http://www.wikidata.org/entity/Q96,,,NI 43-101 Technical Report (feasibility study)...,,[1-2019],e95ae82f-2552-47d9-bfdb-657a096e207f
2,Domestic,United States,Nevada,TLC,Li Cly,12-2018,,,,TLC Li Cly 12-2018,W:\NI 43-101 Reports,.pdf,W:\NI 43-101 Reports\Domestic\Nevada\TLC Li Cl...,Clay not listed on the Commodities sheet (CLY ...,"Zone 11 475650E / 4222560N, (centre) NAD 27",Plots ~12 km NNE of Tonopah vs. the stated 10...,38.15059,-117.277912,TLC,United States of America,http://www.wikidata.org/entity/Q30,Nevada,http://www.wikidata.org/entity/Q1227,NI 43-101 Technical Report for the TLC project...,(1 of 2),[12-2018],1c1896fb-03f1-4683-be43-6bf0b785e0a5
3,International,Canada,,Premier,Au Ag,1-2019,,56° 7',-130° 1',Premier Au Ag 1-2019,W:\NI 43-101 Reports,.pdf,W:\NI 43-101 Reports\International\Canada\Prem...,,,Plots about 19 km NNE of Stewart. Report state...,56.033056,-130.016667,Premier,Canada,http://www.wikidata.org/entity/Q16,,,NI 43-101 Technical Report for the Premier pro...,(1 of 2),[1-2019],d2c75909-5d4a-4c91-86ac-529a98a927f9
4,International,Mali,,Loulo-Gounkoto,Au,12-2017,,,,Loulo-Gounkoto Au 12-2017,W:\NI 43-101 Reports,.pdf,W:\NI 43-101 Reports\International\Mali\Loulo-...,,Detailed latitude longitude coordinates for th...,Plots in the middle of two pits viewed with i...,13.08637,-11.41318,Loulo-Gounkoto,Mali,http://www.wikidata.org/entity/Q912,,,NI 43-101 Technical Report for the Loulo-Gounk...,,[12-2017],d7e815f9-dcd0-4c19-ace0-9cabf96a1547


# Geoarchive Linked Data
Many of the concepts that are going to be used to organize and work with our GeoArchive are "things" that are going to be commonly used and found in other parts of our data system. As such, this is an opportunity for us to start working out how we logically link information from across our data systems together. An increasingly important way of doing this is to align key concepts in data with some type of reference source (via a persistent, resolvable identifier) such that the same concept in another data collection can be unambiguously identified because it shares the same reference identifier.

What we are essentially working to do in this exercise is to create functionally linked data between logically independent databases, actively examining each dataset for anything that can be linked to common concepts or other datasets. Much of the time, the linking elements are found within metadata, which can be represented in the data themselves or housed in some adjacent documentation.

In our particular case, we're dealing with something fairly simple in terms of the inventory spreadsheet we pulled in above, but it still contains several key pieces of information that will likely be found in many other parts of our data system. There are a couple of basic geographic location properties that help us orient where mining properties are found (Country and State (US States only)). The Commodities property points to mineral and other names, which are also listed out in an included sheet. It is also likely that the Name property contains either some type of linkable location name or well-established mining property name.

## Commodity Name Exploration
After initially experimenting with the USGS Thesaurus as a potential point of reference for the commodities that are an aspect of the NI 43-101 reports (the subject of the particular mineral/material exploration project), we talked about needing to use a different primary source. The authoritative source for mineral names used in the geoscience community is the International Mineralogical Association [List of Minerals](https://www.ima-mineralogy.org/Minlist.htm). All we need is a more "online actionable" representation of these names that we can link to individually with persistent, resolvable identifiers.

I am exploring one [potential source](https://catalogue.linked.data.gov.au/index.php/resource/160) maintained by the [Geological Survey of Queensland](https://vocabs.gsq.digital/object?uri=http://linked.data.gov.au/def/minerals). This may be operated in a way that we would consider a reliable resource, but the underlying infrastructure there appears to be somewhat unstable or at least not always reliably accessible.

## Wikidata as Clearinghouse Source
An option for many concepts that we might leverage is Wikidata, essentially the largest open commons for a linked data enabled knowledge graph. Many different organizations who are doing anything with linked open data on the web are actively developing and operating code that populates Wikidata with key pieces of information needed to make other information systems work better. It can serve as a reference source for many different use cases like ours, providing a global clearinghouse of concepts and everything they can be linked to.

Wikidata has a [mineral species](https://www.wikidata.org/wiki/Q12089225) with a defined set of properties that references to the IMA list of minerals. There are 5,694 items that are instances of the mineral species class, meaning that there may be 6 mineral species missing or misclassified (subject for further exploration). Looking at our list of "NONFUEL MINERAL" commodities from the NI 43-101 inventory, there are some things in there that are not going to be listed as IMA minerals. A little bit of exploration through other classes in Wikidata indicates that we might be able to pull together a reasonable reference vocabulary that could serve our purposes here and elsewhere.

## Assembling a Logical Vocabulary Reference
While we could hit the Wikidata API live with queries on individual terms, it is useful to assemble a master set of information with a little bit of our own pre-processing and organization to operate against with subsequent queries. This gives us a level of abstraction away from a very large and complex reference source, giving us some stability and freedom of operation.

For simplicity, we build a array of objects here that contains a label and reference URL for the source of terms we are going to consult. We include the term label and its Wikidata identifier (URL) and process the list of alternate labels for some of the terms where we are likely to encounter additional terms that we can match against. This information gets dumped to a local cache file here in the Colab notebook, but we might pull this into some other type of online, shared cache in a future iteration.

The Wikidata reference source we are building is very simple and tailored for quick term label comparison purposes. As such, we can put all of the logical vocabularies we want to consult together into one structure. The following codeblock provides a configuration principal containing the logical label we want to apply to our sources, the reference URL and identifier for that source, the type of relationships individual terms/concepts have to that source (P31 = instance of; P279 = subclass of), and an indication of whether or not we should include the alternate names/labels for that source. I also created a function to handle the process of querying the Wikidata SPARQL API and putting our list of terms together.

Finally, I included a function that will take a few different parameters and execute a term match against our reference source once that is established. We can use this to create additional properties in our inventory dataframe that will then be used to generate ScienceBase Items and populate them with tags that include explicit references to the associated terms/concepts from Wikidata for further reference.

In [None]:
wikidata_reference = [
    {
        "source_label": "Wikidata Mineral Species",
        "source_reference": "https://www.wikidata.org/wiki/Q12089225",
        "source_rel": "P31",
        "include_alt_names": True
    },
    {
        "source_label": "Wikidata Chemical Elements",
        "source_reference": "https://www.wikidata.org/wiki/Q11344",
        "source_rel": "P31",
        "include_alt_names": True
    },
    {
        "source_label": "Wikidata Sedimentary Rocks",
        "source_reference": "https://www.wikidata.org/wiki/Q82480",
        "source_rel": "P279",
        "include_alt_names": True
    },
    {
        "source_label": "Wikidata Clastic Sediments",
        "source_reference": "https://www.wikidata.org/wiki/Q12372934",
        "source_rel": "P279",
        "include_alt_names": True
    },
    {
        "source_label": "Wikidata Sovereign States",
        "source_reference": "https://www.wikidata.org/wiki/Q3624078",
        "source_rel": "P31",
        "include_alt_names": True
    },
    {
        "source_label": "Wikidata US States",
        "source_reference": "https://www.wikidata.org/wiki/Q35657",
        "source_rel": "P31",
        "include_alt_names": False
    },
    {
        "source_label": "Wikidata Additional Commodities",
        "identifier_list": [
            "Q83437",
            "Q223995",
            "Q10564271",
            "Q190444"
        ],
        "include_alt_names": True
    }
]

wd_api = 'https://query.wikidata.org/sparql'

def get_wd_concepts(wd_source, wd_reference=wikidata_reference, limit=10000):
    source_config = next((i for i in wd_reference if i["source_label"] == wd_source), None)
    if source_config is None:
        return list()

    if "source_reference" in source_config and "source_rel" in source_config:
        wd_query = """
        SELECT ?item ?itemLabel ?itemAltLabel WHERE {
          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
          ?item wdt:%(pid)s wd:%(qid)s.
        }
        LIMIT %(limit)s
        """ % {
            "pid": source_config["source_rel"],
            "qid": source_config["source_reference"].split("/")[-1],
            "limit": limit
        }
        source_reference = source_config["source_reference"]

    elif "identifier_list" in source_config:
        wd_query = """
        SELECT ?item ?itemLabel ?itemAltLabel WHERE {
          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
          VALUES ?item {%(qid_list)s}.
        }
        """ % {
            "qid_list": " ".join([f"wd:{i}" for i in source_config["identifier_list"]])
        }
        source_reference = None

    wd_results = requests.get(
        wd_api, 
        params = {'format': 'json', 'query': wd_query}
    ).json()

    concept_list = [
        {
            "_date_cached": datetime.datetime.utcnow().isoformat(),
            "source": source_config["source_label"],
            "source_reference": source_reference, 
            "label": i["itemLabel"]["value"],
            "wikidata_uri": i["item"]["value"],
            "label_source": "preferred"
        } for i in wd_results["results"]["bindings"]
    ]

    if source_config["include_alt_names"]:
        for item in [i for i in wd_results["results"]["bindings"] if "itemAltLabel" in i]:
            for alt_label in item["itemAltLabel"]["value"].split(","):
                concept_list.append(
                    {
                        "_date_cached": datetime.datetime.utcnow().isoformat(),
                        "source": source_config["source_label"],
                        "source_reference": source_reference, 
                        "label": alt_label.strip(),
                        "wikidata_uri": item["item"]["value"],
                        "label_source": "alternate"
                    }
                )

    return concept_list

def wd_concept_tag(wd_source, source_data_label, wd_search_list, return_var="label", return_preferred=True):
    wd_item = next((i for i in wd_search_list if i["source"] == wd_source and i["label"] == source_data_label), None)
    if wd_item is None:
        if return_var == "label":
            return source_data_label
        else:
            return

    if return_preferred and wd_item["label_source"] != "preferred":
        wd_item = next((i for i in wd_search_list if i["label_source"] == "preferred" and i["wikidata_uri"] == wd_item["wikidata_uri"]), None)

    return wd_item[return_var]


## Wikidata Concept Cache
For ease of use, we can put all our terms together into one structure sitting in memory called wikidata_concept_reference, we can then check this with the various terms we have in our source inventory to bring those references into our data for use. The following codeblock will check the local system for a cache file and load that if it exists. Otherwise, it will run a loop over the set of Wikidata collections from our configuration and build if from live data.

In [None]:
%%time
if os.path.exists("wikidata_concept_reference.p"):
    wikidata_concept_reference = pickle.load(open("wikidata_concept_reference.p", "rb"))
else:
    wikidata_concept_reference = list()
    for wd_concept in wikidata_reference:
        wikidata_concept_reference.extend(
            get_wd_concepts(
                wd_concept["source_label"]
            )
        )
    pickle.dump(wikidata_concept_reference, open("wikidata_concept_reference.p", "wb"))

display(Counter(concept['source'] for concept in wikidata_concept_reference))

Counter({'Wikidata Additional Commodities': 10,
         'Wikidata Chemical Elements': 670,
         'Wikidata Clastic Sediments': 7,
         'Wikidata Mineral Species': 10314,
         'Wikidata Sedimentary Rocks': 109,
         'Wikidata Sovereign States': 1404,
         'Wikidata US States': 50})

CPU times: user 76.1 ms, sys: 8.43 ms, total: 84.6 ms
Wall time: 90.5 ms


# Project as a "new" kind of entity
The main subject of the NI 43-101 technical reports is quite specificaly a mineral exploration or development project. The USGS perspective on these reports is somewhat laser focused toward the mineral occurrences that the documentation associated with the projects helps us map and analyze and gain understanding about, but the reports themselves are about the human activity through some institutions to develop mineral resource. In our various data systems like the Mineral Resources Data System and USMIN Mineral Deposit Database, we're dealing with different kinds of entities like sites or geographic features associated with mining or mineral occurrences. Those entities have different meaning and scope with associated descriptive and data properties.

Because of how the nomenclature and semantics work in this community, we can sometimes end up with a bit of confusion based on the idea of locality, what it signifies in geology and related fields, and how many different kinds of things can become associated with a geographic location. For this project, we do need to take a slight foray into getting the semantics right for the idea of a "project" as it relates to a time-bound mineral development activity that becomes the subject of the reports we're dealing with for this particular archive. The "Name" property in the USMIN inventory spreadsheet pulled in above is essentially a shorthand name for the project that has a longer name in the files themselves and can refer to some common vernacular name for a location or a specific mine. For instance, the code block below shows the inventory records for two different files associated with the "Charlie Uranium Project" in Wyoming, simply referenced as "Charlie" in the spreadsheet. We need a lightweight schema that lets us persistently and consistently identify the projects themselves as something that lives just indendent of the report documents themselves. That will let us do expedient things like link multiple documents in the archive to the same project, indicating my date and/or other means that the report applies to a different period in the lifecycle of the project.

Taking an iterative and very lightweight approach, we may well start with the expedient of simply putting a tag (keyword) on items in the archive that's part of a particular "scheme" of tags for project names. That simply lets us filter/search on project names to see multiple documents associated with the project without overloading the concept. However, as soon as we set something up like a "GeoArchive" as a real going concern that anyone (or at least a lot more people) can access than a simple spreadsheet shared with a few people, it makes using shorthand references like "Charlie" a little awkward and in need of some explanation. We are ultimately going to want to build on the simple shorthand reference to projects to assign some additional information about the projects that's needed in our information architecture, reasonable to assemble, and based on acceptable standards and conventions (e.g., even the simplicity of a few elements from the schema.org Event type).

In [None]:
# clean up project name strings
def set_project_name(project_name):
    if isinstance(project_name, str):
        return project_name.replace(u'\xa0', u' ').replace(u'\n', '').strip()

tracking_sheet["project_name"] = tracking_sheet.Name.apply(set_project_name)

In [None]:
list(tracking_sheet.project_name.unique())[:10]

['Dokis',
 'Ixtaca',
 'TLC',
 'Premier',
 'Loulo-Gounkoto',
 'Kibali',
 'Blanco',
 'Aspen',
 'Gallowai Bul-River',
 'TV Tower']

# States and Countries
While it's not a major improvement to incorporate explicit identifiers for countries and states in our documentation for documents in the GeoArchive, it is still an important linkable piece of information that can sometimes introduce value when we know that we have a very specific and unambiguous reference point established in our data. For one thing, it does help keep us out of problems with simple transcription errors in our data where someone might "fat-finger" a particular name when there is a master list to pull from. More importantly, though, the additional information and connections let us execute important work within a data system where the same concepts are unambiguously linked. For instance, what happens when we need to make a wholesale reference change for a country name (e.g., Union of Burma to the Union of Myanmar to the Republic of the Union of Myanmar; or simply the use of full country name vs. shorthand in some circumstances).

The following codeblock executes our wd_concept_tag function to add labels and URIs to rows in our dataframe for countries and states where we can make a match to our Wikidata reference source.

In [None]:
tracking_sheet["country_label"] = tracking_sheet.Country.apply(
    lambda x: wd_concept_tag(
        wd_source="Wikidata Sovereign States",
        source_data_label=x,
        return_var="label",
        wd_search_list=wikidata_concept_reference
    )
)
tracking_sheet["country_identifier"] = tracking_sheet.Country.apply(
    lambda x: wd_concept_tag(
        wd_source="Wikidata Sovereign States",
        source_data_label=x,
        return_var="wikidata_uri",
        wd_search_list=wikidata_concept_reference
    )
)
tracking_sheet["us_state_label"] = tracking_sheet.State.apply(
    lambda x: wd_concept_tag(
        wd_source="Wikidata US States",
        source_data_label=x,
        return_var="label",
        wd_search_list=wikidata_concept_reference
    )
)
tracking_sheet["us_state_identifier"] = tracking_sheet.State.apply(
    lambda x: wd_concept_tag(
        wd_source="Wikidata US States",
        source_data_label=x,
        return_var="wikidata_uri",
        wd_search_list=wikidata_concept_reference
    )
)

display(tracking_sheet.loc[~tracking_sheet.country_label.isnull()][["country_label","country_identifier"]])
display(tracking_sheet.loc[~tracking_sheet.us_state_label.isnull()][["us_state_label","us_state_identifier"]])

Unnamed: 0,country_label,country_identifier
0,Canada,http://www.wikidata.org/entity/Q16
1,Mexico,http://www.wikidata.org/entity/Q96
2,United States of America,http://www.wikidata.org/entity/Q30
3,Canada,http://www.wikidata.org/entity/Q16
4,Mali,http://www.wikidata.org/entity/Q912
...,...,...
1185,Canada,http://www.wikidata.org/entity/Q16
1186,Brazil,http://www.wikidata.org/entity/Q155
1187,Peru,http://www.wikidata.org/entity/Q419
1188,Peru,http://www.wikidata.org/entity/Q419


Unnamed: 0,us_state_label,us_state_identifier
2,Nevada,http://www.wikidata.org/entity/Q1227
14,Alaska,http://www.wikidata.org/entity/Q797
30,New Mexico,http://www.wikidata.org/entity/Q1522
44,Wyoming,http://www.wikidata.org/entity/Q1214
46,Nevada,http://www.wikidata.org/entity/Q1227
...,...,...
1137,Nevada,http://www.wikidata.org/entity/Q1227
1139,Nevada,http://www.wikidata.org/entity/Q1227
1148,California,http://www.wikidata.org/entity/Q99
1160,Nevada,http://www.wikidata.org/entity/Q1227


## Decision Point
We do need to work out how and where we want to store and manage the "project" concept in our model and determine whether this is something specific to the GeoArchive or a higher level concept that we'll want to use elsewhere. It's likely that the shorthand references we have here in the inventory aren't quite sufficient for use outside of a narrow context where "people in the know" know what things like "Charlie" means.

The expedient thing in the ScienceBase approach is to stick the shorthand names for projects into a type classified local (specific to the GeoArchive collection) vocabulary of simple "project" tags. The primary users will know what these terms mean without any further reference; anyone else may be confused, but it might not matter in the immediate term. We could kick the can down the road a little bit till we have more clarity and a few more use cases on the "project" concept, at which point we can come back and improve the project references in the GeoArchive to incorporate references to some source of definition and further information.

# Commodity Vocabulary
One of the other pieces of information in the inventory spreadsheet is another take on "commodities" as used in this particular context. This is, of course, a commonly used listing/vocabulary that we need throughout the program. As we move toward building out shared, public (at least partially), and fully online assets like the GeoArchive, we really need to align all of these individual cases with a larger platform for terms, definitions, references, and other details. We need a vocbaulary resource that is founded on persistent, resolvable identifiers such that we can always reference easily to the "single point of truth" for terms used within all of our systems (and argue constructively about any differences that arise).

In [None]:
commodity_vocab = pd.read_excel(latest_tracking_sheet(), sheet_name="Commodities")
commodity_vocab

Unnamed: 0,NONFUEL MINERALS,Commodity,Notes
0,Aluminum,Al,
1,Antimony,Sb,
2,Arsenic,As,
3,Barite,barite,
4,Bauxite,bauxite,
...,...,...,...
58,Tungsten,W,
59,Vanadium,V,
60,Yttrium,Y,
61,Zinc,Zn,


# Missing Commodity Definitions
Not all of the commodities from the space-delimited lists on actual inventory records can be found in the accompanying "Commodities" worksheet. We need to verify these values and get that updated.

In [None]:
commodities_from_inventory = list()
for commodity_list in list(tracking_sheet.Commodities):
    commodity_list = commodity_list.replace(u'\xa0', u' ')
    commodities_from_inventory.extend([i.strip() for i in commodity_list.split(" ") if len(i.strip()) > 0])

commodities_from_inventory = list(set(commodities_from_inventory))
commodities_from_inventory.sort()

missing_commodity_names = list()
for commodity in commodities_from_inventory:
    commodity_in_vocab = commodity_vocab.loc[(commodity_vocab["Commodity "] == commodity) | (commodity_vocab["NONFUEL MINERALS"] == commodity)]
    if commodity_in_vocab.empty:
        missing_commodity_names.append({"Commodity ": commodity})
        print(commodity)

A
Age
Ba
C
Ce
Cly
Coal
K
LI
LWA
La
Nd
P
PGM
Pge
Pt
Sd
Sil
Sm
U
gemstone,
halloysite
kaolin
quartzite
sand
silica


# Finding Reference Terms
Now that we have a reasonable data structure to work against, we can check our list of commodities from the NI 43-101 inventory and figure out what we can use. In the following codeblock, I take care of a couple of nuances in how the information is structured and execute one piece of business logic - assuming that we are probably talking about the "mineral species" instead of "chemical element" in some cases where the same term has more than one context. 

In [None]:
inventory_commodities = commodity_vocab.to_dict(orient="records")
inventory_commodities.extend(missing_commodity_names)

commodity_sources = [
    'Wikidata Mineral Species',
    'Wikidata Additional Commodities',
    'Wikidata Chemical Elements',
    'Wikidata Clastic Sediments',
    'Wikidata Sedimentary Rocks'
]

for inventory_commodity in inventory_commodities:
    if "NONFUEL MINERALS" in inventory_commodity:
        if "(" in inventory_commodity["NONFUEL MINERALS"]:
            search_term = inventory_commodity["NONFUEL MINERALS"].split("(")[0].strip().lower()
        else:
            search_term = inventory_commodity["NONFUEL MINERALS"].lower()
    else:
        search_term = inventory_commodity["Commodity "].lower()
    
    wd_references = [i for i in wikidata_concept_reference if i["source"] in commodity_sources and i["label"].lower() == search_term]

    if len(wd_references) == 0 and "NONFUEL MINERALS" in inventory_commodity:
        wd_references = [i for i in wikidata_concept_reference if i["source"] in commodity_sources and i["label"] == inventory_commodity["Commodity "]]

    if len(wd_references) > 1:
        mineral_species_ref = next((i for i in wd_references if i["source"] in commodity_sources[0:2]), None)
        if mineral_species_ref is not None:
            inventory_commodity.update(mineral_species_ref)
        else:
            preferred_label_ref = next((i for i in wd_references if i["label_source"] == "preferred"), None)
            if preferred_label_ref is not None:
                inventory_commodity.update(wd_references[0])
            else:
                display(inventory_commodity)
                display(wd_references)
    elif len(wd_references) == 1:
        inventory_commodity.update(wd_references[0])


# Commodity Tags
When we take this forward into ScienceBase, we will connect the dots between the references we can establish here and the multiple commodity tags on ScienceBase Items housing the files.

In [None]:
pd.DataFrame([i for i in inventory_commodities if "label" in i])

Unnamed: 0,NONFUEL MINERALS,Commodity,Notes,_date_cached,source,source_reference,label,wikidata_uri,label_source
0,Aluminum,Al,,2021-06-22T11:51:33.027956,Wikidata Chemical Elements,https://www.wikidata.org/wiki/Q11344,aluminum,http://www.wikidata.org/entity/Q663,alternate
1,Antimony,Sb,,2021-06-22T11:51:33.027618,Wikidata Chemical Elements,https://www.wikidata.org/wiki/Q11344,antimony,http://www.wikidata.org/entity/Q1099,preferred
2,Arsenic,As,,2021-06-22T11:51:33.027561,Wikidata Chemical Elements,https://www.wikidata.org/wiki/Q11344,arsenic,http://www.wikidata.org/entity/Q871,preferred
3,Barite,barite,,2021-06-22T11:51:30.752284,Wikidata Mineral Species,https://www.wikidata.org/wiki/Q12089225,barite,http://www.wikidata.org/entity/Q184196,alternate
4,Bauxite,bauxite,,2021-06-22T11:51:33.410883,Wikidata Sedimentary Rocks,https://www.wikidata.org/wiki/Q82480,bauxite,http://www.wikidata.org/entity/Q102078,preferred
...,...,...,...,...,...,...,...,...,...
72,,Pge,,2021-06-22T11:51:36.172064,Wikidata Additional Commodities,,PGE,http://www.wikidata.org/entity/Q223995,alternate
73,,Pt,,2021-06-22T11:51:33.028187,Wikidata Chemical Elements,https://www.wikidata.org/wiki/Q11344,Pt,http://www.wikidata.org/entity/Q880,alternate
74,,Sm,,2021-06-22T11:51:33.028662,Wikidata Chemical Elements,https://www.wikidata.org/wiki/Q11344,Sm,http://www.wikidata.org/entity/Q1819,alternate
75,,U,,2021-06-22T11:51:33.028341,Wikidata Chemical Elements,https://www.wikidata.org/wiki/Q11344,U,http://www.wikidata.org/entity/Q1098,alternate


## Still Missing Commodity Names
The following are commodity names in the inventory spreadsheet that are not covered in the "Commodities" reference worksheet and are not currently found in our reference list assembled from Wikidata sources.

In [None]:
[i for i in inventory_commodities if "label" not in i]

[{'Commodity ': 'A'},
 {'Commodity ': 'Age'},
 {'Commodity ': 'Cly'},
 {'Commodity ': 'LWA'},
 {'Commodity ': 'PGM'},
 {'Commodity ': 'Sd'},
 {'Commodity ': 'Sil'},
 {'Commodity ': 'gemstone,'},
 {'Commodity ': 'halloysite'},
 {'Commodity ': 'kaolin'},
 {'Commodity ': 'quartzite'},
 {'Commodity ': 'silica'}]

## Decision Point
How vital is it that we develop some method for bringing together a reference capability for our data systems (including and beyond the GeoArchive) that has this level of rigor?

Our current situation is essentially that each and every instance of some database or information system provides its own vocabularies or enumerated lists of values that each one is using. Some of these might be more robust than others and include definitions or references. Others simply explain what a data field contains and then leave it to the user to decide if there is a relationship to anything else. Sometimes descriptive metadata will describe processing steps that may give some clues as to source material, but this is almost always written for humans and may have missing steps or implicit assumptions that make it difficult to track down specifics. Within a given community like mineral geology, there is perhaps an unspoken assumption that if someone mentions a mineral, then they, of course, must mean the mineral species from the IMA List of Minerals.

The real reason that all of this becomes important is that we are increasingly releasing our data "into the wild" where it can be picked up, pulled apart, and reassembled in myriad ways - sometimes by humans and increasingly by artificial intelligences and other robots. To the extent that we can embed explicit semantics into our data streams, we both mitigate misuse and promote accelerated appropriate use of our data.

Embedding explicit semantics into our data streams is functionally accomplished in the way we are exploring here - including a persistent, resolvable identifier to the agreed upon location of reference for a term along with the label for the term we are using in our context. The hard part is agreeing on what the point(s) of reference should be, and the really hard part can be finding the point of reference to use that is trustworthy and technologically robust enough to be scalable and sustainable.

The Wikidata model is actually a really interesting one to consider for our purposes. On the one hand, it can seem risky to rely on something that is maintained in the global commons where nearly anyone has the potential to influence the content of the system. However, that is also its strength - democratizing the knowledge development space and scaling out to thousands of participants who each hone in on their particular area of expertise and contribute. The vast majority of contributions are based on bots or otherwise bulk processes that turn facts and information from other source material into the Wikidata structure. There are also fairly rigorous safeguards put in place based on a meritocratic system that controls many aspects of the system by debating and carefully crafting the properties of items (anyone can create an item, but properties are determined through a review and debate process).

Items (entities) in wikidata are first classified into some logical bucket of like things (e.g., mineral species, chemical elements), which are themselves entities that are classified and described. Each entity can then have any number of claims (or statements) made about the entity. A claim is based on a property, which is another type of entity with parameters and constraints. Claims can have any number of references and qualifiers, and many communities and algorithms base logic on these attributes of claims (e.g., non-referenced claims should be ignored completely in some cases). A particular type of claim can be recorded as related identifiers from other systems (e.g., mineral species items contain relevant identifiers from MinDat). These essentially mean that someone determined that "this thing in this system is the same as that thing in that system." Wikidata also keeps a detailed history of every transaction that has resulted in a change from the creation of an item to each claim and accessible versions at every change, allowing for trust-based choices. (I once got in an interesting argument with an individual who was conducting gender-based research and decided to use a list of common male/female names as their basis for assigning gender values to the Wikidata items for USGS scientists.)

We could ignore this problem for now and simply put tags on items in ScienceBase containing our NI 43-101 files, setting them up for navigation and organization purposes but not explicitly tying the terms to any reference source. However, as things tend to go, if we keep kicking the can down the road, we will just continue to compound our problems in the long run. We could figure out a solution to the "online actionable" problem of the various logical authorities we already do count on like the IMA list that is operated by some organization that is closer to our domain (government, geoscience, etc.). We could set up our own capability for anything that we need to rely on as long as we can absorb the near and long term costs and risks.

I'm interested in further exploring the Wikidata option. What specific "containers" of items might we trust for our use? What characteristics of items would we want to check to make sure we can trust them (e.g., [graphite](https://www.wikidata.org/wiki/Q5309) is a mineral species that has a reference indicating this was stated in the IMA list from 2019)? Are there other useful properties for entities we might use as reference sources? Would we want to push our own edits and additions to Wikidata as a resource for ourselves and others?

# Archive item schema
For our initial working proof of concept, we'll be leveraging the ScienceBase Item model for each NI 43-101 report in the archive. This gives us a very basic metadata schema for documentation of the documents, while still letting up do things like stub out more complex metadata such as ISO19115 if we needed to do that.

The most fundamental starting requirement for a metamodel on these documents is that we have the right information to generate a compliant reference string (author, date, title, publisher, etc.). That's a low bar for entry, consistent with the minimal information we have in many cases, and meets the most basic need of being able to cite and reference these reports as assets. Information about people and organizations are stored in ScienceBase via "contacts" on items. These are classified with a type designation that can give them particular significance (e.g., author or publisher). We will try to work with the existing type classification scheme and add to it as needed. Date are similarly classified, and we may need to add a new date type for the particular meaning that "effective date" has in the context of the NI 43-101 reports.

Inherent will be that access to the digital files themselves are secured and part of the underlying repository with each item. This means that there will be an HTTP link to the file and various ways of reading, downloading, and working with the contents.

Explicit geospatial information is discussed in a succeeding section. We will also leverage tagging (associated with specific tag schemes) to add geopolitical boundary context to the items to assist with categorization and discovery.

Wherever possible, we will include either a link to the original source for a documnent via the web links part of the item schema (using the classification scheme and title to categorize these links as to what they represent) or a statement in the provenance annotation part of the schema about the origin of the document.

Note: In some code to follow here, I will tap the initial archive collection and item proving ground structure created in ScienceBase to show what the schema looks like at this stage.

In [None]:
display(tracking_sheet.columns)
display(tracking_sheet.head())

Index(['Region', 'Country', 'State', 'Name', 'Commodities', 'Effective Date',
       'Document type', 'Latitude', 'Longitude', 'Filename', 'Root Directory',
       'extension', 'Path', 'Notes', 'Location',
       'Notes on attempt to plot given data in ArcMap (first fifty rows)',
       'Calculated Decimal Latitude', 'Calculated Decimal Longitude'],
      dtype='object')

Unnamed: 0,Region,Country,State,Name,Commodities,Effective Date,Document type,Latitude,Longitude,Filename,Root Directory,extension,Path,Notes,Location,Notes on attempt to plot given data in ArcMap (first fifty rows),Calculated Decimal Latitude,Calculated Decimal Longitude
0,International,Canada,,Dokis,Au Ag,9-2018,,,,Dokis Au Ag 9-2018,W:\NI 43-101 Reports,.pdf,W:\NI 43-101 Reports\International\Canada\Doki...,,It is geographically centered at UTM NAD 83 Zo...,Plots in the middle of Dokis Lake.,48.368795,-79.555252
1,International,Mexico,,Ixtaca,Au Ag,1-2019,FS,19° 40',-97° 51',Ixtaca Au Ag 1-2019 FS,W:\NI 43-101 Reports,.pdf,W:\NI 43-101 Reports\International\Mexico\Ixta...,,,Plots at Tuligtic in ArcMap ~8-10km northwest ...,19.699721,-97.85501
2,Domestic,United States,Nevada,TLC,Li Cly,12-2018,,,,TLC Li Cly 12-2018,W:\NI 43-101 Reports,.pdf,W:\NI 43-101 Reports\Domestic\Nevada\TLC Li Cl...,Clay not listed on the Commodities sheet (CLY ...,"Zone 11 475650E / 4222560N, (centre) NAD 27",Plots ~12 km NNE of Tonopah vs. the stated 10...,38.15059,-117.277912
3,International,Canada,,Premier,Au Ag,1-2019,,56° 7',-130° 1',Premier Au Ag 1-2019,W:\NI 43-101 Reports,.pdf,W:\NI 43-101 Reports\International\Canada\Prem...,,,Plots about 19 km NNE of Stewart. Report state...,56.033056,-130.016667
4,International,Mali,,Loulo-Gounkoto,Au,12-2017,,,,Loulo-Gounkoto Au 12-2017,W:\NI 43-101 Reports,.pdf,W:\NI 43-101 Reports\International\Mali\Loulo-...,,Detailed latitude longitude coordinates for th...,Plots in the middle of two pits viewed with i...,13.08637,-11.41318


In [None]:
df_document_types = pd.read_excel(
    latest_tracking_sheet(), 
    sheet_name="Domains", 
    usecols="G:I", 
    skiprows=0, 
    nrows=12,
    header=None,
    names=["document_type","document_name","document_description"]
).dropna()
df_document_types

Unnamed: 0,document_type,document_name,document_description
1,PEA,preliminary economic assessment,A study that includes an economic analysis of ...
5,PFS,preliminary feasibility study,"A PFS is undertaken to determine, analyze, and..."
9,FS,feasibility study,A FS is an economic study assessing whether a ...


In [None]:
def build_item_title(record):
    document_title_parts = ["NI 43-101 Technical Report"]

    if isinstance(record["Document type"], str):
        if "," in record["Document type"]:
            doc_types = [i.strip() for i in record["Document type"].split(",")]
        else:
            doc_types = [record["Document type"]]
        document_type_names = list(df_document_types.loc[df_document_types.document_type.isin(doc_types)].document_name)
        document_type_names_str = " and ".join(document_type_names)
        document_title_parts.append(f"({document_type_names_str})")

    document_title_parts.append(f"for the {record.project_name} project")
    
    if isinstance(record.us_state_label, str):
        document_title_parts.append(f"in {record.us_state_label}, United States")
    elif isinstance(record.country_label, str):
        document_title_parts.append(f"in {record.country_label}")

    return " ".join(document_title_parts)
    
def build_effective_date(record_dates):
    if not isinstance(record_dates, list):
        return

    item_dates = list()
    for record_date in record_dates:
        date_parts = record_date.split("-")
        if len(date_parts) == 2:
            item_dates.append({
                "label": "Effective Date",
                "type": "EffectiveDate",
                "dateString": f"{date_parts[1]}-{date_parts[0]}"
            })

    return item_dates

def build_place_tags(record):
    place_tags = list()
    if isinstance(record.country_label, str):
        place_tags.append({
            "name": record.country_label,
            "scheme": record.country_identifier,
            "type": "Place"
        })

    if isinstance(record.us_state_label, str):
        place_tags.append({
            "name": record.us_state_label,
            "scheme": record.us_state_identifier,
            "type": "Place"
        })

    return place_tags

def build_commodity_tags(commodities):
    if not commodities:
        return
    
    commodity_tags = list()
    for commodity in [i.strip() for i in commodities.split(" ")]:
        commodity_ref = next((i for i in inventory_commodities if i["Commodity "] == commodity and "label" in i and len(i["label"]) > 0), None)
        if commodity_ref:
            commodity_tags.append({
                "name": commodity_ref["label"],
                "scheme": commodity_ref["wikidata_uri"],
                "type": "Commodity"
            })
        else:
            commodity_tags.append({
                "name": commodity,
                "type": "Commodity"
            })

    return [i for i in commodity_tags if len(i["name"]) > 0]

def build_document_type_tags(record):
    doc_type_tags = list()
    if isinstance(record["Document type"], str):
        if "," in record["Document type"]:
            doc_types = [i.strip() for i in record["Document type"].split(",")]
        else:
            doc_types = [record["Document type"]]
        for doc_type_name in list(df_document_types.loc[df_document_types.document_type.isin(doc_types)].document_name):
            doc_type_tags.append({
                "name": f"NI 43-101 {doc_type_name}",
                "type": "Document Type"
            })

    return doc_type_tags

def parse_coord_string(coord_string, coord_type=None):
    direction = next((i for i in coord_string if i in ["E","W","N","S"]), None)
    if direction is None and coord_type is None:
        return
    elif direction is None and coord_type is not None:
        if coord_type == "latitude":
            if "-" in coord_string:
                direction = "S"
            else:
                direction = "N"
        elif coord_type == "longitude":
            if "-" in coord_string:
                direction = "W"
            else:
                direction = "E"

    dms_components = re.split(r'\D+', coord_string)

    if 1 < len(dms_components) < 2:
        return

    dd = 0
    for index, item in enumerate([i for i in dms_components if len(i) > 0]):
        if index == 0:
            dd += float(item)
        elif index == 1:
            dd += float(item)/60
        elif index == 2:
            dd += float(item)/(60*60)

    if direction in ['S','W']:
        dd *= -1

    return dd

def shape_from_coords(coords, feature_name="feature", crs_epsg=4326, return_type="dataframe"):
    if not isinstance(coords, list):
        return

    if not len(coords) == 2:
        return

    if isinstance(coords[0], float) and isinstance(coords[1], float):
        dms_coords = coords
    else:
        dms_coords = [
            parse_coord_string(coords[0], "latitude"),
            parse_coord_string(coords[1], "longitude")
        ]

    if None in dms_coords:
        return

    lat = next((i for i in dms_coords if abs(i) <= 90), None)
    lon = next((i for i in dms_coords if i != lat and abs(i) <= 180), None)

    if lat is None or lon is None:
        return

    df = pd.DataFrame(
        {'name': [feature_name],
        'Latitude': [lat],
        'Longitude': [lon]})

    gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))
    gdf.set_crs(epsg=crs_epsg, inplace=True)

    if return_type == "dataframe":
        return gdf
    elif return_type == "dict":
        return json.loads(gdf.to_json())

def get_features(geographic_element):
    """Get geospatial features from geojson geographic element
    :param geographic_element: Geographic element
    :return: Features found in the geographic element
    """
    ret = []
    features = []
    if geographic_element['type'] == 'Feature':
        features.append(geographic_element)
    elif geographic_element['type'] == 'Polygon' or geographic_element['type'] == 'Point' or geographic_element['type'] == 'LineString':
        features.append({'type': 'Feature', 'properties': {}, 'geometry': geographic_element})
    elif geographic_element['type'] == 'FeatureCollection':
        features.extend(geographic_element['features'])
    for extent in features:
        if 'id' in extent:
            extent['properties']['name'] = extent['id']
            del(extent['id'])
        if extent['geometry']['type'] != 'GeometryCollection':
            ret.append(extent)
    return ret

# Document Titles
Any time we put items like these files into a catalog or repository or database of some kind, it is useful to give every record a distinct name. This creates a useful label or reference point when listing out the contents of a collection. If possible, a truly distinct name that carries along any parts from an underlying context is useful as it means we can pull together all or a subset of items from one place, put them together with different items from other places, and still look at the label as humans and make some sense of what the item represents.

There are actual titles for many of the NI 43-101 reports, but these have not yet been captured in the inventory. As we put the items into ScienceBase, we want to give them unique and useful labels/titles (even though ScienceBase does not strictly require this). The following code block takes a few pieces of information to create a title and then checks to see if that title needs to have a "part_of_series" appended to the end.

In [None]:
tracking_sheet["document_title"] = tracking_sheet.apply(lambda x: build_item_title(x), axis=1)
tracking_sheet["part_of_series"] = ""

df_title_counts = tracking_sheet.groupby(["document_title"]).project_name.agg("count").to_frame('c').reset_index()
for index, row in df_title_counts.loc[df_title_counts.c > 1].iterrows():
    doc_number_of = 1
    for doc_index, doc_row in tracking_sheet.loc[tracking_sheet.document_title == row.document_title].iterrows():
        tracking_sheet.at[doc_index, 'part_of_series'] = f"({doc_number_of} of {row.c})"
        doc_number_of += 1

list(tracking_sheet.document_title)[:15]

['NI 43-101 Technical Report for the Dokis project in Canada',
 'NI 43-101 Technical Report (feasibility study) for the Ixtaca project in Mexico',
 'NI 43-101 Technical Report for the TLC project in Nevada, United States',
 'NI 43-101 Technical Report for the Premier project in Canada',
 'NI 43-101 Technical Report for the Loulo-Gounkoto project in Mali',
 'NI 43-101 Technical Report for the Kibali project in Congo, Democratic Republic of (DRC)',
 'NI 43-101 Technical Report (feasibility study) for the Kibali project in Congo, Democratic Republic of (DRC)',
 'NI 43-101 Technical Report for the Blanco project in Chile',
 'NI 43-101 Technical Report for the Aspen project in Canada',
 'NI 43-101 Technical Report for the Gallowai Bul-River project in Canada',
 'NI 43-101 Technical Report for the TV Tower project in Canada',
 'NI 43-101 Technical Report for the La Chesnaye Lake project in Canada',
 'NI 43-101 Technical Report for the Cozamin project in Mexico',
 'NI 43-101 Technical Report 

# Geospatial Information for Projects
The spatial location of the projects reported on through the NI 43-101 reports is important for search and discovery in the archive when someone is looking for information applicable to a certain area. The spreadsheet inventory that is serving as part of the input to build the initial archive contains lat/lon properties with the original spatial coordinates found within the files to represent where the project is located as well as some work on validating and improving on these coordinate references. Future work will go into automated processes for extracting geospatial feature information from the documents.

Within the ScienceBase context for our prototype catalog, we can add point locations (or another type of footprint) to items for each file. This will enable geospatial search within the catalog and will also set up collection-level geospatial services such that the locations of files within an archive collection can be shown on a map. In the long run, we can work on strategies for where and how to store more robust extracted geospatial feature information from text and data mining operations on the files.

The following codeblocks present some exploratory work on the geospatial part of building out the archive from existing materials. One of the challenges will be in taking the relatively messy information identifying where projects are located and turning that into information that can be used to effectively create the spatial footprint for the projects. The functions included above and called here provide one part of this, taking a sometimes variable string format for degrees/minutes/seconds and getting it into an actual spatial data format. Once we confirm that we're able to build a valid geometry from the variable provided information, we add this as GeoJSON to our dataframe so that we can call it up later with another specialized function that creates what ScienceBase expects as "extents."

This will be improved from this crude state in future iterations. We may explore additional functionality with this such as introducing an ability to validate a point location based on some type of check against other contextual information from the report. For instance, if the report lists an approximate point coordinate for the center of the project and then states in text that it is some distance from a population center or some landmark, we could introduce a function to geolocate the landmark, check the distance, and make sure the point is within some reasonable distance of that reference area. As we get toward automated processing, that's the kind of utility our overall sysem will need to help produce more reasonable information.

In [None]:
%%time
confirmed_coordinates = list()

for index, record in tracking_sheet.loc[tracking_sheet['Calculated Decimal Latitude'].notnull()].iterrows():
    doc_footprint = shape_from_coords(
        coords=[record["Calculated Decimal Latitude"],record["Calculated Decimal Longitude"]],
        feature_name=record.document_title
    )
    confirmed_coordinates.append({
        "uuid": record.uuid,
        "doc_footprint": doc_footprint.to_json(),
        "decimal_latitude": doc_footprint.Latitude[0],
        "decimal_longitude": doc_footprint.Longitude[0]
    })

for index, record in tracking_sheet.loc[(tracking_sheet['Calculated Decimal Latitude'].isnull()) & (tracking_sheet['Latitude'].notnull())].iterrows():
    doc_footprint = None
    try:
        doc_footprint = shape_from_coords(
            coords=[record.Latitude,record.Longitude],
            feature_name=record.document_title
        )
    except:
        if isinstance(record.Latitude, float):
            decimal_latitude = record.Latitude
        else:
            decimal_latitude = float(re.sub('[^0-9\.\-]+', '', record.Latitude))

        if isinstance(record.Longitude, float):
            decimal_longitude = record.Longitude
        else:
            decimal_longitude = float(re.sub('[^0-9\.\-]+', '', record.Longitude))

        doc_footprint = shape_from_coords(
            coords=[decimal_latitude,decimal_longitude],
            feature_name=record.document_title
        )

    if doc_footprint is not None:
        confirmed_coordinates.append({
            "uuid": record.uuid,
            "doc_footprint": doc_footprint.to_json(),
            "decimal_latitude": doc_footprint.Latitude[0],
            "decimal_longitude": doc_footprint.Longitude[0]
        })

df_confirmed_coordinates = pd.DataFrame(confirmed_coordinates)

df_tracking_sheet_final = pd.merge(tracking_sheet, df_confirmed_coordinates, how="left", on="uuid")

CPU times: user 25.4 s, sys: 524 ms, total: 25.9 s
Wall time: 25.8 s


# Authenticated Access to ScienceBase
Write access into ScienceBase requires user credentials for authentication. The following codeblock handles setting up the authenticated session we need to use here. We'll be working on the best strategy to handle this dynamic within our project. One approach to managing the archive would be to do some or all of the work with code, continuing to manage the inventory in a separate spreadsheet and then having one or more people able to run a script to add/update items in the collection. In that case, they would be doing something like this. There is also some built-in functionality in ScienceBase to process a spreadsheet from a root collection item into child items within that collection. Whether or not that tool can support all the things we would need to do in maintaining the archive will have to be explored.

In [None]:
sb = SbSession()

username = input("Username:  ")
sb.loginc(str(username))

Username:  sbristol@usgs.gov
··········


<sciencebasepy.SbSession.SbSession at 0x7fb41de5a990>

# Archive documentation principles/conventions
What we work through in this initial effort focused on a single, fairly well structured type of document will have an impact on future efforts within the GeoArchive concept. We'll be establishing some degree of precedent and convention for other "sub-archives" to follow or build upon. As part of this project, we will collaborate with the National Geological and Geophysical Data Preservation Program (NGGDPP) to document principles and processes for design of the metamodels, use of vocabulary sources, and other dynamics for future GeoArchive implementation. In that context, we may start stubbing out a limited number of additional archival areas that are on the near horizon but will keep our focus on meeting NI 43-101 requirements in the context of critical mineral assessment methods.

In the following codeblocks, we take our final dataframe where we've done some work on project titles, location references, spatial coordinates, and other details and generate the item structure for each file that ScienceBase expects. This generates a list of dictionaries that we can then send to ScienceBase in a bulk create operation to establish the initial items in our GeoArchive collection.

In [None]:
archive_id = "607efd7fd34e8564d6809ebc"
archive_items = list()

for index, record in df_tracking_sheet_final.iterrows():
    item = {
        "parentId": archive_id,
        "title": record.document_title,
        "tags": build_place_tags(record)
    }

    if len(record.part_of_series) > 0:
        item["title"] = f"{record.document_title} {record.part_of_series}"

    body_components = list()
    if isinstance(record.Notes, str):
        body_components.append(f"<h1>Notes</h1><p>{record.Notes}</p>")
    if isinstance(record.Location, str):
        body_components.append(f"<h1>Location</h1><p>{record.Location}</p>")

    if body_components:
        item["body"] = "".join(body_components)

    if isinstance(record.effective_dates, str):
        item_dates = build_effective_date(record.effective_dates)
        if item_dates:
            item["dates"] = item_dates

    if isinstance(record.Commodities, str):
        item["tags"].extend(build_commodity_tags(record.Commodities))

    if isinstance(record["Document type"], str):
        item["tags"].extend(build_document_type_tags(record))

    item["tags"].append({
        "name": record.project_name,
        "type": "Mineral Development Project"
    })

    if isinstance(record.doc_footprint, str):
        item["extents"] = get_features(json.loads(record.doc_footprint))

    item["identifiers"] = [{
        "type": "usmin_filename",
        "scheme": "usmin_lan",
        "key": record.Filename
    }]

    archive_items.append(item)

archive_items[:5]

[{'body': '<h1>Location</h1><p>It is geographically centered at UTM NAD 83 Zone\n17 U 607000 mE 5358300 mN</p>',
  'extents': [{'geometry': {'coordinates': [-79.5552524902, 48.3687950252],
     'type': 'Point'},
    'properties': {'Latitude': 48.3687950252,
     'Longitude': -79.5552524902,
     'name': '0'},
    'type': 'Feature'}],
  'identifiers': [{'key': 'Dokis Au Ag 9-2018',
    'scheme': 'usmin_lan',
    'type': 'usmin_filename'}],
  'parentId': '607efd7fd34e8564d6809ebc',
  'tags': [{'name': 'Canada',
    'scheme': 'http://www.wikidata.org/entity/Q16',
    'type': 'Place'},
   {'name': 'gold',
    'scheme': 'http://www.wikidata.org/entity/Q897',
    'type': 'Commodity'},
   {'name': 'silver',
    'scheme': 'http://www.wikidata.org/entity/Q1057174',
    'type': 'Commodity'},
   {'name': 'Dokis', 'type': 'Mineral Development Project'}],
  'title': 'NI 43-101 Technical Report for the Dokis project in Canada'},
 {'extents': [{'geometry': {'coordinates': [-97.8550096, 19.6997209],
 

## Commit the Items
The following codeblock deletes existing child items that we might have created previously and commits all the new items we just generated from our dataframe. At this stage, this is meant to be a one-time process, designed to give us an initial population of the archive for the inventory of NI 43-101 reports that have been managed, to date, via an internal, offline process. From the point where we start using ScienceBase in a live fashion, we will want to use either update processes via the API (if we need to do some types of bulk work) or the edit form user interface that ScienceBase provides to create new individual items or edit certain aspects of existing items.

In [None]:
%%time
sb.delete_items(sb.get_child_ids(archive_id))

new_items = sb.create_items(archive_items)
display(new_items[:5])

[{'body': '<h1>Location</h1><p>It is geographically centered at UTM NAD 83 Zone\n17 U 607000 mE 5358300 mN</p>',
  'browseTypes': ['Map Service',
   'OGC WMS Service',
   'OGC WFS Layer',
   'OGC WMS Layer'],
  'distributionLinks': [{'files': '',
    'name': '',
    'rel': 'alternate',
    'title': 'KML Service',
    'type': 'kml',
    'typeLabel': 'KML Download',
    'uri': 'https://www.sciencebase.gov/catalogMaps/mapping/ows/60d2097ed34e86b938ad990a?mode=download&request=kml&service=wms&layers=footprint'},
   {'files': '',
    'name': '',
    'rel': 'alternate',
    'title': 'ScienceBase WMS Service',
    'type': 'serviceCapabilitiesUrl',
    'typeLabel': 'OGC Service Capabilities URL',
    'uri': 'https://www.sciencebase.gov/catalogMaps/mapping/ows/60d2097ed34e86b938ad990a?service=wms&request=getcapabilities&version=1.3.0'}],
  'extents': [3456930],
  'hasChildren': False,
  'id': '60d2097ed34e86b938ad990a',
  'identifiers': [{'key': 'Dokis Au Ag 9-2018',
    'scheme': 'usmin_lan',


CPU times: user 1.85 s, sys: 241 ms, total: 2.09 s
Wall time: 7min 5s


In [None]:
[i for i in new_items if "id" not in i]

[{'errors': [{'field': 'tags[0].name',
    'message': 'Property [name] of class [class gov.sciencebase.Tag] cannot be null',
    'objectName': 'class gov.sciencebase.catalog.item.Item'}]},
 {'errors': [{'field': 'tags[0].name',
    'message': 'Property [name] of class [class gov.sciencebase.Tag] cannot be null',
    'objectName': 'class gov.sciencebase.catalog.item.Item'}]},
 {'errors': [{'field': 'tags[0].name',
    'message': 'Property [name] of class [class gov.sciencebase.Tag] cannot be null',
    'objectName': 'class gov.sciencebase.catalog.item.Item'}]},
 {'errors': [{'field': 'tags[0].name',
    'message': 'Property [name] of class [class gov.sciencebase.Tag] cannot be null',
    'objectName': 'class gov.sciencebase.catalog.item.Item'}]},
 {'errors': [{'field': 'tags[0].name',
    'message': 'Property [name] of class [class gov.sciencebase.Tag] cannot be null',
    'objectName': 'class gov.sciencebase.catalog.item.Item'}]},
 {'errors': [{'field': 'tags[0].name',
    'message': '

# Eventual Production Infrastructure
In the GeoArchive project, we're working to get to a fully fledged working prototype in ScienceBase as quickly as we can so that we can figure out if the platform is going to work for us or not. Some early exploration points to some potential snags, mostly in the areas of performance and stability of the ScienceBase platform. There have been multiple challenges in starting to scale an API-based process of building out an initial archive collection for the NI 43-101 reports in the inventory spreadsheet (service disruptions, etc.). We've also encountered challenges with the user interface in terms of basic authentication access and stability in trying to add and edit items using the ScienceBase Item form. Both of these are considered vital areas of functionality that our system will require. Without these routes being stable and fully functional, we really cannot proceed with practical use. We are communicating about our issues with the ScienceBase team and are working toward solutions if they can be achieved.

A more distant and not quite as critical problem is the lack of fully production use of cloud storage via ScienceBase for the file contents we'll have in our archive. We would ideally like for the actual file content to live in the USGS CHS cloud space where we can work on methods and techniques that would exploit this content with text and data mining technologies. In the near term, we're going to route the content of the NI 43-101 reports to a third party platform (GeoDeepDive and COSMOS engines through our collaboration with UW-Madison). This can happen throught he simplicity of HTTP access to the file content, driven through the ScienceBase API and irrespective of where the files sit. But if we want to do something beyond what these engines provide on their own, even the simplicity of building a basic full text index from PDF and other content, we will want the files in the cloud.