This notebook takes a look at the official data release product metadata flagged to the Science Centers that are most closely affiliated with the Energy and Minerals Mission Area. I was interested in exploring where we seem to have very similar or identical processing step information between data products, potentially indicating areas ripe for exploration of a new process or way of pursuing continuous data release.

To do this, I focused on the data release products officially cataloged via the USGS "public data listing" - our Science Data Catalog (SDC). These all should have standard metadata in one of a couple possible formats, and the SDC is accessible via an API that should allow me to access metadata programmatically and work with the content.

In [1]:
import requests
import xmltodict
import pickle
from joblib import Parallel, delayed
from tqdm import tqdm
import pandas as pd
from glob import glob
import os

Filtering to the items of interest here is a little challenging. There are currently 5 Science Centers which receive the bulk of funding from the Mineral Resources Program and/or Energy Resources Program. This effectively puts those Centers "in" the Energy and Minerals Mission Area (but not exactly). Science Center names change over time, and not all USGS systems share the same "understanding" of what our Science Centers are. I ended up crafting a hard list of the Center names as the SDC has recorded them in its own attempt to standardize. This ends up looping in more records than actually apply, especially in the case of an "interdisciplinary science center" like Alaska. But it suffices to enable us to grab most of the records we want to evaluate in this process.

There's also a specific API route needed to get at the original raw metadata files that are pulled into the SDC harvest process, which is where I need to go in order to get at deeper level metadata like lineage and entity/attribute information.

In [2]:
emma_orgs = [
    "Geology, Geophysics, and Geochemistry Science Center",
    "Central Energy Resources Science Center",
    "Geology, Mineral, Energy, and Geophysics Science Center",
    "Geology, Energy, & Minerals Science Center",
    "Florence Bascom Geoscience Center",
    "Alaska Science Center",
    "Mineral Resources Program",
    "Energy Resources Program"
]

files_api = "https://data.usgs.gov/datacatalog/api/harvest/files"

I threw in a couple functions to handle the basic API queries and processing of metadata content. While this isn't a massive haul, I'm still fiddling with the best way to store and process everything, so it made sense to store a cache of raw XML content in a local directory and then some additional derivatives for analysis.

In [3]:
def get_org_md(org):
    api = f"{files_api}?data_source_name={org}&size=10000"
    r = requests.get(api).json()
    return [(i["id"], i["metadata_pid"]) for i in r["items"]]

def cache_md_record(item):
    path = f"data/emma_md/{item['md_pid']}.xml"
    r = requests.get(item['md_url'])
    with open(path, "w") as f:
        f.write(r.text)
            
def meta_to_dict(pid):
    path = f"data/emma_md/{pid}.xml"
    with open(path) as f:
        return xmltodict.parse(f.read(), dict_constructor=dict)

This is not a great way to go get metadata, but I've had issues in the past trying to operate in parallel or run a bunch of requests against data.usgs.gov end points because of WAF restrictions or application limitations. This block runs queries on the list of organization names and gives us the SDC-specific identifiers for all relevant records. We save that to a table for later reference.

In [None]:
%%time
org_meta = []
for o in emma_orgs:
    o_md = get_org_md(o)
    for item in o_md:
        org_meta.append({
            "org": o,
            "md_id": item[0],
            "md_pid": item[1]
        })
pd.DataFrame(org_meta).to_parquet("data/emma_meta_inventory.parquet")

I know, working with dataframes and Pandas is slow and there are better ways I need to learn. But it's convenient for now.

In [4]:
df_org_meta = pd.read_parquet("data/emma_meta_inventory.parquet")

I much prefer dealing with JSON/dictionaries over XML, so I add grab everything from the cache, transform, and add to the dataframe for use.

In [5]:
%%time
df_org_meta["meta"] = df_org_meta.md_pid.apply(lambda x: meta_to_dict(x))

CPU times: user 9.13 s, sys: 437 ms, total: 9.57 s
Wall time: 37.6 s


I do a little bit of digestion of the metadata structures to tee up some information for examination. Right now, I'm interested in the lineage information and processing steps, in particular, so I summarize that a bit with some new properties in the dataframe.

In [6]:
df_org_meta["meta_keys"] = df_org_meta.meta.apply(lambda x: list(x["metadata"].keys()) if "metadata" in x else None)
df_org_meta["lineage"] = df_org_meta.meta.apply(lambda x: x["metadata"]["dataqual"]["lineage"] if "metadata" in x and "dataqual" in x["metadata"] and "lineage" in x["metadata"]["dataqual"] else None)
df_org_meta["procstep"] = df_org_meta.lineage.apply(lambda x: x["procstep"] if x is not None and "procstep" in x else None)
df_org_meta["num_procstep"] = df_org_meta.procstep.apply(lambda x: len(x) if x is not None else 0)
df_org_meta["procstep_s"] = df_org_meta.procstep.apply(str)
df_org_meta["dup_procstep"] = df_org_meta.duplicated(subset="procstep_s")

At a really crude level, absolutely duplicative processing steps in metadata for what seem like they should be discrete data release products seems odd. Maybe these are really more of a serial data product that could be structured and handled in a different way in terms of review and other FSP process. Or maybe something else is going on. In the next few code blocks, I pull out this piece, figure out where I have have duplicate processing step metadata, show the highest number of those cases, and take a look.

In [7]:
proc_steps = df_org_meta[df_org_meta.num_procstep > 0][["md_pid","procstep_s"]].copy()

In [8]:
dup_procsteps = proc_steps.groupby("procstep_s").agg(list).reset_index("procstep_s").rename(columns={"md_pid":"procstep_pids"})
dup_procsteps["num_pids"] = dup_procsteps.procstep_pids.apply(lambda x: len(x))

In [10]:
dup_procsteps.num_pids.max()

186

This is an interesting example for further exploration. Upon examination of the PIDs where this duplication in processing steps occurred showed a somewhat different dynamic but still within the continuous data release problem area. These all appear to be aeromagnetic survey data collected through contracts, received by USGS in a particular format, and processed in a particular way for our own use. These particular metadata records originated with MRData, are likely generated dynamically from MRData, pre-dated the current data release process, and MRData is serving as the repository.

We do need to dig a bit into what the current state is on processing geophysical survey data being collected now through things like EarthMRI contracts. What kind of review and approval process are these going through? Where are they being housed in terms of a repository (I believe the answer is ScienceBase to some extent)? Are we handling them in the most efficient way we can? Is a survey by survey organization of data products serving the community of use? Is there some larger data system these data could become part of? What's the relationship to other USGS assets where we are striving for National coverage like the National Map, and should we be looking toward more collaboration on processes and techniques.

In [9]:
import ast
ast.literal_eval(dup_procsteps[dup_procsteps.num_pids == 186].iloc[0].procstep_s)

[{'procdesc': "Conversion of measured values to geographic position and\nmagnetic and radiometric values was performed by the contractor\nusing industry standard practices.\nDetails are found under Attribute Accuracy Report,\nHorizontal_Position_Accuracy_Report, and\nVertical_Position_Accuracy_Report\nConversion processes, if reported, may be found in the\nU.S. Department of Energy's published GJO- or GJBX- reports for the\nquadrangle or group of quadrangles.  Unpublished products generated\nby the contractor included magnetic tapes and perhaps some\nwritten documentation.",
  'procdate': '1980'},
 {'procdesc': "USGS reformatting of contractor data to standard format.\nUSGS personnel used the software package Oasis Montaj version 6.3 by\nGeosoft, Inc., to read in the original contractor's data.  Positioning\nand magnetic values were checked for obvious errors or spikes. Values\nof -9999.9,-999.9, -99.9, etc., were given where the value could not be\nreasonably corrected or, in some cas