The FWS Ecological Conservation Online System contains information about T&E species along with other information. USGS EMA staff (or others?) went through the work plan species list, determined the appropriate links to ECOS species web pages, and recorded those in one of the tables provided in the source inventory. In reviewing the various systems and access points, we found that there is information on ECOS species web pages that is not accessible through the ECOS TESS web services. We also found that the identifiers used on the ECOS species web pages do not seem to be found anywhere in the other accessible TESS interfaces. From this, we determined that we should run a rudimentary web scraping tool to gather a few usable pieces of information from the linked ECOS pages as a first step, cache this information in a file, and use it in later work.

One of the main things extracted here is FWS' own determination of the appropriate ITIS species to link to. We use this in favor of running a species search whenever it's available as one avenue of establishing a linkage and retrieving information for later use.

In writing the ECOS web scraper, we found that the ECOS pages are really quite hard to deal with. They are assembled dynamically from what appear to be various sources in a somewhat inconsistent way in terms of where and how the information is output to HTML/Javascript on the pages. This first scraper is kind of crude, and we'll revisit as needed down the road.

# Data prep
The ECOS links were contained in the "FWS 7 Year Workplan Species" worksheet from the original "Prelisting Science USGS Master_19Mar2018" spreadsheet used as source material for this exercise. The links were embedded as hyperlinks on the species "Scientific Name" field using Excel proprietary methods. As such, we had to use a simple VBA script to extract out the links to their own field. We did this by copying the scientific name fields over to another Excel file, running the VBA macro there, and then including that as an intermediary file for processing.

The ECOS Links and information we scrape from those use the Scientific Name string to link back to the main source file. The other thing that we did in this added file was to create a "Lookup Name" field where we corrected a few issues with name strings that make subsequent lookup steps more difficult. It's a common practice to encode some additional meaning or information about a species name in a name string, usually in fairly inconsistent ways. There are ways to tease this out programmatically, but with a list this short and a need to provide this additional file generated in a desktop application anyway, it made sense to simply take care of the few cases by hand.

In [1]:
import pandas as pd
import bispy
from IPython.display import display
import json
from joblib import Parallel, delayed
import jsonschema

ecos = bispy.ecos.Ecos()
bis_utils = bispy.bis.Utils()

import helperfunctions

In [2]:
# Retrieve the extraction of ECOS links with scientific names from an excel file
spp_ecos_links = pd.read_excel(
    "../sources/AdditionalSourceData.xlsx",
    sheet_name="Extracted Species ECOS Links"
)
# Put just the links into a list for processing
ecos_link_list = spp_ecos_links[spp_ecos_links["ECOS Link"].notnull()]["ECOS Link"].tolist()

In [5]:
%%time
# Use joblib to run the scraper function in parallel
ecos_cache = Parallel(n_jobs=8)(delayed(ecos.scrape_ecos)(url) for url in ecos_link_list)

CPU times: user 1.02 s, sys: 108 ms, total: 1.13 s
Wall time: 18.1 s


In [6]:
# Cache the array of retrieved documents and return/display a random sample for verification
display(bis_utils.doc_cache("../cache/ecos.json", ecos_cache))

{'Doc Cache File': 'cache/ecos.json',
 'Number of Documents in Cache': 341,
 'Document Number 46': {'processing_metadata': {'status': 'success',
   'date_processed': '2019-07-25T20:59:19.252721',
   'api': 'https://ecos.fws.gov/ecp/species/9798'},
  'ecos_species_summary': [{'Scientific Name': 'Cambarus jezerinaci',
    'Common Name': 'Spiny Scale crayfish',
    'Current Listing Status Summary': [{'Status': 'Under Review',
      'Date Listed': '',
      'Lead Region': 'Northeast Region (Region 5)',
      'document_link': 'http://www.fws.gov/northeast/',
      'Where Listed': 'Wherever found'}],
    'itis_tsn': '650405'}]}}

# Schema Validation
In this instance of developing a JSON Schema document, I noted a few interesting dynamics:
* I've been experimenting with a number of methods for generating an initial JSON Schema stub from examining the actual data generated. I wrote a function for this based on the genson package, that works reasonably well, though it proved challenging to modify to fit the current draft specification (genson stopped at draft 4). In this case, I experimented again with the jsonschema.net tool, which is pretty good. However, I found that it did not pick up all of the document listing properties when I threw the full ECOS dataset at the inference tool, and I had to build a representative sample of records by selecting for those that had the particular document sets. I could then stub out a schema to take elsewhere and complete.
* I continued to work through some common properties that should be applicable across many different datasets. That process is iteratively continuing to grow the "definitions" section of the JSON Schema; essentially a way to reference an "external" set of property definitions within any given schema. According to the documentation for JSON Schema draft 7, I should be able to pull these from a common file, referencing something like `{"$ref": "common_properties.json#/document_title"}`. However, I found that the jsonschema package doesn't currently follow that type of notation, only using the internal reference schema like #/definitions/document_title. Eventually, we will want to have this be HTTP resolvable references to a master property registry. Of course, we also want to build an intelligent system that is constanly introspecting new registered data and dynamically suggesting where common properties may be emerging.
* In working through some issues in the data, I noticed that the jsonschema validation report that kicks out reports on the record in the list of JSON documents/records where an issue is found. This allows for an efficient iterative process of getting the schema laid out correctly to validate a given dataset.

In [9]:
ecos_schema = helperfunctions.load_schema('ecos')
display(ecos_schema)

jsonschema.validate(ecos_cache, ecos_schema)

{'definitions': {'items': {'$id': '#/items',
   'type': ['object', 'array'],
   'title': 'Generic container for items in a dataset',
   'description': 'A JSON array or object property containing one or more items in a dataset or data structure within a dataset.'}},
 '$schema': 'http://json-schema.org/draft-07/schema#',
 '$id': 'http://data.usgs.gov/property_registry/',
 'type': ['array', 'object'],
 'title': 'ECOS Species Web Scrape Collection',
 'description': 'A dataset containing species information scraped from the USFWS Ecological Conservation Online System web pages. Certain properties on species from the USFWS databases could not be accessed through any other open API. Data were assembled using a function build into the experimental bispy software package (https://github.com/usgs-bcb/bispy).',
 'items': {'$ref': '#/definitions/items',
  'properties': {'processing_metadata': {'$ref': 'common_properties.json#/definitions/processing_metadata'},
   'ecos_species_summary': {'$id': '#