The FWS Ecological Conservation Online System contains information about T&E species along with other information. USGS EMA staff (or others?) went through the work plan species list, determined the appropriate links to ECOS species web pages, and recorded those in one of the tables provided in the source inventory. In reviewing the various systems and access points, we found that there is information on ECOS species web pages that is not accessible through the ECOS TESS web services. We also found that the identifiers used on the ECOS species web pages do not seem to be found anywhere in the other accessible TESS interfaces. From this, we determined that we should run a rudimentary web scraping tool to gather a few usable pieces of information from the linked ECOS pages as a first step, cache this information in a file, and use it in later work.

One of the main things extracted here is FWS' own determination of the appropriate ITIS species to link to. We use this in favor of running a species search whenever it's available as one avenue of establishing a linkage and retrieving information for later use.

In writing the ECOS web scraper, we found that the ECOS pages are really quite hard to deal with. They are assembled dynamically from what appear to be various sources in a somewhat inconsistent way in terms of where and how the information is output to HTML/Javascript on the pages. This first scraper is kind of crude, and we'll revisit as needed down the road.

# Data prep
The ECOS links were contained in the "FWS 7 Year Workplan Species" worksheet from the original "Prelisting Science USGS Master_19Mar2018" spreadsheet used as source material for this exercise. The links were embedded as hyperlinks on the species "Scientific Name" field using Excel proprietary methods. As such, we had to use a simple VBA script to extract out the links to their own field. We did this by copying the scientific name fields over to another Excel file, running the VBA macro there, and then including that as an intermediary file for processing.

The ECOS Links and information we scrape from those use the Scientific Name string to link back to the main source file. The other thing that we did in this added file was to create a "Lookup Name" field where we corrected a few issues with name strings that make subsequent lookup steps more difficult. It's a common practice to encode some additional meaning or information about a species name in a name string, usually in fairly inconsistent ways. There are ways to tease this out programmatically, but with a list this short and a need to provide this additional file generated in a desktop application anyway, it made sense to simply take care of the few cases by hand.

In [1]:
import pandas as pd
import bispy
from IPython.display import display
import json
from joblib import Parallel, delayed

ecos = bispy.tess.Ecos()
bis_utils = bispy.bis.Utils()

In [3]:
# Retrieve the extraction of ECOS links with scientific names from an excel file
spp_ecos_links = pd.read_excel(
    "sources/AdditionalSourceData.xlsx",
    sheet_name="Extracted Species ECOS Links"
)
# Put just the links into a list for processing
ecos_link_list = spp_ecos_links[spp_ecos_links["ECOS Link"].notnull()]["ECOS Link"].tolist()

In [5]:
%%time
# Use joblib to run the scraper function in parallel
ecos_cache = Parallel(n_jobs=8)(delayed(ecos.scrape_ecos)(url) for url in ecos_link_list)

CPU times: user 888 ms, sys: 95.1 ms, total: 983 ms
Wall time: 33.6 s


In [7]:
# Cache the array of retrieved documents and return/display a random sample for verification
display(bis_utils.doc_cache("cache/ecos.json", ecos_cache))

{'Doc Cache File': 'cache/ecos.json',
 'Document Number 269': {'Common Name': 'bifid duct Pyrg',
  'Current Listing Status Summary': [{'Date Listed': '',
    'Lead Region': 'Mountain-Prairie Region (Region 6)',
    'Lead Region_link': 'http://www.fws.gov/mountain-prairie/',
    'Status': 'displayListingStatus("Resolved Taxon")',
    'Where Listed': 'Wherever found'}],
  'Federal Register Documents': [{'Citation Page': '82 FR 46618 46645',
    'Date': '2017-10-05',
    'Title': '12-Month Findings on Petitions To List 25 Species as Endangered or Threatened Species',
    'Title_link': 'https://www.govinfo.gov/link/fr/82/46618?link-type=pdf'},
   {'Citation Page': '76 FR 56608 56630',
    'Date': '2011-09-13',
    'Title': '90-Day Finding on a Petition To List 42 Great Basin and Mojave Desert Springsnails as Threatened or Endangered With Critical Habitat',
    'Title_link': 'https://www.govinfo.gov/link/fr/76/56608?link-type=pdf'}],
  'ITIS TSN': None,
  'Processing Metadata': {'Date Proce