Perhaps one of the most important aspects of building a continuous stream of new information associated with species in the FWS work plan or other circumstances is the ability to identify and access new published literature. Many of the structured databases assembled and organized into this collection are based wholly or partly in literature references and information extracted from literature. We are currently working with the team developing the xDD Digital Library on a number of tools and techniques for a) identifying literature potentially applicable to species-based research and b) using natural language processing tools to pull specific data from those sources for use. This is an ongoing effort that will result in improved production capabilities over time.

In the near term, we take advantage of some basic and enhanced search functionality to identify potential articles of interested in the xDD library of millions of documents that are increasing daily. The xdd module in the bispy package contains some search and packaging functionality that interfaces with the xDD REST API.

# Data Management Considerations
The xDD Library aspect of this overall workflow is likely the most dynamic area of the system we are building as new literature is coming out all the time and the xDD Library is continuing to expand rapidly in its work to index and process as many articles as possible. Through our partnership with University of Wisconsin-Madison and the xDD team, we also have the ability to collaborate on incorporating new sources of papers and meaningful gray literature as we dig further into sources of references that may not have yet found a route into the xDD processing engine.

In [1]:
import requests
import bispy
from IPython.display import display
from joblib import Parallel, delayed

xdd = bispy.xdd.Xdd()
bis_utils = bispy.bis.Utils()

import helperfunctions

In [2]:
name_list = helperfunctions.workplan_species()

In [3]:
# Use joblib to run multiple requests for SGCN records in parallel via scientific names
xdd_results = Parallel(n_jobs=8)(delayed(xdd.snippets)(name) for name, name_source in name_list)


Since the xDD snippets function used here is not specific to a scientific name search, we run through the results and insert a couple of additional metadata attributes to allow us to compare across collections of available information with scientific name and source.

In [4]:
xdd_common_structure = list()

for result in xdd_results:
    name_used = next((r for r in name_list if r[0] == result["Processing Metadata"]["Search Term"]), None)
    result["Processing Metadata"]["Scientific Name"] = name_used[0]
    result["Processing Metadata"]["Name Source"] = name_used[1]
    xdd_common_structure.append(result)
    

In [5]:
# Cache the array of retrieved documents and return/display a random sample for verification
display(bis_utils.doc_cache("cache/xdd.json", xdd_common_structure))

{'Doc Cache File': 'cache/xdd.json',
 'Document Number 331': {'Data': [{'URL': 'http://www.tandfonline.com/doi/abs/10.1080/03632415.2013.808501',
    '_gddid': '5954d6b7cf58f173914c46ea',
    'authors': '',
    'coverDate': '2013 06',
    'doi': '10.1080/03632415.2013.808501',
    'highlight': ['Valley Cave, Kentucky; Alabama. Bottom Row (L-R) Pyrgulopsis coloradensis, USNM',
     'Bottom Row (L-R) Pyrgulopsis coloradensis, USNM 854641, Blue Point Spring, Nevada;',
     'Frest, Liu, and Ash Valley Pyrg Johannes, 2003  E  Pyrgulopsis coloradensis Hershler,',
     'Johannes, 2003  E  Pyrgulopsis coloradensis Hershler, 1998  Blue Point Pyrg  E  Pyrgulopsis'],
    'publisher': 'Taylor and Francis',
    'pubname': 'Fisheries',
    'title': 'Full Issue PDF Volume 38, Issue 6'},
   {'URL': 'http://www.tandfonline.com/doi/abs/10.1080/03632415.2013.785396',
    '_gddid': '5954d6cdcf58f173914c4757',
    'authors': 'Johnson, Paul D.; Bogan, Arthur E.; Brown, Kenneth M.; Burkhead, Noel M.; Cordeir