# Pipeline Stage 5
The ITIS stage in the SGCN pipeline kicks off all the other stages to operate independently. Since ITIS ends up providing us with a number of additional search vectors and is the primary determinant of putting SGCN species on the National List, we run this stage first to assemble a bunch of additional names to go after from other sources and identifiers to use in at least one case (USFWS Ecological Conservation Online System).

All of the sppin information processors use a single function, process_sppin_source_search_term(), built to handle the only slightly different operations in each case. It can operate against a message queue, retrieving a single message and processing it through to completion. Running the process locally, I first retrieve all messages from a given queue and run them in parallel. The limitations in how many messages we can process come from both the number of concurrent connections we can make to our database and from the number of HTTP requests we want to pass to the ITIS (or any other) API. When moving to a Lambda environment, the database connection issue should no longer be a factor, but we will need to throttle the number of concurrent connections we are sending out to third party APIs.

Note: I did run into a number of interrupts when running the 50K or so unique name/source messages on the ITIS queue in trying to parallel this locally against a SQLite database instance from either database lock issues or ITIS HTTP service connection problems, meaning that it had to be restarted a number of times. This shouldn't be as big a problem in a Lambda environment, but we will need to set up dead letter queue handling if messages fail to process entirely.

The ITIS process proliferates many new messages out onto other queues. It will send all of the names it encounteres, both the original name from the SGCN source and any additional names in the ITIS records (not necessarily all synonyms at this point, just the names encountered through search or following through to valid taxonomic records), to the message queues for other SppIn information gatherers. It will also send any names that it does not find to a WoRMS queue.

Both the ITIS and WoRMS processors also send messages with taxonomic authority summary information when they encounter a usable valid record that are used to infuse additional properties into the SGCN master table indicating whether or not a scientific name should be placed onto the "SGCN National List."

In [1]:
import pysgcn
sgcn = pysgcn.sgcn.Sgcn()

from joblib import Parallel, delayed
from tqdm import tqdm

mq = "mq_itis_check"
sppin_source = "itis"

In [2]:
messages = sgcn.sql_mq.get_all_records("mq", mq)

In [3]:
%%time
Parallel(n_jobs=7, prefer="threads")(
    delayed(sgcn.process_sppin_source_search_term)
    (
        message_queue=mq,
        sppin_source=sppin_source,
        message_id=message["id"], 
        message_body=message["body"]
    ) for message in tqdm(messages)
)

100%|██████████| 15084/15084 [06:06<00:00, 41.12it/s]


CPU times: user 4min 13s, sys: 7min 31s, total: 11min 45s
Wall time: 6min 36s


['MESSAGE PROCESSED: Scientific Name:Sistrurus c. catenatus',
 'MESSAGE PROCESSED: Scientific Name:Various species of invertebrates',
 'MESSAGE PROCESSED: Scientific Name:Hemileuca maia ssp',
 'MESSAGE PROCESSED: Scientific Name:Synedoida adumbrata',
 'MESSAGE PROCESSED: Scientific Name:Planorbella pilsbryi',
 'MESSAGE PROCESSED: Scientific Name:Catocala jair ssp',
 'ALREADY CACHED: Scientific Name:Oeneis jutta',
 'ALREADY CACHED: Scientific Name:Sphoeroides maculatus',
 'MESSAGE PROCESSED: Scientific Name:Coregonus reighardi',
 'ALREADY CACHED: Scientific Name:Hylocichla mustelina',
 'ALREADY CACHED: Scientific Name:Brevoortia tyrannus',
 'ALREADY CACHED: Scientific Name:Stylurus spiniceps',
 'MESSAGE PROCESSED: Scientific Name:Ammocrypta pellucidum',
 'MESSAGE PROCESSED: Scientific Name:Monoleuca semifascia',
 'MESSAGE PROCESSED: Scientific Name:Plauditus gloveri',
 'ALREADY CACHED: Scientific Name:Lycia ypsilon',
 'ALREADY CACHED: Scientific Name:Hetaerina americana',
 'ALREADY CACH

In assembling this workflow and thinking about it from the standpoint of asynchronous, message-based processing, I ran into a number of interesting dynamics. We end up pushing out a lot of essentially duplicate messages that have to be dealt with in the processors by checking some API to determine whether or not a given check needs to run. For instance, in processing through something like the SGCN lists or any list of species for any purpose, we're going to end up with the same species name from many different sources on one or more logical message queues triggering taxonomic lookups to ITIS or any of our other sources. We don't need to run lookup processes every single time, but our information gathering algorithms need to be "smart" enough to know when they should run something again to refresh data, and our receiving information systems need to have business rules established that understand how to absorb changed values from third party sources.

ITIS lookups for SGCN present a number of different dynamics that deserve further scrutiny. The refresh cycle for ITIS itself is nominally monthly based on how the group operates that system. A monthly data update for ITIS generally includes taxonomic treatments for one or more entire taxonomic groups along with other incidental corrections and changes in the data discovered by the maintainers over time. That establishes one constraining end member. We may set up a central cyberinfrastructure that receives messages for looking up ITIS taxonomy from many different inbound vectors, processes those messages in a standardized way, and stores information in a dynamic cache. (In this local processing scheme, I've essentially set this up as the "sppin" database, containing cached data structures for each species information source I'm using here.) Every time a new message shows up, we check the cache first, look at dates for existing records, and re-run an ITIS lookup from source any time a 30 day threshold is exceeded. At any point, most records will probably not have changed, but those that have changed will probably need trigger an "itis_changed_record" message somewhere that signals that something needs to be evaluated for one or more dependent information systems. What exactly that change means is going to vary for those dependent systems, and that needs to be thought through as part of data management planning. In the case of SGCN, this could trigger a number of different things:

* The total number of "National List" species could go up or down with changes in taxonomic understanding as species names are combined or split.
* New names could need to be checked against other systems where additional information can be returned or new values derived.
* Additional states could be shown to be listing the same species as synonyms are identified and established.

All of these dynamics will impact applications using the data and reports being generated from the API, and keeping track of when specific records change in some way will likely be important. It is easy enough to use a device like I've used with this local system that uses a hash of a given record content to generate its primary key. This lets us use lazy processes to throw only unique records into a ledger, but we do have to determine what to do when multiple records respond to the same logical query, how to best report what changed between those records in our APIs, and how different systems should determine whether or not the change was significant to their particular business rules.