# Pipeline Stage 1
SGCN source data consists of two generations of state/territory information that is nominally updated decadally as part of State Wildlife Action Plans. The first generation of this dataset that the USGS brought together came from the 2005 SWAP reports. Species names were generally extracted manually from the reports, put into a combined database, and checked against ITIS for taxonomic authority alignment. The second generation of the process involved submission of spreadsheets based on a provided template. Both generations of the source files are contained within a single [ScienceBase collection](https://www.sciencebase.gov/catalog/item/56d720ece4b015c306f442d5) with individual items for each state/territory and year of reporting.

The ScienceBase repository forms the start and foundation of the data management process for the SGCN lists. Items contain original source files and final files for processing along with some artifacts that represent any work that was needed to smooth out the rough edges of submitted text files (e.g., OpenRefine projects). Final source files are all titled as "Process File" to denote what should be processed. Source items also contain a "Date Collected" year and a place tag with the state name as important elements of classification metadata.

This first step in the workflow runs a function from the pysgcn package (get_processable_items) to, which takes a raw list of all items in the ScienceBase collection, checks them against a processing log (still to be developed on infrastructure), and returns a simplified data packet of items ready for processing to go onto a message queue.

In this workflow step, I use a temporary fake message queue expedient (functions that work with a cached Sqlite database) to show what should happen on infrastructure.

In [1]:
import random

import pysgcn
sgcn = pysgcn.sgcn.Sgcn()

In [2]:
%%time
processable_items = sgcn.get_processable_items()

CPU times: user 38.4 ms, sys: 14 ms, total: 52.5 ms
Wall time: 375 ms


This is an example of what the message body looks like for these items. It contains the source ScienceBase Item ID, state and year (that are checked later against the contents in the file itself), source file URL and source file date. These are all the pieces of information out of the source items that are needed at this point to process an item. Once these go onto a message queue, they should be able to operate independently with lambdas to process file contents.

In [3]:
processable_items[random.randint(0,len(processable_items)-1)]

{'sciencebase_item_id': 'https://www.sciencebase.gov/catalog/item/59bffe09e4b091459a5e09bf',
 'state': 'Connecticut',
 'year': '2015',
 'source_file_url': 'https://www.sciencebase.gov/catalog/file/get/59bffe09e4b091459a5e09bf?f=__disk__01%2F35%2Fbd%2F0135bd12fc8edfb598061262f60b503dbf817065',
 'source_file_date': '2017-09-18T17:09:55.000Z'}

For local processing, I use a local data caching path and Sqlite databases for message queues and data storage. This code block sets up the messages in a queue for processing each SGCN source item. This will be replaced with an actual message queue approach on infrastructure. The identifiers here are hashes of the message contents that are generated using the handy hash_id method in the sqlite_utils package I'm using.

In [4]:
for item in processable_items:
    sgcn.queue_message("mq_sgcn_items", item)