In [1]:
import requests,io,configparser
from IPython.display import display
import pandas as pd

In [2]:
# Get API keys and any other config details from a file that is external to the code.
config = configparser.RawConfigParser()
config.read_file(open(r'../config/stuff.py'))

In [3]:
# Build base URL with API key using input from the external config.
def getBaseURL():
    gc2APIKey = config.get('apiKeys','apiKey_GC2_BCB').replace('"','')
    apiBaseURL = "https://gc2.mapcentia.com/api/v1/sql/bcb?key="+gc2APIKey
    return apiBaseURL

Because I was running into encoding errors in the SWAP/SGCN data I originally brought into the GC2 system and I've figure out a bunch of stuff since then, I am building out a process here that will read data from the original files into a new sgcn table in the sgcn schema in our GC2 instance. This will let us re-run the entire thing as necessary from source data in ScienceBase and should set us up for processing new data with some checks on time or existing content.

The process below is specific to 2005 SWAP files. First, I found some issues in the existing files and had to clean up a few problems in the sources. 10 of the text files had no first row header. The files were in different original encodings, so I had to put in an exception to find that (text encodings are a pain in the butt!). It seemed like the easiest way to validate the text files, tee them up, and process them into the SGCN table was to read the text files into Pandas dataframes and then iterate over them to put row by row into GC2 via the SQL API. There is probably some better bulk data load method we can figure out eventually, but this lets us validate everything along the way.

Putting the data all into a common date table from source files means we can then select unique names from there for processing against taxonomic authorities and retrieving related data. We can then build that information back into the common data table with code.

The ScienceBase query is pretty straightforward, but one thing we should do is add the year that a given data file is associated with to make that explicit instead of relying on file name to get the SGCN year. The query can be modified here if we need to process some specific part of the 2005 collection separately again.

In [4]:
# Query ScienceBase for the 2005 states, returning the files structure along with tags (where we get state name)
sbQ = "https://www.sciencebase.gov/catalog/items?q=2005&parentId=56d720ece4b015c306f442d5&format=json&fields=title,files,tags&max=100"
sbR = requests.get(sbQ).json()

In [8]:
totalRecords = 0
sgcn_year = 2005

for item in sbR['items']:
    sgcn_state = item['tags'][0]['name']
    sourceid = "https://www.sciencebase.gov/catalog/item/"+item['id']
    for file in item['files']:
        if file['name'][-25:] == 'Species_Original_List.txt':
            stateList = requests.get(file['url']).content
            try:
                stateListPD = pd.read_csv(io.StringIO(stateList.decode('utf-8')))
            except:
                pass

            try:
                stateListPD = pd.read_csv(io.StringIO(stateList.decode('utf-8')), sep='\t')
            except:
                pass

            try:
                stateListPD = pd.read_csv(io.StringIO(stateList.decode('iso-8859-1')), sep='\t')
            except:
                pass

    for ir in stateListPD.itertuples():
        if type(ir[1]) is float:
            scientificname_submitted = ""
        else:
            scientificname_submitted = ir[1].replace("'","''")
        
        if scientificname_submitted == "Scientific Name":
            break
        
        if type(ir[2]) is float:
            commonname_submitted = ""
        else:
            commonname_submitted = ir[2].replace("'","''")

        taxonomicgroup_submitted = ir[3]

        try:
            q = "INSERT INTO sgcn.sgcn \
                (sourceid,sgcn_year,sgcn_state,scientificname_submitted,commonname_submitted,taxonomicgroup_submitted) \
                VALUES ('"+sourceid+"',"+str(sgcn_year)+",'"+sgcn_state+"','"+scientificname_submitted+"','"+commonname_submitted+"','"+taxonomicgroup_submitted+"')"
            r = requests.get(getBaseURL()+"&q="+q).json()
            print (r)
            totalRecords = totalRecords+1
        except:
            display (ir)

print ("Total Records Processed: "+str(totalRecords))

https://gc2.mapcentia.com/api/v1/sql/bcb?key=1c95cdb240f82acedec84299103e6d4e&q=INSERT INTO sgcn.sgcn                 (sourceid,sgcn_year,sgcn_state,scientificname_submitted,commonname_submitted,taxonomicgroup_submitted)                 VALUES ('https://www.sciencebase.gov/catalog/item/5787cd0ae4b0d27deb3754f2',2005,'Louisiana','Acipenser oxyrhinchus desotoi','Gulf Sturgeon','Fish')
https://gc2.mapcentia.com/api/v1/sql/bcb?key=1c95cdb240f82acedec84299103e6d4e&q=INSERT INTO sgcn.sgcn                 (sourceid,sgcn_year,sgcn_state,scientificname_submitted,commonname_submitted,taxonomicgroup_submitted)                 VALUES ('https://www.sciencebase.gov/catalog/item/5787cd0ae4b0d27deb3754f2',2005,'Louisiana','Actinonaias ligamentina','Mucket','Bivalves')
https://gc2.mapcentia.com/api/v1/sql/bcb?key=1c95cdb240f82acedec84299103e6d4e&q=INSERT INTO sgcn.sgcn                 (sourceid,sgcn_year,sgcn_state,scientificname_submitted,commonname_submitted,taxonomicgroup_submitted)                 