# SGCN National List

The full [national list](https://www1.usgs.gov/csas/swap/national_list.html) of SGCN species across 2005 and 2015 represents a relatively complex query that needs to sum up the total states reporting each species. There may be some way to drive everything with some feature of the Elasticsearch index on the full original data that I haven't figured out yet, but I was only able to come up with a SQL statement to drive this.
```sql
SELECT concat('http://services.itis.gov/?q=tsn:', t.itis->'tsn') AS taxonomicauthorityid,
(array_agg(t.itis->'nameWInd'))[1] AS scientificname,
(array_agg(Coalesce(t.itis->'vernacular:English',s.commonname_submitted)))[1] AS commonname,
(array_agg(t.itis->'rank'))[1] AS taxonomicrank,
(array_agg(s.taxonomicgroup_submitted ORDER BY sgcn_year ASC))[1] AS taxonomicgroup,
array_to_string(array_agg(CASE WHEN s.sgcn_year=2005 THEN s.sgcn_state ELSE NULL END), ',') statelist_2005,
array_to_string(array_agg(CASE WHEN s.sgcn_year=2015 THEN s.sgcn_state ELSE NULL END), ',') statelist_2015,
sum(((s.sgcn_year = 2005))::integer) AS sgcn2005,
sum(((s.sgcn_year = 2015))::integer) AS sgcn2015
FROM sgcn.sgcn s
JOIN tir.tir2 t ON
t.registration->'SGCN_ScientificName_Submitted' = s.scientificname_submitted
WHERE t.itis->'itisMatchMethod' NOT LIKE 'NotMatched%'
GROUP BY t.itis->'tsn'
```  
Running that live is way too costly on the system, so I built a view in GC2 using this select statement and indexed that in Elasticsearch as sgcn_nationallist. This results in a much more responsive query. This query selects only those records where there is an accepted taxonomic authority ID, which is the basic definition of what ends up on the national list.

## UPDATE
The whole SGCN system has been completely reengineered, but I tried to keep the basic final output in something close to the state that has been built against so far for the SWAP app. The sgcn_nationallist view and ElasticSearch index should be identical to what they were before, but the underlying data are all new. Here are a couple of caveats:

* The query starts from the standpoint of the Taxonomic Information Registry joining to the SGCN table on the submitted/registered name.
* Common Name comes from the ITIS vernacular English name if it exists or else uses the submitted common name from one of the states.
* The taxonomic group still comes from what the states originally submitted, so it is blank for some entries. This will be improved once Abby provides a mapping from ITIS taxonomic levels to some logical grouping that we want to put the national list into.
* The underlying data from the states is also all new here. I built a whole new process that reads directly from the source data repository in ScienceBase and processes source files into records in the new sgcn.sgcn table (new sgcn schema in the GC2 instance). Those are then processed using a different method of checking taxonomy against name authorities. Currently, the final data only include the most solid matches on ITIS. WoRMS taxonomic checks have not been completed to fill in some of the blanks, and the ITIS matching algorithm can be improved to find additional matches. I took a fairly conservative approach on the matching process, so there will likely be additional matches found in future to expand out the "SGCN National List."
* The taxonomic authority ID is concatenated to a URL string that can serve as a usable link for machine access if we want to put more ITIS information together. This ID is always the final accepted TSN for the name matched to ITIS and from which we draw taxonomy, common names, and other properties.
* This query will have to be redone once I get the WoRMS matching service running and we need to account for taxonomic matches on more than one authority.
* The new query adds taxonomic rank to the data. This indicates the level of name match that we made when building the national list from taxonomic authorities. We will want to add this in as a search facet to the SWAP app where we have stakeholders interested in filtering down to just those cases where species or subspecies were identified.
* I added the list of states as a string list to the sgcn_nationallist view and index to accommodate the need to display this information behind the numbers. Note that I made this a simple comma separated list. In the current SWAP app, these are displayed with a mouseover function on the number, and this would need to put an extra space in with the commas if desired. Personally, I feel that this could be a more robust feature if a click event was used to display the state list as an actual set of clickable items to go and visit the state page. This could be done in the table to actually expand the given row and show the states in a bulleted list.

In [1]:
import requests
from IPython.display import display

In [2]:
#Class to render tables
class ListTable(list):
    def _repr_html_(self):
        html = ["<table>"]
        for row in self:
            html.append("<tr>")
            
            for col in row:
                html.append("<td>{0}</td>".format(col))
            
            html.append("</tr>")
        html.append("</table>")
        return ''.join(html)

This query returns results from the Elasticsearch index for the sgcn_nationallist view. It only calls the first 25 results, so that will need to be paginated for the SWAP online app. I included the taxonomic authority ID as a reference. Those IDs to ITIS or WoRMS return a machine-readable response and are not content negotiable, so if we want to include them in the UI, we would need to translate the ID into something for humans.

In [3]:
sgcnNationalListURL = "https://gc2.mapcentia.com/api/v1/elasticsearch/search/bcb/sgcn/sgcn_nationallist?size=25"
sgcnNationalList = requests.get(sgcnNationalListURL).json()

tableNationalList = ListTable()
tableNationalList.append(['Scientific Name', 'Common Name', '2005', '2005 State List', '2015', '2015 State List', 'Taxonomic Group', 'Taxonomic Rank', 'Taxonomic Authority ID/Link'])

for hit in sgcnNationalList['hits']['hits']:
    tableNationalList.append([hit['_source']['properties']['scientificname'], hit['_source']['properties']['commonname'], hit['_source']['properties']['sgcn2005'], hit['_source']['properties']['statelist_2005'], hit['_source']['properties']['sgcn2015'], hit['_source']['properties']['statelist_2015'], hit['_source']['properties']['taxonomicgroup'], hit['_source']['properties']['taxonomicrank'], hit['_source']['properties']['taxonomicauthorityid']])

display(tableNationalList)

0,1,2,3,4,5,6,7,8
Scientific Name,Common Name,2005,2005 State List,2015,2015 State List,Taxonomic Group,Taxonomic Rank,Taxonomic Authority ID/Link
Cyanea konahuanuiensis,No Common Name,0,,1,Hawaii,Plants,Species,http://services.itis.gov/?q=tsn:1000352
Neanura,Swamp River Cave Neanura,1,Tennessee,0,,,Genus,http://services.itis.gov/?q=tsn:100226
Sminthuridae,,0,,1,Florida,Insects,Family,http://services.itis.gov/?q=tsn:100258
Acalypta susanae,Lace Bug,1,Arkansas,1,Arkansas,,Species,http://services.itis.gov/?q=tsn:1003617
Ephemeroptera,mayflies,1,Alaska,0,,Insects,Order,http://services.itis.gov/?q=tsn:100502
Cinygmula gartrelli,A mayfly,0,,1,Washington,Insects,Species,http://services.itis.gov/?q=tsn:100564
Rhithrogena flavianula,Flathead Mayfly,0,,1,Colorado,Insects,Species,http://services.itis.gov/?q=tsn:100574
Rhithrogena impersonata,no common name,1,Wisconsin,0,,,Species,http://services.itis.gov/?q=tsn:100586
Heptagenia culacantha,A mayfly,2,"Pennsylvania,New York",1,Pennsylvania,Insects,Species,http://services.itis.gov/?q=tsn:100623


## Aggregations (facets)
The ES index for the national list is set up to support aggregations on taxonomicrank and taxonomicgroup for faceted searching in the system. The aggregations are added to the query DSL using the following:
```json
{
  "aggs": {
    "taxrank": {
      "terms": {
        "field": "properties.taxonomicrank"
      }
    },
    "taxgroup": {
      "terms": {
        "field": "properties.taxonomicgroup"
      }
    }
  }
}
```
See the [ElasticSearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html) on aggregations for more details.

In [4]:
queryWithAggs = "https://gc2.mapcentia.com/api/v1/elasticsearch/search/bcb/sgcn/sgcn_nationallist?q={%22aggs%22:%20{%22taxrank%22:%20{%22terms%22:%20{%22field%22:%20%22properties.taxonomicrank%22}},%22taxgroup%22:%20{%22terms%22:%20{%22field%22:%20%22properties.taxonomicgroup%22}}}}"
rAggs = requests.get(queryWithAggs).json()

print ("Taxonomic Rank")
for bucket in rAggs["aggregations"]["taxrank"]["buckets"]:
    print (bucket["key"], bucket["doc_count"])
print ("----")
print ("Taxonomic Group")
for bucket in rAggs["aggregations"]["taxgroup"]["buckets"]:
    print (bucket["key"], bucket["doc_count"])


Taxonomic Rank
species 12170
subspecies 1136
variety 346
genus 243
family 190
order 24
class 3
phylum 3
subfamily 3
suborder 3
----
Taxonomic Group
insects 2082
plants 1970
vascular 898
fish 767
birds 586
gastropods 378
mammals 325
crustaceans 312
reptiles 263
plant 202
