# SGCN National List
The full [national list](https://www1.usgs.gov/csas/swap/national_list.html) of SGCN species across 2005 and 2015 represents a relatively complex query that needs to sum up the total states reporting each species.
## UPDATE
The whole SGCN system has been completely reengineered, but I tried to keep the basic final output in something close to the state that has been built against so far for the SWAP app. The sgcn_nationallist view and ElasticSearch index should be identical to what they were before, but the underlying data are all new. Here are a couple of caveats:

* The query starts from the standpoint of the Taxonomic Information Registry joining to the SGCN table on the submitted/registered name.
* Common Name comes from the ITIS vernacular English name if it exists or else uses the submitted common name from one of the states.
* The taxonomic group still comes from what the states originally submitted, so it is blank for some entries. This will be improved once Abby provides a mapping from ITIS taxonomic levels to some logical grouping that we want to put the national list into.
* The underlying data from the states is also all new here. I built a whole new process that reads directly from the source data repository in ScienceBase and processes source files into records in the new sgcn.sgcn table (new sgcn schema in the GC2 instance). Those are then processed using a different method of checking taxonomy against name authorities. Currently, the final data only include the most solid matches on ITIS. WoRMS taxonomic checks have not been completed to fill in some of the blanks, and the ITIS matching algorithm can be improved to find additional matches. I took a fairly conservative approach on the matching process, so there will likely be additional matches found in future to expand out the "SGCN National List."
* The taxonomic authority ID is concatenated to a URL string that can serve as a usable link for machine access if we want to put more ITIS information together. This ID is always the final accepted TSN for the name matched to ITIS and from which we draw taxonomy, common names, and other properties.
* This query will have to be redone once I get the WoRMS matching service running and we need to account for taxonomic matches on more than one authority.
* The new query adds taxonomic rank to the data. This indicates the level of name match that we made when building the national list from taxonomic authorities. We will want to add this in as a search facet to the SWAP app where we have stakeholders interested in filtering down to just those cases where species or subspecies were identified.
* I added the list of states as a string list to the sgcn_nationallist view and index to accommodate the need to display this information behind the numbers. Note that I made this a simple comma separated list. In the current SWAP app, these are displayed with a mouseover function on the number, and this would need to put an extra space in with the commas if desired. Personally, I feel that this could be a more robust feature if a click event was used to display the state list as an actual set of clickable items to go and visit the state page. This could be done in the table to actually expand the given row and show the states in a bulleted list.
* I changed the query that builds the National List to use a new taxonomic group designation from the TIR that is built from code to align submitted taxonomic groups to a more simplified set of values. This will ultimately be changed again to map authority taxonomy to the logical groups we want to present, but the data structure will remain the same.
* Data have now been built out on the gc2.datadistillery.org instance with improved processes that have picked up some additional taxa for the National List.
* The query has been changed again to union on three different methods - ITIS match, WoRMS match, and legacy annotation indicating that certain taxa were explicitly added to the 2005 SWAP list. This is done by unioning three different SELECT statements that get to the common set of attributes in different ways. There's probably a more efficient way to build that query with conditional logic, but the slow view gets piped to a fast ElasticSearch index for use anyway.
* There is a new aggregation possibility on "matchmethod," allowing for an ability to filter on how taxa ended up on the National List. We need to fiddle with this some more as ElasticSearch is analyzing these as different words. I need to work on the settings in GC2 ElasticSearch or the query itself.

In [1]:
import requests
from IPython.display import display

In [2]:
#Class to render tables
class ListTable(list):
    def _repr_html_(self):
        html = ["<table>"]
        for row in self:
            html.append("<tr>")
            
            for col in row:
                html.append("<td>{0}</td>".format(col))
            
            html.append("</tr>")
        html.append("</table>")
        return ''.join(html)

This query returns results from the Elasticsearch index for the sgcn_nationallist view. It only calls the first 25 results, so that will need to be paginated for the SWAP online app. I included the taxonomic authority ID as a reference. Those IDs to ITIS or WoRMS return a machine-readable response and are not content negotiable, so if we want to include them in the UI, we would need to translate the ID into something for humans.

In [3]:
sgcnNationalListURL = "https://gc2.datadistillery.org/api/v1/elasticsearch/search/bcb/sgcn/sgcn_nationallist?size=25"
sgcnNationalList = requests.get(sgcnNationalListURL).json()

tableNationalList = ListTable()
tableNationalList.append(['Scientific Name', 'Common Name', '2005', '2005 State List', '2015', '2015 State List', 'Taxonomic Group', 'Taxonomic Rank', 'Match Method', 'Taxonomic Authority ID/Link'])

for hit in sgcnNationalList['hits']['hits']:
    tableNationalList.append([hit['_source']['properties']['scientificname'], hit['_source']['properties']['commonname'], hit['_source']['properties']['sgcn2005'], hit['_source']['properties']['statelist_2005'], hit['_source']['properties']['sgcn2015'], hit['_source']['properties']['statelist_2015'], hit['_source']['properties']['taxonomicgroup'], hit['_source']['properties']['taxonomicrank'], hit['_source']['properties']['matchmethod'], hit['_source']['properties']['taxonomicauthorityid']])

display(tableNationalList)

0,1,2,3,4,5,6,7,8,9
Scientific Name,Common Name,2005,2005 State List,2015,2015 State List,Taxonomic Group,Taxonomic Rank,Match Method,Taxonomic Authority ID/Link
Cyanea konahuanuiensis,No Common Name,0,,1,Hawaii,Plants,Species,Exact Match,http://services.itis.gov/?q=tsn:1000352
Elymus churchii,Church's wild rye,0,,1,Missouri,Plants,Species,Exact Match,http://services.itis.gov/?q=tsn:1000353
Neanura,Swamp River Cave Neanura,1,Tennessee,1,Tennessee,Insects,Genus,Exact Match,http://services.itis.gov/?q=tsn:100226
Arrhopalites,(a cave obligate springtail),4,"Tennessee,Maryland,West Virginia,West Virginia",2,"Maryland,Tennessee",Other Invertebrates,Genus,Exact Match,http://services.itis.gov/?q=tsn:100443
Rhithrogena anomala,A mayfly,2,"Virginia,New York",1,Virginia,Insects,Species,Exact Match,http://services.itis.gov/?q=tsn:100578
Rhithrogena jejuna,no common name,1,Wisconsin,0,,Insects,Species,Exact Match,http://services.itis.gov/?q=tsn:100587
Heptagenia flavescens,A Mayfly,0,,1,Florida,Insects,Species,Exact Match,http://services.itis.gov/?q=tsn:100610
Heptagenia julia,A mayfly,1,New York,0,,Insects,Species,Exact Match,http://services.itis.gov/?q=tsn:100612
Leucrocuta maculipennis,no common name,1,Wisconsin,0,,Insects,Species,Exact Match,http://services.itis.gov/?q=tsn:100679


## Aggregations (facets)
The ES index for the national list is set up to support aggregations on taxonomicrank and taxonomicgroup for faceted searching in the system. The aggregations are added to the query DSL using the following:
```json
{
  "aggs": {
    "taxrank": {
      "terms": {
        "field": "properties.taxonomicrank"
      }
    },
    "taxgroup": {
      "terms": {
        "field": "properties.taxonomicgroup"
      }
    },
    "matchmethod": {
      "terms": {
        "field": "properties.matchmethod"
      }
    }
  }
}
```
See the [ElasticSearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html) on aggregations for more details.

There is a problem in the aggregations code right now due to ElasticSearch settings describes [here](https://www.elastic.co/guide/en/elasticsearch/guide/current/aggregations-and-analysis.html). I just haven't been able to figure out the translation between what you can do in the ES API for these settings vs. the UI that GC2 provides. This is resulting in the individual words for string concepts, Taxonomic Group and Match Method, being analyzed instead of the entire string.

In [4]:
queryWithAggs = "https://gc2.datadistillery.org/api/v1/elasticsearch/search/bcb/sgcn/sgcn_nationallist?q={%22aggs%22:{%22taxrank%22:{%22terms%22:{%22field%22:%22properties.taxonomicrank%22}},%22taxgroup%22:{%22terms%22:{%22field%22:%22properties.taxonomicgroup%22}},%22matchmethod%22:{%22terms%22:{%22field%22: %22properties.matchmethod%22}}}}"
rAggs = requests.get(queryWithAggs).json()

print ("Taxonomic Rank")
for bucket in rAggs["aggregations"]["taxrank"]["buckets"]:
    print (bucket["key"], bucket["doc_count"])
print ("----")
print ("Taxonomic Group")
for bucket in rAggs["aggregations"]["taxgroup"]["buckets"]:
    print (bucket["key"], bucket["doc_count"])
print ("----")
print ("Match Method")
for bucket in rAggs["aggregations"]["matchmethod"]["buckets"]:
    print (bucket["key"], bucket["doc_count"])


Taxonomic Rank
species 12493
subspecies 1153
variety 344
genus 253
family 195
order 26
subfamily 5
class 4
phylum 3
suborder 3
----
Taxonomic Group
plants 3613
insects 3246
other 1710
fish 1501
mollusks 1400
invertebrates 1091
birds 955
mammals 561
reptiles 481
unknown 435
----
Match Method
match 13322
exact 12953
accepted 1159
followed 1159
tsn 1090
fuzzy 369
aphiaid 69
