This notebook shows the essential query for individual SGCN state lists from the structure created in the sgcn schema of the experimental GC2 instance. The whole system has been completely rebuilt from the source repository out. The "Process SGCN repository source files.ipynb" (and its corresponding py script) in this repo is what executes that process starting with all the source files in the ScienceBase Repository.

We are also making some tweaks to the design of the SWAP application where we want to stick with the overall philosophy that we always show exactly what the states submitted on the state pages of the apps. The National List shows what we add to the process by aligning with taxonomic authorities and making judgments on how we group the information. To aid in this process and show full transparency, we added a few additional properties to the sgcn.sgcn table so that each record traces back to its original source. These include sourcid (ScienceBase item URI/URL) along with sourcefileurl and sourcefilename (the actual file processed by the code to produce the data for a given state and year).

Data for the states can be pulled from a database view or its corresponding ElasticSearch index. The view uses the following SQL:
```sql
SELECT s.sgcn_state, s.scientificname_submitted AS scientificname,
(array_agg(s.commonname_submitted ORDER BY s.sgcn_year DESC))[1] AS commonname,
(array_agg(s.taxonomicgroup_submitted))[1] AS taxonomicgroup,
sum(((s.sgcn_year = 2005))::integer) AS sgcn2005,
sum(((s.sgcn_year = 2015))::integer) AS sgcn2015,
(array_agg(split_part(t.itis->'itisMatchMethod', ':', 1)))[1] AS itismatchmethod,
(array_agg(split_part(t.itis->'itisMatchMethod', ':', 2)))[1] AS itismatchmethod_searchstring,
(array_agg(t.itis->'cacheDate'))[1] AS itis_cachedate,
(array_agg(t.itis->'nameWInd'))[1] AS namewind,
(array_agg(t.itis->'tsn'))[1] AS tsn,
(array_agg(t.itis->'discoveredTSN'))[1] AS discoveredtsn,
(array_agg(t.itis->'acceptedTSN'))[1] AS acceptedtsn
FROM sgcn.sgcn s
JOIN tir.tir2 t ON
s.scientificname_submitted = t.registration->'SGCN_ScientificName_Submitted'
GROUP BY sgcn_state,scientificname_submitted
```  
This view makes the following choices:
* Group on the scientific name submitted by the state
* Use the latest common name and taxonomic group provided for species (2015 vs. 2005)

In [7]:
import requests
from IPython.display import display

In [8]:
# Available States
q = "SELECT DISTINCT sgcn_state FROM sgcn.sgcn GROUP BY sgcn_state"

r = requests.get("https://gc2.mapcentia.com/api/v1/sql/bcb?q="+q).json()

display (r)

{'_execution_time': 0.117,
 'auth_check': {'auth_level': 'Write',
  'checked_relations': ['sgcn.sgcn'],
  'session': None,
  'success': True},
 'features': [{'properties': {'sgcn_state': 'Alabama'}, 'type': 'Feature'},
  {'properties': {'sgcn_state': 'Indiana'}, 'type': 'Feature'},
  {'properties': {'sgcn_state': 'Minnesota'}, 'type': 'Feature'},
  {'properties': {'sgcn_state': 'South Carolina'}, 'type': 'Feature'},
  {'properties': {'sgcn_state': 'Louisiana'}, 'type': 'Feature'},
  {'properties': {'sgcn_state': 'California'}, 'type': 'Feature'},
  {'properties': {'sgcn_state': 'New Mexico'}, 'type': 'Feature'},
  {'properties': {'sgcn_state': 'New Hampshire'}, 'type': 'Feature'},
  {'properties': {'sgcn_state': 'American Samoa'}, 'type': 'Feature'},
  {'properties': {'sgcn_state': 'Connecticut'}, 'type': 'Feature'},
  {'properties': {'sgcn_state': 'Alaska'}, 'type': 'Feature'},
  {'properties': {'sgcn_state': 'Nevada'}, 'type': 'Feature'},
  {'properties': {'sgcn_state': 'Oklahoma'}, 

In [10]:
# ElasticSearch API query for a state showing pagination method
stateName = 'Wyoming'

stateListQuery = "https://gc2.mapcentia.com/api/v1/elasticsearch/search/bcb/sgcn/sgcn_statelists?q={%22query%22:{%22match%22:{%22properties.sgcn_state%22:%22"+stateName+"%22}}}&size=25&from=50"

stateList = requests.get(stateListQuery).json()

display (stateList)


{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': 'AVxabwk0UuPNezaKDbJa',
    '_index': 'bcb_sgcn_sgcn_statelists',
    '_score': 3.840837,
    '_source': {'properties': {'acceptedtsn': None,
      'commonname': 'Playa Lovegrass',
      'discoveredtsn': None,
      'itis_cachedate': '2017-05-12T17:27:07.153890',
      'itismatchmethod': 'ExactMatch',
      'itismatchmethod_searchstring': 'Eragrostis pilosa var. perplexa',
      'namewind': 'Eragrostis pilosa var. perplexa',
      'scientificname': 'Eragrostis pilosa var. perplexa',
      'sgcn2005': 0,
      'sgcn2015': 1,
      'sgcn_state': 'Nebraska',
      'taxonomicgroup': 'Plants',
      'tsn': '527901'},
     'type': 'Feature'},
    '_type': 'sgcn_statelists'},
   {'_id': 'AVxabwk0UuPNezaKDbJj',
    '_index': 'bcb_sgcn_sgcn_statelists',
    '_score': 3.840837,
    '_source': {'properties': {'acceptedtsn': None,
      'commonname': 'Short-ray Fleabane',
      'discoveredtsn': None,
      'itis_cac

## Aggregations (facets)
The ES index for the state lists is set up to support aggregations on taxonomicgroup for faceted searching in the system. The aggregations are added to the query DSL using the following:
```json
{
  "aggs": {
    "taxrank": {
      "terms": {
        "field": "properties.taxonomicgroup"
      }
    }
}
```
See the [ElasticSearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html) on aggregations for more details.

In [11]:
# Query for the specified state name and add in the aggregations
queryWithAggs = "https://gc2.mapcentia.com/api/v1/elasticsearch/search/bcb/sgcn/sgcn_statelists?q={%22query%22:{%22match%22:{%22properties.sgcn_state%22:%22"+stateName+"%22}},%22aggs%22:%20{%22taxgroup%22:%20{%22terms%22:%20{%22field%22:%20%22properties.taxonomicgroup%22}}}}"
rAggs = requests.get(queryWithAggs).json()

print ("Taxonomic Group")
for bucket in rAggs["aggregations"]["taxgroup"]["buckets"]:
    print (bucket["key"], bucket["doc_count"])

Taxonomic Group
plants 572
vascular 243
insects 140
birds 113
mammals 36
fish 33
reptiles 28
mollusks 12
bivalves 6
amphibians 4


### Taxonomic Authority Check

To derive the SGCN National List, the USGS processes the names submitted by states against lookup services from taxonomic authorities. The results of this check are recorded in the data and then used to build the sgcn_nationallist index that drives the web app and other uses. The sgcn_statelists view/index contains the necessary properties to show exactly what the matching process consisted of and resulted in.

In the web app, this information has been shown as a set of "name changes" on a separate tab. This should change to provide a full report for the state by unique submitted name by using the information directly from the index for each record. This can be done with a similar device in the web app of showing a separate table, but it is essentially just an extension of several new properties onto the existing information.

Definition of the properties:

* itismatchmethod - One of four values describing the results of checking the species name against ITIS. In the sgcn_statelists ES index, this property can be used to aggregate the data, allowing for facets/filters on the 4 values.
    * ExactMatch - Found an exact match of one taxon using the name string shown in itismatchmethod_searchstring
    * FuzzyMatch - Found a fuzzy match using the ITIS Solr service for a single taxon using the name string in itismatchmethod_searchstring
    * AcceptedTSNFollow - Found a match using the name string, but followed the accepted taxonomic because the discovered name was changed in ITIS
    * NotMatched - The provided name could not be matched at this time. Taxa with "NotMatched" values are implicitly not on the National List, whereas taxa in any of the other three categories of itismatchmethod are on the National List.
* itismatchmethod_searchstring - The exact name string used to search against ITIS. When this string and the scientificname are not exactly the same, it indicates that one or more rules to clean up the names for processing was triggered and applied in the process. These rules are defined in the "stringCleaning" function of the ITIS processing script and should probably be documented somewhere for the SWAP web app to pull from.
* itis_cachedate - The exact date/time that the search was conducted for the record.
* namewind - The name with indicators for the accepted taxon as provided by ITIS. This is the name that is shown in the National List as opposed to the submitted name from the state.
* tsn - The ITIS Taxonomic Serial Number (TSN) for the valid taxonomy. This can be put together with one or another of the ITIS URLs to provide a meaningful link for users. For instance, human-readable web pages in ITIS currently look like https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=983598. The presence of a tsn value in a state lists implicitly indicates that the taxon is on the "National List."
* discoveredtsn - The ITIS TSN at the point of discovery where an accepted TSN was followed to bring back the currently taxonomy for the record. This can also be linked to an ITIS web page.

Abby will need to weigh in on exactly how to present this information. We may want to do things like turn the somewhat cryptic values into better descriptors and provide some derivation on the above explanation for the web site.

In [12]:
# Aggregations query on the itismatchmethod showing an overview of the results for a given state
q_itismatchmethodaggs = "https://gc2.mapcentia.com/api/v1/elasticsearch/search/bcb/sgcn/sgcn_statelists?q={%22query%22:{%22match%22:{%22properties.sgcn_state%22:%22"+stateName+"%22}},%22aggs%22:%20{%22itismatchmethod%22:%20{%22terms%22:%20{%22field%22:%20%22properties.itismatchmethod%22}}}}"
r_itismatchmethodaggs = requests.get(q_itismatchmethodaggs).json()

print ("ITIS Match Method")
for bucket in r_itismatchmethodaggs["aggregations"]["itismatchmethod"]["buckets"]:
    print (bucket["key"], bucket["doc_count"])

ITIS Match Method
exactmatch 788
acceptedtsnfollow 107
notmatched 31
fuzzymatch 19
