# SGCN Search

To accommodate full text searches (and directed, property-specicific searches if we want), we've developed a specialized view and index based on the following SQL that brings together multiple properties/values based on unique submitted scientific names from the originaly submitted data. This gives us both taxa that have been matched to the taxonomic authorities (not on "National List") and taxa that have not been matched but are still available from individual state lists.

#20170602 SQL updates from Abby and Daniel include addition of scientificname_display (for display purposes, pulls sn from itis #if available, uses submitted if not).  Removes field scientific_submitted. Also removes duplicate of species that have the same #tsn.  Added a coalesce of scientific name as well in similar format to common name.

```sql
(SELECT t.itis->'nameWInd' AS scientificname_display,
(array_agg(t.itis->'nameWInd'))[1] AS scientificname_accepted,
(array_agg(t.itis->'tsn'))[1] AS tsn,
(array_agg(t.sgcn->'taxonomicgroup'))[1] AS taxonomicgroup,
(array_agg(t.itis->'Superdivision'))[1] AS superdivision,
(array_agg(t.itis->'Division'))[1] AS division,
(array_agg(t.itis->'Subdivision'))[1] AS subdivision,
(array_agg(t.itis->'Kingdom'))[1] AS kingdom,
(array_agg(t.itis->'Subkingdom'))[1] AS subkingdom,
(array_agg(t.itis->'Infrakingdom'))[1] AS infrakingdom,
(array_agg(t.itis->'Superphylum'))[1] AS superphylum,
(array_agg(t.itis->'Phylum'))[1] AS phylum,
(array_agg(t.itis->'Subphylum'))[1] AS subphylum,
(array_agg(t.itis->'Infraphylum'))[1] AS infraphylum,
(array_agg(t.itis->'Superclass'))[1] AS superclass,
(array_agg(t.itis->'Class'))[1] AS class,
(array_agg(t.itis->'Subclass'))[1] AS subclass,
(array_agg(t.itis->'Infraclass'))[1] AS infraclass,
(array_agg(t.itis->'Superorder'))[1] AS superorder,
(array_agg(t.itis->'Order'))[1] AS order,
(array_agg(t.itis->'Suborder'))[1] AS suborder,
(array_agg(t.itis->'Infraorder'))[1] AS infraorder,
(array_agg(t.itis->'Superfamily'))[1] AS superfamily,
(array_agg(t.itis->'Family'))[1] AS family,
(array_agg(t.itis->'Subfamily'))[1] AS subfamily,
(array_agg(t.itis->'Genus'))[1] AS genus,
(array_agg(t.itis->'Subgenus'))[1] AS subgenus,
(array_agg(t.itis->'Species'))[1] AS species,
(array_agg(t.itis->'Subspecies'))[1] AS subspecies,
(array_agg(t.itis->'Tribe'))[1] AS tribe,
(array_agg(t.itis->'Subtribe'))[1] AS subtribe,
(array_agg(t.itis->'Variety'))[1] AS variety,
(array_agg(Coalesce(t.itis->'vernacular:English',s.commonname_submitted)))[1] AS commonname_preferred,
array_to_string(array_agg(Coalesce(t.itis->'vernacular:English',s.commonname_submitted)), ',') AS commonnamelist,
array_to_string(array_agg(Coalesce(t.registration->'SGCN_ScientificName_Submitted',t.itis->'nameWInd')), ',') AS scientificnamelist,
array_to_string(array_agg(CASE WHEN s.sgcn_year=2005 THEN s.sgcn_state ELSE NULL END), ',') statelist_2005,
array_to_string(array_agg(CASE WHEN s.sgcn_year=2015 THEN s.sgcn_state ELSE NULL END), ',') statelist_2015,
(array_agg(t.tess->'SpeciesCode'))[1] AS tess_speciescode,
(array_agg(t.tess->'entityId'))[1] AS tess_entityid,
(array_agg(t.tess->'cacheDate'))[1] AS tess_cachedate,
(array_agg(t.tess->'StatusText'))[1] AS tess_listingstatus,
(array_agg(t.natureserve->'elementGlobalID'))[1] AS natureserve_elementglobalid,
(array_agg(t.natureserve->'elementGlobalID'))[1] AS natureserve_cachedate,
(array_agg(t.natureserve->'GlobalStatusRank'))[1] AS natureserve_globalstatusrank,
(array_agg(t.natureserve->'roundedGlobalStatusRankDescription'))[1] AS natureserve_roundedglobalstatusrankdescription,
(array_agg(t.natureserve->'globalStatusLastReviewed'))[1] AS natureserve_globalstatuslastreviewed,
(array_agg(t.natureserve->'usNationalStatusRankCode'))[1] AS natureserve_usnationalstatusrankcode,
(array_agg(t.natureserve->'usNationalStatusLastReviewed'))[1] AS natureserve_usnationalstatuslastreviewed
FROM sgcn.sgcn s
JOIN tir.tir2 t ON
t.registration->'SGCN_ScientificName_Submitted' = s.scientificname_submitted
GROUP BY t.itis->'nameWInd', t.itis->'tsn'
HAVING t.itis->'tsn' IS NOT NULL

UNION

SELECT t.registration->'SGCN_ScientificName_Submitted' AS scientificname_display,
(array_agg(t.itis->'nameWInd'))[1] AS scientificname_accepted,
(array_agg(t.itis->'tsn'))[1] AS tsn,
(array_agg(t.sgcn->'taxonomicgroup'))[1] AS taxonomicgroup,
(array_agg(t.itis->'Superdivision'))[1] AS superdivision,
(array_agg(t.itis->'Division'))[1] AS division,
(array_agg(t.itis->'Subdivision'))[1] AS subdivision,
(array_agg(t.itis->'Kingdom'))[1] AS kingdom,
(array_agg(t.itis->'Subkingdom'))[1] AS subkingdom,
(array_agg(t.itis->'Infrakingdom'))[1] AS infrakingdom,
(array_agg(t.itis->'Superphylum'))[1] AS superphylum,
(array_agg(t.itis->'Phylum'))[1] AS phylum,
(array_agg(t.itis->'Subphylum'))[1] AS subphylum,
(array_agg(t.itis->'Infraphylum'))[1] AS infraphylum,
(array_agg(t.itis->'Superclass'))[1] AS superclass,
(array_agg(t.itis->'Class'))[1] AS class,
(array_agg(t.itis->'Subclass'))[1] AS subclass,
(array_agg(t.itis->'Infraclass'))[1] AS infraclass,
(array_agg(t.itis->'Superorder'))[1] AS superorder,
(array_agg(t.itis->'Order'))[1] AS order,
(array_agg(t.itis->'Suborder'))[1] AS suborder,
(array_agg(t.itis->'Infraorder'))[1] AS infraorder,
(array_agg(t.itis->'Superfamily'))[1] AS superfamily,
(array_agg(t.itis->'Family'))[1] AS family,
(array_agg(t.itis->'Subfamily'))[1] AS subfamily,
(array_agg(t.itis->'Genus'))[1] AS genus,
(array_agg(t.itis->'Subgenus'))[1] AS subgenus,
(array_agg(t.itis->'Species'))[1] AS species,
(array_agg(t.itis->'Subspecies'))[1] AS subspecies,
(array_agg(t.itis->'Tribe'))[1] AS tribe,
(array_agg(t.itis->'Subtribe'))[1] AS subtribe,
(array_agg(t.itis->'Variety'))[1] AS variety,
(array_agg(Coalesce(t.itis->'vernacular:English',s.commonname_submitted)))[1] AS commonname_preferred,
array_to_string(array_agg(Coalesce(t.itis->'vernacular:English',s.commonname_submitted)), ',') AS commonnamelist,
array_to_string(array_agg(Coalesce(t.registration->'SGCN_ScientificName_Submitted',t.itis->'nameWInd')), ',') AS scientificnamelist,
array_to_string(array_agg(CASE WHEN s.sgcn_year=2005 THEN s.sgcn_state ELSE NULL END), ',') statelist_2005,
array_to_string(array_agg(CASE WHEN s.sgcn_year=2015 THEN s.sgcn_state ELSE NULL END), ',') statelist_2015,
(array_agg(t.tess->'SpeciesCode'))[1] AS tess_speciescode,
(array_agg(t.tess->'entityId'))[1] AS tess_entityid,
(array_agg(t.tess->'cacheDate'))[1] AS tess_cachedate,
(array_agg(t.tess->'StatusText'))[1] AS tess_listingstatus,
(array_agg(t.natureserve->'elementGlobalID'))[1] AS natureserve_elementglobalid,
(array_agg(t.natureserve->'elementGlobalID'))[1] AS natureserve_cachedate,
(array_agg(t.natureserve->'GlobalStatusRank'))[1] AS natureserve_globalstatusrank,
(array_agg(t.natureserve->'roundedGlobalStatusRankDescription'))[1] AS natureserve_roundedglobalstatusrankdescription,
(array_agg(t.natureserve->'globalStatusLastReviewed'))[1] AS natureserve_globalstatuslastreviewed,
(array_agg(t.natureserve->'usNationalStatusRankCode'))[1] AS natureserve_usnationalstatusrankcode,
(array_agg(t.natureserve->'usNationalStatusLastReviewed'))[1] AS natureserve_usnationalstatuslastreviewed
FROM sgcn.sgcn s
JOIN tir.tir2 t ON
t.registration->'SGCN_ScientificName_Submitted' = s.scientificname_submitted
GROUP BY t.registration->'SGCN_ScientificName_Submitted', t.itis->'tsn'
HAVING t.itis->'tsn' IS NULL)

```
Running searches on the ElasticSearch Index for this data layout should be fairly powerful. An open text search on everything in the index will return all kinds of things from exact scientific names to state names and associated species. However, this may result in unintended results, and it may be best to start by focusing in only on scientific name and common names (which are a list).

In [1]:
import requests
from IPython.display import display

In [2]:
#Class to render tables
class ListTable(list):
    def _repr_html_(self):
        html = ["<table>"]
        for row in self:
            html.append("<tr>")
            
            for col in row:
                html.append("<td>{0}</td>".format(col))
            
            html.append("</tr>")
        html.append("</table>")
        return ''.join(html)

This query returns results from the Elasticsearch index for the sgcn_nationallist view. It only calls the first 25 results, so that will need to be paginated for the SWAP online app. I included the taxonomic authority ID as a reference. Those IDs to ITIS or WoRMS return a machine-readable response and are not content negotiable, so if we want to include them in the UI, we would need to translate the ID into something for humans.

In [7]:
sgcnNationalListURL = "https://gc2.datadistillery.org/api/v1/elasticsearch/search/bcb/tir/tir?size=25"
sgcnNationalList = requests.get(sgcnNationalListURL).json()

tableNationalList = ListTable()
tableNationalList.append(['Scientific Name', 'Common Name', 'Taxonomic Group', 'Taxonomic Rank', 'Taxonomic Authority ID/Link'])

for hit in sgcnNationalList['hits']['hits']:
    tableNationalList.append([hit['_source']['properties']['scientificname'], hit['_source']['properties']['commonname'], hit['_source']['properties']['taxonomicgroup'], hit['_source']['properties']['rank'], hit['_source']['properties']['authorityid']])

display(tableNationalList)

0,1,2,3,4
Scientific Name,Common Name,Taxonomic Group,Taxonomic Rank,Taxonomic Authority ID/Link
Phyllostegia wawrana,fuzzystem phyllostegia,Plants,Species,http://services.itis.gov/?q=tsn:196179
Plantago princeps var. laxifolia,Kuahiwi Laukahi,Plants,Variety,http://services.itis.gov/?q=tsn:834291
Auchenorrhyncha,hoppers,Insects,Suborder,http://services.itis.gov/?q=tsn:109167
Satyrium acadica,Acadian Hairstreak,Insects,Species,http://services.itis.gov/?q=tsn:777814
Carex lupuliformis,false hop sedge,Plants,Species,http://services.itis.gov/?q=tsn:39412
Anas rubripes,American Black Duck,Birds,Species,http://services.itis.gov/?q=tsn:175068
Oncorhynchus mykiss,rainbow trout,Fish,Species,http://services.itis.gov/?q=tsn:161989
Eremophila alpestris strigata,streaked horned lark,Birds,Subspecies,http://services.itis.gov/?q=tsn:178412
Scincella lateralis,Ground Skink,Reptiles,Species,http://services.itis.gov/?q=tsn:174008


## Aggregations (facets)
The ES index for the national list is set up to support aggregations on taxonomicrank and taxonomicgroup for faceted searching in the system. The aggregations are added to the query DSL using the following:
```json
{
  "aggs": {
    "taxrank": {
      "terms": {
        "field": "properties.taxonomicrank"
      }
    },
    "taxgroup": {
      "terms": {
        "field": "properties.taxonomicgroup"
      }
    }
  }
}
```
See the [ElasticSearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html) on aggregations for more details.

In [4]:
queryWithAggs = "https://gc2.datadistillery.org/api/v1/elasticsearch/search/bcb/tir/tir?q={%22aggs%22:%20{%22taxrank%22:%20{%22terms%22:%20{%22field%22:%20%22properties.rank%22}},%22taxgroup%22:%20{%22terms%22:%20{%22field%22:%20%22properties.taxonomicgroup%22}}}}"
rAggs = requests.get(queryWithAggs).json()

print ("Taxonomic Rank")
for bucket in rAggs["aggregations"]["taxrank"]["buckets"]:
    print (bucket["key"], bucket["doc_count"])
print ("----")
print ("Taxonomic Group")
for bucket in rAggs["aggregations"]["taxgroup"]["buckets"]:
    print (bucket["key"], bucket["doc_count"])


Taxonomic Rank
species 15982
rank 1438
taxonomic 1438
unknown 1438
subspecies 1369
genus 510
variety 373
family 196
order 33
class 5
----
Taxonomic Group
plants 4307
insects 4142
other 2076
fish 1929
birds 1898
mollusks 1791
mammals 1282
invertebrates 1243
reptiles 987
amphibians 694


In [None]:
sgcnNationalListURL = "https://gc2.datadistillery.org/api/v1/elasticsearch/search/bcb/tir/tir?size=25"
sgcnNationalList = requests.get(sgcnNationalListURL).json()
