# SGCN Search

We changed pretty much everything about how this search is supposed to function under the new GC2 instance on the DataDistillery. Because we put logic into building out TIR Common Properties (see that under the tir repo), we are now trying to simply search on an index created from the TIR core table. You can facet on taxonomic group, taxonomic rank, and match method here for the SGCN searches. The main thing that has do be done is to first limit the search to the SGCN registrants (source=SGCN).

```json
{
  "query": {
    "match": {
      "properties.source": "SGCN"
    }
  }
}
```

All the other information that we had previously chunked out into separate fields from what was in HStore properties is still there, but it is now in JSONB data fields in the tir table. We are initially trying to just pipe all that into ElasticSearch via GC2 to see how searches behave. We will likely need to parse the properties we care about out again in some fasion to make them more usable as we're likely to get weird results with the way the JSONB data structures are thrown into ElasticSearch as escaped text strings. Or we may want to either a) pull the plug on the GC2 way of piping to ElasticSearch and go to a different architecture or b) look at the GC2 codebase again to see if we could contribute some new thinking about different kinds of PostgreSQL-stored data than they had considered.

A couple changes to worry about here:

- taxonomicauthorityid changed to authorityid
- taxonomicrank changed to rank

In [1]:
import requests
from IPython.display import display

In [2]:
#Class to render tables
class ListTable(list):
    def _repr_html_(self):
        html = ["<table>"]
        for row in self:
            html.append("<tr>")
            
            for col in row:
                html.append("<td>{0}</td>".format(col))
            
            html.append("</tr>")
        html.append("</table>")
        return ''.join(html)

This query returns results from the Elasticsearch index for the tir.tir table. It only calls the first 25 results, so that will need to be paginated for the SWAP online app. I included the taxonomic authority ID as a reference. Those IDs to ITIS or WoRMS return a machine-readable response and are not content negotiable, so if we want to include them in the UI, we would need to translate the ID into something for humans.

In [10]:
sgcnNationalListURL = 'https://gc2.datadistillery.org/api/v1/elasticsearch/search/bcb/tir/tir?size=25&q={"query":{"match":{"properties.source":"SGCN"}}}'
sgcnNationalList = requests.get(sgcnNationalListURL).json()

tableNationalList = ListTable()
tableNationalList.append(['Scientific Name', 'Common Name', 'Taxonomic Group', 'Taxonomic Rank', 'Taxonomic Authority ID/Link'])

for hit in sgcnNationalList['hits']['hits']:
    tableNationalList.append([hit['_source']['properties']['scientificname'], hit['_source']['properties']['commonname'], hit['_source']['properties']['taxonomicgroup'], hit['_source']['properties']['rank'], hit['_source']['properties']['authorityid']])

display(tableNationalList)

0,1,2,3,4
Scientific Name,Common Name,Taxonomic Group,Taxonomic Rank,Taxonomic Authority ID/Link
Goniopsis cruentata,mangrove root crab,Insects,Species,http://services.itis.gov/?q=tsn:99068
Arbacia punctulata,purple-spined sea urchin,Other Invertebrates,Species,http://services.itis.gov/?q=tsn:157906
Opisthonema oglinum,Atlantic thread herring,Fish,Species,http://services.itis.gov/?q=tsn:161748
Pseudacris illinoensis,Illinois Chorus Frog,Amphibians,Species,http://services.itis.gov/?q=tsn:662726
Spirinchus thaleichthys,longfin smelt,Fish,Species,http://services.itis.gov/?q=tsn:162049
Taricha granulosa,Rough-skinned Newt,Amphibians,Species,http://services.itis.gov/?q=tsn:173620
Apamea plutonia,no common name,Insects,Species,http://services.itis.gov/?q=tsn:937849
Corvula sanctaeluciae,striped croaker,Fish,Species,http://services.itis.gov/?q=tsn:646588
Etheostoma,smoothbelly darters,Fish,Genus,http://services.itis.gov/?q=tsn:168357


## Aggregations (facets)
The ES index for the national list is set up to support aggregations on taxonomicgroup, rank, and matchmethod for faceted searching in the system. The aggregations are added to the query DSL using the following:
```json
{
  "query": {
    "match": {
      "properties.source": "SGCN"
    }
  },
  "aggs": {
    "taxrank": {
      "terms": {
        "field": "properties.rank"
      }
    },
    "taxgroup": {
      "terms": {
        "field": "properties.taxonomicgroup"
      }
    },
    "matchmethod": {
      "terms": {
        "field": "properties.matchmethod"
      }
    }
  }
}
```
See the [ElasticSearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html) on aggregations for more details.

### NOTE:
We still have the problem here where the not_analyzed flag in the ElasticSearch GUI from GC2 does not seem to be keeping the aggregation properties from splitting across words.

In [15]:
queryWithAggs = 'https://gc2.datadistillery.org/api/v1/elasticsearch/search/bcb/tir/tir?q={"query": {"match": {"properties.source": "SGCN"}},"aggs": {"taxrank": {"terms": {"field": "properties.rank"}},"taxgroup": {"terms": {"field": "properties.taxonomicgroup"}},"matchmethod": {"terms": {"field": "properties.matchmethod"}}}}'
rAggs = requests.get(queryWithAggs).json()

print ("Taxonomic Rank")
for bucket in rAggs["aggregations"]["taxrank"]["buckets"]:
    print (bucket["key"], bucket["doc_count"])
print ("----")
print ("Taxonomic Group")
for bucket in rAggs["aggregations"]["taxgroup"]["buckets"]:
    print (bucket["key"], bucket["doc_count"])
print ("----")
print ("Match Method")
for bucket in rAggs["aggregations"]["matchmethod"]["buckets"]:
    print (bucket["key"], bucket["doc_count"])


Taxonomic Rank
species 14316
rank 1413
taxonomic 1413
unknown 1413
subspecies 1345
genus 506
variety 373
family 196
order 33
class 5
----
Taxonomic Group
plants 4307
insects 4142
other 2076
fish 1929
mollusks 1791
birds 1249
invertebrates 1243
mammals 823
reptiles 660
unknown 473
----
Match Method
match 15874
exact 14482
accepted 1663
followed 1663
tsn 1592
legacy 748
matched 665
not 665
fuzzy 644
aphiaid 71
