# Querying in Lucene

Lucene can be a squirrely language, but it starts off simple.  Here is where the documentation starts (keep clicking next to read through):

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/_introducing_the_query_language.html

In [1]:
import requests

In [2]:
def execute_es_query(index, query):
    r = requests.get("http://elasticsearch:9200/{}/_search".format(index),
                     json=query)
    if r.status_code != 200:
        print("Error executing query")
        return None
    else:
        return r.json()

In [6]:
# this query matches everything in the index,
# but only returns <= 10 results by default
query = {
    "query": { "match_all": {} }
}

res = execute_es_query('recipes', query)
res

{'took': 3,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': 2,
  'max_score': 1.0,
  'hits': [{'_index': 'recipes',
    '_type': 'recipe',
    '_id': 'AWpmMZUCKa_s2tz80C3k',
    '_score': 1.0,
    '_source': {'name': 'chocolate chip cookies',
     'ingredients': ['flour', 'water', 'sugar', 'chocolate chips']}},
   {'_index': 'recipes',
    '_type': 'recipe',
    '_id': 'AWpmMHlWKa_s2tz80C3j',
    '_score': 1.0,
    '_source': {'name': 'pizza',
     'ingredients': ['flour', 'water', 'yeast', 'cheese', 'tomato sauce']}}]}}

In [8]:
# this query grabs everything, but only returns 1 result
query = {
    "size": 1,
    "query": { "match_all": {} }
}

res = execute_es_query('recipes', query)
res

{'took': 4,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': 2,
  'max_score': 1.0,
  'hits': [{'_index': 'recipes',
    '_type': 'recipe',
    '_id': 'AWpmMZUCKa_s2tz80C3k',
    '_score': 1.0,
    '_source': {'name': 'chocolate chip cookies',
     'ingredients': ['flour', 'water', 'sugar', 'chocolate chips']}}]}}

In [9]:
# we can sort on a field in our query
query = {
    "size": 1,
    "query": { "match_all": {} },
    "sort": { "name": { "order": "desc" } }
}

res = execute_es_query('recipes', query)
res

{'took': 6,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': 2,
  'max_score': None,
  'hits': [{'_index': 'recipes',
    '_type': 'recipe',
    '_id': 'AWpmMHlWKa_s2tz80C3j',
    '_score': None,
    '_source': {'name': 'pizza',
     'ingredients': ['flour', 'water', 'yeast', 'cheese', 'tomato sauce']},
    'sort': ['pizza']}]}}

## Matching on strings

In [10]:
query = {
    "query": { "match": {"ingredients": "flour"} },
}

res = execute_es_query('recipes', query)
res

{'took': 13,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': 2,
  'max_score': 0.2876821,
  'hits': [{'_index': 'recipes',
    '_type': 'recipe',
    '_id': 'AWpmMZUCKa_s2tz80C3k',
    '_score': 0.2876821,
    '_source': {'name': 'chocolate chip cookies',
     'ingredients': ['flour', 'water', 'sugar', 'chocolate chips']}},
   {'_index': 'recipes',
    '_type': 'recipe',
    '_id': 'AWpmMHlWKa_s2tz80C3j',
    '_score': 0.2876821,
    '_source': {'name': 'pizza',
     'ingredients': ['flour', 'water', 'yeast', 'cheese', 'tomato sauce']}}]}}

In [11]:
query = {
    "query": { "match": {"ingredients": "chocolate"} },
}

res = execute_es_query('recipes', query)
res

{'took': 4,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': 0, 'max_score': None, 'hits': []}}

In [12]:
query = {
    "query": { "match": {"ingredients": "chocolate chips"} },
}

res = execute_es_query('recipes', query)
res

{'took': 5,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': 1,
  'max_score': 0.2876821,
  'hits': [{'_index': 'recipes',
    '_type': 'recipe',
    '_id': 'AWpmMZUCKa_s2tz80C3k',
    '_score': 0.2876821,
    '_source': {'name': 'chocolate chip cookies',
     'ingredients': ['flour', 'water', 'sugar', 'chocolate chips']}}]}}

There is MUCH more to querying.  For example, we can combine many conditions together using the `bool` query.  See this for more details:

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/_executing_searches.html

## Filtering

Notice in the `match` queries above a field `_score` was computed for each document matched.  This indicates how closely the document matched the query (higher=better match).

If we don't care about the score then we can save a LOT of computational resources by using "filtering" instead.  Here is a reference

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/_executing_filters.html

In [26]:
# A simple filter query.  The syntax is a bit clunky
query = {
    "query": {
        "bool": {
            "filter": [
                {"term": {"ingredients": "chocolate chips"}},
            ]
        }
    },
}

res = execute_es_query('recipes', query)
res

{'took': 9,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': 1,
  'max_score': 0.0,
  'hits': [{'_index': 'recipes',
    '_type': 'recipe',
    '_id': 'AWpmMZUCKa_s2tz80C3k',
    '_score': 0.0,
    '_source': {'name': 'chocolate chip cookies',
     'ingredients': ['flour', 'water', 'sugar', 'chocolate chips']}}]}}

## Aggregations

The final piece of the puzzle is the ability to aggregate.  Usually what we do is run a query to narrow down the documents that match, then we aggregate over some field.

Here is some documentation to get you started:

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/_executing_aggregations.html

In [30]:
# This is a so-called "bucket" aggregation.  Here we are querying
# for documents that contain the ingredient "chocolate chips".
# Then we scrape inside those documents for other ingredients that co-occur
# with "chocolate chips"
# Some ingredients are in many recipes (e.g. "water"), so that's not
# very informative.  So we use the aggregation "significant_terms"
# which orders rarer correlated words first.
query = {
    "size": 0,
    "query": {
        "bool": {
            "filter": [
                {"term": {"ingredients": "chocolate chips"}},
            ]
        }
    },
    "aggs": {
        "highly_correlated_words": {
            "significant_terms": {
                "field": "ingredients",
                "exclude": "chocolate chips",
                "min_doc_count": 1,
            }
        }
    }
}

res = execute_es_query('recipes', query)
res

{'took': 12,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': 1, 'max_score': 0.0, 'hits': []},
 'aggregations': {'highly_correlated_words': {'doc_count': 1,
   'bg_count': 2,
   'buckets': [{'key': 'sugar', 'doc_count': 1, 'score': 1.0, 'bg_count': 1},
    {'key': 'water', 'doc_count': 1, 'score': 1.0, 'bg_count': 1},
    {'key': 'flour', 'doc_count': 1, 'score': 1.0, 'bg_count': 1}]}}}