# Setup

Before we start developing with Elasticsearch, let's make sure 

- we can connect to the cluster,
- some data are already available,
- and we are able to perform a query on the index

In [1]:
from elasticsearch import Elasticsearch

# Test Connection
client = Elasticsearch(hosts="http://localhost:9200",
                       basic_auth=('elastic', 'changeme'))
client.indices.put_settings(
    settings={
        "index.number_of_replicas": 0
    })

assert client.cluster.health()['status'] == 'green', 'Cluster not healthy'


The connection should be established and the cluster health green - which means we're good to go!

In [2]:
# Create an index

assert client.indices.create(index='my-first-index')['acknowledged'], 'Index could not be created'
assert client.indices.exists(index='my-first-index')

In [3]:
# add a sample document

client.index(index='my-first-index',id=23,document={
    'summary': 'Michael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ, is an American businessman and former professional basketball player. He played fifteen seasons in the National Basketball Association (NBA), winning six NBA championships with the Chicago Bulls. Jordan is the principal owner and chairman of the Charlotte Hornets of the NBA and of 23XI Racing in the NASCAR Cup Series. His biography on the official NBA website states: "By acclamation, Michael Jordan is the greatest basketball player of all time." He was integral in popularizing the NBA around the world in the 1980s and 1990s, becoming a global cultural icon in the process.'
})
assert client.exists(index='my-first-index',id=23)

Now we've created an index and indexed one document - not bad, but we're just getting started! Let's try to search our data.

In [4]:
# this should be easy

client.search(index='my-first-index', query={
    'match': {
        'summary': 'Michael'
    }},
    highlight={'fields': {'summary': {}}}
).body


{'took': 2,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 0, 'relation': 'eq'},
  'max_score': None,
  'hits': []}}

In [5]:
# maybe Elasticsearch has even more answers?

client.search(query={
    'match': {
        'summary': 'GOAT'
    }},
    highlight={'fields': {'summary': {}}}
).body


{'took': 2,
 'timed_out': False,
 '_shards': {'total': 2, 'successful': 2, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 0, 'relation': 'eq'},
  'max_score': None,
  'hits': []}}

## Growing the set

We can already explore the search API with this minimal example, however, to make things more interesting, we should probably accumulate a bit more data. While setting up the cluster, some articles from Wikipedia were already ingested into the cluster.

In [6]:
# count the indexed documents

client.count(index='logstash-articles').body

{'count': 1001,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}

Within the prepared index, around a thousand documents are stored - that's more than one, and most are a bit longer than one paragraph.

When dealing with new data, it's usually a good idea to get a feel for the format of the data. Feel free to use the launched Kibana app to explore the dataset! (Running on port 5601.) Here, we're content with a quick look at the fields containing data:

- content
- links
- summary
- tags
- title
- url

In [7]:
# let's have a look at the mapping

client.indices.get_mapping(index='logstash-articles').body

{'logstash-articles': {'mappings': {'dynamic_templates': [{'message_field': {'path_match': 'message',
      'match_mapping_type': 'string',
      'mapping': {'norms': False, 'type': 'text'}}},
    {'string_fields': {'match': '*',
      'match_mapping_type': 'string',
      'mapping': {'fields': {'keyword': {'ignore_above': 256,
         'type': 'keyword'}},
       'norms': False,
       'type': 'text'}}}],
   'properties': {'@timestamp': {'type': 'date'},
    '@version': {'type': 'keyword'},
    'content': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}},
     'norms': False},
    'geoip': {'dynamic': 'true',
     'properties': {'ip': {'type': 'ip'},
      'latitude': {'type': 'half_float'},
      'location': {'type': 'geo_point'},
      'longitude': {'type': 'half_float'}}},
    'links': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}},
     'norms': False},
    'summary': {'type': 'text',
     'fields': {'keywo

And let's see how many results we get for the same query as above ...

In [8]:
# Michael appears to be a fairly common name!

client.search(index='logstash-articles', query={
    'match': {
        'summary': 'michael'
    }}, 
    highlight={'fields': {'summary': {}}},
    source=['summary'],
).body


{'took': 55,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 14, 'relation': 'eq'},
  'max_score': 8.063337,
  'hits': [{'_index': 'logstash-articles',
    '_id': 'https://en.wikipedia.org/wiki/James_Dean_(2001_film)',
    '_score': 8.063337,
    '_ignored': ['content.keyword', 'summary.keyword'],
    '_source': {'summary': "James Dean is a 2001 American made-for-television biographical drama film based on the life of the American actor James Dean. James Franco plays the title role under the direction of Mark Rydell, who chronicles Dean's rise from a struggling actor to an A-list movie star in 1950s Hollywood. The film's supporting roles included Michael Moriarty, Valentina Cervi, Enrico Colantoni, and Edward Herrmann.\nThe biopic began development at Warner Bros. in the early 1990s. At one point, Michael Mann was contracted to direct with Leonardo DiCaprio starring in the lead role. After Mann's departure, Des McA

Elasticsearch indexes data using an [inverted index](https://en.wikipedia.org/wiki/Inverted_index), allowing fast searches across massive textual datasets. By default, texts / strings are indexed as type [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html) as well as [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html). So far, we have been searching on fields of type `text`, using the default [standard analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html), that tokenizes text and works well in most generic use cases. However, it has limitations once typos, a google-like search-as-you-type feel or domain-specific normalization come into play.

In [19]:
# a fairly standard typo yields no results

client.search(index='logstash-articles', query={
    'match': {
        'summary': 'micheal'
    }},
    highlight={'fields': {'summary': {}}},
    source=['summary'],
).body

{'took': 1,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 0, 'relation': 'eq'},
  'max_score': None,
  'hits': []}}

# Analytical features

To address such use cases, Elasticsearch offers an array of instruments that can be used individually or in conjunction to match feature requests and improve the accuraccy of search results. We will explore a couple of features next.

## Ngrams

Ngrams are a well-known concept in linguistics and Natural Language Processing (NLP). The idea is to split single words into a number of letter tokens with a maximum defined length, e.g. 3. A word, such as _length_ would then be split into tokens with a maximum length of three.

Let's examine the ngrams yielded by [the default tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html).

In [20]:
[t['token'] for t in client.indices.analyze(
    text='lengthy',
    tokenizer='ngram'
).body['tokens']]

['l', 'le', 'e', 'en', 'n', 'ng', 'g', 'gt', 't', 'th', 'h', 'hy', 'y']

As we see, the word `lengthy` is split into tokens with a minimum size of 1 and a maximum size of 2. Let's add a field to our index that uses ngrams!

In [21]:
# first, we need to define an analyzer that uses the ngram tokenizer
assert client.indices.close(index='logstash-articles').body['acknowledged'], 'Technical error' # technically, we need to close the index temporarily - don't do this on a live system unless you know no new data will be added while index is closed

client.indices.put_settings(
    index='logstash-articles',
    settings={
        'analysis': {
            'analyzer': {
                'ngram_analyzer': {
                    'tokenizer': 'ngram_tokenizer'
                }
            },
            'tokenizer': {
                'ngram_tokenizer': {
                    'type': 'ngram',
                    'min_gram': 2,
                    'max_gram': 3,
                    'token_chars': [
                        'letter',
                        'digit'
                    ]
                }
            }
        }
    }
)
assert client.indices.open(index='logstash-articles').body['acknowledged'], 'Technical error' # reopen index

In [22]:
# test the analyzer
[t['token'] for t in client.indices.analyze(
    index='logstash-articles',
    text='lengthy',
    analyzer='ngram_analyzer'
).body['tokens']]

['le', 'len', 'en', 'eng', 'ng', 'ngt', 'gt', 'gth', 'th', 'thy', 'hy']

In [23]:
# test the tokenizer
[t['token'] for t in client.indices.analyze(
    index='logstash-articles',
    text='lengthy',
    tokenizer='ngram_tokenizer'
).body['tokens']]

['le', 'len', 'en', 'eng', 'ng', 'ngt', 'gt', 'gth', 'th', 'thy', 'hy']

In [24]:
# and define a field that uses the analyzer

assert client.indices.put_mapping(
    index='logstash-articles',
    properties={
        'title': {
            'type': 'text',
            'norms': 'false',
            'fields': {
                'ngram': {
                    'type': 'text',
                    'analyzer': 'ngram_analyzer'
                }
            }
        },
        'summary': {
            'type': 'text',
            'norms': 'false',
            'fields': {
                'ngram': {
                    'type': 'text',
                    'analyzer': 'ngram_analyzer'
                }
            }
        },
        'content': {
            'type': 'text',
            'norms': 'false',
            'fields': {
                'ngram': {
                    'type': 'text',
                    'analyzer': 'ngram_analyzer'
                }
            }
        }
    }
).body['acknowledged'], 'Technical error'

assert client.update_by_query(index='logstash-articles').body['updated'] == 1001, 'Technical error' # refresh index

In [25]:
# now we should be able to find something despite the type

client.search(index='logstash-articles', query={
    'match': {
        'summary.ngram': 'micheal'
    }},
    highlight={'fields': {'summary.ngram': {}}},
    source=['summary'],
).body

{'took': 34,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 992, 'relation': 'eq'},
  'max_score': 15.79492,
  'hits': [{'_index': 'logstash-articles',
    '_id': 'https://en.wikipedia.org/wiki/Gabriel_Filippelli',
    '_score': 15.79492,
    '_ignored': ['content.keyword', 'summary.keyword'],
    '_source': {'summary': 'Gabriel Filippelli is an American biogeochemist and professor of Earth sciences at Indiana University-Purdue University Indianapolis (IUPUI). His research interests include biogeochemical cycling in the environment, and the links between environmental processes and human health.'},
    'highlight': {'summary.ngram': ['Gabriel Filippelli is an Amer<em>ic</em>an biogeo<em>che</em><em>mi</em>st and professor of Earth sciences at Indiana University-Purdue',
      'His res<em>ea</em>r<em>ch</em> interests include biogeo<em>che</em><em>mic</em><em>al</em> cycling in t<em>he</em> environment, and t<em>he

However, results the results are certainly not yet what we want. Remember, the `ngram` tokenizer does nothing different than split words into tokens, so we should not be surprised that results appear a bit random. After all, let's have a look at the ngrams derived from _Micheal_:

    ['mi', 'mic', 'ic', 'ich', 'ch', 'che', 'he', 'hea', 'ea', 'eal', 'al']

Though some ngrams would also appear in the tokenization of _Michael_, there is nothing special about them - Elasticsearch will simply match with the documents that contain the tokens most often. In general, it takes a little testing and trial-and-error to find the best minimum and maximum values for ngram length (the longer, the more specific the matches, but less error-tolerant). Let's dig deeper.

## Search-as-you-type

The [search-as-you-type functionality](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html) utilizes ngrams in a slighty different way compared to what we have done so far. Instead of looking at the entire word for tokenization, the focus is on the beginning of the words to offer a fast search functionality that offers results as a user is typing the first few letters.

In [26]:
assert client.indices.put_mapping(
    index='logstash-articles',
    properties={
        'title': {
            'type': 'text',
            'norms': 'false',
            'fields': {
                'search_as_you_type': {
                    'type': 'search_as_you_type'
                }
            }
        }
    }
).body['acknowledged'], 'Technical error'

assert client.update_by_query(index='logstash-articles').body['updated'] == 1001, 'Technical error' # refresh index

In [27]:
# now we can actually query on multiple fields that are generated automatically for us and simulate the results an incremental user input would yield

input = 'micheal'

for i in range(2, len(input)):
    response = client.search(index='logstash-articles', size=100, query={
        "multi_match": {
            "query": input[:i],
            "type": "bool_prefix",
            "fields": [
                "title.search_as_you_type",
                "title.search_as_you_type._2gram",
                "title.search_as_you_type._3gram",
                "title.search_as_you_type._index_prefix"
            ]
        }},
        source=['title'],
    ).body
    print(f"Input {input[:i]} yields a total {response['hits']['total']['value']} results with titles {[r['_source']['title'] for r in response['hits']['hits']]}")

Input mi yields a total 33 results with titles ['1974 United States Senate election in Missouri', '103 Mile Lake', 'Peter Milliman', 'Micropterix lambesiella', 'Middle nasal concha', 'Middletown, Indiana', 'Diving at the 2018 European Aquatics Championships – Mixed 3 m springboard synchro', 'Spring Fork, Missouri', '1968–69 Midland Football League', '1922 Minneapolis Marines season', 'Federico Millacet', 'Federal Ministry of Health (Germany)', 'Grand Haven, Michigan', 'Smock mill, Wolin', 'Kendriya Vidyalaya 9th Mile', '246th Mixed Brigade', 'Micellar solubilization', 'Michael Jordan', 'Mike Gallagher (political commentator)', 'Stephen V. R. Trowbridge (Michigan legislator)', 'The Dazzling Miss Davison', 'Ministry Trax! Box', 'Highway 25 Bridge (Minnesota)', 'Michael Paul Fleischer', 'Mikhail Makagonov', 'Mildred Ratcliffe', 'Jefferson County School District (Mississippi)', 'Michiyoshi', 'Midnattsrocken', 'Miguel Tavares (footballer)', 'Palfrey, West Midlands', 'John Miles (fl. 1404)',

As the user input approaches the typo, the result set size is already narrowed down and could be used to navigate the user to the intended article. However, this functionality does not yet _fix_ the typo.

## Phonetic analysis

As Elasticsearch is an open source project, it is no surprise that contributors have built on the project to address their use cases. Extensions to Elasticsearch can take the shape of [plugins](https://www.elastic.co/guide/en/elasticsearch/plugins/current/intro.html), that are relatively easy to install. Here we take a look at the plugin for phonetic analysis, that enables us to get phonetic representations of input tokens.

In [28]:
# first, we need to define an analyzer that uses the phonetic token filter
assert client.indices.close(index='logstash-articles').body['acknowledged'], 'Technical error' # technically, we need to close the index temporarily - don't do this on a live system unless you know no new data will be added while index is closed

client.indices.put_settings(
    index='logstash-articles',
    settings={
        'analysis': {
            'analyzer': {
                'phonetic_analyzer': {
                    'tokenizer': 'standard',
                    'filter': [
                        'lowercase',
                        'beider_morse'
                    ]
                }
            },
            'filter': {
                'beider_morse': {
                    'type': 'phonetic',
                    'encoder': 'beider_morse',
                    'replace': False
                }
            }
        }
    }
)
assert client.indices.open(index='logstash-articles').body['acknowledged'], 'Technical error' # reopen index

In [29]:
# test the analyzer and compare tokens yielded by 'Michael' and 'Micheal'

michael_phonetic_tokens = [t['token'] for t in client.indices.analyze(
    index='logstash-articles',
    text=['michael'],
    analyzer='phonetic_analyzer'
).body['tokens']]
print('"Michael" yields:', michael_phonetic_tokens)

micheal_phonetic_tokens = [t['token'] for t in client.indices.analyze(
    index='logstash-articles',
    text=['micheal'],
    analyzer='phonetic_analyzer'
).body['tokens']]
print('"Micheal" yields:', micheal_phonetic_tokens)

print('Overlap:', [t for t in michael_phonetic_tokens if t in micheal_phonetic_tokens])

"Michael" yields: ['mQxYl', 'mQxail', 'mQxoil', 'mitsDl', 'mitsail', 'mitsoil', 'mixDl', 'mixYl', 'mixail', 'mixoil']
"Micheal" yields: ['mDxDl', 'mDxal', 'mDxil', 'mQxDl', 'mQxal', 'mQxil', 'miDl', 'mikDl', 'mikal', 'mikial', 'mikil', 'mikiol', 'misDl', 'misal', 'misil', 'mitsDl', 'mitsal', 'mitsil', 'mixDl', 'mixal', 'mixial', 'mixil', 'mixiol']
Overlap: ['mitsDl', 'mixDl']


Coincidentally, we can already see that an overlap in phonetic representation exists between 'Michael' and 'Micheal'. That could help us fix that pesky typo error.

In [30]:
# define a field that uses the phonetic analyzer

assert client.indices.put_mapping(
    index='logstash-articles',
    properties={
        'title': {
            'type': 'text',
            'norms': 'false',
            'fields': {
                'phonetic': {
                    'type': 'text',
                    'analyzer': 'phonetic_analyzer'
                }
            }
        },
        'summary': {
            'type': 'text',
            'norms': 'false',
            'fields': {
                'phonetic': {
                    'type': 'text',
                    'analyzer': 'phonetic_analyzer'
                }
            }
        },
        'content': {
            'type': 'text',
            'norms': 'false',
            'fields': {
                'phonetic': {
                    'type': 'text',
                    'analyzer': 'phonetic_analyzer'
                }
            }
        }
    }
).body['acknowledged'], 'Technical error'


# refresh index
import time

task_id = client.update_by_query(index='logstash-articles', wait_for_completion=False).body['task']

print('waiting for update', end='')
while not client.tasks.get(task_id=task_id).body['completed']:
    print('.', end='')
    time.sleep(1)
    
print(' update done')

waiting for update.............................................. update done


In [31]:
# let's see if we can find something now

client.search(index='logstash-articles', query={
    'match': {
        'title.phonetic': 'micheal'
    }},
    highlight={'fields': {'title.phonetic': {}}},
    source=['title'],
).body

{'took': 19,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 4, 'relation': 'eq'},
  'max_score': 11.736343,
  'hits': [{'_index': 'logstash-articles',
    '_id': 'https://en.wikipedia.org/wiki/Mikhail_Makagonov',
    '_score': 11.736343,
    '_ignored': ['content.keyword'],
    '_source': {'title': 'Mikhail Makagonov'},
    'highlight': {'title.phonetic': ['<em>Mikhail</em> Makagonov']}},
   {'_index': 'logstash-articles',
    '_id': 'https://en.wikipedia.org/wiki/Michael_Jordan',
    '_score': 10.53585,
    '_ignored': ['content.keyword', 'summary.keyword'],
    '_source': {'title': 'Michael Jordan'},
    'highlight': {'title.phonetic': ['<em>Michael</em> Jordan']}},
   {'_index': 'logstash-articles',
    '_id': 'https://en.wikipedia.org/wiki/Lars_van_Meijel',
    '_score': 10.398344,
    '_ignored': ['content.keyword'],
    '_source': {'title': 'Lars van Meijel'},
    'highlight': {'title.phonetic': ['Lars van <

If everything went well, we should now have found some results that might be somewhat acceptable to a user base. However, so far we have focussed on finding good data types to prepare the data we want to query, yet the types of queries we have explored so far are quite basic and don't take advantage of Elasticsearch's query DSL.

## Fuzziness

Fuzziness is a quite simple concept that is available in many types of queries defined in the [Elasticsearch query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html). Applied to our earlier example, it fixes the problem by expanding the searched query terms to similar tokens, e.g. changing or removing a character, or transposing adjacent ones.

In [32]:
# let's see if a fuzzy query fixes the typo problem

client.search(index='logstash-articles', query={
    'match': {
        'title': {
            'query': 'micheal',
            'fuzziness': 1,
            'fuzzy_transpositions': True
        }
    }},
    highlight={'fields': {'title': {}}},
    source=['title'],
).body

{'took': 17,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 2, 'relation': 'eq'},
  'max_score': 7.235198,
  'hits': [{'_index': 'logstash-articles',
    '_id': 'https://en.wikipedia.org/wiki/Michael_Jordan',
    '_score': 7.235198,
    '_ignored': ['content.keyword', 'summary.keyword'],
    '_source': {'title': 'Michael Jordan'},
    'highlight': {'title': ['<em>Michael</em> Jordan']}},
   {'_index': 'logstash-articles',
    '_id': 'https://en.wikipedia.org/wiki/Michael_Paul_Fleischer',
    '_score': 7.235198,
    '_ignored': ['content.keyword'],
    '_source': {'title': 'Michael Paul Fleischer'},
    'highlight': {'title': ['<em>Michael</em> Paul Fleischer']}}]}}

A disadvantage of fuzzy queries, while tempting, is their relatively high computational cost that may impede the query execution time in larger datasets as well as growing the result set size (see [precision and recall problem](https://en.wikipedia.org/wiki/Precision_and_recall)).

## Compounding queries

It is time to put together multiple queries into one more complex query that allows a more fine-grained control over our result set and its ordering. A very popular type of this is the Boolean query type, which does exactly what the name suggests: Combine multiple queries into a boolean construct (think `and`, `or`, and `not`).

Each Boolean query consists of up to four parts:

    "bool": {
        "must": {}, // define what a result must look like
        "filter": {}, // define what a result must look like - but do not score positive results (thus faster and applicable for caching)
        "should": {}, // define what a result should look like - if multiple queries are defined, matching one is enough (unless the minimum is overwritten)
        "must_not": {} // define what a result must not look like
    }

In [33]:
client.search(index='logstash-articles', query={
    'bool': {
        'filter': [
            {
                "multi_match": {
                    "query": "micheal",
                    'fields': [
                        'title.ngram',
                        'summary.ngram',
                        'content.ngram'
                    ]
                }
            }
        ],
        'must': [
            {
                "multi_match": {
                    "query": "micheal",
                    'fuzziness': 2,
                    'fuzzy_transpositions': True,
                    'fields': [
                        'title',
                        'summary',
                        'content'
                    ]
                }
            }
        ],
        'should': [
            {
                'match': {
                    'title.phonetic': {
                        'query': 'micheal'
                    }
                }
            },
            {
                'match': {
                    'title': {
                        'query': 'micheal'
                    }
                }
            }
        ]
    }},
    highlight={'fields': {'title': {}, 'summary': {}, 'content': {}}},
    source=['title'],
).body

{'took': 65,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 74, 'relation': 'eq'},
  'max_score': 22.729939,
  'hits': [{'_index': 'logstash-articles',
    '_id': 'https://en.wikipedia.org/wiki/Michael_Jordan',
    '_score': 22.729939,
    '_ignored': ['content.keyword', 'summary.keyword'],
    '_source': {'title': 'Michael Jordan'},
    'highlight': {'summary': ['<em>Michael</em> Jeffrey Jordan OLY (born February 17, 1963), also known by his initials MJ, is an American businessman',
      'His biography on the official NBA website states: "By acclamation, <em>Michael</em> Jordan is the greatest basketball'],
     'title': ['<em>Michael</em> Jordan'],
     'content': ['Jordan; the marketing of <em>Michael</em> Jordan.',
      'Rare Air: <em>Michael</em> on <em>Michael</em>, with Mark Vancil and Walter Iooss (Harper San Francisco, 1993).',
      "Jordan's Restaurant\n<em>Michael</em> Jordan: Chaos in the Windy City

We are taking a multi-layered approach to our query:

- First, we filter for any articles that contain ngrams as related to the search term in any of the relevant fields we have looked at so far. This reduces the number of documents Elasticsearch has to consider during the scoring phase and speeds up the query execution time while not discarding any eligible documents;
- Second, we set a baseline score for any documents that contain a fuzzy transposition of the search term in the same fields. Because we have already reduced the number of candidates, we can afford a fairly greedy approach with an allowed editing distance of 2;
- Then, we score documents higher that match one or both of two matching queries, either with a phonetic equivalence or a direct lexical match.

Now we have a query that takes advantage of multiple text analysis features, is to an extent tolerant to typing mistakes and scans several relevant fields. To improve user experience, we could further incorporate the `search_as_you_type`-fields. Feel free to try your hand at it!

# Further down the rabbit hole

While we now have a good starting point, we have barely scratched the surface of the text analysis tools included in Elasticsearch. To round things off, let's have a cursory glance at some further features we could use to make our query design better.

## Regression & Testing

As said earlier, Elasticsearch's query DSL offers various ways to match your query design to the user's expectations. To test whether or not an adjusted query matches your expectations in results, it is a good idea to build a test suite, e.g. using [the ranking evaluation API](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html): This allows defining a set of documents that you either want or do not want to appear in the search result list dependening on the query and its parameters.

    POST logstash-articles/_rank_eval
    {
    "requests": [
        { 
        "id": "micheal_query",
        "request": { // the request to be tested
        "query": { "match": { "title.phonetic": "micheal" } }
        },
        "ratings": [ 
            { "_index": "logstash-articles", "_id": "https://en.wikipedia.org/wiki/Michael_Jordan", "rating": 3 },
            { "_index": "logstash-articles", "_id": "https://en.wikipedia.org/wiki/Michael_Paul_Fleischer", "rating": 1 },
            { "_index": "logstash-articles", "_id": "https://en.wikipedia.org/wiki/Scheldeprijs", "rating": 0 }
            ]
        },
        {
        "id": "michael_query",
        "request": {
            "query": { "match": { "title": "michael" } }
        },
        "ratings": [
            { "_index": "logstash-articles", "_id": "https://en.wikipedia.org/wiki/Michael_Jordan", "rating": 3 }
            ]
        }
        ],
        "metric": { 
        "precision": {
            "k": 20,
            "relevant_rating_threshold": 1,
            "ignore_unlabeled": false
        }
        }
    }

This is especially useful if we are in constant conversation with our testers and want to further develop our query design, but also  keep what has worked well so far.

## Runtime fields



Another very useful feature for development purposes are [`runtime_fields`](https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime.html). Normally, as we have done so far, the fields of an index should be declared before the _ingest_ of data, that is before any data are loaded into Elasticsearch. Alternatively, fields can be added on an existing index, which is then refreshed to perform the defined analysis on the documents, until which the field is not usable. `runtime_fields` allow a different approach: The value of the field is determined during `runtime`, i.e. when the data are already being queried or analyzed. The value of the `runtime_field` can be set by providing a script through the search API or by adding it manually through Kibana. Let's have a look.

    GET logstash-articles/_search
    {
        "runtime_mappings": {
            "#links": {
                "type": "long",
                "script": "emit(doc['links.keyword'].length);"
            }
        },
        "query": {
            "range": {
                "#links": {
                    "gte": 1000
                }
            }
        },
        "_source": ["title"]
    }

`runtime_fields` are useful during development or for fringe use cases. Once you find that a field is useful to your use case, you have to decide whether you want to keep the functionality encapsuled during runtime or convert it to a more permanent stored solution - both have advantages regarding storage and execution speed.

## NLP

Now let's get a little more fancy. As stated above, plugins and extensions are a part of the Elastic stack as well. [OpenNLP](https://opennlp.apache.org/) is a Machine Learning (ML)-based framework for the processing of natural text and can be utilized for various tasks, among others for named entity recognition (NER). Elasticsearch can be configured to load trained ML models to perform analysis on ingested text. To do this you need to first [load the plugin](https://github.com/spinscale/elasticsearch-ingest-opennlp) and the necessary ML models into Elasticsearch by downloading both and then configuring the service to utilize the model files.

The setup steps are provided for you in this workspace, and you can try out the NER models using the code below.

In [34]:
# create ingest pipeline that uses the OpenNLP processor
assert client.ingest.put_pipeline(
    id='opennlp-pipeline',
    processors=[
    {
      "opennlp" : {
        "field" : "text",
        "annotated_text_field" : "annotated_text"
      }
    }
  ]
).body['acknowledged'], 'Ingest pipeline could not be created'

# get an example
example = client.search(query={
    'match': {
        'title': 'Michael Jordan'
    }
}).body
content = example['hits']['hits'][0]['_source']['content']

# index the example test
client.index(
    index='annotated-index',
    id='23',
    document={
        'text': content
    },
    pipeline='opennlp-pipeline',
    timeout="600s"
)

# NOTE this may cause a timeout - don't worry, the data is still processed

ObjectApiResponse({'_index': 'annotated-index', '_id': '23', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1})

In [35]:
entities = client.get(
    id='23',
    index='annotated-index',
    source=['entities.*']
).body['_source']['entities']

print('Mentioned persons:', entities['persons'])
print('Mentioned locations:', entities['locations'])
print('Mentioned dates:', entities['dates'])

Mentioned persons: ['Condor', 'Sandro Miller', 'Marv Albert', 'Chris Mullin', 'Simon', 'Michael', 'Karl Malone', 'Jason Kidd', 'Story', 'Bryant', 'Harper San Francisco', 'Charles Oakley', 'Mark Vancil', 'Mike', 'James R . Jordan Sr .', 'Bob Knight', 'Russell Westbrook', 'Minor League Baseball', 'Ed Bradley', 'Jordan', 'John R . Wooden', 'Daniel Green', 'Ben', 'Rodman', 'Kobe Bryant', 'With Scottie Pippen', 'Ron Harper', 'Rod Strickland', 'After Jordan', 'Nick Anderson', "Shaquille O ' Neal", 'Stu Inman', 'Jordan Brand', 'Brand', 'Marcus', 'Sam Perkins', 'Wade', 'Babe Ruth', 'Rip " Hamilton', 'David L .', 'Kwame Brown', 'Jerry West', 'Jordan "', 'Abe Pollin', 'James Harden', 'Dean Smith', 'Abdul - Jabbar', 'Steve Alford', 'Jordan Rules', 'Derek Jeter', 'Richard Esquinas', 'Michael Jordan Celebrity Invitational', 'Bugs Bunny', 'Victoria', 'Jason Hehir', 'Travis Scott', 'David Steward', 'Larry Hughes', 'Johnson', 'James', 'David Thompson', 'George Shinn', 'Barack Obama', 'Jerry Reinsdorf'