# Setup

Before we start developing with Elasticsearch, let's make sure 

- we can connect to the cluster,
- some data are already available,
- and we are able to perform a query on the index

In [None]:
from elasticsearch import Elasticsearch

# Test Connection
client = Elasticsearch(hosts="http://localhost:9200",
                       basic_auth=('elastic', 'changeme'))
client.indices.put_settings(
    settings={
        "index.number_of_replicas": 0
    })

assert client.cluster.health()['status'] == 'green', 'Cluster not healthy'


The connection should be established and the cluster health green - which means we're good to go!

In [None]:
# Create an index

response = client.indices.create(index='my-first-index')

assert response['acknowledged'], 'Index could not be created'
assert client.indices.exists(index='my-first-index')

In [None]:
# Add a sample document

client.index(index='my-first-index',id=23,document={
    'summary': 'Michael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ, is an American businessman and former professional basketball player. He played fifteen seasons in the National Basketball Association (NBA), winning six NBA championships with the Chicago Bulls. Jordan is the principal owner and chairman of the Charlotte Hornets of the NBA and of 23XI Racing in the NASCAR Cup Series. His biography on the official NBA website states: "By acclamation, Michael Jordan is the greatest basketball player of all time." He was integral in popularizing the NBA around the world in the 1980s and 1990s, becoming a global cultural icon in the process.'
})
assert client.exists(index='my-first-index',id=23)

Now we've created an index and indexed one document - not bad, but we're just getting started! Let's try to search our data.

In [None]:
# this should be easy

client.search(index='my-first-index', query={
    'match': {
        'summary': 'Michael'
    }},
    highlight={'fields': {'summary': {}}}
).body


In [None]:
# maybe Elasticsearch has even more answers?

client.search(query={
    'match': {
        'summary': 'GOAT'
    }},
    highlight={'fields': {'summary': {}}}
).body


## Growing the set

We can already explore the search API with this minimal example, however, to make things more interesting, we should probably accumulate a bit more data. While setting up the cluster, some articles from Wikipedia were already ingested into the cluster.

In [None]:
# count the indexed documents

client.count(index='logstash-articles').body

Within the prepared index, around a thousand documents are stored - that's more than one, and most are a bit longer than one paragraph.

When dealing with new data, it's usually a good idea to get a feel for the format of the data. Feel free to use the launched Kibana app to explore the dataset! (Running on port 5601.) Here, we're content with a quick look at the fields containing data:

- content
- links
- summary
- tags
- title
- url

In [None]:
# let's have a look at the mapping

client.indices.get_mapping(index='logstash-articles').body

And let's see how many results we get for the same query as above ...

In [None]:
# Michael appears to be a fairly common name!

client.search(index='logstash-articles', query={
    'match': {
        'summary': 'michael'
    }}, 
    highlight={'fields': {'summary': {}}},
    source=['summary'],
).body


## Limitations

Elasticsearch indexes data using an [inverted index](https://en.wikipedia.org/wiki/Inverted_index), allowing fast searches across massive textual datasets. By default, texts / strings are indexed as type [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html) as well as [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html). So far, we have been searching on fields of type `text`, using the default [standard analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html), that tokenizes text and works well in most generic use cases. However, it has limitations once typos, a google-like search-as-you-type feel or domain-specific normalization come into play.

In [None]:
# a fairly standard typo yields no results

client.search(index='logstash-articles', query={
    'match': {
        'summary': 'micheal'
    }},
    highlight={'fields': {'summary': {}}},
    source=['summary'],
).body

# Analytical features

To address such use cases, Elasticsearch offers an array of instruments that can be used individually or in conjunction to match feature requests and improve the accuraccy of search results. We will explore a couple of features next.

## Ngrams

Ngrams are a well-known concept in linguistics and Natural Language Processing (NLP). The idea is to split single words into a number of letter tokens with a maximum defined length, e.g. 3. A word, such as _length_ would then be split into tokens with a maximum length of three.

Let's examine the ngrams yielded by [the default tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html).

In [None]:
[t['token'] for t in client.indices.analyze(
    text='lengthy',
    tokenizer='ngram'
).body['tokens']]

As we see, the word `lengthy` is split into tokens with a minimum size of 1 and a maximum size of 2. Let's add a field to our index that uses ngrams!

In [None]:
# first, we need to define an analyzer that uses the ngram tokenizer
assert client.indices.close(index='logstash-articles').body['acknowledged'], 'Technical error' # technically, we need to close the index temporarily - don't do this on a live system unless you know no new data will be added while index is closed

client.indices.put_settings(
    index='logstash-articles',
    settings={
        'analysis': {
            'analyzer': {
                'ngram_analyzer': {
                    'tokenizer': 'ngram_tokenizer'
                }
            },
            'tokenizer': {
                'ngram_tokenizer': {
                    'type': 'ngram',
                    'min_gram': 2,
                    'max_gram': 3,
                    'token_chars': [
                        'letter',
                        'digit'
                    ]
                }
            }
        }
    }
)
assert client.indices.open(index='logstash-articles').body['acknowledged'], 'Technical error' # reopen index

In [None]:
# test the analyzer
[t['token'] for t in client.indices.analyze(
    index='logstash-articles',
    text='lengthy',
    analyzer='ngram_analyzer'
).body['tokens']]

In [None]:
# test the tokenizer
[t['token'] for t in client.indices.analyze(
    index='logstash-articles',
    text='lengthy',
    tokenizer='ngram_tokenizer'
).body['tokens']]

In [None]:
# and define a field that uses the analyzer

assert client.indices.put_mapping(
    index='logstash-articles',
    properties={
        'summary': {
            'type': 'text',
            'norms': 'false',
            'fields': {
                'ngram': {
                    'type': 'text',
                    'analyzer': 'ngram_analyzer'
                }
            }
        }
    }
).body['acknowledged'], 'Technical error'

assert client.update_by_query(index='logstash-articles').body['updated'] == 1000, 'Technical error' # refresh index

In [None]:
# now we should be able to find something despite the type

client.search(index='logstash-articles', query={
    'match': {
        'summary.ngram': 'micheal'
    }},
    highlight={'fields': {'summary.ngram': {}}},
    source=['summary'],
).body

However, results the results are certainly not yet what we want. Remember, the `ngram` tokenizer does nothing different than split words into tokens, so we should not be surprised that results appear a bit random. After all, let's have a look at the ngrams derived from _Micheal_:

    ['mi', 'mic', 'ic', 'ich', 'ch', 'che', 'he', 'hea', 'ea', 'eal', 'al']

Though some ngrams would also appear in the tokenization of _Michael_, there is nothing special about them - Elasticsearch will simply match with the documents that contain the tokens most often. In general, it takes a little testing and trial-and-error to find the best minimum and maximum values for ngram length (the longer, the more specific the matches, but less error-tolerant). Let's dig deeper.

## Search-as-you-type

The [search-as-you-type functionality](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html) utilizes ngrams in a slighty different way compared to what we have done so far. Instead of looking at the entire word for tokenization, the focus is on the beginning of the words to offer a fast search functionality that offers results as a user is typing the first few letters.

In [None]:
assert client.indices.put_mapping(
    index='logstash-articles',
    properties={
        'summary': {
            'type': 'text',
            'norms': 'false',
            'fields': {
                'search_as_you_type': {
                    'type': 'search_as_you_type'
                }
            }
        }
    }
).body['acknowledged'], 'Technical error'

assert client.update_by_query(index='logstash-articles').body['updated'] == 1000, 'Technical error' # refresh index

In [None]:
# now we can actually query on multiple fields that are generated automatically for us

client.search(index='logstash-articles', query={
    "multi_match": {
        "query": "mi",
        "type": "bool_prefix",
        "fields": [
            "summary.search_as_you_type",
            "summary.search_as_you_type._2gram",
            "summary.search_as_you_type._3gram",
            "summary.search_as_you_type._index_prefix"
        ]
    }},
    source=['summary'],
).body


## Phonetic analysis

## NLP

# Further down the rabbit hole