# Setup

Before we start developing with Elasticsearch, let's make sure 

- we can connect to the cluster,
- some data are already available,
- and we are able to perform a query on the index

In [None]:
from elasticsearch import Elasticsearch

# Test Connection
client = Elasticsearch(hosts="http://localhost:9200",
                       basic_auth=('elastic', 'changeme'))
client.indices.put_settings(
    settings={
        "index.number_of_replicas": 0
    })

assert client.cluster.health()['status'] == 'green', 'Cluster not healthy'


The connection should be established and the cluster health green - which means we're good to go!

In [None]:
# Create an index

response = client.indices.create(index='my-first-index')

assert response['acknowledged'], 'Index could not be created'
assert client.indices.exists(index='my-first-index')

In [None]:
# Add a sample document

client.index(index='my-first-index',id=23,document={
    'content': 'Michael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ, is an American businessman and former professional basketball player. He played fifteen seasons in the National Basketball Association (NBA), winning six NBA championships with the Chicago Bulls. Jordan is the principal owner and chairman of the Charlotte Hornets of the NBA and of 23XI Racing in the NASCAR Cup Series. His biography on the official NBA website states: "By acclamation, Michael Jordan is the greatest basketball player of all time." He was integral in popularizing the NBA around the world in the 1980s and 1990s, becoming a global cultural icon in the process.'
})
assert client.exists(index='my-first-index',id=23)

Now we've created an index and indexed one document - not bad, but we're just getting started! Let's try to search our data.

In [None]:
# this should be easy

client.search(index='my-first-index', query={
    'match': {
        'content': 'Michael'
    }
}, highlight={'fields':{'content':{}}}).body

In [None]:
# maybe Elasticsearch has even more answers?

client.search(query={
    'match': {
        'content': 'GOAT'
    }
}).body

## Growing the set

We can already explore the search API with this minimal example, however, to make things more interesting, we should probably accumulate a bit more data. While setting up the cluster, some articles from Wikipedia were already ingested into the cluster.

In [None]:
# count the indexed documents

client.count(index='logstash-articles-*').body

Within the prepared index, around a thousand documents are stored - that's more than one, and most are a bit longer than one paragraph.

When dealing with new data, it's usually a good idea to get a feel for the format of the data. Feel free to use the launched Kibana app to explore the dataset! (Running on port 5601.) Here, we're content with a quick look at the fields containing data:

- content
- links
- summary
- tags
- title
- url

In [None]:
# let's have a look at the mapping

client.indices.get_mapping(index='logstash-articles-*').body

And let's see how many results we get for the same query as above ...

In [None]:
# Michael appears to be a fairly common name!

client.search(index='logstash-articles-*', query={
    'match': {
        'content': 'michael'
    }
}).body

## Limitations

Elasticsearch indexes data using an [inverted index](https://en.wikipedia.org/wiki/Inverted_index), allowing fast searches across massive textual datasets. By default, texts / strings are indexed as type [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html) as well as [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html). So far, we have been searching on fields of type `text`, using the default [standard analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html), that tokenizes text and works well in most generic use cases. However, it has limitations once typos, a google-like search-as-you-type feel or domain-specific normalization come into play.

In [None]:
# a fairly standard typo yields no results

client.search(index='logstash-articles-*', query={
    'match': {
        'content': 'micheal'
    }
}).body

# Analytical features

To address such use cases, Elasticsearch offers an array of instruments that can be used individually or in conjunction to match feature requests and improve the accuraccy of search results.

## Ngrams

## Search-as-you-type

## Phonetic analysis

## NLP

# Further down the rabbit hole