# Elasticsearch data model:  Lucene

Thusfar we have used Kibana to visualize data that is stored in Elasticsearch.  Most of the time this is good enough for whatever insights you are seeking.

Under the hood, Kibana performs queries against Elasticsearch using a powerful language named **Lucene**.

Lucene is a *string*-oriented query language (much like an internet search engine).  Somewhat like SQL, Lucene can perform filtering, grouping, and aggregations.  The stringy focus, however, gives it some unique capabilities (and limitations).

In [1]:
import requests

In [7]:
def get_es_indices():
    r = requests.get("http://elasticsearch:9200/_cat/indices?format=json")
    if r.status_code != 200:
        print("Error listing indices")
        return None
    else:
        indices_full = r.json()  # contains full metadata as a dict
        indices = []  # let's extract the names separately
        for i in indices_full:
            indices.append(i['index'])
        return indices, indices_full
        
indices, indices_full = get_es_indices()
print(indices)

['.kibana', 'testindex']


In [8]:
def create_es_index(index, index_config):
    r = requests.put("http://elasticsearch:9200/{}".format(index),
                     json=index_config)
    if r.status_code != 200:
        print("Error creating index")
    else:
        print("Index created")
        

def delete_es_index(index):
    r = requests.delete("http://elasticsearch:9200/{}".format(index))
    if r.status_code != 200:
        print("Error deleting index")
    else:
        print("Index deleted")

Let's delete and recreate a "recipes" index so that we can demonstrate some Lucene fundamentals:

In [22]:
# delete if exists
indices, indices_full = get_es_indices()
if 'recipes' in indices:
    delete_es_index('recipes')
    
index_config = {
    "mappings": {
        "recipe": {  # document TYPE
            "properties": {
                "name": {"type": "string"},
                "ingredients": {"type": "string"}
            }
        }
    }
}

create_es_index('recipes', index_config)

Index deleted
Index created


Let's load some example recipes

In [15]:
def fling_message(index, doctype, msg):
    r = requests.post("http://elasticsearch:9200/{}/{}".format(index, doctype),
                      json=msg)
    if r.status_code != 201:
        print("Error sending message")
    else:
        print("message sent")

In [23]:
msg1 = {
    "name": "Pizza",
    "ingredients": "Flour WATER yeast cheese tomato sauce"
}

fling_message('recipes', 'recipe', msg1)

message sent


In [24]:
msg2 = {
    "name": "Chocolate chip cookies",
    "ingredients": "flour water sugar chocolate chips"
}

fling_message('recipes', 'recipe', msg2)

message sent


Whenever Lucene encounters a string field, it "analyzes" it (i.e. breaks it down) and builds a "reverse" index.  There are various "analyzers" that can be used, but the Standard Analyzer is good for most things:

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-standard-analyzer.html

Roughly, it splits the string by whitespace, lowercases everything, and produces a reverse lookup.

Let's concentrate on the `name` field.  The reverse index looks like this
```
pizza -> msg1
chocolate -> msg2
chip -> msg2
cookies -> msg2
```

For the `ingredients` field the reverse index looks like this:
```
flour -> msg1, msg2
water -> msg1, msg2
yeast -> msg1
sugar -> msg2
cheese -> msg1
tomato -> msg1
sauce -> msg1
chocolate -> msg2
chips -> msg2
```

## Turn off analyzer

The above examples have some problems: namely, "tomato sauce" should not be analyzed (split), nor should "chocolate chips".  We can partially solve this by recreating our index turning off the analyzer:

In [26]:
# delete if exists
indices, indices_full = get_es_indices()
if 'recipes' in indices:
    delete_es_index('recipes')
    
index_config = {
    "mappings": {
        "recipe": {  # document TYPE
            "properties": {
                "name": {"type": "string", "index": "not_analyzed"},
                "ingredients": {"type": "string", "index": "not_analyzed"}
            }
        }
    }
}

create_es_index('recipes', index_config)

Index deleted
Index created


In [27]:
msg1 = {
    "name": "pizza",
    "ingredients": "tomato sauce"
}

fling_message('recipes', 'recipe', msg1)

message sent


In [28]:
msg2 = {
    "name": "chocolate chip cookies",
    "ingredients": "chocolate chips"
}

fling_message('recipes', 'recipe', msg2)

message sent


Now the reverse index looks like this (for the `name` field):
```
pizza -> msg1
chocolate chip cookies -> msg2
```
and similarly for the `ingredients` field:
```
tomato sauce -> msg1
chocolate chips -> msg2
```

For more details on the various field configurations we can use, see this documentation:

https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html

## String fields can be lists!

Obviously we still have a problem.  What about the other ingredients?  Answer:  string fields can actually be lists.

In [29]:
# delete if exists
indices, indices_full = get_es_indices()
if 'recipes' in indices:
    delete_es_index('recipes')
    
index_config = {
    "mappings": {
        "recipe": {  # document TYPE
            "properties": {
                "name": {"type": "string", "index": "not_analyzed"},
                "ingredients": {"type": "string", "index": "not_analyzed"}
            }
        }
    }
}

create_es_index('recipes', index_config)

Index deleted
Index created


In [30]:
msg1 = {
    "name": "pizza",
    "ingredients": ["flour", "water", "yeast", "cheese", "tomato sauce"]
}

fling_message('recipes', 'recipe', msg1)

message sent


In [None]:
msg2 = {
    "name": "chocolate chip cookies",
    "ingredients": ["flour", "water", "sugar", "chocolate chips"]
}

fling_message('recipes', 'recipe', msg2)

Now the reverse index looks like this for `ingredients`:
```
flour -> msg1, msg2
water -> msg1, msg2
yeast -> msg1
sugar -> msg2
cheese -> msg1
tomato sauce -> msg1
chocolate chips -> msg2
```

## Custom analyzers

This is still not quite satisfactory because, by turning off the analyzer, we lost some of the powerful string normalizing capabilities of Elasticsearch (for example, I had to manually lowercase everything above because the analyzer is no longer doing that work for me).

In fact, Elasticsearch comes with several built-in analyzers that behave differently for various applications, and you can even build your own custom ones using regex patterns.

I will leave this more advanced topic for you to explore on your own.