# Intro to ElasticSearch

- Real time distributed search and analytics engine
- Open source (Apache 2 license)
- Based on Apache Lucene - full-text search library
- RESTful API - like Solr - but simple and coherent
- First public release - February 2010 
- Stable version as September 2015 - ElasticSearch 1.7 


## Features

- distributed real time document store 
- every field is indexed and searchable 
- distributed search engine 
- provide real time analytics
- scalable to hundred of servers and petabytes

## Use cases

- Full text search 
- Structured search 
- Analytics
- Log indexing

## Ecosystem

- Logstash - log collection in a central repo  
- Kibana - search and analytics interface 
- Marvel - cluster monitoring  
- Shield - secure cluster

## Concepts 
  - Documents 
  - Fields 
  - Index


### Document oriented
- store objects as documents - tree structures that may contain dates, geo locations, other objects, or arrays of values.
- indexes the content of documents in order to provide search
- notation for documents: JSON 


- all fields in a document are indexed by default
- documents do not need to be homogeneous
- ES does not require schemas


### Example: Document

<pre>
{
    "email":      "john@smith.com",
    "first_name": "John",
    "last_name":  "Smith",
    "info": {
        "bio":         "Eco-warrior and defender of the weak",
        "age":         25,
        "interests": [ "dolphins", "whales" ]
    },
    "join_date": "2014/05/01"
}
</pre>


### Analogy to relational databases 

<pre>
Relational DB  ⇒ Databases ⇒ Tables ⇒ Rows      ⇒ Columns
Elasticsearch  ⇒ Indices   ⇒ Types  ⇒ Documents ⇒ Fields
</pre>

Indices = Collection in other Document DB (MongoDB) 
Holds up to a point


### Index vs Index vs Index 

- Index (noun) - the place to store related documents. It is analogous to a table in RDB. The plural of index is indices or indexes. 
- Index (verb) - to index a document is to store a document into an index (noun) so that it can be queried and retrieved.  
- Inverted index - relational DB create an index (B-tree index) in order to speed up access to data. Text Search engines do the same with an structure called inverted index.


### Field values

- String 
- Number 
- Boolean 
- another object 
- an array of values 
- Date - an string representing a date 
- Geolocation - an object representing a geolocation 


### Document metadata 

- <code>_index</code> the index where the document lives 
- <code>_type</code>
- <code>_id</code> - could be autogenerated - use POST in the type instead
- <code>_version</code>



## Talking to ES 
- REST API - JSON over HTTP protocol on port 9200
- Java (native) - native protocol on port 9300 
  - Node client - joins the cluster as a non-data node
  - Transport - send request to the cluster using a native protocol
- [almost any language](https://www.elastic.co/guide/en/elasticsearch/client/index.html): Python, R, PHP, Perl, Groovy, .NET, Ruby - through the REST API


## Installation

Basic installation instructions: https://www.elastic.co/guide/en/elasticsearch/guide/current/_installing_elasticsearch.html

## Running Elasticsearch

1. /bin/elasticsearch

Attach to port 9200 (default) 

2. http://localhost:9200

In [None]:
import requests

r = requests.get('http://localhost:9200')
print r.text

# Creating documents

In [None]:
employee = """
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
"""

r = requests.put('http://localhost:9200/megacorp/employee/1?pretty', data = employee)
print r.text


In [None]:
employee2 = """
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}
"""

employee3 = """
{
    "first_name" :  "Douglas",
    "last_name" :   "Fir",
    "age" :         35,
    "about":        "I like to build cabinets",
    "interests":  [ "forestry" ]
}
"""

r = requests.put('http://localhost:9200/megacorp/employee/2', data = employee2)
r = requests.put('http://localhost:9200/megacorp/employee/3', data = employee3)


### Retrieving documents
Get a complete document by id

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/2?pretty')
print r.text

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/200?pretty')
print 'Status: ', r.status_code, '\n'
print r.text

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/2?_source=first_name,last_name&pretty')
print r.text

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/2/_source')
print r.text

### Other operations on documents

- Delete a document - DELETE 
- Check if a document exists - HEAD  
- Update a whole document - just PUT another document
- Update a partial document POST /_update 
   - allow to add fields 
   - merge documents
- create and assign an unique id - POST
- create if not exists - POST /_create
- retrieving multiple documents - GET /mget or /<index>/<type>/_mget
- execuet actions in bulk  -bulk API

### Distributed document store
- ES uses optimistic cncurrency control



## Search documents

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/_search?pretty')
print r.text

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/_search?q=last_name:Smith&pretty')
print r.text    

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/_search?q=about:rock&pretty')
print r.text    

### Search with Query DSL

- Query DSL - Domain Specific Language for querying document databases

We can rewrite the simple query as a __match__ query 

In [None]:
payload = """
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}
"""

r = requests.get('http://localhost:9200/megacorp/employee/_search?pretty', data = payload)
print r.text    

### Search - apply a filter

Select only those employees with _age > 30_ with _last_name = "Smith"_
Apply a filter to the documents and then match the query


In [None]:
payload = """
{
    "query" : {
        "filtered" : {
            "filter" : {
                "range" : {
                    "age" : { "gt" : 30 } 
                }
            },
            "query" : {
                "match" : {
                    "last_name" : "smith" 
                }
            }
        }
    }
}
"""

r = requests.get('http://localhost:9200/megacorp/employee/_search?pretty', data = payload)
print r.text    

### Full text search 
- Complex queries
- Relevance

In [None]:
payload = """
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}
"""

r = requests.get('http://localhost:9200/megacorp/employee/_search?pretty', data = payload)
print r.text    

### Phrase search

In [None]:
payload = """
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}
"""

r = requests.get('http://localhost:9200/megacorp/employee/_search?pretty', data = payload)
print r.text    

### Aggregations / Faceting

In [None]:
payload = """
{
  "aggs": {
    "all_interests": {
      "terms": { "field": "interests" }
    }
  }
}
"""

r = requests.get('http://localhost:9200/megacorp/employee/_search?pretty', data = payload)
print r.text    

### Aggregations allow hierarchical rollup

In [None]:
payload = """
{
    "aggs" : {
        "all_interests" : {
            "terms" : { "field" : "interests" },
            "aggs" : {
                "avg_age" : {
                    "avg" : { "field" : "age" }
                }
            }
        }
    }
}"""

r = requests.get('http://localhost:9200/megacorp/employee/_search?pretty', data = payload)
print r.text    

# Search features

- A structured query on concrete fields like gender or age, sorted by a field like join_date, similar to the type of query that you could construct in SQL
- A full-text query, which finds all documents matching the search keywords, and returns them sorted by relevance
- A combination of the two

## Components 
 - Mapping 
 - Analysis
 - Query DSL
 - Relevance
 
 
## Multi index, multi type queries


# Counting documents

In [None]:
url = 'http://localhost:9200/_count?pretty'

payload = """
{
  "query": { 
      "match_all": {}
  }
}
""" 

r = requests.get(url, data = payload)
print r.text


## ElasticSearch administration 

- node 
- cluster 
- shard 




## ElasticSearch Users 

* Wikipedia 
* The Guardian
* Stack Overflow
* GitHub
* IFFT
* NASA

## Bibliography

 * [ElasticSearch: The Definitive Guide](https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html) by Clinton Gormley and Zachary Tong (O’Reilly). Copyright 2015 Elasticsearch BV, 978-1-449-35854-9.

    * [git repo](https://github.com/elastic/elasticsearch-definitive-guide) 
 
    
### Reference

 * [ElasticSearch Reference](https://www.elastic.co/guide/index.html)