Elasticsearch 
## Document Store and Search engine

## ElasticSearch REST API

<pre>
curl -X&lt;VERB&gt; '&lt;PROTOCOL&gt;://&lt;HOST&gt;:&lt;PORT&gt;/&lt;PATH&gt;?&lt;QUERY_STRING&gt;' -d '&lt;BODY&gt;'
<pre>

VERB-The appropriate HTTP method or verb: GET, POST, PUT, HEAD, or DELETE.
PROTOCOL-Either http or https (if you have an https proxy in front of Elasticsearch.)
HOST-The hostname of any node in your Elasticsearch cluster, or localhost for a node on your local machine.
PORT-The port running the Elasticsearch HTTP service, which defaults to 9200.
PATH-API Endpoint (for example _count will return the number of documents in the cluster). Path may contain multiple components, such as _cluster/stats or _nodes/stats/jvm
QUERY_STRING-Any optional query-string parameters (for example ?pretty will pretty-print the JSON response to make it easier to read.)
BODY-A JSON-encoded request body (if the request needs one.)

## ElasticSearch features

 * Document oriented - stores complex objects or documents. 
 * Document: root object that is serialized into JSON and stored in Elasticsearch under a unique ID.
 * Index documents - ES indexes all fields by default 
 * Documents may have types. 
 * Each type has its mappings
 * Documents are inmutable. If we have to change or replace they have to be indexed again. Version changes.  


## Creating documents

In [None]:
import requests

In [39]:
employee = """
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
"""

r = requests.put('http://localhost:9200/megacorp/employee/1?pretty', 
                 data = employee)
print r.text


{
  "_index" : "megacorp",
  "_type" : "employee",
  "_id" : "1",
  "_version" : 9,
  "created" : false
}



[TODO] Explicar el significado de cada parte de la url
[TODO] Explicar los diferentes tipos de campos: texto, numéricos, listas, datetimes

ES creates if not exist:

 * an **index** (*megacorp*)  
 * a **document type** (*employee*)

### Document metadata

A document has metadata information:

 - **_index**: Where the document lives
 - **_type**:  The class of object that the document represents
 - **_id**:    The unique identifier for the document. 
   -  It is always a string. To be precise:  20 character long, URL-safe, Base64-encoded string universally unique identifiers, or UUIDs.
   - We can provide our ids or let ES create them for us. 

_index, _type and _id identifies a document


#### Other metadata

 - **_version**
  

In [None]:
employee2 = """
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}
"""

employee3 = """
{
    "first_name" :  "Douglas",
    "last_name" :   "Fir",
    "age" :         35,
    "about":        "I like to build cabinets",
    "interests":  [ "forestry" ]
}
"""

r = requests.put('http://localhost:9200/megacorp/employee/2', data = employee2)
r = requests.put('http://localhost:9200/megacorp/employee/3', data = employee3)


### Retrieving documents by id
Get a complete document by id

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/2?pretty')
print r.text

### Retrieving a document that do not exists

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/200?pretty')
print 'Status: ', r.status_code, '\n'
print r.text

### Retrieving part of a document

Select only certain fields...

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/2?_source=first_name,last_name&pretty')
print r.text

### Retrieving document content - no metadata

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/2/_source')
print r.text

### Other operations on documents

- Delete a document - DELETE 
- Check if a document exists - HEAD  
- Update a whole document - just PUT another document
- Update a partial document POST /_update 
   - allow to add fields 
   - merge documents
- create and assign an unique id - POST
- create if not exists - POST /_create
- retrieving multiple documents - GET /mget or /<index>/<type>/_mget
- execute actions in bulk  -bulk API


### Delete a document

- doesn’t immediately remove the document from disk
- marks the document as deleted. 
- clean up deleted documents in the background

### Update a document 

- Marks the old version as deleted and index a new version

### Partial updates
 - Add fields 
 - Modify values 
 - Add values to collection fields

### Distributed document store
- ES uses **optimistic concurrency control**
- assumes that conflict is rare 
- Usually, ES is not your primary data store. If it is, application should take care of problems.   

- Documents are also replicated to several nodes in a cluster.



### Problems with optimistic concurrency control

<img src="https://www.elastic.co/guide/en/elasticsearch/guide/master/images/elas_0301.png" height="300" width="300">



## Search documents

Search all documents in an index of the given type

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/_search?pretty')
print r.text

 - Hits - A list with the documents that match the query criteria 
 - Score
 - Metadata about the search process

Search all the documents that match a query: "Last name is Smith"

- This is a simple example of structured search! - mix full-text search with a constraint in a field  

[TODO] Explain the URL: 
  - *_search* 
  - parameter *q* 
  - field queries: *last_name:*   

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/_search?q=last_name:Smith&pretty')
print r.text    

Search a text field using keywords
 - What happens with the score?   

In [None]:
r = requests.get('http://localhost:9200/megacorp/employee/_search?q=about:rock&pretty')
print r.text 

### Search with Query DSL

- Query DSL - Domain Specific Language for querying document databases

We can rewrite the simple query as a __match__ query 

In [41]:
payload = """
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}
"""

r = requests.get('http://localhost:9200/megacorp/employee/_search?pretty', data = payload)
print r.text    

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "megacorp",
      "_type" : "employee",
      "_id" : "2",
      "_score" : 1.0,
      "_source":
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}

    }, {
      "_index" : "megacorp",
      "_type" : "employee",
      "_id" : "1",
      "_score" : 1.0,
      "_source":
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

    } ]
  }
}



### Search DSL - apply a filter

Select only those employees with **age>30** with **last_name = "Smith"**

Apply a filter to the documents and then match the query


In [None]:
payload = """
{
    "query" : {
        "filtered" : {
            "filter" : {
                "range" : {
                    "age" : { "gt" : 30 } 
                }
            },
            "query" : {
                "match" : {
                    "last_name" : "smith" 
                }
            }
        }
    }
}
"""

r = requests.get('http://localhost:9200/megacorp/employee/_search?pretty', data = payload)
print r.text    

### Full text search 
- Complex queries
- Relevance: each document has a different score that represents the relevance to the query 
   

In [None]:
payload = """
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}
"""

r = requests.get('http://localhost:9200/megacorp/employee/_search?pretty', data = payload)
print r.text    

### Phrase search

-  **match: rock climbing** - matches *rock* or *climbing*, better score if the two terms appear
-  **match_phrase: "rock climbing"** - matches the phrase *"rock climbing"*, both terms should appear in the document and in order


In [None]:
payload = """
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}
"""

r = requests.get('http://localhost:9200/megacorp/employee/_search?pretty', data = payload)
print r.text    