## Tutorial on the Elasticsearch


### Introduction

This tutorial will introduce a highly scalable open-source full-text search and analytics engine: [**Elasticsearch**](https://www.elastic.co/). 

Elasticsearch is built over on [Apache Lucene™](http://lucene.apache.org/), a popular Java serach engine library and provides a JSON based REST API to refer to Lucene features. Elasticsearch allows you to store, search, and analyze big volumes of data quickly and in near real time (just a slight latency for a new docuement to be searchable). It is generally used by the applications which need complex search features and requirements. Here is several examples about its usage: 
* **Wikipedia** uses Elasticsearch to provide full-text search with highlighted search snippets, and search-as-you-type and did-you-mean suggestions.
* **The Guardian** uses Elasticsearch to combine visitor logs with social -network data to provide real-time feedback to its editors about the public’s response to new articles.
* **Stack Overflow** combines full-text search with geolocation queries and uses more-like-this to find related questions and answers.
* **GitHub** uses Elasticsearch to query 130 billion lines of code.

###  Elasticsearch terminology

Elasticsearch uses its own terms, which in some cases is different from typical database systems. 
Here are some common terms we may meet along this tutorial:
* **Node**: A node is a single running Elasticsearch instance that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. 
* **Cluster**: A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes.
* **Document**: A document is a basic unit of information stored in Elasticsearch. 
* **Index**: An index is a collection of documents that have somewhat similar characteristics. We can define as many indexes as we want in a single cluster. You can think it as a table in the traditional relational database.
* **Shards & Replicas**: Since the data stored in each node are possible to exceed the hardware limits, Elasticsearch provides to subdivide the index into multiple pieces other nodes called shards. each shards is stored in the nodes of the cluster. Also, Elasticsearch could replic each shard in case some nodes fails. By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica.

### Why choose Elasticsearch?

* #### Fast search responses
    
    Compared to the tradition relational database, Elasticsearch will create an inverted index for the provided document, which helps in making searches faster. The inverted index is a data structure, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears. 
    
    For example, let’s say we have two documents, each with a content field containing the following:

        a. The quick brown fox jumped over the lazy dog
        b. Quick brown foxes leap over lazy dogs in summer
        
   In the Elasticsearch, These two docuements will be stored in the structures like this. 
   With the inverted index, Elasticsearch is able to searches an index, instead of searching the text directly. 
   
| Term    | Doc_1 | Doc_2 |
| :------ |:------|:------|
| brown   |   X   |  X    |
| dog     |   X   |  X    |
| fox     |   X   |  X    |
| in      |       |  X    |
| jump    |   X   |  X    |
| lazy    |   X   |  X    |
| over    |   X   |  X    |
| quick   |   X   |  X    |
| summer  |       |  X    |
| the     |   X   |  X    |

    

* #### Easy to scale

    Another reason is because Elasticsearch is also a distributed storage system. It has the ability to extend resources and balance the loading between the nodes in a cluster. It also replicates the data automatically to prevent data loss in case of server node failure. It has capacity to run on hundreds of server and handle petabytes of documents
    
    Moreover, the cluster is very easy to build. Just bring up multiple Elasticsearch nodes on the same network and assign the cluster name to them. Many things will be done automatically by Elasticsearch including discovery and master selection.

* #### Excellent Query DSL

    Elasticsearch has a powerful JSON-based DSL (domain specific language), which allows development teams the ability to construct complex queries and fine tune them to receive the most precise results from a search. We will use Query DSL in the later example.

* #### Using RESTful API and JSON

    Everything in Elasticsearch can be controlled via RESTful API - from creating indexes to changing the number of replicas per index, all can be done on the go using simple RESTful API calls. Queries and responses are always in JSON format , which is both machine and human readable. For the developer, It is very convenient to manipulate the docuements with these API calls. 
    
* #### Variety of useful Plugins

    Elasticsearch provides some default functionality, such as tokenizer, analyzer. But it also allow Plugins to be installed to enhance the basic Elasticsearch functionality. The plugins are range from adding custom mapping types, custom analyzers (in a more built in fashion), custom script engines, custom discovery and more. 


### Building a simple Yelp Search Engine using Elasticsearch

In this section, to have a better understanding of Elasticsearch, let's start building a simple Yelp search engine of the restaurants near Carnegie Mellon University. This engine will provide the functionality as following:

    a. If user searches restaurant name, it returns a list of this restaurant.
    b. If user searches food name, it returns all the restaurants which sell this food.
    c. Sort the result by the score.



### 1. Download and run the Elasticsearch
1. Elasticsearch is built using Java, and requires at least Java 8 in order to run.
2. Download Elasticsearch from https://www.elastic.co/downloads/elasticsearch.
3. Run `bin/elasticsearch` (or `bin\elasticsearch.bat` on Windows). 

    The Elasticsearch will default run on http://localhost:9200/ with 1GB JVM heap size. You can configure the basic setting through `config/elasticsearch.yml` and the JVM setting through `config/jvm.options`
    
4. Then, open a terminal and run ```curl http://localhost:9200/```(or on the other port you specify) to check if the Elasticsearch is started. 

    You should get the response like this:
    ```json
    {
      "name" : "node-1",      // node name
      "cluster_name" : "15-688-tutorial",      // cluster name
      "cluster_uuid" : "tEAzUdf1QFmbgYHoBAW-9A",
      "version" : {
        "number" : "6.2.2",
        "build_hash" : "10b1edd",
        "build_date" : "2018-02-16T19:01:30.685723Z",
        "build_snapshot" : false,
        "lucene_version" : "7.2.1",
        "minimum_wire_compatibility_version" : "5.6.0",
        "minimum_index_compatibility_version" : "5.0.0"
      },
      "tagline" : "You Know, for Search"
    }
    ```
    
    
### 2. Download the Python Elasticsearch Client
In this tutorial, we will use Python Elasticsearch Client [*elasticsearch-py*](https://github.com/elastic/elasticsearch-py) to call the interfaces in the Elasticsearch.

Install the elasticsearch-py package with pip:
```
pip install elasticsearch
```
After the installation, we can connect to the Elasticsearch in Python.

In [95]:
# make sure you have installed these package.
import json
from elasticsearch import Elasticsearch

In [96]:
es = Elasticsearch() # create connection to the Elasticsearch

print(es.cluster.health()) # check thte cluster status

{'cluster_name': '15-688-tutorial', 'status': 'yellow', 'timed_out': False, 'number_of_nodes': 1, 'number_of_data_nodes': 1, 'active_primary_shards': 15, 'active_shards': 15, 'relocating_shards': 0, 'initializing_shards': 0, 'unassigned_shards': 15, 'delayed_unassigned_shards': 0, 'number_of_pending_tasks': 0, 'number_of_in_flight_fetch': 0, 'task_max_waiting_in_queue_millis': 0, 'active_shards_percent_as_number': 50.0}


### 3. Modeling the data
We already have a clean Elasticsearch instance initialized and running. The next thing we are going to do is to add documents into it. In this example , we will use five restaurant data near Carnegie Mellon University. You can get these data in JSON format from *data.json*.

|  Field   | Description                         | 
| :------- |:----------------------------------- |
| Name     | Restaurant Name                     |
| Score    | The score in the Yelp review        |
| Address  | Restaurant Address                  |
| Phone    | Restaurant Phone                    |
| Price    | Average spending in this restaurant | 
| Foods    | The dishes this restaurant provided |     


|  Name              | Score | Address                    | Phone          |Price | Foods  |
| :----------------- |:-----:| :------------------------- |:-------------- |:----:| :----- |
| Lucca              |  3.5  | 317 S Craig St, Pittsburgh | (412) 682-3310 | 30   | Gluten Free Pasta, Goat Cheese Ravioli, Chicken Romano, Seafood Mix Grill, Lamb Chops |
| Crepes Parisiennes |    4  | 207 S Craig St, Pittsburgh | (412) 683-1912 | 30   | Egg & Cheese, Breakfast Crepe,  Smoked Icelandic Salmon, Ham & Cheese |
| Union Grill        |    4  | 413 S Craig St, Pittsburgh | (412) 681-8620 | 30   | Grilled Capicola & Cheese, Fish Sandwich, Baked Cheese Steak, Chicken Focaccia, Hamburger |
| Legume             |  4.5  | 214 N Craig St, Pittsburgh | (412) 621-2700 | 60   | Grilled Escarole Salad, Mixed Greens Salad, Chicken Paprikash, Hamburger |
| Lulu’s Noodles     |  2.5  | 400 S Craig St, Pittsburgh | (412) 687-7777 | 10   | Teriyaki Chicken Salad, Sauteed Chicken, Thai Curry Chicken, Lu Lu's Pan Fried Noodle |

Let's create an index in Elasticsearch and upload these data. 

In [115]:
def get_restaurant_data():
    '''
    get the data from 'data.json'
    '''
    data = json.load(open('data.json'))['restaurant']
    for d in data:
        foods = []
        for food in d['foods'].split(', '):
            foods.append({'name': food})
        d['foods'] = foods
    return data

def create_index(index, name):
    '''
    create the index in Elasticsearch.
    '''
    query = {
      "mappings": {
          name: {
              "properties": {
                  "name": {
                      "type": "text"
                  },
                  "score": {
                      "type": "integer"
                  },
                  "address": {
                      "type": "text"
                  },
                  "phone": {
                      "type": "text"
                  },
                  "price": {
                      "type": "integer"
                  },
                  "foods": {
                      "type": "nested", 
                      "properties": {
                          "food": {
                              "type": "text"
                          }
                      }
                  }
              }
          }
      }
    }
    es.indices.create(index, body = query)

def upload_data(index, name, data):
    '''
    uploda the data into the Elasticsearch
    '''
    for d in data:
        log = es.index(index=index, doc_type=name, body=d)
        print(log) # check the log whethere the data is upload successfully.


index = "688_tutorial"
name = "restaurant"

create_index(index, name)
# check if the index exists after creating
print("All indices:",  list(es.indices.get_alias().keys()))

upload_data(index, name, get_restaurant_data())

All indices: ['my_index', 'test', '688_tutorial']
{'_index': '688_tutorial', '_type': 'restaurant', '_id': '0FmIdGIBi3HC1m0dSiEO', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1}
{'_index': '688_tutorial', '_type': 'restaurant', '_id': '0VmIdGIBi3HC1m0dSiE5', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1}
{'_index': '688_tutorial', '_type': 'restaurant', '_id': '0lmIdGIBi3HC1m0dSiFG', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 1, '_primary_term': 1}
{'_index': '688_tutorial', '_type': 'restaurant', '_id': '01mIdGIBi3HC1m0dSiFP', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1}
{'_index': '688_tutorial', '_type': 'restaurant', '_id': '1FmIdGIBi3HC1m0dSiFa', '_version': 1, 'result': 'created', '_shards'

### 4. Search from Elasticsearch
So far, our restaurant data are successfully stored in the Elasticsearch. What we want to do next is searching the restaurant data in the restaurant name or food name, just like what we do in [Yelp](https://www.yelp.com/). To do so, We need to create a specifc search [Query DSL (Domain Specific Language)](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) and pass to the Elasticsearch through the search API.

The result will be from these two fields: restaurant name and food name. Thus, we will add these two fields in the search query. The query will then match the resutaurants or foods whose name contains the keywords. We construct the query in the get_search_query() function and pass it to search() function. 

In [109]:
def get_search_query(keyword, show_explain):
    '''
    get the search query
    @param:
        keyword: search by this keyword
        show_explain: whether to show how the relevance score was computed in each hits
    '''
    query = {
      "explain": show_explain, 
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "name": keyword
              }
            },
            {
              "nested": {
                "path": "foods",
                "query": {
                  "match": {
                    "foods.name": keyword
                  }
                }
              }
            }
          ]
        }
      }
    }
    return query

def search(index, query):
    '''
    use query to get the search result from Elasticsearch
    '''
    res = es.search(index=index, body=query)
    return res

### 5. what Elasticsearch do when you are searching 
Now we have finish our basic search function. When we search by a restaurant name, such as 'Lucca', It will return the information of the restaurant 'Lucca'. When we search by food name, such as 'salad', it will return all the restaurant which provides salad.

In [116]:
# search for restaurant name
res = search(index, get_search_query("Lucca", False))
print(json.dumps(res, sort_keys = True, indent = 4, ensure_ascii=False))

# search for the food name which restaourant provides 
res = search(index, get_search_query("Salad", False))
print(json.dumps(res, sort_keys = True, indent = 4, ensure_ascii=False))

{
    "_shards": {
        "failed": 0,
        "skipped": 0,
        "successful": 5,
        "total": 5
    },
    "hits": {
        "hits": [
            {
                "_id": "0FmIdGIBi3HC1m0dSiEO",
                "_index": "688_tutorial",
                "_score": 0.80259144,
                "_source": {
                    "address": "317 S Craig St, Pittsburgh",
                    "foods": [
                        {
                            "name": "Gluten Free Pasta"
                        },
                        {
                            "name": "Goat Cheese Ravioli"
                        },
                        {
                            "name": "Chicken Romano"
                        },
                        {
                            "name": "Seafood Mix Grill"
                        },
                        {
                            "name": "Lamb Chops"
                        }
                    ],
                    "name": "Lucca

* **The Response**

Elasticsearch will return the response in JSON ordered by relevance. By default, the each hit in the response will be sorted by relevance score in '_score' field. Also, Elasticsearch will return other information such as the time it took, the total number it found. Here is the response of search 'Lucca':
```json
{
    // this part how many shards were searched, as well as a count of the successful/failed searched shards
    "_shards": {
        "failed": 0,
        "skipped": 0,
        "successful": 5,
        "total": 5
    },
    // this part is search results
    "hits": { 
        "hits": [ // the array of search results 
            {
                "_id": "slmmb2IBi3HC1m0dviGC", // id of this data
                "_index": "688_tutorial", // index name
                "_score": 1.4599355, // the relevance score, which is a measure of how well the document matches the query.
                "_source": {  // our restaurant data
                    "address": "317 S Craig St, Pittsburgh",
                    "foods": [
                        " Gluten Free Pasta",
                        " Goat Cheese Ravioli",
                        " Chicken Romano",
                        " Seafood Mix Grill",
                        " Lamb Chops"
                    ],
                    "name": "Lucca",
                    "phone": "(412) 682-3310",
                    "price": 30,
                    "score": 3.5
                },
                "_type": "restaurant"
            }
        ],
        "max_score": 1.4599355, // the highest relevance score of all result
        "total": 1 // total number of documents matching our search criteria 
    },
    "timed_out": false, // tells if the search timed out or not
    "took": 4 // time in milliseconds for Elasticsearch to execute the search
}
```

* **Relevence score**

In Elasticsearch, the standard similarity algorithm used is known as term frequency/inverse document frequency (TF/IDF). Also, Elasticsearch provides a useful feedback about computing a score explanation for a query and a specific document. We can enable the explain by add 'explain' field in our search query:
```json
{
  "explain": true,
  "query":{
      ...
  }
}
```

In [111]:
res = search(index, get_search_query("Lucca", True)) # enable the explain
print(json.dumps(res, sort_keys = True, indent = 4, ensure_ascii=False))

{
    "_shards": {
        "failed": 0,
        "skipped": 0,
        "successful": 5,
        "total": 5
    },
    "hits": {
        "hits": [
            {
                "_explanation": {
                    "description": "sum of:",
                    "details": [
                        {
                            "description": "weight(name:lucca in 5) [PerFieldSimilarity], result of:",
                            "details": [
                                {
                                    "description": "score(doc=5,freq=1.0 = termFreq=1.0\n), product of:",
                                    "details": [
                                        {
                                            "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                                            "details": [
                                                {
                                                    "description": "docFreq",
   

### 6. Sort the restaruant by its score

When we search the food 'Salad', from the response we know that there are two restaruant provides it: LuLu's Noodeles and Legume. However, these two restaurants in the response are sorted by the relevance score. In the common sense, the first hit in the response should be the most popular restaurant. In other words, we want to sort the search result by its score. 

To enable sorting in the response, we need to add a sorting query in our search query. Then, we can get the result sorted by its score.

```json
{
  "query":{
      ...
  },
  "sort": {  // the sort query 
    "score": "desc"
  }
}
```

In [117]:
def get_sort_search_query(keyword, show_explain):
    '''
    get the search query, the result will sorted by score
    @param:
        keyword: search by this keyword
        show_explain: whether to show how the relevance score was computed in each hits
    '''
    query = {
      "explain": show_explain, 
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "name": keyword
              }
            },
            {
              "nested": {
                "path": "foods",
                "query": {
                  "match": {
                    "foods.name": keyword
                  }
                }
              }
            }
          ]
        }
      },
      "sort": {  
          "score": "desc"
      }
    }
    return query

# check the result again
res = search(index, get_sort_search_query("Salad", False))
print(json.dumps(res, sort_keys = True, indent = 4, ensure_ascii=False))

{
    "_shards": {
        "failed": 0,
        "skipped": 0,
        "successful": 5,
        "total": 5
    },
    "hits": {
        "hits": [
            {
                "_id": "01mIdGIBi3HC1m0dSiFP",
                "_index": "688_tutorial",
                "_score": null,
                "_source": {
                    "address": "214 N Craig St, Pittsburgh",
                    "foods": [
                        {
                            "name": "Grilled Escarole Salad"
                        },
                        {
                            "name": "Mixed Greens Salad"
                        },
                        {
                            "name": "Chicken Paprikash"
                        },
                        {
                            "name": "Hamburger"
                        }
                    ],
                    "name": "Legume",
                    "phone": "(412) 621-2700",
                    "price": 60,
                    "scor

### References:
[1] Paper on Searching and Indexing Using Elasticsearch, Darshita Kalyani, Dr. Devarshi Mehta https://www.ijecs.in/index.php/ijecs/article/download/2986/2766/

[2] Elasticsearch: How to Add Full-Text Search to Your Database
https://medium.com/@MentorMate/elasticsearch-how-to-add-full-text-search-to-your-database-ee2f3ea4d3f3

[3] Python Elasticsearch Client https://elasticsearch-py.readthedocs.io/en/master/index.html

[4] Elasticsearch reference https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

[5] Elasticsearch: The Definitive Guide, Clinton Gormley, Zachary Tong https://www.elastic.co/guide/en/elasticsearch/guide/master/index.html

[6] Inverted index in Elasticsearch https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html#inverted-index