# Elasticsearch

Elasticsearch is a search system based on tokens. Queries and documents are analyzed into tokens, and the most relevant query-document matches are calculated using a scoring algorithm. The default scoring algorithm is [BM25](https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables). Powerful queries can be constructed using a [rich query syntax](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax).

In [None]:
%%capture

!pip install elasticsearch==7.14.0
!apt install default-jdk > /dev/null

In [None]:
import os
import elasticsearch
from elasticsearch import Elasticsearch
import numpy as np
import pandas as pd
import sys
import json
from ast import literal_eval
from tqdm import tqdm
import datetime
from elasticsearch import helpers
from pathlib import Path
import urllib.request

In [None]:
# Download & extract Elasticsearch 7.0.0

!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.0.0-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.0.0-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.0.0

A daemon instance of Elasticsearch refers to running the Elasticsearch server in the background as a daemon process. This allows Elasticsearch to continue running independently of any terminal or user session. Here's a detailed explanation:

### What is a Daemon?

A daemon is a background process that runs continuously and typically performs system-level tasks. Daemons do not require user interaction and are usually started at system boot and run until the system is shut down. They operate silently in the background, handling tasks like logging, scheduling, or, in this case, search indexing and querying.

### Why Run Elasticsearch as a Daemon?

Running Elasticsearch as a daemon has several advantages:

1. **Continuous Operation**: Elasticsearch can run continuously, handling requests and managing indexes without interruption, even if users log out or the terminal session ends.
2. **Resource Management**: Running in the background helps in better resource management and allows the system to allocate resources efficiently.


In [None]:
# Creating daemon instance of elasticsearch
import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.0.0/bin/elasticsearch'],
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )

In [None]:
# This part is important, since it takes a little amount of time for instance to load
import time
time.sleep(20)

In [None]:
%%bash
# If you get 1 root & 2 daemon process then Elasticsearch instance has started successfully
ps -ef | grep elasticsearch

In [None]:
# Check if elasticsearch is running
!curl -sX GET "localhost:9200/"

In [None]:
es = Elasticsearch(hosts = [{"host":"localhost", "port":9200}])
# Check if python is connected to elasticsearch
es.ping()

In [None]:
def load_data():
    tarball_path = Path("/content/datasets/articles_es.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/tarekhaledai/elasticsearch/raw/main/data/articles_es.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as dataset_tarball:
            dataset_tarball.extractall(path="datasets")
    return pd.read_csv(Path("/content/datasets/articles_es.csv"))

dataset = load_data()
dataset.head()

In [None]:
# Define settings & mappings of Elasticsearch index
Settings = {
    "settings":{
        "number_of_shards":1,
        "number_of_replicas":0
    },
    "mappings":{
        "properties":{
            "article":{
                "type":"text"
            },
            "highlights":{
                "type":"text"
            }
        }
    }
}

The `Settings` dictionary provided defines the configuration for an Elasticsearch index. This configuration includes settings for the number of shards and replicas, as well as mappings that specify the structure of the documents within the index. Here's a detailed description of each component:

### Settings

The `settings` key in the dictionary configures how Elasticsearch will handle the index in terms of distribution and redundancy.

- `"number_of_shards": 1`
  - **Number of Shards**: Shards are subdivisions of an index that allow for distributed storage and parallel processing of data. Setting this to 1 means the index will have a single shard. This is suitable for small datasets or development environments where high scalability is not required.

- `"number_of_replicas": 0`
  - **Number of Replicas**: Replicas are copies of the primary shards and provide redundancy and fault tolerance. Setting this to 0 means there will be no replicas, which implies no redundancy. This setting might be used in a development environment where data loss is acceptable, or to save resources.

### Mappings

The `mappings` key defines the structure and data types of the documents stored in the index. This is crucial for Elasticsearch to understand how to index and search the data.

- `"properties"`: This defines the fields within the documents and their respective data types.

  - `"article": {"type": "text"}`
    - **Article Field**: This field will store the main content of the document. The type is set to `text`, which means Elasticsearch will analyze the field and create an inverted index to support full-text search capabilities. This is suitable for fields where you want to perform search queries on the content.

  - `"highlights": {"type": "text"}`
    - **Highlights Field**: Similar to the `article` field, this field is also of type `text`. This indicates it will store textual data that is likely intended to be searchable and analyzed in the same way as the `article` field.



In [None]:
def json_formatter(dataset, index_name, index_type='_doc'):
    """
    This function is used to create JSON formatted dictionaries for Elasticsearch.

    Args:
      dataset: The dataset you want to apply this function.
      index_name: Name of the index in Elasticsearch
      index_type: Type of the index in Elasticsearch.
      Note: It is suggested to keep index_type as '_doc' since it is deprecated from version 6.
      Note: This function formats all columns of your dataset, if you want to apply this to special columns only,
      you can delete the second for loop and add your custom fields.
    """
    try:
        List = []
        columns = dataset.columns
        for idx, row in dataset.iterrows():
            dic = {}
            dic['_index'] = index_name
            dic['_type'] = index_type
            source = {}
            for i in dataset.columns:
                source[i] = row[i]
            dic['_source'] = source
            List.append(dic)
        return List

    except Exception as e:
        print("There is a problem: {}".format(e))

In [None]:
MY_INDEX = es.indices.create(index="news_index", ignore=[400,404], body=Settings)
MY_INDEX

In [None]:
json_Formatted_dataset = json_formatter(dataset=dataset, index_name='news_index', index_type='_doc')
json_Formatted_dataset[0]

In [None]:
# For importing Data to elasticsearch we use elasticsearch's bulk API from elasticsearch.helpers
try:
    res = helpers.bulk(es, json_Formatted_dataset)
    print("successfully imported to elasticsearch.")
except Exception as e:
    print(f"error: {e}")

In [None]:
# Get 10 sample of data
query = es.search(
    index="news_index",
    body={
      "size":10,
      "query": {
        "match_all":{}
      }
    }
)

output = pd.json_normalize((query['hits']['hits']))
output

In [None]:
# Complicated query
query = es.search(
    index="news_index",
    body={
        "size":20,
        "query":{
            "bool":{
                "must":[
                        {"match":{"article":"dogs fight"}}
                ],
                "should":[
                        {"match":{"highlights":"cat"}}
                ]
            }
        }
    }
)

output = pd.json_normalize((query['hits']['hits']))
output

In [None]:
# More complicated query
query = es.search(
    index="news_index",
    body={
        "size":20,
        "query":{
            "bool":{
                "must":[
                        {"multi_match":{
                            "query":"Football manchester united",
                            "fields":["article","highlights"]
                        }}
                ]
            }
        }
    }
)

output = pd.json_normalize((query['hits']['hits']))
output