# Tutorial to Elasticsearch indexation
## Part 1: Intro to Elasticsearch

Welcome to the Elasticsearch indexation Tutorial!

By the end of this workshop, you will be able to:

- [understand the basics of Elasticsearch](#Understanding-the-basics-of-Elasticsearch)
- get a high level understanding of the architecture of Elasticsearch
- perform basic CRUD (Create, Read, Update, and Delete) operations with Elasticsearch


### Understanding the basics of Elasticsearch

<base target="_blank">
#### `What is ElasticSearch?`

<center><b><i>You know, for search (and analysis)<i></b></center>
<center><img src="./img/ELK.png" alt="ES_logo" style="width:400px;"/></center>

First released in Feb. 2010 by Shay Banon ([see wiki article](https://fr.wikipedia.org/wiki/Elasticsearch)), ES is a solution built to use JSON via HTTP requests, so the search engine can be used with any programming language.  
As a result, Elasticsearch has a wide range of client libraries in many programming languages, such as Java, Python, Javascript(.NodeJS), Ruby, PHP, .NET, Perl, Go...  

[Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/elasticsearch-intro.html) is:
- a distributed, open-source<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1) search and analytics engine, developed in Java

- an engine processing and returning JSON data : suitable for many data types (text, numbers, dates, geospatial data, (un)structured data, nested data)

- a document-oriented NoSQL engine capable of indexing and searching document files, built on Apache Lucene inverted indexes

- running in [near real-time](https://www.elastic.co/guide/en/elasticsearch/reference/current/near-real-time.html) (indexed and fully searchable within 1 second)

- provided with extensive REST APIs for storing and searching the data

<small><a name="cite_note-1"></a>1. [^](#cite_ref-1) Elasticsearch is no longer a fully open-source component. In January 2021, Elastic announced that Elasticsearch and Kibana (as of the 7.11 release) would move to a proprietary dual license (under the SSPL license) and away from the open source Apache-2.0 license. This prompted AWS to fork Elasticsearch and Kibana into OpenSearch and OpenSearch Dashboards, which fulfills the same use cases of the ELK Stack under the open source Apache 2.0 license.</small>


**Elasticsearch allows you to store, search, and analyze huge volumes of data quickly and give back answers in milliseconds.**


##### `Use cases`

Elasticsearch is perfect for storing unstructured data, then retrieving data when needed with blazing speed via its search engine capabilities built on Apache Lucene. By that means, Elasticsearch is perfect for these types of systems:

- Logging and Log Analysis
- Scraping and Combining Public Data
- Full-Text Search (e-commerce search, enterprise search, etc.)
- Data and Metrics
- Data Visualizing Data
- System Observability
- Security (threat hunting and prevention)

Many systems and applications benefit from Elasticsearch and the [ELK stack](https://www.elastic.co/industries), in the various areas such as Business Data Analytics, Security and Fraud Detection, Geospatial Applications, Cybersecurity, Public Safety and Emergency Response, Logistics, Analyzing scientific data, Machine Learning/Artificial Intelligence, IoT...

Elasticsearch is a widely popular enterprise search engine used by companies like Uber, Netflix, Medium, LinkedIn, StackOverFlow, etc. for a variety of use cases.


##### `How does it work?`

To help understand how Elasticsearch works, let’s cover some basic concepts of how it organizes data and its backend components.

- **Logical concepts** from the bottom up:

    - **Fields**: fields or smallest individual units of data from records, similar to columns of a table in a relational database. These are the key/value pairs of the input JSON.  
      Each field has a defined datatype including core datatypes (strings, numbers, dates, booleans), complex datatypes (object and nested), geo datatypes (get_point and geo_shape), and specialized datatypes (token count, join, rank feature, dense vector, flattened, etc.). See the full list [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html).

    - **Documents**: records in an index just like a row in a relational database.
      Each document has a JSON format (global internet data interchange format), a unique *_id* associated to it and pertains to a specific (mapping in the) *index*.  
      Document example:
      ```
      {
          "teamName":"Mission projets numérique",
          "teamNickname":"Dream Team",
          "members":
          [  
              {
                  "firstName":"Vincent",
                  "position":"Chef"
              },
              {  
                 "firstName":"Lucas",
                 "position":"Beautiful AI"
              },
              {  
                 "firstName":"Philippe",
                 "position":"DoTSerisator"
              },
              {  
                 "firstName":"Victor",
                 "position":"ES guru"
              }
          ]
      }
      ```

    - **Index / Indices**: documents are grouped into indices, similar to databases' tables, based on their characteristics. In an e-commerce context, you could have a *products* index, a *customers* index and an *orders* index.  
      An index is the highest level entity that you can query against in ES. Indices are identified by lowercase names that are used when performing various actions (such as searching and deleting) against the indexed documents.

    - **Mapping(s)**: an index mapping defines how a document and its fields are indexed and stored, by defining fields' datatypes and how fields should be handled by Elasticsearch.
      Althought it is not required to define the data's structure beforehand, unlike tables' schemas in relational databases, Elasticsearch will automatically infer datatypes from the intput if a mapping is not provided (this is called [Dynamic mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-mapping.html)).
      However, to increase performance and/or save disk space, it is important to declare explicit indices' mappings.  
      Fields can (and often should) be indexed in more than one way for different purposes (full text search, aggregations, sorting). For instance, an input string field could be mapped as a text field for full-text search, and as a keyword field for sorting or aggregations.

    In our analogy of traditional relational databases, the structure of the data used by [Elasticsearch](https://logz.io/blog/10-elasticsearch-concepts/) would be:

<center><img src="./img/4.png" alt="ES_RDB_concepts" style="width:400px;"/></center>  

   This is a summarized view of the logical layout ES, but it doesn't tell how Elasticsearch handles your data in the background, which determines its performance, scalability, and availability.  

- **Back-end components in Elasticsearch**:

    - **Lucene** search engines: behind the scenes, ES (as many of other prime search engines like Solr, MongoDB, etc.) uses Lucene search engines. Read more about [Lucene](https://en.wikipedia.org/wiki/Apache_Lucene).  
      An ES index is sort of an abstraction because Elasticsearch partitions indices into smaller units called [shards](#shards), allowing data to be distributed across multiple servers for scalability.  
      These shards can also be replicated to ensure data reliability and availability in case of node failure.  
      Shards are the “real” search engines. Queries to an index’s contents are routed to its shards, each of which is actually a Lucene instance or Lucene index.  
      All the data in Elasticsearch is internally stored in sharded Apache Lucene indexes. ***Although data is stored in Apache Lucene, Elasticsearch is what makes it distributed*** and provides the easy-to-use APIs.
      
    - <a name="shards"></a>**Shards**: an instance of Lucene holding a subset of documents of an index. An index can be divided into many shards.
        - **shard segments**: each shard is further distributed in segments, where the data is indexed using inverted indexes.  
      As a shard grows, its segments are merged into fewer, larger segments.  
      It is the key aspect that facilitates Elasticsearch horizontal scaling by distributing data across nodes, helping fast search and analysis across smaller indices.<br/><br/>

    - **Replica Shard**: the main purpose of replicas is for failover: if the node holding a primary shard dies, a replica is promoted to the role of primary; replica shard is the copy of primary shard and serves to prevent data loss in case of hardware failure.  
    <center><img src="./img/elasticsearchArchitecture.png" alt="ES_Backend" style="width:800px;"/></center>  
      
    - **Node**: a node is simply one Elasticsearch instance that holds some data and participates in the cluster’s indexing and querying. A node belongs to a single cluster. A single cluster can have as many nodes as we want.
    
    - **Cluster**: a collection of one or more connected servers that together hold the entire data and give federated indexing and search capabilities across all servers. For relational databases, the node is DB Instance. There can be N nodes with the same cluster name.

    Elasticsearch’s distributed architecture helps the rapid search and analysis of massive amounts of data with almost real-time performance. But this is not the only aspect supporting these capabilities. Let's see how in works.

- **Indexing process in Elasticsearch**:  
  Elasticsearch uses inverted indices, a data structure that maps words to their document locations, for an efficient search.   
    
    - **Data ingestion analysis**: when we index documents in Elasticsearch, a preliminary analysis is performed by [Analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-overview.html).  
       Elasticsearch uses by default a [standard analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html), but we can also choose a different one amongst [built-in analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html) or create a [custom analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html) throught the index settings [analyzer parameter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer.html).  
      Here's a basic example:      
      <center><img src="./img/elasticsearchAnalyzers.png" alt="ES_Analyzers" style="width:800px;"/></center>  
      
      **NB**: in our context, alongside the standard *html_strip* filter, we often use a *\_french_* stopwords built-in filter, a *french_elision* custom filter, as well as an *icu_folding* [plugin filter](https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html).  

    - **Inverted index**: we brieflly mentioned that indices' data is processed to inverted indexes in the shard (Lucene indices) segments.
      Here's how:  


#### `Table of contents`
- [Elasticsearch Introduction](#what-is-elastic-search)
- Elasticsearch Architecture
  - [Indices](#indices)
  - Types
  - Documents
  - Fields
  - Cluster
  - Shard
  - Replica Shards
- [Elasticsearch Queries](#elasticsearch-queries)
- APIs
- [Elastic Stack](#elastic-stack)  
  -  Kibana
  -  Beats
  -  Logstash
- Books
- Certifications
- Elasticsearch developer tools and utilities
- Elasticsearch Use cases


#### `Elastic Architecture`

##### `Indices`
Indices, the largest unit of data in Elasticsearch, are logical partitions of documents and can be compared to a database in the world of relational databases.

Continuing our e-commerce app example, you could have one index containing all of the data related to the products and another with all of the data related to the customers.
You can have as many indices defined in Elasticsearch as you want. These in turn will hold documents that are unique to each index. Indices are identified by lowercase names that refer to actions that are performed actions (such as searching and deleting) on the documents that are inside each index.
For a list of best practices in handling indices, check out the blog Managing an Elasticsearch Index. Another key element to getting how Elasticsearch’s indices work is to get a handle on shards.

 - [Best Practices for Managing Elasticsearch Indices](https://logz.io/blog/managing-elasticsearch-indices/) - Understanding indices




#### APIs


####  Elasticsearch Queries
Elasticsearch provides a full Query DSL (Domain Specific Language) based on JSON to define queries. Think of the Query DSL as an AST (Abstract Syntax Tree) of queries, consisting of two types of clauses:

<center><img src="./img/5-queries.png" alt="ES_Queries" style="width:800px;"/></center>



In [1]:
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")
print(es.info().body)

{'name': 'PORT-RECH01', 'cluster_name': 'elasticsearch', 'cluster_uuid': 'vGCaiZbkSg2xEDTmljOIRQ', 'version': {'number': '8.13.0', 'build_flavor': 'default', 'build_type': 'deb', 'build_hash': '09df99393193b2c53d92899662a8b8b3c55b45cd', 'build_date': '2024-03-22T03:35:46.757803203Z', 'build_snapshot': False, 'lucene_version': '9.10.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}


In [2]:
print(es.cat.indices(index="*", s='index', pri=True, v=True))

health status index                              uuid                   pri rep docs.count docs.deleted store.size pri.store.size dataset.size
yellow open   dicotopo__development__places      iQFTZk_rQlGR07i_T3yABA   1   1    1207647        66305    247.8mb        247.8mb      247.8mb
yellow open   encpos_document                    M0U4tb5EQgOvTySt8YSo1w   1   1       2996            0     71.6mb         71.6mb       71.6mb
yellow open   lettres__development__collections  mpJH2xwtTjyHPghL8XgFog   1   1          4            0      7.3kb          7.3kb        7.3kb
yellow open   lettres__development__documents    ZYsbdt7CSM-B3DwgCh24VA   1   1      12820            4     43.5mb         43.5mb       43.5mb
yellow open   lettres__development__institutions Njmoh7qTTTSjxQsGf_0Uzw   1   1          3            0      6.1kb          6.1kb        6.1kb
yellow open   lettres__development__languages    0bhHX_gMSLmnolm1HiuCTA   1   1          3            0      6.5kb          6.5kb        6.5kb

In [2]:
import os



current_directory = os.getcwd()



print("The current working directory is:", current_directory)


The current working directory is: /home/jboby/Documents/Documentation/elasticsearch_tutorial


### Test the basics of Elasticsearch