# es basic

## what is es
es is a distributed search and analytics engine, scalable data store, and vector database built on apache lucene.
### what is es stack
- ingest: fleet and elastic agent, application performance monitor, beats, elasticsearch ingest pipeline, logstash 
- store: index, store, search, analyze
- consume: kibana, elasticsearch clients 
### search
- full-text search: a fast, relevant full-text search solutions using inverted indexer, tokenization, and text analysis
- vector database: store and search vectorized data, and create vector embeddings with build-in and third party nlp models. <i>please note the three terms in context of NLP (token, embedding, vector)</i>
- semantic search: understand the intent and contextual meaning behind search queries using tools like synonyms, dense vector embeddings, and learned sparse query-document expansion 
- hybrid search: combine full-text search with vector search using state-of-the-art ranking algorithms. 
- build search experience. add hybrid search capabilities to apps or websites, or build enterprise search engines over org's internal data sources. 
- RAG: use es as rag engine to supplement generative AI models with more relevant, up-to-date, or proprietary data for a range of use cases.
- geospatial search: search for locations and calculate spatial relationship using geospatial queries. 

## run es
- quick start: local develop: docker
- hosted options: elastic cloud hosted / elastic cloud serverless
- advanced: self-managed, elastic cloud enterprise, elastic cloud on k8s 

## indices, documents and files
### index
an index is the fundamental unit of storage in es, a logic namespace for storing data that share similar characteristics. 
an index is a collection of docs uniquely identified by a name or an alias. 
### document and fields
es serializes and stores data in the form of JSON docs. a doc is a set of fields, key-value pair that contain data. each doc has a unique id, which user can create or have es auto-generate
### metadata fields
an indexed doc contains data and metadata. metadata fields are system fields that store info about the docs. in es, metadata fields are prefixed with an underscore. 
### mappings and data types
each <u>index</u> have a mapping or schema for how the fields in doc indexed. a mapping defines the data type for each field, how the field should be indexed, and how it should be stored. 
- dynamic mapping: let es automatically detect the data types and create mappings.
- explicit mapping: define the mappings up front by specifying data types for each field. <u>recommended for prod</u> use cases, because you have full control over how your data is indexed to suit your specific use case.

## add data to es
### general content 
- use api to index doc, when dev app, use es client for different programming language
- use kibana file uploader
- web crawler, extract and index web page content into el doc; [detail at github](https://github.com/elastic/crawler)
 - connector, sync doc from various third-party data sources top create searchable, read-only replicas
### timestamped data (ignored)

## search and analyze data
### REST API
use REST API to manage es cluster, and to index and search data from command line or dev tools, or client
### query language
- query DSL, primary
- ES|QL, a new piped query language and compute engine. [ignore]
### query DSL
query DSL is a full-features JSON-style query language that enable complex searching, filtering and aggregations. 
### search and filter with query DSL 
- full-text search
```json
{
  "query": { 
    "match": {
      "content": "your search term here" // field and query string
    }
  }
}
```
- keyword search
```json
{
  "query": {
    "term": {
      "status": "active" 
    }
  }
}
```
- semantic search; embedding model should be placed there in advance
```json{
  "query": {
    "more_like_this": {
      "fields": ["content"],  // Field to search
      "like": "Sample text to find similar documents",  // Text to find similarity
      "min_term_freq": 1,  // Minimum term frequency in the document
      "max_query_terms": 12  // Maximum number of query terms
    }
  }
}
```
- vector search
```json
{
  "query": {
    "knn": {
      "field": "embedding_vector",  // The field containing the vector embeddings
      "query_vector": [0.1, 0.2, 0.3, 0.4, 0.5],  // Your query vector
      "k": 10,  // Number of nearest neighbors to retrieve
      "num_candidates": 100,
      "metric": "cosine"  // Similarity metric (e.g., cosine, euclidean), default is euclidean
    }
  }
}
```
- geospatial search (ignore)

<i>common query types in es
- match: for full-text search, analyze text and looks for relevant docs
- term: exact match on specific files. not analyze text
- match_phrase: similar to match, but looks for exact phrases
- bool: combine multiple queries using logical operators (must, should, must_not)
- range: searches for docs with values within a specific range
- wildcard: wildcard characters (* and ?) for pattern matching
- fuzzy: search for terms that are similar to the specific term, allowing for typos
- nested: searchers nested objects within documents 
- exists: checks for the existence of a field in docs
- dis_max: designed for search multiple fields with different weights 
- script: custom scripting to determine doc relevance. 
</i>

### analyze with query dsl
aggregations are the primary tool for analyzing elastic search data using query dsl. 
- build complex summaries of data
- gain insight into key metrics, patterns, and trends
available aggregation types:
- metric: calculate metrics, such as a sum or average, from field values
- bucket: group docs into buckets based on field values, ranges, or other criteria
- pipeline: run aggregations on the results of other aggregations

## ES|QL (ignore)
elasticsearch query language (ES|QL) is a piped query language for filtering, transforming, and analyzing data. 


# index module
### index settings
- static: can only be set at index creation time or on a closed index, or by using the update-index-settings api
- dynamic: can be changed on a live index using the update-index-settings api
### static index settings
- index.number_of_shards: number of primary shards that an index should have. defaults to 1.
- index.number_of_routing_shards: route docs to a primary shard
- index.codes: compress stored data with LZ4 compression
- index.mode: standard (default), time_series, logsdb
- others, ignore
### dynamic index settings
- index.number_of_replicas 

## analysis
index analysis module acts as a configurable registry of analyzers to convert a string field int individual terms which are
- added to the inverted index to make the doc searchable
- used by high level queries such as the match query to generate search terms
## index shared allocation (ignore)
## index blocks
index blocks limit the kind of operations that are available on a certain index.
### index block settings
- index.blocks.read_only; apply on both index and index matadata 
- index.blocks.read_only_allow_delete; 
- index.blocks.read
- index.blocks.write
- index.blocks.metadata
## mapper
mapper module acts as a registry for the type mapping definitions added to an index either when creating it or by using the update mapping api
## merge
a shard in es is a lucene index, and a lucene index is broken down into segments. segments are internal storage elements in the index where index data is stored, and are immutables. smaller segments are periodically merged into large segments to keep the index size at bay and expunge deleted. 
## similarity module
a similarity (scoring / ranking model) defines how matching docs are scored. 
### config a similarity 
```python
resp = client.indices.create(
    index="index",
    settings={
        "index": {
            "similarity": {
                "my_similarity": {
                    "type": "DFR",
                    "basic_model": "g",
                    "after_effect": "l",
                    "normalization": "h2",
                    "normalization.h2.c": "3.0"
                }
            }
        }
    },
)
print(resp)
```

then, we can refer the my_similarity 
```python
resp = client.indices.put_mapping(
    index="index",
    properties={
        "title": {
            "type": "text",
            "similarity": "my_similarity"
        }
    },
)
print(resp)
```
### avaliable similarities
- BM25 similarity, TF/IDF based similarity that has built-in tf normalization; three parameters: k1, b, and discount_overlaps
- DFR, implements the divergence from randomness framework. parameters: basic_model, after_effect, and normalization 
- DFI similarity 
IB, information based model. based on the concept that the info content is any symbolic distribution sequence is primarily determined by the repetitive usage of it basic elements.
- LM / LM / Scrkpt (ignore)

similarity for index can be updated after it created. 

## show log
Shard level slow search log allows to log slow search
idexing level slow log
## store
different storage type:
- fs
- simplefs
- niofs
- mmapfs
- hybridfs
## translog
## history retention
## index sorting
when creating a new index in es, it is possible to configure how the segments inside each shard will be sorted
sample on how to define a sort one a single filed:
```python
resp = client.indices.create(
    index="my-index-000001",
    settings={
        "index": {
            "sort.field": "date",
            "sort.order": "desc"
        }
    },
    mappings={
        "properties": {
            "date": {
                "type": "date"
            }
        }
    },
)
print(resp)

# also support multiple fields
```
by default in es, a search request must visit every doc that matches a query to retrieve the top docs sorted by a specific sort 

please note the parameter: track_total_hits

## indexing pressure
indexing docs into es introduces system load (memory and CPU); each indexing operation includes coordinating, primary, and replica stages; these stages can be performed across multiple nodes in a cluster. 

# index template

# mapping
## dynamic mapping
## explicit mapping
## runtime fields
## filed data types
## mapping parameters
## mapping limit settings
## removal of mapping types

# text analysis

# data streams

# ingest pipelines

# aliases

# search data

# re-ranking

# query dsl

# aggregations

# geospatial analysis 