## Elastic Search Tutorial

In [14]:
from pprint import pprint
from elasticsearch import Elasticsearch

# Constants (Should be in Environment Variables but this is a tutorial so okay)
CONNECTION_URL = "https://localhost:9200"
USERNAME = "elastic"
PASSWORD = "csK=aB81_-zSctJTLxZ3"
TLS_ENABLED = False                     # Enable in production mode
SSL_WARNINGS = False                    # Enable in production mode
CLOUD_ID = None                         # If you are using Elastic Cloud
API_KEY = None                          # If you are using Elastic Cloud

# Connect to elastic search

es = Elasticsearch(CONNECTION_URL, basic_auth=[USERNAME, PASSWORD], verify_certs=TLS_ENABLED, ssl_show_warn=SSL_WARNINGS)

# Get Client info
client_info = es.info()

print("Connected to Elastic Search")
pprint(client_info.body)

Connected to Elastic Search
{'cluster_name': 'docker-cluster',
 'cluster_uuid': '_WHZ-oGpSHe5R6WMtacRwQ',
 'name': '1d99da101863',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2024-12-11T12:08:05.663969764Z',
             'build_flavor': 'default',
             'build_hash': '2b6a7fed44faa321997703718f07ee0420804b41',
             'build_snapshot': False,
             'build_type': 'docker',
             'lucene_version': '9.12.0',
             'minimum_index_compatibility_version': '7.0.0',
             'minimum_wire_compatibility_version': '7.17.0',
             'number': '8.17.0'}}


## Indices in Elastic Search
Reference : [Indices](https://www.elastic.co/guide/en/elasticsearch/reference/current/documents-indices.html)
  
The __index__ is the fundamental unit of storage in Elasticsearch, a __logical namespace__ for storing data that share similar characteristics
An index is a collection of documents uniquely identified by a name or an alias. This unique name is important because it’s used to target the index in search queries and other operations.
An index in Elasticsearch is like a database in a relational database management system (RDBMS).

In [17]:
# How to delete index
es.indices.delete(index="my_index", ignore_unavailable=True)

# How to create index
es.indices.create(
    index="my_index",
    settings={
        "index": {
            "number_of_shards" : 3,     # How many pieces the data is split into    
            "number_of_replicas" : 2    # How many copies of the data
        }
    }
)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'my_index'})

## Documents
Elasticsearch serializes and stores data in the form of JSON documents. A document is a set of fields, which are key-value pairs that contain your data. 

Each document has a unique ID, which you can create or have Elasticsearch auto-generate.
You can also provide your own ID using the `id` parameter in the `index()` function 

### Inserting Single Document

In [23]:
document = {
  "id": 1,
  "name": "Wireless Mouse",
  "price": 25.99,
  "description": "Ergonomic wireless mouse with adjustable DPI settings.",
  "category": "Electronics",
  "available": True,
  "release_date": "2024-12-01"
}

response = es.index(index="my_index", id=1, document=document)
pprint(response)

ObjectApiResponse({'_index': 'my_index', '_id': '1', '_version': 5, 'result': 'created', '_shards': {'total': 3, 'successful': 1, 'failed': 0}, '_seq_no': 4, '_primary_term': 1})


In [24]:
print(response['result'])       # Shows the result of the response : Created, Updated, Deleted

created


In [26]:
pprint(response.body)

{'_id': '1',
 '_index': 'my_index',
 '_primary_term': 1,
 '_seq_no': 4,
 '_shards': {'failed': 0, 'successful': 1, 'total': 3},
 '_version': 5,
 'result': 'created'}


## Inserting Multiple Documents

The bulk API in Elasticsearch allows you to perform multiple indexing (insert/update/delete) operations in a single request.

- Create a list of documents you want to insert.
- Format the documents as a list of bulk actions.
- Use the bulk() helper function from the Elasticsearch Python client.

In [33]:
documents = [
    {
        "_index": "my_index",
        "_id": 2,
        "_source": {
            "id": 2,
            "name": "Mechanical Keyboard",
            "price": 75.50,
            "description": "Durable mechanical keyboard with customizable RGB lighting.",
            "category": "Electronics",
            "available": True,
            "release_date": "2023-11-15"
        }
    },
    {
        "_index": "my_index",
        "_id": 3,
        "_source": {
            "id": 3,
            "name": "Gaming Monitor",
            "price": 299.99,
            "description": "27-inch gaming monitor with 144Hz refresh rate and 1ms response time.",
            "category": "Electronics",
            "available": True,
            "release_date": "2022-09-20"
        }
    },
    {
        "_index": "my_index",
        "_id": 4,
        "_source": {
            "id": 4,
            "name": "Webcam",
            "price": 49.99,
            "description": "HD webcam with built-in microphone and 1080p resolution.",
            "category": "Electronics",
            "available": True,
            "release_date": "2021-05-12"
        }
    }
]

In [34]:
from elasticsearch.helpers import bulk

# Use the bulk API to insert all documents
response = bulk(es, documents)

# Print the response
print("Bulk Insert Response:", response)

Bulk Insert Response: (3, [])


## Getting documents

In [45]:
fetch_doc = es.get(index="my_index", id = 1)
pprint(fetch_doc.body)

{'_id': '1',
 '_index': 'my_index',
 '_primary_term': 1,
 '_seq_no': 4,
 '_source': {'available': True,
             'category': 'Electronics',
             'description': 'Ergonomic wireless mouse with adjustable DPI '
                            'settings.',
             'id': 1,
             'name': 'Wireless Mouse',
             'price': 25.99,
             'release_date': '2024-12-01'},
 '_version': 5,
 'found': True}


## Searching Documents
You can query using the `query` parameter in `search()` function

In [53]:
response = es.search(
    index="my_index",               # Specify the index name
    query={"match_all": {}}         # Match all documents
)

for hit in response['hits']['hits']:
    pprint(hit['_source'], indent=3)  

{  'available': True,
   'category': 'Electronics',
   'description': 'Durable mechanical keyboard with customizable RGB lighting.',
   'id': 2,
   'name': 'Mechanical Keyboard',
   'price': 75.5,
   'release_date': '2023-11-15'}
{  'available': True,
   'category': 'Electronics',
   'description': '27-inch gaming monitor with 144Hz refresh rate and 1ms '
                  'response time.',
   'id': 3,
   'name': 'Gaming Monitor',
   'price': 299.99,
   'release_date': '2022-09-20'}
{  'available': True,
   'category': 'Electronics',
   'description': 'HD webcam with built-in microphone and 1080p resolution.',
   'id': 4,
   'name': 'Webcam',
   'price': 49.99,
   'release_date': '2021-05-12'}
{  'available': True,
   'category': 'Electronics',
   'description': 'Ergonomic wireless mouse with adjustable DPI settings.',
   'id': 1,
   'name': 'Wireless Mouse',
   'price': 25.99,
   'release_date': '2024-12-01'}


## Deleting Documents and Indexes

In [54]:
es.delete(index="my_index", id=1)       # Deletes Document specified by id

ObjectApiResponse({'_index': 'my_index', '_id': '1', '_version': 6, 'result': 'deleted', '_shards': {'total': 3, 'successful': 1, 'failed': 0}, '_seq_no': 5, '_primary_term': 1})

In [56]:
es.indices.delete(index="my_index")             # Deletes Index

ObjectApiResponse({'acknowledged': True})

## Mappings in Elastic Search

A mapping in Elasticsearch defines the schema of an index, specifying how the data in documents is stored, indexed, and queried. It is similar to a database table schema, where you define the structure of fields and their data types, but it is more flexible.

Reference:
- [Field data types](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html)
- [Video Reference](https://youtu.be/a4HBKEda_F8?si=_xCPkDemPwdByV53&t=1334)


### Common Data Types

1. __Binary Data Type__:
This field accepts binary value as Base64 encoded string
It is not searchable or nor stored
Use the _source (i.e the document) to get back the data

2. __Boolean__:
Stores true and false values

3. __Numbers__:
They can be long, integers, bytes, shorts, etc

4. __Dates__:
Date types, including date and date_nanos

5. __Keywords__:
When you want to sort or filter documents 
Example: IDs, email, status codes, zip codes, etc

### Step 1: Encode The document in base64

In [93]:
import base64

image_path = "./public/elasticsearch-logo.jpg"
with open(image_path, "rb") as image_file:
    image_bytes = image_file.read()
    image_base64 = base64.b64encode(image_bytes).decode("utf-8")
    print(image_base64[:100])

/9j/4AAQSkZJRgABAQAAAQABAAD/4QBiRXhpZgAATU0AKgAAAAgABQESAAMAAAABAAEAAAEaAAUAAAABAAAASgEbAAUAAAABAAAA


### Step 2: Create a mapping which contains the metadata about the image and also its content

In [94]:
# Lets us first create an index with mapping
mapping = {
    "properties" : {
        "file_name": {"type": "keyword"},       # For exact search
        "file_size": {"type": "integer"},       # File size in bytes
        "image_type": {"type": "keyword"},      # Image type (e.g., png, jpeg, gif)
        "image_data" : {
            "type" : "binary"                   # Binary data (Base64-encoded)
        }
    }
}

es.indices.create(index="image_data", mappings=mapping)


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'image_data'})

### Step 3: Store the document inside the index

In [95]:
document = {
    "file_name" : "elastic-search",
    "file_size" : len(image_base64),
    "image_type": "jpg",
    "image_data": image_base64    
}

es.index(index="image_data", document=document)

ObjectApiResponse({'_index': 'image_data', '_id': 'au_UxpMB3baMLLZ1hiyU', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1})

### Step 4: Retrieve back the document and decode from base64. Optionally Save the file

In [99]:
response = es.search(
    index="image_data",
    query={"match": {"file_name": "elastic-search"}}
)

for hit in response['hits']['hits']:
    binary_data = hit['_source']['image_data']
    print(binary_data[:100])
    binary_content = base64.b64decode(binary_data)
    with open("public/retrieved_example.jpg", "wb") as file:
        file.write(binary_content)

/9j/4AAQSkZJRgABAQAAAQABAAD/4QBiRXhpZgAATU0AKgAAAAgABQESAAMAAAABAAEAAAEaAAUAAAABAAAASgEbAAUAAAABAAAA


### Text Data Type
In Elasticsearch, the `text data type` is used for fields that contain full-text content (like paragraphs, titles, descriptions, etc.). The text type is __analyzed__ by Elasticsearch to enable full-text search capabilities, which means it breaks the text into tokens (terms) and indexes those tokens for efficient searching.

Key Features of text type:
Full-text search: The text type is used for fields that require full-text search, where Elasticsearch breaks down the text into terms (words) and indexes them.
Analyzers: By default, Elasticsearch uses a standard analyzer that breaks the text into tokens (words) and lowercases them. You can customize the analyzer for specific use cases.
Not for exact matching: If you need exact matching (e.g., for keywords or IDs), you should use the `keyword` type instead.

### Step 1: Create a mapping with `text` fields for requiring Full-text support

In [100]:
mapping = {
    "properties": {
        "title": {
            "type": "text"  # Full-text search field
        },
        "description": {
            "type": "text"  # Another full-text search field
        },
        "author": {
            "type": "keyword"  # Non-analyzed field for exact matching (e.g., author's name)
        },
        "publish_date": {
            "type": "date"  # Date field for storing publication date
        }
    }
}
es.indices.create(index="books", mappings=mapping)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'books'})

### Step 2: Add a document in the index

In [101]:
document = {
    "title": "Learning Elasticsearch",
    "description": "Elasticsearch is a powerful distributed search and analytics engine.",
    "author": "John Doe",
    "publish_date": "2024-01-01"
}

es.index(index="books", id=1, document=document)

ObjectApiResponse({'_index': 'books', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1})

### Step 3: Query the document based on text

In [109]:
response = es.search(
    index="books",
    query={
        "match": {
            "description": "search engine" 
        }
    }
)

for hit in response['hits']['hits']:
    pprint(hit['_source'])

{'author': 'John Doe',
 'description': 'Elasticsearch is a powerful distributed search and analytics '
                'engine.',
 'publish_date': '2024-01-01',
 'title': 'Learning Elasticsearch'}
