To read more about analyzers, checkout the docs [here](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-analyze).

## Connect to ElasticSearch

In [1]:
from pprint import pprint
from elasticsearch import Elasticsearch

es = Elasticsearch('http://localhost:9200')
client_info = es.info()
print('Connected to Elasticsearch!')
pprint(client_info.body)

Connected to Elasticsearch!
{'cluster_name': 'docker-cluster',
 'cluster_uuid': 'FQt-ffZfTpeh0Snf3pUAQw',
 'name': 'ae8b5b4be42b',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2024-08-05T10:05:34.233336849Z',
             'build_flavor': 'default',
             'build_hash': '1a77947f34deddb41af25e6f0ddb8e830159c179',
             'build_snapshot': False,
             'build_type': 'docker',
             'lucene_version': '9.11.1',
             'minimum_index_compatibility_version': '7.0.0',
             'minimum_wire_compatibility_version': '7.17.0',
             'number': '8.15.0'}}


## 1. Character filters

Read me about them [here](https://www.elastic.co/docs/reference/text-analysis/character-filter-reference)

### 1.1. HTML Strip Character Filter

In [2]:
from pprint import pprint

response = es.indices.analyze(
    char_filter=[
        "html_strip"
    ],
    text="I&apos;m so happy</b>!</p>",
)

pprint(response.body)

{'tokens': [{'end_offset': 26,
             'position': 0,
             'start_offset': 0,
             'token': "I'm so happy!\n",
             'type': 'word'}]}


### 1.2. Mapping character filter

In [3]:
response = es.indices.analyze(
    tokenizer="keyword",
    char_filter=[
        {
            "type":"mapping",
            "mappings":[
                "٠ => 0",
                "١ => 1",
                "٢ => 2",
                "٣ => 3",
                "٤ => 4",
                "٥ => 5",
                "٦ => 6",
                "٧ => 7",
                "٨ => 8",
                "٩ => 9"
            ]
        }
    ],
    text="I saw comet Tsuchinshan Atlas in ٢٠٢٤",
)

pprint(response.body)


{'tokens': [{'end_offset': 37,
             'position': 0,
             'start_offset': 0,
             'token': 'I saw comet Tsuchinshan Atlas in 2024',
             'type': 'word'}]}


## 2. Tokenizer

Read more about tokenizers [here](https://www.elastic.co/docs/reference/text-analysis/tokenizer-reference).

### 2.1. Standard

In [4]:
response = es.indices.analyze(
    tokenizer="standard",
    text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)

tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}', Type: {token['type']}")

Token: 'The', Type: <ALPHANUM>
Token: '2', Type: <NUM>
Token: 'QUICK', Type: <ALPHANUM>
Token: 'Brown', Type: <ALPHANUM>
Token: 'Foxes', Type: <ALPHANUM>
Token: 'jumped', Type: <ALPHANUM>
Token: 'over', Type: <ALPHANUM>
Token: 'the', Type: <ALPHANUM>
Token: 'lazy', Type: <ALPHANUM>
Token: 'dog's', Type: <ALPHANUM>
Token: 'bone', Type: <ALPHANUM>


### 2.2. Lowercase

In [5]:
response = es.indices.analyze(
    tokenizer="lowercase",
    text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)

tokens = response.body['tokens']
for token in tokens:
    print(f"Token: '{token['token']}', Type: {token['type']}")

Token: 'the', Type: word
Token: 'quick', Type: word
Token: 'brown', Type: word
Token: 'foxes', Type: word
Token: 'jumped', Type: word
Token: 'over', Type: word
Token: 'the', Type: word
Token: 'lazy', Type: word
Token: 'dog', Type: word
Token: 's', Type: word
Token: 'bone', Type: word


### 2.3. Whitespace

In [6]:
response = es.indices.analyze(
    tokenizer="whitespace",
    text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)

tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}', Type: {token['type']}")

Token: 'The', Type: word
Token: '2', Type: word
Token: 'QUICK', Type: word
Token: 'Brown-Foxes', Type: word
Token: 'jumped', Type: word
Token: 'over', Type: word
Token: 'the', Type: word
Token: 'lazy', Type: word
Token: 'dog's', Type: word
Token: 'bone.', Type: word


## 3. Token filter

Read more about token filters [here](https://www.elastic.co/docs/reference/text-analysis/token-filter-reference).

### 3.1. Apostrophe

In [7]:
response = es.indices.analyze(
    tokenizer="standard",
    filter=[
        "apostrophe",
    ],
    text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

Token: 'The'
Token: '2'
Token: 'QUICK'
Token: 'Brown'
Token: 'Foxes'
Token: 'jumped'
Token: 'over'
Token: 'the'
Token: 'lazy'
Token: 'dog'
Token: 'bone'


### 3.2. Decimal digit

In [8]:
response = es.indices.analyze(
    tokenizer="standard",
    filter=[
        "decimal_digit"
    ],
    text="I saw comet Tsuchinshan Atlas in ٢٠٢٤",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

Token: 'I'
Token: 'saw'
Token: 'comet'
Token: 'Tsuchinshan'
Token: 'Atlas'
Token: 'in'
Token: '2024'


### 3.2. Reverse

In [9]:
response = es.indices.analyze(
    tokenizer="standard",
    filter=[
        "reverse"
    ],
    text="I saw comet Tsuchinshan Atlas in ٢٠٢٤",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

Token: 'I'
Token: 'was'
Token: 'temoc'
Token: 'nahsnihcusT'
Token: 'saltA'
Token: 'ni'
Token: '٤٢٠٢'


## 4. Built-in analyzers

Read more about token filters [here](https://www.elastic.co/docs/reference/text-analysis/analyzer-reference).



### 4.1 Standard

In [10]:
response = es.indices.analyze(
    analyzer="standard",
    text="I saw comet Tsuchinshan Atlas in ٢٠٢٤",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

Token: 'i'
Token: 'saw'
Token: 'comet'
Token: 'tsuchinshan'
Token: 'atlas'
Token: 'in'
Token: '٢٠٢٤'


### 4.2 Stop

In [11]:

response = es.indices.analyze(
    analyzer="stop",
    text="I saw comet Tsuchinshan Atlas in ٢٠٢٤",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

Token: 'i'
Token: 'saw'
Token: 'comet'
Token: 'tsuchinshan'
Token: 'atlas'


### 4.3 Keyword

In [12]:
response = es.indices.analyze(
    analyzer="keyword",
    text="I saw comet Tsuchinshan Atlas in ٢٠٢٤",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

Token: 'I saw comet Tsuchinshan Atlas in ٢٠٢٤'


## 5. Index time VS Search time analysis

### 5.1. Index time

Index-time analysis transforms text before it's stored in the index. In this example, let's create an index with an analyzer that lowercases text, removes HTML tags, and replaces ampersands (&) with the word "and."

In [14]:
index_name = "index_time_example"
settings = {
    "settings": {
        "analysis": {
            "char_filter": {
                "ampersand_replacement": {
                    "type": "mapping",
                    "mappings": ["& => and"]
                }
            },
            "analyzer": {
                "custom_index_analyzer": {
                    "type": "custom",
                    "char_filter": ["html_strip", "ampersand_replacement"],
                    "tokenizer": "standard",
                    "filter": ["lowercase"]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "custom_index_analyzer"
            }
        }
    }
}

es.indices.delete(index=index_name, ignore_unavailable=True)
es.indices.create(index=index_name, body=settings)

document = {
    "content": "Visit my website https://myuniversehub.com/ & like some images!"}
response = es.index(index=index_name, id=1, body=document)
pprint(response.body)

{'_id': '1',
 '_index': 'index_time_example',
 '_primary_term': 1,
 '_seq_no': 0,
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_version': 1,
 'result': 'created'}


When searching for the document, you'll notice that the content appears unchanged. This is expected because Elasticsearch stores the transformed tokens in an inverted index for searching purposes, while keeping the original document intact in the `_source` field.

In [16]:
response = es.search(index=index_name, body={"query":{"match_all":{}}})
hits = response.body["hits"]["hits"]

for hit in hits:
    print(hit["_source"])

{'content': 'Visit my website https://myuniversehub.com/ & like some images!'}


We can verify that the custom analyzer is working by applying it to the document like this.

In [18]:
response = es.indices.analyze(
    index=index_name,
    body={
        "field": "content",
        "text": "Visit my website https://myuniversehub.com/ & like some images!"
    }
)

tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

Token: 'visit'
Token: 'my'
Token: 'website'
Token: 'https'
Token: 'myuniversehub.com'
Token: 'and'
Token: 'like'
Token: 'some'
Token: 'images'


### 5.2. Search time

Search-time analysis transforms text only when a search query is performed, not when data is indexed. In this example, we’ll perform a search with a search-time analyzer that transforms text differently (e.g., it lowercases and removes stop words).

In [19]:
response = es.search(
    index=index_name,
    body={
        "query":{
            "match": { # match is used for full-text search
                "content": {
                    "query": "myuniversehub.com",
                    "analyzer": "standard" # Using a different analyzer than the one used at index time
                }
            }
        }
    }
)

hits = response["hits"]["hits"]
for hit in hits:
    print(hit["_source"])

{'content': 'Visit my website https://myuniversehub.com/ & like some images!'}


You can also use a `term` query to match exact terms. Since `myuniversehub.com` exists exactly as-is in the document, this query will return the document in the results.

In [20]:
response = es.search(
    index=index_name,
    body={
        "query":{
            "term": {  # term is used for exact matches
                "content": {
                    "value": "myuniversehub.com",
                }
            }
        }
    }
)

hits = response["hits"]["hits"]
for hit in hits:
    print(hit["_source"])

{'content': 'Visit my website https://myuniversehub.com/ & like some images!'}


In this case, `MYUNIVERSEHUB.com` does not appear in the document, so no results are returned.

In [21]:
response = es.search(index=index_name, body={
    "query": {
        "term": {  # term is used for exact matches
            "content": {
                "value": "MYUNIVERSEHUB.com",
            }
        }
    }
})

hits = response["hits"]["hits"]
for hit in hits:
    print(hit["_source"])