## Documentation

To read more about the ingest processor, checkout the docs [here](https://www.elastic.co/guide/en/elasticsearch/reference/8.15/processors.html).

![ingest_processor_docs](../images/ingest_processor_docs.png)

## Connect to ElasticSearch

In [1]:
from pprint import pprint
from elasticsearch import Elasticsearch

es = Elasticsearch('http://localhost:9200')
client_info = es.info()
print('Connected to Elasticsearch!')
pprint(client_info.body)

Connected to Elasticsearch!
{'cluster_name': 'docker-cluster',
 'cluster_uuid': 'DlYG5m9gR3upn7qgaYyAJA',
 'name': '3d37442d2591',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2024-08-05T10:05:34.233336849Z',
             'build_flavor': 'default',
             'build_hash': '1a77947f34deddb41af25e6f0ddb8e830159c179',
             'build_snapshot': False,
             'build_type': 'docker',
             'lucene_version': '9.11.1',
             'minimum_index_compatibility_version': '7.0.0',
             'minimum_wire_compatibility_version': '7.17.0',
             'number': '8.15.0'}}


## Common ingest processors

Here’s a look at some frequently used ingest processors:

1. **Convert**: Changes the data type of a field.
2. **Rename**: Changes the name of a field.
3. **Set**: Assigns a specified value to a field.
4. **HTML Strip**: Strips HTML tags from a field's content.
5. **Lowercase**: Transforms the text in a field to lowercase.
6. **Uppercase**: Transforms the text in a field to uppercase.
7. **Trim**: Removes whitespace from the beginning and end of a field's value.
8. **Split**: Divides the field content into an array, using a comma `,` as the delimiter.
9. **Remove**: Deletes a field from the document.
10. **Append**: Adds a value to an array field.

### 1. Creating the document

We'll apply all these common processors to this document to demonstrate how each one works.

In [2]:
document = {
    "price": "100.50",
    "old_name": "old_value",
    "description": "<p>This is a description with HTML.</p>",
    "username": "UserNAME",
    "category": "books",
    "title": "   Example Title with Whitespace   ",
    "tags": "tag1,tag2,tag3",
    "temporary_field": "This field should be removed"
}

### 2. Creating the pipeline

In [3]:
pipeline_body = {
    "description": "Pipeline to demonstrate various ingest processors",
    "processors": [
        {
            "convert": {
                "field": "price",
                "type": "float",
                "ignore_missing": True
            }
        },
        {
            "rename": {
                "field": "old_name",
                "target_field": "new_name"
            }
        },
        {
            "set": {
                "field": "status",
                "value": "active"
            }
        },
        {
            "html_strip": {
                "field": "description"
            }
        },
        {
            "lowercase": {
                "field": "username"
            }
        },
        {
            "uppercase": {
                "field": "category"
            }
        },
        {
            "trim": {
                "field": "title"
            }
        },
        {
            "split": {
                "field": "tags",
                "separator": ","
            }
        },
        {
            "remove": {
                "field": "temporary_field"
            }
        },
        {
            "append": {
                "field": "tags",
                "value": ["new_tag"]
            }
        }
    ]
}

pipeline_id = "multi_steps_pipeline"
es.ingest.put_pipeline(id=pipeline_id, body=pipeline_body)
print(f"Pipeline '{pipeline_id}' created successfully!")

Pipeline 'multi_steps_pipeline' created successfully!


In [4]:
from pprint import pprint

es.indices.delete(index='my_index', ignore_unavailable=True)
es.indices.create(index='my_index')

response = es.index(index="my_index", document=document, pipeline=pipeline_id)
pprint(response.body)

{'_id': 'pEac5pIB_ipHjM1Sh7kj',
 '_index': 'my_index',
 '_primary_term': 1,
 '_seq_no': 0,
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_version': 1,
 'result': 'created'}


In [5]:
response = es.search(index='my_index')
hits = response.body['hits']['hits']

for hit in hits:
    pprint(hit['_source'])

{'category': 'BOOKS',
 'description': '\nThis is a description with HTML.\n',
 'new_name': 'old_value',
 'price': 100.5,
 'status': 'active',
 'tags': ['tag1', 'tag2', 'tag3', 'new_tag'],
 'title': 'Example Title with Whitespace',
 'username': 'username'}
