Add support for data streams #8

eyadmba · 2023-10-15T22:57:59Z

PR for issue #7

This PR demonstrates what I think should change to support data streams but I made those changes quickly on GitHub's online code editor, so maybe I forget to add comma somewhere. It's only to demonstrate. If we like it, I'll pull it down and properly run it and test it.

vduseev · 2023-10-16T02:16:25Z

Thanks a lot for the PR @eyadmba 👏

I agree with your observations on this issue. To be honest, I had no idea data streams require explicit action/op-type specification for bulk actions.

I can't come up with a better way to add this than the is_data_stream flag. I thought about adding a new enum called IndexType with two values: Normal and DataStream. But that's confusing to old school Elasticsearch users because Index Type used to mean something else. There is also an option to add another value to RotateFrequency called DATA_STREAM. But it looks weird since "Data Stream" is not a rotation frequency. Just trying to figure out a good way to account for "what if they add another index mode called Data Hose in the future". A boolean flag seems to be the best choice right now.

Inside the code I would actually make use of the self._get_index() function but modify it to use the is_data_stream variable. So that if is_data_stream == True it returns self._get_never_index_name().

In that case the code will look like this

                index = self._get_index()
                actions = [
                  {
                    '_index': index,
                    '_source': record,
                    # op_type must be explicitly set to 'create' for bulk operations
                    # on data streams. See issue #7.
                    '_op_type': 'create' if self.is_data_stream else 'index'
                  } for record in logs_buffer
                ]

If you could please make these changes and update the main README as well I would appreciate it a lot. I will add a test that explicitly covers this scenario before the end of the year.

eyadmba · 2023-10-16T15:28:32Z

Thanks for the quick response! I did incorporate the suggestions, and I ran the tests locally to make sure I didn't affect existing functionality.
I did also test it with a data stream as well but manually. I don't know how you wanna go about automating the data stream integration tests because they require the creation of an indexing template specific for data streams.

Here's what I did, I created an index template that will create a data stream when a document is indexed using it:

PUT /_index_template/org-logs
Content-Type: application/json

{
  "index_patterns": "org-logs-*",
  "data_stream": {
    // this data_stream object here is what differentiates a 
    // data stream-specific index template from a regular one.
    "timestamp_field": {
      "name": "@timestamp"
    }
  },
  "priority": 200,
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  }
}

and then I set the logging handler's index_name="org-logs-prod" and confirmed that the log record does get indexed into that data stream.

vduseev · 2023-10-22T16:23:18Z

Thank you for your contribution @eyadmba! 🎉 I appreciate it a lot!

add is_data_stream flag and set op_type to create

0fb8974

fix and refactor

ef9975d

vduseev merged commit ccd29c5 into vduseev:main Oct 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for data streams #8

Add support for data streams #8

eyadmba commented Oct 15, 2023 •

edited

Loading

vduseev commented Oct 16, 2023

eyadmba commented Oct 16, 2023 •

edited

Loading

vduseev commented Oct 22, 2023

Add support for data streams #8

Add support for data streams #8

Conversation

eyadmba commented Oct 15, 2023 • edited Loading

vduseev commented Oct 16, 2023

eyadmba commented Oct 16, 2023 • edited Loading

vduseev commented Oct 22, 2023

eyadmba commented Oct 15, 2023 •

edited

Loading

eyadmba commented Oct 16, 2023 •

edited

Loading