Skip to content

Proposal: add timeout and key options to Stream.partition() #375

@roveo

Description

@roveo

I've been using a custom stream for some time. It works the same way as partition, but with two differences:

  1. You can specify timeout option - the number of seconds after which the stream will force-flush the buffer, even if number of elements in it are below the limit.

    This can be useful when you have some data delivery time requirement, but data flow rate is not constant.

  2. You can specify key option - a callable that will return a partitioning key from a message, effectively splitting the buffer into several independent buffers.

    This is used when you need your batches to be grouped in some manner. I use this to make sure messages in each batch belong to the same source before passing them on to the sink that writes them to an appropriate database table. It's similar to the PartitionRecord processor in NiFi.

I can make a PR, but the questions are:

  • should I? does it make sense to add this to core?
  • if yes, should this go into partition or some new operator? The standard behavior won't change so it's not breaking, but implementation details will. Someone might rely on partition._buffer being a list, not a defaultdict.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions