Current
βββββββββββββββ
β β
βββββββββββ β β
β β βββββββββββββββ β β
β source ββββββββββ--ββββΆβ BatcherNode ββββββΆβ Persistent β
β β βββββββββββββββ β store β
βββββββββββ β β
β β
βββββββββββββββ
Goal of this project to implement certain functioanlities defined by the Dataflow paper (foundation for apache Beam) to build an iceberg writer/persis
- Time based windows (adjustable)
- Window rotation
- Processing the window based on partitioned Values
- Handling CDC data (sub paritionining the batches based on cdc timestamp to maintain order)
- Backpressure
- Adjusts batch sizes based on processing performance
Where sample data looked like
{"name": "A", "size": "small", "count": 2}
Total number of events in 1 batch = 10,000
partitioned on count
and randomness was limit to 10 in data gen
Arrow conversion took: 170.850ms
Partitioning took: 34.328ms
Partitioned length: 10
Ideally
βββββββββββββββ βββββββββββββββ
βββββΆβ BatcherNode ββββββΆβ β
βββββββββββ β βββββββββββββββ β β
β β β βββββββββββββββ β β
β source βββββββββββΆβββββΆβ BatcherNode ββββββΆβ Persistent β
β β β βββββββββββββββ β store β
βββββββββββ β βββββββββββββββ β β
βββββΆβ BatcherNode ββββββΆβ β
βββββββββββββββ βββββββββββββββ
dataflow paper https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf