Skip to content

Shreyas220/Dataflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Current

                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                   β”‚             β”‚
β”‚         β”‚               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚             β”‚
β”‚  source │─────────--───▢│ BatcherNode │────▢│  Persistent β”‚
β”‚         β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚    store    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                   β”‚             β”‚
                                              β”‚             β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Goal

Goal of this project to implement certain functioanlities defined by the Dataflow paper (foundation for apache Beam) to build an iceberg writer/persis

Functioanlity (ToDo)

  • Time based windows (adjustable)
  • Window rotation
  • Processing the window based on partitioned Values
  • Handling CDC data (sub paritionining the batches based on cdc timestamp to maintain order)
  • Backpressure
  • Adjusts batch sizes based on processing performance

Benchmark

Where sample data looked like {"name": "A", "size": "small", "count": 2}

Total number of events in 1 batch = 10,000

partitioned on count and randomness was limit to 10 in data gen

Arrow conversion took: 170.850ms
Partitioning took: 34.328ms
Partitioned length: 10

Ideally


                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”Œβ”€β”€β”€β–Άβ”‚ BatcherNode │────▢│             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚             β”‚
β”‚         β”‚          β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚             β”‚
β”‚  source │─────────▢│───▢│ BatcherNode │────▢│  Persistent β”‚
β”‚         β”‚          β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚    store    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚             β”‚
                     └───▢│ BatcherNode │────▢│             β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Ref

dataflow paper https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages