Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] loop (sub-)chains #405

Open
xunxky opened this issue Jun 9, 2021 · 1 comment
Open

[FR] loop (sub-)chains #405

xunxky opened this issue Jun 9, 2021 · 1 comment

Comments

@xunxky
Copy link

xunxky commented Jun 9, 2021

Thanks for submitting an issue!

Thanks for the generator based approach of bonobo. However, we hit several limitations with bonobo, most of them I could circumvent them. But the most recent seems to me like an important feature i have not seen in any ETL implemented natively yet.

  • If this is a feature request, please make sure you explain the context, the goal, and why it is something that would go into bonobo core. Drafting some bits of spec is a good idea too, even if it's very draft-y.
    We are processing json documents read from jsonlines files. These documents do have the following structure:
{
  "name":"some name",
  "items":[
     {"value":"sub-document 1"},
     {"value":"sub-document 2"},
     {"value":"sub-document 3"}
  ]
}

however we need to process the sub-documents from the items array/list.
Imagine we do have a node which adds the date to the sub-document and another node adding an id based on the full documents name field and the position in the array/list.

simply speaking we could and will do this:

    for i, v in enumerate(doc["items"]):
        doc["items"][i]["date"] = datetime.now()
        doc["items"][i]["id"] = doc["name"] + str(i)

but actually it would be much more valuable if we could separate the responsibilities into different nodes

the ultimate goal would be to be able to loop through the nodes in a chain based on the number of items in the document.
Of course I am not talking about bonobo inspecting the data but offering a step-in step-out visitor pattern like approach to control looping (more generally controlling the flow of a chain/node from a different nodes point of view)

chain

digraph G {
	subgraph cluster {
		node [style=filled];
		"add date" -> "add id";
		label = "loop until split yields EOD";
		color=blue
	}
	"document split" -> "add date"
	"add id" -> "document unsplit"
}

@xunxky
Copy link
Author

xunxky commented Mar 10, 2022

Hi, so a little feed back.
I have implemented this (at least for our needs, incompatible with the bonobo "library") in sync and async.
It only is a straight forward chain without any branching (but it actually could be nested).
The more I look into this the more I believe the basic approach of assuming some "graph" is too academic.

please have a look at gstreamer where they are using sources and sinks to redirect data flow.
In some instances (e.g. Grouping, Counting ... ) the sinks need to know when the last element has been sent so their adjacent source can emit the computed result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant