Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Better querying for available nodes #159

Closed
elijahbenizzy opened this issue Jul 19, 2022 · 6 comments
Closed

Better querying for available nodes #159

elijahbenizzy opened this issue Jul 19, 2022 · 6 comments

Comments

@elijahbenizzy
Copy link
Collaborator

elijahbenizzy commented Jul 19, 2022

We make it easy to attach metadata to nodes, but don't yet have a natural API to make it easy to query these. This could be useful if:

(1) You want to look up a set of nodes with specific tags for reporting purposes
(2) You want to look up nodes used in data quality operators (see the motivating use-case below)
(3) You want to run some subset of the DAG that relates to the way nodes are tagged.

Is your feature request related to a problem? Please describe.
When using DQ we have to query by tags, this is really ugly. E.G.

all_validator_variables = [
    var.name for var in dr.list_available_variables() if
    var.tags.get('hamilton.data_quality.contains_dq_results')]

We should be able to have some utility functions here.

Describe the solution you'd like
Some combo of the following:

dr.query(tag_match={...}, name_match=r"...", module_match=r"...")
hamilton_utils.get_dq_validators(...)

or something like that. Note this would be valuable for more than just data quality -- E.G. tagging by nodes in general.

Describe alternatives you've considered
See above

Additional context
Writing out gitbook docs...

@elijahbenizzy elijahbenizzy changed the title Better querying for available decorators Better querying for available nodes Jul 25, 2022
@elijahbenizzy
Copy link
Collaborator Author

OK, I think this one is going to become necessary as we do more with data quality, etc... Don't have the bandwidth to do it quite yet but I'm putting it on the queue.

@gravesee
Copy link

The query API looks great. My team would find this useful to, for example, perform post-hoc analysis on projects to determine if any model features or their inputs were tagged with particular attributes. In my industry (financial services) we often have to be able to answer: were any sensitive features used in the development of this model?. This would be very easy to answer with this kind of query interface.

@elijahbenizzy
Copy link
Collaborator Author

Awesome. Will bump this up on the queue. Would love some sample queries you'd like to perform for your use-case if you can provide them!

@gravesee
Copy link

Right now it looks like queries would be performed on all nodes in a graph. Usually I want to know characteristics about upstream and downstream nodes for a given set of nodes. For example:

for the set of final nodes, which ones have upstream nodes with a tag of usage=='preview' ?

Or, conversely

for a set of nodes with tag of 'type' == 'input', find the set of downstream nodes (or final nodes, even)

The second one could be useful to determine downstream impacts of source dataset changes when, for example, one of our data providers has an issue or gets turned off.

In both cases, the missing piece seems to be applying the query API to a subset of nodes rather than all of the nodes in the graph. I also haven't checked to see if nodes are hashable, but supporting set operations on node collections could be helpful here as well. I'm going to be demoing the use of hamilton to our data governance team in the coming weeks and that will help me come up with more usage patterns.

@skrawcz
Copy link
Collaborator

skrawcz commented Dec 22, 2022

@gravesee awesome, thanks for the information. Mind if we set up time to chat? I sent you a LinkedIn request to get your contact details. We'd love to help.

@elijahbenizzy
Copy link
Collaborator Author

We are moving repositories! Please see the new version of this issue at DAGWorks-Inc/hamilton#37. Also, please give us a star/update any of your internal links.

Note that everything else (slack community, pypi packages, etc...) will not change at all.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants