Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Enhancement: Add capability to use a DataFrame as a template for the target output. #19

Closed
straun opened this issue Oct 23, 2021 · 2 comments
Labels
enhancement New feature or request product idea

Comments

@straun
Copy link

straun commented Oct 23, 2021

Currently if the caller has a DataFrame structure that they are targeting then they need to ensure they match the names of the columns correctly and manually convert the Series types. If the output_columns or other parameter of the execute function took a DataFrame as a template then the output columns would match the data columns and each series can be delivered using astype conversion.

You will probably need something like a DictionaryError for the scenario where there is a column in the DataFrame template that is not in the data columns available.

There is also the option to be able to process compound column names from the DataFrame to map into a more structured DataFrame. This would involve having a join character e.g. _.

@skrawcz
Copy link
Collaborator

skrawcz commented Oct 24, 2021

@straun can you provide some code/context for a little more motivation for this? Some questions to help with that:

they need to ensure they match the names of the columns correctly and manually convert the Series types

What is causing this mismatch in name and/or type to be the case? Can't the functions be named appropriately and return the right types?

You will probably need something like a DictionaryError for the scenario where there is a column in the DataFrame template that is not in the data columns available.

Currently Hamilton throws a ValueError if you request column that isn't defined in the function DAG. I think that should suffice for your needs?

There is also the option to be able to process compound column names from the DataFrame to map into a more structured DataFrame. This would involve having a join character e.g. _.

I'm not sure I'm following. Could you provide an example of what you mean here?

Otherwise we've advised users that in cases that require a bit more massaging of inputs/outputs, the easier thing to do is to create a "Wrapper" driver, using a delegation pattern, and tell people to use that as the interface. I think that might be a better solution for you with my current understanding. E.g.

class MyCustomDriver(object):

    def __init__(self, config: Dict[str, Any], *modules: ModuleType):
        self.h_driver = Driver(config, *modules)  # delegation pattern

    def match_types(self, actual_df: pd.DataFrame, schema_df: pd.DataFrame) -> pd.DataFrame:
        """Code to make sure the actual DF matches the intended schema."""
        return ...

    def execute(self, wanted_df: pd.DataFrame) -> pd.DataFrame:
        assert(wanted_df.empty), "Wanted DF must be empty"
        output_columns = list(wanted_df.columns)
        df = self.h_driver.execute(output_columns)
        return self.match_types(df, wanted_df)  # function to convert column types

Thoughts?

@skrawcz skrawcz added enhancement New feature or request product idea labels Oct 24, 2021
@skrawcz
Copy link
Collaborator

skrawcz commented Mar 24, 2022

Going to close this issue for now. Hamilton now support more than creating dataframes, so supporting this feature probably doesn't make sense for now.

@skrawcz skrawcz closed this as completed Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request product idea
Projects
None yet
Development

No branches or pull requests

2 participants