Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Enhancement: Add capability to use a DataFrame as the basis for Driver initial_columns #18

Closed
straun opened this issue Oct 23, 2021 · 4 comments
Labels
enhancement New feature or request product idea

Comments

@straun
Copy link

straun commented Oct 23, 2021

Currently the Driver initial_columns accepts a Dict which maps the column names to a Series. If you already have a DataFrame then why not support a Driver constructor that accepts the DataFrame?

Without this, the caller is required to create a Dict themselves by iterating over the column names and using squeeze to generate a Series for each.

@skrawcz skrawcz added enhancement New feature or request product idea labels Oct 24, 2021
@skrawcz
Copy link
Collaborator

skrawcz commented Oct 24, 2021

Just to spec this out a little more:

Rather than:

my_df = loaded_somehow(...) # previously loaded
initial_columns = { 
    'signups': my_df.signups, # I think squeeze() is a no-op here and isn't required?
    'spend': my_df.spend,   # I think squeeze() is a no-op here and isn't required?
}
# we need to tell hamilton where to load function definitions from
module_name = 'my_functions'
module = importlib.import_module(module_name)
dr = driver.Driver(initial_columns, module) 

you're suggesting:

initial_df = loaded_somehow(...) # previously loaded
# we need to tell hamilton where to load function definitions from
module_name = 'my_functions'
module = importlib.import_module(module_name)
dr = driver.Driver(initial_df, module)   # rather than pass a dict, pass a dataframe.

Where the assumption is that initial_df contains the appropriately named columns that we can use to pull
data from?

@chmp
Copy link
Contributor

chmp commented Oct 25, 2021

FYI: A simple fix that would implement this change is to manually iterate over incoming object and pull out the columns.
The following code will accept a DataFrame or a dict of Series and always result in a dict of Series:

initial_df = {key: initial_df[key] for key in initial_df}
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> {key: df[key] for key in df}
{'a': 0    1
1    2
2    3
Name: a, dtype: int64, 'b': 0    4
1    5
2    6
Name: b, dtype: int64}

@skrawcz
Copy link
Collaborator

skrawcz commented Feb 3, 2022

This idea would simplify "driver chaining" a little, so just to flesh things out for a minute.

Motivation

You either have to have a separate because you want to do something, like wide to long transform, or have conflicting module function names:

Proposal

Right now you'd have to do this

dr = driver.Driver(config_or_data, modules)
long_df = dr.execute(['col1', ..., 'col3']).melt(...)
dr2 = driver.Driver({key: initial_df[key] for key in long_df}, modules).  # < -- this line is what I'm talking about
df2 = dr2.execute(['colX', ..., 'colY'])

Proposing this becomes:

dr = driver.Driver(config_or_data, modules)
long_df = dr.execute(['col1', ..., 'col3']).melt(...)
dr2 = driver.Driver(long_df, modules) # < -- this line is now simplified
df2 = dr2.execute(['colX', ..., 'colY'])

edge cases

@config.when likely wont work

Since @config.when operates on scalars, you wouldn't be able to exercise functions decorated with this, as the dataframe has only columns... We could adjust this to have some defined behavior over "vectors", but that requires more thought...

Alternative

We instead push people to @chmp's solution -- which means that we never confuse users with how to provide inputs to the driver.

@skrawcz
Copy link
Collaborator

skrawcz commented Mar 24, 2022

Closing this issue for now. Since we haven't received much feedback on this. Also Hamilton now supports more than just creating dataframes, so perhaps this would result in overcoupling to dataframes when we don't want there to be such tight coupling.

@skrawcz skrawcz closed this as completed Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request product idea
Projects
None yet
Development

No branches or pull requests

3 participants