Enhancement: Add capability to use a DataFrame as the basis for Driver initial_columns
#18
Comments
|
Just to spec this out a little more: Rather than: my_df = loaded_somehow(...) # previously loaded
initial_columns = {
'signups': my_df.signups, # I think squeeze() is a no-op here and isn't required?
'spend': my_df.spend, # I think squeeze() is a no-op here and isn't required?
}
# we need to tell hamilton where to load function definitions from
module_name = 'my_functions'
module = importlib.import_module(module_name)
dr = driver.Driver(initial_columns, module) you're suggesting: initial_df = loaded_somehow(...) # previously loaded
# we need to tell hamilton where to load function definitions from
module_name = 'my_functions'
module = importlib.import_module(module_name)
dr = driver.Driver(initial_df, module) # rather than pass a dict, pass a dataframe.Where the assumption is that |
|
FYI: A simple fix that would implement this change is to manually iterate over incoming object and pull out the columns. initial_df = {key: initial_df[key] for key in initial_df} |
|
This idea would simplify "driver chaining" a little, so just to flesh things out for a minute. MotivationYou either have to have a separate because you want to do something, like wide to long transform, or have conflicting module function names: ProposalRight now you'd have to do this dr = driver.Driver(config_or_data, modules)
long_df = dr.execute(['col1', ..., 'col3']).melt(...)
dr2 = driver.Driver({key: initial_df[key] for key in long_df}, modules). # < -- this line is what I'm talking about
df2 = dr2.execute(['colX', ..., 'colY'])Proposing this becomes: dr = driver.Driver(config_or_data, modules)
long_df = dr.execute(['col1', ..., 'col3']).melt(...)
dr2 = driver.Driver(long_df, modules) # < -- this line is now simplified
df2 = dr2.execute(['colX', ..., 'colY'])edge cases@config.when likely wont workSince @config.when operates on scalars, you wouldn't be able to exercise functions decorated with this, as the dataframe has only columns... We could adjust this to have some defined behavior over "vectors", but that requires more thought... AlternativeWe instead push people to @chmp's solution -- which means that we never confuse users with how to provide inputs to the driver. |
|
Closing this issue for now. Since we haven't received much feedback on this. Also Hamilton now supports more than just creating dataframes, so perhaps this would result in overcoupling to dataframes when we don't want there to be such tight coupling. |
Currently the Driver
initial_columnsaccepts a Dict which maps the column names to a Series. If you already have a DataFrame then why not support a Driver constructor that accepts the DataFrame?Without this, the caller is required to create a Dict themselves by iterating over the column names and using
squeezeto generate a Series for each.The text was updated successfully, but these errors were encountered: