Adds code to make concatenating dataframes columnwise in the result builder default #321

skrawcz · 2023-02-19T08:00:58Z

Why this change? because if you are doing indicator variable stuff, maybe you don't want to be explicit about them, instead you want to concatentate the columns created without knowing about them. Seems like a reasonable pattern on first glance.

This code is backwards compatible with current users, and just enables one to have dataframes in the output.
All it does is that it flattens the dataframe, and then ensures there's no duplicate column definitions. I added tests for this for this case -- this is all on the assumption that the indexes can be merged/used for the outer join (which is the current
behavior).

For the future, we probably want some flexibility with respect
to "joins" and how things are stitched together since you could
envisage some use cases where people want to inner join between
things --- but until there's a common pattern/need, I think
we should just keep thinking about what should be first class within Hamilton,
and what we get people to do themselves.

Changes

base pandas dataframe result builder

How I tested this

unit tests

Notes

should be 100% backwards compatible with current users, but now instead of erroring on a dataframe, it will columnwise concatenate it.

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

Why this change? because if you are doing indicator variable stuff, maybe you don't want to be explicit about them, instead you want to concatentate the columns created without knowing about them. Seems like a reasonable pattern on first glance. Otherwise this isn't pretty code. Still thinking it through, but essentially: 1. want to handle column concat of dataframes and series. 2. problem is that pd.DataFrame({ dict }) behavior essentially does an outer join on indexes if present, and handles scalars. 3. also we need to preserve "column" i.e. output order. 4. So question -- is this doing too much for the normal case of not having a dataframe in the output? Should I structure the code differently? Otherwise need to add unit tests for this new function: - like ensuring order is preserved. -

This code is simpler because it delegates to pandas on how to construct everything -- all it does is that it flattens the dataframe, and then ensures there's no duplicate column definitions. I added tests for this for this case. For the future, we probably want some flexibility with respect to "joins" and how things are stitched together since you could envisage some use cases where people want to inner join between things --- but until there's a common pattern/need, I think we should just keep thinking about what should be first class, and what we get people to do themselves.

skrawcz · 2023-02-19T19:33:00Z

This follows conversation in #46 where I created a result builder for a user; thanks to @Arnechos for the push/idea.

elijahbenizzy

A minor point. Also let's add to documentation. Otherwise 👍

hamilton/base.py

No need for a separate data structure. Can just use one.

skrawcz added 2 commits February 18, 2023 23:56

skrawcz requested a review from elijahbenizzy February 19, 2023 19:31

skrawcz marked this pull request as ready for review February 19, 2023 19:33

skrawcz changed the title ~~Adds code to make concatenating dataframes in the result builder default~~ Adds code to make concatenating dataframes columnwise in the result builder default Feb 19, 2023

elijahbenizzy reviewed Feb 19, 2023

View reviewed changes

hamilton/base.py Show resolved Hide resolved

hamilton/base.py Show resolved Hide resolved

hamilton/base.py Outdated Show resolved Hide resolved

elijahbenizzy approved these changes Feb 20, 2023

View reviewed changes

Simplifies duplicate column check in dataframe builder

f3eee93

No need for a separate data structure. Can just use one.

skrawcz merged commit 915efdd into main Feb 20, 2023

skrawcz deleted the make_df_concat_default branch February 20, 2023 05:38

skrawcz mentioned this pull request Feb 20, 2023

Capturing wide to long transformations of entire dataframes using Hamilton #46

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds code to make concatenating dataframes columnwise in the result builder default #321

Adds code to make concatenating dataframes columnwise in the result builder default #321

skrawcz commented Feb 19, 2023 •

edited

skrawcz commented Feb 19, 2023

elijahbenizzy left a comment

Adds code to make concatenating dataframes columnwise in the result builder default #321

Adds code to make concatenating dataframes columnwise in the result builder default #321

Conversation

skrawcz commented Feb 19, 2023 • edited

Changes

How I tested this

Notes

Checklist

skrawcz commented Feb 19, 2023

elijahbenizzy left a comment

Choose a reason for hiding this comment

skrawcz commented Feb 19, 2023 •

edited