Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Adds code to make concatenating dataframes columnwise in the result builder default #321

Merged
merged 3 commits into from
Feb 20, 2023

Conversation

skrawcz
Copy link
Collaborator

@skrawcz skrawcz commented Feb 19, 2023

Why this change? because if you are doing indicator variable stuff, maybe you don't want to be explicit about them, instead you want to concatentate the columns created without knowing about them. Seems like a reasonable pattern on first glance.

This code is backwards compatible with current users, and just enables one to have dataframes in the output.
All it does is that it flattens the dataframe, and then ensures there's no duplicate column definitions. I added tests for this for this case -- this is all on the assumption that the indexes can be merged/used for the outer join (which is the current
behavior).

For the future, we probably want some flexibility with respect
to "joins" and how things are stitched together since you could
envisage some use cases where people want to inner join between
things --- but until there's a common pattern/need, I think
we should just keep thinking about what should be first class within Hamilton,
and what we get people to do themselves.

Changes

  • base pandas dataframe result builder

How I tested this

  • unit tests

Notes

  • should be 100% backwards compatible with current users, but now instead of erroring on a dataframe, it will columnwise concatenate it.

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Why this change? because if you are doing indicator variable stuff,
maybe you don't want to be explicit about them, instead you want
to concatentate the columns created without knowing about them. Seems
like a reasonable pattern on first glance.

Otherwise this isn't pretty code. Still thinking it through, but essentially:

1. want to handle column concat of dataframes and series.
2. problem is that pd.DataFrame({ dict }) behavior essentially does
an outer join on indexes if present, and handles scalars.
3. also we need to preserve "column" i.e. output order.
4. So question -- is this doing too much for the normal case
of not having a dataframe in the output? Should I structure
the code differently?

Otherwise need to add unit tests for this new function:
 - like ensuring order is preserved.
 -
This code is simpler because it delegates to pandas on how
to construct everything -- all it does is that it flattens
the dataframe, and then ensures there's no duplicate column
definitions. I added tests for this for this case.

For the future, we probably want some flexibility with respect
to "joins" and how things are stitched together since you could
envisage some use cases where people want to inner join between
things --- but until there's a common pattern/need, I think
we should just keep thinking about what should be first class,
and what we get people to do themselves.
@skrawcz
Copy link
Collaborator Author

skrawcz commented Feb 19, 2023

This follows conversation in #46 where I created a result builder for a user; thanks to @Arnechos for the push/idea.

@skrawcz skrawcz marked this pull request as ready for review February 19, 2023 19:33
@skrawcz skrawcz changed the title Adds code to make concatenating dataframes in the result builder default Adds code to make concatenating dataframes columnwise in the result builder default Feb 19, 2023
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor point. Also let's add to documentation. Otherwise 👍

hamilton/base.py Show resolved Hide resolved
hamilton/base.py Show resolved Hide resolved
hamilton/base.py Outdated Show resolved Hide resolved
No need for a separate data structure. Can just use one.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants