Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EHN] Add jointly option for min_max_scale #1112

Merged
merged 19 commits into from
Jun 14, 2022

Conversation

Zeroto521
Copy link
Member

@Zeroto521 Zeroto521 commented Jun 2, 2022

PR Description

Please describe the changes proposed in the pull request:

  • Add an option for min_max_scale support to transform each column values or entire values
  • Default transform each column, similar behavior to sklearn.preprocessing.MinMaxScaler

This PR resolves #1067.

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.md.
  1. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@codecov
Copy link

codecov bot commented Jun 2, 2022

Codecov Report

Merging #1112 (eff218b) into dev (2450124) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev    #1112      +/-   ##
==========================================
+ Coverage   97.95%   97.98%   +0.03%     
==========================================
  Files          77       77              
  Lines        3175     3180       +5     
==========================================
+ Hits         3110     3116       +6     
+ Misses         65       64       -1     

@@ -23,23 +23,18 @@ def min_max_scale(
df: pd.DataFrame,
feature_range: tuple[int | float, int | float] = (0, 1),
column_name: str | int | list[str | int] | pd.Index = None,
entire_data: bool = False,
Copy link
Member Author

@Zeroto521 Zeroto521 Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think entire_data is a better name.
Need to be added to the changelog file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, though I also need a bit of time and space to think of a better name. I wonder if others on the dev team have ideas? @pyjanitor-devs/core-devs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure of a good name to use. maybe keep like in Pandas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean 'keep' its value could be 'column' or 'all'?

But this variable is better as a boolean type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, horrible parameter name. maybe scale_all or scale_all_columns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, @Zeroto521, on second thought, I think we need a bit better definition of the semantics. min_max_scale currently operates with the assumption of operating on one column, so the use of column_name makes sense here. Below is my attempt at reasoning through the multiple ways we could use min_max_scale.

  • Scale one column.
  • Scale multiple columns independently.
  • Scale multiple columns, but jointly (so they are all scaled to the same min and max)

entire_data is a special case of scaling multiple columns jointly.

Since we're working on this function, it's a good chance to change the API to be flexible yet also sensible. What if the API was, instead the following?

def min_max_scale(df, feature_range, column_names: Iterable[Hashable] | Callable, jointly: bool):

If jointly is True, then the column_names provided are jointly scaled; otherwise, they are not.

I wanted to point out a new behaviour that we might be able to support across the rest of the API -- by making column_names accept a Callable that has the signature:

def column_names_callable(df) -> Iterable[Hashable]:

we can enable min_max_scale on all columns by doing:

df.min_max_scale(feature_range=(0, 1), column_names = lambda df: df.columns, jointly=True)

This is pretty concise without resorting to needing to maintain string mappings for special-case behaviour.

I'm glad we didn't rush to merge this PR, giving us the time and space to think clearly about the semantics of the API.

@samukweku and @Zeroto521 what do you think about this?

Copy link
Member Author

@Zeroto521 Zeroto521 Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to point out a new behaviour that we might be able to support across the rest of the API -- by making column_names accept a Callable that has the signature:

def column_names_callable(df) -> Iterable[Hashable]:

I'm sorry I still can't get the point why column_names could accept a callback type argument.

The usage of column_names is to get dataframe's columns like df[column_names].
So if column_names has some column names which is not in df.columns.
There will raise an error. Whatever column_names is Iterable[Hashable] or callback type could return Iterable[Hashable].

we can enable min_max_scale on all columns by doing:

df.min_max_scale(feature_range=(0, 1), column_names = lambda df: df.columns, jointly=True)

To scale all columns, I thought we could use the default argument as None for column_names without no more inputting.

df.min_max_scale(feature_range=(0, 1), column_names=None, jointly=True)

Are there more examples to show the importance of the callback type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments, @Zeroto521!

Regarding why we might want to allow column_names to be a Callable, I had the idea that it helps support being explicit over implicit, which is in the Zen of Python. Setting column_names=None makes selecting all columns implicit, whereas setting column_names=lambda df: df.columns makes selecting all columns explicit. In addition, it allows the selection of arbitrary subsets of column names programmatically, without needing to hard-code those names.

On further thought, I can see how column_names=None actually follows the pattern established in other places in the library, so I think, for now, we can:

  1. Use column_names=None to imply selection of all columns, and
  2. Talk more about column_names: Callable in the issue tracker, deferring the implementation till later.

What do you all think about jointly=True as the keyword for triggering whether to independently scale each column or to jointly scale all columns specified in column_names? @Zeroto521 if you're in agreement with the keyword argument, then I think, let's get that specified in this PR, then we can close out the PR!

Copy link
Member Author

@Zeroto521 Zeroto521 Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with using jointly.

About column_names whether could receive a callable type or not.
I can understand now. column_names=lambda df: df.columns is an implicit style and also a trick.

Select columns Using callable type Using Iterable type df.columns
Select the first three columns lambda df: df.columns[:3] ['a', 'b', 'c'] pd.Index(list('abcde'))
Select str type columns lambda df: [i for i in df.columns if instance(i, str)] ['a', 'b', 'c'] pd.Index(['a', 'b', 'c', 1])

As you said, we can put it aside at present.
Once the parameter column_names of min_max_scale could receive, other functions also need to do the same thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More discussions move to #1115

@Zeroto521
Copy link
Member Author

Zeroto521 commented Jun 2, 2022

Well, GitHub Action style checking is failing but pre-commit.ci app is okay.
I thought it was time to cut the duplicate style checkings #1113

image

@ericmjl
Copy link
Member

ericmjl commented Jun 2, 2022

@Zeroto521 looks like we're good. The action didn't fail b/c of your code; it failed b/c of a system error.

Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zeroto521 I'm going to pre-approve the PR. Regarding whether to use jointly as the kwarg name, please take a decision and we'll run with it.

@Zeroto521 Zeroto521 changed the title [EHN] Add entire_data option for min_max_scale [EHN] Add jointly option for min_max_scale Jun 13, 2022

Changed in version 0.24.0: Deleted "old_min", "old_max", "new_min", and "new_max"
options.
Changed in version 0.24.0: Added "feature_range", and "jointly" options.
Copy link
Member Author

@Zeroto521 Zeroto521 Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this renders well, then we can merge this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[EHN] min_max_scale transforms each column values not entire values
3 participants