[EHN] Add `jointly` option for `min_max_scale` #1112

Zeroto521 · 2022-06-02T02:56:19Z

PR Description

Please describe the changes proposed in the pull request:

Add an option for min_max_scale support to transform each column values or entire values
Default transform each column, similar behavior to sklearn.preprocessing.MinMaxScaler

This PR resolves #1067.

PR Checklist

Please ensure that you have done the following:

PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.

If you're not on the contributors list, add yourself to AUTHORS.md.

Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
- Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

Building a preview of the docs on Netlify
Automatically linting the code
Making sure the code is documented
Making sure that all tests are passed
Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl

to transform each column

codecov · 2022-06-02T03:07:57Z

Codecov Report

Merging #1112 (eff218b) into dev (2450124) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev    #1112      +/-   ##
==========================================
+ Coverage   97.95%   97.98%   +0.03%     
==========================================
  Files          77       77              
  Lines        3175     3180       +5     
==========================================
+ Hits         3110     3116       +6     
+ Misses         65       64       -1

Zeroto521 · 2022-06-02T03:13:04Z

janitor/functions/min_max_scale.py

@@ -23,23 +23,18 @@ def min_max_scale(
    df: pd.DataFrame,
    feature_range: tuple[int | float, int | float] = (0, 1),
    column_name: str | int | list[str | int] | pd.Index = None,
+    entire_data: bool = False,


I don't think entire_data is a better name.
Need to be added to the changelog file.

I agree, though I also need a bit of time and space to think of a better name. I wonder if others on the dev team have ideas? @pyjanitor-devs/core-devs

not sure of a good name to use. maybe keep like in Pandas?

You mean 'keep' its value could be 'column' or 'all'?

But this variable is better as a boolean type.

yea, horrible parameter name. maybe scale_all or scale_all_columns?

Hmmm, @Zeroto521, on second thought, I think we need a bit better definition of the semantics. min_max_scale currently operates with the assumption of operating on one column, so the use of column_name makes sense here. Below is my attempt at reasoning through the multiple ways we could use min_max_scale.

Scale one column.

Scale multiple columns independently.

Scale multiple columns, but jointly (so they are all scaled to the same min and max)

entire_data is a special case of scaling multiple columns jointly.

Since we're working on this function, it's a good chance to change the API to be flexible yet also sensible. What if the API was, instead the following?

def min_max_scale(df, feature_range, column_names: Iterable[Hashable] | Callable, jointly: bool):

If jointly is True, then the column_names provided are jointly scaled; otherwise, they are not.

I wanted to point out a new behaviour that we might be able to support across the rest of the API -- by making column_names accept a Callable that has the signature:

def column_names_callable(df) -> Iterable[Hashable]:

we can enable min_max_scale on all columns by doing:

df.min_max_scale(feature_range=(0, 1), column_names = lambda df: df.columns, jointly=True)

This is pretty concise without resorting to needing to maintain string mappings for special-case behaviour.

I'm glad we didn't rush to merge this PR, giving us the time and space to think clearly about the semantics of the API.

@samukweku and @Zeroto521 what do you think about this?

I wanted to point out a new behaviour that we might be able to support across the rest of the API -- by making column_names accept a Callable that has the signature:

def column_names_callable(df) -> Iterable[Hashable]:

I'm sorry I still can't get the point why column_names could accept a callback type argument.

The usage of column_names is to get dataframe's columns like df[column_names].
So if column_names has some column names which is not in df.columns.
There will raise an error. Whatever column_names is Iterable[Hashable] or callback type could return Iterable[Hashable].

we can enable min_max_scale on all columns by doing:

df.min_max_scale(feature_range=(0, 1), column_names = lambda df: df.columns, jointly=True)

To scale all columns, I thought we could use the default argument as None for column_names without no more inputting.

df.min_max_scale(feature_range=(0, 1), column_names=None, jointly=True)

Are there more examples to show the importance of the callback type?

Thanks for the comments, @Zeroto521!

Regarding why we might want to allow column_names to be a Callable, I had the idea that it helps support being explicit over implicit, which is in the Zen of Python. Setting column_names=None makes selecting all columns implicit, whereas setting column_names=lambda df: df.columns makes selecting all columns explicit. In addition, it allows the selection of arbitrary subsets of column names programmatically, without needing to hard-code those names.

On further thought, I can see how column_names=None actually follows the pattern established in other places in the library, so I think, for now, we can:

Use column_names=None to imply selection of all columns, and

Talk more about column_names: Callable in the issue tracker, deferring the implementation till later.

What do you all think about jointly=True as the keyword for triggering whether to independently scale each column or to jointly scale all columns specified in column_names? @Zeroto521 if you're in agreement with the keyword argument, then I think, let's get that specified in this PR, then we can close out the PR!

I totally agree with using jointly.

About column_names whether could receive a callable type or not.
I can understand now. column_names=lambda df: df.columns is an implicit style and also a trick.

Select columns Using callable type Using Iterable type df.columns

Select the first three columns lambda df: df.columns[:3] ['a', 'b', 'c'] pd.Index(list('abcde'))

Select str type columns lambda df: [i for i in df.columns if instance(i, str)] ['a', 'b', 'c'] pd.Index(['a', 'b', 'c', 1])

As you said, we can put it aside at present.
Once the parameter column_names of min_max_scale could receive, other functions also need to do the same thing.

More discussions move to #1115

janitor/functions/min_max_scale.py

Zeroto521 · 2022-06-02T08:14:05Z

Well, GitHub Action style checking is failing but pre-commit.ci app is okay.
I thought it was time to cut the duplicate style checkings #1113

ericmjl · 2022-06-02T11:36:00Z

@Zeroto521 looks like we're good. The action didn't fail b/c of your code; it failed b/c of a system error.

ericmjl

@Zeroto521 I'm going to pre-approve the PR. Regarding whether to use jointly as the kwarg name, please take a decision and we'll run with it.

for more information, see https://pre-commit.ci

…/Zeroto521/pyjanitor into min-max-scale-entire-data-option

Zeroto521 · 2022-06-13T13:22:14Z

janitor/functions/min_max_scale.py

+
+    Changed in version 0.24.0: Deleted "old_min", "old_max", "new_min", and "new_max"
+    options.
+    Changed in version 0.24.0: Added "feature_range", and "jointly" options.


If this renders well, then we can merge this PR.

Zeroto521 added 7 commits June 2, 2022 10:11

[EHN] Add entire_data for min_max_scale

dbc66f2

to transform each column

Update the description of function

572673a

highlight the keywords

ffe85bb

Update examples

5ea9511

Rename function

bfebf21

Update test suitcases

6c8d242

Ignore darglint error

ec1d956

Update test results

45bff67

Zeroto521 commented Jun 2, 2022

View reviewed changes

janitor/functions/min_max_scale.py Outdated Show resolved Hide resolved

Zeroto521 commented Jun 2, 2022

View reviewed changes

janitor/functions/min_max_scale.py Show resolved Hide resolved

Zeroto521 added 3 commits June 2, 2022 11:19

correct variable name

efd2439

Miss data

7097b25

Update example result

3052b9a

ericmjl approved these changes Jun 13, 2022

View reviewed changes

Zeroto521 and others added 4 commits June 13, 2022 20:39

entire_data -> jointly

06a8405

Update description

3e3ee53

[pre-commit.ci] auto fixes from pre-commit.com hooks

b9d47f6

for more information, see https://pre-commit.ci

Add changelog section

db2a4f3

Zeroto521 changed the title ~~[EHN] Add entire_data option for min_max_scale~~ [EHN] Add jointly option for min_max_scale Jun 13, 2022

Zeroto521 added 3 commits June 13, 2022 21:19

Update CHANGELOG.md

2e18941

Merge branch 'min-max-scale-entire-data-option' of https://github.com…

174e7c7

…/Zeroto521/pyjanitor into min-max-scale-entire-data-option

lint codes

0247ec9

Zeroto521 commented Jun 13, 2022

View reviewed changes

lint codes

eff218b

Zeroto521 mentioned this pull request Jun 13, 2022

[EHN] let column_name or column_names support callback type #1115

Open

ericmjl merged commit 63c075e into pyjanitor-devs:dev Jun 14, 2022

This was referenced Jun 15, 2022

[DOC] Fix min_max_scale docstrings rendering #1123

Merged

[DOC] Adding minimal working examples to docstrings; a checklist #972

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EHN] Add `jointly` option for `min_max_scale` #1112

[EHN] Add `jointly` option for `min_max_scale` #1112

Zeroto521 commented Jun 2, 2022 •

edited

Loading

codecov bot commented Jun 2, 2022 •

edited

Loading

Zeroto521 Jun 2, 2022 •

edited

Loading

ericmjl Jun 2, 2022

samukweku Jun 4, 2022

Zeroto521 Jun 4, 2022

samukweku Jun 4, 2022

ericmjl Jun 13, 2022

Zeroto521 Jun 13, 2022 •

edited

Loading

ericmjl Jun 13, 2022

Zeroto521 Jun 13, 2022 •

edited

Loading

Zeroto521 Jun 13, 2022

Zeroto521 commented Jun 2, 2022 •

edited

Loading

ericmjl commented Jun 2, 2022

ericmjl left a comment

Zeroto521 Jun 13, 2022 •

edited

Loading

Select columns	Using callable type	Using Iterable type	`df.columns`
Select the first three columns	`lambda df: df.columns[:3]`	`['a', 'b', 'c']`	pd.Index(list('abcde'))
Select str type columns	`lambda df: [i for i in df.columns if instance(i, str)]`	`['a', 'b', 'c']`	pd.Index(['a', 'b', 'c', 1])

[EHN] Add jointly option for min_max_scale #1112

[EHN] Add jointly option for min_max_scale #1112

Conversation

Zeroto521 commented Jun 2, 2022 • edited Loading

PR Description

PR Checklist

Automatic checks

Relevant Reviewers

codecov bot commented Jun 2, 2022 • edited Loading

Codecov Report

Zeroto521 Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

ericmjl Jun 2, 2022

Choose a reason for hiding this comment

samukweku Jun 4, 2022

Choose a reason for hiding this comment

Zeroto521 Jun 4, 2022

Choose a reason for hiding this comment

samukweku Jun 4, 2022

Choose a reason for hiding this comment

ericmjl Jun 13, 2022

Choose a reason for hiding this comment

Zeroto521 Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

ericmjl Jun 13, 2022

Choose a reason for hiding this comment

Zeroto521 Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

Zeroto521 Jun 13, 2022

Choose a reason for hiding this comment

Zeroto521 commented Jun 2, 2022 • edited Loading

ericmjl commented Jun 2, 2022

ericmjl left a comment

Choose a reason for hiding this comment

Zeroto521 Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

[EHN] Add `jointly` option for `min_max_scale` #1112

[EHN] Add `jointly` option for `min_max_scale` #1112

Zeroto521 commented Jun 2, 2022 •

edited

Loading

codecov bot commented Jun 2, 2022 •

edited

Loading

Zeroto521 Jun 2, 2022 •

edited

Loading

Zeroto521 Jun 13, 2022 •

edited

Loading

Zeroto521 Jun 13, 2022 •

edited

Loading

Zeroto521 commented Jun 2, 2022 •

edited

Loading

Zeroto521 Jun 13, 2022 •

edited

Loading