Skip to content

Adding a threshold to DropColumnIfNull#1149

Merged
jeromedockes merged 81 commits intoskrub-data:mainfrom
rcap107:drop_null_columns
Nov 25, 2024
Merged

Adding a threshold to DropColumnIfNull#1149
jeromedockes merged 81 commits intoskrub-data:mainfrom
rcap107:drop_null_columns

Conversation

@rcap107
Copy link
Member

@rcap107 rcap107 commented Nov 21, 2024

Quick PR to address #1147

Notes:

  • I am using None to specify "Do not drop anything", rather than 0.0.
  • Do we want to change the name of the object? I don't think it's necessary.

(I also added fixed my account in the changelog)

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for starting this @rcap107 !


def __init__(self, threshold=1.0):
if threshold is not None:
assert 0.0 <= threshold <= 1.0, "Invalid value for the threshold."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we show the value of the threshold in the error message? also use a valueerror rather than an assertionerror. assertions would be used to stop execution if a bug in skrub code has led to a situation that shouldn't be possible. valueerror is how we communicate to the user that the value they provided is inappropriate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other words assertions are more of a debugging tool, not a way to validate user input. for example if the script is run with python -O or PYTHONOPTIMIZE=1 they are not executed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, thanks for pointing that out. I changed it to a ValueError.

@jeromedockes jeromedockes added this to the 0.3.2 milestone Nov 21, 2024
Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @rcap107 !! looks like we're close :)

CHANGES.rst Outdated

* Added a `DropColumnIfNull` transformer that drops columns that contain only null
values. :pr:`1115` by :user: `Riccardo Cappuzzo <riccardocappuzzo>`
* Added a `DropColumnIfNull` transformer that drops columns based on how many
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually DropColumnIfNull has not been added to skrub's public API (and I'm not sure if it should before we decide what to do with selectors) so here we should document the added parameter to TableVectorizer rather than DroopColumnIfNull

@@ -8,8 +8,22 @@


class DropColumnIfNull(SingleColumnTransformer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we change the name to something like NullProportionThreshold ? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe DropColumnWithNullThreshold? DropIfNullAboveThreshold?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe!
DropIfTooManyNulls?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for DropIfTooManyNulls

if self.threshold is not None:
if (
not (
isinstance(self.threshold, float) or isinstance(self.threshold, int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use

import numbers
isinstance(self.threshold, numbers.Number)

instead.

it will work with numpy numeric types whereas

>>> isinstance(np.float32(0.5), float)
False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I knew that check didn't look right, but I wasn't sure how to write it properly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're welcome! also next time you want a disjunction of isinstance checks you can put the types in a tuple

>>> isinstance(2, (str, int))
True
>>> isinstance(2.5, (str, int))
False

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @rcap107 !!
apart from maybe finalizing the choice of the name (of the transformer class and more importantly of the TableVectorizer parameter), this LGTM!


drop_null_columns : bool, default=True
If set to `True`, columns that contain only null values are dropped.
null_threshold : float or None, default=1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vincent-Maladiere @GaelVaroquaux any opinions on this parameter's name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop_null_threshold maybe?

@jeromedockes
Copy link
Member

glad that we will be able to include it in the next release. @rcap107 could you either mark it as "ready for review" or mention what you still plan to do before removing the "draft" status? thanks

Co-authored-by: Jérôme Dockès <jerome@dockes.org>
@rcap107
Copy link
Member Author

rcap107 commented Nov 22, 2024

Looks good to me! I have nothing else to add now.

@rcap107 rcap107 marked this pull request as ready for review November 22, 2024 13:19
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -8,8 +8,22 @@


class DropColumnIfNull(SingleColumnTransformer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for DropIfTooManyNulls


drop_null_columns : bool, default=True
If set to `True`, columns that contain only null values are dropped.
null_threshold : float or None, default=1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop_null_threshold maybe?

@Vincent-Maladiere
Copy link
Member

That is extremely cool :) let's merge this

@jeromedockes jeromedockes merged commit 1967d51 into skrub-data:main Nov 25, 2024
@rcap107 rcap107 deleted the drop_null_columns branch November 25, 2024 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants