Adding a threshold to DropColumnIfNull#1149
Conversation
Co-authored-by: Théo Jolivet <57430673+TheooJ@users.noreply.github.com>
…into drop_null_columns
…into drop_null_columns
…efault" This reverts commit 98b6c10.
jeromedockes
left a comment
There was a problem hiding this comment.
thanks for starting this @rcap107 !
skrub/_drop_column_if_null.py
Outdated
|
|
||
| def __init__(self, threshold=1.0): | ||
| if threshold is not None: | ||
| assert 0.0 <= threshold <= 1.0, "Invalid value for the threshold." |
There was a problem hiding this comment.
can we show the value of the threshold in the error message? also use a valueerror rather than an assertionerror. assertions would be used to stop execution if a bug in skrub code has led to a situation that shouldn't be possible. valueerror is how we communicate to the user that the value they provided is inappropriate
There was a problem hiding this comment.
in other words assertions are more of a debugging tool, not a way to validate user input. for example if the script is run with python -O or PYTHONOPTIMIZE=1 they are not executed
There was a problem hiding this comment.
Noted, thanks for pointing that out. I changed it to a ValueError.
jeromedockes
left a comment
There was a problem hiding this comment.
thanks @rcap107 !! looks like we're close :)
CHANGES.rst
Outdated
|
|
||
| * Added a `DropColumnIfNull` transformer that drops columns that contain only null | ||
| values. :pr:`1115` by :user: `Riccardo Cappuzzo <riccardocappuzzo>` | ||
| * Added a `DropColumnIfNull` transformer that drops columns based on how many |
There was a problem hiding this comment.
actually DropColumnIfNull has not been added to skrub's public API (and I'm not sure if it should before we decide what to do with selectors) so here we should document the added parameter to TableVectorizer rather than DroopColumnIfNull
skrub/_drop_column_if_null.py
Outdated
| @@ -8,8 +8,22 @@ | |||
|
|
|||
|
|
|||
| class DropColumnIfNull(SingleColumnTransformer): | |||
There was a problem hiding this comment.
should we change the name to something like NullProportionThreshold ? 🤔
There was a problem hiding this comment.
Maybe DropColumnWithNullThreshold? DropIfNullAboveThreshold?
There was a problem hiding this comment.
maybe!
DropIfTooManyNulls?
There was a problem hiding this comment.
+1 for DropIfTooManyNulls
skrub/_drop_column_if_null.py
Outdated
| if self.threshold is not None: | ||
| if ( | ||
| not ( | ||
| isinstance(self.threshold, float) or isinstance(self.threshold, int) |
There was a problem hiding this comment.
you can use
import numbers
isinstance(self.threshold, numbers.Number)instead.
it will work with numpy numeric types whereas
>>> isinstance(np.float32(0.5), float)
False
There was a problem hiding this comment.
Thanks, I knew that check didn't look right, but I wasn't sure how to write it properly
There was a problem hiding this comment.
you're welcome! also next time you want a disjunction of isinstance checks you can put the types in a tuple
>>> isinstance(2, (str, int))
True
>>> isinstance(2.5, (str, int))
False
jeromedockes
left a comment
There was a problem hiding this comment.
Thanks again @rcap107 !!
apart from maybe finalizing the choice of the name (of the transformer class and more importantly of the TableVectorizer parameter), this LGTM!
skrub/_table_vectorizer.py
Outdated
|
|
||
| drop_null_columns : bool, default=True | ||
| If set to `True`, columns that contain only null values are dropped. | ||
| null_threshold : float or None, default=1.0 |
There was a problem hiding this comment.
@Vincent-Maladiere @GaelVaroquaux any opinions on this parameter's name?
There was a problem hiding this comment.
drop_null_threshold maybe?
|
glad that we will be able to include it in the next release. @rcap107 could you either mark it as "ready for review" or mention what you still plan to do before removing the "draft" status? thanks |
Co-authored-by: Jérôme Dockès <jerome@dockes.org>
|
Looks good to me! I have nothing else to add now. |
skrub/_drop_column_if_null.py
Outdated
| @@ -8,8 +8,22 @@ | |||
|
|
|||
|
|
|||
| class DropColumnIfNull(SingleColumnTransformer): | |||
There was a problem hiding this comment.
+1 for DropIfTooManyNulls
skrub/_table_vectorizer.py
Outdated
|
|
||
| drop_null_columns : bool, default=True | ||
| If set to `True`, columns that contain only null values are dropped. | ||
| null_threshold : float or None, default=1.0 |
There was a problem hiding this comment.
drop_null_threshold maybe?
|
That is extremely cool :) let's merge this |
Quick PR to address #1147
Notes:
Noneto specify "Do not drop anything", rather than0.0.(I also added fixed my account in the changelog)