Added DropNullColumn transformer to remove columns that contain only nulls #1115
Conversation
TheooJ
left a comment
There was a problem hiding this comment.
Hi @rcap107 ! I made a first pass and have a few comments :
- Personnally I like the name
DropNullColumn, I think it’s clear what it does ! - I would rename the file
_drop_null.py - Make sure you
pre-commit run --all-filesbefore pushing, it seems to be what’s breaking the CI for you here - I think
is_all_nullcould be placed in theDropNullColumnfile if it’s only used there for now, but I could also see it being in_common.py
| return make_dataframe_like(df, cols) | ||
|
|
||
| @dispatch | ||
| def drop(obj, col): |
There was a problem hiding this comment.
I don’t know if drop is necessary, you could directly use skrub selectors:
df = s.select(df, ~s.cols(col))
| similar functionality to what is offered by scikit-learn's | ||
| :class:`~sklearn.compose.ColumnTransformer`. | ||
|
|
||
| drop_null_columns : bool, default=False |
There was a problem hiding this comment.
Do we want it to be True by default ?
There was a problem hiding this comment.
That should be discussed with others I think
There was a problem hiding this comment.
I vote for true by default -- there's nothing we can learn from a completely empty column.
if it is False by default, I think it should be set to True in the tabular_learner
| main_table_dropped = ns.drop(main_table_dropped, "value_nan") | ||
|
|
||
| # Don't drop null columns | ||
| tv = TableVectorizer(drop_null_columns=False) |
There was a problem hiding this comment.
This test needs to go in the TV test file IMO
|
Hi @TheooJ, thanks a lot for the comments! I'll address them and update the PR 👍 |
Co-authored-by: Théo Jolivet <57430673+TheooJ@users.noreply.github.com>
…into drop_null_columns
…into drop_null_columns
|
|
||
| # assert_array_equal( | ||
| # sbd.to_numpy(sbd.col(drop_null_table, "value_almost_null")), | ||
| # np.array(["almost", None, None]), |
There was a problem hiding this comment.
Not sure how to write this check so that it works with either pandas or polars
There was a problem hiding this comment.
You could use df_module as a fixture in the test by adding it to the arguments, then comparing series instead of numpy arrays:
df_module.assert_column_equal(
sbd.col(drop_null_table, "value_almost_null"),
df_module.make_column("value_almost_null", ["almost", None, None]),
)
There was a problem hiding this comment.
Test would look like
def test_single_column(drop_null_table, df_module):
"""Check that null columns are dropped and non-null columns are kept."""
dn = DropNullColumn()
assert dn.fit_transform(drop_null_table["value_nan"]) == []
assert dn.fit_transform(drop_null_table["value_null"]) == []
df_module.assert_column_equal(
sbd.col(drop_null_table, "idx"), df_module.make_column("idx", [1, 2, 3])
)
df_module.assert_column_equal(
sbd.col(drop_null_table, "value_almost_nan"),
df_module.make_column("value_almost_nan", [2.5, np.nan, np.nan]),
)
df_module.assert_column_equal(
sbd.col(drop_null_table, "value_almost_null"),
df_module.make_column("value_almost_null", ["almost", None, None]),
)
There was a problem hiding this comment.
This also circumvents that depending on the version of pandas, null values are not treated the same
|
the failure in the min-deps environment is not related to this pr; the fix is in #1122 |
|
|
||
| self._preprocessors = [CheckInputDataFrame()] | ||
| if self.drop_null_columns: | ||
| add_step(self._preprocessors, DropNullColumn(), cols, allow_reject=True) |
There was a problem hiding this comment.
- we may want to insert it after CleanNullStrings? so that if the column becomes full of nulls after converting
"N/A"tonullit will be dropped. also it's not important but your transformer never raises a RejectColumn exception soallow_rejecthas no effect you don't need it here and can leave the default
There was a problem hiding this comment.
I added it after CleanNullStrings, but I think I did it in an ugly way, maybe it can be fixed
Nevermind, I'll just raise the warning by default. I was thinking that maybe it would be possible to only raise a warning in verbose mode (if it's even a thing), but in the end I just went with a different solution.
We also need to decide whether it's "warn and drop", or "warn and keep", and explain the behavior in the documentation. |
| @@ -778,11 +777,7 @@ def test_drop_null_column(): | |||
|
|
|||
| # Raise exception if a null column is found | |||
| with pytest.raises( | |||
There was a problem hiding this comment.
This test is still failing because the TableVectorizer is not raising the correct exception and I don't know how to make it do that.
There was a problem hiding this comment.
rejectcolumn is a way to signify to the tablevectorizer "I'm not the right transformer for this column, don't apply me here".
|
I have updated the code to have "warn and keep" as the default behavior, I think it's the version that makes the most sense. At the moment the only problem I have is that I don't know how to raise the proper exception from TableVectorizer. In DropNull I am raising RejectColumn, but then I don't know how to propagate it correctly, |
|
I find it a bit weird to have a DropColumnIfNull that does not drop the column and just raises an exception. maybe it should be named something like CheckNulls or something? |
I'm not sure I understand -- in any case the tablevectorizer chooses the estimators during fit so the schema of the output can change every week in this scenario. eg if a column has one more unique value than the previous week it can change from a one-hot encoding to a gap encoding. or for example if you had |
|
so if you want consistent output schema (number of columns, names and types) across training the tablevetorizer is not what you want anyway |
Yes, I agree this sounds weird. I recognize dropping NaN columns –which are usually useless– looks good. What about having the transformer drop by default, but allowing the user to pass other options as arguments in the
Right, I was pointing out that even outside of the It's true though that this issue broadly applies to the |
|
Following today's discussion with @Vincent-Maladiere and @jeromedockes, I reverted the changes back to the old version (with the simple flag) |
jeromedockes
left a comment
There was a problem hiding this comment.
LGTM! thanks again @rcap107
(the codecov report is bogus)
I'll let @Vincent-Maladiere do a final review & merge
Vincent-Maladiere
left a comment
There was a problem hiding this comment.
In our IRL discussion, haven't we agreed on a threshold for the null ratio in the column (which is 1.0 by default)?
Apart from that, and to keep in mind for later, the items we discussed were:
- The ability to freeze TableVectorizer column-to-transformer mapping. It would help to obtain consistent results for retraining in an automated environment, and having more sensible errors to debug.
- Decoupling the check/cleaning part of the TableVectorizer (which comes before the vectorizing part) so that it can be used as a standalone object wherever.
we definitely did; I had understood it would be tackled in a separate PR +1 for points 1. and 2. -- let's open a separate issue for 1. and there is #925 for 2. |
|
I also understood that the threshold would be added in a separate issue |
Vincent-Maladiere
left a comment
There was a problem hiding this comment.
LGTM then, thanks @rcap107 :)
|
yay 🎉 !! thanks @rcap107 ! |
|
🎉 |
fixes #1110
DropNullColumn (provisional name) takes as input a column, and drops it if all the values are nulls or nans. TableVectorizer was also updated with a
drop_null_columnsflag set toFalseby default; if the flag is set toTrue, the DropNullColumn is added as a processing step for all columns.I've also added
dropandis_all_nullto_common.py, though I am not sure if they should go there. Maybeis_all_nullcan stay in theDropNullColumnfile.The test I wrote passes, but I'm not sure if it's good enough.
The documentation is still missing.