ENH Add support for np.nan values in SplineTransformer #28043

StefanieSenger · 2024-01-02T12:05:59Z

Reference Issues/PRs

Closes #26793

What does this implement/fix? Explain your changes.

Adds support for np.nan values in SplineTransformer.

adds param handle_missing : {'error', 'constant'} to init, where error preserves the previous behaviour and constant handles nan values by setting their spline values to all 0s
adds new test (but needs to be extended)

Yet to solve:

I believe in _get_base_knot_positions I have to prepare _weighted_percentile for excluding nan values similarity to how np.nanpercentile excludes nan values for the calculation of the base knots. I tried, but it was quite tricky. Edit: Just found that np.nanpercentile will have a sample_weight option soon: PR 24254 in numpy
Should an error also be raised in case the SplineTransformer was instantiated with (handle_missing="error"), then fitted without missing values and the X then contains missing values in transform?

github-actions · 2024-01-02T12:07:14Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: c1278a6. Link to the linter CI: here}

ogrisel

The PR looks very good but it needs to be merged with main (there are conflicts in the changelog).

Also, I think the get_output_feature_names() method needs to be updated. The tests should be expanded accordingly, maybe to also include a test with .set_output(transform="pandas") (this is how I found out that there was a problem with the output feature names).

ogrisel · 2024-02-29T09:59:47Z

So, should sparse_output=True and handle_missing="indicator" be prevented from being used together (and show an explicit error message) ,

I think we should add support for using those two options together.

ogrisel

Here is a more in depth pass of review. There is indeed a fundamental problem with the current code: the missingness indicators from the training set (when calling .fit or .fit_transform) should not be stored as an estimator attribute and reapplied to the test set (when calling .transform). Instead the missingness pattern from the test set should be extracted.

See more details below:

sklearn/preprocessing/_polynomial.py

ogrisel · 2024-03-01T10:22:41Z

sklearn/preprocessing/_polynomial.py

        if self.include_bias:
-            return XBS
+            return self._concatenate_indicator(XBS)


The missingness indicators computed from the X passed to .transform (which can be a test set) should be passed as argument to _concatenate_indicator instead of reusing the mask extracted from the training set.

sklearn/preprocessing/tests/test_polynomial.py

StefanieSenger

Hey @ogrisel, thanks for reviewing and your help.
I went through your comments and could resolve most of the issues.

I've named the new option handle_missing="constant", but that's just an idea. I found that indicator doesn't fit so well anymore, if we don't add an indicator column to X. Though with constant as well as with zeros I feel that it's not quite clear from the naming, where in the process the nans become something else (before or after calculating the splines). Maybe we can find a name, that conveys that info.

There are quite a few things, I am a bit confused about:
Generally, I don't know if we want SplineTransformer to change or keep behaviour if nan values are present.

If we want it to keep behaviour, instead of having this test data for comparing equality:

    X_nan = np.array([[1, 1], [2, 2], [3, 3], [np.nan, 4], [4, 4]])
    X = np.array([[1, 1], [2, 2], [3, 3], [4, 4]])

it should maybe rather be

    X_nan = np.array([[1, 1], [2, 2], [3, 3], [np.nan, 4], [4, 4]])
    X = np.array([[1, 1], [2, 2], [3, 3], [99, 4], [4, 4]])

and in this case, the current implementation is wrong. Maybe you can shed a light on this so that I know how to go on.

I will check the issue with the feature names next.

sklearn/preprocessing/_polynomial.py

sklearn/preprocessing/tests/test_polynomial.py

StefanieSenger · 2024-03-15T13:41:50Z

I was trying to find about the problem with the feature names, that you have mentioned here, @ogrisel, but I cannot recreate it. Maybe it's been resolved when I worked on the other issues?

This is what I tried (using the code from the existing feature_name_out test):

def test_spline_transformer_feature_names_with_nans():
    """Test that SplineTransformer generates correct feature names if nan values are present."""
    X_nan = np.array([[1, 1], [2, 2], [3, 3], [np.nan, 4], [4, 4]])
    splt = SplineTransformer(
        degree=3,
        n_knots=3,
        handle_missing="constant",
        include_bias=True).fit(X_nan)
    feature_names = splt.get_feature_names_out()
    assert_array_equal(
        feature_names,
        [
            "x0_sp_0",
            "x0_sp_1",
            "x0_sp_2",
            "x0_sp_3",
            "x0_sp_4",
            "x1_sp_0",
            "x1_sp_1",
            "x1_sp_2",
            "x1_sp_3",
            "x1_sp_4",
        ],
    )

    splt = SplineTransformer(
    degree=3,
    n_knots=3,
    handle_missing="constant",
    include_bias=False).fit(X_nan)
    feature_names = splt.get_feature_names_out(["a", "b"])
    assert_array_equal(
        feature_names,
        [
            "a_sp_0",
            "a_sp_1",
            "a_sp_2",
            "a_sp_3",
            "b_sp_0",
            "b_sp_1",
            "b_sp_2",
            "b_sp_3",
        ],
    )

    splt.set_output(transform="pandas")
    X_transformed = splt.transform(X_nan)
    feature_names = splt.get_feature_names_out(["a", "b"])
    assert_array_equal(
        feature_names,
        [
            "a_sp_0",
            "a_sp_1",
            "a_sp_2",
            "a_sp_3",
            "b_sp_0",
            "b_sp_1",
            "b_sp_2",
            "b_sp_3",
        ],
    )

Everything behaves as it should, I believe. But also maybe I didn't understand what you exactly ran into.

ogrisel · 2024-03-15T17:42:34Z

The get_feature_names problem only happens when appending extra output features for the missing indicators (they would then need to be named in that case).

EDIT: I will try to answer your other questions/comments early enough next week.

ogrisel · 2024-03-15T17:45:32Z

About handle_missing="constant", I really prefer handle_missing="zero" or handle_missing="zeros" or handle_missing="ignore" to either convey that missing values are encoded as zero spline feature values or are encoded in a value such that a simple downstream linear model would basically "ignore" them.

StefanieSenger · 2024-04-18T14:05:42Z

Hey @ogrisel, can you give me some feedback?

My current understanding is that if we introduce new 0-values in X_transformed (due to nan values in X), then we also expect different stats for the transformer compared to when no nan values are present.

This would mean, that we expect (and test for)

a different output format in case of nans (I have already written the tests like this)
that the B-splines won't sum up to one * n_features (that I need to define for the test)

sklearn/preprocessing/_polynomial.py

ogrisel

Here is another pass of feedback with some suggestion to extend testing for more cases, but overall it looks good.

I still need to investigate more to understand the interactions between the scipy version and the sparse output format and the code coverage warning.

doc/whats_new/v1.5.rst

sklearn/preprocessing/_polynomial.py

ogrisel · 2024-05-07T15:32:27Z

sklearn/preprocessing/tests/test_polynomial.py

+
+    # prepare mask for nan values
+    mask = _get_mask(X_nan, np.nan)
+    extended_mask = np.repeat(mask, spline.bsplines_[0].c.shape[1], axis=1)


I would rename mask and extended to nan_mask and encoded_nan_mask respectively to be more explicit.

Or alternatively missing_output_mask and missing_output_mask. In which case the code in transform could be updated accordingly.

I have changed into nan_mask and encoded_nan_mask. I didn't get your second paragraph, though. Were you saying I should replace extended_nan_indicator as well in transform()?

ogrisel · 2024-05-07T15:57:24Z

sklearn/preprocessing/tests/test_polynomial.py

+    assert_allclose_dense_sparse(
+        X_transformed_same_shape, X_transformed_different_shapes
+    )
+


We could further extend this test to cover the following:

Suggested change

# Check that nan values are always encoded as zeros, even in columns where

# no missing values were observed at training time.

all_missing_row_encoded = spline.transform([[np.nan, np.nan]])

if sparse_output:

all_missing_row_encoded = all_missing_row_encoded.toarray()

assert_allclose(all_missing_row_encoded, 0)

Do we want this instead of raising for values outside range with extrapolation == "error"?

That's a good question. I would say so: meaning, I would expect 0 encoded nans even in columns that have never seen a nan at fit time for the sake of consistency when running cross-validation with rare nan values.

I'm afraid that I don't understand your intention here. Sorry for this.

If we test if there can be zero-encoded values after transform, even if SplineTransformer hasn't seen any nans in fit, then we could test it with a less critical input like fitting without any nans and transforming an X with only one nan value. And input with a whole nan column (see my comment below for why not all nans) could be tested below in a separate test. Would that make sense to pull those test cases apart this way?

then we could test it with a less critical input like fitting without any nans and transforming an X with only one nan value

I agree, this is compatible with what I suggested in:

https://github.com/scikit-learn/scikit-learn/pull/28043/files#r1592724271

in this snippet, spline has been fit on one column with some nan value and another column without any nan values. In both cases the resulting encoding should be zero.

The all-nan case has been moved in your recent commits to a dedicated test and I find it indeed helps separating concerns and make the tests easier to follow.

Okay, so do I understand correctly, that you don't want me to change, add or do anything here? This is resolved?

sklearn/preprocessing/_polynomial.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

StefanieSenger

Thanks @ogrisel for reviewing and your suggestions. I have applied those changes or commented. Do you want to have another look?

sklearn/preprocessing/_polynomial.py

StefanieSenger · 2024-05-08T09:41:31Z

sklearn/preprocessing/tests/test_polynomial.py

+    assert_allclose_dense_sparse(
+        X_transformed_same_shape, X_transformed_different_shapes
+    )
+


Do we want this instead of raising for values outside range with extrapolation == "error"?

StefanieSenger · 2024-05-08T10:14:48Z

sklearn/preprocessing/tests/test_polynomial.py

+
+    # prepare mask for nan values
+    mask = _get_mask(X_nan, np.nan)
+    extended_mask = np.repeat(mask, spline.bsplines_[0].c.shape[1], axis=1)


I have changed into nan_mask and encoded_nan_mask. I didn't get your second paragraph, though. Were you saying I should replace extended_nan_indicator as well in transform()?

sklearn/preprocessing/_polynomial.py

ogrisel

Another pass of feedback below:

sklearn/preprocessing/_polynomial.py

ogrisel · 2024-05-10T13:36:08Z

sklearn/preprocessing/_polynomial.py

            else:  # extrapolation in ("constant", "linear")
                xmin, xmax = spl.t[degree], spl.t[-degree - 1]
                # spline values at boundaries
                f_min, f_max = spl(xmin), spl(xmax)
+                # values outside of the feature space during `fit` and nan values get
+                # masked out:
                mask = (xmin <= X[:, i]) & (X[:, i] <= xmax)


To make the code more straightforward to follow, I think mask could be renamed to something more explicit such as inside_range_mask and mask_inv to outside_range_mask.

Yes, that's a good initiative.

sklearn/preprocessing/_polynomial.py

ogrisel · 2024-05-10T14:14:20Z

sklearn/preprocessing/tests/test_polynomial.py

+    assert_allclose_dense_sparse(
+        X_transformed_same_shape, X_transformed_different_shapes
+    )
+


That's a good question. I would say so: meaning, I would expect 0 encoded nans even in columns that have never seen a nan at fit time for the sake of consistency when running cross-validation with rare nan values.

sklearn/preprocessing/tests/test_polynomial.py

ogrisel · 2024-05-10T14:16:47Z

sklearn/preprocessing/tests/test_polynomial.py

+    )
+
+    # check that if X has a feature of all nans SplineTransformer works as usual
+    spline.transform(X_nan_full_column)


The result of this call should be checked to see that the first column is all zero valued.

EDIT: I see that spline.transform(X_nan_full_column) is called again and its output checked a few lines below.

To avoid confusion, I think the all-nan column case should be checked in a separate test function that would call SplineTransformer on X_allmissing = np.asarray([[np.nan], [np.nan], [np.nan]]) for instance.

Oh yes, I will make in into a separate test.

I'll be keeping X_nan_full_column = np.array([[np.nan, np.nan], [np.nan, 1]]) for now.

The reason: X_allmissing = np.asarray([[np.nan], [np.nan], [np.nan]]) is a very special case, because in this case np.nanmin(x) is also nan and BSpline.design_matrix() raises. There is no way to avoid this, I believe, because we cannot set the existing nan values in x to any valid value that is within the feature space.

Reproducable:

X = np.array([[1, 1], [2, 2], [3, 3], [4, 5], [4, 4]]) X_allmissing = np.array([[np.nan, np.nan], [np.nan, np.nan]]) spline = SplineTransformer( degree=2, n_knots=3, handle_missing="zeros", ) spline.fit(X) all_missing_column_encoded = spline.transform(X_allmissing)

So for now, this case fails with an error from scipy, but I think it's a quite a readable error message.

ogrisel · 2024-05-10T14:22:01Z

sklearn/preprocessing/tests/test_polynomial.py

+    assert (X_nan_transform[encoded_nan_mask] == 0).all()
+
+    # check that nan values are always encoded as zeros, even in columns where
+    # no missing values were observed at training time.


This comment does not seem to reflect what is being tested below: here X_nan_full_column has 2 columns, the fist one is all-nan and the second column has a mix of missing an non-missing values.

The comment seem to refer to the test case I proposed in https://github.com/scikit-learn/scikit-learn/pull/28043/files#r1592724271 which is different: check how nan values are encoded at test time in columns that have no missing value at training time (e.g. the second column of X_nan).

I can see how this is confusing. I will try to re-formulate this in the new test below (test_spline_transformer_handles_all_nans), as discussed above. The attempt is to pull those two test groups apart. Is it clearer and consistent like this?

(The history of this here is that I have taken the test you have proposed and modified it.)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

StefanieSenger

I went through your review @ogrisel, thanks for the comments. I hope the tests are clearer separated now.

sklearn/preprocessing/_polynomial.py

StefanieSenger · 2024-05-13T09:17:08Z

sklearn/preprocessing/_polynomial.py

            else:  # extrapolation in ("constant", "linear")
                xmin, xmax = spl.t[degree], spl.t[-degree - 1]
                # spline values at boundaries
                f_min, f_max = spl(xmin), spl(xmax)
+                # values outside of the feature space during `fit` and nan values get
+                # masked out:
                mask = (xmin <= X[:, i]) & (X[:, i] <= xmax)


Yes, that's a good initiative.

sklearn/preprocessing/_polynomial.py

StefanieSenger · 2024-05-13T10:08:34Z

sklearn/preprocessing/tests/test_polynomial.py

+    )
+
+    # check that if X has a feature of all nans SplineTransformer works as usual
+    spline.transform(X_nan_full_column)


Oh yes, I will make in into a separate test.

I'll be keeping X_nan_full_column = np.array([[np.nan, np.nan], [np.nan, 1]]) for now.

The reason: X_allmissing = np.asarray([[np.nan], [np.nan], [np.nan]]) is a very special case, because in this case np.nanmin(x) is also nan and BSpline.design_matrix() raises. There is no way to avoid this, I believe, because we cannot set the existing nan values in x to any valid value that is within the feature space.

Reproducable:

X = np.array([[1, 1], [2, 2], [3, 3], [4, 5], [4, 4]]) X_allmissing = np.array([[np.nan, np.nan], [np.nan, np.nan]]) spline = SplineTransformer( degree=2, n_knots=3, handle_missing="zeros", ) spline.fit(X) all_missing_column_encoded = spline.transform(X_allmissing)

So for now, this case fails with an error from scipy, but I think it's a quite a readable error message.

StefanieSenger · 2024-05-13T10:20:02Z

sklearn/preprocessing/tests/test_polynomial.py

+    assert (X_nan_transform[encoded_nan_mask] == 0).all()
+
+    # check that nan values are always encoded as zeros, even in columns where
+    # no missing values were observed at training time.


I can see how this is confusing. I will try to re-formulate this in the new test below (test_spline_transformer_handles_all_nans), as discussed above. The attempt is to pull those two test groups apart. Is it clearer and consistent like this?

(The history of this here is that I have taken the test you have proposed and modified it.)

StefanieSenger · 2024-05-13T10:29:28Z

sklearn/preprocessing/tests/test_polynomial.py

+    assert_allclose_dense_sparse(
+        X_transformed_same_shape, X_transformed_different_shapes
+    )
+


I'm afraid that I don't understand your intention here. Sorry for this.

If we test if there can be zero-encoded values after transform, even if SplineTransformer hasn't seen any nans in fit, then we could test it with a less critical input like fitting without any nans and transforming an X with only one nan value. And input with a whole nan column (see my comment below for why not all nans) could be tested below in a separate test. Would that make sense to pull those test cases apart this way?

sklearn/preprocessing/_polynomial.py

ogrisel · 2024-05-15T15:39:04Z

sklearn/preprocessing/_polynomial.py

+                        # The column is all np.nan valued. Replace it by a constant
+                        # column with an arbitrary non-nan value inside: the minimum
+                        # value within the whole feature space:
+                        x[:] = np.nanmin(X)


We don't need to depend on the content of other input columns to encode the current all-nan column. Any constant value will be spline-encoded to zero, so using a constant 0 input is perfectly fine and much cheaper (it does not requires to rescan the full input array).

Suggested change

# The column is all np.nan valued. Replace it by a constant

# column with an arbitrary non-nan value inside: the minimum

# value within the whole feature space:

x[:] = np.nanmin(X)

# The column is all np.nan valued. Replace it by a constant

# column with an arbitrary non-nan value inside.

x[:] = 0

Note that change might require us to ensure that there are no nan-knots in the output of the np.percentile call in the fit method (I gave more details inline as inline comment in _get_base_knot_positions).

The comment for li. 768-769 also applies here.

sklearn/preprocessing/_polynomial.py

ogrisel · 2024-05-15T16:17:21Z

sklearn/preprocessing/_polynomial.py

@@ -750,8 +762,15 @@ def _get_base_knot_positions(X, n_knots=10, knots="uniform", sample_weight=None)
            )

            if sample_weight is None:
-                knots = np.percentile(X, percentiles, axis=0)
+                knots = np.nanpercentile(X, percentiles, axis=0)


It's possible that might need to replace np.nan by 0.0 to be able to encode missing values in all-nan columns without raising an error.

Since any constant column (all percentiles are the same) will be spline encoded to 0.0, it does not matter which value we use to represent the constant knots values derived from an all-nan training data column.

Note that I am not 100% sure whether this is needed or not.

The comment for li. 768-769 also applies here.

ogrisel · 2024-05-15T16:25:50Z

sklearn/preprocessing/_polynomial.py

@@ -765,8 +784,8 @@ def _get_base_knot_positions(X, n_knots=10, knots="uniform", sample_weight=None)
            # `else` is therefore safe.
            # Disregard observations with zero weight.
            mask = slice(None, None, 1) if sample_weight is None else sample_weight > 0
-            x_min = np.amin(X[mask], axis=0)
-            x_max = np.amax(X[mask], axis=0)


Similarly, if np.isnan(x_min) (which means we have an all-nan column), then replacing x_min and x_max by 0 might be helpful to treat this column as a regular constant column and always encode the output to 0.0, whatever the input.

I have tried this but actually, it is not necessary to replace nan values here, because np.amin(X[mask], axis=0) and np.amax(X[mask], axis=0) are calculated from the whole input data (X) as well. So as long as not the whole input data is all nans, then those two values will be non-nans and valid.

However, in my experiment on a separate branch pytest sklearn/preprocessing/tests/test_polynomial.py -k test_spline_transformer_handles_all_nans still fails for extrapolation="error" and sparse output, because obviously making x[:] = 0 in li. 1072 might be outside the feature range. This is in fact the reason why I had previously figured out that x[:] = np.nanmin(X) is a working solution and the only working solution I could find so far.

But maybe I did something wrong and this is not what you meant. Anyways, if x[:] = np.nanmin(X) is going to be made less computationally heavy, then it needs to work for the extrapolation="error" and sparse output case as well.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

StefanieSenger

@ogrisel Thank you for going through it again. I appreciate it. :)

I did apply your change suggestions, but I am again struggling to apply this suggestion, because as it is it breaks the the extrapolation=error and sparse_output=True test if a whole column is nans and I am not sure if you were aware of this when you suggested it and also I tend to think that it's not possible any other way, is it?

Would you maybe have another look at this concern specifically?

sklearn/preprocessing/_polynomial.py

StefanieSenger · 2024-05-23T12:52:47Z

sklearn/preprocessing/_polynomial.py

@@ -765,8 +784,8 @@ def _get_base_knot_positions(X, n_knots=10, knots="uniform", sample_weight=None)
            # `else` is therefore safe.
            # Disregard observations with zero weight.
            mask = slice(None, None, 1) if sample_weight is None else sample_weight > 0
-            x_min = np.amin(X[mask], axis=0)
-            x_max = np.amax(X[mask], axis=0)


I have tried this but actually, it is not necessary to replace nan values here, because np.amin(X[mask], axis=0) and np.amax(X[mask], axis=0) are calculated from the whole input data (X) as well. So as long as not the whole input data is all nans, then those two values will be non-nans and valid.

However, in my experiment on a separate branch pytest sklearn/preprocessing/tests/test_polynomial.py -k test_spline_transformer_handles_all_nans still fails for extrapolation="error" and sparse output, because obviously making x[:] = 0 in li. 1072 might be outside the feature range. This is in fact the reason why I had previously figured out that x[:] = np.nanmin(X) is a working solution and the only working solution I could find so far.

But maybe I did something wrong and this is not what you meant. Anyways, if x[:] = np.nanmin(X) is going to be made less computationally heavy, then it needs to work for the extrapolation="error" and sparse output case as well.

StefanieSenger · 2024-05-23T12:53:22Z

sklearn/preprocessing/_polynomial.py

@@ -750,8 +762,15 @@ def _get_base_knot_positions(X, n_knots=10, knots="uniform", sample_weight=None)
            )

            if sample_weight is None:
-                knots = np.percentile(X, percentiles, axis=0)
+                knots = np.nanpercentile(X, percentiles, axis=0)


The comment for li. 768-769 also applies here.

StefanieSenger · 2024-05-23T12:53:31Z

sklearn/preprocessing/_polynomial.py

+                        # The column is all np.nan valued. Replace it by a constant
+                        # column with an arbitrary non-nan value inside: the minimum
+                        # value within the whole feature space:
+                        x[:] = np.nanmin(X)


The comment for li. 768-769 also applies here.

StefanieSenger · 2024-05-23T12:56:31Z

sklearn/preprocessing/tests/test_polynomial.py

+    assert_allclose_dense_sparse(
+        X_transformed_same_shape, X_transformed_different_shapes
+    )
+


Okay, so do I understand correctly, that you don't want me to change, add or do anything here? This is resolved?

StefanieSenger · 2024-05-26T11:25:22Z

Ah, I just stumbled over Imputation for missing values from the docs. I think we should include "allow_nan" in _more_tags() here, correct?

lorentzenchr · 2024-05-27T16:14:18Z

Can you ping me once the a reviewer approved?

StefanieSenger added 2 commits December 20, 2023 13:32

added indicator

4255470

added splines full of 0 for nan input in transform

812ae00

github-actions bot added the module:preprocessing label Jan 2, 2024

StefanieSenger and others added 2 commits January 8, 2024 11:39

Merge branch 'main' into nan_SplineTransformer

0ad4f5e

fixed extrapolation='error' related error

f0f7d5e

StefanieSenger changed the title ~~ENH Adds support for np.nan values in SplineTransformer~~ ENH Add support for np.nan values in SplineTransformer Jan 8, 2024

StefanieSenger added 4 commits January 9, 2024 14:14

fixed _get_base_knot_positions so spline parameters are build correctly

a38ee69

exclude knots='quantile' from camparison check

4c9a867

fixed issues from test_common

f7f7ca4

little things

4bbdd4a

ogrisel mentioned this pull request Feb 28, 2024

Handle np.nan / missing values in SplineTransformer #26793

Open

ogrisel reviewed Feb 28, 2024

View reviewed changes

ogrisel mentioned this pull request Feb 28, 2024

Implement SplineTransformer.inverse_transform #28551

Open

Merge branch 'main' into nan_SplineTransformer

092eca0

ogrisel reviewed Mar 1, 2024

View reviewed changes

StefanieSenger and others added 3 commits March 13, 2024 16:22

calculate indicator based on X passed into transform

5165c6b

Merge branch 'main' into nan_SplineTransformer

1a7d134

add and adjust tests after review

708ae53

StefanieSenger commented Mar 15, 2024

View reviewed changes

sklearn/preprocessing/_polynomial.py Show resolved Hide resolved

sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_polynomial.py Show resolved Hide resolved

sklearn/preprocessing/tests/test_polynomial.py Outdated Show resolved Hide resolved

wording and typo

1dc3b3c

Merge branch 'main' into nan_SplineTransformer

dce1620

adrinjalali mentioned this pull request Apr 15, 2024

WIP KBinsDiscretizer supports NaN as input values #19928

Open

Merge branch 'main' into nan_SplineTransformer

63ea8aa

StefanieSenger marked this pull request as ready for review April 18, 2024 13:51

delete unused code

2c261c9

adrinjalali reviewed Apr 29, 2024

View reviewed changes

sklearn/preprocessing/_polynomial.py Show resolved Hide resolved

raise if nan and sample_weight, facilitate tests

e9d1c53

StefanieSenger commented Apr 30, 2024

View reviewed changes

sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved

calculate bsplines columnwise again

e78b12e

StefanieSenger commented May 2, 2024

View reviewed changes

sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved

ogrisel reviewed May 7, 2024

View reviewed changes

StefanieSenger and others added 4 commits May 8, 2024 11:42

Apply suggestions from code review

45b19f7

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

changes after review

1fe01ba

Merge branch 'main' into nan_SplineTransformer

9a0b7d0

update changelog

e25f3a8

StefanieSenger commented May 8, 2024

View reviewed changes

support edge case with whole columns of nans

ac276fa

ogrisel reviewed May 10, 2024

View reviewed changes

StefanieSenger and others added 4 commits May 13, 2024 09:26

Apply suggestions from code review

ffb1012

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

get nan_mask feature-wise

71fe537

Apply suggestions from code review

2344988

refactoring tests

25dce72

StefanieSenger commented May 13, 2024

View reviewed changes

StefanieSenger and others added 2 commits May 13, 2024 13:00

Merge branch 'main' into nan_SplineTransformer

2d77a27

re-put fix for older scipy versions outside of loop through features

a4129fc

ogrisel reviewed May 15, 2024

View reviewed changes

StefanieSenger and others added 2 commits May 23, 2024 09:20

Merge branch 'main' into nan_SplineTransformer

9307816

Apply suggestions from code review

c840f0a

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

StefanieSenger commented May 23, 2024

View reviewed changes

satisfy ruff

c1278a6

StefanieSenger mentioned this pull request May 24, 2024

MNT _weighted_percentile supports np.nan values #29034

Open

adrinjalali mentioned this pull request Jun 5, 2024

sklearn.neighbors.NearestNeighbors allow processing nan values #29085

Open

+    # Check that nan values are always encoded as zeros, even in columns where
+    # no missing values were observed at training time.
+    all_missing_row_encoded = spline.transform([[np.nan, np.nan]])
+    if sparse_output:
+        all_missing_row_encoded = all_missing_row_encoded.toarray()
+    assert_allclose(all_missing_row_encoded, 0)

ENH Add support for np.nan values in SplineTransformer #28043

Are you sure you want to change the base?

ENH Add support for np.nan values in SplineTransformer #28043

Conversation

StefanieSenger commented Jan 2, 2024 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

github-actions bot commented Jan 2, 2024 • edited Loading

✔️ Linting Passed

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

ogrisel commented Feb 29, 2024

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger left a comment

Choose a reason for hiding this comment

StefanieSenger commented Mar 15, 2024

ogrisel commented Mar 15, 2024 • edited Loading

ogrisel commented Mar 15, 2024

StefanieSenger commented Apr 18, 2024 • edited Loading

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger commented May 26, 2024 • edited Loading

lorentzenchr commented May 27, 2024

StefanieSenger commented Jan 2, 2024 •

edited

Loading

github-actions bot commented Jan 2, 2024 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel commented Mar 15, 2024 •

edited

Loading

StefanieSenger commented Apr 18, 2024 •

edited

Loading

StefanieSenger commented May 26, 2024 •

edited

Loading