ENH add support to missing values in NMF #8474

TomDLT · 2017-02-28T15:55:21Z

Here I update NMF's multiplicative update solver to handle missing values (np.nan).
The solver simply doesn't take into account the loss at missing values, and optimizes the factorization on the rest of the data. This can be useful for imputation, recommendation (a bit out of scope for sklearn), or cross-validation.

~~Not sure this is in scikit-learn scope though.~~

The solver update is simple. Except for beta = 1, both denominator and numerator (in the multiplicative update) have either X or W * H, so we can mask them and sum the denominator and the numerator over valid elements. This is referred in the literature as "weighted NMF".

For the coordinate descent solver, the method uses the following factorization, to avoid recomputing W * H for each coordinate update of W: (X - W * H) * H.T = X * H.T - W * (H * H.T)
During the entire W-update, X * H.T and H * H.T do not change, so the optimization is fast.
But if we want to handle missing value, H * H.T changes for each line of W, which increases heavily the computations.
So I did not update the coordinate descent solver to handle missing values.

For the multiplicative update solver, the approximation W * H is used for the update of all coordinates, so we can compute it only once for each update of W. This is still more costly than W * (H * H.T) as used with beta = 2 without missing values, but this is already what we do for the other beta losses (!= 2).

I skipped the sparse case for now.

TODO:

add more tests
add some documentation and an example
add to https://scikit-learn.org/stable/modules/impute.html

raghavrv · 2017-02-28T16:55:36Z

I think @GaelVaroquaux was very interested in this long back... (I can't remember the exact thread / e-mail, I could be wrong ;) )

dburkhardt · 2017-10-06T17:05:53Z

Is this still under development? I see multiple open issues referencing this functionality, but cannot tell if there have been any updates or if this is scheduled for a future release. I would really like to see this functionality!

jnothman · 2017-10-08T02:23:06Z

I think it's open for someone to complete

TomDLT · 2017-10-09T09:06:31Z

We talked briefly about this PR during previous sprint in June, and it was considered a bit out of scope of scikit-learn. Indeed, there is no other estimator that support missing values, and it will probably not work very well with meta-estimators such as Pipeline, GridSearchCV, ..

However, the code is working (though not reviewed), and could be moved to a separate repository properly referenced. I plan to do it at some point, but I am ok if someone wants to do it.

Cristianasp · 2017-10-25T21:42:03Z

Thanks ! Got this code and it is working with NaN values.

hongkahjun · 2017-11-07T10:29:37Z

Used it with no problems!

jnothman · 2017-11-12T23:09:25Z

I'd be interested in having an instance of estimating a latent representation to account for missing values as an alternative to imputation in a classification pipeline. Is that one of the applications of this? Could we add it to plot_missing_values.py and see how it compares?

TomDLT · 2017-11-13T14:07:10Z

Yes, imputation is one application of this, though one can also be interested in the NMF decomposition itself, and not the derived imputation.
To compare with Imputer, I added an ImputerNMF class which implement an imputation transformer with NMF, and added it one plot_missing_values.

The performances are not terrific though:

Score with the entire dataset = 0.56
Score without the samples containing missing values = 0.48
Score after imputation of the missing values (mean) = 0.57
Score after imputation of the missing values (NMF) = 0.51

jnothman · 2017-11-13T20:29:16Z

I was not taking about imputation in the input source, but about deriving a latent representation in which missing values are accounted for, and using that to train.... I know similar is done with autoencoders.

sammymans · 2023-04-19T23:31:52Z

I know this is an old thread but has this been added to the main scikit-learn repo? or can this only be found in @TomDLT forked repo. How can i use the nmf_missing to handle the missing NaN values in a rating matrix for example

TomDLT · 2023-04-20T00:12:59Z

Not merged yet, waiting for reviews.

In the meantime, you can checkout the code with:

git fetch https://github.com/scikit-learn/scikit-learn pull/8474/head:nmf_missing
git checkout nmf_missing

sammymans · 2023-04-20T23:03:37Z

okay awesome thanks for getting back to me. Just to confirm, do i just need to change the solver to 'mu' to be able to run with the NaN values? Also does this work with sparse matricies?

PolarBean · 2024-04-22T14:31:36Z

This is great! incredibly useful for my project, thank you for developing it. One thing I have noticed is that running NMF with mu solver is much faster when nan values have been replaced by zeroes as compared to running with the nan values left in. Is this expected behaviour? Thank you in advance.

PolarBean · 2024-04-26T14:18:38Z

Ah it seems like this is not yet implemented for the partial_fit method of minibatch NMF

PolarBean · 2024-04-26T15:02:59Z

here is the updated partial fit method to accept nans

    def partial_fit(self, X, y=None, W=None, H=None):
        """Update the model using the data in `X` as a mini-batch.

        This method is expected to be called several times consecutively
        on different chunks of a dataset so as to implement out-of-core
        or online learning.

        This is especially useful when the whole dataset is too big to fit in
        memory at once (see :ref:`scaling_strategies`).

        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            Data matrix to be decomposed.

        y : Ignored
            Not used, present here for API consistency by convention.

        W : array-like of shape (n_samples, n_components), default=None
            If `init='custom'`, it is used as initial guess for the solution.
            Only used for the first call to `partial_fit`.

        H : array-like of shape (n_components, n_features), default=None
            If `init='custom'`, it is used as initial guess for the solution.
            Only used for the first call to `partial_fit`.

        Returns
        -------
        self
            Returns the instance itself.
        """
        has_components = hasattr(self, "components_")

        if not has_components:
            self._validate_params()

        X = self._validate_data(
            X,
            accept_sparse=("csr", "csc"),
            dtype=[np.float64, np.float32],
            reset=not has_components,
            force_all_finite=False
        )
        # Handle NaNs for non-sparse X by creating a masked array
        if not sp.issparse(X):
            X_mask = np.isnan(X)
            if np.any(X_mask):
                X = np.ma.masked_array(X, mask=X_mask)
        
        if not has_components:
            # This instance has not been fitted yet (fit or partial_fit)
            self._check_params(X)
            _, H = self._check_w_h(X, W=W, H=H, update_H=True)

            self._components_numerator = H.copy()
            self._components_denominator = np.ones(H.shape, dtype=H.dtype)
            self.n_steps_ = 0
        else:
            H = self.components_

        self._minibatch_step(X, None, H, update_H=True)

        self.n_components_ = H.shape[0]
        self.components_ = H
        self.n_steps_ += 1

        return self

PolarBean · 2024-04-27T15:31:56Z

The below implementation of _safe_dot may be useful as the numpy implementation of operations on boolean matrices is not highly optimised. in my understanding this means that the majority of the computation in np.ma.dot is computing the mask of the result. since you don't use a masked array as the return value this calculating the return mask is not necessary. So then this should give the same result and be much faster. I may be mistaken, but in my tests NMF appears to work well with this. Am I wrong in thinking this will result in an identical model?

def _safe_dot(X, Ht):
    if isinstance(X, np.ma.masked_array):
        X_array = X.data.copy()
        X_array[X.mask] = 0
        X = X_array
    if isinstance(Ht, np.ma.masked_array):
        Ht_array = Ht.data.copy()
        Ht_array[Ht.mask] = 0
        Ht = Ht_array
    return safe_sparse_dot(X, Ht)

TomDLT force-pushed the nmf_missing branch from 5b9d532 to 02ce055 Compare February 28, 2017 16:14

raghavrv added this to Would be very nice to have (MID) in Raghav's *personal* review-priority listing Mar 2, 2017

TomDLT force-pushed the nmf_missing branch from adeced8 to 719c370 Compare March 16, 2017 14:18

TomDLT force-pushed the nmf_missing branch 2 times, most recently from 315c1ee to 759b970 Compare June 6, 2017 08:21

raghavrv removed this from Would be very nice to have (MID) in Raghav's *personal* review-priority listing Jun 29, 2017

TomDLT added the Stalled label Jul 19, 2017

jnothman added the Need Contributor label Oct 8, 2017

TomDLT force-pushed the nmf_missing branch from 759b970 to 0fd1888 Compare October 9, 2017 08:57

add support to missing values in NMF

b060ba0

TomDLT force-pushed the nmf_missing branch from 0fd1888 to b060ba0 Compare October 9, 2017 09:02

Fix masked nonzero selection

2da40e1

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

Fix a bug in NMF.transform with NaN values in X

3b47aeb

scikit-learn deleted a comment from codecov bot Nov 13, 2017

add ImputerNMF class

b2f53b9

TomDLT force-pushed the nmf_missing branch from 9facb14 to b2f53b9 Compare November 13, 2017 14:05

TomDLT added 2 commits November 15, 2017 18:03

remove ImputerNMF

65988e0

keep the same memory for np.dot(W, H)

c8e06b6

TomDLT force-pushed the nmf_missing branch from 7ccc2be to c8e06b6 Compare November 15, 2017 17:03

jeremiedbb mentioned this pull request Dec 30, 2022

Handling NaNs in NMF #25229

Open

TomDLT added 4 commits January 4, 2023 18:12

Merge branch 'main' into nmf_missing

7cdeb3a

extend to MinibatchNMF, apply black formatting

3818534

fix merging conflict

28a9b0a

fix black version

c665aeb

TomDLT marked this pull request as ready for review January 5, 2023 02:40

TomDLT added 11 commits January 4, 2023 18:41

Merge branch 'main' into nmf_missing

0141c71

add changelog entry

507f2d4

add allow_nan tag, add doc

5bf5077

fix allow_nan tag

56ec4f5

fix check_array, simplify test diff

778f7be

black

bc91437

Merge branch 'main' into nmf_missing

ba84570

fix docstring, improve test

167985c

black

7a859c7

rework test_nmf_imputation

e52be53

Merge branch 'main' into nmf_missing

2663b8a

TomDLT added the Waiting for Reviewer label Feb 3, 2023

TomDLT added 2 commits March 31, 2023 14:43

Merge branch 'main' into nmf_missing

3b7a9c4

Merge branch 'main' into nmf_missing

a3a418d

TomDLT added 2 commits April 19, 2023 17:13

Merge branch 'main' into nmf_missing

04e1d41

fix lint

4269109

PolarBean mentioned this pull request Apr 26, 2024

Handling NAN values in NMF yoyololicon/pytorch-NMF#29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH add support to missing values in NMF #8474

ENH add support to missing values in NMF #8474

TomDLT commented Feb 28, 2017 •

edited

raghavrv commented Feb 28, 2017 •

edited

dburkhardt commented Oct 6, 2017

jnothman commented Oct 8, 2017

TomDLT commented Oct 9, 2017 •

edited

Cristianasp commented Oct 25, 2017

hongkahjun commented Nov 7, 2017

jnothman commented Nov 12, 2017

TomDLT commented Nov 13, 2017

jnothman commented Nov 13, 2017 via email

sammymans commented Apr 19, 2023

TomDLT commented Apr 20, 2023

sammymans commented Apr 20, 2023

PolarBean commented Apr 22, 2024

PolarBean commented Apr 26, 2024

PolarBean commented Apr 26, 2024

PolarBean commented Apr 27, 2024

ENH add support to missing values in NMF #8474

Are you sure you want to change the base?

ENH add support to missing values in NMF #8474

Conversation

TomDLT commented Feb 28, 2017 • edited

raghavrv commented Feb 28, 2017 • edited

dburkhardt commented Oct 6, 2017

jnothman commented Oct 8, 2017

TomDLT commented Oct 9, 2017 • edited

Cristianasp commented Oct 25, 2017

hongkahjun commented Nov 7, 2017

jnothman commented Nov 12, 2017

TomDLT commented Nov 13, 2017

jnothman commented Nov 13, 2017 via email

sammymans commented Apr 19, 2023

TomDLT commented Apr 20, 2023

sammymans commented Apr 20, 2023

PolarBean commented Apr 22, 2024

PolarBean commented Apr 26, 2024

PolarBean commented Apr 26, 2024

PolarBean commented Apr 27, 2024

TomDLT commented Feb 28, 2017 •

edited

raghavrv commented Feb 28, 2017 •

edited

TomDLT commented Oct 9, 2017 •

edited