Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH add support to missing values in NMF #8474

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

TomDLT
Copy link
Member

@TomDLT TomDLT commented Feb 28, 2017

Fixes #8447
Fixes #25229

Here I update NMF's multiplicative update solver to handle missing values (np.nan).
The solver simply doesn't take into account the loss at missing values, and optimizes the factorization on the rest of the data. This can be useful for imputation, recommendation (a bit out of scope for sklearn), or cross-validation.

Not sure this is in scikit-learn scope though.


The solver update is simple. Except for beta = 1, both denominator and numerator (in the multiplicative update) have either X or W * H, so we can mask them and sum the denominator and the numerator over valid elements. This is referred in the literature as "weighted NMF".

For the coordinate descent solver, the method uses the following factorization, to avoid recomputing W * H for each coordinate update of W: (X - W * H) * H.T = X * H.T - W * (H * H.T)
During the entire W-update, X * H.T and H * H.T do not change, so the optimization is fast.
But if we want to handle missing value, H * H.T changes for each line of W, which increases heavily the computations.
So I did not update the coordinate descent solver to handle missing values.

For the multiplicative update solver, the approximation W * H is used for the update of all coordinates, so we can compute it only once for each update of W. This is still more costly than W * (H * H.T) as used with beta = 2 without missing values, but this is already what we do for the other beta losses (!= 2).

I skipped the sparse case for now.


TODO:

@raghavrv
Copy link
Member

raghavrv commented Feb 28, 2017

I think @GaelVaroquaux was very interested in this long back... (I can't remember the exact thread / e-mail, I could be wrong ;) )

@raghavrv raghavrv added this to Would be very nice to have (MID) in Raghav's *personal* review-priority listing Mar 2, 2017
@TomDLT TomDLT force-pushed the nmf_missing branch 2 times, most recently from 315c1ee to 759b970 Compare June 6, 2017 08:21
@raghavrv raghavrv removed this from Would be very nice to have (MID) in Raghav's *personal* review-priority listing Jun 29, 2017
@TomDLT TomDLT added the Stalled label Jul 19, 2017
@dburkhardt
Copy link

Is this still under development? I see multiple open issues referencing this functionality, but cannot tell if there have been any updates or if this is scheduled for a future release. I would really like to see this functionality!

@jnothman
Copy link
Member

jnothman commented Oct 8, 2017

I think it's open for someone to complete

@TomDLT
Copy link
Member Author

TomDLT commented Oct 9, 2017

We talked briefly about this PR during previous sprint in June, and it was considered a bit out of scope of scikit-learn. Indeed, there is no other estimator that support missing values, and it will probably not work very well with meta-estimators such as Pipeline, GridSearchCV, ..

However, the code is working (though not reviewed), and could be moved to a separate repository properly referenced. I plan to do it at some point, but I am ok if someone wants to do it.

@Cristianasp
Copy link

Thanks ! Got this code and it is working with NaN values.

@hongkahjun
Copy link
Contributor

Used it with no problems!

@jnothman
Copy link
Member

I'd be interested in having an instance of estimating a latent representation to account for missing values as an alternative to imputation in a classification pipeline. Is that one of the applications of this? Could we add it to plot_missing_values.py and see how it compares?

@scikit-learn scikit-learn deleted a comment from codecov bot Nov 13, 2017
@TomDLT
Copy link
Member Author

TomDLT commented Nov 13, 2017

Yes, imputation is one application of this, though one can also be interested in the NMF decomposition itself, and not the derived imputation.
To compare with Imputer, I added an ImputerNMF class which implement an imputation transformer with NMF, and added it one plot_missing_values.

The performances are not terrific though:

Score with the entire dataset = 0.56
Score without the samples containing missing values = 0.48
Score after imputation of the missing values (mean) = 0.57
Score after imputation of the missing values (NMF) = 0.51

@jnothman
Copy link
Member

jnothman commented Nov 13, 2017 via email

@TomDLT TomDLT marked this pull request as ready for review January 5, 2023 02:40
@sammymans
Copy link

I know this is an old thread but has this been added to the main scikit-learn repo? or can this only be found in @TomDLT forked repo. How can i use the nmf_missing to handle the missing NaN values in a rating matrix for example

@TomDLT
Copy link
Member Author

TomDLT commented Apr 20, 2023

Not merged yet, waiting for reviews.

In the meantime, you can checkout the code with:

git fetch https://github.com/scikit-learn/scikit-learn pull/8474/head:nmf_missing
git checkout nmf_missing

@sammymans
Copy link

okay awesome thanks for getting back to me. Just to confirm, do i just need to change the solver to 'mu' to be able to run with the NaN values? Also does this work with sparse matricies?

@PolarBean
Copy link

This is great! incredibly useful for my project, thank you for developing it. One thing I have noticed is that running NMF with mu solver is much faster when nan values have been replaced by zeroes as compared to running with the nan values left in. Is this expected behaviour? Thank you in advance.

@PolarBean
Copy link

Ah it seems like this is not yet implemented for the partial_fit method of minibatch NMF

@PolarBean
Copy link

here is the updated partial fit method to accept nans

    def partial_fit(self, X, y=None, W=None, H=None):
        """Update the model using the data in `X` as a mini-batch.

        This method is expected to be called several times consecutively
        on different chunks of a dataset so as to implement out-of-core
        or online learning.

        This is especially useful when the whole dataset is too big to fit in
        memory at once (see :ref:`scaling_strategies`).

        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            Data matrix to be decomposed.

        y : Ignored
            Not used, present here for API consistency by convention.

        W : array-like of shape (n_samples, n_components), default=None
            If `init='custom'`, it is used as initial guess for the solution.
            Only used for the first call to `partial_fit`.

        H : array-like of shape (n_components, n_features), default=None
            If `init='custom'`, it is used as initial guess for the solution.
            Only used for the first call to `partial_fit`.

        Returns
        -------
        self
            Returns the instance itself.
        """
        has_components = hasattr(self, "components_")

        if not has_components:
            self._validate_params()

        X = self._validate_data(
            X,
            accept_sparse=("csr", "csc"),
            dtype=[np.float64, np.float32],
            reset=not has_components,
            force_all_finite=False
        )
        # Handle NaNs for non-sparse X by creating a masked array
        if not sp.issparse(X):
            X_mask = np.isnan(X)
            if np.any(X_mask):
                X = np.ma.masked_array(X, mask=X_mask)
        
        if not has_components:
            # This instance has not been fitted yet (fit or partial_fit)
            self._check_params(X)
            _, H = self._check_w_h(X, W=W, H=H, update_H=True)

            self._components_numerator = H.copy()
            self._components_denominator = np.ones(H.shape, dtype=H.dtype)
            self.n_steps_ = 0
        else:
            H = self.components_

        self._minibatch_step(X, None, H, update_H=True)

        self.n_components_ = H.shape[0]
        self.components_ = H
        self.n_steps_ += 1

        return self

@PolarBean
Copy link

The below implementation of _safe_dot may be useful as the numpy implementation of operations on boolean matrices is not highly optimised. in my understanding this means that the majority of the computation in np.ma.dot is computing the mask of the result. since you don't use a masked array as the return value this calculating the return mask is not necessary. So then this should give the same result and be much faster. I may be mistaken, but in my tests NMF appears to work well with this. Am I wrong in thinking this will result in an identical model?

def _safe_dot(X, Ht):
    if isinstance(X, np.ma.masked_array):
        X_array = X.data.copy()
        X_array[X.mask] = 0
        X = X_array
    if isinstance(Ht, np.ma.masked_array):
        Ht_array = Ht.data.copy()
        Ht_array[Ht.mask] = 0
        Ht = Ht_array
    return safe_sparse_dot(X, Ht)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling NaNs in NMF NMF with missing data