New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH add support to missing values in NMF #8474
base: main
Are you sure you want to change the base?
Conversation
I think @GaelVaroquaux was very interested in this long back... (I can't remember the exact thread / e-mail, I could be wrong ;) ) |
315c1ee
to
759b970
Compare
Is this still under development? I see multiple open issues referencing this functionality, but cannot tell if there have been any updates or if this is scheduled for a future release. I would really like to see this functionality! |
I think it's open for someone to complete |
We talked briefly about this PR during previous sprint in June, and it was considered a bit out of scope of scikit-learn. Indeed, there is no other estimator that support missing values, and it will probably not work very well with meta-estimators such as Pipeline, GridSearchCV, .. However, the code is working (though not reviewed), and could be moved to a separate repository properly referenced. I plan to do it at some point, but I am ok if someone wants to do it. |
Thanks ! Got this code and it is working with NaN values. |
Used it with no problems! |
I'd be interested in having an instance of estimating a latent representation to account for missing values as an alternative to imputation in a classification pipeline. Is that one of the applications of this? Could we add it to |
Yes, imputation is one application of this, though one can also be interested in the NMF decomposition itself, and not the derived imputation. The performances are not terrific though:
|
I was not taking about imputation in the input source, but about deriving a
latent representation in which missing values are accounted for, and using
that to train.... I know similar is done with autoencoders.
|
I know this is an old thread but has this been added to the main scikit-learn repo? or can this only be found in @TomDLT forked repo. How can i use the nmf_missing to handle the missing NaN values in a rating matrix for example |
Not merged yet, waiting for reviews. In the meantime, you can checkout the code with: git fetch https://github.com/scikit-learn/scikit-learn pull/8474/head:nmf_missing
git checkout nmf_missing |
okay awesome thanks for getting back to me. Just to confirm, do i just need to change the solver to 'mu' to be able to run with the NaN values? Also does this work with sparse matricies? |
This is great! incredibly useful for my project, thank you for developing it. One thing I have noticed is that running NMF with mu solver is much faster when nan values have been replaced by zeroes as compared to running with the nan values left in. Is this expected behaviour? Thank you in advance. |
Ah it seems like this is not yet implemented for the partial_fit method of minibatch NMF |
here is the updated partial fit method to accept nans def partial_fit(self, X, y=None, W=None, H=None):
"""Update the model using the data in `X` as a mini-batch.
This method is expected to be called several times consecutively
on different chunks of a dataset so as to implement out-of-core
or online learning.
This is especially useful when the whole dataset is too big to fit in
memory at once (see :ref:`scaling_strategies`).
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Data matrix to be decomposed.
y : Ignored
Not used, present here for API consistency by convention.
W : array-like of shape (n_samples, n_components), default=None
If `init='custom'`, it is used as initial guess for the solution.
Only used for the first call to `partial_fit`.
H : array-like of shape (n_components, n_features), default=None
If `init='custom'`, it is used as initial guess for the solution.
Only used for the first call to `partial_fit`.
Returns
-------
self
Returns the instance itself.
"""
has_components = hasattr(self, "components_")
if not has_components:
self._validate_params()
X = self._validate_data(
X,
accept_sparse=("csr", "csc"),
dtype=[np.float64, np.float32],
reset=not has_components,
force_all_finite=False
)
# Handle NaNs for non-sparse X by creating a masked array
if not sp.issparse(X):
X_mask = np.isnan(X)
if np.any(X_mask):
X = np.ma.masked_array(X, mask=X_mask)
if not has_components:
# This instance has not been fitted yet (fit or partial_fit)
self._check_params(X)
_, H = self._check_w_h(X, W=W, H=H, update_H=True)
self._components_numerator = H.copy()
self._components_denominator = np.ones(H.shape, dtype=H.dtype)
self.n_steps_ = 0
else:
H = self.components_
self._minibatch_step(X, None, H, update_H=True)
self.n_components_ = H.shape[0]
self.components_ = H
self.n_steps_ += 1
return self |
The below implementation of _safe_dot may be useful as the numpy implementation of operations on boolean matrices is not highly optimised. in my understanding this means that the majority of the computation in np.ma.dot is computing the mask of the result. since you don't use a masked array as the return value this calculating the return mask is not necessary. So then this should give the same result and be much faster. I may be mistaken, but in my tests NMF appears to work well with this. Am I wrong in thinking this will result in an identical model? def _safe_dot(X, Ht):
if isinstance(X, np.ma.masked_array):
X_array = X.data.copy()
X_array[X.mask] = 0
X = X_array
if isinstance(Ht, np.ma.masked_array):
Ht_array = Ht.data.copy()
Ht_array[Ht.mask] = 0
Ht = Ht_array
return safe_sparse_dot(X, Ht)
|
Fixes #8447
Fixes #25229
Here I update NMF's multiplicative update solver to handle missing values (
np.nan
).The solver simply doesn't take into account the loss at missing values, and optimizes the factorization on the rest of the data. This can be useful for imputation, recommendation (a bit out of scope for sklearn), or cross-validation.
Not sure this is in scikit-learn scope though.The solver update is simple. Except for beta = 1, both denominator and numerator (in the multiplicative update) have either
X
orW * H
, so we can mask them and sum the denominator and the numerator over valid elements. This is referred in the literature as "weighted NMF".For the coordinate descent solver, the method uses the following factorization, to avoid recomputing
W * H
for each coordinate update of W:(X - W * H) * H.T = X * H.T - W * (H * H.T)
During the entire W-update,
X * H.T
andH * H.T
do not change, so the optimization is fast.But if we want to handle missing value,
H * H.T
changes for each line of W, which increases heavily the computations.So I did not update the coordinate descent solver to handle missing values.
For the multiplicative update solver, the approximation
W * H
is used for the update of all coordinates, so we can compute it only once for each update ofW
. This is still more costly thanW * (H * H.T)
as used with beta = 2 without missing values, but this is already what we do for the other beta losses (!= 2).I skipped the sparse case for now.
TODO: