Handling NaNs in NMF #25229

OlenaBugaiova · 2022-12-23T22:13:35Z

Describe the workflow you want to enable

Motivation:
Sparse matrixes are very common in recommender systems problems. And matrix factorization approach is one of the most popular approaches for this task. But recommender systems often have sparse data with a big amount of missing values
Problem:
In principle, non-negative matrix factorization can work with sparse matrices and optimize the solution based only on the present values. In the scikit-learn implementation, the validation doesn't allow the fit_transform method of NMF to accept sparse matrixes with NaN values

Describe your proposed solution

In this file https://github.com/scikit-learn/scikit-learn/blob/98cf537f5/sklearn/decomposition/_nmf.py
In the fit_transform method there is code:
X = self._validate_data(
X, accept_sparse=("csr", "csc"), dtype=[np.float64, np.float32]
)
The parameter accept_sparse with ("csr", "csc") default values will be passed to the check_array method in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py so validation accepts a sparse format
But the check_array method has a parameter force_all_finite with a default value of True. And this parameter doesn't allow check_array to accept NaN values even when the sparse format ("csr", "csc") is allowed
As a result, the method def _assert_all_finite of the validation.py file throws the following error: ValueError("Input contains NaN")

Can you make it configurable to add force_all_finite='allow_nan' to the _validate_data method in the fit_transform method of NMF?

Describe alternatives you've considered, if relevant

No response

Additional context

No response

glemaitre · 2022-12-30T10:07:47Z

Letting the matrix might not be sufficient because a sparse matrix needs to be handled with care and might require to use of specific methods of the scipy.sparse.

ping @jeremiedbb that knows better the internal implementation that can now if it would be more or less feasible?

jeremiedbb · 2022-12-30T12:53:46Z

The issue seems to be about handling nan rather than supporting sparse matrices (which is already the case). Let me rename the issue.

Regarding nan handling, there is this PR #8474. It would be interesting to revive this PR, but we need to take a look at the litterature to check if the PR implements a common approach.

OlenaBugaiova · 2023-01-02T22:40:07Z

It makes sense, thanks for looking at this request. I agree, it is about handling Missing Values rather than sparse matrixes.
I see, there was work going on in PR #8474

TomDLT · 2023-01-09T23:26:44Z

I revived the PR adding support to missing values in NMF (#8474). Note that for now, the PR only adds support to missing values in dense datasets. Tell me if you have any question.

OlenaBugaiova · 2023-02-03T00:48:25Z

That's wonderful! I am trying it locally.
You added a test that the NMF imputation is better than SimpleImputer+NMF. That's why I was looking for it.

Please let me know if I can help with writing more tests or documentation.

OlenaBugaiova · 2023-02-18T02:23:40Z

Tested this functionality locally on data having a lot of NaNs. It worked perfectly.
Notes:

NMF ‘cd’ solver doesn't handle missing values
Init method can not be 'nndsvd' / 'nndsvda', 'nndsvdar' in the case of data with missing values

Closing this feature request

OlenaBugaiova · 2023-02-19T02:16:26Z

It would be nice to add NMF to this page after merging https://scikit-learn.org/stable/modules/impute.html

OlenaBugaiova added Needs Triage Issue requires triage New Feature labels Dec 23, 2022

glemaitre added Needs Decision Requires decision and removed Needs Triage Issue requires triage labels Dec 30, 2022

jeremiedbb changed the title ~~Accepting sparse matrixes in non-negative matrix factorization~~ Handling NaNs in NMF Dec 30, 2022

TomDLT linked a pull request Jan 5, 2023 that will close this issue

ENH add support to missing values in NMF #8474

Open

3 tasks

cmarmo added the module:decomposition label Jan 17, 2023

OlenaBugaiova closed this as completed Feb 18, 2023

adrinjalali reopened this Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling NaNs in NMF #25229

Handling NaNs in NMF #25229

OlenaBugaiova commented Dec 23, 2022

glemaitre commented Dec 30, 2022

jeremiedbb commented Dec 30, 2022 •

edited

OlenaBugaiova commented Jan 2, 2023

TomDLT commented Jan 9, 2023 •

edited

OlenaBugaiova commented Feb 3, 2023

OlenaBugaiova commented Feb 18, 2023

OlenaBugaiova commented Feb 19, 2023

Handling NaNs in NMF #25229

Handling NaNs in NMF #25229

Comments

OlenaBugaiova commented Dec 23, 2022

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

glemaitre commented Dec 30, 2022

jeremiedbb commented Dec 30, 2022 • edited

OlenaBugaiova commented Jan 2, 2023

TomDLT commented Jan 9, 2023 • edited

OlenaBugaiova commented Feb 3, 2023

OlenaBugaiova commented Feb 18, 2023

OlenaBugaiova commented Feb 19, 2023

jeremiedbb commented Dec 30, 2022 •

edited

TomDLT commented Jan 9, 2023 •

edited