New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weights are being normalized using number of samples as opposed to sum in GaussianMixture #24085
Comments
I can work on this if still available. |
Hi @kshitijgoel007, thank you for reporting this. Do you by change have a minimal reproducible example where this incorrect weighting cause problem? |
I can work on this, if he has not yet been finished. |
Please see my review on the PR but I thought I would write my comments coherently here.
To that end, it would be good to know the range of output weights currently possible from a normalized responsibilities matrix. I have a PR #24812 for some requested features passing in user provided responsibilities or a callable initializer which creates a responsibilities. I would be happy if the normalization changes here keep that use case in mind, but since both additions also just boil down to passing in a normalized responsibilities matrix as What can help though is the responsibilities helper and validation functions that are a part of #24812. I debugged the various initial states provided by the initialization options and looked at the responsibility matrices they create. (Surprisingly to me, some are 0->1 continuous, while others are 0,1 binary/discrete), but overall the range is always [0,1] and the normalization is such that each row (responsibilities attached to each component) always sums to 1, since for each sample with n-features we have a PDF. See below: (The PR also adds basic tests. Since some here are familiar with the Mixture codebase it would be great to have a second pair of eyes) Helper to calculate normalized responsibilities: def get_responsibilities(n_samples, n_components, indices=None, labels=None):
"""Create correct shaped responsibilities from array of `indices` or `labels`.
Responsibilities are 1 for the component of the corresponding index or label,
and 0 otherwise. Note that `n_components` is not the number of features
in the data, but the number of components in the mixture model. Responsibilities
can be used to calculate initial weights, means, and precisions of the mixture
model components. Either `indices` or `labels` must be provided.
Parameters
----------
n_samples : int
Number of samples.
n_components : int
Number of components.
indices : array-like of shape (n_components,), default=None
The index location of the chosen components (centers) in the data array X
of shape (n_samples, n_components), will be set to 1 in the output.
Either `indices` or `labels` must be provided.
labels : array-like of shape (n_samples,), default=None
Will be used over `indices` if not `None`. The labels i=0 to n_components-1
will be set to 1 for each sample in the ouput.
Either `indices` or `labels` must be provided.
Returns
-------
responsibilities : array, shape (n_samples, n_components)
Responsibilities of each sample in each component.
"""
resp = np.zeros((n_samples, n_components))
if labels is not None:
_check_shape(labels, (n_samples,), "labels") # will raise if incompatible
resp[np.arange(n_samples), labels] = 1
elif indices is not None:
resp[indices, np.arange(n_components)] = 1
else:
raise ValueError(
"Either `indices` or `labels` must be provided, both were `None`."
)
return resp Validator for normalized responsibilities: def _check_responsibilities(resp, n_components, n_samples):
"""Check the user provided 'resp'.
Parameters
----------
resp : array-like of shape (n_samples, n_components)
The responsibilities for each data sample in terms of each component.
n_components : int
Number of components.
n_samples : int
Number of samples.
Returns
-------
resp : array, shape (n_samples, n_components)
"""
resp = check_array(resp, dtype=[np.float64, np.float32], ensure_2d=False)
_check_shape(resp, (n_samples, n_components), "responsibilities")
# check range
axis_sum = resp.sum(axis=1)
less_1 = np.allclose(axis_sum[axis_sum > 1], 1)
positive = np.allclose(axis_sum[axis_sum < 0] + 1, 1)
in_domain = positive and less_1
if not in_domain:
raise ValueError(
"The parameter 'responsibilities' should be normalized in "
"the range [0, 1] within floating point tolerance, but got: "
f"max value {np.min(resp)}, min value {np.max(resp)}"
)
# check proper normalization exists
nrows_1 = np.sum(np.isclose(axis_sum[axis_sum >= 1], 1))
if nrows_1 < n_components:
raise ValueError(
"The parameter 'responsibilities' should be normalized, "
f"with at least `n_components`={n_components} rows summing to 1."
f"but got {nrows_1}."
)
return resp |
@emirkmo I kind of agree with the principle of not renormalizing user-defined weights. What I am wondering however is that it makes little sense to have unnormalized weight at the init that will get normalized in the next M-steps. Intuitively, I would expect the same normalization (or non-normalization) for the full iterative algorithm. |
@glemaitre The init step really is a way to fine tune the path the algorithm will take. For some use cases it actually barely matters at all. But for many others, it can be really important as the EM algorithm is highly sensitive to initial state. (This can be especially important for sparse multi-D data.) So ultimately I don't think the rest of the algorithm has much to do with the init state. In fact, the initial state does not even have to be super well defined, as the M-steps seem capable of handling that in my testing. What is most important is, (without creating bugs like overflow that can lead to negative normalization in the M-steps), allowing precisely specifying the init state, and providing tools for users to do this easily/intelligently as well. (I'll shamelessly link to #24812). Otherwise, it is like trying to stack magnetic blocks. As you go to place it just so, the system shifts by itself and the end result isn't what you want. Very frustrating. |
Describe the bug
Weights are being normalized at Line https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L718 using
n_samples
. It should be done usingweights.sum()
asdone in
_m_step()
here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L756.Steps/Code to Reproduce
Weights are being normalized at Line https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L718 using
n_samples
. It should be done usingweights.sum()
asdone in
_m_step()
here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L756.Expected Results
Correct weights.
Actual Results
Incorrect weights.
Versions
The text was updated successfully, but these errors were encountered: