Weights are being normalized using number of samples as opposed to sum in GaussianMixture #24085

kshitijgoel007 · 2022-08-02T16:18:04Z

Describe the bug

Weights are being normalized at Line https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L718 using n_samples. It should be done using weights.sum() as
done in _m_step() here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L756.

Steps/Code to Reproduce

Weights are being normalized at Line https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L718 using n_samples. It should be done using weights.sum() as
done in _m_step() here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L756.

Expected Results

Correct weights.

Actual Results

Incorrect weights.

Versions

System:
    python: 3.9.13 (main, May 24 2022, 21:28:31)  [Clang 13.1.6 (clang-1316.0.21.2)]
executable: /Users/kshitijgoel/Documents/main/code.nosync/self_organizing_gmm/.venv/bin/python
   machine: macOS-12.4-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.1.1
          pip: 22.2.1
   setuptools: 62.3.2
        numpy: 1.23.1
        scipy: 1.8.1
       Cython: 0.29.30
       pandas: 1.4.3
   matplotlib: 3.5.2
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: False

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/kshitijgoel/Documents/main/code.nosync/self_organizing_gmm/.venv/lib/python3.9/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.20
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/kshitijgoel/Documents/main/code.nosync/self_organizing_gmm/.venv/lib/python3.9/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.17
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

The text was updated successfully, but these errors were encountered:

kasmith11 · 2022-08-04T22:15:18Z

I can work on this if still available.

Micky774 · 2022-08-05T00:06:32Z

Seems like a simple fix. Note that a similar change was made in PR #23034, addressing an error reported in #23032.

jjerphan · 2022-08-08T15:58:38Z

Hi @kshitijgoel007, thank you for reporting this.

Do you by change have a minimal reproducible example where this incorrect weighting cause problem?

jsawalha · 2022-10-31T19:08:52Z

I can work on this, if he has not yet been finished.

emirkmo · 2022-11-14T10:09:51Z

Please see my review on the PR but I thought I would write my comments coherently here.

I would not re-normalize user provided weights_init, otherwise we take away the ability to precisely define an initial state for the EM algorithm. It is up to the user to provide properly defined and normalized weights_init.
All the current initialization options boil down to passing in a normalized responsibilities through _estimate_gaussian_parameters (I tested this, see below), it's worth considering that we are being as generic as possible for this task.

To that end, it would be good to know the range of output weights currently possible from a normalized responsibilities matrix.
I guess the simple sum is the best way to go about this to avoid floating point errors with precise 1 for n=1 features. (Although we could consider allowing specifying a normalization callable, which by default would be sum(). I guess this would also apply to the normalization at each _m step, but there are no discussions or feature requests about it.)

I have a PR #24812 for some requested features passing in user provided responsibilities or a callable initializer which creates a responsibilities. I would be happy if the normalization changes here keep that use case in mind, but since both additions also just boil down to passing in a normalized responsibilities matrix as resp, it should not be a problem.

What can help though is the responsibilities helper and validation functions that are a part of #24812. I debugged the various initial states provided by the initialization options and looked at the responsibility matrices they create. (Surprisingly to me, some are 0->1 continuous, while others are 0,1 binary/discrete), but overall the range is always [0,1] and the normalization is such that each row (responsibilities attached to each component) always sums to 1, since for each sample with n-features we have a PDF. See below: (The PR also adds basic tests. Since some here are familiar with the Mixture codebase it would be great to have a second pair of eyes)

Helper to calculate normalized responsibilities:

  def get_responsibilities(n_samples, n_components, indices=None, labels=None):
      """Create correct shaped responsibilities from array of `indices` or `labels`.
  
      Responsibilities are 1 for the component of the corresponding index or label,
      and 0 otherwise. Note that `n_components` is not the number of features
      in the data, but the number of components in the mixture model. Responsibilities
      can be used to calculate initial weights, means, and precisions of the mixture
      model components. Either `indices` or `labels` must be provided.
  
      Parameters
      ----------
      n_samples : int
          Number of samples.
  
      n_components : int
          Number of components.
  
      indices : array-like of shape (n_components,), default=None
          The index location of the chosen components (centers) in the data array X
          of shape (n_samples, n_components), will be set to 1 in the output.
          Either `indices` or `labels` must be provided.
  
      labels : array-like of shape (n_samples,), default=None
          Will be used over `indices` if not `None`. The labels i=0 to n_components-1
          will be set to 1 for each sample in the ouput.
          Either `indices` or `labels` must be provided.
  
      Returns
      -------
      responsibilities : array, shape (n_samples, n_components)
          Responsibilities of each sample in each component.
      """
      resp = np.zeros((n_samples, n_components))
      if labels is not None:
          _check_shape(labels, (n_samples,), "labels")  # will raise if incompatible
          resp[np.arange(n_samples), labels] = 1
      elif indices is not None:
          resp[indices, np.arange(n_components)] = 1
      else:
          raise ValueError(
              "Either `indices` or `labels` must be provided, both were `None`."
          )
      return resp

Validator for normalized responsibilities:

  def _check_responsibilities(resp, n_components, n_samples):
      """Check the user provided 'resp'.
  
      Parameters
      ----------
      resp : array-like of shape (n_samples, n_components)
          The responsibilities for each data sample in terms of each component.
  
      n_components : int
          Number of components.
  
      n_samples : int
          Number of samples.
  
      Returns
      -------
      resp : array, shape (n_samples, n_components)
      """
      resp = check_array(resp, dtype=[np.float64, np.float32], ensure_2d=False)
  
  
      _check_shape(resp, (n_samples, n_components), "responsibilities")
  
  
      # check range
      axis_sum = resp.sum(axis=1)
      less_1 = np.allclose(axis_sum[axis_sum > 1], 1)
      positive = np.allclose(axis_sum[axis_sum < 0] + 1, 1)
      in_domain = positive and less_1
      if not in_domain:
          raise ValueError(
              "The parameter 'responsibilities' should be normalized in "
              "the range [0, 1] within floating point tolerance, but got: "
              f"max value {np.min(resp)}, min value {np.max(resp)}"
          )
  
  
      # check proper normalization exists
      nrows_1 = np.sum(np.isclose(axis_sum[axis_sum >= 1], 1))
      if nrows_1 < n_components:
          raise ValueError(
              "The parameter 'responsibilities' should be normalized, "
              f"with at least `n_components`={n_components} rows summing to 1."
              f"but got {nrows_1}."
          )
      return resp

glemaitre · 2022-12-29T10:29:57Z

@emirkmo I kind of agree with the principle of not renormalizing user-defined weights.

What I am wondering however is that it makes little sense to have unnormalized weight at the init that will get normalized in the next M-steps. Intuitively, I would expect the same normalization (or non-normalization) for the full iterative algorithm.

emirkmo · 2023-01-11T17:47:51Z

@glemaitre The init step really is a way to fine tune the path the algorithm will take. For some use cases it actually barely matters at all. But for many others, it can be really important as the EM algorithm is highly sensitive to initial state. (This can be especially important for sparse multi-D data.)

So ultimately I don't think the rest of the algorithm has much to do with the init state. In fact, the initial state does not even have to be super well defined, as the M-steps seem capable of handling that in my testing.

What is most important is, (without creating bugs like overflow that can lead to negative normalization in the M-steps), allowing precisely specifying the init state, and providing tools for users to do this easily/intelligently as well. (I'll shamelessly link to #24812).

Otherwise, it is like trying to stack magnetic blocks. As you go to place it just so, the system shifts by itself and the end result isn't what you want. Very frustrating.

kshitijgoel007 added Bug Needs Triage Issue requires triage labels Aug 2, 2022

Micky774 added the module:mixture label Aug 2, 2022

kasmith11 mentioned this issue Aug 4, 2022

Addresses #24085 - Gaussian mixture weights #24117

Closed

Micky774 added Easy Well-defined and straightforward way to resolve and removed Needs Triage Issue requires triage labels Aug 5, 2022

kasmith11 linked a pull request Aug 5, 2022 that will close this issue

FIX Correct GaussianMixture.weights_ normalization #24119

Open

Soham-West mentioned this issue Aug 9, 2022

Update _gaussian_mixture.py #24150

Closed

amirzarie mentioned this issue Mar 15, 2023

Corrected the weights to use sum instead of number of samples #25860

Closed

thomasjpfan added Moderate Anything that requires some knowledge of conventions and best practices and removed Easy Well-defined and straightforward way to resolve labels Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weights are being normalized using number of samples as opposed to sum in GaussianMixture #24085

Weights are being normalized using number of samples as opposed to sum in GaussianMixture #24085

kshitijgoel007 commented Aug 2, 2022

kasmith11 commented Aug 4, 2022

Micky774 commented Aug 5, 2022

jjerphan commented Aug 8, 2022

jsawalha commented Oct 31, 2022

emirkmo commented Nov 14, 2022 •

edited

glemaitre commented Dec 29, 2022

emirkmo commented Jan 11, 2023 •

edited

Weights are being normalized using number of samples as opposed to sum in GaussianMixture #24085

Weights are being normalized using number of samples as opposed to sum in GaussianMixture #24085

Comments

kshitijgoel007 commented Aug 2, 2022

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

kasmith11 commented Aug 4, 2022

Micky774 commented Aug 5, 2022

jjerphan commented Aug 8, 2022

jsawalha commented Oct 31, 2022

emirkmo commented Nov 14, 2022 • edited

Helper to calculate normalized responsibilities:

Validator for normalized responsibilities:

glemaitre commented Dec 29, 2022

emirkmo commented Jan 11, 2023 • edited

emirkmo commented Nov 14, 2022 •

edited

emirkmo commented Jan 11, 2023 •

edited