Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weights are being normalized using number of samples as opposed to sum in GaussianMixture #24085

Open
kshitijgoel007 opened this issue Aug 2, 2022 · 7 comments · May be fixed by #24119
Open
Labels
Bug Moderate Anything that requires some knowledge of conventions and best practices module:mixture

Comments

@kshitijgoel007
Copy link

Describe the bug

Weights are being normalized at Line https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L718 using n_samples. It should be done using weights.sum() as
done in _m_step() here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L756.

Steps/Code to Reproduce

Weights are being normalized at Line https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L718 using n_samples. It should be done using weights.sum() as
done in _m_step() here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L756.

Expected Results

Correct weights.

Actual Results

Incorrect weights.

Versions

System:
    python: 3.9.13 (main, May 24 2022, 21:28:31)  [Clang 13.1.6 (clang-1316.0.21.2)]
executable: /Users/kshitijgoel/Documents/main/code.nosync/self_organizing_gmm/.venv/bin/python
   machine: macOS-12.4-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.1.1
          pip: 22.2.1
   setuptools: 62.3.2
        numpy: 1.23.1
        scipy: 1.8.1
       Cython: 0.29.30
       pandas: 1.4.3
   matplotlib: 3.5.2
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: False

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/kshitijgoel/Documents/main/code.nosync/self_organizing_gmm/.venv/lib/python3.9/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.20
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/kshitijgoel/Documents/main/code.nosync/self_organizing_gmm/.venv/lib/python3.9/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.17
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8
@kshitijgoel007 kshitijgoel007 added Bug Needs Triage Issue requires triage labels Aug 2, 2022
@kasmith11
Copy link
Contributor

I can work on this if still available.

@Micky774
Copy link
Contributor

Micky774 commented Aug 5, 2022

Seems like a simple fix. Note that a similar change was made in PR #23034, addressing an error reported in #23032.

@Micky774 Micky774 added Easy Well-defined and straightforward way to resolve and removed Needs Triage Issue requires triage labels Aug 5, 2022
@kasmith11 kasmith11 linked a pull request Aug 5, 2022 that will close this issue
@jjerphan
Copy link
Member

jjerphan commented Aug 8, 2022

Hi @kshitijgoel007, thank you for reporting this.

Do you by change have a minimal reproducible example where this incorrect weighting cause problem?

@jsawalha
Copy link

I can work on this, if he has not yet been finished.

@emirkmo
Copy link
Contributor

emirkmo commented Nov 14, 2022

Please see my review on the PR but I thought I would write my comments coherently here.

  1. I would not re-normalize user provided weights_init, otherwise we take away the ability to precisely define an initial state for the EM algorithm. It is up to the user to provide properly defined and normalized weights_init.
  2. All the current initialization options boil down to passing in a normalized responsibilities through _estimate_gaussian_parameters (I tested this, see below), it's worth considering that we are being as generic as possible for this task.

To that end, it would be good to know the range of output weights currently possible from a normalized responsibilities matrix.
I guess the simple sum is the best way to go about this to avoid floating point errors with precise 1 for n=1 features. (Although we could consider allowing specifying a normalization callable, which by default would be sum(). I guess this would also apply to the normalization at each _m step, but there are no discussions or feature requests about it.)

I have a PR #24812 for some requested features passing in user provided responsibilities or a callable initializer which creates a responsibilities. I would be happy if the normalization changes here keep that use case in mind, but since both additions also just boil down to passing in a normalized responsibilities matrix as resp, it should not be a problem.

What can help though is the responsibilities helper and validation functions that are a part of #24812. I debugged the various initial states provided by the initialization options and looked at the responsibility matrices they create. (Surprisingly to me, some are 0->1 continuous, while others are 0,1 binary/discrete), but overall the range is always [0,1] and the normalization is such that each row (responsibilities attached to each component) always sums to 1, since for each sample with n-features we have a PDF. See below: (The PR also adds basic tests. Since some here are familiar with the Mixture codebase it would be great to have a second pair of eyes)

Helper to calculate normalized responsibilities:

  def get_responsibilities(n_samples, n_components, indices=None, labels=None):
      """Create correct shaped responsibilities from array of `indices` or `labels`.
  
      Responsibilities are 1 for the component of the corresponding index or label,
      and 0 otherwise. Note that `n_components` is not the number of features
      in the data, but the number of components in the mixture model. Responsibilities
      can be used to calculate initial weights, means, and precisions of the mixture
      model components. Either `indices` or `labels` must be provided.
  
      Parameters
      ----------
      n_samples : int
          Number of samples.
  
      n_components : int
          Number of components.
  
      indices : array-like of shape (n_components,), default=None
          The index location of the chosen components (centers) in the data array X
          of shape (n_samples, n_components), will be set to 1 in the output.
          Either `indices` or `labels` must be provided.
  
      labels : array-like of shape (n_samples,), default=None
          Will be used over `indices` if not `None`. The labels i=0 to n_components-1
          will be set to 1 for each sample in the ouput.
          Either `indices` or `labels` must be provided.
  
      Returns
      -------
      responsibilities : array, shape (n_samples, n_components)
          Responsibilities of each sample in each component.
      """
      resp = np.zeros((n_samples, n_components))
      if labels is not None:
          _check_shape(labels, (n_samples,), "labels")  # will raise if incompatible
          resp[np.arange(n_samples), labels] = 1
      elif indices is not None:
          resp[indices, np.arange(n_components)] = 1
      else:
          raise ValueError(
              "Either `indices` or `labels` must be provided, both were `None`."
          )
      return resp

Validator for normalized responsibilities:

  def _check_responsibilities(resp, n_components, n_samples):
      """Check the user provided 'resp'.
  
      Parameters
      ----------
      resp : array-like of shape (n_samples, n_components)
          The responsibilities for each data sample in terms of each component.
  
      n_components : int
          Number of components.
  
      n_samples : int
          Number of samples.
  
      Returns
      -------
      resp : array, shape (n_samples, n_components)
      """
      resp = check_array(resp, dtype=[np.float64, np.float32], ensure_2d=False)
  
  
      _check_shape(resp, (n_samples, n_components), "responsibilities")
  
  
      # check range
      axis_sum = resp.sum(axis=1)
      less_1 = np.allclose(axis_sum[axis_sum > 1], 1)
      positive = np.allclose(axis_sum[axis_sum < 0] + 1, 1)
      in_domain = positive and less_1
      if not in_domain:
          raise ValueError(
              "The parameter 'responsibilities' should be normalized in "
              "the range [0, 1] within floating point tolerance, but got: "
              f"max value {np.min(resp)}, min value {np.max(resp)}"
          )
  
  
      # check proper normalization exists
      nrows_1 = np.sum(np.isclose(axis_sum[axis_sum >= 1], 1))
      if nrows_1 < n_components:
          raise ValueError(
              "The parameter 'responsibilities' should be normalized, "
              f"with at least `n_components`={n_components} rows summing to 1."
              f"but got {nrows_1}."
          )
      return resp

@glemaitre
Copy link
Member

@emirkmo I kind of agree with the principle of not renormalizing user-defined weights.

What I am wondering however is that it makes little sense to have unnormalized weight at the init that will get normalized in the next M-steps. Intuitively, I would expect the same normalization (or non-normalization) for the full iterative algorithm.

@emirkmo
Copy link
Contributor

emirkmo commented Jan 11, 2023

@glemaitre The init step really is a way to fine tune the path the algorithm will take. For some use cases it actually barely matters at all. But for many others, it can be really important as the EM algorithm is highly sensitive to initial state. (This can be especially important for sparse multi-D data.)

So ultimately I don't think the rest of the algorithm has much to do with the init state. In fact, the initial state does not even have to be super well defined, as the M-steps seem capable of handling that in my testing.

What is most important is, (without creating bugs like overflow that can lead to negative normalization in the M-steps), allowing precisely specifying the init state, and providing tools for users to do this easily/intelligently as well. (I'll shamelessly link to #24812).

Otherwise, it is like trying to stack magnetic blocks. As you go to place it just so, the system shifts by itself and the end result isn't what you want. Very frustrating.

@thomasjpfan thomasjpfan added Moderate Anything that requires some knowledge of conventions and best practices and removed Easy Well-defined and straightforward way to resolve labels Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment