Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC Add links to KMeans examples in docstrings and the user guide #27799

Merged
merged 14 commits into from
Jan 6, 2024

Conversation

marenwestermann
Copy link
Member

@marenwestermann marenwestermann commented Nov 17, 2023

Reference Issues/PRs

towards #26927

What does this implement/fix? Explain your changes.

Adds links to examples in the docstrings and the user guide which demonstrate how to use K-Means.

Any other comments?

I started with the example plot_cluster_iris.py and then realised that it probably makes sense to group all the links related to K-Means examples in one PR. So I will keep working on adding links to examples which show how to use K-Means.

Edit: the examples are

  • plot_cluster_iris.py
  • plot_color_quantization.py
  • plot_kmeans_assumptions.py
  • plot_kmeans_digits.py
  • plot_kmeans_silhouette_analysis.py
  • plot_mini_batch_kmeans.py
  • plot_document_clustering.py

Note: there can be more than one PR per example script because they might be referenced in different locations. For example there is an existing open PR for plot_document_clustering.py which links this example in the docs of a other estimator.

Copy link

github-actions bot commented Nov 17, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ca33b93. Link to the linter CI: here

@marenwestermann marenwestermann changed the title DOC [WIP] Add links to KMeans examples in docstrings and the user guide DOC Add links to KMeans examples in docstrings and the user guide Nov 24, 2023
Copy link
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @marenwestermann! Here is a batch of comments :)

@@ -218,7 +222,9 @@ initializations of the centroids. One method to help address this issue is the
k-means++ initialization scheme, which has been implemented in scikit-learn
(use the ``init='k-means++'`` parameter). This initializes the centroids to be
(generally) distant from each other, leading to probably better results than
random initialization, as shown in the reference.
random initialization, as shown in the reference. For a detailed example of
comaparing different initialization schemes refer to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
comaparing different initialization schemes refer to
comparing different initialization schemes, refer to

@@ -231,7 +237,17 @@ weight of 2 to a sample is equivalent to adding a duplicate of that sample
to the dataset :math:`X`.

K-means can be used for vector quantization. This is achieved using the
transform method of a trained model of :class:`KMeans`.
transform method of a trained model of :class:`KMeans`. For an example of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
transform method of a trained model of :class:`KMeans`. For an example of
`transform` method of a trained model of :class:`KMeans`. For an example of

using the iris dataset

* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`: Document clustering
using KMeans and MiniBatchKMeans based on sparse data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using KMeans and MiniBatchKMeans based on sparse data
using :class:`KMeans` and :class:`MiniBatchKMeans` based on sparse data


.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_cluster_plot_cluster_iris.py`: Example usage of K-Means
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* :ref:`sphx_glr_auto_examples_cluster_plot_cluster_iris.py`: Example usage of K-Means
* :ref:`sphx_glr_auto_examples_cluster_plot_cluster_iris.py`: Example usage of :class:`KMeans`

Comment on lines 310 to 311
* :ref:`sphx_glr_auto_examples_cluster_plot_mini_batch_kmeans.py`: Comparison of KMeans and
MiniBatchKMeans
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* :ref:`sphx_glr_auto_examples_cluster_plot_mini_batch_kmeans.py`: Comparison of KMeans and
MiniBatchKMeans
* :ref:`sphx_glr_auto_examples_cluster_plot_mini_batch_kmeans.py`: Comparison of
:class:`KMeans` and :class:`MiniBatchKMeans`

* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`: Document clustering using sparse
MiniBatchKMeans
* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`: Document clustering
using KMeans and MiniBatchKMeans based on sparse data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using KMeans and MiniBatchKMeans based on sparse data
using :class:`KMeans` and :class:`MiniBatchKMeans` based on sparse data

- top right: What the effect of a bad initialization is
- top right: What using three clusters would deliver.

- bottom left: What the effect of a bad initialization is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be done in another PR, but currently it seems that the initialization is good. I would rather pass a fixed random_state to KMeans instead of setting a global np.random.seed

Comment on lines 102 to 103
# using the model results itself. In that case, the :ref:`Silhouette Coefficient
# <sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py>` comes in handy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather say something similar to
"In that case the Silhouette analysis comes in handy. See sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py for an example on how to do it."

@@ -41,7 +41,7 @@
china = load_sample_image("china.jpg")

# Convert to floats instead of the default 8 bits integer coding. Dividing by
# 255 is important so that plt.imshow behaves works well on float data (need to
# 255 is important so that plt.imshow works well on float data (need to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Copy link
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it does LGTM, thanks @marenwestermann and sorry for taking so long to answer! (I was/still am off on holidays)

@ArturoAmorQ ArturoAmorQ merged commit 056864d into scikit-learn:main Jan 6, 2024
27 checks passed
@marenwestermann marenwestermann deleted the kmeans-examples branch January 6, 2024 12:34
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Jan 17, 2024
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants