{{ message }}

# [MRG] Ways to compute center_shift_total were different in "full" and "elkan" algorithms.#15930

Merged
merged 4 commits into from Dec 20, 2019
Merged

# [MRG] Ways to compute center_shift_total were different in "full" and "elkan" algorithms.#15930

merged 4 commits into from Dec 20, 2019

## Conversation

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

### inder128 commented Dec 19, 2019 • edited

Fixes #15831

changes in sklearn/cluster/_k_means_elkan.pyx in line 249 and 250.

Ways to compute center_shift_total were different if "full" and "alkan" algorithims.
center_shift_total is compared with value of tol to limit no. of iterations.
Thats why inertia and n_iter were different.

In "full" : ( line 442 and 443 in sklearn/cluster/_k_means.py )
It is computed as:
center_shift_total = squared_norm(centers_old - centers)
if center_shift_total <= tol: #when compared with tol

While in "alkan": ( line 248 , 249 and 250 in sklearn/cluster/k_means_elkan.pyx )
It is computed as:
center_shift = np.sqrt(np.sum((centers
- new_centers) ** 2, axis=1))
center_shift_total = np.sum(center_shift)
if center_shift_total ** 2 < tol: #when compared with tol

I changed above "alkan" code (so that it coputes center_shift_total insame as in "full" algorithm):
It is computed as:
center_shift = np.sqrt(np.sum((centers_ - new_centers) ** 2, axis=1))
center_shift_total = np.sum(center_shift **2)
if center_shift_total < tol: #when compared with tol

Above updated "alkan" code gives same value as "full".

 changes in sklearn/cluster/_k_means_elkan.pyx in line 249 and 250. 
 cb10786 
closed this Dec 19, 2019
reopened this Dec 19, 2019
changed the title ways to compute center_shift_total were different if "full" and "alkan" algorithims. Ways to compute center_shift_total were different if "full" and "alkan" algorithims. Dec 19, 2019
changed the title Ways to compute center_shift_total were different if "full" and "alkan" algorithims. Ways to compute center_shift_total were different in "full" and "alkan" algorithims. Dec 19, 2019
reviewed

### ogrisel left a comment • edited

Nice catch. Could you please add a non-regression test based similar to the example provided in the #15831 bug report?

requested a review from jeremiedbb Dec 19, 2019

### jeremiedbb commented Dec 20, 2019

 I noticed this issue in #11950 and came up with the same fix.

### inder128 commented Dec 20, 2019

 I ran the tests. Results were as expected. CODE : from sklearn.cluster import KMeans import numpy as np data=np.array(range(1000)).reshape(-1,1)/1000. k=10 kwargs = { "n_clusters": k, "init":data[:k], "n_init":1, "tol":1e-4, } km1=KMeans(algorithm="elkan", **kwargs); km1.fit(data) km2=KMeans(algorithm="full", **kwargs); km2.fit(data) print(f"elkan: inertia= {km1.inertia_:10.7f} n_iter= {km1.n_iter_}") print(f"full: inertia= {km2.inertia_:10.7f} n_iter= {km2.n_iter_}") output : output (when tol = 1e-8): output (when tol = 1e-2 ):

 Add non-regression test 
 c41e2ed 

### ogrisel commented Dec 20, 2019

 @inder128 what I meant is include your manual tests as an automated test part of the scikit-learn test suite as I just did in c41e2ed.

added 2 commits Dec 20, 2019
 Merge remote-tracking branch 'origin/master' into my_feature 
 1b75662 
 Add whatsnew for 0.22.1 
 c2b9690 
approved these changes

### ogrisel left a comment

I added the whatsnew entry. LGTM.

added this to the 0.22.1 milestone Dec 20, 2019
changed the title Ways to compute center_shift_total were different in "full" and "alkan" algorithims. Ways to compute center_shift_total were different in "full" and "elkan" algorithims. Dec 20, 2019
changed the title Ways to compute center_shift_total were different in "full" and "elkan" algorithims. Ways to compute center_shift_total were different in "full" and "elkan" algorithms. Dec 20, 2019
changed the title Ways to compute center_shift_total were different in "full" and "elkan" algorithms. [MRG] Ways to compute center_shift_total were different in "full" and "elkan" algorithms. Dec 20, 2019

### inder128 commented Dec 20, 2019

 @ogrisel Thank you very much. I don't know much. I just started open source . This was my first PR.

### ogrisel commented Dec 20, 2019 • edited

 I believe that the failure on the pylatest_conda_mkl Continuous Integration job is random and unrelated (I am pretty sure I had already seen this SpectralClustering doctest fail in the past). Let me re-run this job to see if it's the case.

approved these changes

### jeremiedbb left a comment

lgtm. let's merge

merged commit 1b55e2f into scikit-learn:master Dec 20, 2019
21 checks passed

### ogrisel commented Dec 20, 2019

 Thank you very much for the fix @inder128! Welcome to the scikit-learn contributors' community.

### gittar commented Dec 20, 2019

 Thanks to all of you (@ogrisel, @jeremiedbb, @inder128) for picking this up and resolving so quickly. Great job, @inder128!

deleted the my_feature branch Dec 23, 2019
pushed a commit to ogrisel/scikit-learn that referenced this issue Dec 31, 2019
 [MRG] Ways to compute center_shift_total were different in "full" and… 
 296d077 
… "elkan" algorithms. (scikit-learn#15930)
pushed a commit to ogrisel/scikit-learn that referenced this issue Jan 2, 2020
 [MRG] Ways to compute center_shift_total were different in "full" and… 
 d35a8da 
… "elkan" algorithms. (scikit-learn#15930)
added a commit that referenced this issue Jan 2, 2020
 0.22.1 release (#15998) 
 e5698bd 
* DOC fixed default values in dbscan (#15753)

* DOC fix incorrect branch reference in contributing doc (#15779)

* DOC relabel Feature -> Efficiency in change log (#15770)

* DOC fixed Birch default value (#15780)

* STY Minior change on code padding in website theme (#15768)

* DOC Fix yticklabels order in permutation importances example (#15799)

* Fix yticklabels order in permutation importances example

* STY Update wrapper width (#15793)

* DOC Long sentence was hard to parse and ambiguous in _classification.py (#15769)

* DOC Removed duplicate 'classes_' attribute in Naive Bayes classifiers (#15811)

* BUG Fixes pandas dataframe bug with boolean dtypes (#15797)

* BUG Returns only public estimators in all_estimators (#15380)

* DOC improve doc for multiclass and types_of_target (#15333)

* TST Increases tol for check_pca_float_dtype_preservation assertion (#15775)

* update _alpha_grid class in _coordinate_descent.py (#15835)

* FIX Explicit conversion of ndarray to object dtype. (#15832)

* BLD Parallelize sphinx builds on circle ci (#15745)

* DOC correct url for preprocessing (#15853)

* MNT avoid generating too many cross links in examples (#15844)

* DOC Correct wrong doc in precision_recall_fscore_support (#15833)

Documenting the changes in #15775

* DOC correct indents in docstring _split.py (#15843)

* DOC fix docstring of KMeans based on sklearn guideline (#15754)

* DOC fix docstring of AgglomerativeClustering based on sklearn guideline (#15764)

* DOC fix docstring of AffinityPropagation based on sklearn guideline (#15777)

* DOC fixed SpectralCoclustering and SpectralBiclustering docstrings following sklearn guideline (#15778)

* DOC fix FeatureAgglomeration and MiniBatchKMeans docstring following sklearn guideline (#15809)

* TST Specify random_state in test_cv_iterable_wrapper (#15829)

* DOC Include LinearSV{C, R} in models that support sample_weights (#15871)

* DOC correct some indents (#15875)

* DOC Fix documentation of default values in tree classes (#15870)

* DOC fix typo in docstring (#15887)

* DOC FIX default value for xticks_rotation in plot_confusion_matrix (#15890)

* Fix imports in pip3 ubuntu by suffixing affected files (#15891)

* MNT Raise erorr when normalize is invalid in confusion_matrix (#15888)

* [MRG] DOC Increases search results for API object results (#15574)

* MNT Ignores warning in pyamg for deprecated scipy.random (#15914)

* DOC Instructions to troubleshoot Windows path length limit (#15916)

* DOC clarify doc-string of roc_auc_score and add references (#15293)

* MNT Adds skip lint to azure pipeline CI (#15904)

* BLD Fixes bug when building with NO_MATHJAX=1 (#15892)

* [MRG] BUG Checks to number of axes in passed in ax more generically (#15760)

* EXA Minor fixes in plot_sparse_logistic_regression_20newsgroups.py (#15925)

* BUG Do not shadow public functions with deprecated modules (#15846)

* Import sklearn._distributor_init first (#15929)

* DOC Fix typos, via a Levenshtein-style corrector (#15923)

* DOC in canned comment, mention that PR title becomes commit me… (#15935)

* DOC/EXA Correct spelling of "Classification" (#15938)

* BUG fix pip3 ubuntu update by suffixing file (#15928)

* [MRG] Ways to compute center_shift_total were different in "full" and "elkan" algorithms. (#15930)

* TST Fixes integer test for train and test indices (#15941)

* BUG ensure that parallel/sequential give the same permutation importances (#15933)

* Formatting fixes in changelog (#15944)

* MRG FIX: order of values of self.quantiles_ in QuantileTransformer (#15751)

* [MRG] BUG Fixes constrast in plot_confusion_matrix (#15936)

* BUG use zero_division argument in classification_report (#15879)

* DOC change logreg solver in plot_logistic_path (#15927)

* DOC fix whats new ordering (#15961)

* COSMIT use np.iinfo to define the max int32 (#15960)

* DOC Apply numpydoc validation to VotingRegressor methods (#15969)

* DOC improve naive_bayes.py documentation (#15943)

* DOC Fix default values in Perceptron documentation (#15965)

* DOC Improve default values in logistic documentation (#15966)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* DOC Improve documentation of default values for imputers (#15964)

* EXA/MAINT Simplify code in manifold learning example (#15949)

* DOC Improve default values in SGD documentation (#15967)

* DOC Improve defaults in neural network documentation (#15968)

* FIX use safe_sparse_dot for callable kernel in LabelSpreading (#15868)

* BUG Adds attributes back to check_is_fitted (#15947)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* DOC update check_is_fitted what's new

* DOC change python-devel to python3-devel for yum. (#15986)

* DOC Correct the default value of values_format in plot_confusion_matrix (#15981)

* [MRG] MNT Updates pypy to use 7.2.0 (#15954)

* FIX Add missing 'values_format' param to disp.plot() in plot_confusion_matrix (#15937)

* FIX support scalar values in fit_params in SearchCV (#15863)

* support a scalar fit param

* pep8

* TST add test for desired behavior

* FIX introduce _check_fit_params to validate parameters

* DOC update whats new

* TST tests both grid-search and randomize-search

* PEP8

* DOC revert unecessary change

* TST add test for _check_fit_params

* TST fixes

* DOC whats new

* DOC whats new

* TST revert type of error

* PEP8

* TST fix test by passing X

* avoid to call twice tocsr

* add case column/row sparse in check_fit_param

* provide optional indices

* TST check content when indexing params

* PEP8

* TST update tests to check identity

* stupid fix

* use a distribution in RandomizedSearchCV

* MNT add lightgbm to one of the CI build

* move to another build

* do not install dependencies lightgbm

* MNT comments on the CI setup

* Test fit_params compat without dependency on lightgbm

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* Remove abstractmethod that silently brake downstream packages (#15996)

* FIX restore BaseNB._check_X without abstractmethod decoration (#15997)

* Update v0.22 changelog for 0.22.1 (#16002)

- set the date
- move entry for quantile transformer to the 0.22.1 section
- fix alphabetical ordering of modules

* STY Removes hidden scroll bar (#15999)

* Flake8 fixes

* Fix: remove left-over lines that should have been deleted during conflict resolution when rebasing

* Fix missing imports

* Update version

* Fix test_check_is_fitted

* Make test_sag_regressor_computed_correctly deterministic (#16003)

Fix #15818.

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Co-authored-by: Matt Hall <matt@agilegeoscience.com>
Co-authored-by: Kathryn Poole <kathryn.poole2@gmail.com>
Co-authored-by: lucyleeow <jliu176@gmail.com>
Co-authored-by: JJmistry <jayminm22@gmail.com>
Co-authored-by: Juan Carlos Alfaro Jiménez <JuanCarlos.Alfaro@uclm.es>
Co-authored-by: SylvainLan <sylvain.s.lannuzel@gmail.com>
Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
Co-authored-by: Hanmin Qin <qinhanmin2005@sina.com>
Co-authored-by: Sambhav Kothari <sambhavs.email@gmail.com>
Co-authored-by: shivamgargsya <shivam.gargshya@gmail.com>
Co-authored-by: Reshama Shaikh <rs2715@stern.nyu.edu>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Co-authored-by: Brian Wignall <BrianWignall@gmail.com>
Co-authored-by: Ritchie Ng <ritchieng@u.nus.edu>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Tirth Patel <tirthasheshpatel@gmail.com>
Co-authored-by: Bibhash Chandra Mitra <bibhashm220896@gmail.com>
Co-authored-by: Alexandre Gramfort <alexandre.gramfort@m4x.org>
Co-authored-by: Pulkit Mehta <pulkit_mehta_work@yahoo.com>
Co-authored-by: Niklas <niklas.sm+github@gmail.com>
Co-authored-by: Windber <guolipengyeah@126.com>
Co-authored-by: Brigitta Sipőcz <b.sipocz@gmail.com>
pushed a commit to panpiort8/scikit-learn that referenced this issue Mar 3, 2020
 [MRG] Ways to compute center_shift_total were different in "full" and… 
 b3135e3 
… "elkan" algorithms. (scikit-learn#15930)
pushed a commit to BorgwardtLab/scikit-learn that referenced this issue Apr 24, 2020
 0.22.1 release (scikit-learn#15998) 
 2e162b2 
* DOC fixed default values in dbscan (scikit-learn#15753)

* DOC fix incorrect branch reference in contributing doc (scikit-learn#15779)

* DOC relabel Feature -> Efficiency in change log (scikit-learn#15770)

* DOC fixed Birch default value (scikit-learn#15780)

* STY Minior change on code padding in website theme (scikit-learn#15768)

* DOC Fix yticklabels order in permutation importances example (scikit-learn#15799)

* Fix yticklabels order in permutation importances example

* STY Update wrapper width (scikit-learn#15793)

* DOC Long sentence was hard to parse and ambiguous in _classification.py (scikit-learn#15769)

* DOC Removed duplicate 'classes_' attribute in Naive Bayes classifiers (scikit-learn#15811)

* BUG Fixes pandas dataframe bug with boolean dtypes (scikit-learn#15797)

* BUG Returns only public estimators in all_estimators (scikit-learn#15380)

* DOC improve doc for multiclass and types_of_target (scikit-learn#15333)

* TST Increases tol for check_pca_float_dtype_preservation assertion (scikit-learn#15775)

* update _alpha_grid class in _coordinate_descent.py (scikit-learn#15835)

* FIX Explicit conversion of ndarray to object dtype. (scikit-learn#15832)

* BLD Parallelize sphinx builds on circle ci (scikit-learn#15745)

* DOC correct url for preprocessing (scikit-learn#15853)

* MNT avoid generating too many cross links in examples (scikit-learn#15844)

* DOC Correct wrong doc in precision_recall_fscore_support (scikit-learn#15833)

Documenting the changes in scikit-learn#15775

* DOC correct indents in docstring _split.py (scikit-learn#15843)

* DOC fix docstring of KMeans based on sklearn guideline (scikit-learn#15754)

* DOC fix docstring of AgglomerativeClustering based on sklearn guideline (scikit-learn#15764)

* DOC fix docstring of AffinityPropagation based on sklearn guideline (scikit-learn#15777)

* DOC fixed SpectralCoclustering and SpectralBiclustering docstrings following sklearn guideline (scikit-learn#15778)

* DOC fix FeatureAgglomeration and MiniBatchKMeans docstring following sklearn guideline (scikit-learn#15809)

* TST Specify random_state in test_cv_iterable_wrapper (scikit-learn#15829)

* DOC Include LinearSV{C, R} in models that support sample_weights (scikit-learn#15871)

* DOC correct some indents (scikit-learn#15875)

* DOC Fix documentation of default values in tree classes (scikit-learn#15870)

* DOC fix typo in docstring (scikit-learn#15887)

* DOC FIX default value for xticks_rotation in plot_confusion_matrix (scikit-learn#15890)

* Fix imports in pip3 ubuntu by suffixing affected files (scikit-learn#15891)

* MNT Raise erorr when normalize is invalid in confusion_matrix (scikit-learn#15888)

* [MRG] DOC Increases search results for API object results (scikit-learn#15574)

* MNT Ignores warning in pyamg for deprecated scipy.random (scikit-learn#15914)

* DOC Instructions to troubleshoot Windows path length limit (scikit-learn#15916)

* DOC clarify doc-string of roc_auc_score and add references (scikit-learn#15293)

* MNT Adds skip lint to azure pipeline CI (scikit-learn#15904)

* BLD Fixes bug when building with NO_MATHJAX=1 (scikit-learn#15892)

* [MRG] BUG Checks to number of axes in passed in ax more generically (scikit-learn#15760)

* EXA Minor fixes in plot_sparse_logistic_regression_20newsgroups.py (scikit-learn#15925)

* BUG Do not shadow public functions with deprecated modules (scikit-learn#15846)

* Import sklearn._distributor_init first (scikit-learn#15929)

* DOC Fix typos, via a Levenshtein-style corrector (scikit-learn#15923)

* DOC in canned comment, mention that PR title becomes commit me… (scikit-learn#15935)

* DOC/EXA Correct spelling of "Classification" (scikit-learn#15938)

* BUG fix pip3 ubuntu update by suffixing file (scikit-learn#15928)

* [MRG] Ways to compute center_shift_total were different in "full" and "elkan" algorithms. (scikit-learn#15930)

* TST Fixes integer test for train and test indices (scikit-learn#15941)

* BUG ensure that parallel/sequential give the same permutation importances (scikit-learn#15933)

* Formatting fixes in changelog (scikit-learn#15944)

* MRG FIX: order of values of self.quantiles_ in QuantileTransformer (scikit-learn#15751)

* [MRG] BUG Fixes constrast in plot_confusion_matrix (scikit-learn#15936)

* BUG use zero_division argument in classification_report (scikit-learn#15879)

* DOC change logreg solver in plot_logistic_path (scikit-learn#15927)

* DOC fix whats new ordering (scikit-learn#15961)

* COSMIT use np.iinfo to define the max int32 (scikit-learn#15960)

* DOC Apply numpydoc validation to VotingRegressor methods (scikit-learn#15969)

* DOC improve naive_bayes.py documentation (scikit-learn#15943)

* DOC Fix default values in Perceptron documentation (scikit-learn#15965)

* DOC Improve default values in logistic documentation (scikit-learn#15966)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* DOC Improve documentation of default values for imputers (scikit-learn#15964)

* EXA/MAINT Simplify code in manifold learning example (scikit-learn#15949)

* DOC Improve default values in SGD documentation (scikit-learn#15967)

* DOC Improve defaults in neural network documentation (scikit-learn#15968)

* FIX use safe_sparse_dot for callable kernel in LabelSpreading (scikit-learn#15868)

* BUG Adds attributes back to check_is_fitted (scikit-learn#15947)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* DOC update check_is_fitted what's new

* DOC change python-devel to python3-devel for yum. (scikit-learn#15986)

* DOC Correct the default value of values_format in plot_confusion_matrix (scikit-learn#15981)

* [MRG] MNT Updates pypy to use 7.2.0 (scikit-learn#15954)

* FIX Add missing 'values_format' param to disp.plot() in plot_confusion_matrix (scikit-learn#15937)

* FIX support scalar values in fit_params in SearchCV (scikit-learn#15863)

* support a scalar fit param

* pep8

* TST add test for desired behavior

* FIX introduce _check_fit_params to validate parameters

* DOC update whats new

* TST tests both grid-search and randomize-search

* PEP8

* DOC revert unecessary change

* TST add test for _check_fit_params

* TST fixes

* DOC whats new

* DOC whats new

* TST revert type of error

* PEP8

* TST fix test by passing X

* avoid to call twice tocsr

* add case column/row sparse in check_fit_param

* provide optional indices

* TST check content when indexing params

* PEP8

* TST update tests to check identity

* stupid fix

* use a distribution in RandomizedSearchCV

* MNT add lightgbm to one of the CI build

* move to another build

* do not install dependencies lightgbm

* MNT comments on the CI setup

* Test fit_params compat without dependency on lightgbm

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* Remove abstractmethod that silently brake downstream packages (scikit-learn#15996)

* FIX restore BaseNB._check_X without abstractmethod decoration (scikit-learn#15997)

* Update v0.22 changelog for 0.22.1 (scikit-learn#16002)

- set the date
- move entry for quantile transformer to the 0.22.1 section
- fix alphabetical ordering of modules

* STY Removes hidden scroll bar (scikit-learn#15999)

* Flake8 fixes

* Fix: remove left-over lines that should have been deleted during conflict resolution when rebasing

* Fix missing imports

* Update version

* Fix test_check_is_fitted

* Make test_sag_regressor_computed_correctly deterministic (scikit-learn#16003)

Fix scikit-learn#15818.

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Co-authored-by: Matt Hall <matt@agilegeoscience.com>
Co-authored-by: Kathryn Poole <kathryn.poole2@gmail.com>
Co-authored-by: lucyleeow <jliu176@gmail.com>
Co-authored-by: JJmistry <jayminm22@gmail.com>
Co-authored-by: Juan Carlos Alfaro Jiménez <JuanCarlos.Alfaro@uclm.es>
Co-authored-by: SylvainLan <sylvain.s.lannuzel@gmail.com>
Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
Co-authored-by: Hanmin Qin <qinhanmin2005@sina.com>
Co-authored-by: Sambhav Kothari <sambhavs.email@gmail.com>
Co-authored-by: shivamgargsya <shivam.gargshya@gmail.com>
Co-authored-by: Reshama Shaikh <rs2715@stern.nyu.edu>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Co-authored-by: Brian Wignall <BrianWignall@gmail.com>
Co-authored-by: Ritchie Ng <ritchieng@u.nus.edu>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Tirth Patel <tirthasheshpatel@gmail.com>
Co-authored-by: Bibhash Chandra Mitra <bibhashm220896@gmail.com>
Co-authored-by: Alexandre Gramfort <alexandre.gramfort@m4x.org>
Co-authored-by: Pulkit Mehta <pulkit_mehta_work@yahoo.com>
Co-authored-by: Brigitta Sipőcz <b.sipocz@gmail.com>