Skip to content

KMeans not running in parallel when init='random' #12949

Closed
@fwillo

Description

@fwillo

Description

Dear all,

I experience a difference in behaviour of sklearn.cluster.KMeans when using init='random' or init='k-means++' in combination with n_jobs=-1 (or unequal 1). Not all CPUs are used when init='random', n_jobs=-1 and n_clusers>1. I monitored this with htop. For init='k-means++' this is not the case. Interestingly, this is happening only on Linux (tested Red Hat and Ubuntu, specified in the Versions section is Ubuntu). Another intersting note is, that the behaviour is not observable on my Windows machine, here monitored with Task manager.

Steps/Code to Reproduce

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from tqdm import tqdm # to check the behaviour in dependence of cluster amount

A = make_blobs(60000, 48, 8)

# i=1 running on all cores, monitored with htop. i > 1 only one core
for i in tqdm(range(1, 10)):
    model = KMeans(n_clusters=i, n_jobs=-1, n_init=200, max_iter=500, init='random').fit(A[0])

# For all i's this is using all cores
for i in tqdm(range(1, 10)):
    model = KMeans(n_clusters=i, n_jobs=-1, n_init=200, max_iter=500, init='k-means++').fit(A[0])

Expected Results

No difference regarding usage of cores between 'random' and 'k-means++'.

Actual Results

Only working for all cores with 'random' when n_clusters=1, otherwise only using one core. 'k-means++' is using all cores for any value of n_clusters.

Versions

Windows:

Could not locate executable g77
Could not locate executable f77
Could not locate executable ifort
Could not locate executable ifl
Could not locate executable f90
Could not locate executable DF
Could not locate executable efl

System:
    python: 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
executable: C:\ProgramData\Miniconda3\pythonw.exe
   machine: Windows-7-6.1.7601-SP1

BLAS:
    macros: 
  lib_dirs: 
cblas_libs: cblas

Python deps:
       pip: 18.1
setuptools: 39.0.1
   sklearn: 0.20.2
     numpy: 1.15.4
     scipy: 1.1.0
    Cython: 0.28.2
    pandas: 0.23.4
C:\ProgramData\Miniconda3\lib\site-packages\numpy\distutils\system_info.py:625: UserWarning: 
    Atlas (http://math-atlas.sourceforge.net/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [atlas]) or by setting
    the ATLAS environment variable.
  self.calc_info()
C:\ProgramData\Miniconda3\lib\site-packages\numpy\distutils\system_info.py:625: UserWarning: 
    Blas (http://www.netlib.org/blas/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [blas]) or by setting
    the BLAS environment variable.
  self.calc_info()
C:\ProgramData\Miniconda3\lib\site-packages\numpy\distutils\system_info.py:625: UserWarning: 
    Blas (http://www.netlib.org/blas/) sources not found.
    Directories to search for the sources can be specified in the
    numpy/distutils/site.cfg file (section [blas_src]) or by setting
    the BLAS_SRC environment variable.
  self.calc_info()

Linux:

System:
    python: 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)  [GCC 7.2.0]
executable: /cluster/programs/miniconda/envs/miniconda-36/bin/python
   machine: Linux-4.4.0-87-generic-x86_64-with-debian-stretch-sid

BLAS:
    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
  lib_dirs: /cluster/programs/miniconda/envs/miniconda-36/lib
cblas_libs: mkl_rt, pthread

Python deps:
       pip: 18.0
setuptools: 38.4.0
   sklearn: 0.20.2
     numpy: 1.14.2
     scipy: 1.1.0
    Cython: 0.27.3
    pandas: 0.23.4

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions