[WIP] Number of threads in KMeans should not be bigger than number of chunks #17210

jeremiedbb · 2020-05-13T15:53:42Z

related to #17208

When the number of chunks is smaller than the number of cores (i.e. very small datasets), KMeans launches as many threads as there are cores anyway. It should use n_chunks threads instead.

adrinjalali · 2020-05-13T17:35:40Z

curious, how much of a gain is this? and should it go in 0.23.1?

glemaitre · 2020-05-14T08:05:57Z

curious, how much of a gain is this? and should it go in 0.23.1?

It is a regression in #17208. It seems x10 on very small dataset.

adrinjalali

Then if you've tested and this is fixing the issue, I'm happy. I don't think we can easily write a test for it.

jeremiedbb · 2020-05-14T08:49:33Z

It's an attempt to fix 17208, but it's still wip. I can reproduce the slowdown on my laptop and this pr fixes it but it seems to not work for the person who opened the issue. I need to investigate further with him

rth · 2020-05-14T08:52:43Z

The fix sounds like a good improvement in any case. Though he has 4 cores, so I would have imagined spawning 8 threads shouldn't be too costly performance wise?

jeremiedbb · 2020-05-14T09:05:09Z

The fix sounds like a good improvement in any case

I agree

Though he has 4 cores, so I would have imagined spawning 8 threads shouldn't be too costly performance wise?

In the reproducible snippet, there are only 150 samples, which means there will only be one chunk. On my laptop with 4 cores, it spawns 4 threads and it makes a huge diff. The thing is that it only concerns very small datasets for which the whole fit time is ~ 0.005 sec. So I guess that the overhead of thread creation become non negligible.

rth · 2020-05-14T09:07:56Z

Let's merge this as is? As if necessary you could create a new PR with more improvements? Please add a what's new entry.

BTW, can this affect other parts of the code that do parallel chunking where we could also apply this fix ?

jeremiedbb · 2020-05-14T09:14:40Z

BTW, can this affect other parts of the code that do parallel chunking where we could also apply this fix ?

Not that I can think about.

…s-small-data

jeremiedbb · 2020-05-14T11:14:21Z

Let's merge this as is?

It can't hurt and It's still an improvement

jeremiedbb · 2020-05-14T11:16:18Z

Some profiling showed that it's threadpoolctl that takes 90% of the time on these very small problems. It's called at each iteration. I'll make a pr to move it outside of the loop.

adrinjalali · 2020-05-15T07:31:30Z

I guess @rth is also happy to have this merged. Merging, hopefully the other PR improving the threadpoolctl overhead would get in quickly too.

…hunks (scikit-learn#17210) * num threads not bigger than num chunks * what's new

…hunks (#17210) * num threads not bigger than num chunks * what's new

…hunks (scikit-learn#17210) * num threads not bigger than num chunks * what's new

num threads not bigger than num chunks

c72e4fb

github-actions bot added the module:cluster label May 13, 2020

jnothman added this to the 0.23.1 milestone May 14, 2020

adrinjalali approved these changes May 14, 2020

View reviewed changes

jeremiedbb added 2 commits May 14, 2020 11:15

Merge remote-tracking branch 'upstream/master' into fix-kmeans-thread…

7bcd819

…s-small-data

what's new

787d615

jeremiedbb mentioned this pull request May 14, 2020

KMeans singnificantly slower on 0.23 #17208

Closed

adrinjalali merged commit 90d00da into scikit-learn:master May 15, 2020

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

FIX Number of threads in KMeans should not be bigger than number of c…

cb42aec

…hunks (scikit-learn#17210) * num threads not bigger than num chunks * what's new

jeremiedbb mentioned this pull request May 15, 2020

ENH Move threadpoolctl outside of iteration loop in KMeans #17235

Merged

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request May 18, 2020

FIX Number of threads in KMeans should not be bigger than number of c…

745d741

…hunks (scikit-learn#17210) * num threads not bigger than num chunks * what's new

adrinjalali pushed a commit that referenced this pull request May 19, 2020

FIX Number of threads in KMeans should not be bigger than number of c…

15716da

…hunks (#17210) * num threads not bigger than num chunks * what's new

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020

FIX Number of threads in KMeans should not be bigger than number of c…

3410848

…hunks (scikit-learn#17210) * num threads not bigger than num chunks * what's new

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Number of threads in KMeans should not be bigger than number of chunks #17210

[WIP] Number of threads in KMeans should not be bigger than number of chunks #17210

jeremiedbb commented May 13, 2020 •

edited

adrinjalali commented May 13, 2020

glemaitre commented May 14, 2020

adrinjalali left a comment

jeremiedbb commented May 14, 2020

rth commented May 14, 2020

jeremiedbb commented May 14, 2020 •

edited

rth commented May 14, 2020

jeremiedbb commented May 14, 2020

jeremiedbb commented May 14, 2020

jeremiedbb commented May 14, 2020

adrinjalali commented May 15, 2020

[WIP] Number of threads in KMeans should not be bigger than number of chunks #17210

[WIP] Number of threads in KMeans should not be bigger than number of chunks #17210

Conversation

jeremiedbb commented May 13, 2020 • edited

adrinjalali commented May 13, 2020

glemaitre commented May 14, 2020

adrinjalali left a comment

Choose a reason for hiding this comment

jeremiedbb commented May 14, 2020

rth commented May 14, 2020

jeremiedbb commented May 14, 2020 • edited

rth commented May 14, 2020

jeremiedbb commented May 14, 2020

jeremiedbb commented May 14, 2020

jeremiedbb commented May 14, 2020

adrinjalali commented May 15, 2020

jeremiedbb commented May 13, 2020 •

edited

jeremiedbb commented May 14, 2020 •

edited