Skip to content

Conversation

bryanyang0528
Copy link
Contributor

@bryanyang0528 bryanyang0528 commented Jul 6, 2017

Reference Issue

Fixes #9287, Fixes #9784

What does this implement/fix? Explain your changes.

no matter how many n_jobs, the random_state in kmeans_single should be the same.
So I added seeds = random_state.randint(np.iinfo(np.int32).max, size=n_init if n_jobs=1
and use seed instead original random_state in the for loop.

Any other comments?

I didn't revise test cases for this change yet. I'll update them if you think this change is good.

@jnothman
Copy link
Member

jnothman commented Jul 6, 2017

This seems reasonable except insofar as KMeans with a fixed random state might have been returning the same model for a long time. I'm not sure it's worth breaking users' clusterings,

@@ -338,12 +338,13 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
if n_jobs == 1:
# For a single thread, less memory is needed if we just store one set
# of the best results (as opposed to one set per run per thread).
for it in range(n_init):
seeds = random_state.randint(np.iinfo(np.int32).max, size=n_init)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move the seeds assignment outside of the if clause since it used both for n_job == 1 and n_jobs != 1.

@bryanyang0528
Copy link
Contributor Author

@jnothman Even though creating seeds for n_jobs =1, seeds will be the same with the fixed random_state which might return the same model. But the model might be not the same as the model generated by the current method.

@amueller
Copy link
Member

amueller commented Aug 5, 2019

some duplication with #9785

@adrinjalali
Copy link
Member

This looks good to me. It can take the test from the other PR and I'd say it's almost good to go.

@jnothman you still worried about backward compatibility here?

@amueller
Copy link
Member

amueller commented Aug 6, 2019

I think it's a good fix.

@adrinjalali
Copy link
Member

@bryanyang0528 would you have time to address the comments, and rebase on the latest master here?

@bryanyang0528
Copy link
Contributor Author

@adrinjalali no problem. Thanks!

@jnothman
Copy link
Member

Any chance you can add a test?

@bryanyang0528
Copy link
Contributor Author

@jnothman no problem.

@adrinjalali
Copy link
Member

@bryanyang0528 tests failing :)

@bryanyang0528 bryanyang0528 reopened this Aug 12, 2019
@bryanyang0528
Copy link
Contributor Author

bryanyang0528 commented Aug 12, 2019

@adrinjalali I'm not sure why tests failed only on py35_conda_openblas and pylatest_conda_mkl_pandas. And No module named 'sklearn.__check_build._check_build' happened in circleci:doc. Are there any suggestions or hints for figuring out the issues?

p.s. I notice that recent PRs in sklearn are failed in these tests steps either.

@thomasjpfan
Copy link
Member

Merge with master should fix the issue.

@adrinjalali
Copy link
Member

@bryanyang0528 please avoid force pushing. The errors are not related to you, you can ignore the ones which fail to create the environment.

@adrinjalali
Copy link
Member

or merge master as @thomasjpfan suggests.

@bryanyang0528
Copy link
Contributor Author

bryanyang0528 commented Aug 12, 2019

@adrinjalali @thomasjpfan Thank you for suggestions.

@bryanyang0528
Copy link
Contributor Author

@adrinjalali Thank you for help, all tests passed. What should I do for next step?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, ping @jnothman since I know he had some reservations about this solution. To me this is a fix, and therefore I wouldn't mind the change.

@amueller amueller changed the title [WIP] add seeds when n_jobs=1 and use seed as random_state [MRG] add seeds when n_jobs=1 and use seed as random_state Aug 13, 2019
Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to wait for @jnothman but looks good to me.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a bug-fix, thanks @bryanyang0528

@jnothman
Copy link
Member

I'm happy with the fix.

@jnothman
Copy link
Member

I'm happy with the fix.
Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

Please also note the change at the top of that file under Changed Models

bryanyang0528 and others added 2 commits August 16, 2019 00:16
@amueller
Copy link
Member

thanks!

@bryanyang0528
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KMeans gives slightly different result for n_jobs=1 vs. n_jobs > 1 Inconsistence results of Kmeans between n_job = 1 and n_jobs > 1
7 participants