refactor KMeans #3166

Saurabh7 · 2016-04-14T15:29:42Z

No description provided.

Saurabh7 · 2016-04-14T15:38:49Z

Refactor KMeans to have different classes for KMeans (Lloyds) and KMeansMiniBatch inheriting from KMeansBase .

So basically:

KMeans.cpp -> KMeansBase.cpp
KMeansLloydImpl.cpp -> KMeans.cpp
KMeansMiniBatchImpl.cpp -> KMeansMiniBatch.cpp

Along with adding some helper functions for training, moving methods and some doc changes.

Github comparison is making it look messy , doesnt detect renames :(

Saurabh7 · 2016-04-14T15:47:15Z

src/shogun/clustering/KMeans.cpp


-void CKMeans::compute_cluster_variances()


This compute_cluster_variances I cannot make sense of it. Its not part of the KMeans algo ,
But this is leading to additional computations since this is called at end of train_machine . and causes performance hit for some case like this for dataset corel-histogram shape=(68040, 32) for k=10

2.77 s with compute_cluster_variances 1.31 s without compute_cluster_variances

karlnapf · 2016-04-15T09:57:55Z

It is impossible to review this. A few high level comments:

the impl classes are made to hide c++ code from SWIG, which will only see the main interface class. Are you sure you want to remove them. See some other method, e.g. the metric learning does this
similarly, the base class for sure would not be exposed via SWIG
A class diagram would help, can you draw one so that we can discuss these changes?
make sure that things remain working, i.e. the notebook, the examples, etc
are there any speed implications coming from this?
if you want, I can put up a feature branch for you. There you can develop a bit more freely and then we can merge the whole thing....thoughts

We need another dev to give an opinion here before merging. @vigsterkr @lisitsyn ?

Saurabh7 · 2016-04-15T15:45:31Z

This idea was discussed a bit here: #2558

Both KMeans have one protected method with the alogrithm implementations say Lloyd_KMeans() . I can still keep them in impl if it is going to slow down swig. But we need impl files for each algo then.
Tried removing Base class from clustering_inlcudes.i but it doesn't work since we are inheriting from it. Other base classes like base distance.h are also exposed.
class diagram is simple like :

            CKMeansBase
    | ---------- | ----------- | 
CKMeans  CKMeansMiniBatch  CKMeansElkan(or something else in future maybe)

I havent included speedups or changes for now , this is just restructuring.

karlnapf · 2016-04-15T16:18:50Z

Thanks for the comments. I think it should be ok to merge then....

karlnapf · 2016-05-28T11:16:16Z

@vigsterkr from my side this can be merged. What do you think? Shall we maybe put this in a feature branch to allow for some more checks (on buildbot for example?) Or merge from here? I think the new structure is fine and travis seems to be fine as well. ....

karlnapf · 2016-05-28T11:16:28Z

(has to be rebased against #3217 though before merge)

karlnapf · 2016-05-29T21:44:53Z

Can you rebase? There will be some minor conflicts from #3217
I will ping @vigsterkr to review so that we can merge soon

Saurabh7 · 2016-05-30T11:25:16Z

Ok I resolved conflicts. Lets see travis again.

vigsterkr · 2016-06-06T15:36:32Z

src/shogun/clustering/KMeans.cpp


-			for(int32_t j=0; j<lhs_size; j++)
+			for (int32_t i=0; i<num_centers; i++)


coulnd't we omp pragma this?

I think it is not worth it as the loop is ultra cheap (but maybe investigate!)

vigsterkr · 2016-06-06T15:46:50Z

lets address the comments + rebase with the latest develop and see how travis behaves :)

karlnapf · 2016-06-12T18:20:28Z

Can we push this in parallel to the other stuff you are doing @Saurabh7 ?

Saurabh7 · 2016-06-13T06:41:12Z

Ok updated :)

vigsterkr · 2016-06-13T06:43:24Z

src/shogun/clustering/KMeans.cpp

-		lhs->free_feature_vector(vec, cluster_center_i);
-	}
+	/* Weights : Number of points in each cluster */
+	SGVector<int32_t> weights_set(num_centers);


is 2^31 = 2147483648 i.e. 2.1 billion points enough? :) i mean in case of a big data set it could be more than that or?
i'm just suggesting that maybe using uint32_t or rather int64_t or uint64_t would be better choice, or?

Yes in the case if 2.1 billion points happen to be assigned to one center, it might be short :) should i change to int64_t then ?

isn't weights_set[i] = the number of elements assigned to the ith centre?

yes it is.
But again its starts off with all points assigned to center 0. So I guess this should definintely be increased...

ah ok yeah then rather use int64_t

vigsterkr · 2016-06-13T06:47:47Z

src/shogun/clustering/KMeans.cpp

+					for (j=0; j<dim; j++)
+					{
+						centers(j, min_cluster)+=
+							(vec[j]-centers(j, min_cluster)) / weights_set[min_cluster];


only thing: a / operator (division) is usually much more costly than multiplication...
so precomputing x = 1/weights_set[min_cluster] and then just do

vec[j]-x*centers(j, min_cluster))

of course it'd be good to check what does the compiler do when you do -O3

Saurabh7 · 2016-06-14T08:29:38Z

Updated

vigsterkr · 2016-06-14T08:42:45Z

ALL GREEN! good job @Saurabh7 let's merge it!

refactor KMeans

Saurabh7 reviewed Apr 14, 2016
View reviewed changes

Saurabh7 mentioned this pull request May 6, 2016

Add kmeans page to cookbook #3183

Merged

Saurabh7 force-pushed the kmeans_refactor branch from c7d9d71 to 8fe51d6 Compare May 30, 2016 11:23

vigsterkr reviewed Jun 6, 2016
View reviewed changes

Saurabh7 force-pushed the kmeans_refactor branch from 8fe51d6 to 97479d2 Compare June 13, 2016 06:40

vigsterkr reviewed Jun 13, 2016
View reviewed changes

Saurabh7 force-pushed the kmeans_refactor branch from 97479d2 to c472419 Compare June 13, 2016 06:46

vigsterkr reviewed Jun 13, 2016
View reviewed changes

refactor KMeans

802b810

Saurabh7 force-pushed the kmeans_refactor branch from c472419 to 802b810 Compare June 14, 2016 06:18

vigsterkr merged commit e541a92 into shogun-toolbox:develop Jun 14, 2016

karasikov pushed a commit to karasikov/shogun that referenced this pull request Apr 15, 2017

Merge pull request shogun-toolbox#3166 from Saurabh7/kmeans_refactor

181ce66

refactor KMeans

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor KMeans #3166

refactor KMeans #3166

Saurabh7 commented Apr 14, 2016

Saurabh7 commented Apr 14, 2016

Saurabh7 Apr 14, 2016

karlnapf commented Apr 15, 2016

Saurabh7 commented Apr 15, 2016

karlnapf commented Apr 15, 2016

karlnapf commented May 28, 2016

karlnapf commented May 28, 2016

karlnapf commented May 29, 2016

Saurabh7 commented May 30, 2016

vigsterkr Jun 6, 2016

karlnapf Jun 6, 2016 •

edited

Loading

vigsterkr commented Jun 6, 2016

karlnapf commented Jun 12, 2016

Saurabh7 commented Jun 13, 2016

vigsterkr Jun 13, 2016

Saurabh7 Jun 13, 2016

vigsterkr Jun 13, 2016

Saurabh7 Jun 13, 2016 •

edited

Loading

vigsterkr Jun 13, 2016

vigsterkr Jun 13, 2016

Saurabh7 commented Jun 14, 2016

vigsterkr commented Jun 14, 2016


		for(int32_t j=0; j<lhs_size; j++)
		for (int32_t i=0; i<num_centers; i++)


		void CKMeans::compute_cluster_variances()

refactor KMeans #3166

refactor KMeans #3166

Conversation

Saurabh7 commented Apr 14, 2016

Saurabh7 commented Apr 14, 2016

Saurabh7 Apr 14, 2016

Choose a reason for hiding this comment

karlnapf commented Apr 15, 2016

Saurabh7 commented Apr 15, 2016

karlnapf commented Apr 15, 2016

karlnapf commented May 28, 2016

karlnapf commented May 28, 2016

karlnapf commented May 29, 2016

Saurabh7 commented May 30, 2016

vigsterkr Jun 6, 2016

Choose a reason for hiding this comment

karlnapf Jun 6, 2016 • edited Loading

Choose a reason for hiding this comment

vigsterkr commented Jun 6, 2016

karlnapf commented Jun 12, 2016

Saurabh7 commented Jun 13, 2016

vigsterkr Jun 13, 2016

Choose a reason for hiding this comment

Saurabh7 Jun 13, 2016

Choose a reason for hiding this comment

vigsterkr Jun 13, 2016

Choose a reason for hiding this comment

Saurabh7 Jun 13, 2016 • edited Loading

Choose a reason for hiding this comment

vigsterkr Jun 13, 2016

Choose a reason for hiding this comment

vigsterkr Jun 13, 2016

Choose a reason for hiding this comment

Saurabh7 commented Jun 14, 2016

vigsterkr commented Jun 14, 2016

karlnapf Jun 6, 2016 •

edited

Loading

Saurabh7 Jun 13, 2016 •

edited

Loading