Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor KMeans #3166

Merged
merged 1 commit into from
Jun 14, 2016
Merged

refactor KMeans #3166

merged 1 commit into from
Jun 14, 2016

Conversation

Saurabh7
Copy link
Contributor

No description provided.

@Saurabh7
Copy link
Contributor Author

Refactor KMeans to have different classes for KMeans (Lloyds) and KMeansMiniBatch inheriting from KMeansBase .

So basically:

KMeans.cpp -> KMeansBase.cpp
KMeansLloydImpl.cpp -> KMeans.cpp
KMeansMiniBatchImpl.cpp -> KMeansMiniBatch.cpp

Along with adding some helper functions for training, moving methods and some doc changes.

Github comparison is making it look messy , doesnt detect renames :(


void CKMeans::compute_cluster_variances()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This compute_cluster_variances I cannot make sense of it. Its not part of the KMeans algo ,
But this is leading to additional computations since this is called at end of train_machine . and causes performance hit for some case like this for dataset corel-histogram shape=(68040, 32) for k=10

2.77 s with compute_cluster_variances
1.31 s without compute_cluster_variances

@karlnapf
Copy link
Member

It is impossible to review this. A few high level comments:

  • the impl classes are made to hide c++ code from SWIG, which will only see the main interface class. Are you sure you want to remove them. See some other method, e.g. the metric learning does this
  • similarly, the base class for sure would not be exposed via SWIG
  • A class diagram would help, can you draw one so that we can discuss these changes?
  • make sure that things remain working, i.e. the notebook, the examples, etc
  • are there any speed implications coming from this?
  • if you want, I can put up a feature branch for you. There you can develop a bit more freely and then we can merge the whole thing....thoughts

We need another dev to give an opinion here before merging. @vigsterkr @lisitsyn ?

@Saurabh7
Copy link
Contributor Author

This idea was discussed a bit here: #2558

  • Both KMeans have one protected method with the alogrithm implementations say Lloyd_KMeans() . I can still keep them in impl if it is going to slow down swig. But we need impl files for each algo then.
  • Tried removing Base class from clustering_inlcudes.i but it doesn't work since we are inheriting from it. Other base classes like base distance.h are also exposed.
  • class diagram is simple like :
            CKMeansBase
    | ---------- | ----------- | 
CKMeans  CKMeansMiniBatch  CKMeansElkan(or something else in future maybe)   
  • I havent included speedups or changes for now , this is just restructuring.

@karlnapf
Copy link
Member

Thanks for the comments. I think it should be ok to merge then....

@karlnapf
Copy link
Member

@vigsterkr from my side this can be merged. What do you think? Shall we maybe put this in a feature branch to allow for some more checks (on buildbot for example?) Or merge from here? I think the new structure is fine and travis seems to be fine as well. ....

@karlnapf
Copy link
Member

(has to be rebased against #3217 though before merge)

@karlnapf
Copy link
Member

Can you rebase? There will be some minor conflicts from #3217
I will ping @vigsterkr to review so that we can merge soon

@Saurabh7
Copy link
Contributor Author

Ok I resolved conflicts. Lets see travis again.


for(int32_t j=0; j<lhs_size; j++)
for (int32_t i=0; i<num_centers; i++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coulnd't we omp pragma this?

Copy link
Member

@karlnapf karlnapf Jun 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not worth it as the loop is ultra cheap (but maybe investigate!)

@vigsterkr
Copy link
Member

lets address the comments + rebase with the latest develop and see how travis behaves :)

@karlnapf
Copy link
Member

Can we push this in parallel to the other stuff you are doing @Saurabh7 ?

@Saurabh7
Copy link
Contributor Author

Ok updated :)

lhs->free_feature_vector(vec, cluster_center_i);
}
/* Weights : Number of points in each cluster */
SGVector<int32_t> weights_set(num_centers);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 2^31 = 2147483648 i.e. 2.1 billion points enough? :) i mean in case of a big data set it could be more than that or?
i'm just suggesting that maybe using uint32_t or rather int64_t or uint64_t would be better choice, or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes in the case if 2.1 billion points happen to be assigned to one center, it might be short :) should i change to int64_t then ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't weights_set[i] = the number of elements assigned to the ith centre?

Copy link
Contributor Author

@Saurabh7 Saurabh7 Jun 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is.
But again its starts off with all points assigned to center 0. So I guess this should definintely be increased...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok yeah then rather use int64_t

for (j=0; j<dim; j++)
{
centers(j, min_cluster)+=
(vec[j]-centers(j, min_cluster)) / weights_set[min_cluster];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only thing: a / operator (division) is usually much more costly than multiplication...
so precomputing x = 1/weights_set[min_cluster] and then just do

vec[j]-x*centers(j, min_cluster))

of course it'd be good to check what does the compiler do when you do -O3

@Saurabh7
Copy link
Contributor Author

Updated

@vigsterkr
Copy link
Member

ALL GREEN! good job @Saurabh7 let's merge it!

@vigsterkr vigsterkr merged commit e541a92 into shogun-toolbox:develop Jun 14, 2016
karasikov pushed a commit to karasikov/shogun that referenced this pull request Apr 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants