Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up KMeans #2558

Closed
karlnapf opened this issue Oct 25, 2014 · 9 comments
Closed

Clean up KMeans #2558

karlnapf opened this issue Oct 25, 2014 · 9 comments

Comments

@karlnapf
Copy link
Member

The class is very messy. Since the mini batch extension, things got worse

  • Do a proper object oriented distinction between kmeans and mini batch kmeans rather than handling everything in one class. Its two different algorithms, and therefore should be in different classes (where one inherits from the other and only specialises some internals)
  • Fix error messages and requires, see here for a guide on how to write proper require messages
  • Update documentation and describe the different methods properly, mini batch is not mentioned in class docs
  • Write unit tests (if not there yet)
@karlnapf
Copy link
Member Author

@mazumdarparijat this is yours! :)
@iglesias can maybe help?

@mallikarjun26
Copy link

I would like to take up this task.
Brief inputs for the same please?

@iglesias
Copy link
Collaborator

iglesias commented Nov 4, 2014

Mini-batch k-means is actually mentioned in the documentation briefly. Anyway, I agree this class should be designed in a better way.

@mallikarjun26, the description above by @karlnapf gives a pretty good idea of what needs to be done. Basically, the implementations of the two different k-means variations available in Shogun are all right (each one has its own implementation file, KMeansLloydImpl and KMeansMiniBatchImpl). However, the interface is maybe too cohesive (CKMeans is responsible for both Lloyd and mini-batch training). It would be better to have two different classes.

A way to go might be to use inheritance. For instance, a class CKMeans could implement the standard algorithm while a subclass CKMeansMiniBatch would override the training method. There may better designs though. This was just an example. It is up to you to come up with a good solution and implement it :-)

@abhinavagarwalla
Copy link
Contributor

In my view, the best way to approach this problem would be through using factory method pattern, i.e. by encapsulating the derived CKMeansMiniBatch and CKMeansLloydImpl into the base Kmeans.
By doing so, we would have the two different algorithms in two different classes, plus we can also use it by creating objects of KMeans only. And right now, that is used, which is I think is good when considering ease of the user.
One thing that can be done is that instead of using enum, we can encapsulate the derived class for better design.

@karlnapf
Copy link
Member Author

karlnapf commented Dec 2, 2014

I think we should have all methods that are shared in the base class and subclasses only differ in a couple of lines how to compute the udpates

@abhinavagarwalla
Copy link
Contributor

Sorry, but I didn't get what you are trying to say. Would you please mind saying it again.

@karlnapf
Copy link
Member Author

karlnapf commented Dec 3, 2014

I mean that I do not like the idea of having this factory pattern, a single class should have a fixed behaviour. Different behaviours/algorithms should be implemented in sub-classes.

Put all functionality that is used by both KMeans algorithms into the base class.
Have the subclass implement the updates only, using the base class functionality.

@iglesias
Copy link
Collaborator

iglesias commented Dec 4, 2014

I agree with @karlnapf because I do not quite see the advantage of using the factory method pattern for our use case. With the pattern we would do stuff like:

CKMeans* mbkmeans =  CKMeans::make_kmeans(KMeansType::MB, ...);
CKMeans* lloydkmeans = CKMeans::make_kmeans(KMeansType::Lloyd, ...);

whereas using just inheritance,

CKMeans* mbkmeans = new CMBKMeans(...);
CKMeans* lloydkmeans = new CLloydKMeans(...);

Above, CKMeans stands for a base class that contains common functionality and CMBKMeans and CLloydKMeans stand, respectively, for children classes with the particularities of mini-batch and Llody k-means.

Again, I am not sure what would be the advantages of using this design pattern here. Do you see any, @abhinavagarwalla?

@Saurabh7
Copy link
Contributor

Hi @karlnapf @iglesias ! Have a look at #2648 regarding this isssue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants