Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting/identifying number of clusters present in dataset #19

Open
leebrooks0 opened this issue Sep 30, 2013 · 3 comments
Open

Detecting/identifying number of clusters present in dataset #19

leebrooks0 opened this issue Sep 30, 2013 · 3 comments

Comments

@leebrooks0
Copy link

Does AI4r have any way to let me detect the number of clusters present/inherent in a dataset (for use with a clustering algorithm), for example I would like to be able to say something like:

Best fit: 5 clusters, 99% suitable
Second best fit: 4 clusters, 88% suitable.

Currently I require a user to enter the number of clusters but this is just guess work then...

@agarie
Copy link
Contributor

agarie commented Dec 11, 2013

You can try successive numbers of clusters and measure their performance (basically an error curve). Why not create a pull request implementing this?

@leebrooks0
Copy link
Author

@agarie what I did a while ago was an Anova for a successive number of clusters to see how well the clusters explain the variance.

@awe2m2n2s
Copy link

awe2m2n2s commented May 8, 2016

Hi all,
I had the same issue today and as I did not find any solution somewhere here is what I came up wit so far; I guess it's not perfect but better than nothing :)
assumptions: the data input is already z_norm && nobody wants more than 10 clusters && nobody wants a cluster that didn't improve the fitness by more than 5%

def self.create_best_cluster(data)
    cluster = Ai4r::Clusterers::KMeans.new
    dataset = Ai4r::Data::DataSet.new
    dataset.set_data_items(data)
    cluster_fitness = []

    10.times do |t|
      cluster.build(dataset,t+1)
      cluster_fitness[t] = 0
      dataset.data_items.each do |item|
        cluster_fitness[t] += cluster.distance(cluster.centroids[cluster.eval(item)],item)
      end
    end
    #when derivative gets below 0.05: the next additional cluster did not improve the fitness significant, so the best fitting cluster is the array index + 2 as the array start's at 0 but clustering start's at 1
    cluster.build(dataset, cluster_fitness.each_cons(2).map{|x, y| y - x}.map.with_index{|derivate,index| derivate.abs >= 0.05 ? index : nil}.compact.max + 2)
  end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants