Detecting/identifying number of clusters present in dataset #19

leebrooks0 · 2013-09-30T13:48:14Z

Does AI4r have any way to let me detect the number of clusters present/inherent in a dataset (for use with a clustering algorithm), for example I would like to be able to say something like:

Best fit: 5 clusters, 99% suitable
Second best fit: 4 clusters, 88% suitable.

Currently I require a user to enter the number of clusters but this is just guess work then...

agarie · 2013-12-11T00:10:34Z

You can try successive numbers of clusters and measure their performance (basically an error curve). Why not create a pull request implementing this?

leebrooks0 · 2013-12-14T20:51:48Z

@agarie what I did a while ago was an Anova for a successive number of clusters to see how well the clusters explain the variance.

awe2m2n2s · 2016-05-08T05:26:48Z

Hi all,
I had the same issue today and as I did not find any solution somewhere here is what I came up wit so far; I guess it's not perfect but better than nothing :)
assumptions: the data input is already z_norm && nobody wants more than 10 clusters && nobody wants a cluster that didn't improve the fitness by more than 5%

def self.create_best_cluster(data)
    cluster = Ai4r::Clusterers::KMeans.new
    dataset = Ai4r::Data::DataSet.new
    dataset.set_data_items(data)
    cluster_fitness = []

    10.times do |t|
      cluster.build(dataset,t+1)
      cluster_fitness[t] = 0
      dataset.data_items.each do |item|
        cluster_fitness[t] += cluster.distance(cluster.centroids[cluster.eval(item)],item)
      end
    end
    #when derivative gets below 0.05: the next additional cluster did not improve the fitness significant, so the best fitting cluster is the array index + 2 as the array start's at 0 but clustering start's at 1
    cluster.build(dataset, cluster_fitness.each_cons(2).map{|x, y| y - x}.map.with_index{|derivate,index| derivate.abs >= 0.05 ? index : nil}.compact.max + 2)
  end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detecting/identifying number of clusters present in dataset #19

Detecting/identifying number of clusters present in dataset #19

leebrooks0 commented Sep 30, 2013

agarie commented Dec 11, 2013

leebrooks0 commented Dec 14, 2013

awe2m2n2s commented May 8, 2016 •

edited

Loading

Detecting/identifying number of clusters present in dataset #19

Detecting/identifying number of clusters present in dataset #19

Comments

leebrooks0 commented Sep 30, 2013

agarie commented Dec 11, 2013

leebrooks0 commented Dec 14, 2013

awe2m2n2s commented May 8, 2016 • edited Loading

awe2m2n2s commented May 8, 2016 •

edited

Loading