clustering accuracy added to KMeans notebook

shogun-toolbox · Feb 15, 2014 · f72577d · f72577d
1 parent ea01646
commit f72577d
Showing 1 changed file with 64 additions and 5 deletions.
diff --git a/doc/ipython-notebooks/clustering/KMeans.ipynb b/doc/ipython-notebooks/clustering/KMeans.ipynb
@@ -578,7 +578,7 @@
       "<ul><li>Iris Sensosa</li><li>Iris Versicolour</li><li>Iris Virginica</li></ul>\n",
       "The Iris dataset enlists 4 features that can be used to segregate these varieties, namely\n",
       "<ul><li>sepal length</li><li>sepal width</li><li>petal length</li><li>petal width</li></ul>\n",
-      "It is additionally acknowledged that petal length and petal width are the 2 most important features(ie. features with very high class correlations)[refer to <a href='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names'>summary statistics</a>]. Since, the entire feature vector is impossible to plot, we only plot these two most important features in order to understand the dataset (atleast partially). Note that we could have extracted the 2 most important features by applying PCA(or any one of the many dimensionality reduction methods available in Shogun) as well."
+      "It is additionally acknowledged that petal length and petal width are the 2 most important features (ie. features with very high class correlations)[refer to <a href='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names'>summary statistics</a>]. Since the entire feature vector is impossible to plot, we only plot these two most important features in order to understand the dataset (at least partially). Note that we could have extracted the 2 most important features by applying PCA (or any one of the many dimensionality reduction methods available in Shogun) as well."
      ]
     },
     {
@@ -658,15 +658,13 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Now let us create a 2-D plot of the clusters formed making use of the two most important features(petal length and petal width) and compare it with the earlier plot depicting the actual labels of data points."
+      "Now let us create a 2-D plot of the clusters formed making use of the two most important features (petal length and petal width) and compare it with the earlier plot depicting the actual labels of data points."
      ]
     },
     {
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "\n",
-      "\n",
       "# plot the clusters over the original points in 2 dimensions\n",
       "figure,axis = pyplot.subplots(1,1)\n",
       "for i in xrange(150):\n",
@@ -690,7 +688,68 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "We see that the Iris Sentosa plants are perfectly clustered without error. The Iris Versicolour plants and Iris Virginica plants are also clustered with high accuracy, but there are some plant samples of either class that have been clustered with the wrong class. This happens near the boundary of the 2 classes in the plot and was well expected.  "
+      "From the above plot, it can be inferred that the accuracy of KMeans algorithm is very high for Iris dataset. Don't believe me? Alright, then let us make use of one of Shogun's clustering evaluation techniques to formally validate the claim. But before that, we have to label each sample in the dataset with a label corresponding to the class to which it belongs. "
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from numpy import ones, zeros\n",
+      "\n",
+      "# first 50 are iris sensosa labelled 0, next 50 are iris versicolour labelled 1 and so on\n",
+      "labels = concatenate((zeros(50),ones(50),2.*ones(50)),1)\n",
+      "\n",
+      "# bind labels assigned to Shogun multiclass labels\n",
+      "ground_truth = MulticlassLabels(array(labels,dtype='float64'))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Now we can compute clustering accuracy making use of the <em><b>ClusteringAccuracy</b></em> class in Shogun"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from numpy import nonzero\n",
+      "# shogun object for clustering accuracy\n",
+      "AccuracyEval = ClusteringAccuracy()\n",
+      "\n",
+      "# changes the labels of result (keeping clusters intact) to produce a best match with ground truth\n",
+      "AccuracyEval.best_map(result, ground_truth)\n",
+      "\n",
+      "# evaluates clustering accuracy\n",
+      "print 'Accuracy : ' + str(AccuracyEval.evaluate(result, ground_truth))\n",
+      "\n",
+      "# find out which sample points differ from actual labels (or ground truth)\n",
+      "compare = result.get_labels()-labels\n",
+      "diff = nonzero(compare)\n",
+      "\n",
+      "# plot the difference between ground truth and predicted clusters\n",
+      "figure,axis = pyplot.subplots(1,1)\n",
+      "axis.plot(obsmatrix[2,:],obsmatrix[3,:],'x',color='black', markersize=5)\n",
+      "axis.plot(obsmatrix[2,diff],obsmatrix[3,diff],'x',color='r', markersize=7)\n",
+      "axis.set_xlim(-1,8)\n",
+      "axis.set_ylim(-1,3)\n",
+      "axis.set_title('Difference')\n",
+      "pyplot.show()"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "In the above plot, wrongly clustered data points are marked in red. We see that the Iris Sentosa plants are perfectly clustered without error. The Iris Versicolour plants and Iris Virginica plants are also clustered with high accuracy, but there are some plant samples of either class that have been clustered with the wrong class. This happens near the boundary of the 2 classes in the plot and was well expected.  "
      ]
     },
     {