diff --git a/doc/ipython-notebooks/clustering/KMeans.ipynb b/doc/ipython-notebooks/clustering/KMeans.ipynb index 2d9f4e712cc..c85cdaf68cf 100644 --- a/doc/ipython-notebooks/clustering/KMeans.ipynb +++ b/doc/ipython-notebooks/clustering/KMeans.ipynb @@ -774,7 +774,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "KMeans is highly affected by the curse of dimensionality. So, dimension reduction becomes an important preprocessing step. Shogun offers a variety of dimension reduction techniques to choose from. Since our data is not very high dimensional, PCA is a good choice for dimension reduction. We have already seen the accuracy of KMeans when all four dimensions are used. In the following exercise we shall see how the accuracy varies as one chooses lower dimensions to represent data. " + "KMeans is highly affected by the curse of dimensionality. So, dimension reduction becomes an important preprocessing step. Shogun offers a variety of [dimension reduction techniques](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CDimensionReductionPreprocessor.html) to choose from. Since our data is not very high dimensional, PCA is a good choice for dimension reduction. We have already seen the accuracy of KMeans when all four dimensions are used. In the following exercise we shall see how the accuracy varies as one chooses lower dimensions to represent data. " ] }, { @@ -797,6 +797,7 @@ "collapsed": false, "input": [ "from numpy import dot\n", + "\n", "def apply_pca_to_data(target_dims):\n", " train_features = RealFeatures(obsmatrix)\n", " submean = PruneVarSubMean(False)\n", @@ -808,6 +809,7 @@ " pca_transform = preprocessor.get_transformation_matrix()\n", " new_features = dot(pca_transform.T, train_features)\n", " return new_features\n", + "\n", "oneD_matrix = apply_pca_to_data(1)" ], "language": "python", @@ -981,7 +983,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Again, we follow the same steps, but skip plotting data (because plotting 3-D is not possible)." + "Again, we follow the same steps, but skip plotting data." ] }, { @@ -1023,7 +1025,7 @@ "metadata": {}, "source": [ "STEP 3: Get accuracy of results. In this step, the 'difference' plot positions data points based petal length \n", - " and petal width in the original data. This will enable us to visually campare these results with that of KMeans applied\n", + " and petal width in the original data. This will enable us to visually compare these results with that of KMeans applied\n", " to 4-Dimensional data (ie. our first result on Iris dataset)" ] }, @@ -1060,6 +1062,7 @@ "input": [ "from scipy.interpolate import interp1d\n", "from numpy import linspace\n", + "\n", "x = array([1, 2, 3, 4])\n", "y = array([accuracy_1d, accuracy_2d, accuracy_3d, accuracy_4d])\n", "f = interp1d(x, y)\n", @@ -1079,7 +1082,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The above plot is not very intuitive. The accuracy obtained by using just one latent dimension is much more than that obtained by taking all four features features. This shows the importance of PCA. Not only does it reduce the complexity of running KMeans, it also enhances results." + "The above plot is not very intuitive theoretically. The accuracy obtained by using just one latent dimension is much more than that obtained by taking all four features features. A plausible explanation could be that the mixing of data points from Iris Versicolour and Iris Virginica is least along the single principal dimension chosen by PCA. Additional dimensions only aggrevate this inter-mixing, thus resulting in poorer clustering accuracy. While there could be other explanations to the observed results, our small experiment has successfully highlighted the importance of PCA. Not only does it reduce the complexity of running KMeans, it also enhances results at times." ] }, {