# KMeans
## 1 聚类算法
对于"监督学习"(supervised learning)，其训练样本是带有标记信息的，并且监督学习的目的是：对带有标记的数据集进行模型学习，从而便于对新的样本进行分类。而在“无监督学习”(unsupervised learning)中，训练样本的标记信息是未知的，目标是通过对无标记训练样本的学习来揭示数据的内在性质及规律，为进一步的数据分析提供基础。对于无监督学习，应用最广的便是"聚类"(clustering)。
  “聚类算法”试图将数据集中的样本划分为若干个通常是不相交的子集，每个子集称为一个“簇”(cluster)，通过这样的划分，每个簇可能对应于一些潜在的概念或类别。
  我们可以通过下面这个图来理解：
![title](pic/22.jpg)
上图是未做标记的样本集，通过他们的分布，我们很容易对上图中的样本做出以下几种划分。
  当需要将其划分为两个簇时，即 k=2 时：
![title](pic/23.jpg) ![title](pic/24.jpg)
当需要将其划分为四个簇时，即 k=4 时：
![title](pic/25.jpg)

## 2 kmeans算法
kmeans算法又名k均值算法。其算法思想大致为：先从样本集中随机选取 k 个样本作为簇中心，并计算所有样本与这 k 个“簇中心”的距离，对于每一个样本，将其划分到与其距离最近的“簇中心”所在的簇中，对于新的簇计算各个簇的新的“簇中心”。
  根据以上描述，我们大致可以猜测到实现kmeans算法的主要三点：
  
  （1）簇个数 k 的选择
  
  （2）各个样本点到“簇中心”的距离
  
  （3）根据新划分的簇，更新“簇中心”
### 2.1 kmeans算法要点
（1） k 值的选择

     k 的选择一般是按照实际需求进行决定，或在实现算法时直接给定 k 值。
     
（2） 距离的度量

     给定样本
     $x^{(i)}=\left\{x_{1}^{(i)}, x_{2}^{(i)}, \ldots, x_{n}^{(i)},\right\} \stackrel{(i)}{=}\left\{x_{1}^{(j)}, x_{2}^{(j)}, \ldots, x_{n}^{(j)},\right\}$,其中i,j=1,2,...,m，表示样本数，n表示特征数  。距离的度量方法主要分为以下几种：
     ![title](pic/26.png)

(3） 更新“簇中心”

     对于划分好的各个簇，计算各个簇中的样本点均值，将其均值作为新的簇中心。
### 2.2 kmeans算法过程

![title](pic/27.png)
![title](pic/26.jpg)

In [1]:
import numpy as  np

In [16]:
def loadDataSet(filename):
    dataMat = []
    with open(filename) as fr:
        for line in fr.readlines():
            curline = line.strip().split()
            fltline = list(map(float, curline))
            dataMat.append(fltline)
    return np.array(dataMat)

In [17]:
dataMat=loadDataSet('data/Kmeans/testSet.txt')

In [4]:
def distEclud(vecA, vecB):
    return np.sqrt(np.power(vecA - vecB, 2).sum())  # not sum(mat) but mat.sum()

In [5]:
def randCent(dataSet, k):
    n = np.shape(dataSet)[1]  
    centroids = np.mat(np.zeros((k, n)))
    for j in range(n):
        minj = min(dataSet[:, j])
        rangej = float(max(dataSet[:, j]) - minj)
        centroids[:, j] = minj + rangej * np.random.rand(k, 1)
    return centroids

In [18]:
randCent(dataMat,2)

matrix([[ 2.0087042 , -0.75674176],
        [-2.80963583,  4.9478131 ]])

In [7]:
def KMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
    m = np.shape(dataSet)[0]
    clusterAssment = np.mat(np.zeros((m, 2)))
    centroids = randCent(dataSet, k)
    clusterchanged = True
    while clusterchanged:
        clusterchanged = False
        for i in range(m):
            cluster_i = clusterAssment[i, 0];
            dismax = np.inf
            for j in range(k):
                curdis = distEclud(centroids[j, :], dataSet[i, :])
                if curdis < dismax:
                    dismax = curdis
                    clusterAssment[i, :] = j, dismax
            if cluster_i != clusterAssment[i, 0]: clusterchanged = True
        print(centroids)
        for cent in range(k):
            ptsInClust = dataSet[np.nonzero(clusterAssment[:, 0].A == cent)[0]]
            centroids[cent, :] = np.mean(ptsInClust, axis=0)
    return centroids, clusterAssment

In [8]:
KMeans(np.array(dataMat),2)

[[ 1.64738477 -3.89985624]
 [ 2.14955832  2.37770554]]
[[-0.43856939 -2.95461287]
 [ 0.19944238  2.77665202]]
[[-0.2897198  -2.83942545]
 [ 0.08249337  2.94802785]]


(matrix([[-0.2897198 , -2.83942545],
         [ 0.08249337,  2.94802785]]), matrix([[1.        , 2.06716812],
         [1.        , 3.5681125 ],
         [0.        , 5.39850778],
         [0.        , 5.1167591 ],
         [1.        , 0.89039257],
         [1.        , 3.91557751],
         [0.        , 0.8730819 ],
         [0.        , 3.3862195 ],
         [1.        , 2.91888366],
         [1.        , 3.24808913],
         [0.        , 3.64487896],
         [0.        , 2.51060892],
         [1.        , 4.12585863],
         [1.        , 2.2058353 ],
         [0.        , 2.56070545],
         [0.        , 1.12895497],
         [1.        , 3.07341157],
         [1.        , 0.95792926],
         [0.        , 3.27439238],
         [0.        , 2.95877748],
         [1.        , 2.25513093],
         [1.        , 1.9098742 ],
         [0.        , 2.6496711 ],
         [0.        , 3.11424737],
         [1.        , 1.93527435],
         [1.        , 1.91077281],
         [0.   

In [9]:
def biKmeans(dataSet, k, distMeas=distEclud):
    m = np.shape(dataSet)[0]
    clusterAssment = np.mat(np.zeros((m, 2)))
    centroid0 = np.mean(dataSet, axis=0).tolist()[0]
    cenList = [centroid0]
    for j in range(m):
        clusterAssment[j, 1] = distMeas(np.mat(centroid0), dataSet[j, :]) ** 2
    while len(cenList) < k:
        lowestSSE = np.inf
        for i in range(len(cenList)):
            ptscurrCluster = dataSet[np.nonzero(clusterAssment[:, 0].A == i)[0], :]
            centroidMat, splitClustAss = KMeans(ptscurrCluster, 2)
            ssesplit = splitClustAss[:, 1].sum()
            ssenotsplit = clusterAssment[np.nonzero(clusterAssment[:, 0].A != i), 1].sum()
            print(ssesplit, ssenotsplit)
            if ssesplit + ssenotsplit < lowestSSE:
                lowestSSE = ssenotsplit + ssesplit
                bestnewCent = centroidMat
                bestClustAss = splitClustAss.copy()
                bestCentToSplit = i
        bestClustAss[np.nonzero(bestClustAss[:, 0].A == 1)[0], 0] = len(cenList)
        bestClustAss[np.nonzero(bestClustAss[:, 0].A == 0)[0], 0] = bestCentToSplit
        print('the bestcenttosplit: ', bestCentToSplit)
        print('len bestclustass: ', len(bestClustAss))
        cenList[bestCentToSplit] = bestnewCent[0, :]
        cenList.append(bestnewCent[1, :].tolist()[0])
        clusterAssment[np.nonzero(clusterAssment[:, 0].A == bestCentToSplit)[0],
        :] = bestClustAss  # reassign new clusters, and SSE
    return np.mat(cenList), clusterAssment


In [10]:
biKmeans(np.array(dataMat),2)

[[ 0.32909744 -2.2394072 ]
 [-4.64394088 -3.03177339]]
[[ 1.46190604  0.77760244]
 [-3.54775556 -1.53696152]]
[[ 1.92304273  0.70709257]
 [-3.30703713 -0.97753032]]
[[ 2.35560655  0.474688  ]
 [-3.10932625 -0.45950489]]
[[ 2.62924541  0.26552329]
 [-2.97661844 -0.16775279]]
[[ 2.71473038  0.18858278]
 [-2.9219568  -0.07998038]]
248.1507528392998 0.0
the bestcenttosplit:  0
len bestclustass:  80


(matrix([[matrix([[2.71473038, 0.18858278]]),
          list([-2.9219568000000002, -0.07998037500000002])]], dtype=object),
 matrix([[0.        , 4.23040738],
         [1.        , 3.54441323],
         [0.        , 2.51093336],
         [1.        , 4.10035377],
         [0.        , 3.24316536],
         [1.        , 1.7362298 ],
         [0.        , 4.16075955],
         [1.        , 1.73885412],
         [0.        , 1.40701044],
         [1.        , 3.27951404],
         [0.        , 4.21260813],
         [1.        , 3.02239548],
         [0.        , 3.1701662 ],
         [1.        , 3.12704603],
         [0.        , 2.1186188 ],
         [1.        , 4.63488064],
         [0.        , 1.39036144],
         [1.        , 4.03266264],
         [0.        , 3.55303008],
         [1.        , 2.22090339],
         [0.        , 2.71302541],
         [1.        , 2.86858385],
         [0.        , 2.1593047 ],
         [1.        , 2.74091587],
         [0.        , 3.80919097],
 