# Clustering

## 算法

* K-Means聚类
    - Mini Batch K-Means聚类
* Affinity Propagation
* Mean Shift
* Spectral Clustering
* Ward Hierarchical Clustering
* Agglomerative Clustering
* DBSCAN
* Gaussian Mixtures
* Brich

## 聚类算法表现评估

* Adjusted Rand inde
* Mutual Infomation based scores
* Homogeneity, Completeness, V-measure
* Fowlkes-Mallows scores
* Silhouette Coefficient
* Calinski-Harabaz Index
* Davies-Bouldin Index
* Contingency Matrix

## APIs

In [5]:
# classes
from sklearn.cluster import AffinityPropagation
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import Birch
from sklearn.cluster import DBSCAN
from sklearn.cluster import FeatureAgglomeration
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import MeanShift
from sklearn.cluster import SpectralClustering
from sklearn.cluster.bicluster import SpectralBiclustering
from sklearn.cluster.bicluster import SpectralCoclustering

# functions
from sklearn.cluster import affinity_propagation
from sklearn.cluster import dbscan
from sklearn.cluster import estimate_bandwidth
from sklearn.cluster import k_means
from sklearn.cluster import mean_shift
from sklearn.cluster import spectral_clustering
from sklearn.cluster import ward_tree

## API架构

每个聚类算法都分为两部分：

* class
    - method: `.fit`: learn the clusters on train data
* function
    - attribute: `.labels_`: given train data, return an array of integer labels corresponding to the different clusters


## 数据输入格式

* sklearn.feature_extraction
    - data matrix of shape `[n_samples, n_features]`
* sklearn.metrics.pairwise
    - data matrix of shape `[n_samples, n_samples]`


## 算法对比

## K-Means聚类、MiniBatch K-Means聚类

In [None]:
# K-Means API
class sklearn.cluster.KMeans(n_clusters = 8,               # 聚类的种类数量
                             init = 'k-menas++',           # 聚类簇个数k的初始值选择方法，'k-means++'，'random', np.array((n_cluster, n_features))
                             n_init = 10,                  # 聚类算法设定不同簇初始值的个数
                             max_iter = 300,               # 每种簇初始值下的最大迭代次数
                             tol = 0.0001,                 # 关于声明收敛的惯性的相对容忍度(阈值)
                             precompute_distances = 'auto',# 是否预先计算距离(若n_samples*n_cluster > 1200万，不要计算)
                             verbose = 0,                  # Verbosity模式
                             random_state = None,          
                             copy_x = True,                # 
                             n_jobs = None,                # 不同初始值算法运行时设置并行的线程数；None, -1
                             algorithm = 'auto')           # 'full': EM算法；'elkan': 三角不等式算法；'auto'：稠密数据的elkan，稀疏数据的full

# attributes
clr.cluster_centers_ # 最终聚类簇中心的坐标(算法在全局收敛之前停止)
clr.labels_          # 每个样本的聚类标签
clr.inertia_         # 每个样本点到他们最近的聚类簇的距离平方和
clr.n_iter_          # 迭代次数

# methods
clr.fit(X, smaple_weight)
clr.fit_predict(X, sample_weight)
clr.fit_transform(X, sample_weight)
clr.get_params(deep)
clr.predict(X, sample_weiht)
clr.score(X, y, sample_weight)
clr.set_params(**params)
clr.transform(X)

In [None]:
# MiniBatch K-Means API
class sklearn.cluster.MiniBatchKMeans(n_clusters = 8, 
                                      init = 'k-means++',
                                      max_iter = 100, 
                                      batch_size = 100,     # 每个小批量样本的样本个数  
                                      verbose = 0, 
                                      compute_labels = True, 
                                      random_state = None, 
                                      tol = 0.0,
                                      max_no_improvement = 10, 
                                      init_size = None, 
                                      n_init = 3,
                                      reassignment_ratio = 0.01)

# attributes
mbclr.cluster_centers_
mbclr.labels_
mbclr.inertia_

# method
mbclr.fit()
mbclr.fit_predict()
mbclr.fit_transform()
mbclr.get_params()
mbclr.partial_fit()
mbclr.predict()
mbclr.score()
mbclr.set_params()
mbclr.transform()

### K-Means示例1

In [9]:
from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 2],
             [1, 4],
             [1, 0],
             [4, 2],
             [4, 4],
             [4, 0]])
kmeans = KMeans(n_clusters = 2, random_state = 0).fit(X = X)
print('Train data cluster labels:', kmeans.labels_)

X_test = np.array([[0, 0], 
                   [4, 4]])
pred = kmeans.predict(X = X_test)
print('Test data cluster labels:', pred)

print('数据聚类最终的簇中心点:', '\n', kmeans.cluster_centers_)

Train data cluster labels: [0 0 0 1 1 1]
Test data cluster labels: [0 1]
数据聚类最终的簇中心点: 
 [[1. 2.]
 [4. 2.]]
