# Topics in Cloud and Environment (2019)

# Week 02-2: Clustering with Air Polution Time Series

在前一階段，我們依據 EPA 2015年的 PM2.5 觀測，製作了 8000 多個長度240小時的，某種空污指標的時間序列。接下來，我們要拿這樣的資料來進行 cluster analysis。

基本上，cluster analysis 的演算法可以分成幾個大類：

-	[Connectivity-based clustering (hierarchical clustering)](https://en.wikipedia.org/wiki/Cluster_analysis#Connectivity-based_clustering_(hierarchical_clustering))
-	[Centroid-based clustering](https://en.wikipedia.org/wiki/Cluster_analysis#Centroid-based_clustering)
-	[Distribution-based clustering](https://en.wikipedia.org/wiki/Cluster_analysis#Distribution-based_clustering)
-	[Density-based clustering](https://en.wikipedia.org/wiki/Cluster_analysis#Density-based_clustering)

我們暫時不深入介紹每一類的算法細節，只先介紹基本概念和幾種常用的演算法。


## Similarity / Distance Metrics

所有的 clustering 演算法，基本上做的就是「把相似的放在一起，不相似的分開」這件事情，因此「相似度」（similarity）或是「距離」（distance）的定義，很大程度決定了 clustering 的結果。

這次的練習，我們示範用幾種常見的 distance metrics，搭配三種演算法：[k-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)，[spectral clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering), and [affinity propagation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation)。

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# For Chinese font
from matplotlib.font_manager import FontProperties
font = FontProperties(fname="../data/NotoSansCJKtc-Regular.otf", size=10)
# Load data
data = pd.read_csv('./apts.csv')
data.shape

(240, 355)

In [2]:
dcor = data.corr()
dcor.head()

Unnamed: 0,2015/01/11:h23,2015/01/12:h23,2015/01/13:h23,2015/01/14:h23,2015/01/15:h23,2015/01/16:h23,2015/01/17:h23,2015/01/18:h23,2015/01/19:h23,2015/01/20:h23,...,2015/12/22:h23,2015/12/23:h23,2015/12/24:h23,2015/12/25:h23,2015/12/26:h23,2015/12/27:h23,2015/12/28:h23,2015/12/29:h23,2015/12/30:h23,2015/12/31:h23
2015/01/11:h23,1.0,-0.360782,0.16142,0.041876,0.015155,0.007985,-0.080874,-0.272181,-0.032627,0.144268,...,0.022186,0.361026,-0.160426,-0.258252,0.082439,-0.064767,-0.059526,0.478201,-0.386873,0.225043
2015/01/12:h23,-0.360782,1.0,-0.244681,0.302211,0.108117,0.109245,-0.017095,-0.361428,-0.22532,-0.104866,...,-0.073532,0.044374,0.40089,-0.123128,-0.126958,-0.2905,-0.019735,-0.001914,0.532304,-0.280389
2015/01/13:h23,0.16142,-0.244681,1.0,-0.092882,0.490378,0.187301,0.062759,-0.275348,-0.330736,-0.265284,...,0.335796,-0.081911,0.05365,0.393895,0.084026,-0.029015,-0.243999,-0.0028,0.014446,0.627722
2015/01/14:h23,0.041876,0.302211,-0.092882,1.0,-0.052773,0.53507,-0.019186,-0.273082,-0.268698,-0.347034,...,-0.116369,0.386525,0.006283,0.118244,0.460001,-0.295559,0.062445,-0.130385,0.099035,0.028185
2015/01/15:h23,0.015155,0.108117,0.490378,-0.052773,1.0,-0.001741,0.302279,-0.279994,-0.287015,-0.308939,...,0.521727,-0.041478,0.513617,0.09928,-0.017453,-0.04412,-0.224692,0.194021,-0.009932,0.131474


In [3]:
# kmeans
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=10, random_state=0).fit(data.T)
print(kmeans.labels_)

[7 7 7 7 7 7 7 4 8 8 2 5 1 1 6 9 8 2 5 1 1 6 3 3 3 7 4 8 8 2 5 1 1 6 9 8 8
 8 2 5 1 1 6 3 3 3 3 3 0 0 0 0 7 7 0 0 0 7 7 0 0 0 7 7 7 7 2 7 7 7 7 8 8 2
 2 1 1 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 7 7 7 2 2 7 7 7 3 3 7 7 0 0 0 0 0 0 0 0 0 7 7 0 0 0 7 7 7 0 0 0
 0 0 0 0 0 0 4 4 8 2 5 1 1 6 3 3 3 7 7 7 0 0]


In [4]:
# Spectral Clustering
from sklearn.cluster import SpectralClustering

spc = SpectralClustering(n_clusters=10, assign_labels="discretize", random_state=0)
spc.fit(data.T)
print(spc.labels_)

[5 5 5 5 2 5 0 5 5 2 5 5 5 5 0 0 2 5 5 5 5 5 2 5 2 5 0 5 5 0 0 5 2 5 2 5 5
 5 5 5 0 2 5 2 5 5 0 2 2 0 2 0 5 0 0 2 5 0 2 2 0 2 5 2 2 2 5 2 5 0 5 0 5 5
 5 2 0 2 0 0 2 2 2 0 0 4 3 1 1 1 1 1 1 1 1 1 1 2 5 2 5 2 2 2 2 5 5 1 1 1 1
 1 1 1 1 1 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 5 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 0 2 2 2 0
 0 4 3 3 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 2 2 0 0 6 2 0 2 2 0 0 6 0 5 2 5 2
 2 5 5 5 5 5 2 2 2 2 2 0 5 2 5 2 0 2 5 0 0 5 5 5 2 0 0 2 2 0 5 5 2 0 2 5 0
 0 2 2 3 1 5 0 0 2 0 2 5 0 5 0 5 5 5 2 2 5 5]




In [5]:
# Affinity Propagation
from sklearn.cluster import AffinityPropagation

apc = AffinityPropagation().fit(data.T)
print(apc)
print(apc.labels_)

AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
          damping=0.5, max_iter=200, preference=None, verbose=False)
[ 0  1  2  3 62 63  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
 56 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 71
 44 61 61 62 59 59 60 60 60 61 61 62 63 63 59 64 64 65 66 45 67 46 47 48
 49 50 51 52 53 54 55 71 56 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 58 59 58 58 60 57 58 61 62 63 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58

In [6]:
apc_cor = AffinityPropagation(damping=0.9, affinity='precomputed', preference=0.1).fit(data.corr().fillna(0))
print(apc_cor)
print(apc_cor.labels_)

AffinityPropagation(affinity='precomputed', convergence_iter=15, copy=True,
          damping=0.9, max_iter=200, preference=0.1, verbose=False)
[  0   5  14  15  24   1   8 116 120 124   2  25 125   3  10   4  26   5
  23 126  12   6  19   7   7   8 116 120 119   2  25   9 111  10 112   7
  11 119   2  25  97  12  13  19  14  15  19  99   7  16  32  10   4  11
 119  17  23  21 111  18  19  20 115 116 120 119  17  25  21 115  22  11
  11  17  23  21 111  24 112 122  26   2  25  21 111  10 122  26 100  33
  27  83  28  29 123  30 120  30 120 124  31  25 125  32  98 122  26 100
  33  34  43  35  36 119 121  42 115 110 124 118  17  97  37  98  18  99
 100  38  36  39  46   3  40  41  84  42  43  23  40  44 106  17  45  46
  80  47  48  49  50  51  52  53  54  14  48  55  24  56  36  37  98  57
  58  59  60  61  16  20 112  62  75  63 114  64  88  65  83  66 117  45
  34  67  78  68  69  70  71  73  58  72  73 105  15  74  16  75  76  77
  50  78  79  80  81  58  82  83  84  85  86  87  92 