In [1]:
dataset = []
with open('../dataset/location.csv') as file:
    is_header = True
    for line in file:
        if is_header:
            is_header = False
            continue
        fields = line.split(',')
        name = fields[1]
        longitude = float(fields[3])
        latitude = float(fields[4])
        dataset.append((latitude, longitude, name))

In [2]:
len(dataset)

In [3]:
dataset[:5] + dataset[-5:]

[(25.0712, 121.7812, '五分山雷達站'),
 (24.9976, 121.442, '板橋'),
 (25.1649, 121.4489, '淡水'),
 (25.1826, 121.5297, '鞍部'),
 (25.0377, 121.5149, '臺北'),
 (23.297, 121.2716, '卓樂'),
 (23.4931, 121.3388, '紅葉'),
 (23.4434, 121.3274, '立山'),
 (23.8709, 121.5081, '壽豐'),
 (23.9657, 121.4928, '銅門')]

Use [folium](https://github.com/python-visualization/folium) to draw maps.

In [4]:
import folium
m1 = folium.Map(location=[23.96639, 120.96952], zoom_start=7)
for latitude, longitude, name in dataset:
    folium.Circle(
        location=[latitude, longitude],
        radius=0,
        popup=name,
        color='crimson',
        fill=False
    ).add_to(m1)

In [5]:
m1

In [6]:
import numpy as np
X = np.array(dataset)
X = np.delete(X, 2, 1)
X = X.astype(np.float64)

In [7]:
X

array([[  25.0712,  121.7812],
       [  24.9976,  121.442 ],
       [  25.1649,  121.4489],
       ..., 
       [  23.4434,  121.3274],
       [  23.8709,  121.5081],
       [  23.9657,  121.4928]])

Use `k-means` clustering algorithm from [scikit-learn](http://scikit-learn.org/).

+ Partitioning-based Clustering
+ 不適用於 **大小不同、密度不同、非球狀** 的群
+ 分群的結果易受初始值影響 -> 使用 Bisecting k-means
+ 分群的結果易受 outlier 或 noise 影響 -> 使用 k-medoids

將資料集分成 8 群，其餘參數使用預設值。  
註：scikit-learn 實作的 k-means 預設不是完全隨機挑選初始值，而且也會執行好幾次選最好的一次，所以分群的結果比較不易受初始值影響。

In [8]:
from sklearn.cluster import KMeans
labels = KMeans(n_clusters=8).fit_predict(X)

In [9]:
# From Material Design Color Palette
colors = [
    '#F44336', '#673AB7', '#03A9F4', '#4CAF50', '#FFEB3B',
    '#009688', '#9C27B0', '#607D8B', '#FF5722', '#CDDC39',
    '#FF9800', '#2196F3', '#E91E63', '#3F51B5', '#00BCD4',
    '#8BC34A', '#FFC107', '#795548', '#9E9E9E', '#000000',
]

In [10]:
m2 = folium.Map(location=[23.96639, 120.96952], zoom_start=7)
for i, (latitude, longitude, name) in enumerate(dataset):
    folium.Circle(
        location=[latitude, longitude],
        radius=0,
        popup=name,
        color=colors[labels[i]],
        fill=False
    ).add_to(m2)

In [11]:
m2

如同預期，這個資料集並不適合用 k-means 來分群。

Use `Agglomerative Clustering`, a kind of hierarchical clustering algorithm, from [scikit-learn](http://scikit-learn.org/).

將資料集分成 8 群，其餘參數使用預設值。  
註：scikit-learn 實作的 Agglomerative Clustering 預設是使用 ward linkage，也就是最小化 SSE 的策略。

In [12]:
from sklearn.cluster import AgglomerativeClustering
labels = AgglomerativeClustering(n_clusters=8).fit_predict(X)

In [13]:
m3 = folium.Map(location=[23.96639, 120.96952], zoom_start=7)
for i, (latitude, longitude, name) in enumerate(dataset):
    folium.Circle(
        location=[latitude, longitude],
        radius=0,
        popup=name,
        color=colors[labels[i]],
        fill=False
    ).add_to(m3)

In [14]:
m3

看起來比較有按照資料的分佈的輪廓在分群了。