Preparing `taipower` dataset:  
+ electricity consumption in northern Taiwan
+ 273 days (2016/10/01–2017/06/30)
+ select 10 hours for each day (9:00–18:00)
+ p.s. round DATETIME towards its nearest hour (e.g. 2017/10/16 14:37:52 -> 2017/10/16 15:00:00)

```sql
SELECT
  DATE_FORMAT(
    DATE_ADD(Power.updateTime, INTERVAL 30 MINUTE),
    '%Y-%m-%d %H:00:00'
  ) AS time,
  Power.northUsage
FROM Power
WHERE (DATE(Power.updateTime) BETWEEN '2016-10-01' AND '2017-06-30')
  AND (TIME(Power.updateTime) BETWEEN '08:30:00' AND '18:29:59')
```

In [1]:
from datetime import date, timedelta
start = date(2016, 10, 1)
end = date(2017, 6, 30)
delta = end - start
hours = ['09', '10', '11', '12', '13', '14', '15', '16', '17', '18']
delta.days + 1

273

In [2]:
taipower_dict = {}
for i in range(delta.days + 1):
    for j in hours:
        key = str(start + timedelta(days=i)) + ' ' + j + ':00:00'
        taipower_dict[key] = -1

In [3]:
with open('taipower.csv') as file:
    for line in file:
        fields = line.split(',')
        key = fields[0]
        if key in taipower_dict:
            if taipower_dict[key] != -1:
                print('DUPLICATE', key)
            else:
                taipower_dict[key] = float(fields[1])
        else:
            print('ERROR')

DUPLICATE 2017-05-20 13:00:00


In [4]:
missing = [key for key in taipower_dict if taipower_dict[key] == -1]
len(missing)

867

In [5]:
taipower = []
for i in range(delta.days + 1):
    record = []
    for j in hours:
        key = str(start + timedelta(days=i)) + ' ' + j + ':00:00'
        record.append(taipower_dict[key])
    taipower.append(record)

Use [scikit-learn](http://scikit-learn.org/) to impute missing values using the mean of columns.

In [6]:
from sklearn.preprocessing import Imputer
taipower = Imputer(missing_values=-1).fit_transform(taipower)

In [7]:
len(taipower)

273

Preparing `temperature` dataset:  
+ temperature in Banqiao, Taiwan
+ 273 days (2016/10/01–2017/06/30)
+ select 10 hours for each day (9:00–18:00)
+ p.s. round DATETIME towards its nearest hour (e.g. 2017/10/16 14:37:52 -> 2017/10/16 15:00:00)

```sql
SELECT
  DATE_FORMAT(
    DATE_ADD(逐時觀測.時間, INTERVAL 30 MINUTE),
    '%Y-%m-%d %H:00:00'
  ) AS time,
  逐時觀測.溫度
FROM 逐時觀測
WHERE (DATE(逐時觀測.時間) BETWEEN '2016-10-01' AND '2017-06-30')
  AND (TIME(逐時觀測.時間) BETWEEN '08:30:00' AND '18:29:59')
  AND (逐時觀測.測站 = 'BANQIAO,板橋')
```

In [8]:
temperature_dict = {}
for i in range(delta.days + 1):
    for j in hours:
        key = str(start + timedelta(days=i)) + ' ' + j + ':00:00'
        temperature_dict[key] = -1

In [9]:
with open('temperature.csv') as file:
    for line in file:
        fields = line.split(',')
        key = fields[0]
        if key in temperature_dict:
            if temperature_dict[key] != -1:
                print('DUPLICATE', key)
            else:
                temperature_dict[key] = float(fields[1])
        else:
            print('ERROR')

In [10]:
missing = [key for key in temperature_dict if temperature_dict[key] == -1]
len(missing)

0

In [11]:
temperature = []
for i in range(delta.days + 1):
    record = []
    for j in hours:
        key = str(start + timedelta(days=i)) + ' ' + j + ':00:00'
        record.append(temperature_dict[key])
    temperature.append(record)

In [12]:
import numpy as np
temperature = np.array(temperature)

In [13]:
len(temperature)

273

In [14]:
# From Material Design Color Palette
colors = [
    '#F44336', '#673AB7', '#03A9F4', '#4CAF50', '#FFEB3B',
    '#009688', '#9E9E9E', '#795548', '#CDDC39', '#FF5722',
    '#E91E63', '#2196F3', '#3F51B5', '#00BCD4', '#8BC34A',
    '#FFC107', '#607D8B', '#9C27B0', '#FF9800', '#000000',
]

Use `Agglomerative Clustering`, a kind of hierarchical clustering algorithm, from [scikit-learn](http://scikit-learn.org/).  
將資料集分成 3 群，其餘參數使用預設值。

In [15]:
from sklearn.cluster import AgglomerativeClustering
labels = AgglomerativeClustering(n_clusters=3).fit_predict(taipower)

In [16]:
from IPython.display import HTML, display
html = '<table><tr><td></td>'
for i in range(1, 32):
    html += '<td>' + str(i) + '</td>'
html += '</tr>'
cells = ['<td>10</td>']
for i in range(delta.days + 1):
    d = start + timedelta(days=i)
    if d.day == 1:
        if len(cells) == 32:
            html += '<tr>' + ''.join(cells) + '</tr>'
            cells[:] = ['<td>' + str(d.month) + '</td>']
        elif len(cells) == 31:
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            html += '<tr>' + ''.join(cells) + '</tr>'
            cells[:] = ['<td>' + str(d.month) + '</td>']
        elif len(cells) == 29:
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            html += '<tr>' + ''.join(cells) + '</tr>'
            cells[:] = ['<td>' + str(d.month) + '</td>']
        elif d != start:
            print('ERROR', d, len(cells))
    s = '<td style="background-color: ' + colors[labels[i]] + ';">　　</td>'
    cells.append(s)
cells.append('<td style="background-color: #FFFFFF;">　　</td>')
html += '<tr>' + ''.join(cells) + '</tr></table>'
display(HTML(html))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,22.0,23.0,24.0,25.0,26.0,27.0,28.0,29.0,30.0,31.0
10.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
11.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
12.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


用電量的分群結果。此圖呈現了每個日子分別被分到哪一群。  
圖中不同顏色表示不同群，每一列（橫的）表示一個月份，每一行（直的）表示一天。

In [17]:
from sklearn.cluster import AgglomerativeClustering
labels = AgglomerativeClustering(n_clusters=3).fit_predict(temperature)

In [18]:
from IPython.display import HTML, display
html = '<table><tr><td></td>'
for i in range(1, 32):
    html += '<td>' + str(i) + '</td>'
html += '</tr>'
cells = ['<td>10</td>']
for i in range(delta.days + 1):
    d = start + timedelta(days=i)
    if d.day == 1:
        if len(cells) == 32:
            html += '<tr>' + ''.join(cells) + '</tr>'
            cells[:] = ['<td>' + str(d.month) + '</td>']
        elif len(cells) == 31:
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            html += '<tr>' + ''.join(cells) + '</tr>'
            cells[:] = ['<td>' + str(d.month) + '</td>']
        elif len(cells) == 29:
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            html += '<tr>' + ''.join(cells) + '</tr>'
            cells[:] = ['<td>' + str(d.month) + '</td>']
        elif d != start:
            print('ERROR', d, len(cells))
    s = '<td style="background-color: ' + colors[labels[i]] + ';">　　</td>'
    cells.append(s)
cells.append('<td style="background-color: #FFFFFF;">　　</td>')
html += '<tr>' + ''.join(cells) + '</tr></table>'
display(HTML(html))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,22.0,23.0,24.0,25.0,26.0,27.0,28.0,29.0,30.0,31.0
10.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
11.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
12.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


溫度的分群結果。此圖呈現了每個日子分別被分到哪一群。  
圖中不同顏色表示不同群，每一列（橫的）表示一個月份，每一行（直的）表示一天。

解釋解釋解釋解釋解釋解釋。

Use `DBSCAN`, a kind of density-based clustering algorithm, from [scikit-learn](http://scikit-learn.org/).  
不需指定群的數目，為了讓稠密的判定更寬鬆，嘗試提高 eps，其餘參數使用預設值。

In [19]:
from sklearn.cluster import DBSCAN
labels = DBSCAN(eps=50, min_samples=5).fit_predict(taipower)

In [20]:
from IPython.display import HTML, display
html = '<table><tr><td></td>'
for i in range(1, 32):
    html += '<td>' + str(i) + '</td>'
html += '</tr>'
cells = ['<td>10</td>']
for i in range(delta.days + 1):
    d = start + timedelta(days=i)
    if d.day == 1:
        if len(cells) == 32:
            html += '<tr>' + ''.join(cells) + '</tr>'
            cells[:] = ['<td>' + str(d.month) + '</td>']
        elif len(cells) == 31:
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            html += '<tr>' + ''.join(cells) + '</tr>'
            cells[:] = ['<td>' + str(d.month) + '</td>']
        elif len(cells) == 29:
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            html += '<tr>' + ''.join(cells) + '</tr>'
            cells[:] = ['<td>' + str(d.month) + '</td>']
        elif d != start:
            print('ERROR', d, len(cells))
    s = '<td style="background-color: ' + colors[labels[i]] + ';">　　</td>'
    cells.append(s)
cells.append('<td style="background-color: #FFFFFF;">　　</td>')
html += '<tr>' + ''.join(cells) + '</tr></table>'
display(HTML(html))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,22.0,23.0,24.0,25.0,26.0,27.0,28.0,29.0,30.0,31.0
10.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
11.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
12.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


用電量的分群結果。此圖呈現了每個日子分別被分到哪一群。  
圖中不同顏色表示不同群，黑色表示 noise，每一列（橫的）表示一個月份，每一行（直的）表示一天。

In [21]:
from sklearn.cluster import DBSCAN
labels = DBSCAN(eps=2.75, min_samples=5).fit_predict(temperature)

In [22]:
from IPython.display import HTML, display
html = '<table><tr><td></td>'
for i in range(1, 32):
    html += '<td>' + str(i) + '</td>'
html += '</tr>'
cells = ['<td>10</td>']
for i in range(delta.days + 1):
    d = start + timedelta(days=i)
    if d.day == 1:
        if len(cells) == 32:
            html += '<tr>' + ''.join(cells) + '</tr>'
            cells[:] = ['<td>' + str(d.month) + '</td>']
        elif len(cells) == 31:
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            html += '<tr>' + ''.join(cells) + '</tr>'
            cells[:] = ['<td>' + str(d.month) + '</td>']
        elif len(cells) == 29:
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            cells.append('<td style="background-color: #FFFFFF;">　　</td>')
            html += '<tr>' + ''.join(cells) + '</tr>'
            cells[:] = ['<td>' + str(d.month) + '</td>']
        elif d != start:
            print('ERROR', d, len(cells))
    s = '<td style="background-color: ' + colors[labels[i]] + ';">　　</td>'
    cells.append(s)
cells.append('<td style="background-color: #FFFFFF;">　　</td>')
html += '<tr>' + ''.join(cells) + '</tr></table>'
display(HTML(html))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,22.0,23.0,24.0,25.0,26.0,27.0,28.0,29.0,30.0,31.0
10.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
11.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
12.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


溫度的分群結果。此圖呈現了每個日子分別被分到哪一群。  
圖中不同顏色表示不同群，黑色表示 noise，每一列（橫的）表示一個月份，每一行（直的）表示一天。

解釋解釋解釋解釋解釋解釋。

經過以上的實驗，我覺得這兩個資料集比較不適合使用 Density-based Clustering，因為要找到分佈得很稠密的資料點是有困難的。