# **Load 'train.csv'**
train.csv 的資料為 12 個月中，每個月取 20 天，每天 24 小時的資料(每小時資料有 18 個 features)。

In [73]:
import pandas as pd
import numpy as np
# 读入train.csv，繁体字以big5编码
data = pd.read_csv('./train.csv', encoding = 'big5')
#data.head(10)

# **Preprocessing** 
取需要的數值部分，將 'RAINFALL' 欄位全部補 0。
另外，如果要在 colab 重覆這段程式碼的執行，請從頭開始執行(把上面的都重新跑一次)，以避免跑出不是自己要的結果（若自己寫程式不會遇到，但 colab 重複跑這段會一直往下取資料。意即第一次取原本資料的第三欄之後的資料，第二次取第一次取的資料掉三欄之後的資料，...）。

In [74]:
data = data.iloc[:, 3:]
data[data == 'NR'] = 0
#data.head(6)
raw_data = data.to_numpy()
#print(type(raw_data))

# **Extract Features (1)**
<img src="./Extract_Features.png">
<img src="./Extract_Features2.png">
將原始 4320 * 18 的資料依照每個月分重組成 12 個 18 (features) * 480 (hours) 的資料。 

In [75]:
month_data = {}
for month in range(12):
    sample = np.empty([18, 480])
    for day in range(20):
        sample[:, day * 24: (day + 1) * 24] = raw_data[18 * (20 * month + day): 18 * (20 * month + day + 1), :]
    month_data[month] = sample

# **Extract Features (2)**
<img src="./Extract_Features3.png">
<img src="./Extract_Features4.png">

每個月會有 480hrs，每 9 小時形成一個 data，每個月會有 471 個 data，故總資料數為 471 * 12 筆，而每筆 data 有 9 * 18 的 features (一小時 18 個 features * 9 小時)。

對應的 target 則有 471 * 12 個(第 10 個小時的 PM2.5)

In [76]:
x = np.empty([12 * 471, 18 * 9], dtype = float)
y = np.empty([12 * 471, 1], dtype = float)
for month in range(12):
    for day in range(20):
        for hour in range(24):
            # 最后一天的最后9个小时不能算进去，不然会越界
            if day == 19 and hour > 14:
                continue
            x[month * 471 + day * 24 + hour, :] = month_data[month][:, day * 24 + hour: day * 24 + hour + 9].reshape(1, -1)
            y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9].reshape(1, -1)
#print(x)
#print(y)

# **Normalize (1)**
$x = \frac{x - \mu}{\sigma}$, 其中，$\mu$是样本的均值，$\sigma$是样本的标准差。

通过标准化，可以：

- 将有量纲的表达式，经过变换，化为无量纲的表达式，成为标量
- 使得数据更加符合独立同分布条件

In [77]:
mean_x = np.mean(x, axis = 0) # 18 * 9
std_x = np.std(x, axis = 0) # 18 * 9
for i in range(len(x)): # 12 * 471
    for j in range(len(x[0])): # 18 * 9
        if std_x[j] != 0:
            x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]

## Shuffling Data

In [78]:
per = np.random.permutation(x.shape[0])  #打乱后的行号
x = x[per, :]
y = y[per]
print(x.shape)
print(y.shape)

(5652, 162)
(5652, 1)


# **Split Training Data Into "train_set" and "validation_set"**
這部分是針對作業中 report 的第二題、第三題做的簡單示範，以生成比較中用來訓練的 train_set 和不會被放入訓練、只是用來驗證的 validation_set。

In [69]:
import math
x_train_set = x[: math.floor(len(x) * 0.8), :]
y_train_set = x[: math.floor(len(y) * 0.8), :]

x_validation = x[math.floor(len(x) * 0.8), :]
y_validation = x[math.floor(len(y) * 0.8), :]

# **Training**
<img src="./Implement linear regression.png">
<img src="./Adagrad.png">
<img src="./Adagrad2.png">

(和上圖不同處: 下面的 code 採用 Root Mean Square Error)

因為常數項的存在，所以 dimension (dim) 需要多加一欄；eps 項是避免 adagrad 的分母為 0 而加的極小數值。

每一個 dimension (dim) 會對應到各自的 gradient, weight (w)，透過一次次的 iteration (iter_time) 學習。

In [94]:
# dim = 18 * 9 + 1
# w = np.zeros([dim, 1])
# # 数据补多一列1,是bias
# x = np.concatenate((np.ones([12 * 471, 1]), x), axis = 1).astype(float)
learning_rate = 8
iter_time = 30000
adagrad = np.zeros([dim, 1])
eps = 0.00000000001
for t in range(iter_time):
    loss = np.sqrt(np.sum(np.power(np.dot(x, w) - y, 2)) / 471 / 12) # rmse
    if (t % 1000 == 0):
        print(str(t) + ':' + str(loss))
    gradient = 2 * np.dot(x.transpose(), np.dot(x, w) - y) # dim * 1
    adagrad += gradient ** 2
    w = w - learning_rate * gradient / np.sqrt(adagrad + eps)
np.save('weight.npy', w)

0:5.71288274972583
1000:6.78817220283491
2000:6.106513466180976
3000:5.896024867460641
4000:5.815196055138195
5000:5.777643843671305
6000:5.756952477078474
7000:5.743842130966937
8000:5.734635170351382
9000:5.727689726283957
10000:5.722182165278554
11000:5.717654830918245
12000:5.713832312551483
13000:5.710538912567283
14000:5.707657562123286
15000:5.705107442853523
16000:5.702830939239398
17000:5.700785671341705
18000:5.698939481916697
19000:5.697267219157243
20000:5.695748641536597
21000:5.6943670361072
22000:5.6931082955993295
23000:5.691960293031954
24000:5.6909124506543325
25000:5.689955436771514
26000:5.689080947452337
27000:5.688281545171783
28000:5.6875505361438
29000:5.686881874375611


# **Testing**
<img src="./Predict PM2.5.png">


載入 test data，並且以相似於訓練資料預先處理和特徵萃取的方式處理，使 test data 形成 240 個維度為 18 * 9 + 1 的資料。

In [95]:
testdata = pd.read_csv('./test.csv', header = None, encoding = 'big5')
test_data = testdata.iloc[:, 2:]
test_data[test_data == 'NR'] = 0
test_data = test_data.to_numpy()
test_x = np.empty([240, 18 * 9], dtype = float)
for i in range(240):
    test_x[i, :] = test_data[18 * i: 18 * (i + 1), :].reshape(1, -1)
for i in range(len(test_x)):
    for j in range(len(test_x[10])):
        if std_x[j] != 0:
            test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


# **Prediction**
說明圖同上

<img src="./Predict PM2.5.png">

有了 weight 和測試資料即可預測 target。

In [96]:
w = np.load('weight.npy')
ans_y = np.dot(test_x, w)
ans_y


array([[  6.01835856],
       [ 18.38643375],
       [ 27.38316968],
       [  7.76177247],
       [ 27.14854747],
       [ 22.19149405],
       [ 23.4504508 ],
       [ 31.05252821],
       [ 17.00244251],
       [ 59.01026946],
       [ 11.66340696],
       [  9.34857275],
       [ 63.66725787],
       [ 53.20547218],
       [ 22.35911587],
       [ 12.20534704],
       [ 32.38325348],
       [ 67.04147388],
       [ -0.67927939],
       [ 17.11116268],
       [ 41.46320825],
       [ 71.64654178],
       [  9.26770534],
       [ 17.8836054 ],
       [ 14.6513    ],
       [ 38.17938052],
       [ 14.64470294],
       [ 66.95815553],
       [  7.17357017],
       [ 55.40128333],
       [ 24.47706776],
       [  8.41714   ],
       [  2.76529561],
       [ 18.6395756 ],
       [ 27.69677269],
       [ 37.76203134],
       [ 43.73771253],
       [ 29.50919054],
       [ 41.9970245 ],
       [ 35.05005417],
       [  7.62181208],
       [ 41.09607631],
       [ 30.22020891],
       [ 50

# Data processing

In [97]:
for i in range(240):
    if(ans_y[i][0] < 0):
        ans_y[i][0] = 0
    else:
        ans_y[i][0] = np.round(ans_y[i][0])

# **Save Prediction to CSV File**

In [98]:
import csv
with open('submit.csv', mode='w', newline='') as submit_file:
    csv_writer = csv.writer(submit_file)
    header = ['id', 'value']
    print(header)
    csv_writer.writerow(header)
    for i in range(240):
        row = ['id_' + str(i), ans_y[i][0]]
        csv_writer.writerow(row)
        print(row)

['id', 'value']
['id_0', 6.0]
['id_1', 18.0]
['id_2', 27.0]
['id_3', 8.0]
['id_4', 27.0]
['id_5', 22.0]
['id_6', 23.0]
['id_7', 31.0]
['id_8', 17.0]
['id_9', 59.0]
['id_10', 12.0]
['id_11', 9.0]
['id_12', 64.0]
['id_13', 53.0]
['id_14', 22.0]
['id_15', 12.0]
['id_16', 32.0]
['id_17', 67.0]
['id_18', 0.0]
['id_19', 17.0]
['id_20', 41.0]
['id_21', 72.0]
['id_22', 9.0]
['id_23', 18.0]
['id_24', 15.0]
['id_25', 38.0]
['id_26', 15.0]
['id_27', 67.0]
['id_28', 7.0]
['id_29', 55.0]
['id_30', 24.0]
['id_31', 8.0]
['id_32', 3.0]
['id_33', 19.0]
['id_34', 28.0]
['id_35', 38.0]
['id_36', 44.0]
['id_37', 30.0]
['id_38', 42.0]
['id_39', 35.0]
['id_40', 8.0]
['id_41', 41.0]
['id_42', 30.0]
['id_43', 51.0]
['id_44', 17.0]
['id_45', 35.0]
['id_46', 25.0]
['id_47', 10.0]
['id_48', 27.0]
['id_49', 32.0]
['id_50', 20.0]
['id_51', 8.0]
['id_52', 20.0]
['id_53', 53.0]
['id_54', 16.0]
['id_55', 36.0]
['id_56', 33.0]
['id_57', 21.0]
['id_58', 57.0]
['id_59', 23.0]
['id_60', 15.0]
['id_61', 42.0]
['id_62', 13