<a href="https://colab.research.google.com/github/yuyangweng/NLP/blob/main/NLP_RNN_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 循環神經網絡 LSTM (長短期記憶)來學習字母表順序

很多人在看過RNN或LSTM的原理說明後, 對於RNN神經網絡在序列資料的學習與應用上很難一開始就理解。在本文中，我們將開發和比較幾種不同的LSTM神經網絡模型。

![lstm-abc](https://pro.guidesocial.be/images/thumbs/580x387/arton24101.jpg?fct=1456434296)

我們將要使用深度學習來學習英文26個字母出現的順序。也就是說，給定一個英文字母表的某一個字母，來讓神經網絡預測下一個可能會出現的字母。

> ABCDEFGHIJKLMNOPQRSTUVWXYZ

> 例如: 

> 給 J -> 預測 K

> 給 X -> 預測 Y

這是一個簡單的序列預測問題，一旦被理解，就可以推廣到其他序列預測問題，如時間序列預測和序列分類。

![lstm-many-to-one](https://i.stack.imgur.com/QCnpU.jpg)

## 模型 1. 用LSTM學習一個字符到一個字符映射

### STEP1. 匯入 Keras 及相關模組

In [None]:
import numpy
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM, SimpleRNN
from tensorflow.keras import utils
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 給定隨機的種子, 以便讓大家跑起來的結果是相同的
numpy.random.seed(7)

### STEP2. 準備資料

我們現在可以定義我們的數據集，字母表(alphabet)。為了便於閱讀，我們使用大寫字母來定義字母表。

我們需要將字母表的每個字母映射到數字以便使用人工網絡來進行訓練。我們可以通過為字符創建字母索引的字典來輕鬆完成此操作。
我們還可以創建一個反向查找，將預測轉換回字符以供以後使用。

In [None]:
# 定義序列數據集
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

# 創建字符映射到整數（0 - 25)和反相的查詢字典物件
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

In [None]:
# 打印看一下
print("字母對應到數字編號: \n", char_to_int)
print("\n")

print("數字編號對應到字母: \n", int_to_char)

字母對應到數字編號: 
 {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8, 'J': 9, 'K': 10, 'L': 11, 'M': 12, 'N': 13, 'O': 14, 'P': 15, 'Q': 16, 'R': 17, 'S': 18, 'T': 19, 'U': 20, 'V': 21, 'W': 22, 'X': 23, 'Y': 24, 'Z': 25}


數字編號對應到字母: 
 {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'J', 10: 'K', 11: 'L', 12: 'M', 13: 'N', 14: 'O', 15: 'P', 16: 'Q', 17: 'R', 18: 'S', 19: 'T', 20: 'U', 21: 'V', 22: 'W', 23: 'X', 24: 'Y', 25: 'Z'}


### STEP3. 準備訓練用資料

現在我們需要創建我們的輸入(X)和輸出(y)來訓練我們的神經網絡。我們可以通過定義一個輸入序列長度，然後從輸入字母序列中讀取序列。
例如，我們使用輸入長度1.從原始輸入數據的開頭開始，我們可以讀取第一個字母“A”，下一個字母作為預測“B”。我們沿著一個字符移動並重複，直到達到“Z”的預測。

In [None]:
# 準備輸入數據集
seq_length = 1 # time step
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print(seq_in, '->', seq_out)

A -> B
B -> C
C -> D
D -> E
E -> F
F -> G
G -> H
H -> I
I -> J
J -> K
K -> L
L -> M
M -> N
N -> O
O -> P
P -> Q
Q -> R
R -> S
S -> T
T -> U
U -> V
V -> W
W -> X
X -> Y
Y -> Z


In [None]:
print(dataX) # like word2vec
print(dataY)

[[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24]]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]


### STEP4. 資料預處理
我們需要將NumPy數組重塑為LSTM網絡所期望的格式，也就是: (samples, time_steps, features)。
同時我們將進行資料的歸一化(normalize)來讓資料的值落於0到1之間。並對標籤值進行one-hot的編碼。


> ABCDEFGHIJKLMNOPQRSTUVWXYZ

> 例如: 

> 給 J -> 預測 K

> 給 X -> 預測 Y


目標訓練張量結構: (samples, time_steps, features) -> (n , **1**, **1** )

請特別注意, 這裡的1個字符會變成1個時間步裡頭的1個element的"feature"向量。

In [None]:
# 重塑 X 資料的維度成為 (samples, time_steps, features)
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))
 
# 歸一化
X = X / float(len(alphabet)) # 使用0,1,2,...當作a,b,c,...的字元vector, 沒有作one-hot encoding, 合理嗎?

# one-hot 編碼輸出變量
y = utils.to_categorical(dataY)
 
print("X shape: ", X.shape) # (25筆samples, "1"個時間步長, 1個feature) (batch, timesteps, feature)
print("y shape: ", y.shape)

X shape:  (25, 1, 1)
y shape:  (25, 26)


### STEP5. 建立模型

In [None]:
# 創建模型 (LSTN (unit, input_shape(timesteps, feature))
model = Sequential()
model.add(SimpleRNN(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn (SimpleRNN)       (None, 32)                1088      
_________________________________________________________________
dense_1 (Dense)              (None, 26)                858       
Total params: 1,946
Trainable params: 1,946
Non-trainable params: 0
_________________________________________________________________


### STEP6. 定義訓練並進行訓練

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
25/25 - 1s - loss: 2.4969 - accuracy: 0.0800
Epoch 2/500
25/25 - 0s - loss: 2.4808 - accuracy: 0.1600
Epoch 3/500
25/25 - 0s - loss: 2.4754 - accuracy: 0.1200
Epoch 4/500
25/25 - 0s - loss: 2.4721 - accuracy: 0.1600
Epoch 5/500
25/25 - 0s - loss: 2.4678 - accuracy: 0.1600
Epoch 6/500
25/25 - 0s - loss: 2.4646 - accuracy: 0.1600
Epoch 7/500
25/25 - 0s - loss: 2.4602 - accuracy: 0.1600
Epoch 8/500
25/25 - 0s - loss: 2.4564 - accuracy: 0.2000
Epoch 9/500
25/25 - 0s - loss: 2.4534 - accuracy: 0.1600
Epoch 10/500
25/25 - 0s - loss: 2.4503 - accuracy: 0.1600
Epoch 11/500
25/25 - 0s - loss: 2.4451 - accuracy: 0.1600
Epoch 12/500
25/25 - 0s - loss: 2.4425 - accuracy: 0.1600
Epoch 13/500
25/25 - 0s - loss: 2.4389 - accuracy: 0.1600
Epoch 14/500
25/25 - 0s - loss: 2.4355 - accuracy: 0.2000
Epoch 15/500
25/25 - 0s - loss: 2.4313 - accuracy: 0.1600
Epoch 16/500
25/25 - 0s - loss: 2.4275 - accuracy: 0.1600
Epoch 17/500
25/25 - 0s - loss: 2.4244 - accuracy: 0.1600
Epoch 18/500
25/25 - 0s

<tensorflow.python.keras.callbacks.History at 0x7f088654a6d0>

### STEP7. 評估模型準確率

In [None]:
# 評估模型的性能
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 88.00%


### STEP8. 預測結果

In [None]:
# 展示模型預測能力
for pattern in dataX:
    # 把26個字母一個個拿進模型來預測會出現的字母
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(len(alphabet))
    
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction) # 機率最大的idx
    result = int_to_char[index] # 看看預測出來的是那一個字母
    seq_in = [int_to_char[value] for value in pattern]
    print(seq_in, "->", result) # 打印結果

['A'] -> B
['B'] -> C
['C'] -> D
['D'] -> E
['E'] -> F
['F'] -> G
['G'] -> H
['H'] -> I
['I'] -> J
['J'] -> K
['K'] -> L
['L'] -> M
['M'] -> N
['N'] -> O
['O'] -> P
['P'] -> Q
['Q'] -> R
['R'] -> S
['S'] -> T
['T'] -> U
['U'] -> V
['V'] -> Y
['W'] -> Z
['X'] -> Z
['Y'] -> Z


我們可以看到，"序列資料的預測"這個問題對於網絡學習確實是困難的。
原因是，在以上的範例中的LSTM單位沒有任何上下文的知識(時間歩長只有"1")。每個輸入輸出模式以隨機順序(shuffle)出現到人工網網絡上，而且Keras的LSTM網絡內步狀態(state)會在每個訓練循環(epoch)後被重置(reset)。

接下來，讓我們嘗試提供更多的順序資訊來讓LSTM學習。

## 模型 2. LSTM 學習三個字符特徵窗口(Three-Char Feature Window)到一個字符映射


### STEP1. 準備訓練用資料

In [None]:
# 準備輸入數據集
seq_length = 3 # 
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length] # 3個字符
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print(seq_in, '->', seq_out)

ABC -> D
BCD -> E
CDE -> F
DEF -> G
EFG -> H
FGH -> I
GHI -> J
HIJ -> K
IJK -> L
JKL -> M
KLM -> N
LMN -> O
MNO -> P
NOP -> Q
OPQ -> R
PQR -> S
QRS -> T
RST -> U
STU -> V
TUV -> W
UVW -> X
VWX -> Y
WXY -> Z


### STEP2. 資料預處理


> ABCDEFGHIJKLMNOPQRSTUVWXYZ

> 例如: 

> 給 HIJ -> 預測 K

> 給 EFG -> 預測 H

目標訓練張量結構: (samples, time_steps, features) -> (n , **1**, **3** )

請特別注意, 這裡的三個字符會變成一個有3個element的"feature" vector。因此在準備訓練資料集的時候, 1筆訓練資料只有"1"個時間步, 裡頭存放著"3"個字符的資料"features"向量。

In [None]:
# 重塑 X 資料的維度成為 (samples, time_steps, features)
X = numpy.reshape(dataX, (len(dataX), 1, seq_length))  # <-- 特別注意這裡

# 歸一化
X = X / float(len(alphabet))

# 使用one hot encode 對Y值進行編碼
y = utils.to_categorical(dataY)

print("X shape: ", X.shape)
print("y shape: ", y.shape)

X shape:  (23, 1, 3)
y shape:  (23, 26)


### STEP3. 建立模型

In [None]:
# 創建模型
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2]))) # <-- 特別注意這裡
model.add(Dense(y.shape[1], activation='softmax'))
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 32)                4608      
_________________________________________________________________
dense_1 (Dense)              (None, 26)                858       
Total params: 5,466
Trainable params: 5,466
Non-trainable params: 0
_________________________________________________________________


### STEP4. 定義訓練並進行訓練

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
23/23 - 1s - loss: 3.2622 - accuracy: 0.0000e+00
Epoch 2/500
23/23 - 0s - loss: 3.2506 - accuracy: 0.0435
Epoch 3/500
23/23 - 0s - loss: 3.2443 - accuracy: 0.0435
Epoch 4/500
23/23 - 0s - loss: 3.2388 - accuracy: 0.0435
Epoch 5/500
23/23 - 0s - loss: 3.2322 - accuracy: 0.0435
Epoch 6/500
23/23 - 0s - loss: 3.2257 - accuracy: 0.0435
Epoch 7/500
23/23 - 0s - loss: 3.2186 - accuracy: 0.0435
Epoch 8/500
23/23 - 0s - loss: 3.2115 - accuracy: 0.0435
Epoch 9/500
23/23 - 0s - loss: 3.2041 - accuracy: 0.0435
Epoch 10/500
23/23 - 0s - loss: 3.1953 - accuracy: 0.0435
Epoch 11/500
23/23 - 0s - loss: 3.1876 - accuracy: 0.0435
Epoch 12/500
23/23 - 0s - loss: 3.1778 - accuracy: 0.0435
Epoch 13/500
23/23 - 0s - loss: 3.1680 - accuracy: 0.0435
Epoch 14/500
23/23 - 0s - loss: 3.1584 - accuracy: 0.0435
Epoch 15/500
23/23 - 0s - loss: 3.1481 - accuracy: 0.0435
Epoch 16/500
23/23 - 0s - loss: 3.1378 - accuracy: 0.0435
Epoch 17/500
23/23 - 0s - loss: 3.1269 - accuracy: 0.0435
Epoch 18/500
23/23 

<tensorflow.python.keras.callbacks.History at 0x7f6700142950>

### STEP5. 評估模型準確率

In [None]:
# 評估模型的性能
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 86.96%


### STEP6. 預測結果

In [None]:
# 展示一些模型預測
for pattern in dataX:
    x = numpy.reshape(pattern, (1, 1, len(pattern)))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print(seq_in, "->", result)

['A', 'B', 'C'] -> D
['B', 'C', 'D'] -> E
['C', 'D', 'E'] -> F
['D', 'E', 'F'] -> G
['E', 'F', 'G'] -> H
['F', 'G', 'H'] -> I
['G', 'H', 'I'] -> J
['H', 'I', 'J'] -> K
['I', 'J', 'K'] -> L
['J', 'K', 'L'] -> M
['K', 'L', 'M'] -> N
['L', 'M', 'N'] -> O
['M', 'N', 'O'] -> P
['N', 'O', 'P'] -> Q
['O', 'P', 'Q'] -> R
['P', 'Q', 'R'] -> S
['Q', 'R', 'S'] -> T
['R', 'S', 'T'] -> U
['S', 'T', 'U'] -> V
['T', 'U', 'V'] -> X
['U', 'V', 'W'] -> Z
['V', 'W', 'X'] -> Z
['W', 'X', 'Y'] -> Z


我們可以看到，"模型#2"相比於"模型#1"在預測的表現上只有小幅提升。這個簡單的問題，即使使用window方法，我們仍然無法讓LSTM學習到預測正確的字母出現的順序。

以上的範例也是一個誤用LSTM網絡的糟糕的張量結構。事實上，字母序列是一個特徵的"時間步驟(timesteps)"，而不是單獨特徵的一個時間步驟。我們已經給了網絡更多的上下文，但是沒有更多的順序上下文(context)。

下一範例中，我們將以"時間步驟(timesteps)"的形式給出更多的上下文(context)。

## 模型 3. LSTM 學習三個字符的時間步驟窗口(Three-Char Time Step Window)到一個字符的映射

### STEP1. 準備訓練用資料

In [None]:
seq_length = 3
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print(seq_in, '->', seq_out)

ABC -> D
BCD -> E
CDE -> F
DEF -> G
EFG -> H
FGH -> I
GHI -> J
HIJ -> K
IJK -> L
JKL -> M
KLM -> N
LMN -> O
MNO -> P
NOP -> Q
OPQ -> R
PQR -> S
QRS -> T
RST -> U
STU -> V
TUV -> W
UVW -> X
VWX -> Y
WXY -> Z


In [None]:
dataX

[[0, 1, 2],
 [1, 2, 3],
 [2, 3, 4],
 [3, 4, 5],
 [4, 5, 6],
 [5, 6, 7],
 [6, 7, 8],
 [7, 8, 9],
 [8, 9, 10],
 [9, 10, 11],
 [10, 11, 12],
 [11, 12, 13],
 [12, 13, 14],
 [13, 14, 15],
 [14, 15, 16],
 [15, 16, 17],
 [16, 17, 18],
 [17, 18, 19],
 [18, 19, 20],
 [19, 20, 21],
 [20, 21, 22],
 [21, 22, 23],
 [22, 23, 24]]

### STEP2. 資料預處理


> ABCDEFGHIJKLMNOPQRSTUVWXYZ

> 例如: 

> 給 HIJ -> 預測 K

> 給 EFG -> 預測 H

目標訓練張量結構: (samples, time_steps, features) -> (n , **3**, **1** )

準備訓練資料集的時候要把資料的張量結構轉換成, 1筆訓練資料有"3"個時間步, 裡頭存放著"1"個字符的資料"features"向量。

In [None]:
# 重塑 X 資料的維度成為 (samples, time_steps, features)
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))  # <-- 特別注意這裡

# 歸一化
X = X / float(len(alphabet))

# 使用one hot encode 對Y值進行編碼
y = utils.to_categorical(dataY)

print("X shape: ", X.shape)
print("y shape: ", y.shape)

X shape:  (23, 3, 1)
y shape:  (23, 26)


### STEP3. 建立模型

In [None]:
# 創建模型
model = Sequential()
model.add(SimpleRNN(32, input_shape=(X.shape[1], X.shape[2]))) # <-- 特別注意這裡 (3, 1)
model.add(Dense(y.shape[1], activation='softmax'))

model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_1 (SimpleRNN)     (None, 32)                1088      
_________________________________________________________________
dense_2 (Dense)              (None, 26)                858       
Total params: 1,946
Trainable params: 1,946
Non-trainable params: 0
_________________________________________________________________


### STEP4. 定義訓練並進行訓練

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
23/23 - 1s - loss: 0.2962 - accuracy: 0.9130
Epoch 2/500
23/23 - 0s - loss: 0.2746 - accuracy: 0.9130
Epoch 3/500
23/23 - 0s - loss: 0.2678 - accuracy: 0.9565
Epoch 4/500
23/23 - 0s - loss: 0.2668 - accuracy: 0.9565
Epoch 5/500
23/23 - 0s - loss: 0.2682 - accuracy: 0.9130
Epoch 6/500
23/23 - 0s - loss: 0.2646 - accuracy: 0.9130
Epoch 7/500
23/23 - 0s - loss: 0.2692 - accuracy: 0.9565
Epoch 8/500
23/23 - 0s - loss: 0.2609 - accuracy: 0.9130
Epoch 9/500
23/23 - 0s - loss: 0.2615 - accuracy: 0.9565
Epoch 10/500
23/23 - 0s - loss: 0.2615 - accuracy: 0.9130
Epoch 11/500
23/23 - 0s - loss: 0.2603 - accuracy: 0.9565
Epoch 12/500
23/23 - 0s - loss: 0.2586 - accuracy: 0.9565
Epoch 13/500
23/23 - 0s - loss: 0.2563 - accuracy: 1.0000
Epoch 14/500
23/23 - 0s - loss: 0.2570 - accuracy: 1.0000
Epoch 15/500
23/23 - 0s - loss: 0.2527 - accuracy: 0.9130
Epoch 16/500
23/23 - 0s - loss: 0.2586 - accuracy: 0.9565
Epoch 17/500
23/23 - 0s - loss: 0.2574 - accuracy: 0.9565
Epoch 18/500
23/23 - 0s

<tensorflow.python.keras.callbacks.History at 0x7f08861e62d0>

### STEP5. 評估模型準確率

In [None]:
# 評估模型的性能
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 100.00%


### STEP6. 預測結果

In [None]:
# 讓我們擷取3個字符轉成張量結構 shape:(1,3,1)來進行infer
for pattern in dataX:
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print(seq_in, "->", result)

['A', 'B', 'C'] -> D
['B', 'C', 'D'] -> E
['C', 'D', 'E'] -> F
['D', 'E', 'F'] -> G
['E', 'F', 'G'] -> H
['F', 'G', 'H'] -> I
['G', 'H', 'I'] -> J
['H', 'I', 'J'] -> K
['I', 'J', 'K'] -> L
['J', 'K', 'L'] -> M
['K', 'L', 'M'] -> N
['L', 'M', 'N'] -> O
['M', 'N', 'O'] -> P
['N', 'O', 'P'] -> Q
['O', 'P', 'Q'] -> R
['P', 'Q', 'R'] -> S
['Q', 'R', 'S'] -> T
['R', 'S', 'T'] -> U
['S', 'T', 'U'] -> V
['T', 'U', 'V'] -> W
['U', 'V', 'W'] -> X
['V', 'W', 'X'] -> Y
['W', 'X', 'Y'] -> Z


由"模型#3"的表現來看, 當我們以"時間步驟(timesteps)"的形式給出更多的上下文(context)來訓練LSTM模型時, 這時候循環神經網絡在序列資料的學習的效果就可以發揮出它的效用。

"模型#3"在驗證的結果可達到100%的預測準確度(在這個很簡單的26個字母的順序預測的任務上)!

### 參考:
* Jason Brownlee - "[Understanding Stateful LSTM Recurrent Neural Networks in Python with Keras](https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/)"

* Keras官網 - [Recurrent Layer](https://keras.io/layers/recurrent/)

* https://github.com/erhwenkuo