Keras : 常用数据集 Datasets : https://keras.io/zh/datasets/

In [1]:
from __future__ import print_function

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb 

max_features = 20000
# 只取出其中的前20000個單字

maxlen = 80
batch_size = 32

### IMDB電影資料庫模型建立流程:
##### Step 0. 讀取資料集
##### Step 1. 剪取資料
##### Step 2. 建立模型
##### Step 3. 訓練與評估

### Step 0. 讀取資料集

In [3]:
print('Loading data...')

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Loading data...
25000 train sequences
25000 test sequences


### Step 1. 剪取資料

In [4]:
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Pad sequences (samples x time)
x_train shape: (25000, 80)
x_test shape: (25000, 80)


### pad_sequences 

- tensorflow.keras.preprocessing.sequence.pad_sequences
- sequence是隸屬於keras裡面preproccing類的一個工具類
- pad_sequences則是sequence的一個API。

https://keras.io/zh/preprocessing/sequence/
<!-- <img src='images/pad_sequence.PNG'/> -->

In [5]:
print(x_train.shape)

(25000, 80)


In [6]:
x_train[0]

array([   15,   256,     4,     2,     7,  3766,     5,   723,    36,
          71,    43,   530,   476,    26,   400,   317,    46,     7,
           4, 12118,  1029,    13,   104,    88,     4,   381,    15,
         297,    98,    32,  2071,    56,    26,   141,     6,   194,
        7486,    18,     4,   226,    22,    21,   134,   476,    26,
         480,     5,   144,    30,  5535,    18,    51,    36,    28,
         224,    92,    25,   104,     4,   226,    65,    16,    38,
        1334,    88,    12,    16,   283,     5,    16,  4472,   113,
         103,    32,    15,    16,  5345,    19,   178,    32],
      dtype=int32)

#### - 可以看到資料並非文字，而是數字
    IMDB雖然是電影資料庫，但文字已經被預先處理成對應的數值了。

- 每一個數字都代表一個文字，只是我們不知道他們分別代表什麼 

### Step 2. 建立模型

Embedding layer 會創造一個可以容納 max_features 個字符的框架作為模型的輸入，而輸出長度則是128。
- Embedding只能作為模型的第一層使用，將輸入變成一個向量
- **相當於用128個維度來描述max_features個(20000個)不同數字**
- 如果沒有Embedding層，雖然能夠依靠RNN的特性接受輸入，卻沒辦法計算每個文字之間的距離。


<!-- <img src='images/IMDB_LSTM_model.PNG'/> -->
<img src='images/IMDB LSTM model.png'/>

In [7]:
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Build model...


### Step 3. 訓練與評估

In [56]:
from sklearn.metrics import f1_score, confusion_matrix
print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15)

testing = model.predict_classes(x_test)

print(f1_score(y_true=y_test, y_pred=testing))
print(confusion_matrix(y_true=y_test, y_pred=testing))

Train...
Train on 25000 samples
0.8025535049310472
[[10369  2131]
 [ 2694  9806]]
