上一节中，我们介绍了如何用密集连接的神经网络将向量输入划分为两个互斥的类别，本节你会构建一个网络，将路透社新闻划分为46 个互斥的主题。因为有多个类别，所以 这是多分类（multiclass classification）问题
- 因为每个数据点只能划分到一个类别， 所以更具体地说，这是单标签、多分类（single-label, multiclass classification）问题
- 每个数据点可以划分到多个类别（主题），那它就是一个多标签、多分类（multilabel, multiclass classification）问题

### 路透社数据集
包含许多短新闻及其对应的主题，由路透社在1986 年发布。它 是一个简单的、广泛使用的文本分类数据集。它包括46 个不同的主题：某些主题的样本更多， 但训练集中每个主题都有至少 10 个样本

#### 加载数据集

In [1]:
from keras.datasets import reuters

(train_data,train_labels),(test_data,test_labels) = reuters.load_data(num_words=10000)

In [4]:
print(len(train_data),len(train_labels),len(test_data),len(test_labels))

8982 8982 2246 2246


In [22]:
print(train_data[0],len(train_data),sep='===')

[1, 2, 2, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]===8982


In [16]:
# 样本对应的标签是一个 0~45 范围内的整数，即话题索引编号
print(train_labels[0])
print(max(train_labels),min(train_labels))

3
45 0


#### 解码索引为文本

In [14]:
word_index = reuters.get_word_index()
index_word = dict([(value,key) for (key,value) in word_index.items()])
# 解码某一条新闻序列,记得i-3
# 因为0、1、2 是为“padding”（填充）、“ start of sequence”、“unknown”分别保留的索引
decoded_newswire = ' '.join(index_word.get(i-3,'?') for i in train_data[5])
decoded_newswire

"? the u s agriculture department estimated canada's 1986 87 wheat crop at 31 85 mln tonnes vs 31 85 mln tonnes last month it estimated 1985 86 output at 24 25 mln tonnes vs 24 25 mln last month canadian 1986 87 coarse grain production is projected at 27 62 mln tonnes vs 27 62 mln tonnes last month production in 1985 86 is estimated at 24 95 mln tonnes vs 24 95 mln last month canadian wheat exports in 1986 87 are forecast at 19 00 mln tonnes vs 18 00 mln tonnes last month exports in 1985 86 are estimated at 17 71 mln tonnes vs 17 72 mln last month reuter 3"

### 准备数据
#### 将数据向量化

In [18]:
import numpy as np

def vectorize_sequences(sequences,dimension=10000):
    results = np.zeros((len(sequences),dimension))
    for i,sequence in enumerate(sequences):
        results[i,sequence] = 1
    return results

# 向量化训练和测试数据
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

In [20]:
x_train[0]

array([0., 1., 1., ..., 0., 0., 0.])

In [25]:
print(len(x_train))

8982


#### 将标签向量化
标签向量化有两种方法：你可以将标签列表转换为整数张量，或者使用one-hot 编码

In [19]:
# one_hot 编码
def to_one_hot(labels,dimension=46):
    results = np.zeros((len(labels),dimension))
    for i,label in enumerate(labels):
        results[i,label] = 1
    return results

# one_hot 训练和测试标签
one_hot_train_labels = to_one_hot(train_labels)
one_hot_test_labels = to_one_hot(test_labels)

In [21]:
one_hot_test_labels[0]

array([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [27]:
len(one_hot_train_labels)

8982

### 构建网络

In [23]:
from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(64,activation='relu',input_shape=(10000,)))
model.add(layers.Dense(64,activation='relu'))
model.add(layers.Dense(46,activation='softmax'))

Instructions for updating:
Colocations handled automatically by placer.


### 编译模型

In [24]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['acc'])

### 分开验证和测试集

In [28]:
x_val = x_train[:1000]
partial_x_train = x_train[1000:]

y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]

### 训练模型

In [29]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    batch_size=512,
                    epochs=20,
                    validation_data=[x_val,y_val])

Instructions for updating:
Use tf.cast instead.
Train on 7982 samples, validate on 1000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Using matplotlib backend: TkAgg
