## 实验方法
以下是我们如何解决分类问题的步骤

将所有的新闻样本转化为词索引序列。所谓词索引就是为每一个词依次分配一个整数ID。遍历所有的新闻文本，我们只保留最参见的20,000个词，而且 每个新闻文本最多保留1000个词。
生成一个词向量矩阵。第i列表示词索引为i的词的词向量。
将词向量矩阵载入Keras Embedding层，设置该层的权重不可再训练（也就是说在之后的网络训练过程中，词向量不再改变）。
Keras Embedding层之后连接一个1D的卷积层，并用一个softmax全连接输出新闻类别

In [6]:
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model

MAX_SEQUENCE_LENGTH = 1000 # 每篇文章选取1000个词
MAX_NB_WORDS = 20000 # 将字典设置为含有2万个词
EMBEDDING_DIM = 200 # 词向量维度，200维
VALIDATION_SPLIT = 0.3 # 验证集大小，全部数据的30%

### 数据预处理

In [None]:
# test
with open('test_data_sample.json', 'r') as f:
    test = json.load(f)

dic = {}
with open('submit_sample.txt', 'r') as f:
    for line in f.readlines():
        pro = line.strip().split(',')
        dic[pro[0]] = pro[1]

content = ''
for item in test:
    for passage in item['passages']:
        content += item['question'] + '\t' + passage['content'] + '\t' + dic[str(passage['passage_id'])] + '\n'

with open('test.data', 'w') as f:
    f.write(content)

之后，我们可以新闻样本转化为神经网络训练所用的张量。所用到的Keras库是keras.preprocessing.text.Tokenizer和keras.preprocessing.sequence.pad_sequences。代码如下所示

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

### Embedding layer设置

接下来，我们从 word2vec 文件中解析出每个词和它所对应的词向量，并用字典的方式存储

In [4]:
embeddings_index = {}
f = open('wiki.vector')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 782240 word vectors.


此时，我们可以根据得到的字典生成上文所定义的词向量矩阵

In [None]:
EMBEDDING_DIM = 100 # 词向量维度

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

现在我们将这个词向量矩阵加载到Embedding层中，注意，我们设置trainable=False使得这个编码层不可再训练。

In [5]:
from keras.layers import Embedding

MAX_SEQUENCE_LENGTH = 200 # 最大数字
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

# embedding_layer = Embedding(len(word_index) + 1,
#                             EMBEDDING_DIM,
#                             input_length=MAX_SEQUENCE_LENGTH)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


NameError: name 'word_index' is not defined