本项目目的是用tensorflow里面contrib.learn工具包对文本进行词向量训练和分类，相比以前低版本的tensorflow，1.0版本提供了很多方便的高级接口。本文的数据使用了20类新闻，具体原型代码可以参看tensorflow[文本分类示例](https://github.com/tensorflow/tensorflow/blob/r1.0/tensorflow/examples/learn/text_classification.py)。

## 加载工具包

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import sys

import numpy as np
import pandas
from sklearn import metrics
import tensorflow as tf

learn = tf.contrib.learn
from tensorflow.contrib.layers.python.layers import encoders

## 读取20类新闻数据

In [59]:
from sklearn.datasets import fetch_20newsgroups
import string
import re
news_train = fetch_20newsgroups(subset='train', shuffle=True, 
                                remove=('headers'), random_state=11)
news_test = fetch_20newsgroups(subset='test', shuffle=True, 
                                remove=('headers'), random_state=11)

In [98]:
#清除数字和换行符，将标点符号当作词
def cleanText(corpus):
    for c in string.punctuation:  
        corpus = [z.lower().replace(c,' %s '%c ) for z in corpus]
    #corpus = [re.sub('['+string.punctuation+']', '', s) for s in corpus]
    corpus = [re.sub('['+string.digits+']', '', z) for z in corpus]
    corpus = [z.lower().replace('\n','') for z in corpus]    
    return corpus
X_train = cleanText(news_train.data)   
X_test = cleanText(news_test.data)  

In [99]:
y_train, y_test = news_train.target, news_test.target

## 读取DBpedia文件

In [17]:
train = pandas.read_csv('dbpedia_csv/train.csv', header=None)
X_train, y_train = train[2], train[0]
test = pandas.read_csv('dbpedia_csv/test.csv', header=None)
X_test, y_test = test[2], test[0]

In [18]:
print(X_train.shape)

(560000,)


In [21]:
#随机选取一部分样本进行计算
index = np.random.choice(a=range(len(X_train)),size=160000,replace=False)
X_train = X_train[index]
y_train = y_train[index]

### 建立词库和文本向量矩阵

In [30]:
MAX_DOCUMENT_LENGTH = 100
EMBEDDING_SIZE = 30
n_words = 0

In [31]:
# 处理词库
vocab_processor = learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
x_train = np.array(list(vocab_processor.fit_transform(X_train)))
x_test = np.array(list(vocab_processor.transform(X_test)))
n_words = len(vocab_processor.vocabulary_)
print('Total words: %d' % n_words)

Total words: 363501


## 词袋子模型

In [7]:
def bag_of_words_model(features, target):
  """词袋子模型，将文章看成单词集合，忽视先后顺序"""   
#
  target = tf.one_hot(target, 15, 1, 0)
  features = encoders.bow_encoder(
      features, vocab_size=n_words, embed_dim=EMBEDDING_SIZE)
  logits = tf.contrib.layers.fully_connected(features, 15, activation_fn=None)
  loss = tf.contrib.losses.softmax_cross_entropy(logits, target)
  train_op = tf.contrib.layers.optimize_loss(
      loss,
      tf.contrib.framework.get_global_step(),
      optimizer='Adam',
      learning_rate=0.01)
  return ({
      'class': tf.argmax(logits, 1),
      'prob': tf.nn.softmax(logits)
  }, loss, train_op)

In [8]:
model_fn = bag_of_words_model
classifier = learn.Estimator(model_fn=model_fn)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_environment': 'local', '_tf_random_seed': None, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_task_id': 0, '_keep_checkpoint_max': 5, '_save_summary_steps': 100, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001E70F210940>, '_is_chief': True, '_master': '', '_save_checkpoints_secs': 600, '_num_ps_replicas': 0}


In [11]:
# Train and predict
classifier.fit(x_train, y_train, steps=100)
y_predicted = [
    p['class'] for p in classifier.predict(
        x_test, as_iterable=True)]
score = metrics.accuracy_score(y_test, y_predicted)
print('Accuracy: {0:f}'.format(score))

Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
Instructions for updating:
Use tf.losses.softmax_cross_entropy instead.


  equality = a == b


Instructions for updating:
Use tf.losses.compute_weighted_loss instead.
Instructions for updating:
Use tf.losses.add_loss instead.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 101 into C:\Users\ADMINI~1\AppData\Local\Temp\2\tmpz6y0m1a2\model.ckpt.
INFO:tensorflow:loss = 0.0851872, step = 101
INFO:tensorflow:Saving checkpoints for 133 into C:\Users\ADMINI~1\AppData\Local\Temp\2\tmpz6y0m1a2\model.ckpt.
INFO:tensorflow:Saving checkpoints for 165 into C:\Users\ADMINI~1\AppData\Local\Temp\2\tmpz6y0m1a2\model.ckpt.
INFO:tensorflow:Saving checkpoints for 196 into C:\Users\ADMINI~1\AppData\Local\Temp\2\tmpz6y0m1a2\model.ckpt.
INFO:tensorflow:Saving checkpoints for 200 into C:\Users\ADMINI~1\AppData\Local\Temp\2\tmpz6y0m1a2\model.ckpt.
INFO:tensorflow:Loss for final step: 0.0390103.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKC

## RNN模型

In [24]:
def rnn_model(features, target):
  """RNN模型"""
  # 首先将单词转化成词向量
  # 然后将每篇文章映射成词向量集合
  word_vectors = tf.contrib.layers.embed_sequence(
      features, vocab_size=n_words, embed_dim=EMBEDDING_SIZE, scope='words')

  # Split into list of embedding per word, while removing doc length dim.
  # word_list results to be a list of tensors [batch_size, EMBEDDING_SIZE].
  word_list = tf.unstack(word_vectors, axis=1)

  # Create a Gated Recurrent Unit cell with hidden size of EMBEDDING_SIZE.
  cell = tf.contrib.rnn.GRUCell(EMBEDDING_SIZE)

  # Create an unrolled Recurrent Neural Networks to length of
  # MAX_DOCUMENT_LENGTH and passes word_list as inputs for each unit.
  _, encoding = tf.contrib.rnn.static_rnn(cell, word_list, dtype=tf.float32)

  # Given encoding of RNN, take encoding of last step (e.g hidden size of the
  # neural network of last step) and pass it as features for logistic
  # regression over output classes.
  target = tf.one_hot(target, 15, 1, 0)
  logits = tf.contrib.layers.fully_connected(encoding, 15, activation_fn=None)
  loss = tf.contrib.losses.softmax_cross_entropy(logits, target)

  # Create a training op.
  train_op = tf.contrib.layers.optimize_loss(
      loss,
      tf.contrib.framework.get_global_step(),
      optimizer='Adam',
      learning_rate=0.01)

  return ({
      'class': tf.argmax(logits, 1),
      'prob': tf.nn.softmax(logits)
  }, loss, train_op)

In [25]:
model_fn = rnn_model
classifier = learn.Estimator(model_fn=model_fn)
# Train and predict
classifier.fit(x_train, y_train, steps=100)
y_predicted = [
    p['class'] for p in classifier.predict(
        x_test, as_iterable=True)]
score = metrics.accuracy_score(y_test, y_predicted)
print('Accuracy: {0:f}'.format(score))

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_keep_checkpoint_every_n_hours': 10000, '_task_type': None, '_task_id': 0, '_master': '', '_environment': 'local', '_keep_checkpoint_max': 5, '_num_ps_replicas': 0, '_save_summary_steps': 100, '_is_chief': True, '_evaluation_master': '', '_save_checkpoints_secs': 600, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_save_checkpoints_steps': None, '_tf_random_seed': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001DB4212D208>}
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, 

  equality = a == b


Instructions for updating:
Use tf.losses.softmax_cross_entropy instead.
Instructions for updating:
Use tf.losses.compute_weighted_loss instead.
Instructions for updating:
Use tf.losses.add_loss instead.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\ADMINI~1\AppData\Local\Temp\2\tmp5mw1f5gy\model.ckpt.
INFO:tensorflow:step = 1, loss = 2.70814
INFO:tensorflow:Saving checkpoints for 35 into C:\Users\ADMINI~1\AppData\Local\Temp\2\tmp5mw1f5gy\model.ckpt.
INFO:tensorflow:Saving checkpoints for 68 into C:\Users\ADMINI~1\AppData\Local\Temp\2\tmp5mw1f5gy\model.ckpt.
INFO:tensorflow:Saving checkpoints for 100 into C:\Users\ADMINI~1\AppData\Local\Temp\2\tmp5mw1f5gy\model.ckpt.
INFO:tensorflow:Loss for final step: 0.00187004.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_

## CNN模型

In [32]:
N_FILTERS = 10
WINDOW_SIZE = 20
FILTER_SHAPE1 = [WINDOW_SIZE, EMBEDDING_SIZE]
FILTER_SHAPE2 = [WINDOW_SIZE, N_FILTERS]
POOLING_WINDOW = 4
POOLING_STRIDE = 2
def cnn_model(features, target):
  """2 layer ConvNet to predict from sequence of words to a class."""
  # Convert indexes of words into embeddings.
  # This creates embeddings matrix of [n_words, EMBEDDING_SIZE] and then
  # maps word indexes of the sequence into [batch_size, sequence_length,
  # EMBEDDING_SIZE].
  target = tf.one_hot(target, 15, 1, 0)
  word_vectors = tf.contrib.layers.embed_sequence(
      features, vocab_size=n_words, embed_dim=EMBEDDING_SIZE, scope='words')
  word_vectors = tf.expand_dims(word_vectors, 3)
  with tf.variable_scope('CNN_Layer1'):
    # Apply Convolution filtering on input sequence.
    conv1 = tf.contrib.layers.convolution2d(
        word_vectors, N_FILTERS, FILTER_SHAPE1, padding='VALID')
    # Add a RELU for non linearity.
    conv1 = tf.nn.relu(conv1)
    # Max pooling across output of Convolution+Relu.
    pool1 = tf.nn.max_pool(
        conv1,
        ksize=[1, POOLING_WINDOW, 1, 1],
        strides=[1, POOLING_STRIDE, 1, 1],
        padding='SAME')
    # Transpose matrix so that n_filters from convolution becomes width.
    pool1 = tf.transpose(pool1, [0, 1, 3, 2])
  with tf.variable_scope('CNN_Layer2'):
    # Second level of convolution filtering.
    conv2 = tf.contrib.layers.convolution2d(
        pool1, N_FILTERS, FILTER_SHAPE2, padding='VALID')
    # Max across each filter to get useful features for classification.
    pool2 = tf.squeeze(tf.reduce_max(conv2, 1), squeeze_dims=[1])

  # Apply regular WX + B and classification.
  logits = tf.contrib.layers.fully_connected(pool2, 15, activation_fn=None)
  loss = tf.contrib.losses.softmax_cross_entropy(logits, target)

  train_op = tf.contrib.layers.optimize_loss(
      loss,
      tf.contrib.framework.get_global_step(),
      optimizer='Adam',
      learning_rate=0.01)

  return ({
      'class': tf.argmax(logits, 1),
      'prob': tf.nn.softmax(logits)
  }, loss, train_op)


In [None]:
model_fn = cnn_model
classifier = learn.Estimator(model_fn=model_fn)
# Train and predict
classifier.fit(x_train, y_train, steps=100)
y_predicted = [
    p['class'] for p in classifier.predict(
        x_test, as_iterable=True)]
score = metrics.accuracy_score(y_test, y_predicted)
print('Accuracy: {0:f}'.format(score))

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_keep_checkpoint_every_n_hours': 10000, '_task_type': None, '_task_id': 0, '_master': '', '_environment': 'local', '_keep_checkpoint_max': 5, '_num_ps_replicas': 0, '_save_summary_steps': 100, '_is_chief': True, '_evaluation_master': '', '_save_checkpoints_secs': 600, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_save_checkpoints_steps': None, '_tf_random_seed': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001DB03745780>}
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, 

  equality = a == b


Instructions for updating:
Use tf.losses.add_loss instead.
INFO:tensorflow:Create CheckpointSaverHook.
