# Text Classification

Authors: Victor Zhong, Kelvin Guu

We are going to tackle a relatively straightforward text classification problem with Stanza and Tensorflow.

## Dataset

First, we'll grab the 20 newsgroup data, which is conveniently downloaded by `sklearn`.

In [1]:
from sklearn.datasets import fetch_20newsgroups
classes = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=classes)

from collections import Counter
Counter([classes[t] for t in newsgroups_train.target])

Counter({'alt.atheism': 480, 'soc.religion.christian': 599})

In [2]:
print newsgroups_train.data[0]

From: nigel.allen@canrem.com (Nigel Allen)
Subject: library of congress to host dead sea scroll symposium april 21-22
Lines: 96


 Library of Congress to Host Dead Sea Scroll Symposium April 21-22
 To: National and Assignment desks, Daybook Editor
 Contact: John Sullivan, 202-707-9216, or Lucy Suddreth, 202-707-9191
          both of the Library of Congress

   WASHINGTON, April 19  -- A symposium on the Dead Sea 
Scrolls will be held at the Library of Congress on Wednesday,
April 21, and Thursday, April 22.  The two-day program, cosponsored
by the library and Baltimore Hebrew University, with additional
support from the Project Judaica Foundation, will be held in the
library's Mumford Room, sixth floor, Madison Building.
   Seating is limited, and admission to any session of the symposium
must be requested in writing (see Note A).
   The symposium will be held one week before the public opening of a
major exhibition, "Scrolls from the Dead Sea: The Ancient Library of
Qumran and Modern

## Annotating using CoreNLP

If you do not have CoreNLP, download it from here:

http://stanfordnlp.github.io/CoreNLP/index.html#download

We are going to use the Java server feature of CoreNLP to annotate data in python. In the CoreNLP directory, run the server:

```bash
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
```

Next, we'll annotate an example to see how the server works.

In [3]:
from stanza.corenlp.client import Client

client = Client()
annotation = client.annotate(newsgroups_train.data[0], properties={'annotators': 'tokenize,ssplit,pos'})
annotation['sentences'][0]

{u'index': 0,
 u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE',
 u'tokens': [{u'after': u'',
   u'before': u'',
   u'characterOffsetBegin': 0,
   u'characterOffsetEnd': 4,
   u'index': 1,
   u'originalText': u'From',
   u'pos': u'IN',
   u'word': u'From'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 4,
   u'characterOffsetEnd': 5,
   u'index': 2,
   u'originalText': u':',
   u'pos': u':',
   u'word': u':'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 6,
   u'characterOffsetEnd': 28,
   u'index': 3,
   u'originalText': u'nigel.allen@canrem.com',
   u'pos': u'NNP',
   u'word': u'nigel.allen@canrem.com'},
  {u'after': u'',
   u'before': u' ',
   u'characterOffsetBegin': 29,
   u'characterOffsetEnd': 30,
   u'index': 4,
   u'originalText': u'(',
   u'pos': u'-LRB-',
   u'word': u'-LRB-'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 30,
   u'characterOffsetEnd': 35,
   u'index': 5,
   u'originalText': u'Nigel',
   u'pos': u'NNP'

That was rather long, but the gist is that the annotation is organized into sentences, which is then organized into tokens. Each token carries a number of annotations (we've only asked for the POS tags).

In [4]:
for token in annotation['sentences'][0]['tokens']:
    print token['word'], token['pos']

From IN
: :
nigel.allen@canrem.com NNP
-LRB- -LRB-
Nigel NNP
Allen NNP
-RRB- -RRB-
Subject NNP
: :
library NN
of IN
congress NN
to TO
host NN
dead JJ
sea NN
scroll NN
symposium NN
april NNP
21-22 CD
Lines NNPS
: :
96 CD
Library NNP
of IN
Congress NNP
to TO
Host NNP
Dead NNP
Sea NNP
Scroll NNP
Symposium NNP
April NNP
21-22 CD
To TO
: :
National NNP
and CC
Assignment NNP
desks NNS
, ,
Daybook NNP
Editor NNP
Contact NN
: :
John NNP
Sullivan NNP
, ,
202-707-9216 CD
, ,
or CC
Lucy NNP
Suddreth NNP
, ,
202-707-9191 CD
both DT
of IN
the DT
Library NNP
of IN
Congress NNP
WASHINGTON NNP
, ,
April NNP
19 CD
-- :
A DT
symposium NN
on IN
the DT
Dead NNP
Sea NNP
Scrolls NNP
will MD
be VB
held VBN
at IN
the DT
Library NNP
of IN
Congress NNP
on IN
Wednesday NNP
, ,
April NNP
21 CD
, ,
and CC
Thursday NNP
, ,
April NNP
22 CD
. .


For our purpose, we're actually going to just take the document as a long sequence of words as opposed to a sequence of sequences (eg. a list of sentences of words). We'll do this by passing in the `ssplit.isOneSentence` flag.

In [5]:
docs = []
labels = []
for doc, label in zip(newsgroups_train.data, newsgroups_train.target)[:100]:
    try:
        annotation = client.annotate(doc, properties={'annotators': 'tokenize,ssplit', 'ssplit.isOneSentence': True})
        docs.append([t['word'] for t in annotation['sentences'][0]['tokens']])
        labels.append(label)
    except Exception as e:
        pass  # we're going to punt and ignore unicode errors...
print len(docs), len(labels)

99 99


We'll create a lightweight dataset object out of this. A `Dataset` is really a glorified dictionary of fields, where each field corresponds to an attribute of the examples in the dataset.

In [6]:
from stanza.text.dataset import Dataset
dataset = Dataset({'X': docs, 'Y': labels})

# dataset supports, amongst other functionalities, shuffling:
dataset.shuffle()

Dataset(Y, X)

In [7]:
# indexing of a single element
print dataset[0].keys()

['Y', 'X']


In [8]:
# indexing of multiple elements
n_train = int(0.7 * len(dataset))
train = Dataset(dataset[:n_train])
test = Dataset(dataset[n_train:])

print 'train: {}, test: {}'.format(len(train), len(test))

train: 69, test: 30


## Creating vocabulary and mapping to vector space

Stanza provides means to convert words to vocabularies (eg. map to indices and back). We also provide convienient means of loading pretrained embeddings such as `Senna` and `Glove`.

In [9]:
from stanza.text.vocab import Vocab
vocab = Vocab('***UNK***')
vocab

OrderedDict([('***UNK***', 0)])

We'll try our hands at some conversions:

In [13]:
sents = ['I like cats and dogs', 'I like nothing', 'I like cats and nothing else']
inds = []

# `vocab.update` adds the list of words to the Vocab object.
# It also returns the list of words as ints.
for s in sents[:2]:
    inds.append(vocab.update(s.split()))

# `vocab.words2indices` converts the list of words to ints (but does not update the vocab)
inds.append(vocab.words2indices(sents[2].split()))

for s, ind in zip(sents, inds):
    print '{:50}{}\nrecovered: {}'.format(s, ind, vocab.indices2words(ind))
    print

I like cats and dogs                              [1, 2, 3, 4, 5]
recovered: ['I', 'like', 'cats', 'and', 'dogs']

I like nothing                                    [1, 2, 6]
recovered: ['I', 'like', 'nothing']

I like cats and nothing else                      [1, 2, 3, 4, 6, 0]
recovered: ['I', 'like', 'cats', 'and', 'nothing', '***UNK***']



A common operation to do with vocabular objects is to replace rare words with UNKNOWN tokens. We'll convert words that occured less than some number of times.

In [14]:
vocab.counts

Counter({'***UNK***': 0,
         'I': 6,
         'and': 3,
         'cats': 3,
         'dogs': 3,
         'like': 6,
         'nothing': 3})

In [15]:
# this is actually a copy operation, because indices change when words are removed from the vocabulary
vocab = vocab.prune_rares(cutoff=6)
for s in sents:
    inds = vocab.words2indices(s.split())
    print vocab.indices2words(inds)

['I', 'like', '***UNK***', '***UNK***', '***UNK***']
['I', 'like', '***UNK***']
['I', 'like', '***UNK***', '***UNK***', '***UNK***', '***UNK***']


Now, we'll convert the entire dataset. The `convert` function applies a transform to the specified field of the dataset. We'll apply a transform using the vocabulary.

In [16]:
from stanza.text.vocab import SennaVocab
vocab = SennaVocab()

# we'll actually just use the first 200 tokens of the document
max_len = 200
train = train.convert({'X': lambda x: x[:max_len]}, in_place=True)
test = test.convert({'X': lambda x: x[:max_len]}, in_place=True)
    
# make a backup
train_orig = train
test_orig = test

train = train_orig.convert({'X': vocab.update}, in_place=False)
vocab = vocab.prune_rares(cutoff=3)
train = train_orig.convert({'X': vocab.words2indices}, in_place=False)
test = test_orig.convert({'X': vocab.words2indices}, in_place=False)
pad_index = vocab.add('***PAD***', count=100)

max_len = max([len(x) for x in train.fields['X'] + test.fields['X']])

print 'train: {}, test: {}'.format(len(train), len(test))
print 'vocab size: {}'.format(vocab)
print 'sequence max len: {}'.format(max_len)
print
print test[:2]

train: 69, test: 30
vocab size: Vocab(668 words)
sequence max len: 200

OrderedDict([('Y', [0, 1]), ('X', [[1, 2, 339, 4, 340, 341, 342, 7, 8, 2, 9, 2, 0, 0, 0, 80, 2, 346, 347, 12, 348, 39, 349, 14, 2, 0, 350, 2, 0, 320, 0, 0, 16, 2, 182, 69, 70, 41, 481, 96, 136, 578, 153, 39, 58, 126, 138, 355, 182, 0, 43, 38, 0, 124, 39, 111, 69, 0, 67, 37, 77, 49, 69, 70, 257, 269, 257, 12, 31, 182, 0, 138, 89, 151, 73, 0, 39, 28, 58, 37, 399, 596, 90, 43, 49, 17, 0, 194, 0, 0, 0, 0, 195, 37, 73, 0, 131, 31, 0, 39, 28, 31, 0, 17, 48, 73, 0, 12, 31, 0, 39, 0, 48, 0, 17, 0, 283, 33, 0, 245, 111, 0, 49, 479, 39, 51, 481, 0, 31, 0, 26, 226, 149, 147, 0, 12, 31, 0, 26, 17, 238, 0, 39, 69, 52, 425, 67, 133, 61, 0, 66, 0, 49, 17, 0, 39, 142, 323, 0, 0, 131, 95, 455, 39, 159, 67, 37, 231, 142, 17, 481, 149, 52, 0, 33, 590, 49, 104, 37, 0, 367, 148, 26, 10, 0, 0, 26, 0, 75, 89, 0, 0, 26, 107, 0, 0, 26, 10, 542, 217], [1, 2, 0, 4, 0, 486, 7, 8, 2, 0, 2, 124, 85, 61, 73, 639, 26, 4, 105, 9, 2, 409, 0, 0, 0, 

## Training a model

At this point, you're welcome to use whatever program/model/package you like to run your experiments. We'll go with TensorFlow. In particular, we'll define a LSTM classifier.

### Model definition

We'll define a lookup table, a LSTM, and a linear classifier.

In [17]:
import tensorflow as tf    
from tensorflow.models.rnn import rnn    
from tensorflow.models.rnn.rnn_cell import LSTMCell
from stanza.ml.tensorflow_utils import labels_to_onehots
import numpy as np

np.random.seed(42)      
embedding_size = 50
hidden_size = 100
seq_len = max_len
vocab_size = len(vocab)
class_size = len(classes)

# symbolic variable for word indices
indices = tf.placeholder(tf.int32, [None, seq_len])
# symbolic variable for labels
labels = tf.placeholder(tf.float32, [None, class_size])

# lookup table
with tf.device('/cpu:0'), tf.name_scope("embedding"):
    E = tf.Variable(
        tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
        name="emb")
    embeddings = tf.nn.embedding_lookup(E, indices)
    embeddings_list = [tf.squeeze(t, [1]) for t in tf.split(1, seq_len, embeddings)]

# rnn
cell = LSTMCell(hidden_size, embedding_size)  
outputs, states = rnn.rnn(cell, embeddings_list, dtype=tf.float32)
final_output = outputs[-1]

# classifier
def weights(shape):
    return tf.Variable(tf.random_normal(shape, stddev=0.01))
scores = tf.matmul(final_output, weights((hidden_size, class_size)))

# objective
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(scores, labels))

# operations
train_op = tf.train.AdamOptimizer(0.001, 0.9).minimize(cost)
predict_op = tf.argmax(scores, 1)

### Training

We'll train the network for a fixed number of epochs and then evaluate on the test set. This is a relatively simple procedure without tuning, regularization and early stopping.

In [18]:
from sklearn.metrics import accuracy_score
from time import time
batch_size = 128
num_epochs = 10

def run_epoch(split, train=False):
    epoch_cost = 0
    epoch_pred = []
    for i in xrange(0, len(split), batch_size):
        batch = split[i: i+batch_size]
        n = len(batch['Y'])
        X = Dataset.pad(batch['X'], pad_index, seq_len)
        Y = np.zeros((n, class_size))
        Y[np.arange(n), np.array(batch['Y'])] = 1
        if train:
            batch_cost, batch_pred, _ = session.run(
                [cost, predict_op, train_op], {indices: X, labels: Y})
        else:
            batch_cost, batch_pred = session.run(
                [cost, predict_op], {indices: X, labels: Y})
        epoch_cost += batch_cost * n
        epoch_pred += batch_pred.flatten().tolist()
    return epoch_cost, epoch_pred

def train_eval(session):
    for epoch in xrange(num_epochs):
        start = time()
        print 'epoch: {}'.format(epoch)
        epoch_cost, epoch_pred = run_epoch(train, True)
        print 'train cost: {}, acc: {}'.format(epoch_cost/len(train),
                                               accuracy_score(train.fields['Y'], epoch_pred))
        print 'time elapsed: {}'.format(time() - start)
    
    test_cost, test_pred = run_epoch(test, False)
    print '-' * 20
    print 'test cost: {}, acc: {}'.format(test_cost/len(test),
                                          accuracy_score(test.fields['Y'], test_pred))

with tf.Session() as session:
    tf.set_random_seed(123)
    session.run(tf.initialize_all_variables())
    train_eval(session)

epoch: 0
train cost: 0.693798243999, acc: 0.463768115942
time elapsed: 2.13884997368
epoch: 1
train cost: 0.689658164978, acc: 0.608695652174
time elapsed: 0.947463989258
epoch: 2
train cost: 0.685618042946, acc: 0.608695652174
time elapsed: 0.931604862213
epoch: 3
train cost: 0.681350648403, acc: 0.594202898551
time elapsed: 0.989146947861
epoch: 4
train cost: 0.676672458649, acc: 0.608695652174
time elapsed: 0.967782974243
epoch: 5
train cost: 0.6715965271, acc: 0.608695652174
time elapsed: 0.938482046127
epoch: 6
train cost: 0.666440963745, acc: 0.594202898551
time elapsed: 1.01319694519
epoch: 7
train cost: 0.661608576775, acc: 0.594202898551
time elapsed: 0.951257944107
epoch: 8
train cost: 0.656547665596, acc: 0.594202898551
time elapsed: 0.969254016876
epoch: 9
train cost: 0.64949887991, acc: 0.594202898551
time elapsed: 0.979185819626
--------------------
test cost: 0.700526297092, acc: 0.566666666667


Remember how we used `SennaVocab`? Let's see what happens if we preinitialize our embeddings:

In [19]:
preinit_op = E.assign(vocab.get_embeddings())
with tf.Session() as session:
    tf.set_random_seed(123)
    session.run(tf.initialize_all_variables())
    session.run(preinit_op)
    train_eval(session)

epoch: 0
train cost: 0.691315352917, acc: 0.536231884058
time elapsed: 2.2313709259
epoch: 1
train cost: 0.685225009918, acc: 0.565217391304
time elapsed: 0.93435382843
epoch: 2
train cost: 0.679915368557, acc: 0.594202898551
time elapsed: 0.934975862503
epoch: 3
train cost: 0.675073981285, acc: 0.594202898551
time elapsed: 0.968421220779
epoch: 4
train cost: 0.670495569706, acc: 0.594202898551
time elapsed: 0.991052865982
epoch: 5
train cost: 0.666101515293, acc: 0.594202898551
time elapsed: 0.95667886734
epoch: 6
train cost: 0.661866605282, acc: 0.594202898551
time elapsed: 0.931576013565
epoch: 7
train cost: 0.657688558102, acc: 0.594202898551
time elapsed: 0.932205915451
epoch: 8
train cost: 0.653176009655, acc: 0.594202898551
time elapsed: 1.01794791222
epoch: 9
train cost: 0.647635579109, acc: 0.594202898551
time elapsed: 0.996000051498
--------------------
test cost: 0.698161303997, acc: 0.566666666667
