# Text Classification

Author: Victor Zhong

We are going to tackle a relatively straight forward text classification problem with Stanza and Tensorflow.

## Dataset

First, we'll grab the 20 newsgroup data, which is conviently downloaded by `sklearn`.

In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

Unsurprisingly, the 20 newsgroup data contains newgroup text from 20 topics. The topics are as follows:

In [2]:
classes = list(newsgroups_train.target_names)
classes

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

We'll limit ourselves to two classes for sake of simplicity

In [3]:
classes = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=classes)
from collections import Counter
Counter([classes[t] for t in newsgroups_train.target])

Counter({'alt.atheism': 480, 'soc.religion.christian': 599})

Here's an example from the dataset:

In [4]:
newsgroups_train.data[0]

u'From: nigel.allen@canrem.com (Nigel Allen)\nSubject: library of congress to host dead sea scroll symposium april 21-22\nLines: 96\n\n\n Library of Congress to Host Dead Sea Scroll Symposium April 21-22\n To: National and Assignment desks, Daybook Editor\n Contact: John Sullivan, 202-707-9216, or Lucy Suddreth, 202-707-9191\n          both of the Library of Congress\n\n   WASHINGTON, April 19  -- A symposium on the Dead Sea \nScrolls will be held at the Library of Congress on Wednesday,\nApril 21, and Thursday, April 22.  The two-day program, cosponsored\nby the library and Baltimore Hebrew University, with additional\nsupport from the Project Judaica Foundation, will be held in the\nlibrary\'s Mumford Room, sixth floor, Madison Building.\n   Seating is limited, and admission to any session of the symposium\nmust be requested in writing (see Note A).\n   The symposium will be held one week before the public opening of a\nmajor exhibition, "Scrolls from the Dead Sea: The Ancient Librar

In [5]:
newsgroups_train.target[0]

1

Notice that the target is already converted into a class index. Namely, in this case the text belongs to the class:

In [6]:
classes[newsgroups_train.target[0]]

'soc.religion.christian'

## Annotating using CoreNLP

If you do not have CoreNLP, download it from here:

http://stanfordnlp.github.io/CoreNLP/index.html#download

We are going to use the Java server feature of CoreNLP to annotate data in python. In the CoreNLP directory, run the server:

```bash
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
```

Next, we'll annotate an example to see how the server works.

In [7]:
from stanza.corenlp.client import Client

client = Client()
annotation = client.annotate(newsgroups_train.data[0], properties={'annotators': 'tokenize,ssplit,pos'})
annotation['sentences'][0]

{u'index': 0,
 u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE',
 u'tokens': [{u'after': u'',
   u'before': u'',
   u'characterOffsetBegin': 0,
   u'characterOffsetEnd': 4,
   u'index': 1,
   u'originalText': u'From',
   u'pos': u'IN',
   u'word': u'From'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 4,
   u'characterOffsetEnd': 5,
   u'index': 2,
   u'originalText': u':',
   u'pos': u':',
   u'word': u':'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 6,
   u'characterOffsetEnd': 28,
   u'index': 3,
   u'originalText': u'nigel.allen@canrem.com',
   u'pos': u'NNP',
   u'word': u'nigel.allen@canrem.com'},
  {u'after': u'',
   u'before': u' ',
   u'characterOffsetBegin': 29,
   u'characterOffsetEnd': 30,
   u'index': 4,
   u'originalText': u'(',
   u'pos': u'-LRB-',
   u'word': u'-LRB-'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 30,
   u'characterOffsetEnd': 35,
   u'index': 5,
   u'originalText': u'Nigel',
   u'pos': u'NNP'

That was rather long, but the gist is that the annotation is organized into sentences, which is then organized into tokens. Each token carries a number of annotations (we've only asked for the POS tags).

In [8]:
for token in annotation['sentences'][0]['tokens']:
    print token['word'], token['pos']

From IN
: :
nigel.allen@canrem.com NNP
-LRB- -LRB-
Nigel NNP
Allen NNP
-RRB- -RRB-
Subject NNP
: :
library NN
of IN
congress NN
to TO
host NN
dead JJ
sea NN
scroll NN
symposium NN
april NNP
21-22 CD
Lines NNPS
: :
96 CD
Library NNP
of IN
Congress NNP
to TO
Host NNP
Dead NNP
Sea NNP
Scroll NNP
Symposium NNP
April NNP
21-22 CD
To TO
: :
National NNP
and CC
Assignment NNP
desks NNS
, ,
Daybook NNP
Editor NNP
Contact NN
: :
John NNP
Sullivan NNP
, ,
202-707-9216 CD
, ,
or CC
Lucy NNP
Suddreth NNP
, ,
202-707-9191 CD
both DT
of IN
the DT
Library NNP
of IN
Congress NNP
WASHINGTON NNP
, ,
April NNP
19 CD
-- :
A DT
symposium NN
on IN
the DT
Dead NNP
Sea NNP
Scrolls NNP
will MD
be VB
held VBN
at IN
the DT
Library NNP
of IN
Congress NNP
on IN
Wednesday NNP
, ,
April NNP
21 CD
, ,
and CC
Thursday NNP
, ,
April NNP
22 CD
. .


For our purpose, we're actually going to just take the document as a long sequence of words as opposed to a sequence of sequences (eg. a list of sentences of words). We'll do this by passing in the `ssplit.isOneSentence` flag.

In [9]:
docs = []
labels = []
for doc, label in zip(newsgroups_train.data, newsgroups_train.target):
    try:
        annotation = client.annotate(doc, properties={'annotators': 'tokenize,ssplit', 'ssplit.isOneSentence': True})
        docs.append([t['word'] for t in annotation['sentences'][0]['tokens']])
        labels.append(label)
    except Exception as e:
        pass  # we're going to punt and ignore unicode errors...
print len(docs), len(labels)

1074 1074


We'll create a lightweight dataset object out of this. A `Dataset` is really a glorified dictionary of fields, where each field corresponds to an attribute of the examples in the dataset.

In [10]:
from stanza.text.dataset import Dataset
from pprint import pprint
dataset = Dataset({'X': docs, 'Y': labels})

# dataset supports, amongst other functionalities, shuffling:
print dataset.shuffle()

# indexing of a single element
pprint(dataset[0])

# indexing of multiple elements
pprint(dataset[:2])

n_train = int(0.7 * len(dataset))
train = Dataset(dataset[:n_train])
test = Dataset(dataset[n_train:])

print 'train: {}, test: {}'.format(len(train), len(test))

Dataset(Y, X)
OrderedDict([('Y', 1), ('X', [u'From', u':', u'seanna@bnr.ca', u'-LRB-', u'Seanna', u'-LRB-', u'S.M.', u'-RRB-', u'Watson', u'-RRB-', u'Subject', u':', u'Re', u':', u'``', u'Accepting', u'Jeesus', u'in', u'your', u'heart', u'...', u"''", u'Organization', u':', u'Bell-Northern', u'Research', u',', u'Ottawa', u',', u'Canada', u'Lines', u':', u'38', u'-LCB-', u'Dan', u'Johnson', u'asked', u'for', u'evidence', u'that', u'the', u'most', u'effective', u'abuse', u'recovery', u'programs', u'involve', u'meeting', u'people', u"'s", u'spiritual', u'needs', u'.', u'I', u'responded', u':', u'In', u'12-step', u'programs', u'-LRB-', u'like', u'Alcoholics', u'Anonymous', u'-RRB-', u',', u'one', u'of', u'the', u'steps', u'involves', u'acknowleding', u'a', u'``', u'higher', u'power', u"''", u'.', u'AA', u'and', u'other', u'12-step', u'abuse', u'-', u'recovery', u'programs', u'are', u'acknowledged', u'as', u'being', u'among', u'the', u'most', u'effective', u'.', u'-RCB-', u'Dan', u'Johnson'

## Creating vocabulary and mapping to vector space

Stanza provides means to convert words to vocabularies (eg. map to indices and back). We also provide convienient means of loading pretrained embeddings such as `Senna` and `Glove`.

In [11]:
from stanza.text.vocab import Vocab
vocab = Vocab('***UNK***')
vocab

OrderedDict([('***UNK***', 0)])

We'll try our hands at some conversions:

In [12]:
sents = ['I like cats and dogs', 'I like nothing', 'I like cats and nothing else']
inds = []
for s in sents[:2]:
    inds.append(vocab.update(s.split()))
inds.append(vocab.words2indices(sents[2].split()))

for s, ind in zip(sents, inds):
    print 'read {}, which got mapped to indices {}\nrecovered:{}'.format(s, ind, vocab.indices2words(ind))

read I like cats and dogs, which got mapped to indices [1, 2, 3, 4, 5]
recovered:['I', 'like', 'cats', 'and', 'dogs']
read I like nothing, which got mapped to indices [1, 2, 6]
recovered:['I', 'like', 'nothing']
read I like cats and nothing else, which got mapped to indices [1, 2, 3, 4, 6, 0]
recovered:['I', 'like', 'cats', 'and', 'nothing', '***UNK***']


A common operation to do with vocabular objects is to replace rare words with UNKNOWN tokens. We'll convert words that occured less than 2 times.

In [13]:
# this is actually a copy operation, because indices change when words are removed from the vocabulary
vocab = vocab.prune_rares(cutoff=2)
for s in sents:
    inds = vocab.words2indices(s.split())
    print vocab.indices2words(inds)

['I', 'like', '***UNK***', '***UNK***', '***UNK***']
['I', 'like', '***UNK***']
['I', 'like', '***UNK***', '***UNK***', '***UNK***', '***UNK***']


Now, we'll convert the entire dataset. The `convert` function applies a transform to the specified field of the dataset. We'll apply a transform using the vocabulary.

In [14]:
from stanza.text.vocab import SennaVocab
vocab = SennaVocab()

# we'll actually just use the first 200 tokens of the document
max_len = 200
train = train.convert({'X': lambda x: x[:max_len]}, in_place=True)
test = test.convert({'X': lambda x: x[:max_len]}, in_place=True)
    
# make a backup
train_orig = train
test_orig = test

train = train_orig.convert({'X': vocab.update}, in_place=False)
vocab = vocab.prune_rares(cutoff=3)
train = train_orig.convert({'X': vocab.words2indices}, in_place=False)
test = test_orig.convert({'X': vocab.words2indices}, in_place=False)
pad_index = vocab.add('***PAD***', count=100)

max_len = max([len(x) for x in train.fields['X'] + test.fields['X']])

print 'train: {}, test: {}'.format(len(train), len(test))
print 'vocab size: {}'.format(vocab)
print 'sequence max len: {}'.format(max_len)
print
print test[:2]

train: 751, test: 323
vocab size: Vocab(4217 words)
sequence max len: 200

OrderedDict([('Y', [0, 1]), ('X', [[1, 2, 824, 4, 825, 826, 7, 9, 2, 10, 2, 757, 129, 828, 295, 10, 2, 2585, 2586, 302, 19, 2, 831, 252, 832, 240, 25, 2, 1622, 48, 122, 0, 3627, 4, 3628, 3629, 7, 121, 2, 71, 825, 826, 4, 824, 7, 480, 2, 124, 806, 289, 808, 373, 14, 283, 202, 34, 1009, 763, 192, 14, 15, 2096, 3960, 124, 11, 34, 0, 0, 54, 391, 0, 18, 275, 158, 1009, 763, 192, 14, 256, 580, 3960, 124, 389, 232, 0, 46, 71, 528, 187, 2244, 86, 2400, 58, 1725, 0, 1311, 0, 3486, 156, 71, 0, 46, 1, 283, 47, 430, 22, 1010, 4163, 2674, 75, 931, 86, 71, 0, 76, 3935, 46, 2012, 1747, 54, 0, 62, 283, 208, 289, 1290, 331, 275, 825], [1, 2, 0, 9, 2, 236, 14, 34, 0, 0, 19, 2, 608, 25, 2, 2097, 0, 62, 0, 222, 47, 249, 50, 86, 615, 14, 4071, 250, 42, 330, 4, 58, 7, 108, 1319, 236, 4, 289, 2324, 105, 7, 22, 62, 4, 0, 7, 65, 14, 34, 0, 0, 1247, 3553, 31, 1063, 2, 64, 15, 335, 0, 0, 64, 15, 210, 54, 34, 1564, 145, 146, 62, 15, 0, 64,

## Training a model

At this point, you're welcome to use whatever program/model/package you like to run your experiments. We'll try our hands at Tensorflow. In particular, we'll define a LSTM classifier.

### Model definition

We'll define a lookup table, a LSTM, and a linear classifier.

In [15]:
import tensorflow as tf    
from tensorflow.models.rnn import rnn    
from tensorflow.models.rnn.rnn_cell import LSTMCell
from stanza.ml.tensorflow_utils import labels_to_onehots
import numpy as np

np.random.seed(42)      
embedding_size = 50
hidden_size = 100
seq_len = max_len
vocab_size = len(vocab)
class_size = len(classes)

# symbolic variable for word indices
indices = tf.placeholder(tf.int32, [None, seq_len])
# symbolic variable for labels
labels = tf.placeholder(tf.float32, [None, class_size])

In [16]:
# lookup table
with tf.device('/cpu:0'), tf.name_scope("embedding"):
    E = tf.Variable(
        tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
        name="emb")
    embeddings = tf.nn.embedding_lookup(E, indices)
    embeddings_list = [tf.squeeze(t, [1]) for t in tf.split(1, seq_len, embeddings)]

In [17]:
# rnn
cell = LSTMCell(hidden_size, embedding_size)  
outputs, states = rnn.rnn(cell, embeddings_list, dtype=tf.float32)
final_output = outputs[-1]

In [18]:
# classifier
def weights(shape):
    return tf.Variable(tf.random_normal(shape, stddev=0.01))
scores = tf.matmul(final_output, weights((hidden_size, class_size)))

In [19]:
# objective
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(scores, labels))

We'll optimize the network via Adam

In [20]:
# operations
train_op = tf.train.AdamOptimizer(0.001, 0.9).minimize(cost)
predict_op = tf.argmax(scores, 1)

### Training

We'll train the network for a fixed number of epochs and then evaluate on the test set. This is a relatively simple procedure without tuning, regularization and early stopping.

In [27]:
from sklearn.metrics import accuracy_score
from time import time
batch_size = 128
num_epochs = 10

def run_epoch(split, train=False):
    epoch_cost = 0
    epoch_pred = []
    for i in xrange(0, len(split), batch_size):
        batch = split[i: i+batch_size]
        n = len(batch['Y'])
        X = Dataset.pad(batch['X'], pad_index, seq_len)
        Y = np.zeros((n, class_size))
        Y[np.arange(n), np.array(batch['Y'])] = 1
        if train:
            batch_cost, batch_pred, _ = session.run([cost, predict_op, train_op], {indices: X, labels: Y})
        else:
            batch_cost, batch_pred = session.run([cost, predict_op], {indices: X, labels: Y})
        epoch_cost += batch_cost * n
        epoch_pred += batch_pred.flatten().tolist()
    return epoch_cost, epoch_pred

def train_eval(session):
    for epoch in xrange(num_epochs):
        start = time()
        print 'epoch: {}'.format(epoch)
        epoch_cost, epoch_pred = run_epoch(train, True)
        print 'train cost: {}, acc: {}'.format(epoch_cost/len(train), accuracy_score(train.fields['Y'], epoch_pred))
        print 'time elapsed: {}'.format(time() - start)
    
    test_cost, test_pred = run_epoch(test, False)
    print '-' * 20
    print 'test cost: {}, acc: {}'.format(test_cost/len(test), accuracy_score(test.fields['Y'], test_pred))

with tf.Session() as session:
    tf.set_random_seed(123)
    session.run(tf.initialize_all_variables())
    train_eval(session)

epoch: 0
train cost: 0.69150376931, acc: 0.533954727031
time elapsed: 11.6190190315
epoch: 1
train cost: 0.68147453257, acc: 0.587217043941
time elapsed: 9.51137089729
epoch: 2
train cost: 0.662719958632, acc: 0.589880159787
time elapsed: 9.50747179985
epoch: 3
train cost: 0.629683734098, acc: 0.688415446072
time elapsed: 9.8141450882
epoch: 4
train cost: 0.611709104159, acc: 0.709720372836
time elapsed: 9.49769997597
epoch: 5
train cost: 0.582100759651, acc: 0.684420772304
time elapsed: 10.240678072
epoch: 6
train cost: 0.570877154404, acc: 0.737683089214
time elapsed: 9.94308805466
epoch: 7
train cost: 0.564803322447, acc: 0.720372836218
time elapsed: 9.75270009041
epoch: 8
train cost: 0.542043169631, acc: 0.757656458056
time elapsed: 10.0136928558
epoch: 9
train cost: 0.490948782978, acc: 0.78828229028
time elapsed: 12.185503006
--------------------
test cost: 0.591299379758, acc: 0.693498452012


Remember how we used `SennaVocab`? Let's see what happens if we preinitialize our embeddings:

In [28]:
preinit_op = E.assign(vocab.get_embeddings())
with tf.Session() as session:
    tf.set_random_seed(123)
    session.run(tf.initialize_all_variables())
    session.run(preinit_op)
    train_eval(session)

epoch: 0
train cost: 0.688563662267, acc: 0.539280958722
time elapsed: 13.9362518787
epoch: 1
train cost: 0.674707842254, acc: 0.584553928096
time elapsed: 11.8684880733
epoch: 2
train cost: 0.663795230312, acc: 0.607190412783
time elapsed: 11.9107489586
epoch: 3
train cost: 0.641969507131, acc: 0.645805592543
time elapsed: 12.0843448639
epoch: 4
train cost: 0.619178395138, acc: 0.660452729694
time elapsed: 12.1125848293
epoch: 5
train cost: 0.591043990636, acc: 0.711051930759
time elapsed: 12.0441889763
epoch: 6
train cost: 0.568309741633, acc: 0.712383488682
time elapsed: 11.8249480724
epoch: 7
train cost: 0.52520389722, acc: 0.772303595206
time elapsed: 11.7157990932
epoch: 8
train cost: 0.501435582949, acc: 0.756324900133
time elapsed: 11.6853508949
epoch: 9
train cost: 0.439889284647, acc: 0.809587217044
time elapsed: 11.5280079842
--------------------
test cost: 0.598627277203, acc: 0.699690402477
