This example shows how to use an LSTM sentiment classification model trained using Keras in spaCy. spaCy splits the document into sentences, and each sentence is classified using the LSTM. The scores for the sentences are then aggregated to give the document score. This kind of hierarchical model is quite difficult in "pure" Keras or Tensorflow, but it's very effective. The Keras example on this dataset performs quite poorly, because it cuts off the documents so that they're a fixed size. This hurts review accuracy a lot, because people often summarise their rating in the final sentence

Prerequisites:
spacy download en_vectors_web_lg
pip install keras==2.0.9

Compatible with: spaCy v2.0.0+

# Process

In [1]:
import plac
import random
import pathlib
import cytoolz
import numpy as np

In [2]:
from keras.models import Sequential, model_from_json
from keras.layers import LSTM, Dense, Embedding, Bidirectional
from keras.layers import TimeDistributed
from keras.optimizers import Adam

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  (fname, cnt))
  (fname, cnt))


In [3]:
import thinc.extra.datasets
from spacy.compat import pickle
import spacy

In [4]:
imdb_data = thinc.extra.datasets.imdb()

In [5]:
train_texts, train_labels = zip(*imdb_data[0])

In [6]:
dev_texts, dev_labels = zip(*imdb_data[1])

In [7]:
len(train_texts)

25000

In [8]:
len(dev_texts)

25000

In [9]:
train_texts[0]

"As I write this review in 2008, we are mired in a remake culture. Movie studios seem determined to ruin as many classic films as they can with thoroughly pointless updates including 'King Kong, 'The Wicker Man' and practically every film that ever starred Michael Caine. This lazy remake mentality is not a new phenomenon, however, as 'Dough for the Do-Do' proves. An entirely pointless colorized version of Bob Clampett's surreal masterpiece 'Porky in Wackyland', 'Dough for the Do-Do' sucks the life out of the original by splashing colour all over Clampett's original footage and adding some lame new footage overseen by Friz Freleng. Freleng was an entirely unsuitable director to be tampering with Clampett's source material, although in truth no director could hope to come close to Clampett's inspired insanity. Inevitably, then, 'Dough for the Do-Do' is nothing more than the raping of a classic with an appalling new title attached. For cartoon fans like myself, its equivalent to a coloriz

In [10]:
train_labels[0]

0

In [11]:
train_labels = np.asarray(train_labels, dtype='int32')
dev_labels = np.asarray(dev_labels, dtype='int32')

In [12]:
nlp = spacy.load('en_core_web_lg')

In [13]:
nlp.add_pipe(nlp.create_pipe('sentencizer'))

In [14]:
def get_embeddings(vocab):
    return vocab.vectors.data

In [15]:
embeddings = get_embeddings(nlp.vocab)

In [16]:
len(train_texts)

25000

In [19]:
train_docs = list(nlp.pipe(train_texts, batch_size=10000, n_threads=48))
dev_docs = list(nlp.pipe(dev_texts, batch_size=10000, n_threads=48))

In [20]:
def get_labelled_sentences(docs, doc_labels):
    labels = []
    sentences = []
    for doc, y in zip(docs, doc_labels):
        for sent in doc.sents:
            sentences.append(sent)
            labels.append(y)
    return sentences, np.asarray(labels, dtype='int32')

In [21]:
train_docs, train_labels = get_labelled_sentences(train_docs, train_labels)
dev_docs, dev_labels = get_labelled_sentences(dev_docs, dev_labels)

In [22]:
def get_features(docs, max_length):
    docs = list(docs)
    Xs = np.zeros((len(docs), max_length), dtype='int32')
    for i, doc in enumerate(docs):
        j = 0
        for token in doc:
            vector_id = token.vocab.vectors.find(key=token.orth)
            if vector_id >= 0:
                Xs[i, j] = vector_id
            else:
                Xs[i, j] = 0
            j += 1
            if j >= max_length:
                break
    return Xs

In [24]:
lstm_shape = {'nr_hidden': 64, 'max_length': 100, 'nr_class': 1}
lstm_settings = {'dropout': 0.5, 'lr': 0.001}

In [25]:
train_X = get_features(train_docs, lstm_shape['max_length'])
dev_X = get_features(dev_docs, lstm_shape['max_length'])

In [27]:
import pickle
def save_object(obj, filename):
    with open(filename, 'wb') as output:  # Overwrites any existing file.
        pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)

In [31]:
save_object((train_texts, train_labels, train_X), 'train.pkl')

In [32]:
save_object((dev_texts, dev_labels, dev_X), 'dev.pkl')

# Training

In [1]:
import plac
import random
import pathlib
import cytoolz
import numpy as np

In [2]:
from keras.models import Sequential, model_from_json
from keras.layers import LSTM, Dense, Embedding, Bidirectional
from keras.layers import TimeDistributed
from keras.optimizers import Adam

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  (fname, cnt))
  (fname, cnt))


In [3]:
import thinc.extra.datasets
from spacy.compat import pickle
import spacy

In [4]:
with open('train.pkl', 'rb') as f:
    train_texts, train_labels, train_X = pickle.load(f)
with open('dev.pkl', 'rb') as f:
    dev_texts, dev_labels, dev_X = pickle.load(f)

In [5]:
lstm_shape = {'nr_hidden': 64, 'max_length': 100, 'nr_class': 1}
lstm_settings = {'dropout': 0.5, 'lr': 0.001}

In [6]:
nlp = spacy.load('en_core_web_lg')
nlp.add_pipe(nlp.create_pipe('sentencizer'))
def get_embeddings(vocab):
    return vocab.vectors.data
embeddings = get_embeddings(nlp.vocab)

In [7]:
model = Sequential()
model.add(
    Embedding(
        embeddings.shape[0],
        embeddings.shape[1],
        input_length=lstm_shape['max_length'],
        trainable=False,
        weights=[embeddings],
        mask_zero=True
    )
)
model.add(TimeDistributed(Dense(lstm_shape['nr_hidden'],
                                use_bias=False)))
model.add(Bidirectional(LSTM(lstm_shape['nr_hidden'],
                             recurrent_dropout=lstm_settings['dropout'],
                             dropout=lstm_settings['dropout'])))
model.add(Dense(lstm_shape['nr_class'],
                activation='sigmoid'))

In [8]:
model.compile(optimizer=Adam(lr=lstm_settings['lr']),
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [9]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 300)          205449300 
_________________________________________________________________
time_distributed_1 (TimeDist (None, 100, 64)           19200     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               66048     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 205,534,677
Trainable params: 85,377
Non-trainable params: 205,449,300
_________________________________________________________________


In [10]:
model.fit(train_X,
          train_labels,
          validation_data=(dev_X, dev_labels),
          epochs=5,
          batch_size=10000,
          verbose=2)

Train on 310254 samples, validate on 304112 samples
Epoch 1/5
 - 107s - loss: 0.6560 - acc: 0.6072 - val_loss: 0.6185 - val_acc: 0.6588
Epoch 2/5
 - 103s - loss: 0.6156 - acc: 0.6623 - val_loss: 0.6033 - val_acc: 0.6745
Epoch 3/5
 - 103s - loss: 0.6040 - acc: 0.6727 - val_loss: 0.5951 - val_acc: 0.6803
Epoch 4/5
 - 103s - loss: 0.5969 - acc: 0.6777 - val_loss: 0.5955 - val_acc: 0.6791
Epoch 5/5
 - 103s - loss: 0.5937 - acc: 0.6794 - val_loss: 0.5872 - val_acc: 0.6848


<keras.callbacks.History at 0x7f36caf0c518>