## Predicting movie review sentiment - word embeddings with CNN

Continuing my experience with NLP after studying through many tutorials - The IMDB movie dataset has 1000 positive and 1000 negative reviews. Sufficient cleaning is already done, reviews are stored one review per file with a naming convention cv000 to cv999 for each neg and pos. 100 reviews in both pos and neg are used for testing (90% train).

Dataset here - www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

### Cleaning our text data - taking care of punctuations, removing stopwords...

In [1]:
from nltk.corpus import stopwords
import string
def load_doc(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text

def clean_doc(doc):
    tokens = doc.split()
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

#Loading document
filename = 'txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)

['films', 'adapted', 'comic', 'books', 'plenty', 'success', 'whether', 'theyre', 'superheroes', 'batman', 'superman', 'spawn', 'geared', 'toward', 'kids', 'casper', 'arthouse', 'crowd', 'ghost', 'world', 'theres', 'never', 'really', 'comic', 'book', 'like', 'hell', 'starters', 'created', 'alan', 'moore', 'eddie', 'campbell', 'brought', 'medium', 'whole', 'new', 'level', 'mid', 'series', 'called', 'watchmen', 'say', 'moore', 'campbell', 'thoroughly', 'researched', 'subject', 'jack', 'ripper', 'would', 'like', 'saying', 'michael', 'jackson', 'starting', 'look', 'little', 'odd', 'book', 'graphic', 'novel', 'pages', 'long', 'includes', 'nearly', 'consist', 'nothing', 'footnotes', 'words', 'dont', 'dismiss', 'film', 'source', 'get', 'past', 'whole', 'comic', 'book', 'thing', 'might', 'find', 'another', 'stumbling', 'block', 'hells', 'directors', 'albert', 'allen', 'hughes', 'getting', 'hughes', 'brothers', 'direct', 'seems', 'almost', 'ludicrous', 'casting', 'carrot', 'top', 'well', 'anythi

## Creating a bag of words - from our clean list of tokens (to filter out low occurring words)

The bag of words approach would help us in forming a well defined vocabulary - a dictionary mapping of the tokens/words and their counts. Each document is parsed and added to our vocabulary using the Counter() function. From [docs](https://docs.python.org/2/library/collections.html), "A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values." 

In [2]:
from collections import Counter
from os import listdir
#Load document and add it to vocab
def add_doc_to_vocab(filename, vocab):
    doc = load_doc(filename)
    tokens = clean_doc(doc)
    #Update counts
    vocab.update(tokens)

def process_docs(directory, vocab, is_train):
    for filename in listdir(directory):
        #Skip reviews from test set
        if is_train and filename.startswith('cv9'):
            continue
        if not is_train and not filename.startswith('cv9'):
            continue
        #Creating full path of the file
        path = directory + '/' + filename
        add_doc_to_vocab(path, vocab)


vocab = Counter()
process_docs('txt_sentoken/neg', vocab, True)
process_docs('txt_sentoken/pos', vocab, True)
print(len(vocab))
#Top words in the vocab
print(vocab.most_common(50))

44276
[('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262), ('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844), ('much', 1824), ('also', 1757), ('characters', 1735), ('get', 1724), ('character', 1703), ('two', 1643), ('first', 1588), ('see', 1557), ('way', 1515), ('well', 1511), ('make', 1418), ('really', 1407), ('little', 1351), ('life', 1334), ('plot', 1288), ('people', 1269), ('bad', 1248), ('could', 1248), ('scene', 1241), ('movies', 1238), ('never', 1201), ('best', 1179), ('new', 1140), ('scenes', 1135), ('man', 1131), ('many', 1130), ('doesnt', 1118), ('know', 1092), ('dont', 1086), ('hes', 1024), ('great', 1014), ('another', 992), ('action', 985), ('love', 977), ('us', 967), ('go', 952), ('director', 948), ('end', 946), ('something', 945), ('still', 936)]


Let's get rid off low occuring words...

In [3]:
min_occur = 2
tokens = [k for k,c in vocab.items() if c >= min_occur]
print(len(tokens))

25767


The vocabulary can be saved to a new file called vocab.txt that we can load later and use to filter movie reviews prior to encoding them for modeling.

In [4]:
def save_list(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()
#Vocabulary file
save_list(tokens, 'vocab.txt')

## Word embeddings for a neural network

A word embedding is a way of representing text where each word in the vocabulary is represented by a real valued vector in a high-dimensional space. The vectors are learned in such a way that words that have similar meanings will have similar representation in the vector space. The real valued vector representation for words can be learned while training the neural network. We can do this in the Keras deep learning library using the embedding layer.

Reworking the previous section's functions...

In [5]:
def load_doc(filename):
    file = open(filename, 'r')
    #Read all text
    text = file.read()
    file.close()
    return text

vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
from string import punctuation
def clean_doc(doc, vocab):
    #Split into tokens by white space
    tokens = doc.split()
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    #Filter out tokens not in vocab
    tokens = [w for w in tokens if w in vocab]
    tokens = ' '.join(tokens)
    return tokens

#Load all docs in a directory
def process_docs(directory, vocab, is_train):
    documents = list()
    for filename in listdir(directory):
        if is_train and filename.startswith('cv9'):
            continue
        if not is_train and not filename.startswith('cv9'):
            continue
        path = directory + '/' + filename
        doc = load_doc(path)
        tokens = clean_doc(doc, vocab)
        #Add to list
        documents.append(tokens)
    return documents

#Load all training reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
train_docs = negative_docs + positive_docs

## Encoding the training documents as sequences of integers using the Tokenizer class in the Keras API.

The Keras function pad_sequences() is used to pad the sequences to the maximum length by adding 0 values at the end - all reviews to the length of the longest review in the training dataset.This ensures documents are of same length.

In [6]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from numpy import array
tokenizer = Tokenizer()
#Fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)

#Sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
#Pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

#Define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])
positive_docs = process_docs('txt_sentoken/pos', vocab, False)
negative_docs = process_docs('txt_sentoken/neg', vocab, False)
test_docs = negative_docs + positive_docs
encoded_docs = tokenizer.texts_to_sequences(test_docs)
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

Using TensorFlow backend.


## Neural network - the good stuff !

The model will use an embedding layer as the first layer. The embedding layer requires the specification of the vocabulary size, the size of the real-valued vector space, and the maximum length of input documents. The vocabulary size is the total number of words in our vocabulary, plus one for unknown words.

The next layer in the model is a one-dimensional convolutional neural network layer, which employs 32 filters to perform parallel feature extraction over the text data. Each filter has a kernel size of 8 and uses the rectified linear unit (ReLU) activation function. This convolutional layer is followed by a max-pooling layer, which downsamples the output from the convolutional layer to capture the most salient features.

The output from the CNN part of the model is flattened to one long vector to represent the features. The output layer uses a sigmoid activation function to output a value between 0 and 1 for the negative and positive sentiment in the review.

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

vocab_size = len(tokenizer.word_index) + 1
#Defining model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
#Fit network
model.fit(Xtrain, ytrain, epochs=5, verbose=2, validation_steps=None)
loss, acc = model.evaluate(Xtest, ytest, verbose=2)
print('Test Accuracy: %f' % (acc*100))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1317, 100)         2576800   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 1310, 32)          25632     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 655, 32)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 20960)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                209610    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 11        
Total params: 2,812,053
Trainable params: 2,812,053
Non-trainable params: 0
_________________________________________________________________


This is where my GPU choked and gave a slew of tensorflow errors in jupyter. The following is the output from Spyder, which was able to fit the model and evaluate with ease.

model.fit(Xtrain, ytrain, epochs=5, verbose=2, validation_steps=None)
loss, acc = model.evaluate(Xtest, ytest, verbose=2)
print('Test Accuracy: %f' % (acc*100))


Epoch 1/5 - 2s - loss: 0.2770 - acc: 0.9983
Epoch 2/5 - 2s - loss: 0.2638 - acc: 0.9989
Epoch 3/5 - 2s - loss: 0.2519 - acc: 0.9989
Epoch 4/5 - 2s - loss: 0.2410 - acc: 0.9989
Epoch 5/5 - 2s - loss: 0.2308 - acc: 0.9989
Test Accuracy: 81.500000