<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-in-action/blob/7-text-with-convolutional-neural-networks/convolutional_neural_network_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Convolutional Neural Network for Sentiment Analysis

Let’s take a look at convolution in Python with the example convolutional neural network classifier provided in the Keras documentation. They have crafted a onedimensional convolutional net to examine the IMDB movie review dataset.

## Setup

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from tensorflow.keras import backend as keras_backend
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Conv1D, GlobalMaxPooling1D
from tensorflow.keras.preprocessing import sequence

import os
import tarfile
import re
import tqdm

import requests

TensorFlow 2.x selected.


In [2]:
! pip install pugnlp

Collecting pugnlp
[?25l  Downloading https://files.pythonhosted.org/packages/a3/c6/17a0ef5af34e20b595f1983be0468efa9f903a17a5af2a0a980ecaf9c411/pugnlp-0.2.5-py2.py3-none-any.whl (706kB)
[K     |████████████████████████████████| 716kB 4.6MB/s 
Collecting python-Levenshtein
[?25l  Downloading https://files.pythonhosted.org/packages/42/a9/d1785c85ebf9b7dfacd08938dd028209c34a0ea3b1bcdb895208bd40a67d/python-Levenshtein-0.12.0.tar.gz (48kB)
[K     |████████████████████████████████| 51kB 7.2MB/s 
Collecting pypandoc
  Downloading https://files.pythonhosted.org/packages/71/81/00184643e5a10a456b4118fc12c96780823adb8ed974eb2289f29703b29b/pypandoc-1.4.tar.gz
Collecting fuzzywuzzy
  Downloading https://files.pythonhosted.org/packages/d8/f1/5a267addb30ab7eaa1beab2b9323073815da4551076554ecc890a3595ec9/fuzzywuzzy-0.17.0-py2.py3-none-any.whl
Building wheels for collected packages: python-Levenshtein, pypandoc
  Building wheel for python-Levenshtein (setup.py) ... [?25l[?25hdone
  Created wheel f

In [0]:
from pugnlp.futil import path_status, find_files

## Data Preparation

Each data point is prelabeled with a 0 (negative sentiment) or a 1 (positive sentiment).you’re going to swap out their example IMDB movie review dataset
for one in raw text, so you can get your hands dirty with the preprocessing of the text as well. And then you’ll see if you can use this trained network to classify text it has never seen before.

### Downloading data

In [0]:
BIG_URLS = {
    'w2v': ('https://www.dropbox.com/s/965dir4dje0hfi4/GoogleNews-vectors-negative300.bin.gz?dl=1', 1647046227),
    'slang': ('https://www.dropbox.com/s/43c22018fbfzypd/slang.csv.gz?dl=1', 117633024),
    'tweets': ('https://www.dropbox.com/s/5gpb43c494mc8p0/tweets.csv.gz?dl=1', 311725313),
    'lsa_tweets': ('https://www.dropbox.com/s/rpjt0d060t4n1mr/lsa_tweets_5589798_2003588x200.tar.gz?dl=1', 3112841563),  # 3112841312
    'imdb': ('https://www.dropbox.com/s/yviic64qv84x73j/aclImdb_v1.tar.gz?dl=1', 3112841563),  # 3112841312
}

In [0]:
# These functions are part of the nlpia package which can be pip installed and run from there.
def dropbox_basename(url):
    filename = os.path.basename(url)
    match = re.findall(r'\?dl=[0-9]$', filename)
    if match:
        return filename[:-len(match[0])]
    return filename

def download_file(url, data_path='.', filename=None, size=None, chunk_size=4096, verbose=True):
    """Uses stream=True and a reasonable chunk size to be able to download large (GB) files over https"""
    if filename is None:
        filename = dropbox_basename(url)
    file_path = os.path.join(data_path, filename)
    if url.endswith('?dl=0'):
        url = url[:-1] + '1'  # noninteractive download
    if verbose:
        tqdm_prog = tqdm
        print('requesting URL: {}'.format(url))
    else:
        tqdm_prog = no_tqdm
    r = requests.get(url, stream=True, allow_redirects=True)
    size = r.headers.get('Content-Length', None) if size is None else size
    print('remote size: {}'.format(size))

    stat = path_status(file_path)
    print('local size: {}'.format(stat.get('size', None)))
    if stat['type'] == 'file' and stat['size'] == size:  # TODO: check md5 or get the right size of remote file
        r.close()
        return file_path

    print('Downloading to {}'.format(file_path))

    with open(file_path, 'wb') as f:
        for chunk in r.iter_content(chunk_size=chunk_size):
            if chunk:  # filter out keep-alive chunks
                f.write(chunk)

    r.close()
    return file_path

def untar(fname):
    if fname.endswith("tar.gz"):
        with tarfile.open(fname) as tf:
            tf.extractall()
    else:
        print("Not a tar.gz file: {}".format(fname))

In [6]:
download_file(BIG_URLS['w2v'][0])

requesting URL: https://www.dropbox.com/s/965dir4dje0hfi4/GoogleNews-vectors-negative300.bin.gz?dl=1
remote size: 1647046227
local size: None
Downloading to ./GoogleNews-vectors-negative300.bin.gz


'./GoogleNews-vectors-negative300.bin.gz'

In [7]:
untar(download_file(BIG_URLS['imdb'][0]))

requesting URL: https://www.dropbox.com/s/yviic64qv84x73j/aclImdb_v1.tar.gz?dl=1
remote size: 84125825
local size: None
Downloading to ./aclImdb_v1.tar.gz


### Preprocessing the loaded documents

The reviews in the train folder are broken up into text files in either the pos or neg folders. You’ll first need to read those in Python with their appropriate label and then shuffle the deck so the samples aren’t all positive and then all negative. Training with the sorted labels will skew training toward whatever comes last, especially when you use certain hyperparameters, such as momentum.

In [8]:
import glob
from random import shuffle

def pre_process_data(filepath):
  '''
  This is dependent on your training data source but we will try to generalize it as best as possible.
  '''
  positive_path = os.path.join(filepath, 'pos')
  negative_path = os.path.join(filepath, 'neg')

  pos_label = 1
  neg_label = 0

  dataset = []

  for filename in glob.glob(os.path.join(positive_path, '*.txt')):
    with open(filename, 'r') as f:
      dataset.append((pos_label, f.read()))

  for filename in glob.glob(os.path.join(negative_path, '*.txt')):
    with open(filename, 'r') as f:
      dataset.append((neg_label, f.read()))

  shuffle(dataset)

  return dataset

dataset = pre_process_data('./aclImdb/train')
print(dataset[0])

(1, "I saw this film at the Toronto International Film Festival. Not as salacious as it sounds, this is a three-part documentary (each episode is 50 minutes) featuring Slovenian superstar philosopher/psychoanalyst Slavoj Zizek. Zizek takes us on a journey through many classic films, exploring themes of sexuality, fantasy, morality and mortality. It was directed by Sophie Fiennes, of the multi-talented Fiennes clan (she's sister to actors Ralph and Joseph).<br /><br />I enjoyed this quite a bit, although I think it will be even more enjoyable on DVD, since there is such a stew of ideas to be digested. Freudian and Lacanian analysis can be pretty heavy going and seeing the whole series all at once became a bit disorienting by the end of two and a half hours. It didn't help that an ill-advised coffee and possession of a bladder led me to some discomfort for the last hour or so.<br /><br />My only real issue with this is that Zizek picked films that were quite obviously filled with Freudia

### Data tokenization and vectorization

The next step is to tokenize and vectorize the data. You’ll use the Google News pretrained Word2vec vectors, so download those directly from Google.

You’ll use gensim to unpack the vectors, You can
experiment with the limit argument to the load_word2vec_format method; a
higher number will get you more vectors to play with, but memory quickly becomes an issue and return on investment drops quickly in really high values for limit.

In [9]:
from nltk.tokenize import TreebankWordTokenizer
from gensim.models import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True, limit=200000)

def tokenize_and_vectorize(dataset):
  tokenizer = TreebankWordTokenizer()
  vectorized_data = []
  expected = []

  for sample in dataset:
    tokens = tokenizer.tokenize(sample[1])
    sample_vecs = []
    for token in tokens:
      try:
        sample_vecs.append(word_vectors[token])
      except KeyError:
        pass    # No matching token in the Google w2v vocab

    vectorized_data.append(sample_vecs)

  return vectorized_data

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


You also need to collect the target values—0 for a negative review, 1 for a positive review—in the same order as the training samples.

In [0]:
def collect_expected(dataset):
  '''Peel of the target values from the dataset'''
  expected = []
  for sample in dataset:
    expected.append(sample[0])
  
  return expected

And then you simply pass your data into those functions:

In [0]:
vectorized_data = tokenize_and_vectorize(dataset)
expected = collect_expected(dataset)

### Train/Test splitting

Next you’ll split the prepared data into a training set and a test set. You’re just going to split your imported dataset 80/20, but this ignores the folder of test data.

In [0]:
split_point = int(len(vectorized_data) * .8)

x_train = vectorized_data[:split_point]
y_train = expected[:split_point]
x_test = vectorized_data[split_point:]
y_test = expected[split_point:]

### CNN parameters

The next sets most of the hyperparameters for the net.

In [0]:
maxlen = 400          # holds the maximum review length
batch_size = 32       # How many samples to show the net before backpropagating the error and updating the weights
embedding_dims = 300  # Length of the token vectors you’ll create for passing into the convnet
filters = 250         # Number of filters you’ll train
kernel_size = 3       # Filters width; actual filters will each be a matrix of weights of size: embedding_dims x kernel_size, or 50 x 3 in your case
hidden_dims = 250     # Number of neurons in the plain feed forward net at the end of the chain
epochs = 2            # Number of times we will pass the entire training dataset through the network

### Padding and truncating token sequence(sequences of vectors)

Keras has a preprocessing helper method, pad_sequences, that in theory could be
used to pad your input data, but unfortunately it works only with sequences of scalars, and you have sequences of vectors. 

Let’s write a helper function of your own to pad your input data.

In [0]:
def pad_trunc(data, maxlen):
  '''For a given dataset pad with zero vectors or truncate to maxlen'''
  new_data = []

  # Create a vector of 0's the length of our word vectors
  zero_vector = []
  for _ in range(len(data[0][0])):
    zero_vector.append(0.0)
  #zero_vector = [0.0 for _ in range(len(data[0][0]))]

  for sample in data:
    if len(sample) > maxlen:
        temp = sample[:maxlen]
    elif len(sample) < maxlen:
        temp = sample
        additional_elems = maxlen - len(sample)
        for _ in range(additional_elems):
            temp.append(zero_vector)
    else:
        temp = sample
    new_data.append(temp)
  
  return new_data

Then you need to pass your train and test data into the padder/truncator. After that you can convert it to numpy arrays to make Keras happy. This is a tensor with the shape (number of samples, sequence length, word vector length) that you need for your CNN.

In [0]:
x_train = pad_trunc(x_train, maxlen)
keras_backend.clear_session()

In [0]:
x_test = pad_trunc(x_test, maxlen)
keras_backend.clear_session()

In [0]:
x_train = np.reshape(x_train, (len(x_train), maxlen, embedding_dims))
y_train = np.array(y_train)
keras_backend.clear_session()
x_test = np.reshape(x_test, (len(x_test), maxlen, embedding_dims))
y_test = np.array(y_test)

Phew; finally you’re ready to build a neural network.

## Convolutional neural network architecture

Sequential is one of the base classes for neural networks in Keras. From here you can start to layer on the magic.

The first piece you add is a convolutional layer. In this case, you assume that it’s okay that the output is of smaller dimension than the input, and you set the padding to 'valid'. Each filter will start its pass with its leftmost edge at the start of the sentence and stop with its rightmost edge on the last token.

Each shift (stride) in the convolution will be one token. The kernel (window
width) you already set to three tokens.And you’re using the 'relu' activation
function. At each step, you’ll multiply the filter weight times the value in the
three tokens it’s looking at (element-wise), sum up those answers, and pass them through if they’re greater than 0, else you output 0. That last passthrough of positive values and 0s is the rectified linear units activation function or ReLU.

```python
model = Sequential()
# Add one Conv1D layer, which will learn word group filters of size kernel_size.
model.add(Conv1D(filters, kernel_size, padding='valid', activation='relu', strides=1, input_shape=(maxlen, embedding_dims)))
```



### Pooling

Pooling is the convolutional neural network’s path to dimensionality reduction. In some ways, you’re speeding up the process by allowing for parallelization of the computation.

The key idea is you’re going to evenly divide the output of each filter into a subsection. Then for each of those subsections, you’ll select or compute a representative value. And then you set the original output aside and use the collections of representative values as the input to the next layers.

Usually, discarding data wouldn’t be the best course of action. But it turns out, it’s a path toward learning higher order representations of the source data. The filters are being trained to find patterns. The patterns are revealed in relationships between words and their neighbors! Just the kind of subtle   information you set out to find.

In image processing, the first layers will tend to learn to be edge detectors, places where pixel densities rapidly shift from one side to the other. Later layers learn concepts like shape and texture. And layers after that may learn “content” or “meaning.” Similar processes will happen with text.

<img src='https://github.com/rahiakela/img-repo/blob/master/pooling-layers.PNG?raw=1' width='800'/>

You have two choices for pooling:
* Average Pooling: Average is the more intuitive of the two in that by taking the average of the subset of values you would in theory retain the most data.
* Max Pooling: has an interesting property, in that by taking the largest activation value for the given region, the network sees
that subsection’s most prominent feature. The network has a path toward learning
what it should look at, regardless of exact pixel-level position!

In addition to dimensionality reduction and the computational savings that come
with it, you gain something else special: **location invariance**. If an original input element is jostled slightly in position in a similar but distinct input sample, the max pooling layer will still output something similar. This is a huge boon in the image recognition world, and it serves a similar purpose in natural language processing.

In Keras, you’re using the GlobalMaxPooling1D layer.

```python
model.add(GlobalMaxPooling1D())
```
Now for each input sample you have a 1D vector that the network thinks is a good representation of that input sample. This is a semantic representation of the input—a crude one to be sure. And it will only be semantic in the context of the training target, which is sentiment. There won’t be an encoding of the content of the movie being reviewed, say, just an encoding of its sentiment.

### Dropout

Dropout is a special technique developed to prevent overfitting in neural networks. It isn’t specific to natural language processing, but it does work well here.

The idea is that on each training pass, if you “turn off” a certain percentage of the input going to the next layer, randomly chosen on each pass, the model will be less likely to learn the specifics of the training set, “overfitting,” and instead learn more nuanced representations of the patterns in the data and thereby be able to generalize and make accurate predictions when it sees completely novel data.

The parameter passed into the Dropout layer in Keras is the percentage of the inputs to randomly turn off. In this example, only 80% of the embedding data, randomly chosen for each training sample, will pass into the next layer as it is. The rest will go in as 0s. A 20% dropout setting is common, but a dropout of up to 50% can have good results.

```python
# You start with a vanilla fully connected hidden layer and then tack on dropout and ReLU.
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))
```

In [0]:
model = Sequential()

# we add a Convolution1D, which will learn filters word group filters of size filter_length
model.add(Conv1D(filters, kernel_size, padding='valid', activation='relu', strides=1, input_shape=(maxlen, embedding_dims)))
model.add(GlobalMaxPooling1D())

# vanilla hidden layer
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

# We project onto a single unit output layer, and squash it with a sigmoid
model.add(Dense(1))
model.add(Activation('sigmoid'))

# compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# train the model
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_test, y_test))

Train on 20000 samples, validate on 5000 samples
Epoch 1/2


You would like to save the model state after training.
Because you aren’t going to hold the model in memory for now, you can grab its
structure in a JSON file and save the trained weights in another file for later reinstantiation.

In [0]:
model_structure = model.to_json()   # Note that this doesn’t save the weights of the network, only the structure.

# Save your trained model before you lose it!
with open('cnn_model.json', 'w') as json_file:
  json_file.write(model_structure)
model.save_weights('cnn_weights.h5')

Now your trained model will be persisted on disk; should it converge, you won’t have to train it again.

## Loading saved model

After you have a trained model, you can then pass in a novel sample and see what the network thinks. This could be an incoming chat message or tweet to your bot; in your case, it’ll be a made-up example.

First, reinstate your trained model, if it’s no longer in memory.

In [0]:
from tf.keras.models import model_from_json

with open('cnn_model.json', 'r') as json_file:
  json_string = json_file.read()

model = model_from_json(json_string)
model.load_weights('cnn_weights.h5')

## Prediction

Let’s make up a sentence with an obvious negative sentiment and see what the network has to say about it.

In [0]:
sample_1 = """
I'm hate that the dismal weather that had me down for so long, when will it break! Ugh, when does happiness return?  
The sun is blinding and the puffy clouds are too thin.  I can't wait for the weekend.
"""

With the model pretrained, testing a new sample is quick. The are still thousands and
thousands of calculations to do, but for each sample you only need one forward pass
and no backpropagation to get a result.

In [0]:
# You pass a dummy value in the first element of the tuple just because
# your helper expects it from the way you processed the initial data.
# That value won’t ever see the network, so it can be anything.
vec_list = tokenize_and_vectorize([(1, sample_1)])

# Tokenize returns a list of the data (length 1 here)
test_vec_list = pad_trunc(vec_list, maxlen)

test_vec = np.reshape(test_vec_list, (len(test_vec_list), maxlen, embedding_dims))
model.predict(test_vec)

In [0]:
model.predict_classes(test_vec)

## Conclusion

We touched briefly on the output of the convolutional layers (before you step into
the feedforward layer). This semantic representation is an important artifact. It’s in many
ways a numerical representation of the thought and details of the input text. Specifically
in this case, it’s a representation of the thought and details through the lens of sentiment
analysis, as all the “learning” that happened was in response to whether the
sample was labeled as a positive or negative sentiment. The vector that was generated
by training on a set that was labeled for another specific topic and classified as such
would contain much different information. Using the intermediary vector directly
from a convolutional neural net isn’t common, but other neural network architectures where the details of that intermediary
vector become important, and in some cases are the end goal itself.

Why would you choose a CNN for your NLP classification task? The main benefit it
provides is efficiency. In many ways, because of the pooling layers and the limits created
by filter size (though you can make your filters large if you wish), you’re throwing
away a good deal of information. But that doesn’t mean they aren’t useful models. As
you’ve seen, they were able to efficiently detect and predict sentiment over a relatively
large dataset, and even though you relied on the Word2vec embeddings, CNNs can
perform on much less rich embeddings without mapping the entire language.

