<a href="https://colab.research.google.com/github/rahiakela/deep-learning-for-nlp-by-jason-brownlee/blob/part-4-text-classi-cation/2_develop_ngram_cnn_model_for_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Develop an n-gram CNN Model for Sentiment Analysis

A standard deep learning model for text classification and sentiment analysis uses a word embedding layer and one-dimensional convolutional neural network. The model can be expanded by using multiple parallel convolutional neural networks that read the source document using
different kernel sizes. This, in effect, creates a multichannel convolutional neural network for text that reads text with different n-gram sizes (groups of words).

## Setup

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from tensorflow.keras import backend as keras_backend
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Dense, Embedding, Flatten, Conv1D, Dropout, MaxPool1D, Concatenate

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import plot_model

TensorFlow 2.x selected.


## Movie Review Dataset

The Movie Review Data is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing.

The dataset is comprised of 1,000 positive and 1,000 negative movie reviews drawn from an archive of the rec.arts.movies.reviews newsgroup hosted at IMDB. The authors refer to this dataset as the polarity dataset.

The data has been cleaned up somewhat, for example:
* The dataset is comprised of only English reviews.
* All text has been converted to lowercase.
* There is white space around punctuation like periods, commas, and brackets.
* Text has been split into one sentence per line.

The data has been used for a few related natural language processing tasks. For classification, the performance of classical models (such as Support Vector Machines) on the data is in the range of high 70% to low 80% (e.g. 78%-to-82%). More sophisticated data preparation may see results as high as 86% with 10-fold cross-validation.


After unzipping the file, you will have a directory called txt sentoken with two sub-directories containing the text neg and pos for negative and positive reviews. Reviews are stored
one per file with a naming convention from cv000 to cv999 for each of neg and pos.


## Load Text Data

we will look at loading individual text files, then processing the directories of filles. We will fetch data from Github repository where we have storred this Movie Review Polarity Dataset and after fetching it will be available in the current working directory in the folder txt sentoken.

We can load an individual text file by opening it, reading
in the ASCII text, and closing the file. This is standard file handling stuff.

In [2]:
# fetch dataset from github
! git clone https://github.com/rahiakela/machine-learning-datasets -b movie-review-polarity-dataset

fatal: destination path 'machine-learning-datasets' already exists and is not an empty directory.


## Data Preparation

In this section, we will look at 3 things:

1.   Separation of data into training and test sets.
2.   Loading and cleaning the data to remove punctuation and numbers.
3.   Defining a vocabulary of preferred words.



### Split into Train and Test Sets

We are pretending that we are developing a system that can predict the sentiment of a textual movie review as either positive or negative. This means that after the model is developed, we will need to make predictions on new textual reviews. This will require all of the same data preparation to be performed on those new reviews as is performed on the training data for the model.

We will ensure that this constraint is built into the evaluation of our models by splitting the training and test datasets prior to any data preparation. This means that any knowledge in the
test set that could help us better prepare the data (e.g. the words used) is unavailable during the preparation of data and the training of the model. 

That being said, we will use the last 100 positive reviews and the last 100 negative reviews as a test set (100 reviews) and the remaining 1,800 reviews as the training dataset. This is a 90% train, 10% split of the data. The split can be imposed easily by using the filenames of the reviews where reviews named 000 to 899 are for training data and reviews named 900 onwards are for testing the model.



### Loading and Cleaning Reviews

The text data is already pretty clean, so not much preparation is required. Without getting too much into the details, we will prepare the data using the following method:

* Split tokens on white space.
* Remove all punctuation from words.
* Remove all words that are not purely comprised of alphabetical characters.
* Remove all words that are known stop words.
* Remove all words that have a length <= 1 character.



In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
from nltk.corpus import stopwords
import string
import re

# load doc into memory
def load_doc(filename):
  # open the file as read only
  file = open(filename, 'r')
  # read all text
  text = file.read()
  # close the file
  file.close()

  return text

# turn a doc into clean tokens
def clean_doc(doc):
  # split into tokens by white space
  tokens = doc.split()
  # prepare regex for char filtering
  re_punc = re.compile('[%s]' % re.escape(string.punctuation))
  # remove punctuation from each word
  tokens = [re_punc.sub('', w) for w in tokens]
  # remove remaining tokens that are not alphabetic
  tokens = [word for word in tokens if word.isalpha()]
  # filter out stop words
  stop_words = set(stopwords.words('english'))
  tokens = [w for w in tokens if not w in stop_words]
  # filter out short tokens
  tokens = [word for word in tokens if len(word) > 1]
  return tokens

# define base path for dataset
base_path = 'machine-learning-datasets/movie-review-polarity-dataset/txt_sentoken'

# load one file
filename = base_path + '/pos/cv000_29590.txt'
text = load_doc(filename)

# clean doc
tokens = clean_doc(text)
print(tokens)

['films', 'adapted', 'comic', 'books', 'plenty', 'success', 'whether', 'theyre', 'superheroes', 'batman', 'superman', 'spawn', 'geared', 'toward', 'kids', 'casper', 'arthouse', 'crowd', 'ghost', 'world', 'theres', 'never', 'really', 'comic', 'book', 'like', 'hell', 'starters', 'created', 'alan', 'moore', 'eddie', 'campbell', 'brought', 'medium', 'whole', 'new', 'level', 'mid', 'series', 'called', 'watchmen', 'say', 'moore', 'campbell', 'thoroughly', 'researched', 'subject', 'jack', 'ripper', 'would', 'like', 'saying', 'michael', 'jackson', 'starting', 'look', 'little', 'odd', 'book', 'graphic', 'novel', 'pages', 'long', 'includes', 'nearly', 'consist', 'nothing', 'footnotes', 'words', 'dont', 'dismiss', 'film', 'source', 'get', 'past', 'whole', 'comic', 'book', 'thing', 'might', 'find', 'another', 'stumbling', 'block', 'hells', 'directors', 'albert', 'allen', 'hughes', 'getting', 'hughes', 'brothers', 'direct', 'seems', 'almost', 'ludicrous', 'casting', 'carrot', 'top', 'well', 'anythi

### Clean All Reviews and Save

We can now use the function to clean reviews and apply it to all reviews. To do this, we will develop a new function named process docs() below that will walk through all reviews in a directory, clean them, and return them as a list. We will also add an argument to the function to indicate whether the function is processing train or test reviews, that way the filenames can
be filtered and only those train or test reviews requested will be cleaned
and returned.

```python
# load all docs in a directory
def process_docs(directory, is_train):
  documents = list()
  # walk through all files in the folder
  for filename in listdir(directory):
    # skip any reviews in the test set
    if is_train and filename.startswith('cv9'):
      continue
    if not is_train and not filename.startswith('cv9'):
      continue
    # create the full path of the file to open
    path = directory + '/' + filename
    # load the doc
    doc = load_doc(path)
    # clean doc
    tokens = clean_doc(doc)
    # add to list
    documents.append(tokens)
  return documents
```

We can call this function with negative training reviews. We also need labels for the train and test documents. We know that we have 900 training documents and 100 test documents. We can use a Python list comprehension to create the labels for the negative (0) and positive (1) reviews for both train and test sets.

```python
# load and clean a dataset
def load_clean_dataset(is_train):
  # load documents
  neg = process_docs('txt_sentoken/neg', is_train)
  pos = process_docs('txt_sentoken/pos', is_train)
  docs = neg + pos
  # prepare labels
  labels = [0 for _ in range(len(neg))] + [1 for _ in range(len(pos))]
  return docs, labels
```

Finally, we want to save the prepared train and test sets to file so that we can load them later for modeling and model evaluation.

```python
def save_dataset(lines, filename):
  dump(dataset, open(filename, 'wb'))
  print('Saved: %s' % filename)
```

In [5]:
from os import listdir
from pickle import dump

# turn a doc into clean tokens
def clean_doc(doc):
  # split into tokens by white space
  tokens = doc.split()
  # prepare regex for char filtering
  re_punc = re.compile('[%s]' % re.escape(string.punctuation))
  # remove punctuation from each word
  tokens = [re_punc.sub('', w) for w in tokens]
  # remove remaining tokens that are not alphabetic
  tokens = [word for word in tokens if word.isalpha()]
  # filter out stop words
  stop_words = set(stopwords.words('english'))
  tokens = [w for w in tokens if not w in stop_words]
  # filter out short tokens
  tokens = [word for word in tokens if len(word) > 1]
  tokens = ' '.join(tokens)

  return tokens

# load all docs in a directory
def process_docs(directory, is_train):
  documents = list()
  # walk through all files in the folder
  for filename in listdir(directory):
    # skip any reviews in the test set
    if is_train and filename.startswith('cv9'):
      continue
    if not is_train and not filename.startswith('cv9'):
      continue 
    # create the full path of the file to open
    path = directory + '/' + filename
    # load the doc
    doc = load_doc(path)
    # clean doc
    tokens = clean_doc(doc)
    # add to list
    documents.append(documents)

  return documents

# load and clean a dataset
def load_clean_dataset(is_train):
  # load documents
  neg = process_docs(base_path + '/neg', is_train)
  pos = process_docs(base_path + '/pos', is_train)

  docs = neg + pos

  # prepare labels
  labels = [0 for _ in range(len(neg))] + [1 for _ in range(len(pos))]

  return docs, labels

# save a dataset to file
def save_dataset(dataset, filename):
  dump(dataset, open(filename, 'wb'))
  print(f'Saved: {filename}')

# load and clean all reviews
train_docs, ytrain = load_clean_dataset(True)
test_docs, ytest = load_clean_dataset(False)

# save training datasets
save_dataset([train_docs, ytrain], 'train.pkl')
save_dataset([test_docs, ytest], 'test.pkl')

Saved: train.pkl
Saved: test.pkl


We are now ready to develop our model.

## Develop Multichannel Model

In this section, we will develop a multichannel convolutional neural network for the sentiment analysis prediction problem. This section is divided into 3 parts:

1.  Encode Data
2.  Define Model.
3.  Complete Example.

### Encode Data

The first step is to load the cleaned training dataset. The function below-named load dataset() can be called to load the pickled training dataset.


```python
# load a clean dataset
def load_dataset(filename):
  return load(open(filename, 'rb'))

trainLines, trainLabels = load_dataset('train.pkl')
```

Next, we must fit a Keras Tokenizer on the training dataset. We will use this tokenizer to both define the vocabulary for the Embedding layer and encode the review documents as integers.

```python
# fit a tokenizer
def create_tokenizer(lines):
  tokenizer = Tokenizer()
  tokenizer.fit_on_texts(lines)
  return tokenizer
```

We also need to know the maximum length of input sequences as input for the model and to pad all sequences to the fixed length.

The function max length() below will calculate the maximum length (number of words) for all reviews in the training dataset.

```python
# calculate the maximum document length
def max_length(lines):
  return max([len(s.split()) for s in lines])
```

We also need to know the size of the vocabulary for the Embedding layer. This can be calculated from the prepared Tokenizer.

```python
# calculate vocabulary size
vocab_size = len(tokenizer.word_index) + 1
```

Finally, we can integer encode and pad the clean movie review text. The function below named encode text() will both encode and pad text data to the maximum review length.

```python
# encode a list of lines
def encode_text(tokenizer, lines, length):
  # integer encode
  encoded = tokenizer.texts_to_sequences(lines)
  # pad encoded sequences
  padded = pad_sequences(encoded, maxlen=length, padding='post')
  return padded
```








### Define Model

A standard model for document classification is to use an Embedding layer as input, followed by a one-dimensional convolutional neural network, pooling layer, and then a prediction output layer. The kernel size in the convolutional layer defines the number of words to consider as the convolution is passed across the input text document, providing a grouping parameter. 

A multi-channel convolutional neural network for document classification involves using multiple versions of the standard model with different sized kernels. This allows the document to be processed at different resolutions or different n-grams (groups of words) at a time, whilst the model learns how to best integrate these interpretations.

In Keras, a multiple-input model can be defined using the functional API. We will define a model with three input channels for processing 4-grams, 6-grams, and 8-grams of movie review text. Each channel is comprised of the following elements:

* **Input layer** that defines the length of input sequences.
* **Embedding layer** set to the size of the vocabulary and 100-dimensional real-valued representations.
* **Conv1D layer** with 32 filters and a kernel size set to the number of words to read at once.
* **MaxPooling1D layer** to consolidate the output from the convolutional layer.
* **Flatten layer** to reduce the three-dimensional output to two dimensional for concatenation.

The output from the three channels are concatenated into a single vector and process by a Dense layer and an output layer.

```python
# define the model
def define_model(length, vocab_size):

  # channel 1
  inputs1 = Input(shape=(length,))
  embedding1 = Embedding(vocab_size, 100)(inputs1)
  conv1 = Conv1D(filters=32, kernel_size=4, activation='relu')(embedding1)
  drop1 = Dropout(0.5)(conv1)
  pool1 = MaxPooling1D(pool_size=2)(drop1)
  flat1 = Flatten()(pool1)

  # channel 2
  inputs2 = Input(shape=(length,))
  embedding2 = Embedding(vocab_size, 100)(inputs2)
  conv2 = Conv1D(filters=32, kernel_size=6, activation='relu')(embedding2)
  drop2 = Dropout(0.5)(conv2)
  pool2 = MaxPooling1D(pool_size=2)(drop2)
  flat2 = Flatten()(pool2)

  # channel 3
  inputs3 = Input(shape=(length,))
  embedding3 = Embedding(vocab_size, 100)(inputs3)
  conv3 = Conv1D(filters=32, kernel_size=8, activation='relu')(embedding3)
  drop3 = Dropout(0.5)(conv3)
  pool3 = MaxPooling1D(pool_size=2)(drop3)
  flat3 = Flatten()(pool3)

  # merge
  merged = concatenate([flat1, flat2, flat3])

  # interpretation
  dense1 = Dense(10, activation='relu')(merged)
  outputs = Dense(1, activation='sigmoid')(dense1)
  model = Model(inputs=[inputs1, inputs2, inputs3], outputs=outputs)

  # compile
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  
  # summarize
  model.summary()
  plot_model(model, show_shapes=True, to_file='multichannel.png')
  return model
```



We can put all of this together in a single example.

In [0]:
from keras.preprocessing.text import Tokenizer
from pickle import load


# load dataset
def load_dataset(filename):
  # load dataset
  return load(open(filename, 'rb'))

# fit a tokenizer
def create_tokenizer(lines):
  tokenizer = Tokenizer()
  tokenizer.fit_on_texts(lines)
  return tokenizer

# calculate the maximum document length
def max_length(lines):
  return max([len(line.split()) for line in lines])

# encode a list of lines
def encode_text(tokenizer, lines, max_length):
  # integer encode
  encoded = tokenizer.texts_to_sequences(lines)
  # pad sequences
  padded = pad_sequences(encoded, maxlen=max_length, padding='post')

  return padded

# define the model
def define_model(max_length, vocab_size):

  print('Creating channel.......')
  # channel 1
  inputs1 = Input(shape=(max_length, ))
  embedding1 = Embedding(vocab_size, 100)(inputs1)
  conv1 = Conv1D(filters=32, kernel_size=4, activation='relu')(embedding1)
  drop1 = Dropout(0.5)(conv1)
  pool1 = MaxPool1D(pool_size=2)(drop1)
  flat1 = Flatten()(pool1)

  print('Creating channe2.......')
  # channel 2
  inputs2 = Input(shape=(max_length, ))
  embedding2 = Embedding(vocab_size, 100)(inputs2)
  conv2 = Conv1D(filters=32, kernel_size=6, activation='relu')(embedding2)
  drop2 = Dropout(0.5)(conv2)
  pool2 = MaxPool1D(pool_size=2)(drop2)
  flat2 = Flatten()(pool2)

  print('Creating channe3.......')
  # channel 3
  inputs3 = Input(shape=(max_length, ))
  embedding3 = Embedding(vocab_size, 100)(inputs3)
  conv3 = Conv1D(filters=32, kernel_size=8, activation='relu')(embedding3)
  drop3 = Dropout(0.5)(conv3)
  pool3 = MaxPool1D(pool_size=2)(drop3)
  flat3 = Flatten()(pool3)

  print('Creating all channes.......')
  # merge all channel
  merged_layer = Concatenate([flat1, flat2, flat3])

  # interpretation
  dense_layer = Dense(10, activation='relu')(merged_layer)
  output_layer = Dense(1, activation='sigmoid')(dense_layer)

  print('Creating model.......')
  # create model
  model = Model(inputs=[inputs1, inputs2, inputs3], outputs=output_layer)

  # compile model
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

  # summarize defined model
  model.summary()

  # plot model architecture
  plot_model(model, to_file='model.png', show_shapes=True)

  return model

print('Loadin dataset.......')
# load training dataset
trainLines, trainLabels = load_dataset('train.pkl')
# convert to array
trainLines = np.array(trainLines)
trainLabels = np.array(trainLabels)

# create the tokenizer
tokenizer = create_tokenizer(trainLines)

# calculate the maximum document length
max_length = max_length(trainLines)
print(f'Maximum document length: {str(max_length)}')

# calculate vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print(f'Vocabulary size: {str(vocab_size)}')

# encode data
trainX = encode_text(tokenizer, trainLines, max_length)

print('Creating model.......')
# define model
model = define_model(max_length, vocab_size)

print('Traing model.......')
# fit model
model.fit([trainX, trainX, trainX], trainLabels, epochs=7, batch_size=16, verbose=1)

# save the model
model.save('model.h5')

## Evaluate Model