# Practical 5.1 Modeling Text

## Basic data preprocessing for modeling text sequences

In [1]:
from __future__ import print_function

## 1. Data description

We will use IMDB review data set to train a Recurrent Neural Networks (RNN) model, by using two (2) type of text sequences as model input: characters and words. Data can be downloaded from https://storage.googleapis.com/trl_data/imdb_dataset.zip. Training set contains 25000 reviews with labels 0 for "negative" sentiment and 1 for "positive" sentiment. For validation set, the information about binary labels (0 and 1) can be seen in attribute "id" of the data set. Number after character '\_' represents rating score. If rating <5, then the sentiment score is 0 or "negative" sentiment. If the rating is greater than 7, then the score is 1 or "positive". Otherwise, it is negative (0).

Example of (part of) original text in data set:

```
id	sentiment	review

"7759_3"	0	"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger ."

```

## 2. Problem Definition

Given a text (e.g. a movie review), we need to predict whether this review is positive (class label = 1) or negative (class label = 0). We will work with two (2) types of preprocessing to create sequence for our model input: character-level and word-level.

## 3. Data Preprocessing

Basic data preprocessing for text sequence:

* Cleaning raw text data
    - remove HTML tags
    - remove non-informative characters
* Tokenizing raw text into array of word tokens (for word-level sequences)
* Create vocabulary index: character based and word based look up dictionary index
* Transform tokenized text into integer sequences (based on look up vocabulary index)

In [2]:
import os
import sys
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = 100
import re
import nltk

DATA_PATH = 'data'
EMBEDDING_PATH = 'embedding'
MODEL_PATH = 'model'

Create above directories under your current working directory. Download data set provided and locate it in directory 'data' above.

### 3.1. Read data

In [3]:
# function to clean raw text data

def striphtml(html):
    p = re.compile(r'<.*?>')
    return p.sub('', html)

def clean(s):
    return re.sub(r'[^\x00-\x7f]', r'', s)

In [4]:
train_data = pd.read_csv(os.path.join(DATA_PATH,"trainingData.tsv"), header=0, delimiter="\t")

In [5]:
valid_data = pd.read_csv(os.path.join(DATA_PATH,"validationData.tsv"), header=0, delimiter="\t")

In [6]:
train_data[:5]

Unnamed: 0,id,sentiment,review
0,5814_8,1,"With all this stuff going down at the moment with MJ i've started listening to his music, watchi..."
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hines is a very entertaining film that obviously goe..."
2,7759_3,0,The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Pr...
3,3630_4,0,"It must be assumed that those who praised this film (\the greatest filmed opera ever,\"" didn't I..."
4,9495_8,1,"Superbly trashy and wondrously unpretentious 80's exploitation, hooray! The pre-credits opening ..."


In [7]:
valid_data[:5]

Unnamed: 0,id,review
0,12311_10,"Naturally in a film who's main themes are of mortality, nostalgia, and loss of innocence it is p..."
1,8348_2,"This movie is a disaster within a disaster film. It is full of great action scenes, which are on..."
2,5828_4,"All in all, this is a movie for kids. We saw it tonight and my child loved it. At one point my k..."
3,7186_2,"Afraid of the Dark left me with the impression that several different screenplays were written, ..."
4,12128_7,"A very accurate depiction of small time mob life filmed in New Jersey. The story, characters and..."


### 3.2. Clean data

### Cleaning training set

In [8]:
# this  will create a cleaned version of training set

train_docs = []
train_labels = []
for cont, sentiment in zip(train_data.review, train_data.sentiment):
    
    doc = clean(striphtml(cont))
    doc = doc.lower() 
    train_docs.append(doc)
    train_labels.append(sentiment)

### Cleaning validation set

In [9]:
# this  will create a cleaned version of validation set
# we also need to extract labels from attribute 'id'

valid_docs =[]
valid_labels = []
i=0
for docid,cont in zip(valid_data.id, valid_data.review):
    
    id_label = docid.split('_')
    # if rating >= 7, then assign 1 (positive sentiment) as label
    if(int(id_label[1]) >= 7):
        valid_labels.append(1)
    # else, assign 0 (negative sentiment) as label
    else:
        valid_labels.append(0)         
    doc = clean(striphtml(cont))
    doc = doc.lower() 
    valid_docs.append(doc)

### 3.3. Build vocabulary index

### Character-level vocabulary index

Notice that for generating lookup vocabulary index of character-level text sequences, we use characters from both training and validation set -- as compared to preprocessing word sequences (later). This is because the number of unique characters is fewer that unique words in corresponding document corpus.

In [10]:
txt = ''
for doc in train_docs:
    for s in doc:
        txt += s
for doc in valid_docs:
    for s in doc:
        txt += s

In [11]:
chars = set(txt)
print('total chars:', len(chars))

total chars: 71


In [12]:
# pairs of character - index of character in look up vocabulary
char_indices = dict((c, i) for i, c in enumerate(chars))

# pairs of index of character - character in look up vocabulary
indices_char = dict((i, c) for i, c in enumerate(chars))

In [13]:
list(char_indices.items())[:5]

[('a', 0), ('0', 1), ('\x08', 57), ('8', 3), ('c', 70)]

In [14]:
list(indices_char.items())[:5]

[(0, 'a'), (1, '0'), (2, '!'), (3, '8'), (4, ')')]

In [15]:
# save vocabulary index

np.save(os.path.join(DATA_PATH,'char_indices.npy'), char_indices)
np.save(os.path.join(DATA_PATH,'indices_char.npy'), indices_char)

### Word-level vocabulary index

In [16]:
# FUNCTION to tokenize documents into array list of words
# you may also use nltk tokenizer, sklearn tokenizer, or keras tokenizer - 
# but for the tutorial in text modeling, we will use below function: 

def tokenizeWords(text):
    
    tokens = re.sub(r"[^a-z0-9]+", " ", text.lower()).split()
    return [str(strtokens) for strtokens in tokens]

# FUNCTION to create word-level vocabulary index

def indexingVocabulary(array_of_words):

    wordIndex = list(array_of_words)
    
    # we will later pad our sequence into fixed length, so
    # we will use '0' as the integer index of pad 
    wordIndex.insert(0,'<pad>')
    
    # index for word token '<start>' as a starting sign of sequence. We won't use it for this model
    # but for the latter model (sequence-to-sequence model)
    wordIndex.append('<start>')
    
    # index for word token '<end>' as an ending sign of sequence. We won't use it for this model
    # but for the latter model (sequence-to-sequence model)
    wordIndex.append('<end>')
    
    # index for word token '<unk>' or unknown words (out of vocabulary words) 
    wordIndex.append('<unk>')
    
    vocab=dict([(i,wordIndex[i]) for i in range(len(wordIndex))])
    
    return vocab

### Tokenization (for word sequences as model input)

Create array list of tokenized words and merged array of these word tokens to generate vocabulary index. Notice that we only use 10.000 most frequent words from training set. Out of Vocabulary (OOV) words will be presented as '<unk>' or unknown words.

In [17]:
# tokenize text from training set

train_str_tokens = []
all_tokens = []
for i, text in enumerate(train_docs):
    
    # this will create our training corpus
    train_str_tokens.append(tokenizeWords(text))
    
    # this will be our merged array to create vocabulary index
    all_tokens.extend(tokenizeWords(text))

In [18]:
# likewise, tokenize text from validation set

valid_str_tokens = []
for i, text in enumerate(valid_docs):

    valid_str_tokens.append(tokenizeWords(text))

In [19]:
# use nltk to count word frequency and use 10.000 most frequent words to generate vocabulary index

tf = nltk.FreqDist(all_tokens)
common_words = tf.most_common(10000)
arr_common = np.array(common_words)
words = arr_common[:,0]

# create vocabulary index

# word- index pairs
words_indices = indexingVocabulary(words)

# index - word pairs
indices_words = dict((v,k) for (k,v) in words_indices.items())

In [20]:
list(words_indices.items())[:5]

[(0, '<pad>'), (1, 'the'), (2, 'and'), (3, 'a'), (4, 'of')]

In [21]:
list(indices_words.items())[:5]

[('order', 652),
 ('stirring', 8397),
 ('elementary', 9980),
 ('duke', 3430),
 ('unfortunately', 469)]

In [22]:
# save vocabulary index

np.save(os.path.join(DATA_PATH,'words_indices.npy'), words_indices)
np.save(os.path.join(DATA_PATH,'indices_words.npy'), indices_words)

### 3.4. Preparing model input - output

### Character-level sequences

We define our maximum length of character sequences as model input equals to 1000 character length. We also need to pad the sequence, in a case when the length of sequence < 1000 characters. Using vocabulary index as our look up dictionary, transform character sequences into integer format of sequences.

In [23]:
# define maximum length of input sequence for the model 
maxlen = 500 # 500 characters length

# initialize sequence as numpy array of zeros 
# will be acted as our padding if text length < 1000 characters
X_train = np.zeros((len(train_docs), maxlen), dtype=np.int32)
y_train = np.array(train_labels)

# transform sequence of characters into their integer format of sequence (based on look up vocabulary index)
for i, doc in enumerate(train_docs):
    len_doc = len(doc)
    if len_doc > maxlen:
        txt = doc[:maxlen]
    else:
        txt = doc
    for j, char in enumerate(txt):
        X_train[i, j] = char_indices[char]

Likewise, do similar steps for validation set

In [24]:
X_valid = np.zeros((len(valid_docs), maxlen), dtype=np.int32) 
y_valid = np.array(valid_labels)

for i, doc in enumerate(valid_docs):
    len_doc = len(doc)
    if len_doc > maxlen:
        txt = doc[:maxlen]
    else:
        txt = doc
    for j, char in enumerate(txt):
        X_valid[i, j] = char_indices[char]

In [25]:
# save files

np.save(os.path.join(DATA_PATH,'X_train_char.npy'), X_train)
np.save(os.path.join(DATA_PATH,'y_train_char.npy'), y_train)

np.save(os.path.join(DATA_PATH,'X_valid_char.npy'), X_valid)
np.save(os.path.join(DATA_PATH,'y_valid_char.npy'), y_valid)

### Word-level sequences

In [27]:
# integer format of training input 
train_int_input = []
for i, text in enumerate(train_str_tokens):
    int_tokens = [indices_words[w] if w in indices_words.keys() else indices_words['<unk>'] for w in text ]
    train_int_input.append(int_tokens)

In [28]:
# integer format of test validation input 
valid_int_input = []
for i, text in enumerate(valid_str_tokens):
    int_tokens = [indices_words[w] if w in indices_words.keys() else indices_words['<unk>'] for w in text ]
    valid_int_input.append(int_tokens)

In [29]:
X_train_arr = np.array(train_int_input)
y_train = np.array(train_labels)

X_valid_arr = np.array(valid_int_input)
y_valid = np.array(valid_labels)

#### Padding word sequences

We define maximum 500 words as our fixed length of input sequences. Here, we use keras padding, but you may also define your own padding function.

In [30]:
from keras.preprocessing import sequence

max_review_length = 500
X_train = sequence.pad_sequences(X_train_arr, maxlen=max_review_length)
X_valid = sequence.pad_sequences(X_valid_arr, maxlen=max_review_length)

In [31]:
# save files

np.save(os.path.join(DATA_PATH,'X_train_word.npy'), X_train)
np.save(os.path.join(DATA_PATH,'y_train_word.npy'), y_train)

np.save(os.path.join(DATA_PATH,'X_valid_word.npy'), X_valid)
np.save(os.path.join(DATA_PATH,'y_valid_word.npy'), y_valid)