Text classification involves the use of word embedding for representing words and a Convolutional Neural Netword (CNN). The model can be described as following:   
* **Word Embedding** A distributed representation of words where different words that have a similar meaning (based on their usage) also have a similar representation.   
* **Convolutional Model**: A feature extraction model that learns to extract salient feature from documents represented using a word embedding.   
* **Fully connected model**: The intepretation of extracted features in term of a predictive output.  

Initial step, we do some preprocessing data.

In [18]:
import re
import numpy as np
import string
from os import listdir
from collections import Counter
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [7]:
# load doc into mem
def load_doc(filename):
    file = open(filename,'r')
    text = file.read()
    file.close()
    return text

# generate token from a doc
def clean_doc(doc):
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]'% re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('',w) for w in tokens]
    # remove non alphabetic token
    tokens =[word for word in tokens if word.isalpha()]

    # filter out stop words
    stop_words =set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    
    
    # filter out short token
    tokens = [w for w in tokens if len(w) >1]
    
    return tokens

def add_doc_to_vocab(filename,vocab):
    doc = load_doc(filename)
    # clean doc
    tokens = clean_doc(doc)
    # update count
    vocab.update(tokens)
    
def process_docs(directory, vocab):
    # walk through all files
    for filename in listdir(directory):
        # skip any reviews in the test set
        if filename.startswith('cv9'):
            continue
        path = directory +'/'+filename
        
        # add doc to vocab
        add_doc_to_vocab(path,vocab)
        
        
def save_list(lines,filename):
    # convert lines to a single blob of text
    data = '\n'.join(lines)
    # open file
    file = open(filename,'w')
    file.write(data)
    file.close() 

    
#define vocab
vocab = Counter()
process_docs('/home/tri/Downloads/txt_sentoken/pos',vocab)
process_docs('/home/tri/Downloads/txt_sentoken/neg',vocab)

print(len(vocab))
# keep tokens with a min occurences
min_occurrence =2

tokens = [k for k,c in vocab.items() if c >- min_occurrence]
print(len(tokens))

# save token to vocabulary
save_list(tokens,'/home/tri/Downloads/vocab.txt')

44276
44276


## Training CNN with Embedding layer   

Remind that **word embedding** illustrates text where each word in the vocabulary is represented by a real valued vector in a high dimensional space. This is better than loosing a relationship between words in the sentence as **bag-of-word**. This representation can be used while training the neural netword. Keras supports it with **Embedding** layer.   
First load file pre_contructed **vocab.txt** with one word per line. Then, load all the training data movies review and generate a clean tokens with updated clean_doc(doc,vocab). 

In [11]:
# generate clean tokens from a doc
def clean_doc(doc, vocab):
    # split into tokens
    tokens =  doc.split()
    # create regex for filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation
    tokens = [re_punc.sub('',w) for w in tokens]
    tokens = [w for w in tokens if w in vocab]
    tokens = ' '.join(tokens)
    return tokens

# rewrite process_docs
def process_docs(directory,vocab,is_train):
    documents  = list()
    
    for filename in listdir(directory):
        # skip any review in test set
        if is_train and filename.startswith('cv9'):
            continue
        if is_train and not filename.startswith('cv9'):
            continue
        # get full path
        path = directory +'/'+ filename
        # load the doc
        doc = load_doc(path)
        # clean doc
        tokens = clean_doc(doc,vocab)
        # add to list
        documents.append(tokens)
    return documents

def load_clean_dataset(vocab, is_train):
    neg = process_docs('/home/tri/Downloads/txt_sentoken/neg',vocab,is_train)
    pos = process_docs('/home/tri/Downloads/txt_sentoken/pos',vocab,is_train)
    docs = neg+ pos
    
    labels  =np.array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
    return docs,labels
        

We need to encode each document as a sequence of integer. Keras Embedding layer requires integer inputs where each integer maps to a single token that has a specific real-valued vector within th embedding. These vectors are random at the beginning of training, but get more meaningful later in training. We use API Keras Tokenizer to do this as  followings

In [20]:
# fit a tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer


After mapping word to integers, we use **texts_to_sequences()** function on Tokenizer. We also need all documents have the same length by padding with **pad_sequence()**. One way is to pad all reviews to the length of the longest review in the training set. This max value can be computed from the max()function: 
>max_length = max([len(s.split()) for s in train_docs])     
The max value is used in pad_sequences

To create neural network model, we use Embedding as the first hidden layer. There parameters of Embedding layer are vocabulary size, the size of the real-valued vector space, and the maximum length of input documents.   
* The **vocabulary size** is the total number of words in our vocabulary plus one for unknown word. This could be the vocab set length or the size of the vocab within tokenizer used to integer encode the document, for example:   
>vocab_size =len(tokenizer.word_index)+1


We will use 100 dimensional vector space but other values can be used. The maximum docment length is calculated above in the max_length variable used during padding.

In [21]:
# integer encode and pad doc
def encode_docs(tokenizer,max_length, docs):
    encoded = tokenizer.texts_to_sequences(docs)
    padded = pad_sequences(encoded, maxlen=max_length, padding='post')
    return padded
    

# create model
def create_model(vocab_size, max_length):
    model =Sequential()
    model.add(Embedding(vocab_size, 100, input_length = max_length))
    model.add(Conv1D(filters=32,kernel_size=8,activation ='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Flatten())
    model.add(Dense(10,activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # compile
    model.cmpile(loss='binary_crossentropy',optimizer='adam', metrics=['acc'])
    # summary
    model.summary()
    
    return model

In [22]:
# load the vocabulary
vocab_filename ='/home/tri/Downloads/vocab.txt'
vocab = load_doc(vocab_filename)
vocab = set(vocab.split())

# load training data
train_docs, ytrain = load_clean_dataset(vocab,True)

#create the tokenizer
tokenizer = create_tokenizer(train_docs)

# define v0cabulary size
vocab_size = len(tokenizer.word_index)+1
print('Vocabulary size: %d' % vocab_size)

# calculate the maximum sequence length
max_length = max([len(s.split()) for s in train_docs])
print('Maximum length: %d' % max_length)

# encode data
Xtrain = encode_docs(tokenizer, max_length, train_docs)


#define model
model = create_model(vocab_size,max_length)

# fit 
model.fit(Xtrain,ytrain,epochs =10, verbose=2)

# save model
modl.save('model.h5')

Vocabulary size: 1


ValueError: max() arg is an empty sequence