## Convolutional Neural Networks for Sentence Classification

#### Members Names: Gagandip Chane, Devika Kabe

#### Members Emails: gchane@ryerson.ca, dkabe@ryerson.ca

# Introduction:

#### Problem Description:

Perform sentence-level classification tasks using CNNs with only one layer of convolution.

#### Context of the Problem:

This problem is important because it demonstrates the importance of pre-trained vectors. It shows that they are universal feature extractors that can be utilized for various classification tasks e.g. sentence classification, question classification and polarity detection. This paper also demonstrates that CNNs are computationally efficient and there is no need to use information stored in the sequential nature of the data when performing tasks such as sentiment analysis, since the main takeaway will be whether or not something was "good" or "bad" etc. Thus, we can conclude that CNNs can be sufficient in comparison to traditional approaches such as Naive Bayes, Linear Regression, and Support Vector Machines. 

#### Limitation About other Approaches:

One paper, Kalchbrenner et al., (2014) uses a max Time Delay Neural Network (TDNN) as a comparison. They report that the size of the filters is limited to the span of the weights. Increasing the span makes the range of the filters larger which requires increasing the minimum sentence size. Roller et al., (2016) implement a very simple multi-group norm constraint convolutional neural network (MGNC-CNN) which aims to have short run time, but as the word embeddings increase, run time increases. Schutze and Yin (2014) propose a Multichannel Variable Size CNN (MVCNN), a CNN architecture for sentence classification. The model is complex both in terms of implementation and run time and requires that input word embeddings have the same dimensionality. 

#### Solution:

Train a simple CNN with one layer of convolution on top of word vectors. Concatenate the word vectors and then generate features by applying a filter to a window of h words. This produces a feature map. Apply max-pooling over the feature map to take the maximum value as the feature corresponding to this filter. Do this for multiple features. These features are passed through a fully connected softmax layer whose output is the probability distribution over labels.

# Background



| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Kalchbrenner et al. [1] | Applies max-TDNN to multipe datasets. The sentence is viewed as having a time dimension and the convolution is applied over the time dimension. | SST1,  TREC, Twitter sentiment | Max-TDNN only 37.4% accuracy, size of feature detectors (filters) is limited. If they increase it, then they have to increase the minimum size of the sentence required.|
| Zhang et al. [2] | They implement a simple multi-group norm constraint CNN. What this means is it treats each word embedding as a distinct group and applies CNNs independently to each one. | SST1, SST2, Subj, TREC, Irony | High accuracy but must tune norm constraint hyperparameter for all word embeddings, run time will increase as word embeddings increase. |
| Yin and Schutze [3] | Multichannel Variable-Size CNN for sentence classification. This means that the input is a 3 dimensional array of size c x d x s where c is the number of word embeddings, d is the dimension of word embeddings and s is the sentence length. | SST1, Sentiment140, Subj | Requires that input word embeddings have the same dimensionality and model is complex in terms of implementation and run time. |
| Kim, Y. [4] | Train CNN on top of word vectors with one convolution layer and apply max over-time pooling. | MR, SST1, SST2, Subj, TREC, CR, MPQA | Further work on regularizing the fine-tuning process is warranted. Multichannel architecture did not prevent overfitting as hoped. |


# Methodology

#### The paper uses several variants of the model, namely:
#### CNN-rand: 
This is used as a baseline. All words are initialized at random and then modified during training. <br>
#### CNN-static: 
A model with pre-trained vectors from word2vec. All words—including the unknown ones that are randomly initialized—are kept static and only the other parameters of the model are learned <br>
#### CNN-non-static: 
Same as above but the pretrained vectors are fine-tuned for each task. <br>
#### CNN-multichannel: 
A model with two sets of word vectors. Each set of vectors is treated as a ‘channel’ and each filter is applied to both channels, but gradients are backpropagated only through one of the channels.   

#### Step 1: This step processes the input data which creates word embeddings for all words in the sentences and concatenates them

Let $x_{i}$ ∈ $R_{k}$ be the k-dimensional word vector corresponding to the i-th word in the sentence. <br>
A sentence of length n (padded where necessary) is represented as: <br>
$x_{1:n}$ = $x_{1}$ ⊕ $x_{2}$ ⊕ . . . ⊕ $x_{n}$ <br>
⊕ is the concatenation operator, so it will look like:
![Alternate text ](image3.png "Max pooling applied")

#### Step 2: Apply a filter to a sliding window of words (similar to that of image classification)  to create a new matrix 

A filter, $w$ ∈ $R_{hk}$ is applied to a window of $h$ words to produce a new feature, e.g.: <br>
$c_{i}$ = $f$($w$·$x_{i:i+h-1}$ + $b$) <br>
$b$ is a bias term and $f$ is a non-linear function. This filter is applied to each possible window of words in the sentence {$x_{1:h}$, $x_{2:h+1}$,..., $x_{n-h+1:n}$) to produce a feature map: <br>
$c$ = [$c_{1}$, $c_{2}$,...,$c_{n-h+1}$] with c ∈ $R^{n-h+1}$ <br>

#### Step 3: Using max over-time pooling

Apply a max over-time pooling operation over the feature map and take the maximum value: <br>
$\hat{c}$ = max{$c$} <br>
The idea here is to capture the most important feature. 

Max over-time pooling is slightly different from max pooling in that it simply returns one number which is the maximum over the entire vector, as opposed to a matrix of new dimensions.

This step shows it only for one filter, the model uses multiple filters to obtain multiple features which would look like:
![Alternate text ](image2.png "Max pooling applied")

#### Step 4: Flatten the dimensions and feed through final layer

These features form the final layer and are passed to a fully connected softmax layer whose output is the probability over labels.

### Regularization: Using dropout

Employ dropout on final layer, e.g.:
if $z$ = [$\hat{c_{1}}$,...,$\hat{c_{m}}$]
then for the output unit $y$ in forward propagation, dropout uses $y$ = $w$·($z$◦$r$) + $b$ where ◦ is the element-wise multiplication operator and $r$ ∈ $R_{m}$ is a ‘masking’ vector of Bernoulli random variables with probability p of being 1.

## Result

![Alternate text ](image1.jpg "Model architecture with two channels for an example sentence")

# Implementation

The implementation is from:
https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras

Please install the gensim package in the notebook or through anaconda prompt as the dependent w2v.py file which loads the word vectors makes use of this package.

In [None]:
#pip install gensim

In [2]:
import numpy as np
from w2v import train_word2vec # separate w2v.py file to load the pre-trained vectors from word2vec

from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding
from keras.layers.merge import Concatenate
from keras.datasets import imdb
from keras.preprocessing import sequence
import warnings

warnings.filterwarnings('ignore')
np.random.seed(0)

The parameters of the model are defined below. 
There are some things that are different in this implementation compared to the original paper:
- Embedding dimension for the word vectors from word2vec is 50 instead of 300.
- Filter sizes are (3, 8) instead of (3, 4, 5)
- For each filter size, there are 10 filters, instead of original 100. Experiments showed that 3-10 filters are enough.
- There are two elements in the dropout_prob tuple. The first one is used after the embedding layer (p=0.5) and the second one is used after the convolutional layer (p=0.8). The original paper just uses dropout only once with p = 0.5 after the convolutional layer. I tried both dropout methods and the one shown in this implementation (0.5, 0.8) yields the same validation accuracy with training accuracy being closer to the validation accuracy, hence it pre-vents over-fitting.
- The implementation shows training on two datasets however the implementation provided here is done on one:
    - polarity dataset v1.0 dataset found at the following link: https://www.cs.cornell.edu/people/pabo/movie-review-data/. 
    - IMDB 25k reviews dataset (shown in this notebook)

As stated above, there are multiple variations of the model. They can be changed by adjusting the model_type variable. 


In [23]:
# ---------------------- Parameters section -------------------
#
# Model type. See Kim Yoon's Convolutional Neural Networks for Sentence Classification, Section 3
model_type = "CNN-non-static"  # CNN-rand|CNN-non-static|CNN-static

# Model Hyperparameters
embedding_dim = 50 # dimension of word vectors in the embedding layer
filter_sizes = (3, 8)  # kernel size in convolutional layer
num_filters = 10
dropout_prob = (0.5, 0.8)
hidden_dims = 50 # hidden neurons in the fully connected layer

# Training parameters
batch_size = 64 # training batch size
num_epochs = 10

# Prepossessing parameters
max_words = 5000 # maximum number of words to include in each review -- used for pulling data
sequence_length = 400 # maximum length of each input -- used for padding

# Word2Vec parameters (see train_word2vec)
min_word_count = 1 # minimum count of words to consider -- if a word occurs less than this, ignore it 

# window size to consider for each word in the input when training word vectors
# 10 means 10 words before and 10 words after
context = 10 

#
# ---------------------- Parameters end -----------------------

Below is the function to load the imdb data and preprocess it. Specific comments are provided below.

In [24]:
# Data Preparation
print("Load data...")

# import imdb reviews dataset and limit the maximum words to 5000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_words, start_char=None,
                                                            oov_char=None, index_from=None)
        
# perform padding based on defined sequence in parameters section
x_train = sequence.pad_sequences(x_train, maxlen=sequence_length, padding="post", truncating="post")
x_test = sequence.pad_sequences(x_test, maxlen=sequence_length, padding="post", truncating="post")

vocabulary = imdb.get_word_index() # get index of all vocabulary
vocabulary_inv = dict((v, k) for k, v in vocabulary.items()) # flip the key and values to put word index as key
vocabulary_inv[0] = "<PAD/>" # add <PAD/> to vocabulary for padding

In [25]:
print("x_train shape:", x_train.shape)
print("x_test shape:", x_test.shape)
print("Vocabulary Size: {:d}".format(len(vocabulary_inv)))

Load data...
x_train shape: (25000, 400)
x_test shape: (25000, 400)
Vocabulary Size: 88585


train_word2vec is a function in the w2v.py file that retrives the pre-trained vectors from word2vec as the embedding layer weights. x_train and x_test are stacked and passed. Vocabulary including words and their index are passed. The embedding dimension defined in the parameters section is passed. Minimum word count and context length is passed as well. 
Some things to note:
- If the model_type is CNN-static, the retrieved word embedding weights are the ones that are used throughout the training process and are not trained (hence static). Due to this, in the sub-condition below they are stacked based on sentences and words in the training and test set.
- If the model_type is CNN-non-static, the pre-trained embedding weights are passed into the embedding layer in the model code blocks and then further trained.
- If the model_type is CNN-rand (baseline model), there are no pre-trained embedding weights and are randomly initialized and then trained in the model.

In [27]:
# Prepare embedding layer weights and convert inputs for static model
print("Model type is", model_type)
# model_type from parameters section
if model_type in ["CNN-non-static", "CNN-static"]:
    embedding_weights = train_word2vec(np.vstack((x_train, x_test)), vocabulary_inv, num_features=embedding_dim,
                                       min_word_count=min_word_count, context=context)
    
    # stacking embedding weights for all words in all sentences which will stay static throughout training
    if model_type == "CNN-static":
        x_train = np.stack([np.stack([embedding_weights[word] for word in sentence]) for sentence in x_train])
        x_test = np.stack([np.stack([embedding_weights[word] for word in sentence]) for sentence in x_test])
        print("x_train static shape:", x_train.shape)
        print("x_test static shape:", x_test.shape)

# no pre-trained embeddings if CNN-rand
elif model_type == "CNN-rand":
    embedding_weights = None
else:
    raise ValueError("Unknown model type")

Model type is CNN-non-static
Load existing Word2Vec model '50features_1minwords_10context'


In [9]:
# Build model

# setting input shape based on model type
if model_type == "CNN-static":
    input_shape = (sequence_length, embedding_dim)
else:
    input_shape = (sequence_length,)

model_input = Input(shape=input_shape)

# static model does not have embedding layer
if model_type == "CNN-static":
    z = model_input
else:
    # embedding layer added if model type is CNN-rand or CNN-non-static
    z = Embedding(len(vocabulary_inv), embedding_dim, input_length=sequence_length, name="embedding")(model_input)

# dropout with probability 0.5 (pre-defined) after embedding layer
z = Dropout(dropout_prob[0])(z)

# convolutional layer with 10 filters (pre-defined), (3, 8) as filter sizes (pre-defined)
conv_blocks = []
# iterates through all filter sizes and adds to convolutional layer
for sz in filter_sizes:
    conv = Convolution1D(filters=num_filters,
                         kernel_size=sz,
                         padding="valid",
                         activation="relu",
                         strides=1)(z)
    conv = MaxPooling1D(pool_size=2)(conv) # maxpool layer
    conv = Flatten()(conv) # flatten to pass into fully conencted layer
    conv_blocks.append(conv)
z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0] # all filters are concateneted if more than 1

# dropout probability of 0.8 (pre-defined) after convolutional layer
z = Dropout(dropout_prob[1])(z)

# hidden neurons (pre-defined) in fully connected layer
z = Dense(hidden_dims, activation="relu")(z) 

# output layer with sigmoid activation for positive/negative
model_output = Dense(1, activation="sigmoid")(z) 

# create model instance
model = Model(model_input, model_output)

# compile model with loss functiona as binary cross-entropy, adam optimizer is used to iterate and optimize
# the objective function
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# if model type is CNN-non-static, the embedding layer is initialized with the embedding weights from word2vec
if model_type == "CNN-non-static":
    weights = np.array([v for v in embedding_weights.values()])
    print("Initializing embedding layer with word2vec weights, shape", weights.shape)
    embedding_layer = model.get_layer("embedding")
    embedding_layer.set_weights([weights])

Initializing embedding layer with word2vec weights, shape (88585, 50)


Fitting the model below. Note that if the model type is CNN-static, x_train and x_test are are the sentences whose words are based on pre-trained word vectors. This is done in the above code block.

In [10]:
# Train the model
model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs,
          validation_data=(x_test, y_test), verbose=1)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x1a43a191d0>

# Conclusion and Future Direction

In this paper, they explained a series of experiments with convolutional neural networks built on top of word2vec. Despite little tuning of hyperparameters, a simple CNN with one layer of convolution performs remarkably well. This adds to the research that unsupervised pre-training of word vectors is an important aspect of deep learning for NLP. The CNN
models discussed in the paper improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification. They will continue to study multichannel architecture to prevent overfitting.

# References:

[1]:  Kalchbrenner, E. Grefenstette, P. Blunsom. 2014. A
Convolutional Neural Network for Modelling Sentences. In Proceedings of ACL 2014.

[2]:  Zhang, Y., Roller, S., and Wallace, B. (2016). MGNC-CNN: A simple approach to exploiting
multiple word embeddings for sentence classification. Proc. of NAACL.

[3]: Wenpeng Yin and Hinrich 
Schutze. 2015. Multichannel variable-size convolution
for sentence classification. In Proceedings of the Conference on Computational Natural Language Learning,
pages 204–214.

[4]: Yoon Kim. 2014. Convolutional neural
networks for sentence classification. 