<h3>What are Convolutional Neural Networks?</h3>
<p>Now you know what convolutions are. But what about CNNs? CNNs are basically just several layers of convolutions with <em>nonlinear activation functions</em> like <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">ReLU</a> or <a href="https://reference.wolfram.com/language/ref/Tanh.html">tanh</a> applied to the results. In a traditional feedforward neural network we connect each input neuron to each output neuron in the next layer. That&#8217;s also called a fully connected layer, or affine layer. In CNNs we don&#8217;t do that. Instead, we use convolutions over the input layer to compute the output. This results in local connections, where each region of the input is connected to a neuron in the output. <span style="line-height: 1.5;">Each layer applies different filters, typically hundreds or thousands like the ones showed above, and combines their results. There&#8217;s also something something called pooling (subsampling) layers, but I&#8217;ll get into that later. During the training phase, </span><strong style="line-height: 1.5;">a CNN</strong> <strong style="line-height: 1.5;">automatically learns the values of its filters</strong><span style="line-height: 1.5;"> based on the task you want to perform. For example, in Image Classification a CNN may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers. The last layer is then a classifier that uses these high-level features.</span></p>
<p><a href="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-07-at-7.26.20-AM.png"><img class="alignnone size-large wp-image-424" src="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-07-at-7.26.20-AM-1024x279.png" alt="Convolutional Neural Network (Clarifai)" width="1024" height="279" srcset="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-07-at-7.26.20-AM-1024x279.png 1024w, http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-07-at-7.26.20-AM-300x82.png 300w, http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-07-at-7.26.20-AM.png 1558w" sizes="(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px" /></a></p>
<p>There are two aspects of this computation worth paying attention to: <strong>Location Invariance</strong> and <strong>Compositionality</strong>. Let&#8217;s say you want to classify whether or not there&#8217;s an elephant in an image. Because you are sliding your filters over the whole image you don&#8217;t really care <em>where</em> the elephant occurs. In practice,  <em>pooling</em> also gives you invariance to translation, rotation and scaling, but more on that later. The second key aspect is (local) compositionality. Each filter <em>composes</em> a local patch of lower-level features into higher-level representation. That&#8217;s why CNNs are so powerful in Computer Vision. It makes intuitive sense that you build edges from pixels, shapes from edges, and more complex objects from shapes.</p>
<h4>So, how does any of this apply to NLP?</h4>
<p>Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a character. That is, each row is vector that represents a word. Typically, these vectors are <em>word embeddings</em> (low-dimensional representations) like <a href="https://code.google.com/p/word2vec/">word2vec</a> or <a href="http://nlp.stanford.edu/projects/glove/">GloVe</a>, but they could also be one-hot vectors that index the word into a vocabulary. For a 10 word sentence using a 100-dimensional embedding we would have a 10&#215;100 matrix as our input. That&#8217;s our &#8220;image&#8221;.</p>
<p>In vision, our filters slide over local patches of an image, but in NLP we typically use filters that slide over full rows of the matrix (words). Thus, the &#8220;width&#8221; of our filters is usually the same as the width of the input matrix. The height, or <em>region size</em>, may vary, but sliding windows over 2-5 words at a time is typical. Putting all the above together, a Convolutional Neural Network for NLP may look like this (take a few minutes and try understand this picture and how the dimensions are computed. You can ignore the pooling for now, we&#8217;ll explain that later):</p>
<figure id="attachment_420" style="max-width: 1024px" class="wp-caption alignnone"><a href="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM.png"><img class="size-large wp-image-420" src="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM-1024x937.png" alt="Illustration of a Convolutional Neural Network (CNN) architecture for sentence classification. Here we depict three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs convolution on the sentence matrix and generates (variable-length) feature maps. Then 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax layer then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states. Source: hang, Y., &amp; Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification" width="1024" height="937" srcset="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM-1024x937.png 1024w, http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM-300x274.png 300w, http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM.png 1504w" sizes="(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px" /></a><figcaption class="wp-caption-text">Illustration of a Convolutional Neural Network (CNN) architecture for sentence classification. Here we depict three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs convolution on the sentence matrix and generates (variable-length) feature maps. Then 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax layer then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states. Source: Zhang, Y., &amp; Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.</figcaption></figure>
<p>What about the nice intuitions we had for Computer Vision? Location Invariance and local Compositionality made intuitive sense for images, but not so much for NLP. You probably do care a lot where in the sentence a word appears. Pixels close to each other are likely to be semantically related (part of the same object), but the same isn&#8217;t always true for words. In many languages, parts of phrases could be separated by several other words. The compositional aspect isn&#8217;t obvious either. Clearly, words compose in some ways, like an adjective modifying a noun, but how exactly this works what higher level representations actually &#8220;mean&#8221; isn&#8217;t as obvious as in the Computer Vision case.</p>

<p>Given all this, it seems like CNNs wouldn&#8217;t be a good fit for NLP tasks. <a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/">Recurrent Neural Networks</a> make more intuitive sense. They resemble how we process language (or at least how we think we process language): Reading sequentially from left to right. Fortunately, this doesn&#8217;t mean that CNNs don&#8217;t work.  <a href="https://en.wikipedia.org/wiki/All_models_are_wrong">All models are wrong, but some are useful</a>. It turns out that CNNs applied to NLP problems perform quite well. The simple <a href="https://en.wikipedia.org/wiki/Bag-of-words_model">Bag of Words model</a> is an obvious oversimplification with incorrect assumptions, but has nonetheless been the standard approach for years and lead to pretty good results.</p>
<p><span style="line-height: 1.5;">A big argument for CNNs is that they are fast. Very fast. Convolutions are a central part of computer graphics and implemented on a hardware level on GPUs. Compared to something like <a href="https://en.wikipedia.org/wiki/N-gram">n-grams</a>, CNNs are also <em>efficient</em> in terms of representation. With a large vocabulary, computing anything more than 3-grams can quickly become expensive. Even Google doesn&#8217;t provide anything beyond 5-grams. Convolutional Filters learn good representations automatically, without needing to represent the whole vocabulary. It&#8217;s completely reasonable to have filters of size larger than 5. I like to think that many of the learned filters in the first layer are capturing features quite similar (but not limited) to n-grams, but represent them in a more compact way.</span></p>
<h3>CNN Hyperparameters</h3>
<p>Before explaining at how CNNs are applied to NLP tasks, let&#8217;s look at some of the choices you need to make when building a CNN. Hopefully this will help you better understand the literature in the field.</p>
<h4>Narrow vs. Wide convolution</h4>
<p>When I explained convolutions above I neglected a little detail of how we apply the filter. Applying a 3&#215;3 filter at the center of the matrix works fine, but what about the edges? How would you apply the filter to the first element of a matrix that doesn&#8217;t have any neighboring elements to the top and left? You can use <em>zero-padding</em>. All elements that would fall outside of the matrix are taken to be zero. By doing this you can apply the filter to every element of your input matrix, and get a larger or equally sized output. Adding zero-padding is also called <em>wide convolution</em><strong>,</strong> and not using zero-padding would be a<em> narrow convolution</em>. An example in 1D looks like this:</p>

<figure id="attachment_407" style="max-width: 1024px" class="wp-caption alignnone"><a href="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-9.47.41-AM.png"><img class="wp-image-407 size-large" src="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-9.47.41-AM-1024x261.png" alt="Narrow vs. Wide Convolution. Source: A Convolutional Neural Network for Modelling Sentences (2014)" width="1024" height="261" srcset="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-9.47.41-AM-1024x261.png 1024w, http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-9.47.41-AM-300x77.png 300w, http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-9.47.41-AM.png 1536w" sizes="(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px" /></a><figcaption class="wp-caption-text">Narrow vs. Wide Convolution. Filter size 5, input size 7. Source: A Convolutional Neural Network for Modelling Sentences (2014)</figcaption></figure>
<p>You can see how wide convolution is useful, or even necessary, when you have a large filter relative to the input size. In the above, the narrow convolution yields  an output of size <img src="//s0.wp.com/latex.php?latex=%287-5%29+%2B+1%3D3&#038;bg=ffffff&#038;fg=000&#038;s=0" alt="(7-5) + 1=3" title="(7-5) + 1=3" class="latex" />, and a wide convolution an output of size <img src="//s0.wp.com/latex.php?latex=%287%2B2%2A4+-+5%29+%2B+1+%3D11&#038;bg=ffffff&#038;fg=000&#038;s=0" alt="(7+2*4 - 5) + 1 =11" title="(7+2*4 - 5) + 1 =11" class="latex" />. More generally, the formula for the output size is <img src="//s0.wp.com/latex.php?latex=n_%7Bout%7D%3D%28n_%7Bin%7D+%2B+2%2An_%7Bpadding%7D+-+n_%7Bfilter%7D%29+%2B+1+&#038;bg=ffffff&#038;fg=000&#038;s=0" alt="n_{out}=(n_{in} + 2*n_{padding} - n_{filter}) + 1 " title="n_{out}=(n_{in} + 2*n_{padding} - n_{filter}) + 1 " class="latex" />.</p>

## Implementing a CNN for Text Classification
<h3>Data and Preprocessing</h3>
<p>The dataset we&#8217;ll use in this post is the <a href="http://www.cs.cornell.edu/people/pabo/movie-review-data/">Movie Review data from Rotten Tomatoes</a> &#8211; one of the data sets also used in the original paper. The dataset contains 10,662 example review sentences, half positive and half negative. The dataset has a vocabulary of size around 20k. Note that since this data set is pretty small we&#8217;re likely to overfit with a powerful model. Also, the dataset doesn&#8217;t come with an official train/test split, so we simply use 10% of the data as a dev set. The original paper reported results for 10-fold cross-validation on the data.</p>
<p>I won&#8217;t go over the data pre-processing code in this post, but it is <a href="https://github.com/dennybritz/cnn-text-classification-tf/blob/master/data_helpers.py">available on Github</a> and does the following:</p>
<ol>
<li>Load positive and negative sentences from the raw data files.</li>
<li>Clean the text data using the <a href="https://github.com/yoonkim/CNN_sentence">same code</a> as the original paper.</li>
<li>Pad each sentence to the maximum sentence length, which turns out to be 59. We append special <code>&lt;PAD&gt;</code> tokens to all other sentences to make them 59 words. Padding sentences to the same length is useful because it allows us to efficiently batch our data since each example in a batch must be of the same length.</li>
<li>Build a vocabulary index and map each word to an integer between 0 and 18,765 (the vocabulary size). Each sentence becomes a vector of integers.</li>
</ol>

<h3>The Model</h3>
<p>The network we will build in this post looks roughly as follows:</p>
<p><a href="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-8.03.47-AM.png" rel="attachment wp-att-415"><img class="alignnone size-large wp-image-415" src="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-8.03.47-AM-1024x413.png" alt="Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification" width="1024" height="413" srcset="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-8.03.47-AM-1024x413.png 1024w, http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-8.03.47-AM-300x121.png 300w" sizes="(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px" /></a></p>
<p>The first layers embeds words into low-dimensional vectors. The next layer performs convolutions over the embedded word vectors using multiple filter sizes. For example, sliding over 3, 4 or 5 words at a time. Next, we max-pool the result of the convolutional layer into a long feature vector, add dropout regularization, and classify the result using a softmax layer.</p>
<p>Because this is an educational post I decided to simplify the model from the original paper a little:</p>
<ul>
<li>We will not used pre-trained <a href="https://code.google.com/p/word2vec/">word2vec</a> vectors for our word embeddings. Instead, we learn embeddings from scratch.</li>
<li>We will not enforce L2 norm constraints on the weight vectors. <a href="http://arxiv.org/abs/1510.03820">A Sensitivity Analysis of (and Practitioners&#8217; Guide to) Convolutional Neural Networks for Sentence Classification</a> found that the constraints had little effect on the end result.</li>
<li>The original paper experiments with two input data channels &#8211; static and non-static word vectors. We use only one channel.</li>
</ul>
<p>It is relatively straightforward (a few dozen lines of code) to add the above extensions to the code here. Take a look at the exercises at the end of the post.</p>
<p>Let&#8217;s get started!</p>

In [None]:
import re
import numpy as np


TOKENIZER_RE = re.compile(r"[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",
                          re.UNICODE)


def tokenizer(iterator):
  """Tokenizer generator.
  Args:
    iterator: Input iterator with strings.
  Yields:
    array of tokens per each value in the input.
  """
  for value in iterator:
    yield TOKENIZER_RE.findall(value)
    
    
class VocabularyProcessor(object):
  """
  Maps documents to sequences of word ids.
  """
  def __init__(self,max_document_length, min_frequency=0, vocabulary=None, tokenizer_fn=None):
    """Initializes a VocabularyProcessor instance.
    Args:
      max_document_length: Maximum length of documents.
        if documents are longer, they will be trimmed, if shorter - padded.
      min_frequency: Minimum frequency of words in the vocabulary.
      vocabulary: CategoricalVocabulary object.
    Attributes:
      vocabulary_: CategoricalVocabulary object.
    """
    self.max_document_length = max_document_length
    self.min_frequency = min_frequency
    if vocabulary:
      self.vocabulary_ = vocabulary
    else:
      self.vocabulary_ = CategoricalVocabulary()
    if tokenizer_fn:
      self._tokenizer = tokenizer_fn
    else:
      self._tokenizer = tokenizer

    def fit(self, raw_documents, unused_y=None):
    """Learn a vocabulary dictionary of all tokens in the raw documents.
    Args:
      raw_documents: An iterable which yield either str or unicode.
      unused_y: to match fit format signature of estimators.
    Returns:
      self
    """
    for tokens in self._tokenizer(raw_documents):
      for token in tokens:
        self.vocabulary_.add(token)
    if self.min_frequency > 0:
      self.vocabulary_.trim(self.min_frequency)
    self.vocabulary_.freeze()
    return self

  def fit_transform(self, raw_documents, unused_y=None):
    """Learn the vocabulary dictionary and return indexies of words.
    Args:
      raw_documents: An iterable which yield either str or unicode.
      unused_y: to match fit_transform signature of estimators.
    Returns:
      x: iterable, [n_samples, max_document_length]. Word-id matrix.
    """
    self.fit(raw_documents)
    return self.transform(raw_documents)

  def transform(self, raw_documents):
    """Transform documents to word-id matrix.
    Convert words to ids with vocabulary fitted with fit or the one
    provided in the constructor.
    Args:
      raw_documents: An iterable which yield either str or unicode.
    Yields:
      x: iterable, [n_samples, max_document_length]. Word-id matrix.
    """
    for tokens in self._tokenizer(raw_documents):
      word_ids = np.zeros(self.max_document_length, np.int64)
      for idx, token in enumerate(tokens):
        if idx >= self.max_document_length:
          break
        word_ids[idx] = self.vocabulary_.get(token)
      yield word_ids

  def reverse(self, documents):
    """Reverses output of vocabulary mapping to words.
    Args:
      documents: iterable, list of class ids.
    Yields:
      Iterator over mapped in words documents.
    """
    for item in documents:
      output = []
      for class_id in item:
        output.append(self.vocabulary_.reverse(class_id))
      yield ' '.join(output)