<h3>What are Convolutional Neural Networks?</h3>
<p>Now you know what convolutions are. But what about CNNs? CNNs are basically just several layers of convolutions with <em>nonlinear activation functions</em> like <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">ReLU</a> or <a href="https://reference.wolfram.com/language/ref/Tanh.html">tanh</a> applied to the results. In a traditional feedforward neural network we connect each input neuron to each output neuron in the next layer. That&#8217;s also called a fully connected layer, or affine layer. In CNNs we don&#8217;t do that. Instead, we use convolutions over the input layer to compute the output. This results in local connections, where each region of the input is connected to a neuron in the output. <span style="line-height: 1.5;">Each layer applies different filters, typically hundreds or thousands like the ones showed above, and combines their results. There&#8217;s also something something called pooling (subsampling) layers, but I&#8217;ll get into that later. During the training phase, </span><strong style="line-height: 1.5;">a CNN</strong> <strong style="line-height: 1.5;">automatically learns the values of its filters</strong><span style="line-height: 1.5;"> based on the task you want to perform. For example, in Image Classification a CNN may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers. The last layer is then a classifier that uses these high-level features.</span></p>
<p><a href="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-07-at-7.26.20-AM.png"><img class="alignnone size-large wp-image-424" src="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-07-at-7.26.20-AM-1024x279.png" alt="Convolutional Neural Network (Clarifai)" width="1024" height="279" srcset="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-07-at-7.26.20-AM-1024x279.png 1024w, http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-07-at-7.26.20-AM-300x82.png 300w, http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-07-at-7.26.20-AM.png 1558w" sizes="(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px" /></a></p>
<p>There are two aspects of this computation worth paying attention to: <strong>Location Invariance</strong> and <strong>Compositionality</strong>. Let&#8217;s say you want to classify whether or not there&#8217;s an elephant in an image. Because you are sliding your filters over the whole image you don&#8217;t really care <em>where</em> the elephant occurs. In practice,  <em>pooling</em> also gives you invariance to translation, rotation and scaling, but more on that later. The second key aspect is (local) compositionality. Each filter <em>composes</em> a local patch of lower-level features into higher-level representation. That&#8217;s why CNNs are so powerful in Computer Vision. It makes intuitive sense that you build edges from pixels, shapes from edges, and more complex objects from shapes.</p>
<h4>So, how does any of this apply to NLP?</h4>
<p>Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a character. That is, each row is vector that represents a word. Typically, these vectors are <em>word embeddings</em> (low-dimensional representations) like <a href="https://code.google.com/p/word2vec/">word2vec</a> or <a href="http://nlp.stanford.edu/projects/glove/">GloVe</a>, but they could also be one-hot vectors that index the word into a vocabulary. For a 10 word sentence using a 100-dimensional embedding we would have a 10&#215;100 matrix as our input. That&#8217;s our &#8220;image&#8221;.</p>
<p>In vision, our filters slide over local patches of an image, but in NLP we typically use filters that slide over full rows of the matrix (words). Thus, the &#8220;width&#8221; of our filters is usually the same as the width of the input matrix. The height, or <em>region size</em>, may vary, but sliding windows over 2-5 words at a time is typical. Putting all the above together, a Convolutional Neural Network for NLP may look like this (take a few minutes and try understand this picture and how the dimensions are computed. You can ignore the pooling for now, we&#8217;ll explain that later):</p>
<figure id="attachment_420" style="max-width: 1024px" class="wp-caption alignnone"><a href="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM.png"><img class="size-large wp-image-420" src="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM-1024x937.png" alt="Illustration of a Convolutional Neural Network (CNN) architecture for sentence classification. Here we depict three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs convolution on the sentence matrix and generates (variable-length) feature maps. Then 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax layer then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states. Source: hang, Y., &amp; Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification" width="1024" height="937" srcset="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM-1024x937.png 1024w, http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM-300x274.png 300w, http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM.png 1504w" sizes="(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px" /></a><figcaption class="wp-caption-text">Illustration of a Convolutional Neural Network (CNN) architecture for sentence classification. Here we depict three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs convolution on the sentence matrix and generates (variable-length) feature maps. Then 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax layer then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states. Source: Zhang, Y., &amp; Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.</figcaption></figure>
<p>What about the nice intuitions we had for Computer Vision? Location Invariance and local Compositionality made intuitive sense for images, but not so much for NLP. You probably do care a lot where in the sentence a word appears. Pixels close to each other are likely to be semantically related (part of the same object), but the same isn&#8217;t always true for words. In many languages, parts of phrases could be separated by several other words. The compositional aspect isn&#8217;t obvious either. Clearly, words compose in some ways, like an adjective modifying a noun, but how exactly this works what higher level representations actually &#8220;mean&#8221; isn&#8217;t as obvious as in the Computer Vision case.</p>

<p>Given all this, it seems like CNNs wouldn&#8217;t be a good fit for NLP tasks. <a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/">Recurrent Neural Networks</a> make more intuitive sense. They resemble how we process language (or at least how we think we process language): Reading sequentially from left to right. Fortunately, this doesn&#8217;t mean that CNNs don&#8217;t work.  <a href="https://en.wikipedia.org/wiki/All_models_are_wrong">All models are wrong, but some are useful</a>. It turns out that CNNs applied to NLP problems perform quite well. The simple <a href="https://en.wikipedia.org/wiki/Bag-of-words_model">Bag of Words model</a> is an obvious oversimplification with incorrect assumptions, but has nonetheless been the standard approach for years and lead to pretty good results.</p>
<p><span style="line-height: 1.5;">A big argument for CNNs is that they are fast. Very fast. Convolutions are a central part of computer graphics and implemented on a hardware level on GPUs. Compared to something like <a href="https://en.wikipedia.org/wiki/N-gram">n-grams</a>, CNNs are also <em>efficient</em> in terms of representation. With a large vocabulary, computing anything more than 3-grams can quickly become expensive. Even Google doesn&#8217;t provide anything beyond 5-grams. Convolutional Filters learn good representations automatically, without needing to represent the whole vocabulary. It&#8217;s completely reasonable to have filters of size larger than 5. I like to think that many of the learned filters in the first layer are capturing features quite similar (but not limited) to n-grams, but represent them in a more compact way.</span></p>
<h3>CNN Hyperparameters</h3>
<p>Before explaining at how CNNs are applied to NLP tasks, let&#8217;s look at some of the choices you need to make when building a CNN. Hopefully this will help you better understand the literature in the field.</p>
<h4>Narrow vs. Wide convolution</h4>
<p>When I explained convolutions above I neglected a little detail of how we apply the filter. Applying a 3&#215;3 filter at the center of the matrix works fine, but what about the edges? How would you apply the filter to the first element of a matrix that doesn&#8217;t have any neighboring elements to the top and left? You can use <em>zero-padding</em>. All elements that would fall outside of the matrix are taken to be zero. By doing this you can apply the filter to every element of your input matrix, and get a larger or equally sized output. Adding zero-padding is also called <em>wide convolution</em><strong>,</strong> and not using zero-padding would be a<em> narrow convolution</em>. An example in 1D looks like this:</p>

<figure id="attachment_407" style="max-width: 1024px" class="wp-caption alignnone"><a href="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-9.47.41-AM.png"><img class="wp-image-407 size-large" src="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-9.47.41-AM-1024x261.png" alt="Narrow vs. Wide Convolution. Source: A Convolutional Neural Network for Modelling Sentences (2014)" width="1024" height="261" srcset="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-9.47.41-AM-1024x261.png 1024w, http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-9.47.41-AM-300x77.png 300w, http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-9.47.41-AM.png 1536w" sizes="(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px" /></a><figcaption class="wp-caption-text">Narrow vs. Wide Convolution. Filter size 5, input size 7. Source: A Convolutional Neural Network for Modelling Sentences (2014)</figcaption></figure>
<p>You can see how wide convolution is useful, or even necessary, when you have a large filter relative to the input size. In the above, the narrow convolution yields  an output of size <img src="//s0.wp.com/latex.php?latex=%287-5%29+%2B+1%3D3&#038;bg=ffffff&#038;fg=000&#038;s=0" alt="(7-5) + 1=3" title="(7-5) + 1=3" class="latex" />, and a wide convolution an output of size <img src="//s0.wp.com/latex.php?latex=%287%2B2%2A4+-+5%29+%2B+1+%3D11&#038;bg=ffffff&#038;fg=000&#038;s=0" alt="(7+2*4 - 5) + 1 =11" title="(7+2*4 - 5) + 1 =11" class="latex" />. More generally, the formula for the output size is <img src="//s0.wp.com/latex.php?latex=n_%7Bout%7D%3D%28n_%7Bin%7D+%2B+2%2An_%7Bpadding%7D+-+n_%7Bfilter%7D%29+%2B+1+&#038;bg=ffffff&#038;fg=000&#038;s=0" alt="n_{out}=(n_{in} + 2*n_{padding} - n_{filter}) + 1 " title="n_{out}=(n_{in} + 2*n_{padding} - n_{filter}) + 1 " class="latex" />.</p>

## Filters

At the core of CNNs are filters (weights, kernels, etc.) which convolve (slide) across our input to extract relevant features. The filters are initialized randomly but learn to pick up meaningful features from the input that aid in optimizing for the objective. We're going to teach CNNs in an unorthodox method where we entirely focus on applying it to 2D text data. Each input is composed of words and we will be representing each word as one-hot encoded vector which gives us our 2D input. The intuition here is that each filter represents a feature and we will use this filter on other inputs to capture the same feature. This is known as parameter sharing.

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/conv.gif" width=400>

In [1]:
import torch
import torch.nn as nn

Our inputs are a batch of 2D text data. Let's make an input with 64 samples, where each sample has 8 words and each word is represented by a array of 10 values (one hot encoded with vocab size of 10). This gives our inputs the size (64, 8, 10). The [PyTorch CNN modules](https://pytorch.org/docs/stable/nn.html#convolution-functions) prefer inputs to have the channel dim (one hot vector dim in our case) to be in the second position, so our inputs are of shape (64, 10, 8).

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/cnn_text1.png" width=400>

In [3]:
# Assume all our inputs have the same # of words
batch_size = 64
sequence_size = 8 # words per input
one_hot_size = 10 # vocab size (num_input_channels)
x = torch.randn(batch_size, one_hot_size, sequence_size)
print("Size: {}".format(x.shape))

Size: torch.Size([64, 10, 8])


We want to convolve on this input using filters. For simplicity we will use just 5 filters that is of size (1, 2) and has the same depth as the number of channels (one_hot_size). This gives our filter a shape of (5, 2, 10) but recall that PyTorch CNN modules prefer to have the channel dim (one hot vector dim in our case) to be in the second position so the filter is of shape (5, 10, 2).

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/cnn_text2.png" width=400>

In [10]:
# Create filters for a conv layer
out_channels = 5 # of filters
kernel_size = 2 # filters size 2
conv1 = nn.Conv1d(in_channels=one_hot_size, out_channels=out_channels, kernel_size=kernel_size)
print("Size: {}".format(conv1.weight.shape))
print("Filter size: {}".format(conv1.kernel_size[0]))
print("Padding: {}".format(conv1.padding[0]))
print("Stride: {}".format(conv1.stride[0]))

Size: torch.Size([5, 10, 2])
Filter size: 2
Padding: 0
Stride: 1


When we apply this filter on our inputs, we receive an output of shape (64, 5, 7). We get 64 for the batch size, 5 for the channel dim because we used 5 filters and 7 for the conv outputs because:

$\frac{W - F + 2P}{S} + 1 = \frac{8 - 2 + 2(0)}{1} + 1 = 7$

where:
  * W: width of each input
  * F: filter size
  * P: padding
  * S: stride
    
<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/cnn_text3.png" width=400>

In [5]:
# Convolve using filters
conv_output = conv1(x)
print("Size: {}".format(conv_output.shape))

Size: torch.Size([64, 5, 7])


## Pooling

The result of convolving filters on an input is a feature map. Due to the nature of convolution and overlaps, our feature map will have lots of redundant information. Pooling is a way to summarize a high-dimensional feature map into a lower dimensional one for simplified downstream computation. The pooling operation can be the max value, average, etc. in a certain receptive field.

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/pool.jpeg" width=450>

In [6]:
# Max pooling
kernel_size = 2
pool1 = nn.MaxPool1d(kernel_size=kernel_size, stride=2, padding=0)
pool_output = pool1(conv_output)
print("Size: {}".format(pool_output.shape))

Size: torch.Size([64, 5, 3])


$\frac{W-F}{S} + 1 = \frac{7-2}{2} + 1 =  \text{floor }(2.5) + 1 = 3$

## Implementing a CNN for Text Classification
We're going use convolutional neural networks on text data which typically involves convolving on the character level representation of the text to capture meaningful n-grams. 

You can easily use this set up for [time series](https://arxiv.org/abs/1807.10707) data or [combine it](https://arxiv.org/abs/1808.04928) with other networks. For text data, we will create filters of varying kernel sizes (1,2), (1,3), and (1,4) which act as feature selectors of varying n-gram sizes. The outputs are concated and fed into a fully-connected layer for class predictions. In our example, we will be applying 1D convolutions on letter in a word. In the [embeddings notebook](https://colab.research.google.com/github/GokuMohandas/practicalAI/blob/master/notebooks/12_Embeddings.ipynb), we will apply 1D convolutions on words in a sentence.

**Word embeddings**: capture the temporal correlations among
adjacent tokens so that similar words have similar representations. Ex. "New Jersey" is close to "NJ" is close to "Garden State", etc.

**Char embeddings**: create representations that map words at a character level. Ex. "toy" and "toys" will be close to each other.

In [13]:
import os
from argparse import Namespace
import collections
import copy
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import urllib

In [12]:
# Set Numpy and PyTorch seeds
def set_seeds(seed, cuda):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if cuda:
        torch.cuda.manual_seed_all(seed)
        
# Creating directories
def create_dirs(dirpath):
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)

In [21]:
# Arguments
args = Namespace(
    seed=1234,
    cuda=False,
    shuffle=True,
    data_file="names.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="data",
    train_size=0.7,
    val_size=0.15,
    test_size=0.15,
    num_epochs=20,
    early_stopping_criteria=5,
    learning_rate=1e-3,
    batch_size=64,
    num_filters=100,
    dropout_p=0.1,
)

# Set seeds
set_seeds(seed=args.seed, cuda=args.cuda)

# Create save dir
create_dirs(args.save_dir)

# Expand filepaths
args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)
args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

# Check CUDA
if not torch.cuda.is_available():
    args.cuda = False
args.device = torch.device("cuda" if args.cuda else "cpu")
print("Using CUDA: {}".format(args.cuda))

Using CUDA: False


In [22]:
# Upload data from GitHub to notebook's local drive
url = "https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/data/surnames.csv"
response = urllib.request.urlopen(url)
html = response.read()
with open(args.data_file, 'wb') as fp:
    fp.write(html)

# Raw data
df = pd.read_csv(args.data_file, header=0)
df.head()

Unnamed: 0,surname,nationality
0,Woodford,English
1,Coté,French
2,Kore,English
3,Koury,Arabic
4,Lebzak,Russian


In [23]:
# Split by nationality
by_nationality = collections.defaultdict(list)
for _, row in df.iterrows():
    by_nationality[row.nationality].append(row.to_dict())
for nationality in by_nationality:
    print ("{0}: {1}".format(nationality, len(by_nationality[nationality])))

English: 2972
French: 229
Arabic: 1603
Russian: 2373
Japanese: 775
Chinese: 220
Italian: 600
Czech: 414
Irish: 183
German: 576
Greek: 156
Spanish: 258
Polish: 120
Dutch: 236
Vietnamese: 58
Korean: 77
Portuguese: 55
Scottish: 75


In [24]:
# Create split data
final_list = []
for _, item_list in sorted(by_nationality.items()):
    if args.shuffle:
        np.random.shuffle(item_list)
    n = len(item_list)
    n_train = int(args.train_size*n)
    n_val = int(args.val_size*n)
    n_test = int(args.test_size*n)

  # Give data point a split attribute
    for item in item_list[:n_train]:
        item['split'] = 'train'
    for item in item_list[n_train:n_train+n_val]:
        item['split'] = 'val'
    for item in item_list[n_train+n_val:]:
        item['split'] = 'test'  

    # Add to final list
    final_list.extend(item_list)

In [25]:
# df with split datasets
split_df = pd.DataFrame(final_list)
split_df["split"].value_counts()

train    7680
test     1660
val      1640
Name: split, dtype: int64

In [26]:
# Preprocessing
def preprocess_text(text):
    text = ' '.join(word.lower() for word in text.split(" "))
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text
    
split_df.surname = split_df.surname.apply(preprocess_text)
split_df.head()

Unnamed: 0,nationality,split,surname
0,Arabic,train,bishara
1,Arabic,train,nahas
2,Arabic,train,ghanem
3,Arabic,train,tannous
4,Arabic,train,mikhail
