# Paper: Character-level Convolutional Networks for Text Classification

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import math

# Inline plotting
%matplotlib inline

## Summary

The paper proposes character-based convolution networks as a method for text classification. 

Like the traditional ConvNets for images, the character-based ConvNet consists of convolutional layers followed by a pooling layers. In the convolution layer, the input matrix is convolved with multiple one-dimensional filters and summed to form single vectors. The pooling layer performs non-overlapping max-pooling. Before the text documents are fed to the ConvNet, they are quantised. This means that each letter is converted to a one-hot vector and concatenated together into a matrix. 

This new method is compared to different traditional methods and deep learning methods (two word-based ConvNets and an LSTM) using several large datasets. The authors conclude that although the character-based ConvNet is an effective method, it is not a silver bullet. The performance depends on factors like the size of the dataset, whether the documents are curated and the choice of the alphabet.

## Quantisation

1. Define an alphabet of letters
  - In the article, the alphabet consists of 26 english letters, 10 digits, 33 other characters and the new line character so 70 characters in total.
2. Quantise each character as one-hot vector 
  - Each character vector has 70 dimensions.
  - Spaces and characters outside the alphabet are zero vectors.
  - Character quantization order is backward??
3. Quantise each document by concatinating the corresponding char vectors.
  - Document size is fixed to $l_0$ e.g. $l_0 = 1014$
  - Documents with more than $l_0$ characters are truncated.
  - Documents with less than $l_0$ characters are padded with zero vectors


<img src="figures/p0/quantisation-example.png" width="300" />











## Temporal Convolutional Module

<img src="figures/p0/convolutional-module.png" width="600" />










In [2]:
def convolve(g, f, d=1):
    """
    Computes 1D convolution
    
    g(x) is the input array
    f(x) is the 1D filter
    d is the stride
    """
    l = g.shape[0] # Size of the input array
    k = f.shape[0] # Size of the filter
    c = k - d + 1  # Offset constant

    output_size = int(np.floor((l - k + 1)/d))
    h = np.zeros(output_size)
    
    # Naive implementation
    for y in range(output_size):
        for x in range(k):
            # Compensate for zero indexing
            gi = (y+1) * d - x + c - 2
            h[y] += f[x] * g[gi]

    return h

In [3]:
g = np.array([1, 0, 0, 1, 1, 1, 0, 0, 1, 1])
f = np.array([0, 0, 1])
h = convolve(g, f)
h

array([1., 0., 0., 1., 1., 1., 0., 0.])

In [4]:
expected = np.array([1, 0, 0, 1, 1, 1, 0, 0])
(h == expected).all()

True

Once the convolution operation is performed, the result is then summed column-wise. Here is an example:

<img src="figures/p0/1d-convolutional-example.png" width="900" />












## Temporal Max-Pooling Module

<img src="figures/p0/max-pool-module.png" width="600" />










In [5]:
def maxPool(g, k, d=1):
    l = g.shape[0] # Size of the input array
    c = k - d + 1  # Offset constant
    
    # Note: using ceil() even though the paper uses floor().
    # If floor() is used then the output size is incorrect
    output_size = int(np.ceil((l - k + 1)/d))
    h = -np.inf * np.ones(output_size)
    
    for y in range(output_size):
        for x in range(k):
            # Compensate for zero indexing
            gi = (y+1) * d - x + c - 2
            if g[gi] > h[y]:
                h[y] = g[gi]
            
    return h
    

In [6]:
g = np.array([3, 4, 3, 5, 6, 11, 4, 3, 42])
h = maxPool(g, k=3)
h

array([ 4.,  5.,  6., 11., 11., 11., 42.])

In [7]:
expected = np.array([ 4, 5, 6, 11, 11, 11, 42])
(h == expected).all()

True

In the paper, the authors mention that pooling layers are non-overlapping i.e., the size of the pooling and the stride is the same.

In [8]:
g = np.array([3, 4, 3, 5, 6, 11, 4, 3, 42])
h = maxPool(g, k=3, d=3)
h

array([ 4., 11., 42.])

In [9]:
expected = np.array([ 4, 11, 42])
(h == expected).all()

True

Here is an example of the max pooling operation:

<img src="figures/p0/max-pooling-example.png" width="600" />

















## Model Design

The authors tested two character-based ConvNet configurations:

<img src="figures/p0/network-configurations.png" width="400" />






























Figure 1 from the paper:

<img src="figures/p0/figure-1.png" width="600" />










## Methodology

- Network weights are initialised using a Gaussian distribution.
  - The mean and standard deviation used for initializing the large model is (0; 0.02) and small model (0; 0.05).
- Network is trained with Stochastic Gradient Descent with momentum 0.9, minibatch size 128
- Initial step size 0.01, halved every 3 epoches for 10 times
- Each epoch takes a fixed number of random training samples uniformly sampled across classes.
- Data augmented by replacing words or phrases with their synonyms
  - Used English thesaurus by WordNet
  - Every synonym to a word is ranked by the semantic closeness to the most frequently seen meaning
  - Randomly replaced $r$ number of words in a document. The synonym is also chosen randomly

## Methods for Comparison

- Multinomial logistic regression using hand-crafted features:
  - Bag-of-words and its TFIDF
  - Bag-of-ngrams and its TFIDF
  - Bag-of-means on word embedding
- Word-based ConvNets
  - using pretrained word2vec embedding
  - using lookup tables
- Long-short term memory using pretrained word2vec embedding


## Datasets

<img src="figures/p0/datasets.png" width="600" />














## Experiment Results

Table 4 shows testing errors of all the models. Numbers are in percentage. 

Following abbreviations are used:
- "Lg" stands for "large"
- "Sm" stands for "small"
- "w2v" is an abbreviation for "word2vec"
- "Lk" stands for "lookup table"
- "Th" stands for thesaurus. 
- ConvNets labeled "Full" are those that distinguish between lower and upper letters. Observed that it usually (but not always) gives worse results when such distinction is made.

<img src="figures/p0/experiment-results.png" width="600" />














## Discussion

- Character-level ConvNet is an effective method
- ConvNets do better when the dataset goes to the scale of several millions
- ConvNets may work well for user-generated data
- Distinguishing between uppercase and lowercase letters could make a difference. For million-scale datasets, it seems that not making such distinction usually works better.
- Semantics of tasks may not matter
- Bag-of-means is a misuse of word2vec
- There is not a single machine learning model that can work for all kinds of datasets