## Neural Models for Document Classification

#### 1. Word Embeddings + CNN = Text Classification

Text classification invoves the use of a word embedding for representing words and a Convolutional Neural Network (CNN) for learning how to discriminate documents on classification problems.

ConvNets are effective at document classification namely because they are able to pick out salient features (tokens, or word sequences) in a way that is invariant to their position within the input sequences.

The architecture is thus described as:
    
    1. Word Embedding. A distributed representation of words where different words that have a similar meaning (based on their usage) also have a similar representation.
    2. Convolutional Model. A feature extraction model that learns to extract salient features from documents represented using a word embedding.
    3. Fully-Connected Model. The interpretation of extracted features in terms of a predictive output.
    
     **We know the CNN is a feature-extracting architecture, meant to be integrated into a larger network and be trained to work in tendem in order to produce an end result.**

#### 2. Use a Single Layer CNN Architecture

CNN text classification architecture from the paper by Yoon Kim:

    He discovered that using pre-trained word vectors for classification tasks does very well. Suggesting we use pre-trained embeddings that are trained on very large text corpora.
    
    Transfer function: Rectified Linear.
    Kernel Sizes: 2, 4, 5
    Number of Filters: 100
    Dropout rate: 0.5
    Weight regularization (L2): 3
    Batch size: 50
    Update Rule: Adadelta
    
These configurations can be used to inspire a starting point for your own experiments.

#### 3. Dial in CNN Hyperparameters

Using Convolutional Neural Networks for Sentence classification, one can take the following findings from Ye Zhang and Byron Wallace as steps in the right direction towards building a model.

    1. The choice of pretrained Word2Vec and GloVe embeddings differ from problem to problem, and both performed better than using one hot encoded word vectors.
    2. the size of the kernel is important and should be tuned for each problem.
    3. The number of feature maps is also important and should be tuned.
    4. The 1-max pooling generally outperformed other types of pooling.
    5. Dropout has little effect on the model performance.
    
Specific Heuristics:
    
    1. Use Word2Vec or GloVe word embeddings as a starting point and tune them while fitting the model.
    2. Grid search across different kernel sizes to find the optimal configuration for your problem, in the range 1 -10.
    3. Search the number of filters from 100-600 and explore a dropout of 0.0-0.5 as part of the same search.
    4. Explore using tanh, relu, amd linear activation functions.

#### 4. Consider Character-Level CNNs

Text documents can be modeled at the character level using convnets that are capable of learning the relevant hierarchical structure of words, sentences, paragraphs, and more.

The promise of this method is, all the effort required to prepare text could be overcome if a CNN can learn to abstract the salient details.

In this method, 
    
    *The model reads in one hot encoded characters in a fixed-size alphabet. 
    *Encoded characters are read in blocks or sequences of 1024 characters.
    *A stack of 6 convolutional layers with pooling follows, with 3 fully connected layers at the output end of the network in order to make a prediction.
    
This method performs better on problems that offer a larger corpus of text.

#### 5. Consider Deeper CNNs for Classification

Better performance, it is hoped can be achieved with deeper nets than seen above or in the NLP text by Jason Brownlee.

Although in most Neural NLP models, only about 5 - 6 layers have been used, a heavy contrast to the computer vision tasks that have sometimes up to 152 layers. This suggests that NLP challenges can afford to try out very deep systems and get rewards much better than shallow nets.

This will be called VDCNN - Very Deep Convolutional Neural Networks. And the key to their approach will be embedding of individual characters, rather than a word embedding. We can note from below:
    * The very deep architecture worked well on small and large datasets.
    * Deeper networks decrease classification error.
    * Maxpooling achieves better results than other, more sophisticated types of pooling.
    * Generally going deeper degrades accuracy; the shortcut connections used in the architecture are important.