# CNN-based Text Classification

## Colab Setup

You can skip this section if not running on Google's colab.

If running with GPUs, sanity check that the GPUs are enabled.

In [None]:
!nvidia-smi

In [2]:
import torch
torch.cuda.is_available()

True

Should be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly).

In [None]:
!pwd

This should be "/content" on Colab.

First, if running from colab, you must install the package. (You may skip if you installed already).

In [None]:
!git clone --single-branch --branch colab https://github.com/will-thompson-k/deeplearning-nlp-models.git
%cd deeplearning-nlp-models

In [None]:
!pip install datasets torchtext==0.8.0

In [None]:
!python setup.py install

## Imports

Here are the packages we need to import.

In [5]:
from nlpmodels.models import text_cnn
from nlpmodels.utils import train,utils
from nlpmodels.utils.elt import text_cnn_dataset
from argparse import Namespace
utils.set_seed_everywhere()


## Sentiment Analysis with CNNs

Following the logic in Kim's paper, we are running an embedding + convolutional layer architecture in order
to conduct sentiment analysis.

### Hyper-parameters

These are the data processing and model training hyper-parameters for this run.

In [6]:
args = Namespace(
        # Model hyper-parameters
        max_sequence_length=50, #Often you chose it such that there is minimal padding. 95th percentile=582
        dim_model=128, # Embedding size controls the projection of a vocabulary.
        num_filters=128, # output filters from convolution
        window_sizes=[3,5,7], # different filter sizes, total number of filters len(window_sizes)*num_filters
        num_classes=2, # binary classification problem
        dropout=0.5, # 0.5 from original implementation, kind of high compared to other papers (usually 0.1)
        # Training hyper-parameters
        num_epochs=30,
        learning_rate=1.e-6, #chosing LR is important, often accompanied with scheduler to change
        batch_size=64
)

In [None]:
train_loader, vocab = text_cnn_dataset.TextCNNDataset.get_training_dataloader(args)
model = text_cnn.TextCNN(vocab_size = len(vocab),
                        dim_model = args.dim_model,
                        num_filters = args.num_filters,
                        window_sizes =  args.window_sizes,
                        num_classes = args.num_classes,
                        dropout = args.dropout)

trainer = train.TextCNNTrainer(args, vocab.mask_index, model, train_loader, vocab)

Let's run this.

In [None]:
trainer.run()

### Review

The goal is just to show how this works - you can play with the hyper-parameters as you see fit.
In an ideal situation, we would check the data against an unseen val or test set to diagnose performance.

#### Parameter importance

In playing with the model, there are a few things to note:

- *l2 regularization*: Unlike the original paper, I didn't end up using L2 regularization.
- *dictionary pruning*: The original dictionary had 75k tokens. I ended up pruning any <.1% frequency, bringing it down
to <20k.
- *max_sequence_length*: Generally, you don't want to truncate the sentences and want to set this to the longest sequence.
However, here the max == ~2k while the 95th percentile was ~500, so I chose to truncate some sentences.
- *learning_rate*: I set the parameter to be static,
but often times it makes sense to use a scheduler to allow larger parameter changes initially and then fine-tune over updates.

