# N-Gram detection with 1D Convolution

In the previous examples, we have run tasks on embeddings in each word, but we sometimes should handle a set of ordered items.<br>
For instance, "hot dog" won't be in the species of "dog", but it will be a part of food. "Paris Hilton" will also be far from "Paris" in language context. Even when you find "good" in the sentence, it might be a signal of negative sentiment in the context "not good".<br>
Not only bi-grams, but the same is true for tri-grams and generic N-grams.

The convolution network (CNN) is a today's widely used model in computer vision (such as, image classification, object detection, segmentation, etc). In NLP, this convolutional architecture can also be applied in N-gram detection.<br>
In computer vision, 2D convolution (convolution by 2 dimensions of width and height) is generally used, but in N-gram detection, 1D convolution is applied as follows.

![Bi-gram CNN](images/bigram_convolution.png)

There exist several variations for N-gram detection in NLP by convolutions.<br>
The **hierarchical convolutions** can capture patterns with gaps, such as, "not --- good" or "see --- little" where "---" stands for a short sequence of words.<br>
Similar to image processing, **multiple channels** can also be applied in NLP convolution. For instance, when each word has multiple embeddings (such as, word embedding, POS-tag embedding, position-wise word embedding, etc), these embeddings can be manipulated as multiple channels in NLP.<br>
Or, after applying multiple N-grams (such as, 2-gram, 4-gram, and 6-gram), the results can also be manipulated as multiple channels.

In this example, I'll simply apply bi-gram detection using 1D convolution (with a single channel) for the purpose of your beginning.

*back to [index](https://github.com/tsmatz/nlp-tutorials/)*

## Install required packages

In [None]:
!pip install torch pandas numpy nltk

In [None]:
import nltk
nltk.download("popular")

## Prepare data

Same as in [previous example](./03_word2vec.ipynb), here I also use text in news papers dataset. (However, in this example, we use 2 columns of "headline" and "short description".)

Before starting, please download [News_Category_Dataset_v2.json](https://www.kaggle.com/datasets/rmisra/news-category-dataset) (collected by HuffPost) in Kaggle.

In [1]:
import pandas as pd

data = pd.read_json("News_Category_Dataset_v2.json",lines=True)

In this example, we'll apply text classification task.

The words appearing in the former part in sequence will be more indicative (topical) rather than the latter part. For this reason, in practical text classification, a long text will then be separated into **regions**. In each region, the convolution (with pooling) is then applied and concatenated. (See below.)<br>
For instance, with RCV1 (Reuters Corpus Volume I) dataset, 20 equally sized regions has better performance in category classification. (See [Johnson and Zhang (2015)](https://arxiv.org/abs/1504.01255).)

![region separation](images/region_separation.png)

In this example, ```headline``` and ```short_description``` are both short text, and we then treat these features as regions, instead of separating a single text into regions.

In [2]:
text_data = data[["headline", "short_description"]]
label_data = data["category"]
text_data

Unnamed: 0,headline,short_description
0,There Were 2 Mass Shootings In Texas Last Week...,She left her husband. He killed their children...
1,Will Smith Joins Diplo And Nicky Jam For The 2...,Of course it has a song.
2,Hugh Grant Marries For The First Time At Age 57,The actor and his longtime girlfriend Anna Ebe...
3,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,The actor gives Dems an ass-kicking for not fi...
4,Julianna Margulies Uses Donald Trump Poop Bags...,"The ""Dietland"" actress said using the bags is ..."
...,...,...
200848,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,Verizon Wireless and AT&T are already promotin...
200849,Maria Sharapova Stunned By Victoria Azarenka I...,"Afterward, Azarenka, more effusive with the pr..."
200850,"Giants Over Patriots, Jets Over Colts Among M...","Leading up to Super Bowl XLVI, the most talked..."
200851,Aldon Smith Arrested: 49ers Linebacker Busted ...,CORRECTION: An earlier version of this story i...


To get the better performance (accuracy), we standarize the input text as follows.
- Make all words to lowercase in order to reduce words
- Make "-" (hyphen) to space
- Remove stop words
- Remove all punctuation

> Note : Here I have removed stop words, but please take care if you train model for other tasks (such as, sentiment detection), since it might include important words for n-gram detection (such as, "not", "don't", "isn't", etc).

> Note : Lemmatization (standardization for such as "have", "had" or "having") should be dealed with, but here I have skipped these pre-processing.<br>
> In the strict pre-processing, we should also care about the polysemy. (The different meanings in the same word should have different tokens.)

In [3]:
import nltk
from nltk.corpus import stopwords
import re
import string

# to lowercase
text_data = text_data.apply(lambda x: x.str.lower())

# replace hyphen
text_data = text_data.apply(lambda x: x.str.replace("-"," "))

# remove stop words (only when it includes punctuation)
for w in stopwords.words("english"):
    if re.match(r"(^|\w+)[%s](\w+|$)" % re.escape(string.punctuation), w):
        text_data = text_data.apply(lambda x: x.str.replace(r"(^|\s+)%s(\s+|$)" % re.escape(w)," ",regex=True))
text_data = text_data.apply(lambda x: x.str.strip())

# remove punctuation
text_data = text_data.apply(lambda x: x.str.replace("[%s]" % re.escape(string.punctuation),"",regex=True))
text_data = text_data.apply(lambda x: x.str.strip())

# remove stop words (only when it doesn't include punctuation)
for w in stopwords.words("english"):
    if not re.match(r"(^|\w+)[%s](\w+|$)" % re.escape(string.punctuation), w):
        text_data = text_data.apply(lambda x: x.str.replace(r"(^|\s+)%s(\s+|$)" % re.escape(w)," ",regex=True))
text_data = text_data.apply(lambda x: x.str.strip())

In [4]:
text_data

Unnamed: 0,headline,short_description
0,2 mass shootings texas last week 1 tv,left husband killed children another day america
1,smith joins diplo nicky jam 2018 world cups of...,course song
2,hugh grant marries first time age 57,actor longtime girlfriend anna eberstein tied ...
3,jim carrey blasts castrato adam schiff democra...,actor gives dems ass kicking fighting hard eno...
4,julianna margulies uses donald trump poop bags...,dietland actress said using bags really cathar...
...,...,...
200848,rim ceo thorsten heins significant plans black...,verizon wireless att already promoting lte dev...
200849,maria sharapova stunned victoria azarenka aust...,afterward azarenka effusive press normal credi...
200850,giants patriots jets colts among improbable su...,leading super bowl xlvi talked game could end ...
200851,aldon smith arrested 49ers linebacker busted dui,correction earlier version story incorrectly s...


Next we convert a category name (e.g, "WORLD NEWS") to label ID (e.g, 2).<br>
First we build functions for convertions.

In [5]:
category_array = label_data.unique()
category_dic = {c: i for i, c in enumerate(category_array)}
itoc = list(category_dic.keys())
ctoi = category_dic
category_dic

{'CRIME': 0,
 'ENTERTAINMENT': 1,
 'WORLD NEWS': 2,
 'IMPACT': 3,
 'POLITICS': 4,
 'WEIRD NEWS': 5,
 'BLACK VOICES': 6,
 'WOMEN': 7,
 'COMEDY': 8,
 'QUEER VOICES': 9,
 'SPORTS': 10,
 'BUSINESS': 11,
 'TRAVEL': 12,
 'MEDIA': 13,
 'TECH': 14,
 'RELIGION': 15,
 'SCIENCE': 16,
 'LATINO VOICES': 17,
 'EDUCATION': 18,
 'COLLEGE': 19,
 'PARENTS': 20,
 'ARTS & CULTURE': 21,
 'STYLE': 22,
 'GREEN': 23,
 'TASTE': 24,
 'HEALTHY LIVING': 25,
 'THE WORLDPOST': 26,
 'GOOD NEWS': 27,
 'WORLDPOST': 28,
 'FIFTY': 29,
 'ARTS': 30,
 'WELLNESS': 31,
 'PARENTING': 32,
 'HOME & LIVING': 33,
 'STYLE & BEAUTY': 34,
 'DIVORCE': 35,
 'WEDDINGS': 36,
 'FOOD & DRINK': 37,
 'MONEY': 38,
 'ENVIRONMENT': 39,
 'CULTURE & ARTS': 40}

In [6]:
ctoi["WORLD NEWS"]

2

In [7]:
itoc[2]

'WORLD NEWS'

Now we convert all label to label IDs.

In [8]:
label_data = label_data.apply(lambda y: ctoi[y])
label_data

0          0
1          1
2          1
3          1
4          1
          ..
200848    14
200849    10
200850    10
200851    10
200852    10
Name: category, Length: 200853, dtype: int64

## Build data loader

Same as in previous examples, we will tokenize, in which it converts each text to the sequence of word's indices as follows.<br>
In this example, each text will be padded by the padding index (here, 50000) when the the length of text is smaller than 140.

![Index vectorize](images/index_vectorize2.png)

First we create a list of vocabulary (```vocab```).

In [9]:
from nltk.tokenize import SpaceTokenizer

###
# define Vocab
###
class Vocab:
    def __init__(self, list_of_sentence, tokenization, special_token, max_tokens=None):
        # count vocab frequency
        vocab_freq = {}
        tokens = tokenization(list_of_sentence)
        for t in tokens:
            for vocab in t:
                if vocab not in vocab_freq:
                    vocab_freq[vocab] = 0 
                vocab_freq[vocab] += 1
        # sort by frequency
        vocab_freq = {k: v for k, v in sorted(vocab_freq.items(), key=lambda i: i[1], reverse=True)}
        # create vocab list
        self.vocabs = [special_token] + list(vocab_freq.keys())
        if max_tokens:
            self.vocabs = self.vocabs[:max_tokens]
        self.stoi = {v: i for i, v in enumerate(self.vocabs)}

    def _get_tokens(self, list_of_sentence):
        for sentence in list_of_sentence:
            tokens = tokenizer.tokenize(sentence)
            yield tokens

    def get_itos(self):
        return self.vocabs

    def get_stoi(self):
        return self.stoi

    def append_token(self, token):
        self.vocabs.append(token)
        self.stoi = {v: i for i, v in enumerate(self.vocabs)}

    def __call__(self, list_of_tokens):
        def get_token_index(token):
            if token in self.stoi:
                return self.stoi[token]
            else:
                return 0
        return [get_token_index(t) for t in list_of_tokens]

    def __len__(self):
        return len(self.vocabs)

###
# generate Vocab
###
vocab_size = 50000
max_seq_len = 140

# create tokenizer
tokenizer = SpaceTokenizer()

# define tokenization function
def yield_tokens(text_data):
    for text in text_data:
        tokens = tokenizer.tokenize(text)
        tokens = tokens[:max_seq_len]
        yield tokens

# union headline and short_description
text_all = pd.concat([text_data["headline"], text_data["short_description"]])

# build vocabulary list
vocab = Vocab(
    text_all,
    tokenization=yield_tokens,
    special_token="<unk>",
    max_tokens=vocab_size,
)

The generated token index is ```0, 1, ... , vocab_size - 1```.<br>
Now I set ```vocab_size``` (here 50000) as a token id in padded positions.

In [10]:
pad_index = vocab.__len__()
vocab.append_token("<pad>")

Get list for both index-to-word and word-to-index.

In [11]:
itos = vocab.get_itos()
stoi = vocab.get_stoi()

In [12]:
# test
print("The number of token index is {}.".format(vocab.__len__()))
print("The padded index is {}.".format(stoi["<pad>"]))

The number of token index is 50001.
The padded index is 50000.


Now we build a collator function, which is used for pre-processing in data loader.

In [13]:
import torch
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, head_token_list, desc_token_list = [], [], []
    for label, head, desc in batch:
        # 1. skip None data
        if head is None or desc is None:
            continue
        # 2. generate word's index vector
        head_token = vocab(tokenizer.tokenize(head))
        desc_token = vocab(tokenizer.tokenize(desc))
        # 3. limit token length to max_seq_len
        head_token = head_token[:max_seq_len]
        desc_token = desc_token[:max_seq_len]
        # 4. pad sequence
        head_token += [pad_index] * (max_seq_len - len(head_token))
        desc_token += [pad_index] * (max_seq_len - len(desc_token))
        # add to list
        label_list.append(label)
        head_token_list.append(head_token)
        desc_token_list.append(desc_token)
    # convert to tensor
    label_list = torch.tensor(label_list, dtype=torch.int64).to(device)
    head_token_list = torch.tensor(head_token_list, dtype=torch.int64).to(device)
    desc_token_list = torch.tensor(desc_token_list, dtype=torch.int64).to(device)
    return label_list, head_token_list, desc_token_list

dataloader = DataLoader(
    list(zip(label_data, text_data["headline"], text_data["short_description"])),
    batch_size=512,
    shuffle=True,
    collate_fn=collate_batch
)

In [14]:
# test
for labels, heads, descs in dataloader:
    break

print("label shape in batch : {}".format(labels.size()))
print("headline token shape in batch : {}".format(heads.size()))
print("short_desc token shape in batch : {}".format(descs.size()))
print("***** label sample *****")
print(labels[0])
print("***** headline token sample *****")
print(heads[0])
print("***** short_desc token sample *****")
print(descs[0])

label shape in batch : torch.Size([512])
headline token shape in batch : torch.Size([512, 140])
short_desc token shape in batch : torch.Size([512, 140])
***** label sample *****
tensor(9, device='cuda:0')
***** headline token sample *****
tensor([14793, 23265,  1982,  2237,  8315,   312,  1229,   332,  1387,  1622,
        50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,
        50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,
        50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,
        50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,
        50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,
        50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,
        50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,
        50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,
        50000, 50000, 50000, 50000, 50000, 50000, 50000, 50

## Build network

Now let's build network.<br>
In this neural network,

(1) As you saw in previous examples, we build embedding vectors (dense vectors) $ \{ \mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_m \} $ from text for both ```headline``` and ```short_description``` respectively.

(2) For these embedding vectors, we apply 1D convolution $ \mathbf{p}_i = g(U (\mathbf{x}_i) + \mathbf{b}) $ where $ \mathbf{x}_i = [\mathbf{w}_i, \mathbf{w}_{i+1}] $, $U$ is a weight matrix, $\mathbf{b}$ is a bias vector, and $ g() $ is RELU activaiton. (i.e, In convolutions, the size of window is 2 (bi-gram) and the size of stride is 1.)<br>
In this example, we apply half padding convolution (i.e, apply $ \mathbf{x}_i = [\mathbf{w}_i, \mathbf{w}_{i+1}] $ for $ i=1,\ldots,m $ where $\mathbf{w}_{m+1}$ is zero) and the number of outputs will then also be $m$.<br>
I assume that the result is $n$-dimensional vectors $ \mathbf{p}_1, \mathbf{p}_2, \cdots, \mathbf{p}_m $ .

(3) Next we get $n$-dimensional vector $\mathbf{c}$ by applying $\mathbf{c}_{[j]} = \max_{1 \leq i \leq m} \mathbf{p}_{i [j]} \forall j \in [1,n]$. (i.e, max pooling)<br>
Here I have denoted $j$-th element of vecotr $\mathbf{p}_i$ by $\mathbf{p}_{i [j]}$. ($i \in [1,m], j \in [1,n]$)

(4) We concatenate the result's vectors $\mathbf{c}$ and $\mathbf{d}$, each of which is corresponing to ```headline``` and ```short_description```.

(5) Finally, we apply fully-connected feed-forward network (i.e, Dense Net) for predicting class label.

![composing network](images/1d_conv_net.png)

Before max pooling, the values in padding positions (which index of token is ```pad_index```) are converted into zero, and these will then be underestimated in gradient descent.

In [15]:
import torch.nn as nn

embedding_dim = 200

class BigramClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, class_num, padding_idx, conv_channel=256, hidden_dim=128):
        super().__init__()

        self.padding_idx = padding_idx
        self.conv_channel = conv_channel

        self.embedding01 = nn.Embedding(
            vocab_size,
            embedding_dim,
            padding_idx=padding_idx,
        )
        self.embedding02 = nn.Embedding(
            vocab_size,
            embedding_dim,
            padding_idx=padding_idx,
        )
        self.conv01 = torch.nn.Conv1d(
            in_channels=embedding_dim,
            out_channels=conv_channel,
            kernel_size=2,
            stride=1,
            padding="same",
        )
        self.conv02 = torch.nn.Conv1d(
            in_channels=embedding_dim,
            out_channels=conv_channel,
            kernel_size=2,
            stride=1,
            padding="same",
        )
        self.relu = nn.ReLU()
        self.max_pool = torch.nn.MaxPool1d(
            kernel_size=max_seq_len,
        )
        self.hidden = nn.Linear(conv_channel*2, hidden_dim)
        self.classify = nn.Linear(hidden_dim, class_num)

    def forward(self, region01, region02):
        # Get padding masks (in which, element is 0.0 when it's in padded position, otherwise 1.0)
        mask01 = torch.ones(region01.size()).to(device)
        mask01 = mask01.masked_fill(region01 == self.padding_idx, 0)
        mask02 = torch.ones(region02.size()).to(device)
        mask02 = mask02.masked_fill(region02 == self.padding_idx, 0)
        # Embedding
        #   --> [batch_size, max_seq_len, embedding_dim]
        out01 = self.embedding01(region01)
        out02 = self.embedding02(region02)
        # Apply convolution on dimension=1
        #   --> [batch_size, max_seq_len, conv_channel]
        out01 = self.conv01(out01.transpose(1,2)).transpose(1,2)
        out02 = self.conv02(out02.transpose(1,2)).transpose(1,2)
        # Apply masking (In padded position, it will then be 0.0)
        extend_mask01 = mask01.unsqueeze(dim=2)
        extend_mask01 = extend_mask01.expand(-1, -1, self.conv_channel)
        out01 = out01 * extend_mask01
        extend_mask02 = mask02.unsqueeze(dim=2)
        extend_mask02 = extend_mask02.expand(-1, -1, self.conv_channel)
        out02 = out02 * extend_mask02
        # Apply relu
        out01 = self.relu(out01)
        out02 = self.relu(out02)
        # Apply max pooling on dimension=1
        #   --> [batch_size, 1, conv_channel]
        out01 = self.max_pool(out01.transpose(1,2)).transpose(1,2)
        out02 = self.max_pool(out02.transpose(1,2)).transpose(1,2)
        # Flatten
        #   --> [batch_size, conv_channel]
        out01 = out01.squeeze(dim=1)
        out02 = out02.squeeze(dim=1)

        # Concat outputs of head and short_description
        #   --> [batch_size, conv_channel * 2]
        out = torch.concat((out01, out02), dim=-1)

        # Apply classification head
        #   --> [batch_size, hidden_dim]
        out = self.hidden(out)
        out = self.relu(out)
        #   --> [batch_size, class_num]
        logits = self.classify(out)
        
        return logits

model = BigramClassifier(vocab.__len__(), embedding_dim, len(ctoi), pad_index).to(device)

## Train model

Now let's train our network.

In [16]:
from torch.nn import functional as F

num_epochs = 15

optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
for epoch in range(num_epochs):
    for labels, heads, descs in dataloader:
        # optimize
        optimizer.zero_grad()
        logits = model(heads, descs)
        loss = F.cross_entropy(logits, labels)
        loss.backward()
        optimizer.step()
        # calculate accuracy
        pred_labels = logits.argmax(dim=1)
        num_correct = (pred_labels == labels).float().sum()
        accuracy = num_correct / len(labels)
        print("Epoch {} - loss: {:2.4f} - accuracy: {:2.4f}".format(epoch+1, loss.item(), accuracy), end="\r")
    print("")

  return F.conv1d(input, weight, bias, self.stride,


Epoch 1 - loss: 2.0183 - accuracy: 0.4295
Epoch 2 - loss: 1.6093 - accuracy: 0.5235
Epoch 3 - loss: 1.2186 - accuracy: 0.6443
Epoch 4 - loss: 0.9033 - accuracy: 0.7181
Epoch 5 - loss: 0.6881 - accuracy: 0.8188
Epoch 6 - loss: 0.4761 - accuracy: 0.8792
Epoch 7 - loss: 0.4410 - accuracy: 0.8792
Epoch 8 - loss: 0.3817 - accuracy: 0.8658
Epoch 9 - loss: 0.2463 - accuracy: 0.9195
Epoch 10 - loss: 0.1037 - accuracy: 0.9664
Epoch 11 - loss: 0.0286 - accuracy: 1.0000
Epoch 12 - loss: 0.0230 - accuracy: 1.0000
Epoch 13 - loss: 0.0252 - accuracy: 0.9933
Epoch 14 - loss: 0.0098 - accuracy: 1.0000
Epoch 15 - loss: 0.0063 - accuracy: 1.0000


## Classify text

Now we classify 3 text about "```Michael Jackson```", "```Michael Avenatti```", and "```Ronny Jackson```".<br>
All of them has high probabilities to each categories. (See the logits output.)

In this experiment, "```Michael Jackson```" is strongly categorized to ```ENTERTAINMENT```. But neither "```Michael```" nor "```Jackson```" affects the result, because both "```Michael Avenatti```" and "```Ronny Jackson```" are classified to different categories.

In [17]:
import numpy as np

def classify_text(headline, description):
    test_list = [
        [1, headline, description],
    ]
    _, test_heads, test_descs = collate_batch(test_list)
    pred_logits = model(test_heads, test_descs)
    pred_index = pred_logits.argmax()
    return itoc[pred_index.item()]

print(classify_text(
    "report about michael jackson",
    "michael jackson is wise and honest"
))
print(classify_text(
    "report about michael avenatti",
    "michael avenatti is wise and honest"
))
print(classify_text(
    "report about ronny jackson",
    "ronny jackson is wise and honest"
))

ENTERTAINMENT
MEDIA
MEDIA
