# N-Gram detection with 1D Convolution

In the previous examples, we have run tasks on embeddings in each word, but we sometimes should handle a set of ordered items.<br>
For instance, "hot dog" won't be in the species of "dog", but it will be a part of food. "Paris Hilton" will also be far from "Paris" in language context. Even when you find "good" in the sentence, it might be a signal of negative sentiment in the context "not good".<br>
Not only bi-grams, but the same is true for tri-grams and generic N-grams.

The convolution network (CNN) is a today's widely used model in computer vision (such as, image classification, object detection, segmentation, etc). In NLP, this convolutional architecture can also be applied in N-gram detection.<br>
In computer vision, 2D convolution (convolution by 2 dimensions of width and height) is generally used, but in N-gram detection, 1D convolution is applied as follows.

![Bi-gram CNN](images/bigram_convolution.png?raw=true)

There exist several variations for N-gram detection by convolutions.<br>
The hierarchical convolutions can capture patterns with gaps, such as, "not --- good" or "see --- little" where "---" stands for a short sequence of words.<br>
Similar to image processing, multiple channels can also be applied in NLP convolution. For instance, when each word has multiple embeddings (such as, word embedding, POS-tag embedding, position-wise word embedding, etc), these embeddings can be manipulated as multiple channels in NLP. Or, after applying multiple N-grams (such as, 2-gram, 4-gram, and 6-gram), the results can also be manipulated as multiple channels.

In this example, for the purpose of your beginning, I'll simply apply bi-gram detection using 1D convolution with a single channel.

*back to [index](https://github.com/tsmatz/nlp-tutorials/)*

## Install required packages

In [None]:
!pip install tensorflow==2.6.2 pandas numpy nltk

In [None]:
import nltk
nltk.download("popular")

## Prepare data

Same as in [previous example](./03_word2vec.ipynb), here I also use text in news papers dataset. (However, in this example, we use 2 columns of "headline" and "short description".)

Before starting, please download [News_Category_Dataset_v2.json](https://www.kaggle.com/datasets/rmisra/news-category-dataset) (collected by HuffPost) in Kaggle.

In [1]:
import pandas as pd

data = pd.read_json("News_Category_Dataset_v2.json",lines=True)

In this example, we'll apply text classification task.

If we handle a long text, words appearing early will be more indicative (topical) rather than others. In practical text classification, a long text will then be separated into **regions**. In each region, the convolution (with pooling) is applied and then concatenated. (See below.)<br>
For instance, with RCV1 (Reuters Corpus Volume I) dataset, 20 equally sized regions has better performance in category classification. (See [Johnson and Zhang (2015)](https://arxiv.org/abs/1504.01255).)

![region separation](images/region_separation.png?raw=true)

In this example, ```headline``` and ```short_description``` are both short text, and we then treat these features as regions, instead of separating into regions.

In [2]:
train_data = data[["headline", "short_description"]]
train_data

Unnamed: 0,headline,short_description
0,There Were 2 Mass Shootings In Texas Last Week...,She left her husband. He killed their children...
1,Will Smith Joins Diplo And Nicky Jam For The 2...,Of course it has a song.
2,Hugh Grant Marries For The First Time At Age 57,The actor and his longtime girlfriend Anna Ebe...
3,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,The actor gives Dems an ass-kicking for not fi...
4,Julianna Margulies Uses Donald Trump Poop Bags...,"The ""Dietland"" actress said using the bags is ..."
...,...,...
200848,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,Verizon Wireless and AT&T are already promotin...
200849,Maria Sharapova Stunned By Victoria Azarenka I...,"Afterward, Azarenka, more effusive with the pr..."
200850,"Giants Over Patriots, Jets Over Colts Among M...","Leading up to Super Bowl XLVI, the most talked..."
200851,Aldon Smith Arrested: 49ers Linebacker Busted ...,CORRECTION: An earlier version of this story i...


To get the better performance (accuracy), we standarize the input text as follows.
- Make all words to lowercase in order to reduce words
- Make "-" (hyphen) to space
- Remove all punctuation

> Note : Lemmatization (standardization for such as "have", "had" or "having") should be dealed with, but here I have skipped these pre-processing.<br>
> In the strict pre-processing, we should also care about the polysemy. (The different meanings in the same word should have different tokens.)

In [3]:
import nltk
from nltk.corpus import stopwords
import re
import string

# to lowercase
train_data = train_data.apply(lambda x: x.str.lower())

# replace hyphen
train_data = train_data.apply(lambda x: x.str.replace("-"," "))

# remove stop words (only when it includes punctuation)
for w in stopwords.words("english"):
    if re.match("(^|\w+)[%s](\w+|$)" % re.escape(string.punctuation), w):
        train_data = train_data.apply(lambda x: x.str.replace("(^|\s+)%s(\s+|$)" % re.escape(w)," ",regex=True))
train_data = train_data.apply(lambda x: x.str.strip())

# remove punctuation
train_data = train_data.apply(lambda x: x.str.replace("[%s]" % re.escape(string.punctuation),"",regex=True))
train_data = train_data.apply(lambda x: x.str.strip())

# remove stop words (only when it doesn't include punctuation)
for w in stopwords.words("english"):
    if not re.match("(^|\w+)[%s](\w+|$)" % re.escape(string.punctuation), w):
        train_data = train_data.apply(lambda x: x.str.replace("(^|\s+)%s(\s+|$)" % re.escape(w)," ",regex=True))
train_data = train_data.apply(lambda x: x.str.strip())

# drop Nan
train_data = train_data.dropna()

In [4]:
train_data

Unnamed: 0,headline,short_description
0,2 mass shootings texas last week 1 tv,left husband killed children another day america
1,smith joins diplo nicky jam 2018 world cups of...,course song
2,hugh grant marries first time age 57,actor longtime girlfriend anna eberstein tied ...
3,jim carrey blasts castrato adam schiff democra...,actor gives dems ass kicking fighting hard eno...
4,julianna margulies uses donald trump poop bags...,dietland actress said using bags really cathar...
...,...,...
200848,rim ceo thorsten heins significant plans black...,verizon wireless att already promoting lte dev...
200849,maria sharapova stunned victoria azarenka aust...,afterward azarenka effusive press normal credi...
200850,giants patriots jets colts among improbable su...,leading super bowl xlvi talked game could end ...
200851,aldon smith arrested 49ers linebacker busted dui,correction earlier version story incorrectly s...


## Build network

Same as in previous examples, we will generate a vectorizer, which converts each text to the sequence of word's indices (a vector with 140 dimensions) as follows.<br>
When the length of text is smaller than 140, the vector is padded by zero.

![Index vectorize](images/index_vectorize.png?raw=true)

In [5]:
import tensorflow as tf

vocab_size = 50000
max_seq_len = 140

# Set up vectorizer
vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_sequence_length=max_seq_len,
    output_mode="int",
    pad_to_max_tokens=False,
    trainable=False)

# concat columns (headline and short_description)
text_all = pd.concat([train_data["headline"], train_data["short_description"]])

# create vocabulary list
# (UNK is automatically included)
vectorizer.adapt(text_all)

Now let's build network.<br>
In this neural network,

1. As you saw in previous examples, we build embedding vectors (dense vectors) $ \{ \mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_m \} $ from text for both ```headline``` and ```short_description``` respectively.
2. For these embedding vectors, we apply 1D convolution $ \mathbf{p}_i = g(U (\mathbf{x}_i) + \mathbf{b}) $ where $ \mathbf{x}_i = [\mathbf{w}_i, \mathbf{w}_{i+1}] $, $U$ is a weight matrix, $\mathbf{b}$ is a bias vector, and $ g() $ is RELU activaiton. (i.e, In convolutions, the size of window is 2 (bi-gram) and the size of stride is 1.)<br>
In this example, we apply half padding convolution (i.e, apply $ \mathbf{x}_i = [\mathbf{w}_i, \mathbf{w}_{i+1}] $ for $ i=1,\ldots,m $ where $\mathbf{w}_{m+1}$ is zero) and the number of outputs will then also be $m$.<br>
I assume that the result is $n$-dimensional vectors $ \mathbf{p}_1, \mathbf{p}_2, \cdots, \mathbf{p}_m $ .
3. Next we apply $\mathbf{c}_{[j]} = \max_{1 \leq i \leq m} \mathbf{p}_{i [j]} \forall j \in [1,n]$ and get $n$-dimensional vector $\mathbf{c}$. (i.e, max pooling)<br>
Here I have denoted $j$-th element of vecotr $\mathbf{p}_i$ by $\mathbf{p}_{i [j]}$. ($i \in [1,m], j \in [1,n]$)
4. We concatenate the result's vectors $\mathbf{c}$ and $\mathbf{d}$, each of which is corresponing to ```headline``` and ```short_description```.
5. Finally, we apply fully-connected feed-forward network (i.e, Dense Net) for predicting one-hot class value.

![composing network](images/1d_conv_net.png?raw=true)

In [6]:
class BigramClassificationModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim):
        super(BigramClassificationModel, self).__init__()

        #
        # input definition (shape : (batch_size, 1, ))
        #

        input1 = tf.keras.layers.Input(
            dtype=tf.string,
            shape=(1, ),
            name="text1")
        input2 = tf.keras.layers.Input(
            dtype=tf.string,
            shape=(1, ),
            name="text2")

        #
        # Apply convolution for input1
        #

        # vectorize (shape : (batch_size, 140))
        vec1 = vectorizer(input1)
        # word's embedding (shape : (batch_size, 140, 200))
        emb1 = tf.keras.layers.Embedding(
            vocab_size,
            embedding_dim,
            trainable=True,
            name="embedding1")(vec1)
        # apply convolution (shape : (batch_size, 140, 256) - because the number of bi-gram segments is same as word's count.)
        conv1 = tf.keras.layers.Conv1D(
            256,
            2,
            strides=1,
            padding="same",
            activation="relu",
            trainable=True)(emb1)
        # apply maxpool (shape : (batch_size, 1, 256)
        pool1 = tf.keras.layers.MaxPool1D(pool_size=max_seq_len)(conv1)
        # reshape : (batch_size, 256)
        flat1 = tf.keras.layers.Flatten()(pool1)

        #
        # Apply convolution for input2
        #

        # vectorize (shape : (batch_size, 140))
        vec2 = vectorizer(input2)
        # word's embedding (shape : (batch_size, 140, 200))
        emb2 = tf.keras.layers.Embedding(
            vocab_size,
            embedding_dim,
            trainable=True,
            name="embedding2")(vec2)
        # apply convolution (shape : (batch_size, 140, 256) - because the number of bi-gram segments is same as word's count.)
        conv2 = tf.keras.layers.Conv1D(
            256,
            2,
            strides=1,
            padding="same",
            activation="relu",
            trainable=True,)(emb2)
        # apply maxpool (shape : (batch_size, 1, 256)
        pool2 = tf.keras.layers.MaxPool1D(pool_size=max_seq_len)(conv2)
        # reshape : (batch_size, 256)
        flat2 = tf.keras.layers.Flatten()(pool2)

        #
        # concatenate each pool (shape : (batch_size, 512))
        #

        news_feature = tf.keras.layers.Concatenate(axis=-1)(
            [flat1, flat2])

        #
        # classify by fully-connected feed forward network
        #

        hidden1 = tf.keras.layers.Dense(
            128,
            activation="relu",
            trainable=True)(news_feature)
        outputs = tf.keras.layers.Dense(
            41,
            activation=None,
            trainable=True)(hidden1)

        #
        # Generate model
        #

        self.base_model = tf.keras.Model(
            inputs=[input1, input2],
            outputs=outputs)

    def call(self, inputs):
        input1, input2 = inputs
        return self.base_model({
            "text1": input1,
            "text2": input2
        })

#
# define and generate model
#
embedding_dim = 200
model = BigramClassificationModel(vocab_size, embedding_dim)
model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"])

## Train model

We use ```category``` column for label data in training.<br>
The following output is one-hot label values in each row of dataset. (It has 41 categories.)

In [7]:
y_df = pd.get_dummies(data["category"])
y_df

Unnamed: 0,ARTS,ARTS & CULTURE,BLACK VOICES,BUSINESS,COLLEGE,COMEDY,CRIME,CULTURE & ARTS,DIVORCE,EDUCATION,...,TASTE,TECH,THE WORLDPOST,TRAVEL,WEDDINGS,WEIRD NEWS,WELLNESS,WOMEN,WORLD NEWS,WORLDPOST
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200848,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
200849,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
200850,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
200851,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we generate TensorFlow dataset for training.

In [8]:
# generate one-hot list
y = y_df.values.tolist()

# generate tensorflow dataset (X, y)
train_tf_data = tf.data.Dataset.from_tensor_slices((
    (train_data["headline"], train_data["short_description"]),
    y))

Train model !

In [9]:
model.fit(
    train_tf_data.shuffle(10000).batch(512),
    epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7fe5718fde10>

## Classify text

Now we classify text with "```Paris```", "```Hilton Hotel```", and "```Paris Hilton```".<br>
Only "```Paris Hilton```" will be categorized as ```ENTERTAINMENT```, because 2-gram word "```Paris Hilton```" frequently occurs in ```ENTERTAINMENT``` article.

In [11]:
import numpy as np

def classify_text(headline, description):
    test_X = tf.convert_to_tensor(
    [
        [headline],
        [description]
    ])
    test_y_one_hot = model(test_X)
    test_y_idx = np.argmax(test_y_one_hot, axis=-1)
    test_y = [y_df.columns[i] for i in test_y_idx]
    return test_y

print(classify_text(
    "report about paris",
    "paris is brilliant"
))
print(classify_text(
    "report about hilton hotel",
    "hilton hotel is brilliant"
))
print(classify_text(
    "report about paris hilton",
    "paris hilton is brilliant"
))

['TRAVEL']
['TRAVEL']
['ENTERTAINMENT']


The next example will classify text with "```Michael Jackson```", "```Michael Avenatti```", and "```Ronny Jackson```".<br>
Each of text includes either of "```Michael```" or "```Jackson```", or both of these. But the results will differ, because these 2-gram phrases have different occurrences in the source text.

In [12]:
print(classify_text(
    "report about michael jackson",
    "michael jackson is wise and honest"
))
print(classify_text(
    "report about michael avenatti",
    "michael avenatti is wise and honest"
))
print(classify_text(
    "report about ronny jackson",
    "ronny jackson is wise and honest"
))

['WELLNESS']
['MEDIA']
['ENTERTAINMENT']
