## This File Explains: learning and practicing NLP with TensorFlow.

we will see how we can gain insights into text data and hands-on on how to use those insights to train NLP models and perform some human mimicking tasks.

**Tokenization:**

Representing the words in a way that a computer can process them, with a view to later training a Neural network that can understand their meaning. This process is called tokenization.

`Let’s look at how we can tokenize the sentences using TensorFlow tools.`

In [2]:
# Importing required libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer


# List of sample sentences that we want to tokenize
sentences = ['I love my dog.',
             'I love my cat?',
             ]

# intializing a tokenizer that can index
# num_words is the maximum number words that can be kept 
# tokenizer will automatically help in choosing most frequent words
tokenizer = Tokenizer(num_words = 100)

# fitting the sentences to using created tokenizer object
tokenizer.fit_on_texts(sentences)

# the full list of words is available as the tokenizer's word index
word_index = tokenizer.word_index

# the result will be a dictionary, key being the words and the values being the token for that word
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


The tokenizer is also smart enough to catch some exceptions. In the next example, we have added a word dog! but the tokenizer is smart enough to not create a new token for “dog!” again.



In [3]:
sentences = ['I love my dog',
             'I love my cat',
             'you love my DOG!'
             ]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# Exoectec resulting dictionary without a new token for "dog!" 
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


**Sequencing:**

Now that our words are represented like this, next, we need to represent our sentences by a sequence of numbers in the correct order. Then we will have data ready for processing by a neural network to understand or maybe even generate new text. Let’s look at how we can manage this sequencing using TensorFlow tools.

In [4]:
sentences = ['I love my dog',
             'I love my cat',
             'you love my dog!',
             'Do you think my dog is amazing?',
             ]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# this creates sequence of tokens representing each sentence
sequences = tokenizer.texts_to_sequences(sentences)

print(word_index)
print()
print(sequences)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


Now we have basic tokenization done. But there is a catch. This is all very well for getting data ready for training a neural network, but what happens when that Neural Network has to classify texts, but there are words that it has never seen before? So this can confuse the Neural Network. Let’s look at how to handle that next.

Let’s try sequencing a sentence, which has words that tokenizer has not seen yet.

In [5]:
test_data = ['i really love my dog',
             'my dog loves my manatee',
             ]

test_seq = tokenizer.texts_to_sequences(test_data)

print(test_seq)

[[4, 2, 1, 3], [1, 3, 1]]


**Unseen words:**

`i really love my dog’ = [4, 2, 1, 3] i.e. a 5-word sentence ends up as a 4 numbered sequence, why?
Because the word “really” was not in the word index. The corpus used to build the word index doesn’t contain that word.
`

`Similarly ‘my dog loves my manatee’ = [1, 3, 1] i.e. a 5-word sentence ends up as a 3 numbered sequence or it is equivalent to “my dog my” as “loves” and “manatee” are not in word index.`

So we can imagine that we need a huge word index to handle sentences that are not in the training set. But in order not to lose the length of the sequence, there is also a little trick that we can use. Let’s take a look at that.

**OOV(out of vocabulary):**

By using the OOV(out of vocabulary) token property, and setting it as something that you would not expect to see in the corpus, like “<OOV>”, this word is never used anywhere, so we can use a word that we can assume never appears in a text. Then the tokenizer will create a token for that and replaces words that it doesn’t recognize with the out of vocabulary token instead. It’s simple but effective. Let’s look at an example.


In [6]:
sentences = ['I love my dog',
             'I love my cat',
             'you love my dog!',
             'Do you think my dog is amazing?',
             ]

# adding a "out of vocabulary" word to the tokenizer
tokenizer = Tokenizer(num_words = 100,oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)

test_data = ['i really love my dog',
             'my dog loves my manatee',
             ]

test_seq = tokenizer.texts_to_sequences(test_data)

print(word_index)
print(test_seq)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


Now we can notice that the length of the sentences has been retained and the unseen words in the sentence are replaced by the “<OOV>” token. So the resultant sentences are like :

`‘i really love my dog’ = [5, 1, 3, 2, 4] = ‘i <OOV> love my dog’`

`‘my dog loves my manatee’ = [2, 4, 1, 2, 1] =‘my dog <OOV> my <OOV>’`

We still lost some meaning, but a lot less and the sentences are of at least the correct lengths. And while it helps to maintain the sequence length to be the same length as the sentence, we might wonder, when it comes to needing to train a Neural Network, how can it handle sentences of different lengths?

With images, they are all usually the same size. So how would we solve that problem?

**Padding the sequences:**

A simple solution is padding. For this, we will use pad_sequences imported for the sequence module of **`tensorflow.keras.preprocessing`.** As the name suggests, we can use it to pad our sequence. So we just need to pass sequences to pad_sequence function and the rest is done for us.

In [7]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = ['I love my dog',
             'I love my cat',
             'you love my dog!',
             'Do you think my dog is amazing?',
             ]

tokenizer = Tokenizer(num_words = 100,oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)

# padding sequences 
padded = pad_sequences(sequences)

print(word_index)
print()
print(sequences)
print()
print(padded)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


So our first example [5, 3, 2, 4] is preceded by 3 zeros in the padded sequence. But why 3 zeros? Well, it’s because our longest sentence has 7 words in it, so we pass this corpus sequence to pad sequence, it measures that and ensures that all of the sentences have equally-sized sequences by padding them with zero’s at the front. Note that OOV is 0, it is not 1.

Now we might think that we don’t want zero’s in the front, but instead after the sentence. Well, that’s easy. We can just the padding parameter to “post” i.e **padding = “post”**.

Or if we don’t want the length of the padded sentences to be the same as the longest sentence, we can then specify the desired length by specifying the “maxlen” parameter to the required length. But wait? we might think what happens if the sentences are longer than the “maxlen” parameter?

Well, then we can specify hot to truncate the sentence whether by chopping off the words at the end, with a post truncation or from the beginning with a pre-truncation. Please refer to pad_sequences documentation for other options.

The function pad_Sequences might then look like :

`padded = pad_sequences(sequences,maxlen = 5, padding=’post’, truncating = ‘post’)`

Till now we have seen how to tokenize text into numeric values, and use tools in TensorFlow to regularize and pad that text. Now that we’ve gotten pre-processing out of the way, we can next look at how to build a classifier to recognize sentiment in text.

We’ll start by using a [dataset of News headlines](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection), where the headlines have been categorized as sarcastic or not. We’ll train a classifier on this and it can then tell us that if a new piece of text looks like it might be sarcastic or not.

**This dataset has 3 fields:**

`is_sarcastic field: “1” if sarcastic and 0 otherwise.`

`headline: the headline if the news article`

`article_link: link to the original news article.`

The data is stored in JSON format and we will convert it to python DataFrame format for training.

In [8]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [12]:
path = "/content/gdrive/MyDrive/Deep Learning/Jupyter Notebook/Natural Language Processing/Sarcasm_Headlines_Dataset.json"

In [13]:
import pandas as pd
# import os
# path = os.getcwd()

data = pd.read_json(path, lines=True)
data.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [14]:
# training_size = 20000
X, y = data['headline'], data['is_sarcastic']
from sklearn.model_selection import train_test_split
training_sentences, testing_sentences, training_labels, testing_labels = train_test_split(X, y, test_size=0.25)


In [15]:
training_sentences.shape

(20031,)

In [16]:
training_labels.shape[0]

20031

In [19]:
vocab_size = 10000
max_length = 100
trunc_type='post'
padding_type='post'
oov_token = "<OOV>"
training_size = training_labels.shape[0]

In [18]:
training_sentences.shape

(20031,)

In [20]:
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
# fitting tokenizer only to training set
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

# creating training sequences and padding them
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences,maxlen = max_length,
                                padding = padding_type,
                                truncating=trunc_type,
                                )

# creating  testing sequences and padding them using same tokenizer
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen = max_length,
                                padding = padding_type,
                                truncating=trunc_type,
)

                              
print(training_padded.shape, testing_padded.shape)
import numpy as np
# converting all variables to numpy arrays, to be able to work with tf version 2
training_padded = np.array(training_padded)
training_labels = np.array(training_labels)
testing_padded = np.array(testing_padded)
testing_labels = np.array(testing_labels)

(20031, 100) (6678, 100)


In [21]:
training_padded.shape

(20031, 100)

In [22]:
training_labels.shape

(20031,)

In [25]:
word_index["<OOV>"]

1

In [23]:
training_padded[0]

array([2620, 2850,   32,    1,   17,   99,  213,    1, 1366,  702,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0], dtype=int32)

**Word Embeddings:**

But you might be wondering that we’ve turned our sentences to numbers, with numbers being tokens representing the words. But how do we get meaning from that? How do we determine if something is sarcastic just from the numbers?

Well, here’s where the context of embeddings comes in.

Let’s consider the most basic of sentiments, good and bad. We can often see these as being opposites, so we can plot them as having opposite directions as shown in the below image.

So then what happens with a word like “meh”? it’s not particularly good, and it is not particularly bad. Probably a little more bad than good. So we can plot it somewhere near the bad line. Or the phrase “not bad” which is usually meant to plot something as having a little bit of goodness, but not necessarily very good. So this plot can be inclined towards the good line.

[link text](https://drive.google.com/file/d/13RD8Lj9gqDFDH4kjhXcPqt99WeDZ7jdq/view?usp=share_link)

Now imagine plotting this on the X and Y axis, then we can start to determine the good or bad sentiment as the coordinates in the X and Y as shown in the image(image not to scale). Similarly, we can represent “meh” and good as points in the XY plane.

So by looking at the direction of the vector, we can start to determine the meaning of the word. So what if we can extend that into multiple dimensions instead of just two? What if words that are labeled with sentiments, like sarcastic and not sarcastic, are plotted in a multi-dimensional space. And then as we train, we try to learn what the direction in these multi-dimensional spaces should look like. Words that appear only in the sarcastic sentences will have a strong component in the sarcastic direction and vice versa.

As we load more and more sentences into the network for training, these directions can change. And when we have a fully trained network and give it a set of words, it could look up the vectors for these words, sum them up, and thus give us an idea for the sentiment. This concept is known as embedding.

Now let’s take a look at how we can do this using the TensorFlow embedding layer.

In [26]:
embedding_dim = 16

# creating a model for sentiment analysis
model  = tf.keras.Sequential([
                # addinging an Embedding layer for Neural Network to learn the vectors
                tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length),
                # Global Average pooling is similar to adding up vectors in this case
                tf.keras.layers.GlobalAveragePooling1D(),
                tf.keras.layers.Dense(24, activation = 'relu'),
                tf.keras.layers.Dense(1, activation = 'sigmoid')
])

model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])


num_epochs = 10

history = model.fit(training_padded,training_labels, epochs = num_epochs,
                    validation_data = (testing_padded,testing_labels))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


We can notice we achieved some good accuracy with training data, but as we can see the val_accuracy is decreasing, which is some classic overfitting. So we can either add less learning rate to our model or train for less number of epochs.

**Establishing Sentiment:**

Now let us see how we can use this model to establish sentiment for unseen sentences.

In [27]:
# forming new sentences for testing, feel free to experiment
# sentence 1 is bit sarcastic, whereas sentence two is a general statment.
new_sentence = [
                "granny starting to fear spider in the garden might be real",
                "game of thrones season finale showing this sunday night"]

# Converting the sentences to sequences using tokenizer
new_sequences = tokenizer.texts_to_sequences(new_sentence)
# padding the new sequences to make them have same dimensions
new_padded = pad_sequences(new_sequences, maxlen = max_length,
                           padding = padding_type,
                           truncating = trunc_type)

new_padded = np.array(new_padded )

print(model.predict(new_padded))

[[0.28417525]
 [0.2538942 ]]


*0.7923 indicates that the first sentence has 72% cahnce of being a sarcastic sentence, so it is classified as a sarcastic one, whereas 0.0628 indicates that the second sentence is very close to a non-sarcastic one.*