In [4]:
import tensorflow as tf
Tokenizer = tf.keras.preprocessing.text.Tokenizer

In [5]:
# initialise sentences 
sentences = ["I love my dog", "I LOVE MY CAT!", "You love my dog", "Do you think my dog is amazing?"]

you define the number of relevant words in the string sequence that are relevant. Then you pass the train data into your tokenizer and you can check what tokens are given to your data by using the `word_index` method. The module is designed to lemmatise the words and ignores punctuation and capitalisation. 

In [6]:
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
word_index

{'my': 1,
 'love': 2,
 'dog': 3,
 'i': 4,
 'you': 5,
 'cat': 6,
 'do': 7,
 'think': 8,
 'is': 9,
 'amazing': 10}

You can also see that applying these tokens to your sentences converts them into an array of integers ready to be processed. You can see that each sequence has different number of words in them and so different length arrays. 

In [7]:
sequences = tokenizer.texts_to_sequences(sentences)
sequences

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

Now what if your tokeniser has not assigned a value to new words that it sees in its test data?

In [8]:
test_data = ["i really love my dog", "my dog loves my shoes"]
test_seq = tokenizer.texts_to_sequences(test_data)
test_seq

[[4, 2, 1, 3], [1, 3, 1]]

One way to tackle this is by creating a token for Out Of Vocabulary words and assigning that to any new word that you see. Now our sequences are the same length as the actual sentences even though we have still lost some meaning.

In [9]:
tokenizer_oov = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer_oov.fit_on_texts(sentences)
word_index = tokenizer_oov.word_index
word_index

{'<OOV>': 1,
 'my': 2,
 'love': 3,
 'dog': 4,
 'i': 5,
 'you': 6,
 'cat': 7,
 'do': 8,
 'think': 9,
 'is': 10,
 'amazing': 11}

In [10]:
test_seq = tokenizer_oov.texts_to_sequences(test_data)
test_seq

[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

Now that we are getting close to building the neural network and processing matrices that we have created we need to think about what if your sentences are not the same length? One method is using a *"ragged tensor"* which we will not look at. Another method is padding.

In [11]:
pad_sequences = tf.keras.preprocessing.sequence.pad_sequences

In [12]:
# use kwarg 
#   padding="post" to pad on the left
#   maxlen=# to pad to a specific length
# if maxlen is smaller than longest sentences use truncating="post"/"pre" to define cutoffs
padded_sequences = pad_sequences(sequences) 
padded_sequences

array([[ 0,  0,  0,  4,  2,  1,  3],
       [ 0,  0,  0,  4,  2,  1,  6],
       [ 0,  0,  0,  5,  2,  1,  3],
       [ 7,  5,  8,  1,  3,  9, 10]])

# Sarcastic news titles

Im gonna use the kaggle API to download the dataset as JSON. Train our data and test... the usual

In [13]:
import json
import numpy as np
from kaggle.api.kaggle_api_extended import KaggleApi

In [14]:
api = KaggleApi()
api.authenticate()

In [15]:
# copied from kaggle the API command to download the dataset
# kaggle datasets download -d rmisra/news-headlines-dataset-for-sarcasm-detection

# Download all files of a dataset
# Signature: dataset_download_files(dataset, path=None, force=False, quiet=True, unzip=False)

#api.dataset_download_files("rmisra/news-headlines-dataset-for-sarcasm-detection")

In [16]:
with open("data/Sarcasm_Headlines_Dataset.json","r") as file:
    raw = file.readlines()
    data = {"root":[]}
    for line in raw:
        data["root"].append(json.loads(line))

sentences = [data["root"][i]["headline"] for i in range(len(data["root"]))]
labels = [data["root"][i]["is_sarcastic"] for i in range(len(data["root"]))]

In [17]:
training_size = 20000

train_sentences = sentences[:training_size]
train_labels = labels[:training_size]

test_sentences = sentences[training_size:]
test_labels = labels[training_size:]

In [18]:
vocab_size = 10000
tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(train_sentences)

train_word_index = tokenizer.word_index

train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding="post")

test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding="post")

In [19]:
# Need this block to get it to work with TensorFlow 2.x
train_padded = np.array(train_padded)
train_labels = np.array(train_labels)
test_padded = np.array(test_padded)
test_labels = np.array(test_labels)

In [20]:
train_padded.shape
# we have 20000 sentences, longest of which must have been 40 words long

(20000, 40)

Now it comes to building the neural network. Sequential is a model class which carries out the model in a sequence. 
- The first item is embedding. The direction/vector representation/tone of each word will be learned epoch by epoch
- Next you calculate the global average of these directions given their context
- Dense is a layer of interconnected neurons and here we have used 2 layers, the first having 24 and the latter having 1 nodes. 

In [21]:
embedding_dim = 16
max_length = 100

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation="relu"),
    tf.keras.layers.Dense(1 ,activation="sigmoid")
])

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [22]:
history = model.fit(train_padded, train_labels, epochs=30, validation_data=(test_padded, test_labels), verbose=2)

Epoch 1/30
625/625 - 2s - loss: 0.5793 - accuracy: 0.6914 - val_loss: 0.4080 - val_accuracy: 0.8243
Epoch 2/30
625/625 - 1s - loss: 0.3242 - accuracy: 0.8676 - val_loss: 0.3437 - val_accuracy: 0.8535
Epoch 3/30
625/625 - 1s - loss: 0.2437 - accuracy: 0.9040 - val_loss: 0.3401 - val_accuracy: 0.8556
Epoch 4/30
625/625 - 1s - loss: 0.1987 - accuracy: 0.9233 - val_loss: 0.3605 - val_accuracy: 0.8538
Epoch 5/30
625/625 - 1s - loss: 0.1659 - accuracy: 0.9367 - val_loss: 0.3797 - val_accuracy: 0.8530
Epoch 6/30
625/625 - 1s - loss: 0.1415 - accuracy: 0.9488 - val_loss: 0.4115 - val_accuracy: 0.8481
Epoch 7/30
625/625 - 1s - loss: 0.1214 - accuracy: 0.9584 - val_loss: 0.4547 - val_accuracy: 0.8416
Epoch 8/30
625/625 - 1s - loss: 0.1065 - accuracy: 0.9635 - val_loss: 0.4861 - val_accuracy: 0.8410
Epoch 9/30
625/625 - 1s - loss: 0.0936 - accuracy: 0.9691 - val_loss: 0.5188 - val_accuracy: 0.8395
Epoch 10/30
625/625 - 1s - loss: 0.0815 - accuracy: 0.9733 - val_loss: 0.5648 - val_accuracy: 0.8357

In [23]:
new_sentences = ["granny starting to fear spiders in the garden might be real", "game of thrones season finale showing this sunday night"]

new_sequences = tokenizer.texts_to_sequences(new_sentences)
new_pads = pad_sequences(new_sequences, maxlen=max_length, padding="post", truncating="post")

model.predict(new_pads)

array([[7.6833975e-01],
       [1.0817187e-04]], dtype=float32)

what we see is that the model predicts the first sentence to be 80% probable to be sarcastic whereas the second sentence is very unlikely to be sarcastic. 

# ... TBC