# Synopsis

This project performs sentiment analysis using natural language processing to detect sarcasm in English news headlines. The data used can be obtained from the URL https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection. The news headlines were converted to sequences of tokens and padded to a uniform length. A recurrent neural network model with long short-term memory was trained on the training set and validated against the validation set. The final out-of-sample accuracy of the natural language processing model was then reported on the test set.

# Setup

Import the libraries and methods needed for the project.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import json

# Load the Data

Load the text corpus from the JSON file containing the dataset.

In [2]:
def parse_data(file):
    for l in open(file, "r"):
        yield json.loads(l)

corpus = list(parse_data("../input/news-headlines-dataset-for-sarcasm-detection/Sarcasm_Headlines_Dataset.json"))

Instantiate lists to store the desired variables of the dataset.

In [3]:
sentences = []
labels = []

Fill the lists with the corresponding data in the JSON object.

In [4]:
for item in corpus:
    sentences.append(item["headline"])
    labels.append(item["is_sarcastic"])

# Data Partitioning

Obtain the total number of sentences in the dataset.

In [5]:
n_sentences = len(sentences)

Partition the data into a training set, a validation set and a test set. An 80-10-10 split is carried out.

In [6]:
training_last_index = int(0.8 * n_sentences)
validation_last_index = int(0.9 * n_sentences)

training_sentences = sentences[0 : training_last_index]
validation_sentences = sentences[training_last_index : validation_last_index]
test_sentences = sentences[validation_last_index : ]

training_labels = labels[0 : training_last_index]
validation_labels = labels[training_last_index : validation_last_index]
test_labels = labels[validation_last_index : ]

# Tokenization

Set hyperparameters for tokenization.

In [7]:
vocab_size = 10000
# Set the maximum length of padded sequences equal to the number of words in the longest training sentence
max_length = max([len(training_sentence.split(" ")) for training_sentence in training_sentences])
oov_token = "<OOV>"
padding_type = "post"
trunc_type = "post"

Instantiate the tokenizer.

In [8]:
tokenizer = Tokenizer(num_words = vocab_size,
                      oov_token = oov_token)

Fit the tokenizer to the training sentences.

In [9]:
tokenizer.fit_on_texts(training_sentences)

Obtain the word index of the tokenizer.

In [10]:
word_index = tokenizer.word_index

Create sequences of tokens representing the sentences.

In [11]:
training_sequences = tokenizer.texts_to_sequences(training_sentences)
validation_sequences = tokenizer.texts_to_sequences(validation_sentences)
test_sequences = tokenizer.texts_to_sequences(test_sentences)

Pad the sequences to ensure they're all the same length.

In [12]:
training_padded = pad_sequences(training_sequences,
                                maxlen = max_length,
                                padding = padding_type,
                                truncating = trunc_type)
validation_padded = pad_sequences(validation_sequences,
                                  maxlen = max_length,
                                  padding = padding_type,
                                  truncating = trunc_type)
test_padded = pad_sequences(test_sequences,
                            maxlen = max_length,
                            padding = padding_type,
                            truncating = trunc_type)

Convert the lists containing the padded sequences and labels to numpy arrays to ensure they're compatible with TensorFlow.

In [13]:
training_padded = np.array(training_padded)
training_labels = np.array(training_labels)

validation_padded = np.array(validation_padded)
validation_labels = np.array(validation_labels)

test_padded = np.array(test_padded)
test_labels = np.array(test_labels)

# Sanity Check

Observe the first 10 elements of the word index of the tokenizer.

In [14]:
for i, (word, index) in enumerate(word_index.items()):
    if i < 10:
        print(f"{word}: {index}")

Observe the raw sequence created from the first training sentence.

In [15]:
training_sequences[0]

Observe the padded sequence created from the first training sentence.

In [16]:
training_padded[0]

Observe the shape of the object containing the padded sequences for the training set.

In [17]:
training_padded.shape

It is observed that there are 21367 sequences in the training set, each sequence having a length of 39.

# Define the Model

Set the size of the word embeddings.

In [18]:
embedding_dim = 64

Define the layers of the natural language processing model.

In [19]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim = vocab_size,
                              output_dim = embedding_dim,
                              input_length = max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units = 64,
                                                       return_sequences = True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units = 32)),
    tf.keras.layers.Dense(units = 64,
                          activation = "relu"),
    tf.keras.layers.Dense(units = 1, 
                          activation = "sigmoid")
])

Obtain an overview of the model.

In [20]:
model.summary()

Compile the model with a loss function, optimizer and metric.

In [21]:
model.compile(loss = "binary_crossentropy",
              optimizer = "adam",
              metrics = ["accuracy"])

# Train the Model

Choose a desired number of epochs.

In [22]:
n_epochs = 30

Train the model for the given number of epochs.

In [23]:
history = model.fit(x = training_padded,
                    y = training_labels,
                    epochs = n_epochs,
                    validation_data = (validation_padded, validation_labels),
                    verbose = 2)

Create a helper function to plot training and validation curves.

In [24]:
def PlotGraphs(history, metric):
    plt.plot(history.history[metric])
    plt.plot(history.history["val_" + metric], "")
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend([metric, "val_" + metric])

Visualize the respective plots.

In [25]:
plt.figure(figsize = [16, 8])
plt.subplot(1, 2, 1)
PlotGraphs(history, "accuracy")
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
PlotGraphs(history, "loss")
plt.ylim(0, None)

# Evaluate the Model

Perform a final evaluation of the model on the test set.

In [26]:
test_loss, test_accuracy = model.evaluate(x = test_padded,
                                          y = test_labels)

print(f"Loss on the test set: {test_loss}")
print(f"Accuracy on the test set: {test_accuracy}")