About Dataset

Context
** Please cite the dataset using the BibTex provided in one of the following sections if you are using it in your research, thank you! **

Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in this requires the availability of contextual tweets.

To overcome the limitations related to noise in Twitter datasets, this News Headlines dataset for Sarcasm Detection is collected from two news website. TheOnion aims at producing sarcastic versions of current events and we collected all the headlines from News in Brief and News in Photos categories (which are sarcastic). We collect real (and non-sarcastic) news headlines from HuffPost.

This new dataset has the following advantages over the existing Twitter datasets:

Since news headlines are written by professionals in a formal manner, there are no spelling mistakes and informal usage. This reduces the sparsity and also increases the chance of finding pre-trained embeddings.

Furthermore, since the sole purpose of TheOnion is to publish sarcastic news, we get high-quality labels with much less noise as compared to Twitter datasets.

Unlike tweets which are replies to other tweets, the news headlines we obtained are self-contained. This would help us in teasing apart the real sarcastic elements.

Content
Each record consists of three attributes:

is_sarcastic: 1 if the record is sarcastic otherwise 0

headline: the headline of the news article

article_link: link to the original news article. Useful in collecting supplementary data

General statistics of data, instructions on how to read the data in python, and basic exploratory analysis could be found at this GitHub repo. A hybrid NN architecture trained on this dataset can be found at this GitHub repo.

Citation
If you're using this dataset for your work, please cite the following articles:

Citation in text format:

1. Misra, Rishabh and Prahal Arora. "Sarcasm Detection using News Headlines Dataset." AI Open (2023).
2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).


In [401]:
import keras.layers
import pandas as pd
import numpy as np
import tensorflow as tf
from keras.preprocessing.text import  Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import keras
import json
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [402]:
gpu = tf.config.list_physical_devices('GPU')

if gpu:
    print('GPU is available: {}'.format(gpu))
else:
    print('GPU Not Found!!')

GPU is available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In [403]:
vocab_size = 30000
embedded_dim = 16
max_length = 1000
trunc_type = 'post'
pad_type = 'post'
oov_token = 'OOV'
training_size = 20000

In [404]:
def parse_data(file):
    for l in open(file,'r'):
        yield json.loads(l)

data = list(parse_data('Sarcasm_Headlines_Dataset.json'))

In [405]:
sentences = []
labels = []
urls = []

for item in parse_data('Sarcasm_Headlines_Dataset.json'):
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

In [406]:
print(sentences[:5])
print(labels[:5])
print(urls[:5])

['thirtysomething scientists unveil doomsday clock of hair loss', 'dem rep. totally nails why congress is falling short on gender, racial equality', 'eat your veggies: 9 deliciously different recipes', 'inclement weather prevents liar from getting to work', "mother comes pretty close to using word 'streaming' correctly"]
[1, 0, 0, 1, 1]
['https://www.theonion.com/thirtysomething-scientists-unveil-doomsday-clock-of-hai-1819586205', 'https://www.huffingtonpost.com/entry/donna-edwards-inequality_us_57455f7fe4b055bb1170b207', 'https://www.huffingtonpost.com/entry/eat-your-veggies-9-delici_b_8899742.html', 'https://local.theonion.com/inclement-weather-prevents-liar-from-getting-to-work-1819576031', 'https://www.theonion.com/mother-comes-pretty-close-to-using-word-streaming-cor-1819575546']


In [407]:
training_sentences = sentences[:training_size]
training_labels = labels[:training_size]

testing_sentences = sentences[training_size:]
testing_labels = labels[training_size:]

In [408]:
tokenizer = Tokenizer(oov_token=oov_token, num_words = max_length)
tokenizer.fit_on_texts(training_sentences)

In [409]:
# Printing the Vocabulary List
print(tokenizer.word_index)



In [410]:
max_value = 0

for key,value in tokenizer.word_index.items():
    if value > max_value:
        max_value = value
print(f'The max word index in the Vocabulary list is {max_value}')

The max word index in the Vocabulary list is 25898


In [411]:
training_sequences = tokenizer.texts_to_sequences(training_sentences)
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)

In [412]:
max_length_sentence = 0
longest_training_sentence = ""
indice = 0

for index, sentence in enumerate(training_sentences):
    if(len(sentence) > max_length):
        max_length_sentence = len(sentence)
        longest_training_sentence = sentence
        indice = index

print('The longest sentence is the folowing: \n\n{}'.format(longest_training_sentence))
print(f'\n\nThe longest sentence has the {max_length_sentence}')
print(f'\n\nThe longest sentence is located {indice} index')

The longest sentence is the folowing: 




The longest sentence has the 0


The longest sentence is located 0 index


In [413]:
training_padded_sequences = pad_sequences(training_sequences,padding = pad_type, maxlen = max_length, truncating = trunc_type)
testing_padded_sequences = pad_sequences(testing_sequences,padding = pad_type, maxlen = max_length, truncating = trunc_type)

In [414]:
print(training_sentences[0])
print('\n\n',training_padded_sequences[0])
print('\n\n', training_labels[0])

thirtysomething scientists unveil doomsday clock of hair loss


 [  1 325   1   1   1   3 655 993   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   

In [415]:
print('The number of sentences in the training corpus {}'.format(len(training_sentences)))
print('The shape of the padded training sequences {}'.format(training_padded_sequences.shape))

The number of sentences in the training corpus 20000
The shape of the padded training sequences (20000, 1000)


In [416]:
print('The number of sentences in the testing corpus {}'.format(len(testing_sentences)))
print('The shape of the padded testing sequences {}'.format(testing_padded_sequences.shape))

The number of sentences in the testing corpus 8619
The shape of the padded testing sequences (8619, 1000)


In [417]:
model = tf.keras.Sequential(
    [tf.keras.layers.Embedding(input_dim=vocab_size, input_length=max_length, output_dim=embedded_dim),
     tf.keras.layers.GlobalAveragePooling1D(),
     tf.keras.layers.Dense(units=24, activation='relu'),
     tf.keras.layers.Dense(units=1, activation='sigmoid')
     ]
)

In [418]:
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 1000, 16)          480000    
                                                                 
 global_average_pooling1d_3   (None, 16)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_6 (Dense)             (None, 24)                408       
                                                                 
 dense_7 (Dense)             (None, 1)                 25        
                                                                 
Total params: 480,433
Trainable params: 480,433
Non-trainable params: 0
_________________________________________________________________


In [419]:
training_padded_sequences = np.array(training_padded_sequences)
testing_padded_sequences = np.array(testing_padded_sequences)

training_labels = np.array(training_labels)
testing_labels = np.array(testing_labels)

In [420]:
history = model.fit(x=training_padded_sequences,
                    y=training_labels,
                    epochs=100,
                    verbose=2,
                    batch_size=32,
                    validation_data=(testing_padded_sequences,testing_labels))

Epoch 1/100
625/625 - 3s - loss: 0.6921 - accuracy: 0.5215 - val_loss: 0.6912 - val_accuracy: 0.5249 - 3s/epoch - 5ms/step
Epoch 2/100
625/625 - 3s - loss: 0.6890 - accuracy: 0.5327 - val_loss: 0.6809 - val_accuracy: 0.5255 - 3s/epoch - 4ms/step
Epoch 3/100
625/625 - 3s - loss: 0.6465 - accuracy: 0.6625 - val_loss: 0.5907 - val_accuracy: 0.7154 - 3s/epoch - 4ms/step
Epoch 4/100
625/625 - 3s - loss: 0.5444 - accuracy: 0.7431 - val_loss: 0.4982 - val_accuracy: 0.7727 - 3s/epoch - 4ms/step
Epoch 5/100
625/625 - 3s - loss: 0.4834 - accuracy: 0.7689 - val_loss: 0.4762 - val_accuracy: 0.7742 - 3s/epoch - 4ms/step
Epoch 6/100
625/625 - 3s - loss: 0.4476 - accuracy: 0.7918 - val_loss: 0.4455 - val_accuracy: 0.7815 - 3s/epoch - 4ms/step
Epoch 7/100
625/625 - 2s - loss: 0.4311 - accuracy: 0.7955 - val_loss: 0.4313 - val_accuracy: 0.7902 - 2s/epoch - 4ms/step
Epoch 8/100
625/625 - 3s - loss: 0.4176 - accuracy: 0.8056 - val_loss: 0.4195 - val_accuracy: 0.8057 - 3s/epoch - 4ms/step
Epoch 9/100
625/

In [421]:
print(history.history)

{'loss': [0.6920522451400757, 0.6890435218811035, 0.6464536786079407, 0.5444061160087585, 0.4833928346633911, 0.4476032257080078, 0.431108295917511, 0.41759467124938965, 0.40838414430618286, 0.3994889259338379, 0.39592811465263367, 0.39318186044692993, 0.38388925790786743, 0.3862280249595642, 0.38233116269111633, 0.3771722614765167, 0.37692078948020935, 0.3774590492248535, 0.3723208010196686, 0.3762192130088806, 0.37210118770599365, 0.36847883462905884, 0.3705820143222809, 0.36803871393203735, 0.3653112053871155, 0.3695344626903534, 0.36188504099845886, 0.362967848777771, 0.3642965257167816, 0.36318060755729675, 0.36483871936798096, 0.366569846868515, 0.3596910238265991, 0.3608953356742859, 0.363165944814682, 0.3632674217224121, 0.3599258363246918, 0.35921627283096313, 0.3604520559310913, 0.3622017502784729, 0.3624069392681122, 0.35924118757247925, 0.35903212428092957, 0.35818833112716675, 0.3597429096698761, 0.3608834445476532, 0.35798317193984985, 0.35752248764038086, 0.3598664402961

In [1]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,9))

# Plotting the training accuracy curve
ax1.plot(history.history['accuracy'], label = 'Training Accuracy')
ax1.plot(history.history['val_accuracy'], label = "Validation Accuracy")
ax1.set_title('Training Vs Validation Accuracy Plot')
ax1.set_xlabel('# Epochs')
ax1.set_ylabel('Accracy')
ax1.legend()


# Plotting the training loss curve
ax2.plot(history.history['loss'], label = 'Training Loss')
ax2.plot(history.history['val_loss'], label = "Validation Loss")
ax2.set_title('Training Vs Validation Loss Plot')
ax2.set_xlabel('# Epochs')
ax2.set_ylabel('Loss')
ax2.legend()

NameError: name 'plt' is not defined