<a href="https://colab.research.google.com/github/vishnudas-raveendran/PGP-AIML/blob/master/NLP/Project%202/NLP_Project_2_Part_B_Sarcasm_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Project 2: Part B

The goal is to build a model to detect whether a sentence is sarcastic or not, using Bidirectional LSTMs.

## 1. Read and Explore Dataset

In [2]:
!wget https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection/raw/master/Sarcasm_Headlines_Dataset.json

--2022-03-22 18:51:15--  https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection/raw/master/Sarcasm_Headlines_Dataset.json
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection/master/Sarcasm_Headlines_Dataset.json [following]
--2022-03-22 18:51:16--  https://raw.githubusercontent.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection/master/Sarcasm_Headlines_Dataset.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6057046 (5.8M) [text/plain]
Saving to: ‘Sarcasm_Headlines_Dataset.json’


2022-03-22 18:51:17 (176 

In [3]:
def parseJson(fname):
    for line in open(fname, 'r'):
        yield eval(line)

In [4]:
data = list(parseJson('./Sarcasm_Headlines_Dataset.json'))

In [1]:
import pandas as pd

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional, GlobalMaxPool1D
from tensorflow.keras.models import Model, Sequential
from sklearn.model_selection import train_test_split
import numpy as np
import pickle
import warnings
import logging
logging.basicConfig(level=logging.INFO)


In [5]:
df = pd.DataFrame(data)

In [12]:
df = pd.read_json("Sarcasm_Headlines_Dataset.json", lines=True)

In [6]:
df.shape

(28619, 3)

In [7]:
df.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


## 2. Retain relevant columns

Since we are planning to use only headline to detect sarcasm, we can remove the link to the article from dataframe

In [6]:
df = df.drop(columns=['article_link'])

In [9]:
df.head()

Unnamed: 0,is_sarcastic,headline
0,1,thirtysomething scientists unveil doomsday clo...
1,0,dem rep. totally nails why congress is falling...
2,0,eat your veggies: 9 deliciously different recipes
3,1,inclement weather prevents liar from getting t...
4,1,mother comes pretty close to using word 'strea...


## 3. Get length of each sentence

In [10]:
length = df['headline'].apply(len)

In [11]:
length

0        61
1        79
2        49
3        52
4        61
         ..
28614    44
28615    87
28616    71
28617    61
28618    34
Name: headline, Length: 28619, dtype: int64

In [12]:
print(f"\nHeadline Characteristics: \n\nMin length: {length.min()}\nMax Length: {length.max()}\nStdDev: {length.std()}\nMean: {length.mean()}")


Headline Characteristics: 

Min length: 7
Max Length: 926
StdDev: 20.726483379171803
Mean: 62.30857122890387


## 4. Define Parameters

In [18]:
MAX_NB_WORDS=10000
MAX_SEQUENCE_LENGTH=100
TOKENIZER_MODEL_FILE = "tokenizer.pkl"
RANDOM_SEED = 42
word_index =""

## 5. Get Indices for words

In [26]:
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(df['headline'])

In [24]:
tokenizer.word_index

{'to': 1,
 'of': 2,
 'the': 3,
 'in': 4,
 'for': 5,
 'a': 6,
 'on': 7,
 'and': 8,
 'with': 9,
 'is': 10,
 'new': 11,
 'trump': 12,
 'man': 13,
 'at': 14,
 'from': 15,
 'about': 16,
 'by': 17,
 'after': 18,
 'you': 19,
 'this': 20,
 'out': 21,
 'up': 22,
 'be': 23,
 'as': 24,
 'that': 25,
 'it': 26,
 'how': 27,
 'not': 28,
 'he': 29,
 'his': 30,
 'are': 31,
 'your': 32,
 'just': 33,
 'what': 34,
 'all': 35,
 'who': 36,
 'has': 37,
 'will': 38,
 'report': 39,
 'into': 40,
 'more': 41,
 'one': 42,
 'have': 43,
 'year': 44,
 'over': 45,
 'why': 46,
 'day': 47,
 'u': 48,
 'area': 49,
 'woman': 50,
 'can': 51,
 's': 52,
 'says': 53,
 'donald': 54,
 'time': 55,
 'first': 56,
 'like': 57,
 'no': 58,
 'her': 59,
 'get': 60,
 'off': 61,
 'old': 62,
 "trump's": 63,
 'life': 64,
 'now': 65,
 'people': 66,
 "'": 67,
 'an': 68,
 'house': 69,
 'still': 70,
 'obama': 71,
 'white': 72,
 'back': 73,
 'make': 74,
 'was': 75,
 'than': 76,
 'women': 77,
 'if': 78,
 'down': 79,
 'when': 80,
 'i': 81,
 'my':

In [27]:
#dump tokenizer model for prediction later
pickle.dump(tokenizer, open(TOKENIZER_MODEL_FILE, 'wb'))
sequences = tokenizer.texts_to_sequences(df['headline'])
text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print(text.shape)

(28619, 100)


In [38]:
word_index = tokenizer.word_index

## 6. Create features and labels

In [33]:
X = text  #headline
y = df['is_sarcastic']  #label

Split into training and testing

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

In [36]:
print(f"Training size: {y_train.shape[0]} Testing size: {y_test.shape[0]}")

Training size: 20033 Testing size: 8586


## 7. Vocabulary Size

In [34]:
print('Found %s unique tokens.' % len(tokenizer.word_index))

Found 30884 unique tokens.


## 8. Create weight matrix using Glove embedding

In [37]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2022-03-22 19:13:11--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-03-22 19:13:12--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-03-22 19:13:12--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-0

In [49]:
def get_embedding_dictionary():
  embeddings_dict = {}
  f = open("glove.6B.50d.txt", encoding="utf8")
  for line in f:
      values = line.split()
      word = values[0]
      try:
          coefs = np.asarray(values[1:], dtype='float32')
      except:
          pass
      embeddings_dict[word] = coefs
  f.close()
  print('Total %s word vectors.' % len(embeddings_dict))
  return embeddings_dict

In [55]:
def make_embedding_matrix(word_index, glove_dict, EMBEDDING_DIM=50):
  embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
  print()
  for word, i in word_index.items():
    embedding_vector = glove_dict.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        if len(embedding_matrix[i]) != len(embedding_vector):
            print("could not broadcast input array from shape", str(len(embedding_matrix[i])),
                  "into shape", str(len(embedding_vector)), " Please make sure your"
                                                            " EMBEDDING_DIM is equal to embedding_vector file ,GloVe,")
            exit(1)
        embedding_matrix[i] = embedding_vector
  return embedding_matrix

In [52]:
glove_dict = get_embedding_dictionary()


Total 400000 word vectors.


In [56]:
EMBEDDING_DIM = 50
embedding_matrix = make_embedding_matrix(word_index, glove_dict, EMBEDDING_DIM)




Weight Matrix

In [57]:
embedding_matrix.shape

(30885, 50)

## 9. Define and Compile Bidirectional LSTM Model

In [58]:
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                                EMBEDDING_DIM,
                                weights=[embedding_matrix],
                                input_length=MAX_SEQUENCE_LENGTH,
                                trainable=True))
model.add(Bidirectional(LSTM(32, return_sequences=True, recurrent_dropout=0.2)))
model.add(Dropout(0.5))
model.add(Bidirectional(LSTM(32, return_sequences=True, recurrent_dropout=0.2)))
model.add(Dropout(0.5))
model.add(Dense(1,activation = 'sigmoid'))
model.compile(loss='binary_crossentropy',optimizer = 'adam',metrics = ['accuracy'])

## 10. Fit model and check validation accuracy

In [60]:
model.fit(X_train, y_train, batch_size=256, epochs=10, validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f02bca5f350>

In [61]:
score,acc = model.evaluate(X_test, y_test, verbose = 2, batch_size = 32)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

269/269 - 12s - loss: 0.4148 - accuracy: 0.8321 - 12s/epoch - 44ms/step
score: 0.41
acc: 0.83


We have a accuracy of 83% in detecting sarcasm