<img src="http://drive.google.com/uc?export=view&id=1tpOCamr9aWz817atPnyXus8w5gJ3mIts" width=500px>

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

### Package Version:
- tensorflow==2.2.0
- pandas==1.0.5
- numpy==1.18.5
- google==2.0.3

In [None]:
# All import statements.
import tensorflow as tf
import os
import math
import numpy as np
import pandas as pd
from scipy import stats
from sklearn import metrics
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, Dense
from tensorflow.keras.layers import LSTM, Bidirectional, Embedding, Dropout, Flatten
from tensorflow.keras  import callbacks
from tensorflow.keras import optimizers
from sklearn.metrics import mean_squared_error
import json
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# Sarcasm Detection

### Dataset

#### Acknowledgement
Misra, Rishabh, and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

**Required Files given in below link.**

https://drive.google.com/drive/folders/1xUnF35naPGU63xwRDVGc-DkZ3M8V5mMk

### Load Data 

In [None]:
# run this cell to to mount the google drive if you are using google colab
from google.colab import drive
drive.mount('/content/drive')
project_path = '/content/drive/My Drive/assignments/'

Mounted at /content/drive


In [None]:
f = open(project_path+'Sarcasm_Headlines_Dataset.json', "r")

In [None]:
dataArr = []

In [None]:
for val in f.readlines():
  dataArr.append(json.loads(val))

In [None]:
data = pd.DataFrame(dataArr)

In [None]:
data.shape

(26709, 3)

There are 26709 records 

In [None]:
data.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


### Drop `article_link` from dataset 

In [None]:
data.drop('article_link', axis=1, inplace=True)

In [None]:
data.shape

(26709, 2)

### Get length of each headline and add a column for that 

In [None]:
headline_len = []
for i in range(len(data)):
  headline_len.append(len(data.iloc[i: i+1].headline.values[0]))

In [None]:
data['headline_len'] = headline_len

In [None]:
data.shape

(26709, 3)

In [None]:
data.head()

Unnamed: 0,headline,is_sarcastic,headline_len
0,former versace store clerk sues over secret 'b...,0,78
1,the 'roseanne' revival catches up to our thorn...,0,84
2,mom starting to fear son's web series closest ...,1,79
3,"boehner just wants wife to listen, not come up...",1,84
4,j.k. rowling wishes snape happy birthday in th...,0,64


Since there are a lot of records its not possible to train the full set as the colab crashes, hence taking 5000 records to train the model

In [None]:
# Staching the remaining recods which can be used for testing
stachedData = data.iloc[5001:]

In [None]:
# Taking first 5000 records
data = data.iloc[:5000]

In [None]:
data.shape

(5000, 3)

### Initialize parameter values
- Set values for max_features, maxlen, & embedding_size
- max_features: Number of words to take from tokenizer(most frequent words)
- maxlen: Maximum length of each sentence to be limited to 25
- embedding_size: size of embedding vector

In [None]:
max_features = 10000
maxlen = 25
embedding_size = 200

### Apply `tensorflow.keras` Tokenizer and get indices for words 
- Initialize Tokenizer object with number of words as 10000
- Fit the tokenizer object on headline column
- Convert the text to sequence


In [None]:
tokenizer = Tokenizer(num_words=10000, split= ' ')

In [None]:
tokenizer.fit_on_texts(data.headline)
headline_seq = tokenizer.texts_to_sequences(data.headline.values)

### Pad sequences 
- Pad each example with a maximum length
- Convert target column into numpy array

In [None]:
headline_seq = pad_sequences(headline_seq)

In [None]:
headline_seq.shape

(5000, 26)

### Vocab mapping
- There is no word for 0th index

In [None]:
tokenizer.word_index

{'to': 1,
 'of': 2,
 'the': 3,
 'in': 4,
 'for': 5,
 'a': 6,
 'on': 7,
 'and': 8,
 'with': 9,
 'is': 10,
 'new': 11,
 'trump': 12,
 'man': 13,
 'from': 14,
 'at': 15,
 'you': 16,
 'about': 17,
 'after': 18,
 'out': 19,
 'by': 20,
 'this': 21,
 'it': 22,
 'up': 23,
 'be': 24,
 'that': 25,
 'as': 26,
 'how': 27,
 'not': 28,
 'are': 29,
 'what': 30,
 'all': 31,
 'your': 32,
 'will': 33,
 'who': 34,
 'just': 35,
 'has': 36,
 'over': 37,
 'his': 38,
 'he': 39,
 'year': 40,
 'one': 41,
 'have': 42,
 'u': 43,
 'into': 44,
 'more': 45,
 'first': 46,
 'time': 47,
 'woman': 48,
 'old': 49,
 'day': 50,
 'area': 51,
 'no': 52,
 's': 53,
 'says': 54,
 'why': 55,
 'an': 56,
 'donald': 57,
 'report': 58,
 'her': 59,
 'can': 60,
 'get': 61,
 'off': 62,
 'obama': 63,
 'still': 64,
 'like': 65,
 'make': 66,
 'people': 67,
 "trump's": 68,
 'women': 69,
 'life': 70,
 'their': 71,
 'when': 72,
 'my': 73,
 'family': 74,
 'way': 75,
 'back': 76,
 'i': 77,
 'they': 78,
 'do': 79,
 'white': 80,
 'now': 81,
 'c

In [None]:
len(tokenizer.word_index)

12052

### Set number of words
- Since the above 0th index doesn't have a word, add 1 to the length of the vocabulary

In [None]:
num_words = len(tokenizer.word_index) + 1
print(num_words)

12053


### Load Glove Word Embeddings 

*   List item
*   List item



### Create embedding matrix

In [None]:
EMBEDDING_FILE = 'glove.6B.200d.txt'

embeddings = {}
for o in open(project_path + EMBEDDING_FILE):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((num_words, 200))

for word, i in tokenizer.word_index.items():
	embedding_vector = embeddings.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

### Define model 
- Hint: Use Sequential model instance and then add Embedding layer, Bidirectional(LSTM) layer, flatten it, then dense and dropout layers as required. 
In the end add a final dense layer with sigmoid activation for binary classification.

In [None]:
model = Sequential()

# Embedding layer
model.add(Embedding(input_dim=max_features, output_dim=num_words, input_length = headline_seq.shape[1], weights=None, trainable=True))

forward_layer = LSTM(64, return_sequences=True)
backward_layer = LSTM(64, activation='relu', return_sequences=True,
                       go_backwards=True)

# Recurrent layer Bidirectional LSTM
model.add(Bidirectional(forward_layer, backward_layer=backward_layer, input_shape=(num_words, 1)))

# Flatten the model
model.add(Flatten())

# Fully connected layer
model.add(Dense(64, activation='relu'))

# Dropout for regularization
model.add(Dropout(0.5))

# Output layer for 2 class classification
model.add(Dense(2, activation='sigmoid'))



### Compile the model 

In [None]:
# Compile the model
model.compile(
    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 26, 12053)         120530000 
_________________________________________________________________
bidirectional (Bidirectional (None, 26, 128)           6204416   
_________________________________________________________________
flatten (Flatten)            (None, 3328)              0         
_________________________________________________________________
dense (Dense)                (None, 64)                213056    
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 130       
Total params: 126,947,602
Trainable params: 126,947,602
Non-trainable params: 0
__________________________________________

### Fit the model 

In [None]:
X = headline_seq
Y = pd.get_dummies(data.is_sarcastic).values

# Splitting the data into training and testing in 7:30 ratio.
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.30, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(3500, 26) (3500, 2)
(1500, 26) (1500, 2)


In [None]:
batch_size = 32
model.fit(X_train, Y_train, epochs = 10, batch_size=batch_size, verbose = 2)

Epoch 1/10
110/110 - 132s - loss: 0.5193 - accuracy: 0.7286
Epoch 2/10
110/110 - 121s - loss: 0.1570 - accuracy: 0.9466
Epoch 3/10
110/110 - 122s - loss: 0.0282 - accuracy: 0.9923
Epoch 4/10
110/110 - 122s - loss: 0.0057 - accuracy: 0.9980
Epoch 5/10
110/110 - 121s - loss: 0.0017 - accuracy: 0.9994
Epoch 6/10
110/110 - 122s - loss: 0.0029 - accuracy: 0.9989
Epoch 7/10
110/110 - 122s - loss: 0.0086 - accuracy: 0.9980
Epoch 8/10
110/110 - 121s - loss: 0.0199 - accuracy: 0.9969
Epoch 9/10
110/110 - 122s - loss: 0.0073 - accuracy: 0.9989
Epoch 10/10
110/110 - 122s - loss: 0.0070 - accuracy: 0.9991


<tensorflow.python.keras.callbacks.History at 0x7f1b20053208>

In [None]:
# Function predicting if the headline is sarcasm
def predictSarcasm(val):
  testHeadline = [val.headline.values[0]]
  #vectorizing the tweet by the pre-fitted tokenizer instance
  testHeadline = tokenizer.texts_to_sequences(testHeadline)
  #padding the tweet to have exactly the same shape as `embedding_2` input
  testHeadline = pad_sequences(testHeadline, maxlen=26, dtype='int32', value=0)
  sentiment = model.predict(testHeadline, batch_size=1, verbose = 2)[0]
  print(val)

  # Original saracasm value in the record
  if(val.is_sarcastic.values[0] == 0):
      print("\n Original non-sarcastic")
  elif (val.is_sarcastic.values[0] == 1):
      print("\n Original sarcastic")

  # Predicted sarcasm value
  if(np.argmax(sentiment) == 0):
      print("\n Predicted non-sarcastic")
  elif (np.argmax(sentiment) == 1):
      print("\n Predicted sarcastic")

Predicting the values from the stached data

In [None]:
predictSarcasm(stachedData.iloc[0:1])

1/1 - 0s
                                               headline  ...  headline_len
5001  museum staff braces for large group wearing sa...  ...            56

[1 rows x 3 columns]

 Original sarcastic

 Predicted sarcastic


In [None]:
predictSarcasm(stachedData.iloc[1:2])

1/1 - 0s
                                               headline  ...  headline_len
5002  relieved scott walker narrowly avoids acknowle...  ...            95

[1 rows x 3 columns]

 Original sarcastic

 Predicted non-sarcastic


In [None]:
predictSarcasm(stachedData.iloc[2:3])

1/1 - 0s
                                               headline  ...  headline_len
5003  beyond the classroom: experiencing technology ...  ...            86

[1 rows x 3 columns]

 Original non-sarcastic

 Predicted non-sarcastic


In [None]:
predictSarcasm(stachedData.iloc[3:4])

1/1 - 0s
                                               headline  ...  headline_len
5004  north carolina elects someone to run out for c...  ...            55

[1 rows x 3 columns]

 Original sarcastic

 Predicted sarcastic


In [None]:
predictSarcasm(stachedData.iloc[4:5])

1/1 - 0s
                                               headline  ...  headline_len
5005  the justice department pledge to prosecute whi...  ...            95

[1 rows x 3 columns]

 Original non-sarcastic

 Predicted non-sarcastic


In [None]:
predictSarcasm(stachedData.iloc[5:6])

1/1 - 0s
                                               headline  ...  headline_len
5006  death penalty and redemption: thoughts on tsar...  ...            76

[1 rows x 3 columns]

 Original non-sarcastic

 Predicted non-sarcastic


As we can see the model is doing pity decent while having trained with a very few records, if we trained with more data and higher epochs we could get better results.