# Sarcasm Detection
 **Acknowledgement**

Misra, Rishabh, and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

**Required Files given in below link.**

https://drive.google.com/drive/folders/1xUnF35naPGU63xwRDVGc-DkZ3M8V5mMk

## Install `Tensorflow2.0` 

In [1]:
!!pip uninstall tensorflow
!pip install tensorflow==2.0.0

Collecting tensorflow==2.0.0
  Using cached https://files.pythonhosted.org/packages/46/0f/7bd55361168bb32796b360ad15a25de6966c9c1beb58a8e30c01c8279862/tensorflow-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl
Installing collected packages: tensorflow
Successfully installed tensorflow-2.0.0


## Get Required Files from Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [4]:
#Set your project path 
project_path =  '/content/drive/My Drive/Colab Notebooks/NLP/'

#**## Reading and Exploring Data**

## Read Data "Sarcasm_Headlines_Dataset.json". Explore the data and get  some insights about the data. ( 4 marks)
Hint - As its in json format you need to use pandas.read_json function. Give paraemeter lines = True.

In [5]:
import pandas as pd
df = pd.read_json(project_path + 'Sarcasm_Headlines_Dataset.json', lines=True)
print(df.head())
print(df.shape)

#Lets see how many sarcastic posts we have
dfs = df["is_sarcastic"] == 1
dfns = df["is_sarcastic"] == 0
print(df[dfs].shape)
print(df[dfns].shape)

                                        article_link  ... is_sarcastic
0  https://www.huffingtonpost.com/entry/versace-b...  ...            0
1  https://www.huffingtonpost.com/entry/roseanne-...  ...            0
2  https://local.theonion.com/mom-starting-to-fea...  ...            1
3  https://politics.theonion.com/boehner-just-wan...  ...            1
4  https://www.huffingtonpost.com/entry/jk-rowlin...  ...            0

[5 rows x 3 columns]
(26709, 3)
(11724, 3)
(14985, 3)


## Drop `article_link` from dataset. ( 2 marks)
As we only need headline text data and is_sarcastic column for this project. We can drop artical link column here.

In [6]:
df.drop(["article_link"], axis=1, inplace=True)
df.head()

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0
2,mom starting to fear son's web series closest ...,1
3,"boehner just wants wife to listen, not come up...",1
4,j.k. rowling wishes snape happy birthday in th...,0


## Get the Length of each line and find the maximum length. ( 4 marks)
As different lines are of different length. We need to pad the our sequences using the max length.

In [7]:
df.headline.map(len).max()

254

#**## Modelling**

## Import required modules required for modelling.

In [8]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional, GlobalMaxPool1D, GRU
from tensorflow.keras.models import Model, Sequential

# Set Different Parameters for the model. ( 2 marks)

In [9]:
max_features = 10000
maxlen = 254
embedding_size = 200

## Apply Keras Tokenizer of headline column of your data.  ( 4 marks)
Hint - First create a tokenizer instance using Tokenizer(num_words=max_features) 
And then fit this tokenizer instance on your data column df['headline'] using .fit_on_texts()

In [10]:
tokenizer = Tokenizer(num_words=maxlen)
tokenizer.fit_on_texts(df['headline'])

# Define X and y for your model.

In [11]:
X = tokenizer.texts_to_sequences(df['headline'])
X = pad_sequences(X, maxlen = maxlen)
y = np.asarray(df['is_sarcastic'])

print("Number of Samples:", len(X))
print(X[0])
print("Number of Labels: ", len(y))
print(y[0])

Number of Samples: 26709
[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0 47  5]
Number of Labels:  26709
0


## Get the Vocabulary size ( 2 marks)
Hint : You can use tokenizer.word_index.

In [12]:
vocab_size = tokenizer.word_index.items()
num_words = len(vocab_size) + 1
print(num_words)

29657


#**## Word Embedding**

## Get Glove Word Embeddings

In [13]:
glove_file = project_path + "glove.6B.zip"

In [14]:
#Extract Glove embedding zip file
from zipfile import ZipFile
with ZipFile(glove_file, 'r') as z:
  z.extractall()

# Get the Word Embeddings using Embedding file as given below.

In [15]:
EMBEDDING_FILE = './glove.6B.200d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(" ")[0]
    #print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    #print(embd)
    embeddings[word] = embd



# Create a weight matrix for words in training docs

In [16]:
embedding_matrix = np.zeros((num_words, 200))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

len(embeddings.values())

400000

## Create and Compile your Model  ( 7 marks)
Hint - Use Sequential model instance and then add Embedding layer, Bidirectional(LSTM) layer, then dense and dropout layers as required. 
In the end add a final dense layer with sigmoid activation for binary classification.


In [24]:
### Embedding layer for hint 
## model.add(Embedding(num_words, embedding_size, weights = [embedding_matrix]))
### Bidirectional LSTM layer for hint 
## model.add(Bidirectional(LSTM(128, return_sequences = True)))

#Create the model
model = Sequential()
model.add(Embedding(num_words, 200, weights = [embedding_matrix]))
model.add(Bidirectional(LSTM(128, return_sequences = True)))
model.add(Bidirectional(GRU(64 , recurrent_dropout = 0.1 , dropout = 0.1)))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

# Compile the model
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 200)         5931400   
_________________________________________________________________
bidirectional_4 (Bidirection (None, None, 256)         336896    
_________________________________________________________________
bidirectional_5 (Bidirection (None, 128)               123648    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 6,392,073
Trainable params: 6,392,073
Non-trainable params: 0
_________________________________________________________________
None


# Fit your model with a batch size of 100 and validation_split = 0.2. and state the validation accuracy ( 5 marks)


In [25]:
from sklearn.model_selection import train_test_split

batch_size = 100
epochs = 5

x_train, x_test, y_train, y_test = train_test_split(X, y , test_size = 0.2 , random_state = 0)

#Fit the model
model.fit(x_train, y_train, validation_data = (x_test, y_test), epochs=epochs, batch_size=batch_size, verbose=2)

Epoch 1/5
214/214 - 556s - loss: 0.4827 - accuracy: 0.7560 - val_loss: 0.4526 - val_accuracy: 0.7767
Epoch 2/5
214/214 - 560s - loss: 0.4267 - accuracy: 0.7931 - val_loss: 0.4387 - val_accuracy: 0.7872
Epoch 3/5
214/214 - 554s - loss: 0.4200 - accuracy: 0.7975 - val_loss: 0.4306 - val_accuracy: 0.7873
Epoch 4/5
214/214 - 555s - loss: 0.4068 - accuracy: 0.8039 - val_loss: 0.4289 - val_accuracy: 0.7896
Epoch 5/5
214/214 - 555s - loss: 0.3990 - accuracy: 0.8087 - val_loss: 0.4221 - val_accuracy: 0.7943


<tensorflow.python.keras.callbacks.History at 0x7f624e1a31d0>

In [2]:
#Hence the model accuracy is 0.7943