<a href="https://colab.research.google.com/github/sandheepgopinath/NLP/blob/master/NLP_Project_Sarcasm_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sarcasm Detection
 **Acknowledgement**

Misra, Rishabh, and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

**Required Files given in below link.**

https://drive.google.com/drive/folders/1xUnF35naPGU63xwRDVGc-DkZ3M8V5mMk

## Install `Tensorflow2.0` 

In [1]:
!!pip uninstall tensorflow
!pip install tensorflow==2.0.0

Collecting tensorflow==2.0.0
  Using cached https://files.pythonhosted.org/packages/46/0f/7bd55361168bb32796b360ad15a25de6966c9c1beb58a8e30c01c8279862/tensorflow-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl
Installing collected packages: tensorflow
Successfully installed tensorflow-2.0.0


## Get Required Files from Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [0]:
path='/content/drive/My Drive/Colab Notebooks/Data'


In [4]:
% cd /content/drive/My Drive/Colab Notebooks/Data

/content/drive/.shortcut-targets-by-id/1s37-DVvRWfOcv59ojc6DnwHzPFkA6e8M/Data


In [5]:
% ls

glove.6B.100d.txt  glove.6B.300d.txt  glove.6B.zip
glove.6B.200d.txt  glove.6B.50d.txt   Sarcasm_Headlines_Dataset.json


#**## Reading and Exploring Data**

## Read Data "Sarcasm_Headlines_Dataset.json". Explore the data and get  some insights about the data. ( 4 marks)
Hint - As its in json format you need to use pandas.read_json function. Give paraemeter lines = True.

In [6]:
import pandas as pd
import re
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
j=pd.read_json('Sarcasm_Headlines_Dataset.json',lines=True)

Using TensorFlow backend.


In [30]:
j.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [32]:
#Identifying the number of different cases in 'is_sarcastic' column
set(j['is_sarcastic'])

{0, 1}


"is Sarcastic" column has two different values which are 1 and 0.

In [60]:
print('The file has ',j[j['is_sarcastic']==1].count()[0],' sarcastic entries and ',j[j['is_sarcastic']==0].count()[0],' non sarcastic entries and a total of',len(j['headline'].values),' entries')

The file has  11724  sarcastic entries and  14985  non sarcastic entries and a total of 26709  entries


In [51]:
# Confirming what 1 and 0 stands for

print(j['headline'][0],'   ' ,j['is_sarcastic'][0])
print(j['headline'][5],'   ' ,j['is_sarcastic'][5])
print(j['headline'][2138],'   ' ,j['is_sarcastic'][2138])
print(j['headline'][3827],'   ' ,j['is_sarcastic'][3827])
print(j['headline'][1002],'   ' ,j['is_sarcastic'][1002])


former versace store clerk sues over secret 'black code' for minority shoppers     0
advancing the world's women     0
why these men bucked tradition and wore an engagement ring     0
japanese prime minister resigns to seek revenge on man who killed his family     1
labor secretary letting 8 million unemployed americans crash at his place until they get back on their feet     1


So 1 is Sarcastic and 0 is Non-Sarcastic

## Drop `article_link` from dataset. ( 2 marks)
As we only need headline text data and is_sarcastic column for this project. We can drop artical link column here.

In [0]:
data=j.drop('article_link',axis=1)

## Get the Length of each line and find the maximum length. ( 4 marks)
As different lines are of different length. We need to pad the our sequences using the max length.

In [0]:
data['headline']=data['headline'].apply(lambda x: x.lower())
data['headline']=data['headline'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))


In [9]:
print(data[ data['is_sarcastic'] == 1].size)
print(data[ data['is_sarcastic'] == 0].size)

23448
29970


In [0]:
length=[]
for i,row in data.iterrows():
  length.append(len(row['headline']))
data['length']=length

***Identifying the maximum length ***

In [11]:
data['length'].max()

237

**Identifying the entry with maximum length**

In [12]:
data[data['length']==237]

Unnamed: 0,headline,is_sarcastic,length
19868,maya angelou poet author civil rights activist...,1,237


**Dropping the length column as it is not useful for further analysis**

In [0]:
data.drop('length',axis=1,inplace=True)

#**## Modelling**

## Import required modules required for modelling.

In [0]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional, GlobalMaxPool1D,TimeDistributed,SpatialDropout1D
from tensorflow.keras.models import Model, Sequential

# Set Different Parameters for the model. ( 2 marks)

In [0]:
max_features = 10000
maxlen = 237
embedding_size = 200

## Apply Keras Tokenizer of headline column of your data.  ( 4 marks)
Hint - First create a tokenizer instance using Tokenizer(num_words=max_features) 
And then fit this tokenizer instance on your data column df['headline'] using .fit_on_texts()

In [0]:
tokenizer = Tokenizer(num_words=max_features, split=' ')
tokenizer.fit_on_texts(data['headline'].values)


# Define X and y for your model.

In [17]:
X = tokenizer.texts_to_sequences(data['headline'].values)
X = pad_sequences(X,maxlen=maxlen)
y = np.asarray(data['is_sarcastic'])

print("Number of Samples:", len(X))
print(X[0])
print("Number of Labels: ", len(y))
print(y[0])

Number of Samples: 26709
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0  

## Get the Vocabulary size ( 2 marks)
Hint : You can use tokenizer.word_index.

In [18]:
len(tokenizer.word_index)

28398

#**## Word Embedding**

## Get Glove Word Embeddings

In [19]:
%cd /content/drive/My Drive/Colab Notebooks/

/content/drive/My Drive/Colab Notebooks


In [0]:
glove_file ="glove.6B.zip"

In [0]:
#Extract Glove embedding zip file
from zipfile import ZipFile
with ZipFile(glove_file, 'r') as z:
  z.extractall()

# Get the Word Embeddings using Embedding file as given below.

In [0]:
import numpy as np
EMBEDDING_FILE = 'glove.6B.200d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd



In [23]:
len(embeddings)

400000

# Create a weight matrix for words in training docs

In [24]:
embedding_matrix = np.zeros((28398+1, 200))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

len(embeddings.values())

400000

## Create and Compile your Model  ( 7 marks)
Hint - Use Sequential model instance and then add Embedding layer, Bidirectional(LSTM) layer, then dense and dropout layers as required. 
In the end add a final dense layer with sigmoid activation for binary classification.


In [25]:
### Embedding layer for hint 
## model.add(Embedding(num_words, embedding_size, weights = [embedding_matrix]))
### Bidirectional LSTM layer for hint 
## model.add(Bidirectional(LSTM(128, return_sequences = True)))
num_words=len(tokenizer.word_index)
lstm_out=196
model = Sequential()
model.add(Embedding(num_words+1,200,weights=[embedding_matrix]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 200)         5679800   
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, None, 200)         0         
_________________________________________________________________
lstm (LSTM)                  (None, 196)               311248    
_________________________________________________________________
dense (Dense)                (None, 2)                 394       
Total params: 5,991,442
Trainable params: 5,991,442
Non-trainable params: 0
_________________________________________________________________
None


# Fit your model with a batch size of 100 and validation_split = 0.2. and state the validation accuracy ( 5 marks)


In [26]:
batch_size = 100
epochs = 5
Y = pd.get_dummies(data['is_sarcastic']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.25, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)


(20031, 237) (20031, 2)
(6678, 237) (6678, 2)


In [27]:
model.fit(X_train, Y_train, epochs = epochs, batch_size=batch_size, verbose = 2)

Train on 20031 samples
Epoch 1/5
20031/20031 - 293s - loss: 0.5409 - accuracy: 0.7168
Epoch 2/5
20031/20031 - 291s - loss: 0.3832 - accuracy: 0.8288
Epoch 3/5
20031/20031 - 295s - loss: 0.3225 - accuracy: 0.8586
Epoch 4/5
20031/20031 - 297s - loss: 0.2712 - accuracy: 0.8875
Epoch 5/5
20031/20031 - 295s - loss: 0.2401 - accuracy: 0.8978


<tensorflow.python.keras.callbacks.History at 0x7f4a094235c0>

In [28]:
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

6678/1 - 29s - loss: 0.2924 - accuracy: 0.8750
score: 0.30
acc: 0.87


In [68]:
twt = ['Great effort by the team in managing the project so well']
twt = tokenizer.texts_to_sequences(twt)
twt = pad_sequences(twt, maxlen=28, dtype='int32', value=0)
sentiment = model.predict(twt,batch_size=1,verbose = 2)[0]
if(np.argmax(sentiment) == 0):
    print("Non Sarcastic")
elif (np.argmax(sentiment) == 1):
    print("Sarcastic")

1/1 - 0s
Non Sarcastic


## The model is able to predict Sarsactic and Non Sarcastic comments with 87% accuracy