<a href="https://colab.research.google.com/github/sanjeevr5/NLP/blob/main/CNN_For_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Textual Sarcasm Detection Using CNN

- Architecture Used : CNN With BPEmb embeddings(https://github.com/bheinzerling/bpemb)
- Referring : https://chriskhanhtran.github.io/posts/cnn-sentence-classification/ including the CNN architecture image
- @article{misra2019sarcasm,
  title={Sarcasm Detection using Hybrid Neural Network},
  author={Misra, Rishabh and Arora, Prahal},
  journal={arXiv preprint arXiv:1908.07414},
  year={2019}
}
- Given data is in json format and is_sarcastic will be our label and let us try to predict using the "headline" only

**This notebook is intended not to get good results from the architecture but a way to demonstrate how CNNs can be used with text data**

In [1]:
!head -5 ./Sarcasm_Headlines_Dataset.json

{"article_link": "https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5", "headline": "former versace store clerk sues over secret 'black code' for minority shoppers", "is_sarcastic": 0}
{"article_link": "https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365", "headline": "the 'roseanne' revival catches up to our thorny political mood, for better and worse", "is_sarcastic": 0}
{"article_link": "https://local.theonion.com/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697", "headline": "mom starting to fear son's web series closest thing she will have to grandchild", "is_sarcastic": 1}
{"article_link": "https://politics.theonion.com/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302", "headline": "boehner just wants wife to listen, not come up with alternative debt-reduction ideas", "is_sarcastic": 1}
{"article_link": "https://www.huffingtonpost.com/entry/jk-rowling-wishes-snape-happy-bir

In [2]:
!head -5 ./Sarcasm_Headlines_Dataset_v2.json

{"is_sarcastic": 1, "headline": "thirtysomething scientists unveil doomsday clock of hair loss", "article_link": "https://www.theonion.com/thirtysomething-scientists-unveil-doomsday-clock-of-hai-1819586205"}
{"is_sarcastic": 0, "headline": "dem rep. totally nails why congress is falling short on gender, racial equality", "article_link": "https://www.huffingtonpost.com/entry/donna-edwards-inequality_us_57455f7fe4b055bb1170b207"}
{"is_sarcastic": 0, "headline": "eat your veggies: 9 deliciously different recipes", "article_link": "https://www.huffingtonpost.com/entry/eat-your-veggies-9-delici_b_8899742.html"}
{"is_sarcastic": 1, "headline": "inclement weather prevents liar from getting to work", "article_link": "https://local.theonion.com/inclement-weather-prevents-liar-from-getting-to-work-1819576031"}
{"is_sarcastic": 1, "headline": "mother comes pretty close to using word 'streaming' correctly", "article_link": "https://www.theonion.com/mother-comes-pretty-close-to-using-word-strea

In [3]:
import json

def read_file(path):
  for line in open(path, 'r'):
    yield json.loads(line)

train_data = list(read_file('./Sarcasm_Headlines_Dataset.json'))
test_data = list(read_file('./Sarcasm_Headlines_Dataset_v2.json'))

In [4]:
print('Train Data\n\n')
print(train_data[:5])
print('Test Data\n\n')
print(test_data[:5])

Train Data


[{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5', 'headline': "former versace store clerk sues over secret 'black code' for minority shoppers", 'is_sarcastic': 0}, {'article_link': 'https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365', 'headline': "the 'roseanne' revival catches up to our thorny political mood, for better and worse", 'is_sarcastic': 0}, {'article_link': 'https://local.theonion.com/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697', 'headline': "mom starting to fear son's web series closest thing she will have to grandchild", 'is_sarcastic': 1}, {'article_link': 'https://politics.theonion.com/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302', 'headline': 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas', 'is_sarcastic': 1}, {'article_link': 'https://www.huffingtonpost.com/entry/jk-rowling-wishes-s

In [5]:
import numpy as np
SEED = 43
np.random.seed(SEED)
train_data = np.array([(row['headline'], row['is_sarcastic']) for row in train_data])
test_data = np.array([(row['headline'], row['is_sarcastic']) for row in test_data])
print(f'Train shape is {train_data.shape} and test shape is {test_data.shape}')

Train shape is (26709, 2) and test shape is (28619, 2)


In [6]:
from collections import Counter
print(f'Train label distribution : {Counter(train_data[:,1])}')

Train label distribution : Counter({'0': 14985, '1': 11724})


- The labels distribution are not so skewed
- We will the use BPEmb as our embeddings

In [7]:
%%capture
!pip install bpemb
!pip install ftfy
from bpemb import BPEmb
bpemb_en = BPEmb(lang="en", dim=300, vs = 10000) #vs will be the voacb size

In [8]:
embeddings = bpemb_en.vectors
embeddings = np.insert(embeddings, 0, [0] * 300, axis = 0) #Adding 300d zero vector to the embeddings
print(embeddings.shape) 

(10001, 300)


In [9]:
word2idx = {key : index+1 for index, key in enumerate(bpemb_en.emb.vocab.keys())}
word2idx['<pad>'] = 0
idx2word = {index+1 : key for index, key in enumerate(bpemb_en.emb.vocab.keys())}
idx2word[0] = '<pad>'

In [10]:
import ftfy #for fixing encoding issues

def prepare_data(sentence, max_len = 65):
  encoded = bpemb_en.encode(ftfy.fix_text(sentence))
  encoded = [word2idx[token] if word2idx.get(token,0) else 1 for token in encoded]
  encoded += [0] * (max_len - len(encoded))
  return encoded[:max_len]

train_labels = train_data[:,1].astype(int)
test_labels = test_data[:,1].astype(int)

train_encoded, test_encoded = [], []

for headline, label in train_data:
  train_encoded.append(prepare_data(headline))
for headline, label in test_data:
  test_encoded.append(prepare_data(headline))

In [11]:
import torch
from torch.utils.data import (TensorDataset, DataLoader, RandomSampler, SequentialSampler)

torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

train_dataset = TensorDataset(torch.tensor(train_encoded), torch.from_numpy(train_labels))
test_dataset = TensorDataset(torch.tensor(test_encoded), torch.from_numpy(test_labels))

train_loader = DataLoader(train_dataset, sampler = RandomSampler(train_dataset), batch_size = 32)
test_loader = DataLoader(test_dataset, sampler = SequentialSampler(test_dataset), batch_size = 32)

## CNN Architecture

- We use 1D convolutions here

<b> What are the differences between 1D and 2D convs? </b>

- The direction matters the 2D conv kernel can travel along both x-axis and y-axis while the 1D kernel can travel along x-axis
- Conv2D == Conv1D when kernel size has the same width as the input's width
- Generally, 1D convs are used in text and signal processing

![](https://github.com/chriskhanhtran/CNN-Sentence-Classification-PyTorch/blob/master/cnn-architecture.JPG?raw=true)



In [12]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

device = ('cuda' if torch.cuda.is_available() else 'cpu')
print(f'This runs on {device}')

class CNN_Text(nn.Module):

  def __init__(self, embed, filter_sizes, num_filters, classes = 1, freeze_emb = True, drop_rate = 0.5):
    super(CNN_Text, self).__init__()
    self.vocab_size, self.emb_size = embed.shape
    self.embedding = nn.Embedding.from_pretrained(embed, freeze = freeze_emb)
    self.conv1d = nn.ModuleList([nn.Conv1d(in_channels = self.emb_size, out_channels = num_filters[i], kernel_size = filter_sizes[i])
                                 for i in range(len(filter_sizes))
    ])
    self.fc = nn.Linear(sum(num_filters), classes)
    self.drp = nn.Dropout(drop_rate)

  def forward(self, batch):

    embed = self.embedding(batch)
    embed = embed.permute(0, 2, 1) #Conv1D expects the data to be of batch, width, height
    convs = [F.relu(conv(embed)) for conv in self.conv1d]
    pooled = [F.max_pool1d(conv, kernel_size = conv.shape[2]) for conv in convs] #OP:  (b, num_filters[i], 1)
    logits = self.fc(self.drp(torch.cat([pool.squeeze(dim=2) for pool in pooled], dim = 1)))
    return logits

model = CNN_Text(torch.tensor(embeddings), [2, 3, 4, 5], [5] * 4 )
print('The number of trainable parameters are :', sum(p.numel() for p in model.parameters() if p.requires_grad))
model.to(device)
optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()
criterion = criterion.to(device)

This runs on cuda
The number of trainable parameters are : 21041


In [13]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

def train_m(model, iterator, optimizer, l):
  e_loss = 0
  e_acc = 0
  model.train()

  for inputs, labels in iterator:
    optimizer.zero_grad()
    inputs, labels = inputs.to(device), labels.to(device)
    preds = model(inputs)
    acc = ((preds.ge(0.5).view(-1)) == labels).sum().float() / len(preds)
    loss = l(preds.squeeze(1), labels.float())
    loss.backward()
    optimizer.step()
    e_loss += loss.item()
    e_acc += acc.item()
  return e_loss/len(iterator), e_acc/len(iterator)

def evaluate_m(model, iterator, l):
  e_loss = 0
  e_acc = 0
  model.eval()
  with torch.no_grad():
    for inputs, labels in iterator:
      inputs, labels = inputs.to(device), labels.to(device)
      preds = model(inputs)
      loss = l(preds.squeeze(1), labels.float())
      acc = ((preds.ge(0.5).view(-1)) == labels).sum().float() / len(preds)
      e_loss += loss.item()
      e_acc += acc.item()
  return e_loss/len(iterator), e_acc/len(iterator)

In [14]:
N_EPOCHS = 10

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train_m(model, train_loader, optimizer, criterion)
    valid_loss, valid_acc = evaluate_m(model, test_loader, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    print(f'Epoch: {epoch+1:02} / {N_EPOCHS} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 / 10 | Epoch Time: 0m 5s
	Train Loss: 0.564 | Train Acc: 66.05%
	 Val. Loss: 0.466 |  Val. Acc: 74.14%
Epoch: 02 / 10 | Epoch Time: 0m 5s
	Train Loss: 0.484 | Train Acc: 74.83%
	 Val. Loss: 0.411 |  Val. Acc: 78.64%
Epoch: 03 / 10 | Epoch Time: 0m 5s
	Train Loss: 0.454 | Train Acc: 76.93%
	 Val. Loss: 0.380 |  Val. Acc: 81.28%
Epoch: 04 / 10 | Epoch Time: 0m 5s
	Train Loss: 0.434 | Train Acc: 78.21%
	 Val. Loss: 0.357 |  Val. Acc: 83.98%
Epoch: 05 / 10 | Epoch Time: 0m 5s
	Train Loss: 0.411 | Train Acc: 79.84%
	 Val. Loss: 0.333 |  Val. Acc: 84.23%
Epoch: 06 / 10 | Epoch Time: 0m 5s
	Train Loss: 0.403 | Train Acc: 80.35%
	 Val. Loss: 0.324 |  Val. Acc: 84.69%
Epoch: 07 / 10 | Epoch Time: 0m 5s
	Train Loss: 0.395 | Train Acc: 80.42%
	 Val. Loss: 0.310 |  Val. Acc: 85.29%
Epoch: 08 / 10 | Epoch Time: 0m 5s
	Train Loss: 0.378 | Train Acc: 81.39%
	 Val. Loss: 0.297 |  Val. Acc: 86.09%
Epoch: 09 / 10 | Epoch Time: 0m 5s
	Train Loss: 0.374 | Train Acc: 82.14%
	 Val. Loss: 0.290 |  

## Inference

In [15]:
LABEL_DICT =  {0: 'Normal', 1: 'Sarcastic'}

def predict(sentence):
  input_tensor = torch.tensor([prepare_data(sentence)]).to(device)
  pred = model(input_tensor)
  if torch.sigmoid(pred).ge(0.5).item():
    print(LABEL_DICT[1])
  else:
    print(LABEL_DICT[0])

In [16]:
predict('I am lucky not to have this')

Normal
