# **Sarcasm Detection using DistilBert**

The goal of this project is to show how we could replace the head of a Pre-Trained language model to perform a specific task.  

The task here is going to be a **'classification'** task.  

The custom model here would be created to perform 'Sarcasm Detection' using  a dataset from kaggle. Get the dataset [here](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection/data).

The model used here is **'DistilBert'** & we use the **HuggingFace Transformers** package to download the model.  

## **Prepare the dataset**

In [61]:
# INSTALL ALL THE REQUIRED PACKAGES

! pip install -q datasets==2.19.2

In [89]:
# IMPORT ALL THE REQUIRED PACKAGES

import os

import numpy as np
import pandas as pd

import datasets
import transformers
import torch

from tqdm.auto import tqdm

In [63]:
# INIT REQUIRED PATH.

data_path = "/content/Sarcasm_Headlines_Dataset_v2.json"

In [64]:
# READ DATA INTO A PANDAS DATAFRAME.

df = pd.read_json(data_path, lines=True)
df.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


In [65]:
# WORK WITH HUGGINGFACE DATASETS PACKAGE.

dataset_hf = datasets.load_dataset("json", data_files=data_path)
dataset_hf

DatasetDict({
    train: Dataset({
        features: ['is_sarcastic', 'headline', 'article_link'],
        num_rows: 28619
    })
})

In [66]:
type(dataset_hf)

datasets.dataset_dict.DatasetDict

In [67]:
# REMOVE UNWANTED COLUMNS.
dataset_hf = dataset_hf.remove_columns(['article_link'])

# SET FORMAT AS PANDAS.
dataset_hf.set_format('pandas')
dataset_hf = dataset_hf['train'][:]
dataset_hf

Unnamed: 0,is_sarcastic,headline
0,1,thirtysomething scientists unveil doomsday clo...
1,0,dem rep. totally nails why congress is falling...
2,0,eat your veggies: 9 deliciously different recipes
3,1,inclement weather prevents liar from getting t...
4,1,mother comes pretty close to using word 'strea...
...,...,...
28614,1,jews to celebrate rosh hashasha or something
28615,1,internal affairs investigator disappointed con...
28616,0,the most beautiful acceptance speech this week...
28617,1,mars probe destroyed by orbiting spielberg-gat...


In [68]:
# RENAME COLS.

dataset_hf = dataset_hf.rename({"is_sarcastic":"label"}, axis=1)
dataset_hf.head()

Unnamed: 0,label,headline
0,1,thirtysomething scientists unveil doomsday clo...
1,0,dem rep. totally nails why congress is falling...
2,0,eat your veggies: 9 deliciously different recipes
3,1,inclement weather prevents liar from getting t...
4,1,mother comes pretty close to using word 'strea...


In [69]:
# REMOVE DUPLICATES

dataset_hf = dataset_hf.drop_duplicates(subset=["headline"], keep="first")
dataset_hf = dataset_hf.reset_index(drop=True)
dataset_hf.shape

(28503, 2)

In [70]:
dataset_hf.tail()

Unnamed: 0,label,headline
28498,1,jews to celebrate rosh hashasha or something
28499,1,internal affairs investigator disappointed con...
28500,0,the most beautiful acceptance speech this week...
28501,1,mars probe destroyed by orbiting spielberg-gat...
28502,1,dad clarifies this not a food stop


In [71]:
dataset_hf = datasets.Dataset.from_pandas(dataset_hf)

In [72]:
# SPLIT DATASET.
# 80-20 SPLIT FOR TRAIN & TEST + VAL.

train_test_split = dataset_hf.train_test_split(test_size=0.2)
train_test_split

DatasetDict({
    train: Dataset({
        features: ['label', 'headline'],
        num_rows: 22802
    })
    test: Dataset({
        features: ['label', 'headline'],
        num_rows: 5701
    })
})

In [73]:
# NOW SPLIT THE 20% FROM ORIGINAL SIZE TO TEST & VAL.

test_val_split = train_test_split["test"].train_test_split(test_size=0.5)
test_val_split

DatasetDict({
    train: Dataset({
        features: ['label', 'headline'],
        num_rows: 2850
    })
    test: Dataset({
        features: ['label', 'headline'],
        num_rows: 2851
    })
})

In [74]:
dataset_hf = datasets.DatasetDict({
    'train': train_test_split['train'],
    'test': test_val_split['train'],
    'val': test_val_split['test'],
})

dataset_hf

DatasetDict({
    train: Dataset({
        features: ['label', 'headline'],
        num_rows: 22802
    })
    test: Dataset({
        features: ['label', 'headline'],
        num_rows: 2850
    })
    val: Dataset({
        features: ['label', 'headline'],
        num_rows: 2851
    })
})

## **Tokenization**

The model that we are using here is **'distilbert-base-uncased'**.  
Know more details about the model in [HuggingFace](https://huggingface.co/distilbert/distilbert-base-uncased).

In [75]:
# INIT THE TOKENIZER.

tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
tokenizer.model_max_len = 512

In [76]:
# DEFINE A FUNCTION THAT APPLIES THE TOKENIZER TO TRAINING DATASET.

def _tokenize(input_batch):
    return tokenizer(input_batch["headline"], truncation=True, max_length=512)

In [77]:
# MAP THE DATASET WITH THE TOKENIZER.

tokenized_dataset_hf = dataset_hf.map(_tokenize, batched=True)
tokenized_dataset_hf

Map:   0%|          | 0/22802 [00:00<?, ? examples/s]

Map:   0%|          | 0/2850 [00:00<?, ? examples/s]

Map:   0%|          | 0/2851 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'headline', 'input_ids', 'attention_mask'],
        num_rows: 22802
    })
    test: Dataset({
        features: ['label', 'headline', 'input_ids', 'attention_mask'],
        num_rows: 2850
    })
    val: Dataset({
        features: ['label', 'headline', 'input_ids', 'attention_mask'],
        num_rows: 2851
    })
})

In [78]:
# SET THE FORMAT TO TORCH.

tokenized_dataset_hf.set_format('torch', columns=["input_ids", "attention_mask", "label"])

**DataCollatorWithPadding**
- Data processing & augmentation.
- Forms a batch with list of dataset as input.
- Random masking.
- Padding. By default, it pads to the length of longest training example.

Read more about Data Collator [here](https://huggingface.co/docs/transformers/en/main_classes/data_collator).

In [79]:
# MAKE AUGMENTATIONS TO THE DATASET.

collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)
collator

DataCollatorWithPadding(tokenizer=DistilBertTokenizer(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

## **Defining a Custom Model.**

In [91]:
class SarcasmDetectionCustomModel(torch.nn.Module):

    def __init__(self, model, num_labels):

        super(SarcasmDetectionCustomModel, self).__init__()

        self.num_labels = num_labels
        self.model = model

        # NEW LAYER.
        self.dropout = torch.nn.Dropout(0.1)
        self.classifier = torch.nn.Linear(768, num_labels)

    def forward(self, input_ids=None, attention_mask=None, labels=None):

        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)

        last_hidden_state = outputs[0]

        sequence_outputs = self.dropout(last_hidden_state)

        logits = self.classifier(sequence_outputs[:, 0, :].view(-1, 768))

        loss=None
        if labels is not None:
            loss_func = torch.nn.CrossEntropyLoss()
            loss = loss_func(logits.view(-1, self.num_labels), labels.view(-1))

        return transformers.modeling_outputs.TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs)

## **Training**

In [92]:
tokenized_dataset_hf

DatasetDict({
    train: Dataset({
        features: ['label', 'headline', 'input_ids', 'attention_mask'],
        num_rows: 22802
    })
    test: Dataset({
        features: ['label', 'headline', 'input_ids', 'attention_mask'],
        num_rows: 2850
    })
    val: Dataset({
        features: ['label', 'headline', 'input_ids', 'attention_mask'],
        num_rows: 2851
    })
})

In [93]:
train_dl = torch.utils.data.DataLoader(
    tokenized_dataset_hf["train"],
    shuffle=True,
    batch_size=32,
    collate_fn=collator
)

val_dl = torch.utils.data.DataLoader(
    tokenized_dataset_hf["val"],
    shuffle=True,
    collate_fn=collator
)

In [100]:
len(tokenized_dataset_hf["train"])

22802

In [101]:
len(tokenized_dataset_hf["train"])/32

712.5625

In [99]:
len(train_dl)

713

In [94]:
# INIT MODEL & CONFIG.

config = transformers.DistilBertConfig.from_pretrained("distilbert-base-uncased", output_attention=True, output_hidden_state=True)

model = transformers.DistilBertModel.from_pretrained("distilbert-base-uncased", config=config)



In [95]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

custom_model_obj = SarcasmDetectionCustomModel(model=model, num_labels=2).to(device)

In [96]:
# DEFINE OPTIMIZER
optimizer = transformers.AdamW(custom_model_obj.parameters(), lr=5e-5)

# DEFINE THE NUMBER OF TRAINING EPOCHS.
num_epoch = 3

# DEFINE THE NUMBER OF TRAINING STEPS (BASED ON BATCH SIZE)
num_training_steps = num_epoch * len(train_dl)

# DEFINE A LEARNING RATE SCHEDULER.
lr_scheduler = transformers.get_scheduler('linear', optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)



In [97]:
# DEFINE EVAL METRIC.

metric = datasets.load_metric("f1")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [98]:
progress_bar_train = tqdm(range(num_training_steps))
progress_bar_eval = tqdm(range(num_epoch * len(val_dl) ))


for epoch in range(num_epoch):
    custom_model_obj.train()
    for batch in train_dl:
        batch = {k:v.to(device) for k, v in batch.items()}
        outputs = custom_model_obj(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar_train.update(1)

    custom_model_obj.eval()
    for batch in val_dl:
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = custom_model_obj(**batch)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim = -1 )
        metric.add_batch(predictions = predictions, references = batch['labels'] )
        progress_bar_eval.update(1)

    print(metric.compute())

  0%|          | 0/2139 [00:00<?, ?it/s]

  0%|          | 0/8553 [00:00<?, ?it/s]

{'f1': 0.9231878831590336}
{'f1': 0.925253991291727}
{'f1': 0.9242367131549191}


## **Evaluation**

In [102]:
# LOAD THE TEST DATA USING TORCH DATA LOADER.

test_dl = torch.utils.data.DataLoader(
    tokenized_dataset_hf["test"],
    batch_size=32,
    collate_fn=collator
)

len(test_dl)

90

In [None]:
# REDEFINE EVAL METRIC.

metric = datasets.load_metric("f1")

In [105]:
# RUN EVALUATION

custom_model_obj.eval()
for batch in test_dl:
    batch = {k:v.to(device) for k,v in batch.items()}
    with torch.no_grad():
        outputs = custom_model_obj(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch['labels'])

metric.compute()

{'f1': 0.9182058047493403}