# Introduction

Sarcasm is a sophisticated language phenomenon, which would cause much confusion to exist sentiment classification systems.     
So sarcasm detection, a task of predicting whether a given text contains sarcasm, has received much research attention.     

Recently, many methods have been proposed for sarcasm detection, which could be broadly classified into two categories.     
One is the text-only method which only concentrate on the utterance itself, such as exploiting incongruity expressions to detect the sarcasm text.     
Another direction is based on extra information, which exploits external knowledge to assist the detection procedure, such as user history, and common sense knowledge.

We propose an unsupervised sarcasm detection method.     

First, we leverage the external sentiment knowledge to mask prominent tokens. Then the masked texts are fed into the pre-trained generation model, which follows the remaining logic structure to generate texts.     
There is a good chance that these reborn texts would not be sarcastic or make more sense.     

Second, after obtaining the similarity score between the generated sentence and the original one, features beneath the scores will be extracted to decide whether a sentence is sarcasm.     

Then, we construct several unsupervised baselines and conduct experiments on IAC-V2 dataset.

# Imports and Reading Data

In [1]:
!pip install senticnet

Collecting senticnet
  Downloading senticnet-1.6-py3-none-any.whl (51.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.9/51.9 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: senticnet
Successfully installed senticnet-1.6


In [2]:
import numpy as np
import pandas as pd

from senticnet.senticnet import SenticNet

import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from transformers import AutoTokenizer, AutoModel
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
df = pd.read_csv("/content/drive/My Drive/AlifResearch/sarcasm_v2/HYP-sarc-notsarc.csv")

In [5]:
df= df.drop('id', axis= 1)
df

Unnamed: 0,class,text
0,notsarc,have no predators to fear? check. who said we ...
1,notsarc,2 hours? damn! that book took me a good 2 day...
2,notsarc,you never played myst? damn!!! i must be reall...
3,notsarc,"Well, if Genesis was in fact true, then we wou..."
4,notsarc,Just making sure that everybody is aware of hi...
...,...,...
1159,sarc,you really believed me? wow! i never knew i ha...
1160,sarc,please tell me you're kidding. these bowling b...
1161,sarc,you're kidding. just because your life is 'a f...
1162,sarc,the evidence that is provided to you is not en...


# Understanding Data

In [6]:
df.dtypes

class    object
text     object
dtype: object

In [7]:
df.columns

Index(['class', 'text'], dtype='object')

In [8]:
text_data_original = list(df['text'])
text_data = [x.lower() for x in text_data_original]
print(*text_data, sep = "\n")

have no predators to fear? check. who said we don't have predators? we have predators. a human's predator is another human. they may not haunt you for food but loads of them will be more than happy to kill you for many other reasons.  in the wild you can also be eaten by all sorts of other predators. you see, when you look at it from the other side, our animal predators have no problem at all eating us... your response to this so far has been 'yes but animals don't understand that it is wrong'. in other words, animals don't have the mental capacity to see death from the same 'morally unacceptable' perspective you see it from. and yet you are still trying to project your hung ups about death onto them.
2 hours? damn!  that book took me a good 2 days, it was a good book but i found it extremely slow. then again i never like books i have to read for school =)
you never played myst? damn!!! i must be really old ;)
well, if genesis was in fact true, then we would have to assume that god is 

In [9]:
label_data = list(df['class'])
print(*label_data, sep = "\n")

notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc


# Overview

The proposed framework contains three main components:     

1) Sentences mask and generation.     
This procedure first recognizes main components of sentences which will be properly masked to cause more impact on original sentences, and then fulfills the texts generation work;     

2) Sentences representation.     
It is expected to calculate dense vectors of sentences;     

3) Sarcastic utterances detection leverages.     
the similarity scores between original and regenerated sentences to detect whether an utterance is sarcastic.

# Sentences Mask and Generation
## 1)
"First, we use the sentiment common knowledge retrieved from SenticNet to recognize affective words in the sentence 𝑥,     
and split those words into two sets according to its sentiment polarities:    
PW = {pw1, pw2, ..., pwh} and    
NW = {nw1, nw2, ..., nwk},     
h + k <= n."

In [10]:
def tokenize_sentence(sentence):
    tokens = word_tokenize(sentence)

    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

In [11]:
def get_sentiment_polarity_from_senticnet(word):
    sn = SenticNet()

    word = word.lower()

    try:
        return sn.polarity_label(word)
    except:
        return "neutral"

In [12]:
def analyze_sentiment(sentences):
    positive_words = []
    negative_words = []

    for sentence in sentences:
        words = tokenize_sentence(sentence)

        PW = set()
        NW = set()

        for word in words:
            sentiment_polarity = get_sentiment_polarity_from_senticnet(word)
            if sentiment_polarity == "positive":
                PW.add(word.lower())
            elif sentiment_polarity == "negative":
                NW.add(word.lower())

        positive_words.append(PW)
        negative_words.append(NW)

    return positive_words, negative_words

In [13]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [14]:
positive_words, negative_words = analyze_sentiment(text_data)

for i, sentence in enumerate(text_data):
    print(f"Sentence: {sentence}")
    print(f"Positive Words: {positive_words[i]}")
    print(f"Negative Words: {negative_words[i]}")
    print("- - - - - - - - - -")

Sentence: have no predators to fear? check. who said we don't have predators? we have predators. a human's predator is another human. they may not haunt you for food but loads of them will be more than happy to kill you for many other reasons.  in the wild you can also be eaten by all sorts of other predators. you see, when you look at it from the other side, our animal predators have no problem at all eating us... your response to this so far has been 'yes but animals don't understand that it is wrong'. in other words, animals don't have the mental capacity to see death from the same 'morally unacceptable' perspective you see it from. and yet you are still trying to project your hung ups about death onto them.
Positive Words: {'perspective', 'mental', 'understand', 'food', 'capacity', 'eating', 'project', 'happy', 'reason', 'wild', 'check'}
Negative Words: {'predator', 'problem', 'kill', 'unacceptable', 'fear', 'wrong', 'death', 'load', 'response', 'haunt'}
- - - - - - - - - -
Sentenc

In [15]:
df["PW"] = positive_words
df["NW"] = negative_words
df

Unnamed: 0,class,text,PW,NW
0,notsarc,have no predators to fear? check. who said we ...,"{perspective, mental, understand, food, capaci...","{predator, problem, kill, unacceptable, fear, ..."
1,notsarc,2 hours? damn! that book took me a good 2 day...,{good},"{slow, damn}"
2,notsarc,you never played myst? damn!!! i must be reall...,{played},"{old, damn}"
3,notsarc,"Well, if Genesis was in fact true, then we wou...","{vengeful, diety, changing, realize, compass, ...","{torment, stretch, damn, interpretation, qualify}"
4,notsarc,Just making sure that everybody is aware of hi...,"{good, indicator, fairness, willfully, supersp...","{dishonesty, unfortunately, evasion, creationi..."
...,...,...,...,...
1159,sarc,you really believed me? wow! i never knew i ha...,"{power, wow}",{}
1160,sarc,please tell me you're kidding. these bowling b...,{bowling},{flat}
1161,sarc,you're kidding. just because your life is 'a f...,"{breezy, breeze, doe}",{}
1162,sarc,the evidence that is provided to you is not en...,"{lol, nice, particular}","{ethnocentrism, buddy}"


## 2)
"Second, we analyze the sentence to get its syntax information to identify non-stop words     
     𝑆𝑊 = {𝑠𝑤1, 𝑠𝑤2, ..., 𝑠𝑤𝑚, 𝑚 ≤ 𝑛}.     
Intuitively, these words are the main components of sentences. Then we split 𝑆𝑊 into two sets which satisfy :     
     𝑆𝑊1 ∪ 𝑆𝑊2 = 𝑆𝑊 ,     
     |𝑆𝑊1| = |𝑆𝑊2|."

In [16]:
def extract_non_stop_words(sentence):
    words = nltk.word_tokenize(sentence)

    stop_words = set(stopwords.words("english"))

    non_stop_words = [word.lower() for word in words if word.lower() not in stop_words and word.isalpha()]

    return non_stop_words

In [17]:
def split_non_stop_words(non_stop_words):
    m = len(non_stop_words)
    m1 = m // 2
    SW1 = set(non_stop_words[:m1])
    SW2 = set(non_stop_words[m1:])
    return SW1, SW2

In [18]:
def analyze_sentences(sentences):
    all_SW1 = []
    all_SW2 = []

    for sentence in sentences:
        non_stop_words = extract_non_stop_words(sentence)
        SW1, SW2 = split_non_stop_words(non_stop_words)
        all_SW1.append(SW1)
        all_SW2.append(SW2)

    return all_SW1, all_SW2

In [19]:
all_SW1, all_SW2 = analyze_sentences(text_data)

for i, sentence in enumerate(text_data):
    print(f"Sentence: {sentence}")
    print(f"SW1: {all_SW1[i]}")
    print(f"SW2: {all_SW2[i]}")
    print("- - - - - - - - - -")

Sentence: have no predators to fear? check. who said we don't have predators? we have predators. a human's predator is another human. they may not haunt you for food but loads of them will be more than happy to kill you for many other reasons.  in the wild you can also be eaten by all sorts of other predators. you see, when you look at it from the other side, our animal predators have no problem at all eating us... your response to this so far has been 'yes but animals don't understand that it is wrong'. in other words, animals don't have the mental capacity to see death from the same 'morally unacceptable' perspective you see it from. and yet you are still trying to project your hung ups about death onto them.
SW1: {'wild', 'check', 'loads', 'human', 'many', 'eaten', 'food', 'said', 'another', 'side', 'may', 'predator', 'reasons', 'predators', 'also', 'see', 'look', 'kill', 'sorts', 'happy', 'fear', 'haunt'}
SW2: {'far', 'perspective', 'mental', 'words', 'unacceptable', 'problem', 'an

In [20]:
df["SW1"] = all_SW1
df["SW2"] = all_SW2
df

Unnamed: 0,class,text,PW,NW,SW1,SW2
0,notsarc,have no predators to fear? check. who said we ...,"{perspective, mental, understand, food, capaci...","{predator, problem, kill, unacceptable, fear, ...","{wild, check, loads, human, many, eaten, food,...","{far, perspective, mental, words, unacceptable..."
1,notsarc,2 hours? damn! that book took me a good 2 day...,{good},"{slow, damn}","{book, damn, hours, took, good, days}","{slow, found, school, extremely, books, never,..."
2,notsarc,you never played myst? damn!!! i must be reall...,{played},"{old, damn}","{myst, played, never}","{must, old, really, damn}"
3,notsarc,"Well, if Genesis was in fact true, then we wou...","{vengeful, diety, changing, realize, compass, ...","{torment, stretch, damn, interpretation, qualify}","{one, interpretation, explains, well, torment,...","{exactly, actual, demonstrated, stretch, venge..."
4,notsarc,Just making sure that everybody is aware of hi...,"{good, indicator, fairness, willfully, supersp...","{dishonesty, unfortunately, evasion, creationi...","{time, illustration, good, stop, sure, would, ...","{keeps, seen, anyway, uneducated, willfully, s..."
...,...,...,...,...,...,...
1159,sarc,you really believed me? wow! i never knew i ha...,"{power, wow}",{},"{believed, wow, really}","{never, power, knew}"
1160,sarc,please tell me you're kidding. these bowling b...,{bowling},{flat},"{tell, please, kidding}","{must, flat, balls, bowling}"
1161,sarc,you're kidding. just because your life is 'a f...,"{breezy, breeze, doe}",{},"{breeze, fricken, people, breezy, mean, life, ...","{lately, live, either, even, life, news, anoth..."
1162,sarc,the evidence that is provided to you is not en...,"{lol, nice, particular}","{ethnocentrism, buddy}","{provided, nice, ethnocentrism, evidence, budd...","{assuming, right, wrote, lol, ranged, kidding,..."


## 3)
"Here, 𝑃𝑊 ∪ 𝑆𝑊1 and 𝑁𝑊 ∪ 𝑆𝑊2 are used to mask original sentence respectively. So, we will obtain two masked sentences     
𝑥𝑚1 = { [𝑚]1, 𝑥2, ..., [𝑚]𝑛} and     
𝑥𝑚2 = {𝑥1, [𝑚]2, ..., 𝑥𝑛}."

In [21]:
def construct_union(sentences, PW, NW, all_SW1, all_SW2):
    union_PW_SW1 = []
    union_NW_SW2 = []

    for i, sentence in enumerate(sentences):
        SW1 = all_SW1[i]
        SW2 = all_SW2[i]

        union_PW_SW1.append(PW[i].union(SW1))
        union_NW_SW2.append(NW[i].union(SW2))

    return union_PW_SW1, union_NW_SW2

In [22]:
union_PW_SW1, union_NW_SW2 = construct_union(text_data, positive_words, negative_words, all_SW1, all_SW2)
print(union_PW_SW1)
print(union_NW_SW2)



In [23]:
df["union_PW_SW1"] = union_PW_SW1
df["union_NW_SW2"] = union_NW_SW2
df

Unnamed: 0,class,text,PW,NW,SW1,SW2,union_PW_SW1,union_NW_SW2
0,notsarc,have no predators to fear? check. who said we ...,"{perspective, mental, understand, food, capaci...","{predator, problem, kill, unacceptable, fear, ...","{wild, check, loads, human, many, eaten, food,...","{far, perspective, mental, words, unacceptable...","{perspective, mental, wild, check, loads, huma...","{far, perspective, mental, words, unacceptable..."
1,notsarc,2 hours? damn! that book took me a good 2 day...,{good},"{slow, damn}","{book, damn, hours, took, good, days}","{slow, found, school, extremely, books, never,...","{book, days, damn, hours, good, took}","{slow, found, damn, school, like, extremely, b..."
2,notsarc,you never played myst? damn!!! i must be reall...,{played},"{old, damn}","{myst, played, never}","{must, old, really, damn}","{myst, played, never}","{must, old, damn, really}"
3,notsarc,"Well, if Genesis was in fact true, then we wou...","{vengeful, diety, changing, realize, compass, ...","{torment, stretch, damn, interpretation, qualify}","{one, interpretation, explains, well, torment,...","{exactly, actual, demonstrated, stretch, venge...","{one, interpretation, actual, explains, well, ...","{history, diety, damn, effect, exactly, interp..."
4,notsarc,Just making sure that everybody is aware of hi...,"{good, indicator, fairness, willfully, supersp...","{dishonesty, unfortunately, evasion, creationi...","{time, illustration, good, stop, sure, would, ...","{keeps, seen, anyway, uneducated, willfully, s...","{dishonesty, time, illustration, evasion, damn...","{dishonesty, plenty, evasion, keeps, creationi..."
...,...,...,...,...,...,...,...,...
1159,sarc,you really believed me? wow! i never knew i ha...,"{power, wow}",{},"{believed, wow, really}","{never, power, knew}","{really, believed, power, wow}","{never, power, knew}"
1160,sarc,please tell me you're kidding. these bowling b...,{bowling},{flat},"{tell, please, kidding}","{must, flat, balls, bowling}","{tell, please, kidding, bowling}","{must, flat, balls, bowling}"
1161,sarc,you're kidding. just because your life is 'a f...,"{breezy, breeze, doe}",{},"{breeze, fricken, people, breezy, mean, life, ...","{lately, live, either, even, life, news, anoth...","{breeze, fricken, people, breezy, mean, life, ...","{lately, live, either, even, life, news, anoth..."
1162,sarc,the evidence that is provided to you is not en...,"{lol, nice, particular}","{ethnocentrism, buddy}","{provided, nice, ethnocentrism, evidence, budd...","{assuming, right, wrote, lol, ranged, kidding,...","{provided, nice, lol, ethnocentrism, evidence,...","{assuming, right, wrote, lol, ethnocentrism, r..."


In [24]:
def mask_sentence(sentence, mask_words, max_mask_count = 5):
    masked_sentence = []

    for word in sentence.split():
        if word in mask_words and max_mask_count > 0:
            masked_sentence.append("<mask>")
            max_mask_count -= 1
        else:
            masked_sentence.append(word)

    return " ".join(masked_sentence)

In [25]:
def construct_masked_sentences(sentences, union_PW_SW1, union_NW_SW2):
    masked_pos_sentences = []
    masked_neg_sentences = []

    for i, sentence in enumerate(sentences):

        masked_pos_sentence = mask_sentence(sentence, union_PW_SW1[i])
        masked_pos_sentences.append(masked_pos_sentence)

        masked_neg_sentence = mask_sentence(sentence, union_NW_SW2[i])
        masked_neg_sentences.append(masked_neg_sentence)

    return masked_pos_sentences, masked_neg_sentences

In [26]:
masked_pos_sentences, masked_neg_sentences = construct_masked_sentences(text_data, union_PW_SW1, union_NW_SW2)

for i, sentence in enumerate(text_data):
    print(f"Original Sentence: {sentence}")
    print(f"Masked Positive Sentence: {masked_pos_sentences[i]}")
    print(f"Masked Negative Sentence: {masked_neg_sentences[i]}")
    print("- - - - - - - - - -")

Original Sentence: have no predators to fear? check. who said we don't have predators? we have predators. a human's predator is another human. they may not haunt you for food but loads of them will be more than happy to kill you for many other reasons.  in the wild you can also be eaten by all sorts of other predators. you see, when you look at it from the other side, our animal predators have no problem at all eating us... your response to this so far has been 'yes but animals don't understand that it is wrong'. in other words, animals don't have the mental capacity to see death from the same 'morally unacceptable' perspective you see it from. and yet you are still trying to project your hung ups about death onto them.
Masked Positive Sentence: have no <mask> to fear? check. who <mask> we don't have predators? we have predators. a human's <mask> is <mask> human. they <mask> not haunt you for food but loads of them will be more than happy to kill you for many other reasons. in the wild

In [27]:
dfnew = pd.DataFrame({"text": text_data_original, "maskedPosSentence": masked_pos_sentences, "maskedNegSentence": masked_neg_sentences})
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence
0,have no predators to fear? check. who said we ...,have no <mask> to fear? check. who <mask> we d...,have no <mask> to fear? check. who said we don...
1,2 hours? damn! that book took me a good 2 day...,2 hours? damn! that <mask> <mask> me a <mask> ...,2 hours? damn! that book took me a good 2 days...
2,you never played myst? damn!!! i must be reall...,you <mask> <mask> myst? damn!!! i must be real...,you never played myst? damn!!! i <mask> be <ma...
3,"Well, if Genesis was in fact true, then we wou...","well, if <mask> was in <mask> true, then we <m...","well, if genesis was in fact true, then we <ma..."
4,Just making sure that everybody is aware of hi...,just <mask> <mask> that <mask> is <mask> of hi...,just making sure that everybody is aware of hi...
...,...,...,...
1159,you really believed me? wow! i never knew i ha...,you <mask> <mask> me? wow! i never knew i had ...,you really believed me? wow! i <mask> <mask> i...
1160,please tell me you're kidding. these bowling b...,<mask> <mask> me you're kidding. these <mask> ...,please tell me you're kidding. these <mask> <m...
1161,you're kidding. just because your life is 'a f...,you're kidding. just because your <mask> is 'a...,you're kidding. just because your <mask> is 'a...
1162,the evidence that is provided to you is not en...,the <mask> that is <mask> to you is not enough...,the evidence that is provided to you is not en...


## 4)
"These two masked sentences are fed into the pre-trained generation model to fulfill the generation procedure.     
𝑨{𝑎1, ..., 𝑥2, ..., 𝑥𝑛−1, ..., 𝑎𝑜 } = 𝐵𝐴𝑅𝑇 ( [𝑚]1, 𝑥2, ..., 𝑥𝑛−1, [𝑚]𝑛 )----(1)  
Thus, we will obtain two reborn sentences     
𝐴 = {𝑎1, 𝑎2, ..., 𝑎𝑜 } and     
𝐵 = {𝑏1, 𝑏2, ..., 𝑏𝑝 }."

In [28]:
%pip install transformers



In [29]:
def generate_reborn_sentences(masked_sentences):
    tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
    model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

    i = 0
    reborn_sentences = []
    for masked_sentence in masked_sentences:
        inputs = tokenizer(masked_sentence, return_tensors="pt")
        generated_encoded = model.generate(inputs['input_ids'])
        reborn_sentence = tokenizer.batch_decode(generated_encoded, skip_special_tokens=True)[0]
        reborn_sentences.append(reborn_sentence)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} sentences')

    return reborn_sentences

In [31]:
from google.colab import userdata
userdata.get('HF_TOKEN')

'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'

In [32]:
import os
os.environ["HF_TOKEN"] = "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"

In [33]:
reborn_pos_sentences = generate_reborn_sentences(masked_pos_sentences)

reborn_neg_sentences = generate_reborn_sentences(masked_neg_sentences)

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]



Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences
Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences


In [34]:
print("Reborn Sentences for Masked Positive Sentences:")
for i, reborn_sentence in enumerate(reborn_pos_sentences):
    print(f"Reborn Sentence {i + 1}: {reborn_sentence}")

Reborn Sentences for Masked Positive Sentences:
Reborn Sentence 1: have no reason to fear? check. who cares if we don't have predators?
Reborn Sentence 2: 2 hours? damn! that book took me a good 2 days, it was a
Reborn Sentence 3: you think i'm myst? damn!!! i must be really old ;)
Reborn Sentence 4: well, if what was in the bible was true, then we don't have to
Reborn Sentence 5: just because supersport isn't a creationist doesn't mean that he is a creation
Reborn Sentence 6: that's is not a good argument for logic. just because the us government makes mistakes
Reborn Sentence 7: well, i have always believed that if you don't believe in a god's genesis
Reborn Sentence 8: believe me, marc, we all can. the only problem is that it
Reborn Sentence 9: darling, i'm in the 10th year of my current relationship. and i
Reborn Sentence 10: anyone who has served in the military is a hero. once you put your life
Reborn Sentence 11: his lies, his lies and his fake status, my post above answers your

In [35]:
print("\nReborn Sentences for Masked Negative Sentences:")
for i, reborn_sentence in enumerate(reborn_neg_sentences):
    print(f"Reborn Sentence {i + 1}: {reborn_sentence}")


Reborn Sentences for Masked Negative Sentences:
Reborn Sentence 1: have no one to fear? check. who said we don't have predators? we
Reborn Sentence 2: 2 hours? damn! that book took me a good 2 days, it was a
Reborn Sentence 3: you never played myst? damn!!! i should be ashamed of myself ;)
Reborn Sentence 4: well, if genesis was in fact true, then we don't have to assume that
Reborn Sentence 5: just making sure that everybody is aware of his evasion, misrepresentation, and dishonesty
Reborn Sentence 6: that's is ridiculously stupid logic. simply becuase the church is guilty of all
Reborn Sentence 7: well, many have argued that if you don't except a literal genesis, you're
Reborn Sentence 8: believe me, marc, we certainly can. the only problem is that it
Reborn Sentence 9: darling, i'm in the 10th year of my committed relationship. and since
Reborn Sentence 10: anyone who serves in the military is a hero. once you sign on the dotted
Reborn Sentence 11: his lies, his media darling statu

In [36]:
dfnew["rebornPosSentence"] = reborn_pos_sentences
dfnew["rebornNegSentence"] = reborn_neg_sentences
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence,rebornPosSentence,rebornNegSentence
0,have no predators to fear? check. who said we ...,have no <mask> to fear? check. who <mask> we d...,have no <mask> to fear? check. who said we don...,have no reason to fear? check. who cares if we...,have no one to fear? check. who said we don't ...
1,2 hours? damn! that book took me a good 2 day...,2 hours? damn! that <mask> <mask> me a <mask> ...,2 hours? damn! that book took me a good 2 days...,2 hours? damn! that book took me a good 2 days...,2 hours? damn! that book took me a good 2 days...
2,you never played myst? damn!!! i must be reall...,you <mask> <mask> myst? damn!!! i must be real...,you never played myst? damn!!! i <mask> be <ma...,you think i'm myst? damn!!! i must be really o...,you never played myst? damn!!! i should be ash...
3,"Well, if Genesis was in fact true, then we wou...","well, if <mask> was in <mask> true, then we <m...","well, if genesis was in fact true, then we <ma...","well, if what was in the bible was true, then ...","well, if genesis was in fact true, then we don..."
4,Just making sure that everybody is aware of hi...,just <mask> <mask> that <mask> is <mask> of hi...,just making sure that everybody is aware of hi...,just because supersport isn't a creationist do...,just making sure that everybody is aware of hi...
...,...,...,...,...,...
1159,you really believed me? wow! i never knew i ha...,you <mask> <mask> me? wow! i never knew i had ...,you really believed me? wow! i <mask> <mask> i...,you know what that means to me? wow! i never k...,you really believed me? wow! i can't believe i...
1160,please tell me you're kidding. these bowling b...,<mask> <mask> me you're kidding. these <mask> ...,please tell me you're kidding. these <mask> <m...,don't tell me you're kidding. these balls must...,please tell me you're kidding. these things wi...
1161,you're kidding. just because your life is 'a f...,you're kidding. just because your <mask> is 'a...,you're kidding. just because your <mask> is 'a...,you're kidding. just because your planet is 'a...,you're kidding. just because your planet is 'a...
1162,the evidence that is provided to you is not en...,the <mask> that is <mask> to you is not enough...,the evidence that is provided to you is not en...,the fact that is given to you is not enough. n...,the evidence that is provided to you is not en...


# Sentences Representation
"We embed the original sentence 𝑥 and its corresponding reborn texts 𝐴 and 𝐵     
into 𝑑-dimentional embedding 𝑯𝑡 ∈ R𝑑     
via pre-trained BERT-base:     
𝑯𝑥, 𝑯𝐴, 𝑯𝐵 = 𝐵𝐸𝑅𝑇 (𝑥), 𝐵𝐸𝑅𝑇 (𝐴), 𝐵𝐸𝑅𝑇 (𝐵)."

In [37]:
def embed_sentences(sentences):
    tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
    model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")

    i = 0
    embeddings = []
    for sentence in sentences:
        inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = model(**inputs).last_hidden_state.mean(dim=1)
        embeddings.append(outputs)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} sentences')

    return torch.stack(embeddings)

In [38]:
x_embeddings = embed_sentences(text_data)

A_embeddings = embed_sentences(reborn_pos_sentences)

B_embeddings = embed_sentences(reborn_neg_sentences)

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences
Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences
Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences


In [None]:
for i, sentence in enumerate(text_data):
    print(f"Embedding for Original Lowercase Sentence {i + 1} ({sentence}):")
    print(x_embeddings[i])
    print("- - - - - - - - - -")

In [None]:
for i, sentence in enumerate(reborn_pos_sentences):
    print(f"Embedding for Reborn Positive Sentence {i + 1} ({sentence}):")
    print(A_embeddings[i])
    print("- - - - - - - - - -")

In [None]:
for i, sentence in enumerate(reborn_neg_sentences):
    print(f"Embedding for Reborn Negative Sentence {i + 1} ({sentence}):")
    print(B_embeddings[i])
    print("- - - - - - - - - -")

In [None]:
dfnew["xEmbedding"] = x_embeddings.tolist()
dfnew["AEmbedding"] = A_embeddings.tolist()
dfnew["BEmbedding"] = B_embeddings.tolist()
dfnew

# Sarcastic Utterances Detection
## 1)
"We utilize cosine similarity to measure the similarity between representations of original sentence 𝐻𝑥     
and generation texts 𝐻𝐴/𝐻𝐵.

Then we use the following equation to calculate a difference score of each sentence:     
diff = sim(𝐻𝑥, 𝐻𝐴) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 || sim(𝐻𝑥, 𝐻𝐵) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑     
where || means "or" logical operator."

In [None]:
def calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, threshold):
    i = 0
    diff_scores = []
    for x_emb, A_emb, B_emb in zip(x_embeddings, A_embeddings, B_embeddings):
        sim_Hx_HA = cosine_similarity(x_emb, A_emb)
        sim_Hx_HB = cosine_similarity(x_emb, B_emb)

        diff = (sim_Hx_HA < threshold) or (sim_Hx_HB < threshold)
        diff_scores.append(diff)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} embeddings')

    return diff_scores

In [44]:
threshold = 0.755

diff_scores = calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, threshold)
diff_scores

Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings


[array([[ True]]),
 array([[False]]),
 array([[False]]),
 array([[ True]]),
 array([[ True]]),
 array([[False]]),
 array([[False]]),
 array([[False]]),
 array([[ True]]),
 array([[False]]),
 array([[False]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[ True]]),
 array([[False]]),
 array([[False]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[False]]),
 array([[False]]),
 array([[False]]),
 array([[Fal

## 2)
"Since the sarcastic utterances are influenced more than normal texts during the masking and generation procedure,     
the difference score of sarcastic texts should be greater than a non-sarcastic one.

If we have a threshold value which separates sarcastic texts and normal texts,     
we can yield the prediction 𝑦 by:     
𝑦 = I(diff)."

In [None]:
predicted_labels = [int(diff) for diff in diff_scores]
print(predicted_labels)
print(sum(predicted_labels))

In [None]:
labels = ["sarc" if diff else "notsarc" for diff in diff_scores]
print(labels)

In [None]:
dffinal = pd.DataFrame({"text": text_data, "class": label_data, "prediction": labels})
dffinal

# Main Experiment Results

In [48]:
true_labels = [1 if pred == "sarc" else 0 for pred in df["class"]]
print(true_labels)
print(predicted_labels)

accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels)
f1 = f1_score(true_labels, predicted_labels)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("F1 Score:", f1)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [49]:
conf_matrix = confusion_matrix(true_labels, predicted_labels)

print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[229 353]
 [221 361]]
