# Introduction

Sarcasm is a sophisticated language phenomenon, which would cause much confusion to exist sentiment classification systems.     
So sarcasm detection, a task of predicting whether a given text contains sarcasm, has received much research attention.     

Recently, many methods have been proposed for sarcasm detection, which could be broadly classified into two categories.     
One is the text-only method which only concentrate on the utterance itself, such as exploiting incongruity expressions to detect the sarcasm text.     
Another direction is based on extra information, which exploits external knowledge to assist the detection procedure, such as user history, and common sense knowledge.

We propose an unsupervised sarcasm detection method.     

First, we leverage the external sentiment knowledge to mask prominent tokens. Then the masked texts are fed into the pre-trained generation model, which follows the remaining logic structure to generate texts.     
There is a good chance that these reborn texts would not be sarcastic or make more sense.     

Second, after obtaining the similarity score between the generated sentence and the original one, features beneath the scores will be extracted to decide whether a sentence is sarcasm.     

Then, we construct several unsupervised baselines and conduct experiments on IAC-V2 dataset.

# Imports and Reading Data

In [1]:
# !pip install senticnet

In [1]:
import numpy as np
import pandas as pd

# from senticnet.senticnet import SenticNet

import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from transformers import AutoTokenizer, AutoModel
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
df = pd.read_csv("/content/drive/My Drive/AlifResearch/twitter_indonesia_sarcastic/data/test.csv")

In [4]:
df['class'] = df['label'].map({0: 'notsarc', 1: 'sarc'})

In [5]:
# Drop the old 'label' column and rename 'tweet' to 'text'
df = df.drop(columns=['label'])
df = df.rename(columns={'tweet': 'text'})

In [6]:
df = df[['class', 'text']]
df

Unnamed: 0,class,text
0,sarc,saraswati pintar sekali tida belajar dari peng...
1,notsarc,Menag bicara sola celana cingkrang gak bisa ik...
2,notsarc,Koruptor dihukum berat bukannya dikasih kering...
3,notsarc,Belum jadi pemimpin udah bikin hoax gmn kalau ...
4,sarc,Karena JUJUR membuatku malu dan merasa bersala...
...,...,...
533,sarc,Ciyee petani bawang nih :D :D Udahlah yaa sand...
534,notsarc,<username> <username> yg gue tangkep itu antar...
535,sarc,<username> Wuih pak capres memang keren kalau ...
536,sarc,<username> Welcome to Indonesia!!!! Sedih bang...


In [7]:
import re
# Function to remove text inside <>
def remove_brackets(text):
    return re.sub(r'<.*?>', '', text).strip()

# Apply the function to the 'text' column
df['text'] = df['text'].apply(remove_brackets)
df

Unnamed: 0,class,text
0,sarc,saraswati pintar sekali tida belajar dari peng...
1,notsarc,Menag bicara sola celana cingkrang gak bisa ik...
2,notsarc,Koruptor dihukum berat bukannya dikasih kering...
3,notsarc,Belum jadi pemimpin udah bikin hoax gmn kalau ...
4,sarc,Karena JUJUR membuatku malu dan merasa bersala...
...,...,...
533,sarc,Ciyee petani bawang nih :D :D Udahlah yaa sand...
534,notsarc,yg gue tangkep itu antara apart-nya serem dia ...
535,sarc,Wuih pak capres memang keren kalau pakai jas ....
536,sarc,Welcome to Indonesia!!!! Sedih bangetttt!!! Si...


In [8]:
def clean_text(text, encoding):
    try:
        # Decode text using the specified encoding with error handling
        return text.encode(encoding, 'ignore').decode('utf-8', 'ignore')
    except Exception as e:
        # Handle any errors during encoding/decoding
        print(f"Error processing text with {encoding}: {e}")
        return text

# List of encodings to try
encodings = ['utf-8', 'latin1', 'cp1252', 'ascii']

# Apply cleaning function with different encodings
for encoding in encodings:
    print(f"Trying encoding: {encoding}")
    df['text'] = df['text'].apply(lambda x: clean_text(x, encoding))
    print(df)

# Optionally, remove non-ASCII characters after trying different encodings
def remove_non_ascii(text):
    return re.sub(r'[^\x00-\x7F]', '', text)

# Apply non-ASCII removal
df['text'] = df['text'].apply(remove_non_ascii)

# Display the cleaned DataFrame
print(df)

Trying encoding: utf-8
       class                                               text
0       sarc  saraswati pintar sekali tida belajar dari peng...
1    notsarc  Menag bicara sola celana cingkrang gak bisa ik...
2    notsarc  Koruptor dihukum berat bukannya dikasih kering...
3    notsarc  Belum jadi pemimpin udah bikin hoax gmn kalau ...
4       sarc  Karena JUJUR membuatku malu dan merasa bersala...
..       ...                                                ...
533     sarc  Ciyee petani bawang nih :D :D Udahlah yaa sand...
534  notsarc  yg gue tangkep itu antara apart-nya serem dia ...
535     sarc  Wuih pak capres memang keren kalau pakai jas ....
536     sarc  Welcome to Indonesia!!!! Sedih bangetttt!!! Si...
537     sarc           terimakasih sudah dikasih liburan gratis

[538 rows x 2 columns]
Trying encoding: latin1
       class                                               text
0       sarc  saraswati pintar sekali tida belajar dari peng...
1    notsarc  Menag bicara sola c

# Understanding Data

In [9]:
df.dtypes

class    object
text     object
dtype: object

In [10]:
df.columns

Index(['class', 'text'], dtype='object')

In [11]:
text_data_original = list(df['text'])
text_data = [x.lower() for x in text_data_original]
print(*text_data, sep = "\n")

saraswati pintar sekali tida belajar dari pengalaman hufffffffffffff
menag bicara sola celana cingkrang gak bisa ikut aturan keluar .. nah lhoo
koruptor dihukum berat bukannya dikasih keringanan
belum jadi pemimpin udah bikin hoax gmn kalau diberi jabatan ?? bisa lebih parah
karena jujur membuatku malu dan merasa bersalah . itulah kenapa jkw sulit melakukannya.
oi tolong lah itu yg pada demo suruh pulang .
astagfirullah medina zein make narkoba ..saya tidak kaget make narkoba nya saya kaget karena saya tidak tahu siapa pula medina zein
bapak ngantornya di jalan tol saja bapak .. balaikota yang bikin bukan masa pemerintahan jokowi lho bapak .. :d :d
tolong dong bapak dpr kasih kuliah khusus dulu buat menag ini biar tidak bikin ribut mulu .
ah tolol         
mana nih klarifikasi nya yg lem aibon cc
anaknya pinter mak nya tolol nih
selamat menikmati berkah warge jakarte ... terimakasih juge ame bapak   oke tidak tuh ??? wkwkwkwkwkw ... kalau aye mah ogah kebanjiran . biar negara utang aye

In [12]:
label_data = list(df['class'])
print(*label_data, sep = "\n")

sarc
notsarc
notsarc
notsarc
sarc
notsarc
sarc
sarc
sarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
sarc
sarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
sarc
notsarc
sarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
sarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
sarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
sarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
sarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc


# Overview

The proposed framework contains three main components:     

1) Sentences mask and generation.     
This procedure first recognizes main components of sentences which will be properly masked to cause more impact on original sentences, and then fulfills the texts generation work;     

2) Sentences representation.     
It is expected to calculate dense vectors of sentences;     

3) Sarcastic utterances detection leverages.     
the similarity scores between original and regenerated sentences to detect whether an utterance is sarcastic.

# Sentences Mask and Generation
## 1)
"First, we use the sentiment common knowledge retrieved from SenticNet to recognize affective words in the sentence 𝑥,     
and split those words into two sets according to its sentiment polarities:    
PW = {pw1, pw2, ..., pwh} and    
NW = {nw1, nw2, ..., nwk},     
h + k <= n."

In [13]:
# def tokenize_sentence(sentence):
#     tokens = word_tokenize(sentence)

#     lemmatizer = WordNetLemmatizer()

#     clean_tokens = []
#     for tok in tokens:
#         clean_tok = lemmatizer.lemmatize(tok).lower().strip()
#         clean_tokens.append(clean_tok)

#     return clean_tokens

In [14]:
!pip install nlp-id



In [15]:
from nlp_id import Tokenizer, Lemmatizer

def tokenize_sentence(sentence):
    # Initialize the tokenizer and lemmatizer
    tokenizer = Tokenizer()
    lemmatizer = Lemmatizer()

    # Tokenize the sentence
    tokens = tokenizer.tokenize(sentence)

    # Lemmatize each token
    clean_tokens = [lemmatizer.lemmatize(token).lower().strip() for token in tokens]

    return clean_tokens

In [None]:
# def get_sentiment_polarity_from_senticnet(word):
#     sn = SenticNet()

#     word = word.lower()

#     try:
#         return sn.polarity_label(word)
#     except:
#         return "neutral"

In [19]:
!pip install sentiws

[31mERROR: Could not find a version that satisfies the requirement sentiws (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for sentiws[0m[31m
[0m

In [18]:
from sentiws import SentiWS
from nlp_id.tokenizer import Tokenizer
from nlp_id.lemmatizer import Lemmatizer

# Initialize SentiWS sentiment analyzer
senti_ws = SentiWS()

def get_sentiment_polarity(word):
    word = word.lower()

    try:
        sentiment_result = senti_ws.get_sentiment(word)
        # Determine the sentiment based on the compound score
        if sentiment_result > 0:
            return "positive"
        elif sentiment_result < 0:
            return "negative"
        else:
            return "neutral"
    except:
        return "neutral"

ModuleNotFoundError: No module named 'sentiws'

In [None]:
# def analyze_sentiment(sentences):
#     positive_words = []
#     negative_words = []

#     for sentence in sentences:
#         words = tokenize_sentence(sentence)

#         PW = set()
#         NW = set()

#         for word in words:
#             sentiment_polarity = get_sentiment_polarity_from_senticnet(word)
#             if sentiment_polarity == "positive":
#                 PW.add(word.lower())
#             elif sentiment_polarity == "negative":
#                 NW.add(word.lower())

#         positive_words.append(PW)
#         negative_words.append(NW)

#     return positive_words, negative_words

In [None]:
def analyze_sentiment(sentences):
    positive_words = []
    negative_words = []

    for sentence in sentences:
        words = tokenize_sentence(sentence)

        PW = set()
        NW = set()

        for word in words:
            sentiment_polarity = get_sentiment_polarity(word)
            if sentiment_polarity == "positive":
                PW.add(word.lower())
            elif sentiment_polarity == "negative":
                NW.add(word.lower())

        positive_words.append(PW)
        negative_words.append(NW)

    return positive_words, negative_words

In [None]:
# import nltk
# nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
# positive_words, negative_words = analyze_sentiment(text_data)

# for i, sentence in enumerate(text_data):
#     print(f"Sentence: {sentence}")
#     print(f"Positive Words: {positive_words[i]}")
#     print(f"Negative Words: {negative_words[i]}")
#     print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Negative Words: {'catholic'}
- - - - - - - - - -
Sentence: yet according to pp themselves, they provide far more abortions and ec services than any of their other services. :pppfa services (2002-2003)surgical abortions: 227,385 emergency contraception: 633,756 prenatal care: 15,860 adoption referrals: 1,963 (source: www.plannedparenthood.org)
Positive Words: {'surgical', 'adoption', 'contraception', 'provide', 'care', 'referral'}
Negative Words: {'emergency', 'prenatal', 'abortion'}
- - - - - - - - - -
Sentence: well, if burglars want to steal guns then it makes sense that they would be likely to target the homes of gun owners. and the source is from the recently published book evaluating gun policy which you can read and review. the source is listed at the bottom.
Positive Words: {'published', 'review'}
Negative Words: {'steal', 'bottom', 'target', 'listed', 'burglar', 'policy'}
- - - - - - - - - -
Sentence: but the fact

In [None]:
positive_words, negative_words = analyze_sentiment(text_data)

for i, sentence in enumerate(text_data):
    print(f"Sentence: {sentence}")
    print(f"Positive Words: {positive_words[i]}")
    print(f"Negative Words: {negative_words[i]}")
    print("- - - - - - - - - -")

In [None]:
df["PW"] = positive_words
df["NW"] = negative_words
df

Unnamed: 0,class,text,PW,NW
0,notsarc,"If that's true, then Freedom of Speech is doom...","{freedom, true, subjective}","{harassment, doomed}"
1,notsarc,Neener neener - is it time to go in from the p...,{playground},{}
2,notsarc,"Just like the plastic gun fear, the armour pie...",{},"{fear, myth, misinformation, plastic}"
3,notsarc,So geology is a religion because we weren't he...,"{rock, religion, formed}",{geology}
4,notsarc,Well done Monty. Mark that up as your first ev...,"{honest, accurate}",{}
...,...,...,...,...
6515,sarc,depends on when the baby bird died. run alon...,"{bird, baby, adult}",{died}
6516,sarc,"ok, sheesh, to clarify, women who arent aborti...","{baby, clarify, ok}","{pregnant, sheesh}"
6517,sarc,so.. eh?? hows this sound? will it fly w...,"{fly, conservative, progressive}",{}
6518,sarc,"I think we should put to a vote, the right of ...",{majority},{extremist}


## 2)
"Second, we analyze the sentence to get its syntax information to identify non-stop words     
     𝑆𝑊 = {𝑠𝑤1, 𝑠𝑤2, ..., 𝑠𝑤𝑚, 𝑚 ≤ 𝑛}.     
Intuitively, these words are the main components of sentences. Then we split 𝑆𝑊 into two sets which satisfy :     
     𝑆𝑊1 ∪ 𝑆𝑊2 = 𝑆𝑊 ,     
     |𝑆𝑊1| = |𝑆𝑊2|."

In [None]:
def extract_non_stop_words(sentence):
    words = nltk.word_tokenize(sentence)

    stop_words = set(stopwords.words("english"))

    non_stop_words = [word.lower() for word in words if word.lower() not in stop_words and word.isalpha()]

    return non_stop_words

In [None]:
def split_non_stop_words(non_stop_words):
    m = len(non_stop_words)
    m1 = m // 2
    SW1 = set(non_stop_words[:m1])
    SW2 = set(non_stop_words[m1:])
    return SW1, SW2

In [None]:
def analyze_sentences(sentences):
    all_SW1 = []
    all_SW2 = []

    for sentence in sentences:
        non_stop_words = extract_non_stop_words(sentence)
        SW1, SW2 = split_non_stop_words(non_stop_words)
        all_SW1.append(SW1)
        all_SW2.append(SW2)

    return all_SW1, all_SW2

In [None]:
all_SW1, all_SW2 = analyze_sentences(text_data)

for i, sentence in enumerate(text_data):
    print(f"Sentence: {sentence}")
    print(f"SW1: {all_SW1[i]}")
    print(f"SW2: {all_SW2[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
SW2: {'biased', 'catholic', 'pope'}
- - - - - - - - - -
Sentence: yet according to pp themselves, they provide far more abortions and ec services than any of their other services. :pppfa services (2002-2003)surgical abortions: 227,385 emergency contraception: 633,756 prenatal care: 15,860 adoption referrals: 1,963 (source: www.plannedparenthood.org)
SW1: {'services', 'ec', 'abortions', 'provide', 'according', 'pp', 'far', 'pppfa', 'yet'}
SW2: {'services', 'surgical', 'adoption', 'contraception', 'abortions', 'source', 'care', 'referrals', 'emergency', 'prenatal'}
- - - - - - - - - -
Sentence: well, if burglars want to steal guns then it makes sense that they would be likely to target the homes of gun owners. and the source is from the recently published book evaluating gun policy which you can read and review. the source is listed at the bottom.
SW1: {'burglars', 'steal', 'makes', 'well', 'target', 'want', 'gun', 'likely'

In [None]:
df["SW1"] = all_SW1
df["SW2"] = all_SW2
df

Unnamed: 0,class,text,PW,NW,SW1,SW2
0,notsarc,"If that's true, then Freedom of Speech is doom...","{freedom, true, subjective}","{harassment, doomed}","{true, speech, harassment, doomed, freedom}","{subjective, book, like, claim, harassing, ban..."
1,notsarc,Neener neener - is it time to go in from the p...,{playground},{},"{neener, time}","{go, playground, yet}"
2,notsarc,"Just like the plastic gun fear, the armour pie...",{},"{fear, myth, misinformation, plastic}","{armour, piercing, gun, fear, like, plastic}","{bullet, built, upon, fear, myth, misinformation}"
3,notsarc,So geology is a religion because we weren't he...,"{rock, religion, formed}",{geology},"{see, geology, religion}","{rock, x, formed}"
4,notsarc,Well done Monty. Mark that up as your first ev...,"{honest, accurate}",{},"{monty, well, done, mark}","{post, accurate, honest, ever, first}"
...,...,...,...,...,...,...
6515,sarc,depends on when the baby bird died. run alon...,"{bird, baby, adult}",{died},"{baby, run, bird, died, depends}","{along, boy, adults, little, debate, let}"
6516,sarc,"ok, sheesh, to clarify, women who arent aborti...","{baby, clarify, ok}","{pregnant, sheesh}","{babys, sheesh, aborting, arent, women, clarif...","{year, women, times, pregnant, several, oftan,..."
6517,sarc,so.. eh?? hows this sound? will it fly w...,"{fly, conservative, progressive}",{},"{sound, hows, fly, eh}","{conservatives, thezion, progressives, think}"
6518,sarc,"I think we should put to a vote, the right of ...",{majority},{extremist},"{majority, vote, right, put, extremists, say, ...","{anything, vote, steeeeeve, put, say, way, any..."


## 3)
"Here, 𝑃𝑊 ∪ 𝑆𝑊1 and 𝑁𝑊 ∪ 𝑆𝑊2 are used to mask original sentence respectively. So, we will obtain two masked sentences     
𝑥𝑚1 = { [𝑚]1, 𝑥2, ..., [𝑚]𝑛} and     
𝑥𝑚2 = {𝑥1, [𝑚]2, ..., 𝑥𝑛}."

In [None]:
def construct_union(sentences, PW, NW, all_SW1, all_SW2):
    union_PW_SW1 = []
    union_NW_SW2 = []

    for i, sentence in enumerate(sentences):
        SW1 = all_SW1[i]
        SW2 = all_SW2[i]

        union_PW_SW1.append(PW[i].union(SW1))
        union_NW_SW2.append(NW[i].union(SW2))

    return union_PW_SW1, union_NW_SW2

In [None]:
union_PW_SW1, union_NW_SW2 = construct_union(text_data, positive_words, negative_words, all_SW1, all_SW2)
print(union_PW_SW1)
print(union_NW_SW2)



In [None]:
df["union_PW_SW1"] = union_PW_SW1
df["union_NW_SW2"] = union_NW_SW2
df

Unnamed: 0,class,text,PW,NW,SW1,SW2,union_PW_SW1,union_NW_SW2
0,notsarc,"If that's true, then Freedom of Speech is doom...","{freedom, true, subjective}","{harassment, doomed}","{true, speech, harassment, doomed, freedom}","{subjective, book, like, claim, harassing, ban...","{true, speech, subjective, harassment, doomed,...","{subjective, book, like, harassment, doomed, c..."
1,notsarc,Neener neener - is it time to go in from the p...,{playground},{},"{neener, time}","{go, playground, yet}","{playground, neener, time}","{go, playground, yet}"
2,notsarc,"Just like the plastic gun fear, the armour pie...",{},"{fear, myth, misinformation, plastic}","{armour, piercing, gun, fear, like, plastic}","{bullet, built, upon, fear, myth, misinformation}","{armour, piercing, fear, like, plastic, gun}","{bullet, built, upon, fear, myth, plastic, mis..."
3,notsarc,So geology is a religion because we weren't he...,"{rock, religion, formed}",{geology},"{see, geology, religion}","{rock, x, formed}","{rock, formed, see, religion, geology}","{rock, x, geology, formed}"
4,notsarc,Well done Monty. Mark that up as your first ev...,"{honest, accurate}",{},"{monty, well, done, mark}","{post, accurate, honest, ever, first}","{well, monty, honest, mark, accurate, done}","{honest, post, ever, first, accurate}"
...,...,...,...,...,...,...,...,...
6515,sarc,depends on when the baby bird died. run alon...,"{bird, baby, adult}",{died},"{baby, run, bird, died, depends}","{along, boy, adults, little, debate, let}","{baby, run, adult, bird, died, depends}","{died, along, adults, little, debate, boy, let}"
6516,sarc,"ok, sheesh, to clarify, women who arent aborti...","{baby, clarify, ok}","{pregnant, sheesh}","{babys, sheesh, aborting, arent, women, clarif...","{year, women, times, pregnant, several, oftan,...","{baby, babys, sheesh, aborting, arent, women, ...","{year, sheesh, women, times, pregnant, several..."
6517,sarc,so.. eh?? hows this sound? will it fly w...,"{fly, conservative, progressive}",{},"{sound, hows, fly, eh}","{conservatives, thezion, progressives, think}","{sound, conservative, progressive, fly, hows, eh}","{conservatives, thezion, progressives, think}"
6518,sarc,"I think we should put to a vote, the right of ...",{majority},{extremist},"{majority, vote, right, put, extremists, say, ...","{anything, vote, steeeeeve, put, say, way, any...","{majority, religious, vote, right, put, say, y...","{anything, vote, steeeeeve, put, say, way, any..."


In [None]:
def mask_sentence(sentence, mask_words, max_mask_count = 5):
    masked_sentence = []

    for word in sentence.split():
        if word in mask_words and max_mask_count > 0:
            masked_sentence.append("<mask>")
            max_mask_count -= 1
        else:
            masked_sentence.append(word)

    return " ".join(masked_sentence)

In [None]:
def construct_masked_sentences(sentences, union_PW_SW1, union_NW_SW2):
    masked_pos_sentences = []
    masked_neg_sentences = []

    for i, sentence in enumerate(sentences):

        masked_pos_sentence = mask_sentence(sentence, union_PW_SW1[i])
        masked_pos_sentences.append(masked_pos_sentence)

        masked_neg_sentence = mask_sentence(sentence, union_NW_SW2[i])
        masked_neg_sentences.append(masked_neg_sentence)

    return masked_pos_sentences, masked_neg_sentences

In [None]:
masked_pos_sentences, masked_neg_sentences = construct_masked_sentences(text_data, union_PW_SW1, union_NW_SW2)

for i, sentence in enumerate(text_data):
    print(f"Original Sentence: {sentence}")
    print(f"Masked Positive Sentence: {masked_pos_sentences[i]}")
    print(f"Masked Negative Sentence: {masked_neg_sentences[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Masked Negative Sentence: i think her point was that the <mask> is <mask> because he's catholic.
- - - - - - - - - -
Original Sentence: yet according to pp themselves, they provide far more abortions and ec services than any of their other services. :pppfa services (2002-2003)surgical abortions: 227,385 emergency contraception: 633,756 prenatal care: 15,860 adoption referrals: 1,963 (source: www.plannedparenthood.org)
Masked Positive Sentence: <mask> <mask> to <mask> themselves, they <mask> <mask> more abortions and ec services than any of their other services. :pppfa services (2002-2003)surgical abortions: 227,385 emergency contraception: 633,756 prenatal care: 15,860 adoption referrals: 1,963 (source: www.plannedparenthood.org)
Masked Negative Sentence: yet according to pp themselves, they provide far more <mask> and ec <mask> than any of their other services. :pppfa <mask> (2002-2003)surgical abortions: 227,385 <mask> 

In [None]:
dfnew = pd.DataFrame({"text": text_data_original, "maskedPosSentence": masked_pos_sentences, "maskedNegSentence": masked_neg_sentences})
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence
0,"If that's true, then Freedom of Speech is doom...","if that's true, then <mask> of <mask> is doome...","if that's true, then freedom of speech is doom..."
1,Neener neener - is it time to go in from the p...,<mask> <mask> - is it <mask> to go in from the...,neener neener - is it time to <mask> in from t...
2,"Just like the plastic gun fear, the armour pie...","just <mask> the <mask> <mask> fear, the <mask>...","just like the <mask> gun fear, the armour pier..."
3,So geology is a religion because we weren't he...,so <mask> is a <mask> because we weren't here ...,so <mask> is a religion because we weren't her...
4,Well done Monty. Mark that up as your first ev...,<mask> <mask> monty. <mask> that up as your fi...,well done monty. mark that up as your <mask> <...
...,...,...,...
6515,depends on when the baby bird died. run alon...,<mask> on when the <mask> <mask> died. <mask> ...,depends on when the baby bird died. run <mask>...
6516,"ok, sheesh, to clarify, women who arent aborti...","ok, sheesh, to clarify, <mask> who <mask> <mas...","ok, sheesh, to clarify, <mask> who arent abort..."
6517,so.. eh?? hows this sound? will it fly w...,so.. eh?? <mask> this sound? will it <mask> wi...,so.. eh?? hows this sound? will it fly with <m...
6518,"I think we should put to a vote, the right of ...","i <mask> we should <mask> to a vote, the <mask...","i think we should <mask> to a vote, the right ..."


## 4)
"These two masked sentences are fed into the pre-trained generation model to fulfill the generation procedure.     
𝑨{𝑎1, ..., 𝑥2, ..., 𝑥𝑛−1, ..., 𝑎𝑜 } = 𝐵𝐴𝑅𝑇 ( [𝑚]1, 𝑥2, ..., 𝑥𝑛−1, [𝑚]𝑛 )----(1)  
Thus, we will obtain two reborn sentences     
𝐴 = {𝑎1, 𝑎2, ..., 𝑎𝑜 } and     
𝐵 = {𝑏1, 𝑏2, ..., 𝑏𝑝 }."

In [None]:
%pip install transformers



In [None]:
def generate_reborn_sentences(masked_sentences):
    tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
    model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

    i = 0
    reborn_sentences = []
    for masked_sentence in masked_sentences:
        inputs = tokenizer(masked_sentence, return_tensors="pt")
        generated_encoded = model.generate(inputs['input_ids'])
        reborn_sentence = tokenizer.batch_decode(generated_encoded, skip_special_tokens=True)[0]
        reborn_sentences.append(reborn_sentence)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} sentences')

    return reborn_sentences

In [None]:
from google.colab import userdata
userdata.get('huggingface')

'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'

In [None]:
import os
os.environ["HF_TOKEN"] = "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"

In [None]:
reborn_pos_sentences = generate_reborn_sentences(masked_pos_sentences)

reborn_neg_sentences = generate_reborn_sentences(masked_neg_sentences)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]



Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences
Processed 1200 sentences
Processed 1300 sentences
Processed 1400 sentences
Processed 1500 sentences
Processed 1600 sentences
Processed 1700 sentences
Processed 1800 sentences
Processed 1900 sentences
Processed 2000 sentences
Processed 2100 sentences
Processed 2200 sentences
Processed 2300 sentences
Processed 2400 sentences
Processed 2500 sentences
Processed 2600 sentences
Processed 2700 sentences
Processed 2800 sentences
Processed 2900 sentences
Processed 3000 sentences
Processed 3100 sentences
Processed 3200 sentences
Processed 3300 sentences
Processed 3400 sentences
Processed 3500 sentences
Processed 3600 sentences
Processed 3700 sentences
Processed 3800 sentences
Processed 3900 sentences
Processed 4000 sentences
Processed

In [None]:
print("Reborn Sentences for Masked Positive Sentences:")
for i, reborn_sentence in enumerate(reborn_pos_sentences):
    print(f"Reborn Sentence {i + 1}: {reborn_sentence}")

In [None]:
print("\nReborn Sentences for Masked Negative Sentences:")
for i, reborn_sentence in enumerate(reborn_neg_sentences):
    print(f"Reborn Sentence {i + 1}: {reborn_sentence}")

In [None]:
dfnew["rebornPosSentence"] = reborn_pos_sentences
dfnew["rebornNegSentence"] = reborn_neg_sentences
dfnew

# Sentences Representation
"We embed the original sentence 𝑥 and its corresponding reborn texts 𝐴 and 𝐵     
into 𝑑-dimentional embedding 𝑯𝑡 ∈ R𝑑     
via pre-trained BERT-base:     
𝑯𝑥, 𝑯𝐴, 𝑯𝐵 = 𝐵𝐸𝑅𝑇 (𝑥), 𝐵𝐸𝑅𝑇 (𝐴), 𝐵𝐸𝑅𝑇 (𝐵)."

In [None]:
def embed_sentences(sentences):
    tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
    model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")

    i = 0
    embeddings = []
    for sentence in sentences:
        inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = model(**inputs).last_hidden_state.mean(dim=1)
        embeddings.append(outputs)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} sentences')

    return torch.stack(embeddings)

In [None]:
x_embeddings = embed_sentences(text_data)

A_embeddings = embed_sentences(reborn_pos_sentences)

B_embeddings = embed_sentences(reborn_neg_sentences)

In [None]:
for i, sentence in enumerate(text_data):
    print(f"Embedding for Original Lowercase Sentence {i + 1} ({sentence}):")
    print(x_embeddings[i])
    print("- - - - - - - - - -")

In [None]:
for i, sentence in enumerate(reborn_pos_sentences):
    print(f"Embedding for Reborn Positive Sentence {i + 1} ({sentence}):")
    print(A_embeddings[i])
    print("- - - - - - - - - -")

In [None]:
for i, sentence in enumerate(reborn_neg_sentences):
    print(f"Embedding for Reborn Negative Sentence {i + 1} ({sentence}):")
    print(B_embeddings[i])
    print("- - - - - - - - - -")

In [None]:
dfnew["xEmbedding"] = x_embeddings.tolist()
dfnew["AEmbedding"] = A_embeddings.tolist()
dfnew["BEmbedding"] = B_embeddings.tolist()
dfnew

# Sarcastic Utterances Detection
## 1)
"We utilize cosine similarity to measure the similarity between representations of original sentence 𝐻𝑥     
and generation texts 𝐻𝐴/𝐻𝐵.

Then we use the following equation to calculate a difference score of each sentence:     
diff = sim(𝐻𝑥, 𝐻𝐴) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 || sim(𝐻𝑥, 𝐻𝐵) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑     
where || means "or" logical operator."

In [None]:
def calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, threshold):
    i = 0
    diff_scores = []
    for x_emb, A_emb, B_emb in zip(x_embeddings, A_embeddings, B_embeddings):
        sim_Hx_HA = cosine_similarity(x_emb, A_emb)
        sim_Hx_HB = cosine_similarity(x_emb, B_emb)

        diff = (sim_Hx_HA < threshold) or (sim_Hx_HB < threshold)
        diff_scores.append(diff)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} embeddings')

    return diff_scores

In [None]:
threshold = 0.755

diff_scores = calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, threshold)
diff_scores

## 2)
"Since the sarcastic utterances are influenced more than normal texts during the masking and generation procedure,     
the difference score of sarcastic texts should be greater than a non-sarcastic one.

If we have a threshold value which separates sarcastic texts and normal texts,     
we can yield the prediction 𝑦 by:     
𝑦 = I(diff)."

In [None]:
predicted_labels = [int(diff) for diff in diff_scores]
print(predicted_labels)
print(sum(predicted_labels))

In [None]:
labels = ["sarc" if diff else "notsarc" for diff in diff_scores]
print(labels)

In [None]:
dffinal = pd.DataFrame({"text": text_data, "class": label_data, "prediction": labels})
dffinal

# Main Experiment Results

In [None]:
true_labels = [1 if pred == "sarc" else 0 for pred in df["class"]]
print(true_labels)
print(predicted_labels)

accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels)
f1 = f1_score(true_labels, predicted_labels)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("F1 Score:", f1)

In [None]:
conf_matrix = confusion_matrix(true_labels, predicted_labels)

print("Confusion Matrix:")
print(conf_matrix)