# Introduction

Sarcasm is a sophisticated language phenomenon, which would cause much confusion to exist sentiment classification systems.     
So sarcasm detection, a task of predicting whether a given text contains sarcasm, has received much research attention.     

Recently, many methods have been proposed for sarcasm detection, which could be broadly classified into two categories.     
One is the text-only method which only concentrate on the utterance itself, such as exploiting incongruity expressions to detect the sarcasm text.     
Another direction is based on extra information, which exploits external knowledge to assist the detection procedure, such as user history, and common sense knowledge.

We propose an unsupervised sarcasm detection method.     

First, we leverage the external sentiment knowledge to mask prominent tokens. Then the masked texts are fed into the pre-trained generation model, which follows the remaining logic structure to generate texts.     
There is a good chance that these reborn texts would not be sarcastic or make more sense.     

Second, after obtaining the similarity score between the generated sentence and the original one, features beneath the scores will be extracted to decide whether a sentence is sarcasm.     

Then, we construct several unsupervised baselines and conduct experiments on IAC-V2 dataset.

# Imports and Reading Data

In [None]:
!pip install senticnet

Collecting senticnet
  Using cached senticnet-1.6-py3-none-any.whl.metadata (2.6 kB)
Downloading senticnet-1.6-py3-none-any.whl (51.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.9/51.9 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: senticnet
Successfully installed senticnet-1.6


In [None]:
import numpy as np
import pandas as pd

from senticnet.senticnet import SenticNet

import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from transformers import AutoTokenizer, AutoModel
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv("/content/drive/My Drive/AlifResearch/iSarcasm/train.csv")
df1 = pd.read_csv("/content/drive/My Drive/AlifResearch/iSarcasm/test.csv")

In [None]:
df

Unnamed: 0,id,text,label,emoji,tag
0,1,why do small shouldered tiny guys wear huge t ...,0,,
1,2,"good morning , please go and vote ! <repeated>...",0,🙅,"<url>,</hashtag>,<hashtag>,<number>,<repeated>"
2,3,is it even christmas if there isn ’ t a fight ...,1,,
3,4,helping mum with her maths work for the course...,0,,
4,5,<hashtag> dear customer </hashtag> i am sorry ...,0,,"<hashtag>,</hashtag>"
...,...,...,...,...,...
3111,3112,fcking hate being let down . don ’ t get my ho...,0,,
3112,3113,last day in my twenties 😫,0,😫,
3113,3114,who ’ s dick do i have to suck for some dominos,0,,
3114,3115,<user> yet if you threw cold water it would st...,0,,<user>


In [None]:
df1

Unnamed: 0,id,text,label,emoji,tag
0,3464,saw poppin fresh in the macy ' s parade . my d...,1,,
1,3465,i knew as soon as i heard doing ford was cutti...,0,,"<url>,<percent>"
2,3466,great advice from well established individuals...,0,,<user>
3,3467,"eating apple sauce , chicken thighs , broccoli...",0,,"<hashtag>,</hashtag>"
4,3468,<user> ur not a real smiler if ur not expectin...,1,,<user>
...,...,...,...,...,...
882,4346,imagine that it ' s going to cost me <number> ...,0,,<number>
883,4347,people really out here tryna argue you do not ...,0,,<url>
884,4348,"<user> and their relentless running game , on ...",0,,"<number>,<user>"
885,4349,why is it that whether i get out of bed at <nu...,0,,"<number>,<allcaps>,<repeated>,</allcaps>"


In [None]:
# Concatenate vertically
df = pd.concat([df, df1], ignore_index=True)
df

Unnamed: 0,id,text,label,emoji,tag
0,1,why do small shouldered tiny guys wear huge t ...,0,,
1,2,"good morning , please go and vote ! <repeated>...",0,🙅,"<url>,</hashtag>,<hashtag>,<number>,<repeated>"
2,3,is it even christmas if there isn ’ t a fight ...,1,,
3,4,helping mum with her maths work for the course...,0,,
4,5,<hashtag> dear customer </hashtag> i am sorry ...,0,,"<hashtag>,</hashtag>"
...,...,...,...,...,...
3998,4346,imagine that it ' s going to cost me <number> ...,0,,<number>
3999,4347,people really out here tryna argue you do not ...,0,,<url>
4000,4348,"<user> and their relentless running game , on ...",0,,"<number>,<user>"
4001,4349,why is it that whether i get out of bed at <nu...,0,,"<number>,<allcaps>,<repeated>,</allcaps>"


In [None]:
df = df.drop(columns=['id', 'emoji', 'tag'])
df

Unnamed: 0,text,label
0,why do small shouldered tiny guys wear huge t ...,0
1,"good morning , please go and vote ! <repeated>...",0
2,is it even christmas if there isn ’ t a fight ...,1
3,helping mum with her maths work for the course...,0
4,<hashtag> dear customer </hashtag> i am sorry ...,0
...,...,...
3998,imagine that it ' s going to cost me <number> ...,0
3999,people really out here tryna argue you do not ...,0
4000,"<user> and their relentless running game , on ...",0
4001,why is it that whether i get out of bed at <nu...,0


In [None]:
df['class'] = df['label'].map({0: 'notsarc', 1: 'sarc'})
df

Unnamed: 0,text,label,class
0,why do small shouldered tiny guys wear huge t ...,0,notsarc
1,"good morning , please go and vote ! <repeated>...",0,notsarc
2,is it even christmas if there isn ’ t a fight ...,1,sarc
3,helping mum with her maths work for the course...,0,notsarc
4,<hashtag> dear customer </hashtag> i am sorry ...,0,notsarc
...,...,...,...
3998,imagine that it ' s going to cost me <number> ...,0,notsarc
3999,people really out here tryna argue you do not ...,0,notsarc
4000,"<user> and their relentless running game , on ...",0,notsarc
4001,why is it that whether i get out of bed at <nu...,0,notsarc


In [None]:
# Drop the old 'label' column and rename 'tweet' to 'text'
df = df.drop(columns=['label'])
df = df.rename(columns={'tweet': 'text'})

In [None]:
df

Unnamed: 0,text,class
0,why do small shouldered tiny guys wear huge t ...,notsarc
1,"good morning , please go and vote ! <repeated>...",notsarc
2,is it even christmas if there isn ’ t a fight ...,sarc
3,helping mum with her maths work for the course...,notsarc
4,<hashtag> dear customer </hashtag> i am sorry ...,notsarc
...,...,...
3998,imagine that it ' s going to cost me <number> ...,notsarc
3999,people really out here tryna argue you do not ...,notsarc
4000,"<user> and their relentless running game , on ...",notsarc
4001,why is it that whether i get out of bed at <nu...,notsarc


In [None]:
import re
# Function to remove text inside <>
def remove_brackets(text):
    return re.sub(r'<.*?>', '', text).strip()

# Apply the function to the 'text' column
df['text'] = df['text'].apply(remove_brackets)
df

Unnamed: 0,text,class
0,why do small shouldered tiny guys wear huge t ...,notsarc
1,"good morning , please go and vote ! it only t...",notsarc
2,is it even christmas if there isn ’ t a fight ...,sarc
3,helping mum with her maths work for the course...,notsarc
4,dear customer i am sorry that the mobile phon...,notsarc
...,...,...
3998,imagine that it ' s going to cost me pound to...,notsarc
3999,people really out here tryna argue you do not ...,notsarc
4000,"and their relentless running game , on the bri...",notsarc
4001,why is it that whether i get out of bed at or...,notsarc


In [None]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.12.1-py3-none-any.whl.metadata (5.4 kB)
Downloading emoji-2.12.1-py3-none-any.whl (431 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/431.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.4/431.4 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.12.1


In [None]:
import emoji
# Function to remove emojis
def remove_emojis(text):
    return emoji.replace_emoji(text, replace='')

# Apply the function to the 'text' column
df['text'] = df['text'].apply(remove_emojis)
df

Unnamed: 0,text,class
0,why do small shouldered tiny guys wear huge t ...,notsarc
1,"good morning , please go and vote ! it only t...",notsarc
2,is it even christmas if there isn ’ t a fight ...,sarc
3,helping mum with her maths work for the course...,notsarc
4,dear customer i am sorry that the mobile phon...,notsarc
...,...,...
3998,imagine that it ' s going to cost me pound to...,notsarc
3999,people really out here tryna argue you do not ...,notsarc
4000,"and their relentless running game , on the bri...",notsarc
4001,why is it that whether i get out of bed at or...,notsarc


# Understanding Data

In [None]:
df.dtypes

Unnamed: 0,0
text,object
class,object


In [None]:
df.columns

Index(['text', 'class'], dtype='object')

In [None]:
text_data_original = list(df['text'])
text_data = [x.lower() for x in text_data_original]
print(*text_data, sep = "\n")

why do small shouldered tiny guys wear huge t shirts ?
good morning , please go and vote !  it only takes  minutes and a low turnout will hand victory to the brexit party   e uelections 2019
is it even christmas if there isn ’ t a fight with neighbours and a broken wrist ?
helping mum with her maths work for the course she ’ s taking and i ’ m slowly realising i am not great at maths
dear customer  i am sorry that the mobile phone reseller in the mall fucked you over . we all are not a bunch of sheisters . i hope your other life issues gets better and that i earned your future business .
anyone fancy writing my lit review for me ? can not . be . arsed .
so the  episode about ladonna was one of the most poignant and sad investigations of abuse of power and discrimination in institutions . v hard to listen to but so important .
middle aged women are bitchier than most people i know
baby tobias has arrived ! i ’ ll be taking a spot of leave but back into the swing of things for spring / s

In [None]:
label_data = list(df['class'])
print(*label_data, sep = "\n")

notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
sarc
sarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
sarc
sarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
sarc
notsarc
notsarc
sarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
sarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
not

# Overview

The proposed framework contains three main components:     

1) Sentences mask and generation.     
This procedure first recognizes main components of sentences which will be properly masked to cause more impact on original sentences, and then fulfills the texts generation work;     

2) Sentences representation.     
It is expected to calculate dense vectors of sentences;     

3) Sarcastic utterances detection leverages.     
the similarity scores between original and regenerated sentences to detect whether an utterance is sarcastic.

# Sentences Mask and Generation
## 1)
"First, we use the sentiment common knowledge retrieved from SenticNet to recognize affective words in the sentence 𝑥,     
and split those words into two sets according to its sentiment polarities:    
PW = {pw1, pw2, ..., pwh} and    
NW = {nw1, nw2, ..., nwk},     
h + k <= n."

In [None]:
def tokenize_sentence(sentence):
    tokens = word_tokenize(sentence)

    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

In [None]:
def get_sentiment_polarity_from_senticnet(word):
    sn = SenticNet()

    word = word.lower()

    try:
        return sn.polarity_label(word)
    except:
        return "neutral"

In [None]:
def analyze_sentiment(sentences):
    positive_words = []
    negative_words = []

    for sentence in sentences:
        words = tokenize_sentence(sentence)

        PW = set()
        NW = set()

        for word in words:
            sentiment_polarity = get_sentiment_polarity_from_senticnet(word)
            if sentiment_polarity == "positive":
                PW.add(word.lower())
            elif sentiment_polarity == "negative":
                NW.add(word.lower())

        positive_words.append(PW)
        negative_words.append(NW)

    return positive_words, negative_words

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
positive_words, negative_words = analyze_sentiment(text_data)

for i, sentence in enumerate(text_data):
    print(f"Sentence: {sentence}")
    print(f"Positive Words: {positive_words[i]}")
    print(f"Negative Words: {negative_words[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Sentence: the one thing i wish i ’ d brought travelling with me is multi vitamins ! budget backpacking is not great for getting a balanced diet .
Positive Words: {'multi', 'great', 'brought', 'backpacking', 'travelling'}
Negative Words: {'vitamin'}
- - - - - - - - - -
Sentence: win a cyberpower  gaming pc plus  fifa   on pc .  ,  ,
Positive Words: {'plus', 'fifa', 'gaming', 'win'}
Negative Words: set()
- - - - - - - - - -
Sentence: i cannot wait for halloween        
Positive Words: set()
Negative Words: {'wait'}
- - - - - - - - - -
Sentence: thought i ' d be a sheep . it ' s yanny for me
Positive Words: set()
Negative Words: set()
- - - - - - - - - -
Sentence: bored of all these manager rumors .  viera , arteta and gerrard ? are these the guys to replace rafa or mo diame ?  nufc
Positive Words: {'rumor'}
Negative Words: {'bored'}
- - - - - - - - - -
Sentence: note to self .  the cold air is not good with asthma , bronchi

In [None]:
df["PW"] = positive_words
df["NW"] = negative_words
df

Unnamed: 0,text,class,PW,NW
0,why do small shouldered tiny guys wear huge t ...,notsarc,"{wear, huge, tiny, shirt}",{}
1,"good morning , please go and vote ! it only t...",notsarc,"{party, good, victory, turnout}",{low}
2,is it even christmas if there isn ’ t a fight ...,sarc,"{wrist, christmas, fight}",{broken}
3,helping mum with her maths work for the course...,notsarc,"{great, slowly, math, work}",{mum}
4,dear customer i am sorry that the mobile phon...,notsarc,"{hope, better}","{sorry, fucked}"
...,...,...,...,...
3998,imagine that it ' s going to cost me pound to...,notsarc,"{pound, travel, imagine}",{cost}
3999,people really out here tryna argue you do not ...,notsarc,{},{argue}
4000,"and their relentless running game , on the bri...",notsarc,"{army, running}","{relentless, dangerous}"
4001,why is it that whether i get out of bed at or...,notsarc,{},{}


## 2)
"Second, we analyze the sentence to get its syntax information to identify non-stop words     
     𝑆𝑊 = {𝑠𝑤1, 𝑠𝑤2, ..., 𝑠𝑤𝑚, 𝑚 ≤ 𝑛}.     
Intuitively, these words are the main components of sentences. Then we split 𝑆𝑊 into two sets which satisfy :     
     𝑆𝑊1 ∪ 𝑆𝑊2 = 𝑆𝑊 ,     
     |𝑆𝑊1| = |𝑆𝑊2|."

In [None]:
def extract_non_stop_words(sentence):
    words = nltk.word_tokenize(sentence)

    stop_words = set(stopwords.words("english"))

    non_stop_words = [word.lower() for word in words if word.lower() not in stop_words and word.isalpha()]

    return non_stop_words

In [None]:
def split_non_stop_words(non_stop_words):
    m = len(non_stop_words)
    m1 = m // 2
    SW1 = set(non_stop_words[:m1])
    SW2 = set(non_stop_words[m1:])
    return SW1, SW2

In [None]:
def analyze_sentences(sentences):
    all_SW1 = []
    all_SW2 = []

    for sentence in sentences:
        non_stop_words = extract_non_stop_words(sentence)
        SW1, SW2 = split_non_stop_words(non_stop_words)
        all_SW1.append(SW1)
        all_SW2.append(SW2)

    return all_SW1, all_SW2

In [None]:
all_SW1, all_SW2 = analyze_sentences(text_data)

for i, sentence in enumerate(text_data):
    print(f"Sentence: {sentence}")
    print(f"SW1: {all_SW1[i]}")
    print(f"SW2: {all_SW2[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Sentence: the one thing i wish i ’ d brought travelling with me is multi vitamins ! budget backpacking is not great for getting a balanced diet .
SW1: {'multi', 'wish', 'thing', 'one', 'brought', 'travelling'}
SW2: {'getting', 'great', 'balanced', 'diet', 'vitamins', 'budget', 'backpacking'}
- - - - - - - - - -
Sentence: win a cyberpower  gaming pc plus  fifa   on pc .  ,  ,
SW1: {'cyberpower', 'gaming', 'win'}
SW2: {'plus', 'fifa', 'pc'}
- - - - - - - - - -
Sentence: i cannot wait for halloween        
SW1: {'wait'}
SW2: {'halloween'}
- - - - - - - - - -
Sentence: thought i ' d be a sheep . it ' s yanny for me
SW1: {'thought'}
SW2: {'sheep', 'yanny'}
- - - - - - - - - -
Sentence: bored of all these manager rumors .  viera , arteta and gerrard ? are these the guys to replace rafa or mo diame ?  nufc
SW1: {'bored', 'gerrard', 'manager', 'arteta', 'rumors', 'viera'}
SW2: {'rafa', 'replace', 'nufc', 'mo', 'guys', 'diame'}
- 

In [None]:
df["SW1"] = all_SW1
df["SW2"] = all_SW2
df

Unnamed: 0,text,class,PW,NW,SW1,SW2
0,why do small shouldered tiny guys wear huge t ...,notsarc,"{wear, huge, tiny, shirt}",{},"{tiny, shouldered, small}","{wear, huge, guys, shirts}"
1,"good morning , please go and vote ! it only t...",notsarc,"{party, good, victory, turnout}",{low},"{go, vote, please, minutes, good, morning, takes}","{hand, brexit, e, uelections, low, party, vict..."
2,is it even christmas if there isn ’ t a fight ...,sarc,"{wrist, christmas, fight}",{broken},"{even, christmas, fight}","{neighbours, broken, wrist}"
3,helping mum with her maths work for the course...,notsarc,"{great, slowly, math, work}",{mum},"{mum, helping, maths, course, work}","{slowly, great, taking, maths, realising}"
4,dear customer i am sorry that the mobile phon...,notsarc,"{hope, better}","{sorry, fucked}","{dear, bunch, mall, sorry, customer, reseller,...","{business, gets, earned, issues, better, hope,..."
...,...,...,...,...,...,...
3998,imagine that it ' s going to cost me pound to...,notsarc,"{pound, travel, imagine}",{cost},"{going, cost, imagine}","{year, university, travel, pound}"
3999,people really out here tryna argue you do not ...,notsarc,{},{argue},"{really, people, tryna}","{soap, argue, need, sanitary}"
4000,"and their relentless running game , on the bri...",notsarc,"{army, running}","{relentless, dangerous}","{relentless, brink, running, upsetting, game}","{knows, well, game, army, dangerous, ground}"
4001,why is it that whether i get out of bed at or...,notsarc,{},{},"{get, bed, always, whether}","{get, kitchen, downstairs, clock}"


## 3)
"Here, 𝑃𝑊 ∪ 𝑆𝑊1 and 𝑁𝑊 ∪ 𝑆𝑊2 are used to mask original sentence respectively. So, we will obtain two masked sentences     
𝑥𝑚1 = { [𝑚]1, 𝑥2, ..., [𝑚]𝑛} and     
𝑥𝑚2 = {𝑥1, [𝑚]2, ..., 𝑥𝑛}."

In [None]:
def construct_union(sentences, PW, NW, all_SW1, all_SW2):
    union_PW_SW1 = []
    union_NW_SW2 = []

    for i, sentence in enumerate(sentences):
        SW1 = all_SW1[i]
        SW2 = all_SW2[i]

        union_PW_SW1.append(PW[i].union(SW1))
        union_NW_SW2.append(NW[i].union(SW2))

    return union_PW_SW1, union_NW_SW2

In [None]:
union_PW_SW1, union_NW_SW2 = construct_union(text_data, positive_words, negative_words, all_SW1, all_SW2)
print(union_PW_SW1)
print(union_NW_SW2)



In [None]:
df["union_PW_SW1"] = union_PW_SW1
df["union_NW_SW2"] = union_NW_SW2
df

Unnamed: 0,text,class,PW,NW,SW1,SW2,union_PW_SW1,union_NW_SW2
0,why do small shouldered tiny guys wear huge t ...,notsarc,"{wear, huge, tiny, shirt}",{},"{tiny, shouldered, small}","{wear, huge, guys, shirts}","{small, shirt, tiny, wear, shouldered, huge}","{wear, huge, guys, shirts}"
1,"good morning , please go and vote ! it only t...",notsarc,"{party, good, victory, turnout}",{low},"{go, vote, please, minutes, good, morning, takes}","{hand, brexit, e, uelections, low, party, vict...","{go, vote, please, minutes, good, morning, tak...","{hand, brexit, uelections, low, party, e, vict..."
2,is it even christmas if there isn ’ t a fight ...,sarc,"{wrist, christmas, fight}",{broken},"{even, christmas, fight}","{neighbours, broken, wrist}","{even, wrist, christmas, fight}","{neighbours, broken, wrist}"
3,helping mum with her maths work for the course...,notsarc,"{great, slowly, math, work}",{mum},"{mum, helping, maths, course, work}","{slowly, great, taking, maths, realising}","{math, slowly, great, maths, helping, mum, cou...","{maths, slowly, great, mum, realising, taking}"
4,dear customer i am sorry that the mobile phon...,notsarc,"{hope, better}","{sorry, fucked}","{dear, bunch, mall, sorry, customer, reseller,...","{business, gets, earned, issues, better, hope,...","{dear, bunch, mobile, mall, sorry, customer, p...","{business, gets, sorry, earned, issues, better..."
...,...,...,...,...,...,...,...,...
3998,imagine that it ' s going to cost me pound to...,notsarc,"{pound, travel, imagine}",{cost},"{going, cost, imagine}","{year, university, travel, pound}","{travel, pound, going, imagine, cost}","{year, university, travel, pound, cost}"
3999,people really out here tryna argue you do not ...,notsarc,{},{argue},"{really, people, tryna}","{soap, argue, need, sanitary}","{really, people, tryna}","{need, argue, soap, sanitary}"
4000,"and their relentless running game , on the bri...",notsarc,"{army, running}","{relentless, dangerous}","{relentless, brink, running, upsetting, game}","{knows, well, game, army, dangerous, ground}","{army, relentless, running, brink, upsetting, ...","{relentless, knows, ground, well, army, danger..."
4001,why is it that whether i get out of bed at or...,notsarc,{},{},"{get, bed, always, whether}","{get, kitchen, downstairs, clock}","{get, bed, always, whether}","{get, kitchen, downstairs, clock}"


In [None]:
def mask_sentence(sentence, mask_words, max_mask_count = 5):
    masked_sentence = []

    for word in sentence.split():
        if word in mask_words and max_mask_count > 0:
            masked_sentence.append("<mask>")
            max_mask_count -= 1
        else:
            masked_sentence.append(word)

    return " ".join(masked_sentence)

In [None]:
def construct_masked_sentences(sentences, union_PW_SW1, union_NW_SW2):
    masked_pos_sentences = []
    masked_neg_sentences = []

    for i, sentence in enumerate(sentences):

        masked_pos_sentence = mask_sentence(sentence, union_PW_SW1[i])
        masked_pos_sentences.append(masked_pos_sentence)

        masked_neg_sentence = mask_sentence(sentence, union_NW_SW2[i])
        masked_neg_sentences.append(masked_neg_sentence)

    return masked_pos_sentences, masked_neg_sentences

In [None]:
masked_pos_sentences, masked_neg_sentences = construct_masked_sentences(text_data, union_PW_SW1, union_NW_SW2)

for i, sentence in enumerate(text_data):
    print(f"Original Sentence: {sentence}")
    print(f"Masked Positive Sentence: {masked_pos_sentences[i]}")
    print(f"Masked Negative Sentence: {masked_neg_sentences[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Original Sentence: the one thing i wish i ’ d brought travelling with me is multi vitamins ! budget backpacking is not great for getting a balanced diet .
Masked Positive Sentence: the <mask> <mask> i <mask> i ’ d <mask> <mask> with me is multi vitamins ! budget backpacking is not great for getting a balanced diet .
Masked Negative Sentence: the one thing i wish i ’ d brought travelling with me is multi <mask> ! <mask> <mask> is not <mask> for <mask> a balanced diet .
- - - - - - - - - -
Original Sentence: win a cyberpower  gaming pc plus  fifa   on pc .  ,  ,
Masked Positive Sentence: <mask> a <mask> <mask> pc <mask> <mask> on pc . , ,
Masked Negative Sentence: win a cyberpower gaming <mask> <mask> <mask> on <mask> . , ,
- - - - - - - - - -
Original Sentence: i cannot wait for halloween        
Masked Positive Sentence: i cannot <mask> for halloween
Masked Negative Sentence: i cannot <mask> for <mask>
- - - - - - - - - -

In [None]:
dfnew = pd.DataFrame({"text": text_data_original, "maskedPosSentence": masked_pos_sentences, "maskedNegSentence": masked_neg_sentences})
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence
0,why do small shouldered tiny guys wear huge t ...,why do <mask> <mask> <mask> guys <mask> <mask>...,why do small shouldered tiny <mask> <mask> <ma...
1,"good morning , please go and vote ! it only t...","<mask> <mask> , <mask> <mask> and <mask> ! it ...","good morning , please go and vote ! it only ta..."
2,is it even christmas if there isn ’ t a fight ...,is it <mask> <mask> if there isn ’ t a <mask> ...,is it even christmas if there isn ’ t a fight ...
3,helping mum with her maths work for the course...,<mask> <mask> with her <mask> <mask> for the <...,helping <mask> with her <mask> work for the co...
4,dear customer i am sorry that the mobile phon...,<mask> <mask> i am <mask> that the <mask> <mas...,dear customer i am <mask> that the mobile phon...
...,...,...,...
3998,imagine that it ' s going to cost me pound to...,<mask> that it ' s <mask> to <mask> me <mask> ...,imagine that it ' s going to <mask> me <mask> ...
3999,people really out here tryna argue you do not ...,<mask> <mask> out here <mask> argue you do not...,people really out here tryna <mask> you do not...
4000,"and their relentless running game , on the bri...","and their <mask> <mask> <mask> , on the <mask>...","and their <mask> running <mask> , on the brink..."
4001,why is it that whether i get out of bed at or...,why is it that <mask> i <mask> out of <mask> a...,why is it that whether i <mask> out of bed at ...


## 4)
"These two masked sentences are fed into the pre-trained generation model to fulfill the generation procedure.     
𝑨{𝑎1, ..., 𝑥2, ..., 𝑥𝑛−1, ..., 𝑎𝑜 } = 𝐵𝐴𝑅𝑇 ( [𝑚]1, 𝑥2, ..., 𝑥𝑛−1, [𝑚]𝑛 )----(1)  
Thus, we will obtain two reborn sentences     
𝐴 = {𝑎1, 𝑎2, ..., 𝑎𝑜 } and     
𝐵 = {𝑏1, 𝑏2, ..., 𝑏𝑝 }."

In [None]:
%pip install transformers



In [None]:
def generate_reborn_sentences(masked_sentences):
    tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
    model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

    i = 0
    reborn_sentences = []
    for masked_sentence in masked_sentences:
        inputs = tokenizer(masked_sentence, return_tensors="pt")
        generated_encoded = model.generate(inputs['input_ids'])
        reborn_sentence = tokenizer.batch_decode(generated_encoded, skip_special_tokens=True)[0]
        reborn_sentences.append(reborn_sentence)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} sentences')

    return reborn_sentences

In [None]:
from google.colab import userdata
userdata.get('HF_TOKEN')

'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'

In [None]:
import os
os.environ["HF_TOKEN"] = "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"

In [None]:
reborn_pos_sentences = generate_reborn_sentences(masked_pos_sentences)

reborn_neg_sentences = generate_reborn_sentences(masked_neg_sentences)

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]



Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences
Processed 1200 sentences
Processed 1300 sentences
Processed 1400 sentences
Processed 1500 sentences
Processed 1600 sentences
Processed 1700 sentences
Processed 1800 sentences
Processed 1900 sentences
Processed 2000 sentences
Processed 2100 sentences
Processed 2200 sentences
Processed 2300 sentences
Processed 2400 sentences
Processed 2500 sentences
Processed 2600 sentences
Processed 2700 sentences
Processed 2800 sentences
Processed 2900 sentences
Processed 3000 sentences
Processed 3100 sentences
Processed 3200 sentences
Processed 3300 sentences
Processed 3400 sentences
Processed 3500 sentences
Processed 3600 sentences
Processed 3700 sentences
Processed 3800 sentences
Processed 3900 sentences
Processed 4000 sentences
Processed

In [None]:
print("Reborn Sentences for Masked Positive Sentences:")
for i, reborn_sentence in enumerate(reborn_pos_sentences):
    print(f"Reborn Sentence {i + 1}: {reborn_sentence}")

Reborn Sentences for Masked Positive Sentences:
Reborn Sentence 1: why do you guys wear t shirts?
Reborn Sentence 2: In the end, it is all about turnout and turnout! it only takes minutes and
Reborn Sentence 3: is it worth it if there isn ’ t a problem with neighbours and a broken
Reborn Sentence 4: i ’ m struggling with her math skills for the classes she ’ s taking
Reborn Sentence 5: i hope you get better. i am sorry that the guy who was the reseller
Reborn Sentence 6: What is going on in my lit. for me? can not. be. ar
Reborn Sentence 7: so the most important thing about this report was that it was one of the most comprehensive
Reborn Sentence 8: I know people who are bitchier than most people i know
Reborn Sentence 9: The summer has come and gone! i ’ ll be taking a bit of leave
Reborn Sentence 10: do not have to lose when re retroing a colorway that's so limited
Reborn Sentence 11: ... way too much...
Reborn Sentence 12: for those of you who can not wait for xmas to arrive in the 

In [None]:
print("\nReborn Sentences for Masked Negative Sentences:")
for i, reborn_sentence in enumerate(reborn_neg_sentences):
    print(f"Reborn Sentence {i + 1}: {reborn_sentence}")


Reborn Sentences for Masked Negative Sentences:
Reborn Sentence 1: why do small shouldered tiny shoes look like t-shirts?
Reborn Sentence 2: good morning, please go and vote! it only takes minutes and a few clicks and
Reborn Sentence 3: is it even christmas if there isn ’ t a fight with Santa and a
Reborn Sentence 4: helping her with her work for the course she ’ s taking and i �
Reborn Sentence 5: dear customer i am sorry to say that the mobile phone reseller in the mall
Reborn Sentence 6: anyone fancy writing my name for me? can not. be...
Reborn Sentence 7: so the episode about ladonna was one of the most poignant and poignant investigations of the
Reborn Sentence 8: middle aged women are older than most men i know
Reborn Sentence 9: baby tobias has arrived! i ’ ll be taking a spot of them,
Reborn Sentence 10: do not understand decision to make atmos'so... what do they have
Reborn Sentence 11: ... too much...
Reborn Sentence 12: for anybody who can not find cheese footballs in sout

In [None]:
dfnew["rebornPosSentence"] = reborn_pos_sentences
dfnew["rebornNegSentence"] = reborn_neg_sentences
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence,rebornPosSentence,rebornNegSentence
0,why do small shouldered tiny guys wear huge t ...,why do <mask> <mask> <mask> guys <mask> <mask>...,why do small shouldered tiny <mask> <mask> <ma...,why do you guys wear t shirts?,why do small shouldered tiny shoes look like t...
1,"good morning , please go and vote ! it only t...","<mask> <mask> , <mask> <mask> and <mask> ! it ...","good morning , please go and vote ! it only ta...","In the end, it is all about turnout and turnou...","good morning, please go and vote! it only take..."
2,is it even christmas if there isn ’ t a fight ...,is it <mask> <mask> if there isn ’ t a <mask> ...,is it even christmas if there isn ’ t a fight ...,is it worth it if there isn ’ t a problem with...,is it even christmas if there isn ’ t a fight ...
3,helping mum with her maths work for the course...,<mask> <mask> with her <mask> <mask> for the <...,helping <mask> with her <mask> work for the co...,i ’ m struggling with her math skills for the ...,helping her with her work for the course she ’...
4,dear customer i am sorry that the mobile phon...,<mask> <mask> i am <mask> that the <mask> <mas...,dear customer i am <mask> that the mobile phon...,i hope you get better. i am sorry that the guy...,dear customer i am sorry to say that the mobil...
...,...,...,...,...,...
3998,imagine that it ' s going to cost me pound to...,<mask> that it ' s <mask> to <mask> me <mask> ...,imagine that it ' s going to <mask> me <mask> ...,I found out that it's going to cost me to get ...,imagine that it's going to take me a while to ...
3999,people really out here tryna argue you do not ...,<mask> <mask> out here <mask> argue you do not...,people really out here tryna <mask> you do not...,There are many people out here who argue you d...,people really out here tryna say you do not wa...
4000,"and their relentless running game , on the bri...","and their <mask> <mask> <mask> , on the <mask>...","and their <mask> running <mask> , on the brink...","and their army, on the other hand, of course. ...","and their team are running into, on the brink ..."
4001,why is it that whether i get out of bed at or...,why is it that <mask> i <mask> out of <mask> a...,why is it that whether i <mask> out of bed at ...,why is it that when i go out of the house at o...,"why is it that whether i get out of bed at or,..."


# Sentences Representation
"We embed the original sentence 𝑥 and its corresponding reborn texts 𝐴 and 𝐵     
into 𝑑-dimentional embedding 𝑯𝑡 ∈ R𝑑     
via pre-trained BERT-base:     
𝑯𝑥, 𝑯𝐴, 𝑯𝐵 = 𝐵𝐸𝑅𝑇 (𝑥), 𝐵𝐸𝑅𝑇 (𝐴), 𝐵𝐸𝑅𝑇 (𝐵)."

In [None]:
def embed_sentences(sentences):
    tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
    model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")

    i = 0
    embeddings = []
    for sentence in sentences:
        inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = model(**inputs).last_hidden_state.mean(dim=1)
        embeddings.append(outputs)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} sentences')

    return torch.stack(embeddings)

In [None]:
x_embeddings = embed_sentences(text_data)

A_embeddings = embed_sentences(reborn_pos_sentences)

B_embeddings = embed_sentences(reborn_neg_sentences)

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences
Processed 1200 sentences
Processed 1300 sentences
Processed 1400 sentences
Processed 1500 sentences
Processed 1600 sentences
Processed 1700 sentences
Processed 1800 sentences
Processed 1900 sentences
Processed 2000 sentences
Processed 2100 sentences
Processed 2200 sentences
Processed 2300 sentences
Processed 2400 sentences
Processed 2500 sentences
Processed 2600 sentences
Processed 2700 sentences
Processed 2800 sentences
Processed 2900 sentences
Processed 3000 sentences
Processed 3100 sentences
Processed 3200 sentences
Processed 3300 sentences
Processed 3400 sentences
Processed 3500 sentences
Processed 3600 sentences
Processed 3700 sentences
Processed 3800 sentences
Processed 3900 sentences
Processed 4000 sentences
Processed

In [None]:
for i, sentence in enumerate(text_data):
    print(f"Embedding for Original Lowercase Sentence {i + 1} ({sentence}):")
    print(x_embeddings[i])
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
          3.4725e-01, -2.4786e-01, -2.7578e-01, -2.6902e-01, -6.9727e-02,
         -1.0435e-01,  2.9424e-01,  2.7911e-01, -6.7370e-02,  1.4964e-02,
         -8.9923e-02, -1.1077e-01, -6.1528e-02, -3.2078e-02,  1.0491e-01,
         -2.3540e-01, -6.5370e-02,  1.7637e-02,  7.9000e-02,  2.2984e-01,
         -2.9960e-01, -2.3310e-01,  1.8909e-01,  2.2909e-01, -6.2361e-01,
         -2.5710e-01,  4.4839e-02,  9.2926e-02, -1.2253e-01,  7.8538e-02,
         -6.5198e-02,  3.0073e-02, -1.7900e-01]])
- - - - - - - - - -
Embedding for Original Lowercase Sentence 3972 (spent the morning nursing a very hungover daughter , after she came home at some ungodly hour .  oh how times have changed ):
tensor([[ 5.3629e-02,  5.9410e-02,  5.4324e-01, -5.0224e-01,  2.3215e-02,
         -1.9734e-01,  2.9660e-01,  6.2760e-01,  2.0810e-01,  2.5437e-01,
          3.5356e-01, -2.3373e-01, -2.4814e-01,  6.0689e-01, -3.9613e-01,
          2.0445e-01,  5.

In [None]:
for i, sentence in enumerate(reborn_pos_sentences):
    print(f"Embedding for Reborn Positive Sentence {i + 1} ({sentence}):")
    print(A_embeddings[i])
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
         -3.9696e-01, -1.6536e-01,  4.5898e-01,  1.8173e-01,  1.1948e-01,
         -2.0990e-01, -3.9816e-02, -2.4545e-01, -4.6324e-02, -2.0375e-03,
          3.2038e-01, -3.7396e-01,  1.2510e-01, -3.5947e-01,  5.3571e-01,
          7.6657e-01, -3.1868e-01,  5.5204e-02, -3.3219e-01,  6.1572e-01,
          7.0459e-01, -3.9263e-01,  2.9200e-02, -2.9348e-01, -3.8205e-01,
          4.5781e-01,  2.9722e-01, -2.3898e-01,  3.9874e-01, -4.9511e-01,
          2.6286e-02, -2.3081e-01,  2.3719e-01,  1.4884e-01, -2.6199e-02,
         -2.0297e-01,  1.3258e-02, -1.5921e-01,  4.1456e-01, -4.1036e-01,
          2.6274e-01,  7.6720e-02,  3.0448e-02, -1.2010e-01, -4.4252e-01,
         -1.8509e-01, -1.9594e-01,  3.7691e-01, -2.3564e-02, -1.7320e-01,
         -3.8880e-01, -4.0927e-01,  2.6244e-01, -1.0163e-01,  2.2600e-01,
         -4.1847e-01,  3.2648e-02, -5.7476e-01, -1.7836e-01,  4.5830e-02,
          1.2182e-01,  5.3445e-01, -3.3001e-02,

In [None]:
for i, sentence in enumerate(reborn_neg_sentences):
    print(f"Embedding for Reborn Negative Sentence {i + 1} ({sentence}):")
    print(B_embeddings[i])
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
          4.6132e-01, -5.3686e-01, -1.4633e-01, -1.2558e-01, -4.3290e-01,
          2.2939e-03,  2.6573e-01,  3.2299e-01,  2.9147e-01,  1.8500e-01,
         -3.2247e-01,  1.9400e-01, -1.0467e-02,  2.3563e-01, -4.1244e-02,
         -1.4074e-01, -2.3866e-01,  1.2857e-01,  2.4971e-02,  2.3071e-01,
         -5.0418e-01, -2.1720e-01, -1.4548e-01,  1.6598e-01, -7.6404e-01,
         -2.7170e-01,  2.0507e-02,  2.0171e-02,  2.0353e-01, -1.7305e-01,
          1.3793e-01, -9.8941e-03, -2.1236e-01]])
- - - - - - - - - -
Embedding for Reborn Negative Sentence 3972 (spent the morning nursing a very sick daughter, after she came home at some point):
tensor([[-3.6182e-01, -1.0838e-01,  2.2165e-01, -2.8312e-01, -2.2319e-01,
          2.6743e-01,  5.3862e-01,  2.1642e-01,  1.8862e-01,  6.3483e-02,
          3.2252e-01, -1.8719e-01,  6.4171e-02,  6.1693e-01, -1.3156e-01,
          1.0946e-01,  3.2220e-01,  7.6148e-02, -3.6017e-01, -2.5184e-

In [None]:
dfnew["xEmbedding"] = x_embeddings.tolist()
dfnew["AEmbedding"] = A_embeddings.tolist()
dfnew["BEmbedding"] = B_embeddings.tolist()
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence,rebornPosSentence,rebornNegSentence,xEmbedding,AEmbedding,BEmbedding
0,why do small shouldered tiny guys wear huge t ...,why do <mask> <mask> <mask> guys <mask> <mask>...,why do small shouldered tiny <mask> <mask> <ma...,why do you guys wear t shirts?,why do small shouldered tiny shoes look like t...,"[[0.7081333994865417, 0.12105370312929153, -0....","[[1.6330150365829468, -0.06324655562639236, -0...","[[0.8604000210762024, -0.3459759056568146, -0...."
1,"good morning , please go and vote ! it only t...","<mask> <mask> , <mask> <mask> and <mask> ! it ...","good morning , please go and vote ! it only ta...","In the end, it is all about turnout and turnou...","good morning, please go and vote! it only take...","[[-0.1458599716424942, -0.6124595403671265, 1....","[[0.24227996170520782, -0.38228973746299744, 0...","[[0.04123152047395706, -0.028487255796790123, ..."
2,is it even christmas if there isn ’ t a fight ...,is it <mask> <mask> if there isn ’ t a <mask> ...,is it even christmas if there isn ’ t a fight ...,is it worth it if there isn ’ t a problem with...,is it even christmas if there isn ’ t a fight ...,"[[0.0810900628566742, -0.2309497594833374, 0.3...","[[0.4334094822406769, -0.3625483512878418, 0.3...","[[-0.13637270033359528, -0.23558928072452545, ..."
3,helping mum with her maths work for the course...,<mask> <mask> with her <mask> <mask> for the <...,helping <mask> with her <mask> work for the co...,i ’ m struggling with her math skills for the ...,helping her with her work for the course she ’...,"[[0.20817327499389648, 0.2446015179157257, 0.4...","[[0.08149339258670807, -0.08096496760845184, -...","[[0.3286980092525482, -0.12290207296609879, 0...."
4,dear customer i am sorry that the mobile phon...,<mask> <mask> i am <mask> that the <mask> <mas...,dear customer i am <mask> that the mobile phon...,i hope you get better. i am sorry that the guy...,dear customer i am sorry to say that the mobil...,"[[0.4123024642467499, 0.09793899208307266, 0.5...","[[-0.008014368824660778, 0.07397935539484024, ...","[[0.4495949149131775, 0.4749864339828491, 0.61..."
...,...,...,...,...,...,...,...,...
3998,imagine that it ' s going to cost me pound to...,<mask> that it ' s <mask> to <mask> me <mask> ...,imagine that it ' s going to <mask> me <mask> ...,I found out that it's going to cost me to get ...,imagine that it's going to take me a while to ...,"[[0.370595246553421, 0.17534072697162628, 0.46...","[[0.27342572808265686, -0.038232725113630295, ...","[[0.39162731170654297, -0.28266990184783936, 0..."
3999,people really out here tryna argue you do not ...,<mask> <mask> out here <mask> argue you do not...,people really out here tryna <mask> you do not...,There are many people out here who argue you d...,people really out here tryna say you do not wa...,"[[0.6794639825820923, 0.3844420313835144, -0.2...","[[0.7550011873245239, 0.27998435497283936, 0.0...","[[0.5378870964050293, -0.0463699996471405, -0...."
4000,"and their relentless running game , on the bri...","and their <mask> <mask> <mask> , on the <mask>...","and their <mask> running <mask> , on the brink...","and their army, on the other hand, of course. ...","and their team are running into, on the brink ...","[[-0.20612291991710663, 0.10531312227249146, -...","[[-0.1339244842529297, 0.43760520219802856, 0....","[[-0.45246705412864685, -0.14730283617973328, ..."
4001,why is it that whether i get out of bed at or...,why is it that <mask> i <mask> out of <mask> a...,why is it that whether i <mask> out of bed at ...,why is it that when i go out of the house at o...,"why is it that whether i get out of bed at or,...","[[0.6105951070785522, 0.4410526752471924, 0.49...","[[0.2320578247308731, 0.3032695949077606, 0.40...","[[0.5307813882827759, 0.00047325657214969397, ..."


# Sarcastic Utterances Detection
## 1)
"We utilize cosine similarity to measure the similarity between representations of original sentence 𝐻𝑥     
and generation texts 𝐻𝐴/𝐻𝐵.

Then we use the following equation to calculate a difference score of each sentence:     
diff = sim(𝐻𝑥, 𝐻𝐴) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 || sim(𝐻𝑥, 𝐻𝐵) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑     
where || means "or" logical operator."

In [None]:
def calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, threshold):
    i = 0
    diff_scores = []
    for x_emb, A_emb, B_emb in zip(x_embeddings, A_embeddings, B_embeddings):
        sim_Hx_HA = cosine_similarity(x_emb, A_emb)
        sim_Hx_HB = cosine_similarity(x_emb, B_emb)

        diff = (sim_Hx_HA < threshold) or (sim_Hx_HB < threshold)
        diff_scores.append(diff)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} embeddings')

    return diff_scores

In [None]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, f1_score, confusion_matrix

# Define the range for the threshold
threshold_range = range(55, 91)  # This includes 80

# Initialize a dictionary to store the results for each threshold
results = {}

# Iterate over the threshold range
for threshold in threshold_range:
    # Normalize the threshold to the required scale (if needed, assuming the original threshold is on a different scale)
    normalized_threshold = threshold / 100

    # Calculate the difference scores for the current threshold
    diff_scores = calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, normalized_threshold)

    # Convert difference scores to predicted labels
    predicted_labels = [int(diff) for diff in diff_scores]

    # Generate label names based on predictions
    labels = ["sarc" if diff else "notsarc" for diff in diff_scores]

    # Create a DataFrame with text data, true labels, and predictions
    dffinal = pd.DataFrame({"text": text_data, "class": label_data, "prediction": labels})

    # Convert true labels to binary format
    true_labels = [1 if pred == "sarc" else 0 for pred in dffinal["class"]]

    # Calculate performance metrics
    accuracy = accuracy_score(true_labels, predicted_labels)
    precision = precision_score(true_labels, predicted_labels)
    f1 = f1_score(true_labels, predicted_labels)

    # Store the metrics in the results dictionary
    results[threshold] = {
        "accuracy": accuracy,
        "precision": precision,
        "f1_score": f1
    }

    # Print the metrics for the current threshold
    print(f"Threshold: {threshold / 100}")
    print(f"Accuracy: {accuracy}")
    print(f"Precision: {precision}")
    print(f"F1 Score: {f1}")

    # Print the confusion matrix
    conf_matrix = confusion_matrix(true_labels, predicted_labels)
    print("Confusion Matrix:")
    print(conf_matrix)
    print("-" * 50)  # Separator for readability

# Optionally, convert the results dictionary to a DataFrame for easier analysis
results_df = pd.DataFrame(results).T
results_df


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.57
Accuracy: 0.7829128153884587
Precision: 0.19011406844106463
F1 Score: 0.10319917440660474
Confusion Matrix:
[[3084  213]
 [ 656   50]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
P

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.59
Accuracy: 0.766674993754684
Precision: 0.19518716577540107
F1 Score: 0.13518518518518519
Confusion Matrix:
[[2996  301]
 [ 633   73]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pr

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.6
Accuracy: 0.7559330502123407
Precision: 0.19821826280623608
F1 Score: 0.15411255411255412
Confusion Matrix:
[[2937  360]
 [ 617   89]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pr

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embeddings
Processed 4000 embed

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embeddings
Processed 4000 embed

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embeddings
Processed 4000 embed

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embeddings
Processed 4000 embed

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embeddings
Processed 4000 embed

  predicted_labels = [int(diff) for diff in diff_scores]


Confusion Matrix:
[[1789 1508]
 [ 353  353]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Pro

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embeddings
Processed 4000 embed

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embeddings
Processed 4000 embed

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.77
Accuracy: 0.4024481638770922
Precision: 0.188470066518847
F1 Score: 0.2989449003516999
Confusion Matrix:
[[1101 2196]
 [ 196  510]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Proc

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.78
Accuracy: 0.37721708718461155
Precision: 0.18986463033668866
F1 Score: 0.3049902425425146
Confusion Matrix:
[[ 963 2334]
 [ 159  547]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
P

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.79
Accuracy: 0.35048713464901327
Precision: 0.18848684210526315
F1 Score: 0.30592632140950343
Confusion Matrix:
[[ 830 2467]
 [ 133  573]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings


  predicted_labels = [int(diff) for diff in diff_scores]


Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embeddings
Processed 4000 embed

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embeddings
Processed 4000 embed

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.82
Accuracy: 0.28253809642767924
Precision: 0.18535735037768739
F1 Score: 0.3076181292189007
Confusion Matrix:
[[ 493 2804]
 [  68  638]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
P

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embeddings
Processed 4000 embed

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embeddings
Processed 4000 embed

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.86
Accuracy: 0.21533849612790407
Precision: 0.1793521200948117
F1 Score: 0.30246502331778813
Confusion Matrix:
[[ 181 3116]
 [  25  681]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
P

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.87
Accuracy: 0.20484636522608043
Precision: 0.178060826618144
F1 Score: 0.3009005051614321
Confusion Matrix:
[[ 135 3162]
 [  21  685]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pro

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.88
Accuracy: 0.19735198601049214
Precision: 0.17734877734877735
F1 Score: 0.3001524722282727
Confusion Matrix:
[[ 101 3196]
 [  17  689]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
P

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.89
Accuracy: 0.18860854359230578
Precision: 0.1762608252674478
F1 Score: 0.29879101899827293
Confusion Matrix:
[[  63 3234]
 [  14  692]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
P

  predicted_labels = [int(diff) for diff in diff_scores]


Unnamed: 0,accuracy,precision,f1_score
55,0.791406,0.172589,0.075305
56,0.785661,0.169565,0.083333
57,0.782913,0.190114,0.103199
58,0.776418,0.190164,0.114738
59,0.766675,0.195187,0.135185
60,0.755933,0.198218,0.154113
61,0.748439,0.19596,0.161532
62,0.736198,0.192982,0.172414
63,0.718461,0.186289,0.181554
64,0.705721,0.187003,0.193151


In [None]:
# threshold = 0.755

# diff_scores = calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, threshold)
# diff_scores

## 2)
"Since the sarcastic utterances are influenced more than normal texts during the masking and generation procedure,     
the difference score of sarcastic texts should be greater than a non-sarcastic one.

If we have a threshold value which separates sarcastic texts and normal texts,     
we can yield the prediction 𝑦 by:     
𝑦 = I(diff)."

In [None]:
# predicted_labels = [int(diff) for diff in diff_scores]
# print(predicted_labels)
# print(sum(predicted_labels))

In [None]:
# labels = ["sarc" if diff else "notsarc" for diff in diff_scores]
# print(labels)

In [None]:
# dffinal = pd.DataFrame({"text": text_data, "class": label_data, "prediction": labels})
# dffinal

# Main Experiment Results

In [None]:
# true_labels = [1 if pred == "sarc" else 0 for pred in df["class"]]
# print(true_labels)
# print(predicted_labels)

# accuracy = accuracy_score(true_labels, predicted_labels)
# precision = precision_score(true_labels, predicted_labels)
# f1 = f1_score(true_labels, predicted_labels)

# print("Accuracy:", accuracy)
# print("Precision:", precision)
# print("F1 Score:", f1)

In [None]:
# conf_matrix = confusion_matrix(true_labels, predicted_labels)

# print("Confusion Matrix:")
# print(conf_matrix)