# Introduction

Sarcasm is a sophisticated language phenomenon, which would cause much confusion to exist sentiment classification systems.     
So sarcasm detection, a task of predicting whether a given text contains sarcasm, has received much research attention.     

Recently, many methods have been proposed for sarcasm detection, which could be broadly classified into two categories.     
One is the text-only method which only concentrate on the utterance itself, such as exploiting incongruity expressions to detect the sarcasm text.     
Another direction is based on extra information, which exploits external knowledge to assist the detection procedure, such as user history, and common sense knowledge.

We propose an unsupervised sarcasm detection method.     

First, we leverage the external sentiment knowledge to mask prominent tokens. Then the masked texts are fed into the pre-trained generation model, which follows the remaining logic structure to generate texts.     
There is a good chance that these reborn texts would not be sarcastic or make more sense.     

Second, after obtaining the similarity score between the generated sentence and the original one, features beneath the scores will be extracted to decide whether a sentence is sarcasm.     

Then, we construct several unsupervised baselines and conduct experiments on IAC-V2 dataset.

# Imports and Reading Data

In [1]:
!pip install senticnet

Collecting senticnet
  Downloading senticnet-1.6-py3-none-any.whl.metadata (2.6 kB)
Downloading senticnet-1.6-py3-none-any.whl (51.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.9/51.9 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: senticnet
Successfully installed senticnet-1.6


In [2]:
import numpy as np
import pandas as pd

from senticnet.senticnet import SenticNet

import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from transformers import AutoTokenizer, AutoModel
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
df = pd.read_csv("/content/drive/My Drive/AlifResearch/sarcasm_v2/GEN-sarc-notsarc.csv")
df1 = pd.read_csv("/content/drive/My Drive/AlifResearch/sarcasm_v2/HYP-sarc-notsarc.csv")
df2 = pd.read_csv("/content/drive/My Drive/AlifResearch/sarcasm_v2/RQ-sarc-notsarc.csv")

In [5]:
df

Unnamed: 0,class,id,text
0,notsarc,1,"If that's true, then Freedom of Speech is doom..."
1,notsarc,2,Neener neener - is it time to go in from the p...
2,notsarc,3,"Just like the plastic gun fear, the armour pie..."
3,notsarc,4,So geology is a religion because we weren't he...
4,notsarc,5,Well done Monty. Mark that up as your first ev...
...,...,...,...
6515,sarc,6516,depends on when the baby bird died. run alon...
6516,sarc,6517,"ok, sheesh, to clarify, women who arent aborti..."
6517,sarc,6518,so.. eh?? hows this sound? will it fly w...
6518,sarc,6519,"I think we should put to a vote, the right of ..."


In [6]:
df1

Unnamed: 0,class,id,text
0,notsarc,1,have no predators to fear? check. who said we ...
1,notsarc,2,2 hours? damn! that book took me a good 2 day...
2,notsarc,3,you never played myst? damn!!! i must be reall...
3,notsarc,4,"Well, if Genesis was in fact true, then we wou..."
4,notsarc,5,Just making sure that everybody is aware of hi...
...,...,...,...
1159,sarc,1160,you really believed me? wow! i never knew i ha...
1160,sarc,1161,please tell me you're kidding. these bowling b...
1161,sarc,1162,you're kidding. just because your life is 'a f...
1162,sarc,1163,the evidence that is provided to you is not en...


In [7]:
df2

Unnamed: 0,class,id,text
0,notsarc,1,"Archie, the ONLY issue that gays don't have a ..."
1,notsarc,2,"No, not really. All that is different is the n..."
2,notsarc,3,It's ashame that everyone keeps looking for th...
3,notsarc,4,"Almost? Usually, that is true, and it involves..."
4,notsarc,5,And so have animals. Plants have been wiped ou...
...,...,...,...
1697,sarc,1698,"Tell me genius, how is me accurately and corre..."
1698,sarc,1699,So you think it is a good idea for public scho...
1699,sarc,1700,"Now settle down charlie, and try to think rati..."
1700,sarc,1701,The VPC has a political agenda. The FBI? That ...


In [8]:
# Concatenate vertically
df = pd.concat([df, df1, df2], ignore_index=True)
df

Unnamed: 0,class,id,text
0,notsarc,1,"If that's true, then Freedom of Speech is doom..."
1,notsarc,2,Neener neener - is it time to go in from the p...
2,notsarc,3,"Just like the plastic gun fear, the armour pie..."
3,notsarc,4,So geology is a religion because we weren't he...
4,notsarc,5,Well done Monty. Mark that up as your first ev...
...,...,...,...
9381,sarc,1698,"Tell me genius, how is me accurately and corre..."
9382,sarc,1699,So you think it is a good idea for public scho...
9383,sarc,1700,"Now settle down charlie, and try to think rati..."
9384,sarc,1701,The VPC has a political agenda. The FBI? That ...


In [9]:
df= df.drop('id', axis= 1)
df

Unnamed: 0,class,text
0,notsarc,"If that's true, then Freedom of Speech is doom..."
1,notsarc,Neener neener - is it time to go in from the p...
2,notsarc,"Just like the plastic gun fear, the armour pie..."
3,notsarc,So geology is a religion because we weren't he...
4,notsarc,Well done Monty. Mark that up as your first ev...
...,...,...
9381,sarc,"Tell me genius, how is me accurately and corre..."
9382,sarc,So you think it is a good idea for public scho...
9383,sarc,"Now settle down charlie, and try to think rati..."
9384,sarc,The VPC has a political agenda. The FBI? That ...


# Understanding Data

In [10]:
df.dtypes

Unnamed: 0,0
class,object
text,object


In [11]:
df.columns

Index(['class', 'text'], dtype='object')

In [12]:
text_data_original = list(df['text'])
text_data = [x.lower() for x in text_data_original]
print(*text_data, sep = "\n")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
i am willing to look at you, me and anyone and tell them if they are not prepared to love and nurture what is produced by sex, don't have it.   if you are, get it on.
well, in the case of a creationist, the bias determines what evidence will be discarded, what evidence will be twisted, what evidence will simply be ignored.
find me a mass shooting in utah where concealed carry on campus is legal.
even joe the plumber abandoned the garbage scow? say it aint so!
and that's why i think you are a troll. there really aren't any experts here. there are several who are knowledgable in a variety of topics, but that knowledge seems to me to be the result of being both educated and widely read. it also indicates an ability to reason and an open mind.
look what is talking about insane, immune to reason, a prisnor of his own giant ignorance, slavery fetishist and one of the comic relief 3 stooges. you might have a dialogue but then in

In [13]:
label_data = list(df['class'])
print(*label_data, sep = "\n")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc


# Overview

The proposed framework contains three main components:     

1) Sentences mask and generation.     
This procedure first recognizes main components of sentences which will be properly masked to cause more impact on original sentences, and then fulfills the texts generation work;     

2) Sentences representation.     
It is expected to calculate dense vectors of sentences;     

3) Sarcastic utterances detection leverages.     
the similarity scores between original and regenerated sentences to detect whether an utterance is sarcastic.

# Sentences Mask and Generation
## 1)
"First, we use the sentiment common knowledge retrieved from SenticNet to recognize affective words in the sentence 𝑥,     
and split those words into two sets according to its sentiment polarities:    
PW = {pw1, pw2, ..., pwh} and    
NW = {nw1, nw2, ..., nwk},     
h + k <= n."

In [14]:
def tokenize_sentence(sentence):
    tokens = word_tokenize(sentence)

    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

In [15]:
def get_sentiment_polarity_from_senticnet(word):
    sn = SenticNet()

    word = word.lower()

    try:
        return sn.polarity_label(word)
    except:
        return "neutral"

In [16]:
def analyze_sentiment(sentences):
    positive_words = []
    negative_words = []

    for sentence in sentences:
        words = tokenize_sentence(sentence)

        PW = set()
        NW = set()

        for word in words:
            sentiment_polarity = get_sentiment_polarity_from_senticnet(word)
            if sentiment_polarity == "positive":
                PW.add(word.lower())
            elif sentiment_polarity == "negative":
                NW.add(word.lower())

        positive_words.append(PW)
        negative_words.append(NW)

    return positive_words, negative_words

In [17]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [18]:
positive_words, negative_words = analyze_sentiment(text_data)

for i, sentence in enumerate(text_data):
    print(f"Sentence: {sentence}")
    print(f"Positive Words: {positive_words[i]}")
    print(f"Negative Words: {negative_words[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Sentence: well, if one must make a choice, then let's make the question more pertinent to this discussion. from a public health perspective regarding aids, which would be better: all sex outside a momogamous relationship between tested individual protected, whether promiscuous or not or only promiscuous sex protected and non-promiscuous sex unprotected? (leaving aside a common definition of 'promiscuous' -- you use your mysterious definition, i'll use mine.) i'd vote for the former. you? well, if you want gay people to be less promiscuous (still without a determined definition), support our full integration into the larger culture and stop calling us immoral. in the long run, we'll end up acting just as randy -- or not -- as anyone else.
Positive Words: {'gay', 'health', 'determined', 'perspective', 'choice', 'aid', 'better', 'individual', 'definition', 'support', 'pertinent', 'mysterious', 'culture', 'larger', 'sex', 'in

In [19]:
df["PW"] = positive_words
df["NW"] = negative_words
df

Unnamed: 0,class,text,PW,NW
0,notsarc,"If that's true, then Freedom of Speech is doom...","{freedom, subjective, true}","{doomed, harassment}"
1,notsarc,Neener neener - is it time to go in from the p...,{playground},{}
2,notsarc,"Just like the plastic gun fear, the armour pie...",{},"{plastic, misinformation, myth, fear}"
3,notsarc,So geology is a religion because we weren't he...,"{religion, formed, rock}",{geology}
4,notsarc,Well done Monty. Mark that up as your first ev...,"{accurate, honest}",{}
...,...,...,...,...
9381,sarc,"Tell me genius, how is me accurately and corre...","{accurately, genius, meaning, correctly}","{inaccurate, misdirect, ignorant, mistake, dis..."
9382,sarc,So you think it is a good idea for public scho...,"{determine, good, disciplinary, board, new, su...","{bad, catholic, investigate}"
9383,sarc,"Now settle down charlie, and try to think rati...","{link, doe, appreciate, rationally, fetus, set...","{occurrence, tumor}"
9384,sarc,The VPC has a political agenda. The FBI? That ...,"{taste, believe, saying, better}","{pepsi, political}"


## 2)
"Second, we analyze the sentence to get its syntax information to identify non-stop words     
     𝑆𝑊 = {𝑠𝑤1, 𝑠𝑤2, ..., 𝑠𝑤𝑚, 𝑚 ≤ 𝑛}.     
Intuitively, these words are the main components of sentences. Then we split 𝑆𝑊 into two sets which satisfy :     
     𝑆𝑊1 ∪ 𝑆𝑊2 = 𝑆𝑊 ,     
     |𝑆𝑊1| = |𝑆𝑊2|."

In [20]:
def extract_non_stop_words(sentence):
    words = nltk.word_tokenize(sentence)

    stop_words = set(stopwords.words("english"))

    non_stop_words = [word.lower() for word in words if word.lower() not in stop_words and word.isalpha()]

    return non_stop_words

In [21]:
def split_non_stop_words(non_stop_words):
    m = len(non_stop_words)
    m1 = m // 2
    SW1 = set(non_stop_words[:m1])
    SW2 = set(non_stop_words[m1:])
    return SW1, SW2

In [22]:
def analyze_sentences(sentences):
    all_SW1 = []
    all_SW2 = []

    for sentence in sentences:
        non_stop_words = extract_non_stop_words(sentence)
        SW1, SW2 = split_non_stop_words(non_stop_words)
        all_SW1.append(SW1)
        all_SW2.append(SW2)

    return all_SW1, all_SW2

In [23]:
all_SW1, all_SW2 = analyze_sentences(text_data)

for i, sentence in enumerate(text_data):
    print(f"Sentence: {sentence}")
    print(f"SW1: {all_SW1[i]}")
    print(f"SW2: {all_SW2[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Sentence: well, if one must make a choice, then let's make the question more pertinent to this discussion. from a public health perspective regarding aids, which would be better: all sex outside a momogamous relationship between tested individual protected, whether promiscuous or not or only promiscuous sex protected and non-promiscuous sex unprotected? (leaving aside a common definition of 'promiscuous' -- you use your mysterious definition, i'll use mine.) i'd vote for the former. you? well, if you want gay people to be less promiscuous (still without a determined definition), support our full integration into the larger culture and stop calling us immoral. in the long run, we'll end up acting just as randy -- or not -- as anyone else.
SW1: {'let', 'question', 'public', 'better', 'outside', 'sex', 'one', 'regarding', 'health', 'well', 'choice', 'aids', 'individual', 'aside', 'make', 'tested', 'must', 'whether', 'momogam

In [24]:
df["SW1"] = all_SW1
df["SW2"] = all_SW2
df

Unnamed: 0,class,text,PW,NW,SW1,SW2
0,notsarc,"If that's true, then Freedom of Speech is doom...","{freedom, subjective, true}","{doomed, harassment}","{freedom, speech, harassment, doomed, true}","{subjective, banned, harassing, claim, book, l..."
1,notsarc,Neener neener - is it time to go in from the p...,{playground},{},"{time, neener}","{yet, go, playground}"
2,notsarc,"Just like the plastic gun fear, the armour pie...",{},"{plastic, misinformation, myth, fear}","{armour, fear, piercing, gun, plastic, like}","{myth, fear, misinformation, bullet, upon, built}"
3,notsarc,So geology is a religion because we weren't he...,"{religion, formed, rock}",{geology},"{religion, geology, see}","{formed, rock, x}"
4,notsarc,Well done Monty. Mark that up as your first ev...,"{accurate, honest}",{},"{mark, well, done, monty}","{first, ever, accurate, honest, post}"
...,...,...,...,...,...,...
9381,sarc,"Tell me genius, how is me accurately and corre...","{accurately, genius, meaning, correctly}","{inaccurate, misdirect, ignorant, mistake, dis...","{accurately, misdirect, pointing, tell, using,...","{lying, penfold, statement, inaccurate, words,..."
9382,sarc,So you think it is a good idea for public scho...,"{determine, good, disciplinary, board, new, su...","{bad, catholic, investigate}","{new, public, school, field, hampshire, local,...","{organizations, gets, article, necessary, boar..."
9383,sarc,"Now settle down charlie, and try to think rati...","{link, doe, appreciate, rationally, fetus, set...","{occurrence, tumor}","{heard, tumor, ever, try, charlie, think, seco...","{link, appreciate, documenting, tumor, form, s..."
9384,sarc,The VPC has a political agenda. The FBI? That ...,"{taste, believe, saying, better}","{pepsi, political}","{fbi, vpc, saying, political, believe, like, a...","{coke, commericial, better, pepsi, taste, says}"


## 3)
"Here, 𝑃𝑊 ∪ 𝑆𝑊1 and 𝑁𝑊 ∪ 𝑆𝑊2 are used to mask original sentence respectively. So, we will obtain two masked sentences     
𝑥𝑚1 = { [𝑚]1, 𝑥2, ..., [𝑚]𝑛} and     
𝑥𝑚2 = {𝑥1, [𝑚]2, ..., 𝑥𝑛}."

In [25]:
def construct_union(sentences, PW, NW, all_SW1, all_SW2):
    union_PW_SW1 = []
    union_NW_SW2 = []

    for i, sentence in enumerate(sentences):
        SW1 = all_SW1[i]
        SW2 = all_SW2[i]

        union_PW_SW1.append(PW[i].union(SW1))
        union_NW_SW2.append(NW[i].union(SW2))

    return union_PW_SW1, union_NW_SW2

In [26]:
union_PW_SW1, union_NW_SW2 = construct_union(text_data, positive_words, negative_words, all_SW1, all_SW2)
print(union_PW_SW1)
print(union_NW_SW2)

Output hidden; open in https://colab.research.google.com to view.

In [27]:
df["union_PW_SW1"] = union_PW_SW1
df["union_NW_SW2"] = union_NW_SW2
df

Unnamed: 0,class,text,PW,NW,SW1,SW2,union_PW_SW1,union_NW_SW2
0,notsarc,"If that's true, then Freedom of Speech is doom...","{freedom, subjective, true}","{doomed, harassment}","{freedom, speech, harassment, doomed, true}","{subjective, banned, harassing, claim, book, l...","{freedom, subjective, speech, harassment, doom...","{subjective, banned, harassment, harassing, cl..."
1,notsarc,Neener neener - is it time to go in from the p...,{playground},{},"{time, neener}","{yet, go, playground}","{playground, time, neener}","{yet, go, playground}"
2,notsarc,"Just like the plastic gun fear, the armour pie...",{},"{plastic, misinformation, myth, fear}","{armour, fear, piercing, gun, plastic, like}","{myth, fear, misinformation, bullet, upon, built}","{like, armour, fear, piercing, plastic, gun}","{myth, fear, misinformation, bullet, upon, bui..."
3,notsarc,So geology is a religion because we weren't he...,"{religion, formed, rock}",{geology},"{religion, geology, see}","{formed, rock, x}","{geology, formed, see, religion, rock}","{geology, formed, rock, x}"
4,notsarc,Well done Monty. Mark that up as your first ev...,"{accurate, honest}",{},"{mark, well, done, monty}","{first, ever, accurate, honest, post}","{mark, monty, accurate, well, honest, done}","{first, accurate, ever, honest, post}"
...,...,...,...,...,...,...,...,...
9381,sarc,"Tell me genius, how is me accurately and corre...","{accurately, genius, meaning, correctly}","{inaccurate, misdirect, ignorant, mistake, dis...","{accurately, misdirect, pointing, tell, using,...","{lying, penfold, statement, inaccurate, words,...","{accurately, misdirect, pointing, tell, using,...","{statement, inaccurate, words, mistake, emotic..."
9382,sarc,So you think it is a good idea for public scho...,"{determine, good, disciplinary, board, new, su...","{bad, catholic, investigate}","{new, public, school, field, hampshire, local,...","{organizations, gets, article, necessary, boar...","{board, new, public, school, organization, loc...","{organizations, gets, article, necessary, boar..."
9383,sarc,"Now settle down charlie, and try to think rati...","{link, doe, appreciate, rationally, fetus, set...","{occurrence, tumor}","{heard, tumor, ever, try, charlie, think, seco...","{link, appreciate, documenting, tumor, form, s...","{heard, tumor, ever, try, think, rationally, s...","{link, appreciate, tumor, documenting, form, s..."
9384,sarc,The VPC has a political agenda. The FBI? That ...,"{taste, believe, saying, better}","{pepsi, political}","{fbi, vpc, saying, political, believe, like, a...","{coke, commericial, better, pepsi, taste, says}","{fbi, better, vpc, saying, like, political, be...","{coke, commericial, better, pepsi, political, ..."


In [28]:
def mask_sentence(sentence, mask_words, max_mask_count = 5):
    masked_sentence = []

    for word in sentence.split():
        if word in mask_words and max_mask_count > 0:
            masked_sentence.append("<mask>")
            max_mask_count -= 1
        else:
            masked_sentence.append(word)

    return " ".join(masked_sentence)

In [29]:
def construct_masked_sentences(sentences, union_PW_SW1, union_NW_SW2):
    masked_pos_sentences = []
    masked_neg_sentences = []

    for i, sentence in enumerate(sentences):

        masked_pos_sentence = mask_sentence(sentence, union_PW_SW1[i])
        masked_pos_sentences.append(masked_pos_sentence)

        masked_neg_sentence = mask_sentence(sentence, union_NW_SW2[i])
        masked_neg_sentences.append(masked_neg_sentence)

    return masked_pos_sentences, masked_neg_sentences

In [30]:
masked_pos_sentences, masked_neg_sentences = construct_masked_sentences(text_data, union_PW_SW1, union_NW_SW2)

for i, sentence in enumerate(text_data):
    print(f"Original Sentence: {sentence}")
    print(f"Masked Positive Sentence: {masked_pos_sentences[i]}")
    print(f"Masked Negative Sentence: {masked_neg_sentences[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Original Sentence: well, if one must make a choice, then let's make the question more pertinent to this discussion. from a public health perspective regarding aids, which would be better: all sex outside a momogamous relationship between tested individual protected, whether promiscuous or not or only promiscuous sex protected and non-promiscuous sex unprotected? (leaving aside a common definition of 'promiscuous' -- you use your mysterious definition, i'll use mine.) i'd vote for the former. you? well, if you want gay people to be less promiscuous (still without a determined definition), support our full integration into the larger culture and stop calling us immoral. in the long run, we'll end up acting just as randy -- or not -- as anyone else.
Masked Positive Sentence: well, if <mask> <mask> <mask> a choice, then let's <mask> the <mask> more pertinent to this discussion. from a public health perspective regarding aids,

In [31]:
dfnew = pd.DataFrame({"text": text_data_original, "maskedPosSentence": masked_pos_sentences, "maskedNegSentence": masked_neg_sentences})
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence
0,"If that's true, then Freedom of Speech is doom...","if that's true, then <mask> of <mask> is doome...","if that's true, then freedom of speech is doom..."
1,Neener neener - is it time to go in from the p...,<mask> <mask> - is it <mask> to go in from the...,neener neener - is it time to <mask> in from t...
2,"Just like the plastic gun fear, the armour pie...","just <mask> the <mask> <mask> fear, the <mask>...","just like the <mask> gun fear, the armour pier..."
3,So geology is a religion because we weren't he...,so <mask> is a <mask> because we weren't here ...,so <mask> is a religion because we weren't her...
4,Well done Monty. Mark that up as your first ev...,<mask> <mask> monty. <mask> that up as your fi...,well done monty. mark that up as your <mask> <...
...,...,...,...
9381,"Tell me genius, how is me accurately and corre...","<mask> me genius, how is me <mask> and <mask> ...","tell me genius, how is me accurately and corre..."
9382,So you think it is a good idea for public scho...,so you <mask> it is a <mask> <mask> for <mask>...,so you <mask> it is a good idea for public sch...
9383,"Now settle down charlie, and try to think rati...","now <mask> down charlie, and <mask> to <mask> ...","now settle down charlie, and try to think rati..."
9384,The VPC has a political agenda. The FBI? That ...,the <mask> has a <mask> agenda. the fbi? that ...,the vpc has a <mask> agenda. the fbi? that is ...


## 4)
"These two masked sentences are fed into the pre-trained generation model to fulfill the generation procedure.     
𝑨{𝑎1, ..., 𝑥2, ..., 𝑥𝑛−1, ..., 𝑎𝑜 } = 𝐵𝐴𝑅𝑇 ( [𝑚]1, 𝑥2, ..., 𝑥𝑛−1, [𝑚]𝑛 )----(1)  
Thus, we will obtain two reborn sentences     
𝐴 = {𝑎1, 𝑎2, ..., 𝑎𝑜 } and     
𝐵 = {𝑏1, 𝑏2, ..., 𝑏𝑝 }."

In [32]:
%pip install transformers



In [33]:
def generate_reborn_sentences(masked_sentences):
    tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
    model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

    i = 0
    reborn_sentences = []
    for masked_sentence in masked_sentences:
        inputs = tokenizer(masked_sentence, return_tensors="pt")
        generated_encoded = model.generate(inputs['input_ids'])
        reborn_sentence = tokenizer.batch_decode(generated_encoded, skip_special_tokens=True)[0]
        reborn_sentences.append(reborn_sentence)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} sentences')

    return reborn_sentences

In [34]:
from google.colab import userdata
userdata.get('HF_TOKEN')

'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'

In [35]:
import os
os.environ["HF_TOKEN"] = "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"

In [None]:
reborn_pos_sentences = generate_reborn_sentences(masked_pos_sentences)

reborn_neg_sentences = generate_reborn_sentences(masked_neg_sentences)

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]



Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences
Processed 1200 sentences
Processed 1300 sentences
Processed 1400 sentences
Processed 1500 sentences
Processed 1600 sentences
Processed 1700 sentences
Processed 1800 sentences
Processed 1900 sentences
Processed 2000 sentences
Processed 2100 sentences
Processed 2200 sentences
Processed 2300 sentences
Processed 2400 sentences
Processed 2500 sentences
Processed 2600 sentences
Processed 2700 sentences
Processed 2800 sentences
Processed 2900 sentences
Processed 3000 sentences
Processed 3100 sentences
Processed 3200 sentences
Processed 3300 sentences
Processed 3400 sentences
Processed 3500 sentences
Processed 3600 sentences
Processed 3700 sentences
Processed 3800 sentences
Processed 3900 sentences
Processed 4000 sentences
Processed

In [None]:
print("Reborn Sentences for Masked Positive Sentences:")
for i, reborn_sentence in enumerate(reborn_pos_sentences):
    print(f"Reborn Sentence {i + 1}: {reborn_sentence}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Reborn Sentence 4387: no - but they do have a bias. as do we all.
Reborn Sentence 4388: don't you think it's kind of passing of just-so statement as argument?
Reborn Sentence 4389: all of them. i don't know what you are referring to at all. i
Reborn Sentence 4390: first he can tell irs that god will provide his cell number. irs can
Reborn Sentence 4391: i don't know what its all about, and i don't think it's that
Reborn Sentence 4392: i.e., it was the product of newspapers, indeed!
Reborn Sentence 4393: so if that sounds familiar to you, you'd be right about it if only the
Reborn Sentence 4394: really? can you find any of those sources? i am sure that you will be
Reborn Sentence 4395: i was on the phone at the time. i'll pick you up and get back
Reborn Sentence 4396: it seems to me that peddler is simply being facetious... he is trying to
Reborn Sentence 4397: no, no, no matthew! we just have this idea that we are going
R

In [None]:
print("\nReborn Sentences for Masked Negative Sentences:")
for i, reborn_sentence in enumerate(reborn_neg_sentences):
    print(f"Reborn Sentence {i + 1}: {reborn_sentence}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Reborn Sentence 4387: no - but they certainly have a bias. as do we all.
Reborn Sentence 4388: don't you get the point of using a bunch of just-so arguments as argument
Reborn Sentence 4389: all of them since i don't know what quotes you are referring to at all.
Reborn Sentence 4390: well he can defend his tax difficulties by claiming he is a mathematician and thus cannot be
Reborn Sentence 4391: i notice its typos and mispellings that attract the attention of jim.
Reborn Sentence 4392: i suppose piltdown man was the inventor of newspapers, and he was indeed!
Reborn Sentence 4393: so according to you, you'd prefer it if only the bad guys are armed,
Reborn Sentence 4394: really? can you cite those sources? i am sure that you will be citing the
Reborn Sentence 4395: i was eight at the time. i'll read up and give it to you.
Reborn Sentence 4396: it appears that he is finally open to learning new things! (unless, of course
Re

In [None]:
dfnew["rebornPosSentence"] = reborn_pos_sentences
dfnew["rebornNegSentence"] = reborn_neg_sentences
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence,rebornPosSentence,rebornNegSentence
0,"If that's true, then Freedom of Speech is doom...","if that's true, then <mask> of <mask> is doome...","if that's true, then freedom of speech is doom...","if that's true, then the whole idea of banning...","if that's true, then freedom of speech is doom..."
1,Neener neener - is it time to go in from the p...,<mask> <mask> - is it <mask> to go in from the...,neener neener - is it time to <mask> in from t...,So - is it time to go in from the airport yet?,neener neener - is it time to step in from the...
2,"Just like the plastic gun fear, the armour pie...","just <mask> the <mask> <mask> fear, the <mask>...","just like the <mask> gun fear, the armour pier...","just like the first bullet fear, the second bu...","just like the gun fear, the armour piercing fe..."
3,So geology is a religion because we weren't he...,so <mask> is a <mask> because we weren't here ...,so <mask> is a religion because we weren't her...,so this is a joke because we weren't here to s...,so what is a religion because we weren't here ...
4,Well done Monty. Mark that up as your first ev...,<mask> <mask> monty. <mask> that up as your fi...,well done monty. mark that up as your <mask> <...,This is a monty. You can pick that up as your ...,well done monty. mark that up as your own. 100...
...,...,...,...,...,...
9381,"Tell me genius, how is me accurately and corre...","<mask> me genius, how is me <mask> and <mask> ...","tell me genius, how is me accurately and corre...","you are calling me genius, how is me? and how ...","tell me genius, how is me accurately and corre..."
9382,So you think it is a good idea for public scho...,so you <mask> it is a <mask> <mask> for <mask>...,so you <mask> it is a good idea for public sch...,so you think it is a good idea for the school ...,so you don't think it is a good idea for publi...
9383,"Now settle down charlie, and try to think rati...","now <mask> down charlie, and <mask> to <mask> ...","now settle down charlie, and try to think rati...","now i have to sit down charlie, and try to thi...","now settle down charlie, and try to think rati..."
9384,The VPC has a political agenda. The FBI? That ...,the <mask> has a <mask> agenda. the fbi? that ...,the vpc has a <mask> agenda. the fbi? that is ...,the fbi has a political agenda. the fbi? that ...,the vpc has a different agenda. the fbi? that ...


# Sentences Representation
"We embed the original sentence 𝑥 and its corresponding reborn texts 𝐴 and 𝐵     
into 𝑑-dimentional embedding 𝑯𝑡 ∈ R𝑑     
via pre-trained BERT-base:     
𝑯𝑥, 𝑯𝐴, 𝑯𝐵 = 𝐵𝐸𝑅𝑇 (𝑥), 𝐵𝐸𝑅𝑇 (𝐴), 𝐵𝐸𝑅𝑇 (𝐵)."

In [None]:
def embed_sentences(sentences):
    tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
    model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")

    i = 0
    embeddings = []
    for sentence in sentences:
        inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = model(**inputs).last_hidden_state.mean(dim=1)
        embeddings.append(outputs)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} sentences')

    return torch.stack(embeddings)

In [None]:
x_embeddings = embed_sentences(text_data)

A_embeddings = embed_sentences(reborn_pos_sentences)

B_embeddings = embed_sentences(reborn_neg_sentences)

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences
Processed 1200 sentences
Processed 1300 sentences
Processed 1400 sentences
Processed 1500 sentences
Processed 1600 sentences
Processed 1700 sentences
Processed 1800 sentences
Processed 1900 sentences
Processed 2000 sentences
Processed 2100 sentences
Processed 2200 sentences
Processed 2300 sentences
Processed 2400 sentences
Processed 2500 sentences
Processed 2600 sentences
Processed 2700 sentences
Processed 2800 sentences
Processed 2900 sentences
Processed 3000 sentences
Processed 3100 sentences
Processed 3200 sentences
Processed 3300 sentences
Processed 3400 sentences
Processed 3500 sentences
Processed 3600 sentences
Processed 3700 sentences
Processed 3800 sentences
Processed 3900 sentences
Processed 4000 sentences
Processed

In [None]:
for i, sentence in enumerate(text_data):
    print(f"Embedding for Original Lowercase Sentence {i + 1} ({sentence}):")
    print(x_embeddings[i])
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
          2.8928e-01, -2.7054e-01,  2.5134e-01, -4.1064e-02,  1.2650e-01,
         -4.5400e-01,  1.3122e-01, -4.9693e-01, -5.9639e-01,  3.9484e-01,
         -1.0214e-01, -1.9991e-01,  1.2501e-01,  5.3332e-01,  1.8895e-01,
          9.2860e-03,  1.7296e-01,  3.3529e-02, -5.9547e-02,  4.3366e-01,
         -2.3004e-01, -4.9007e-02,  4.8700e-02,  2.6306e-01, -7.6661e-02,
          4.4054e-02, -1.0815e-01, -4.5487e-02, -8.5874e-02, -1.3641e-01,
         -1.0791e-01, -7.5112e-02, -7.9894e-02]])
- - - - - - - - - -
Embedding for Original Lowercase Sentence 9355 (no more of a poison than alcohol (actually far less), should we start lacing that with strychnine? ***i can see it now****excuse me bartender...what are the drink specials?-we got bud light bottles for $4.00, jagerbombs for $7.50, and this new drink called convulsions on the beach. first one is on the house!):
tensor([[ 1.2838e-01,  2.3714e-01,  6.3039e-01, -3.6950e-01, 

In [None]:
for i, sentence in enumerate(reborn_pos_sentences):
    print(f"Embedding for Reborn Positive Sentence {i + 1} ({sentence}):")
    print(A_embeddings[i])
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
          1.1994e-01, -3.7928e-01,  1.3793e-02, -2.1081e-02, -4.4088e-01,
         -1.7168e-01,  2.0685e-01, -2.9854e-02, -6.4462e-01, -2.2299e-03,
         -1.3455e-01, -2.6773e-01, -3.2710e-01,  3.9880e-01, -5.9600e-01,
         -6.7302e-01,  2.6783e-01, -1.2413e-01, -2.6501e-01,  1.4360e-01,
         -3.3590e-01,  1.6064e-02,  1.8389e-01,  2.0150e-01, -2.4643e-01,
         -1.2896e-01, -3.6428e-01, -7.2519e-02,  1.7863e-01,  1.0935e-01,
          1.1178e-01,  3.7606e-01, -2.7353e-01]])
- - - - - - - - - -
Embedding for Reborn Positive Sentence 9355 (no more of a drink than that (actually it's a lot less), should we):
tensor([[ 1.4541e-01, -5.2658e-02,  3.2920e-01, -8.7392e-02, -1.0531e-01,
         -5.2562e-01,  2.7335e-01,  1.7266e-01,  4.5363e-01,  4.1177e-01,
          6.0299e-02,  3.4734e-01, -1.7574e-01,  6.8035e-02, -8.2805e-02,
          4.1740e-01,  3.6877e-01,  1.3437e-01,  6.6787e-01, -5.1811e-04,
          7

In [None]:
for i, sentence in enumerate(reborn_neg_sentences):
    print(f"Embedding for Reborn Negative Sentence {i + 1} ({sentence}):")
    print(B_embeddings[i])
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
          3.5742e-01, -1.4797e-01, -4.6431e-02, -1.9732e-01, -2.8702e-01,
         -3.2802e-01,  4.7924e-01, -3.9624e-02, -6.6906e-01,  1.0075e-01,
          7.9532e-02, -4.3159e-01, -3.4822e-01,  3.9283e-01, -2.6643e-01,
         -2.9612e-01,  1.0541e-01, -1.9892e-01, -6.0857e-03,  3.5499e-01,
         -2.0595e-01, -7.9037e-02, -1.1404e-01,  2.5401e-01, -7.6607e-02,
         -5.3799e-02, -5.0172e-01, -1.6233e-01,  2.3692e-01,  2.6061e-02,
          8.9898e-02,  1.0130e-01, -1.4812e-01]])
- - - - - - - - - -
Embedding for Reborn Negative Sentence 9355 (no more of a pain in the ass than alcohol (actually far less), should we):
tensor([[-1.8610e-01,  3.3666e-01,  1.7728e-02, -1.7969e-01, -5.1006e-01,
         -1.0709e-01,  2.3799e-01,  3.0864e-01,  4.9011e-01,  2.4840e-01,
          6.7531e-02,  2.9730e-01, -2.7904e-01,  2.2921e-02, -5.7304e-02,
          6.9024e-01,  2.9603e-01,  2.3382e-02,  3.2016e-01, -6.4098e-03,
     

In [None]:
dfnew["xEmbedding"] = x_embeddings.tolist()
dfnew["AEmbedding"] = A_embeddings.tolist()
dfnew["BEmbedding"] = B_embeddings.tolist()
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence,rebornPosSentence,rebornNegSentence,xEmbedding,AEmbedding,BEmbedding
0,"If that's true, then Freedom of Speech is doom...","if that's true, then <mask> of <mask> is doome...","if that's true, then freedom of speech is doom...","if that's true, then the whole idea of banning...","if that's true, then freedom of speech is doom...","[[0.28837844729423523, 0.4140242040157318, 0.5...","[[0.40408673882484436, 0.48163169622421265, 0....","[[0.3530767560005188, 0.35502880811691284, 0.2..."
1,Neener neener - is it time to go in from the p...,<mask> <mask> - is it <mask> to go in from the...,neener neener - is it time to <mask> in from t...,So - is it time to go in from the airport yet?,neener neener - is it time to step in from the...,"[[0.5522841215133667, -0.7339645028114319, 0.0...","[[0.3821914792060852, -0.635523796081543, 0.43...","[[0.3627055883407593, -0.5871938467025757, -0...."
2,"Just like the plastic gun fear, the armour pie...","just <mask> the <mask> <mask> fear, the <mask>...","just like the <mask> gun fear, the armour pier...","just like the first bullet fear, the second bu...","just like the gun fear, the armour piercing fe...","[[0.6633196473121643, 0.4716658592224121, -0.0...","[[0.4542853534221649, 0.25347092747688293, -0....","[[0.6167881488800049, 0.3169938325881958, -0.1..."
3,So geology is a religion because we weren't he...,so <mask> is a <mask> because we weren't here ...,so <mask> is a religion because we weren't her...,so this is a joke because we weren't here to s...,so what is a religion because we weren't here ...,"[[0.6581794619560242, 0.4820736348628998, -0.1...","[[0.7056750655174255, 0.03841923549771309, -0....","[[0.5304409265518188, 0.20461611449718475, -0...."
4,Well done Monty. Mark that up as your first ev...,<mask> <mask> monty. <mask> that up as your fi...,well done monty. mark that up as your <mask> <...,This is a monty. You can pick that up as your ...,well done monty. mark that up as your own. 100...,"[[0.1503063440322876, 0.3441377282142639, 0.26...","[[-0.07571136951446533, 0.027956988662481308, ...","[[0.22033721208572388, 0.5279561281204224, 0.4..."
...,...,...,...,...,...,...,...,...
9381,"Tell me genius, how is me accurately and corre...","<mask> me genius, how is me <mask> and <mask> ...","tell me genius, how is me accurately and corre...","you are calling me genius, how is me? and how ...","tell me genius, how is me accurately and corre...","[[0.3213426172733307, 0.12743927538394928, -0....","[[0.5475573539733887, 0.15864616632461548, -0....","[[0.12852323055267334, 0.31853243708610535, -0..."
9382,So you think it is a good idea for public scho...,so you <mask> it is a <mask> <mask> for <mask>...,so you <mask> it is a good idea for public sch...,so you think it is a good idea for the school ...,so you don't think it is a good idea for publi...,"[[0.08587020635604858, -0.2503293454647064, -0...","[[0.1419232189655304, -0.5703044533729553, -0....","[[0.08905086666345596, -0.4892028868198395, -0..."
9383,"Now settle down charlie, and try to think rati...","now <mask> down charlie, and <mask> to <mask> ...","now settle down charlie, and try to think rati...","now i have to sit down charlie, and try to thi...","now settle down charlie, and try to think rati...","[[0.45978158712387085, -0.1993301808834076, 0....","[[0.1915624886751175, 0.2985902726650238, 0.21...","[[0.02997184544801712, 0.07484742999076843, -0..."
9384,The VPC has a political agenda. The FBI? That ...,the <mask> has a <mask> agenda. the fbi? that ...,the vpc has a <mask> agenda. the fbi? that is ...,the fbi has a political agenda. the fbi? that ...,the vpc has a different agenda. the fbi? that ...,"[[0.17703638970851898, 0.2844843566417694, 0.1...","[[0.04596323147416115, -0.22533215582370758, -...","[[0.155222550034523, -0.007009867113083601, -0..."


# Sarcastic Utterances Detection
## 1)
"We utilize cosine similarity to measure the similarity between representations of original sentence 𝐻𝑥     
and generation texts 𝐻𝐴/𝐻𝐵.

Then we use the following equation to calculate a difference score of each sentence:     
diff = sim(𝐻𝑥, 𝐻𝐴) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 || sim(𝐻𝑥, 𝐻𝐵) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑     
where || means "or" logical operator."

In [None]:
def calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, threshold):
    i = 0
    diff_scores = []
    for x_emb, A_emb, B_emb in zip(x_embeddings, A_embeddings, B_embeddings):
        sim_Hx_HA = cosine_similarity(x_emb, A_emb)
        sim_Hx_HB = cosine_similarity(x_emb, B_emb)

        diff = (sim_Hx_HA < threshold) or (sim_Hx_HB < threshold)
        diff_scores.append(diff)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} embeddings')

    return diff_scores

In [None]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, f1_score, confusion_matrix

# Define the range for the threshold
threshold_range = range(55, 91)  # This includes 80

# Initialize a dictionary to store the results for each threshold
results = {}

# Iterate over the threshold range
for threshold in threshold_range:
    # Normalize the threshold to the required scale (if needed, assuming the original threshold is on a different scale)
    normalized_threshold = threshold / 100

    # Calculate the difference scores for the current threshold
    diff_scores = calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, normalized_threshold)

    # Convert difference scores to predicted labels
    predicted_labels = [int(diff) for diff in diff_scores]

    # Generate label names based on predictions
    labels = ["sarc" if diff else "notsarc" for diff in diff_scores]

    # Create a DataFrame with text data, true labels, and predictions
    dffinal = pd.DataFrame({"text": text_data, "class": label_data, "prediction": labels})

    # Convert true labels to binary format
    true_labels = [1 if pred == "sarc" else 0 for pred in dffinal["class"]]

    # Calculate performance metrics
    accuracy = accuracy_score(true_labels, predicted_labels)
    precision = precision_score(true_labels, predicted_labels)
    f1 = f1_score(true_labels, predicted_labels)

    # Store the metrics in the results dictionary
    results[threshold] = {
        "accuracy": accuracy,
        "precision": precision,
        "f1_score": f1
    }

    # Print the metrics for the current threshold
    print(f"Threshold: {threshold / 100}")
    print(f"Accuracy: {accuracy}")
    print(f"Precision: {precision}")
    print(f"F1 Score: {f1}")

    # Print the confusion matrix
    conf_matrix = confusion_matrix(true_labels, predicted_labels)
    print("Confusion Matrix:")
    print(conf_matrix)
    print("-" * 50)  # Separator for readability

# Optionally, convert the results dictionary to a DataFrame for easier analysis
results_df = pd.DataFrame(results).T
results_df


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.56
Accuracy: 0.49936075005327085
Precision: 0.48770491803278687
F1 Score: 0.04820741340895281
Confusion Matrix:
[[4568  125]
 [4574  119]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings


  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.58
Accuracy: 0.5008523332623056
Precision: 0.5099502487562189
F1 Score: 0.08047105004906771
Confusion Matrix:
[[4496  197]
 [4488  205]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pr

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.59
Accuracy: 0.5002130833155763
Precision: 0.5020408163265306
F1 Score: 0.09492571869573607
Confusion Matrix:
[[4449  244]
 [4447  246]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pr

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.6
Accuracy: 0.5017046665246111
Precision: 0.5132013201320133
F1 Score: 0.1173806378561993
Confusion Matrix:
[[4398  295]
 [4382  311]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Proc

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.68
Accuracy: 0.5037289580225869
Precision: 0.5071457737852184
F1 Score: 0.3478017362083449
Confusion Matrix:
[[3486 1207]
 [3451 1242]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pro

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.69
Accuracy: 0.5
Precision: 0.5
F1 Score: 0.3684564661552954
Confusion Matrix:
[[3324 1369]
 [3324 1369]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Process

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.71
Accuracy: 0.49936075005327085
Precision: 0.4991181657848324
F1 Score: 0.4195182211241507
Confusion Matrix:
[[2989 1704]
 [2995 1698]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pr

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.72
Accuracy: 0.49936075005327085
Precision: 0.4992109416096791
F1 Score: 0.44685108887580927
Confusion Matrix:
[[2789 1904]
 [2795 1898]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
P

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.73
Accuracy: 0.5
Precision: 0.5
F1 Score: 0.4719252841228761
Confusion Matrix:
[[2596 2097]
 [2596 2097]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Process

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.74
Accuracy: 0.4982953334753889
Precision: 0.4982532751091703
F1 Score: 0.49218160250188725
Confusion Matrix:
[[2395 2298]
 [2411 2282]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pr

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.75
Accuracy: 0.49722991689750695
Precision: 0.4973843058350101
F1 Score: 0.5116423470971748
Confusion Matrix:
[[2195 2498]
 [2221 2472]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pr

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.76
Accuracy: 0.5021308331557639
Precision: 0.5018608113137328
F1 Score: 0.5358100725141552
Confusion Matrix:
[[2016 2677]
 [1996 2697]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pro

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.77
Accuracy: 0.5
Precision: 0.5
F1 Score: 0.5512955349459796
Confusion Matrix:
[[1810 2883]
 [1810 2883]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Process

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.78
Accuracy: 0.5017046665246111
Precision: 0.5012978585334199
F1 Score: 0.5692180160265267
Confusion Matrix:
[[1619 3074]
 [1603 3090]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pro

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.79
Accuracy: 0.5001065416577882
Precision: 0.5000763242253091
F1 Score: 0.582710779082177
Confusion Matrix:
[[1418 3275]
 [1417 3276]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Proc

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.8
Accuracy: 0.4981887918176007
Precision: 0.49876722262509066
F1 Score: 0.593545046599931
Confusion Matrix:
[[1237 3456]
 [1254 3439]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Proc

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.82
Accuracy: 0.4969102919241423
Precision: 0.4980694980694981
F1 Score: 0.6130776794493608
Confusion Matrix:
[[ 923 3770]
 [ 952 3741]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pro

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.83
Accuracy: 0.49584487534626037
Precision: 0.49750224157807094
F1 Score: 0.62144
Confusion Matrix:
[[ 770 3923]
 [ 809 3884]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 32

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.85
Accuracy: 0.49403366716386105
Precision: 0.4966191741125332
F1 Score: 0.6339884393063584
Confusion Matrix:
[[ 524 4169]
 [ 580 4113]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pr

  predicted_labels = [int(diff) for diff in diff_scores]


Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.87
Accuracy: 0.495099083741743
Precision: 0.4973410404624277
F1 Score: 0.6448324964400809
Confusion Matrix:
[[ 345 4348]
 [ 391 4302]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Proc

  predicted_labels = [int(diff) for diff in diff_scores]



Confusion Matrix:
[[ 290 4403]
 [ 305 4388]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Pr

  predicted_labels = [int(diff) for diff in diff_scores]


Threshold: 0.89
Accuracy: 0.4984018751331771
Precision: 0.4991606043648573
F1 Score: 0.6545347813325506
Confusion Matrix:
[[ 218 4475]
 [ 233 4460]]
--------------------------------------------------
Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Pro

  predicted_labels = [int(diff) for diff in diff_scores]


Unnamed: 0,accuracy,precision,f1_score
55,0.500213,0.505208,0.039713
56,0.499361,0.487705,0.048207
57,0.500213,0.503086,0.064979
58,0.500852,0.50995,0.080471
59,0.500213,0.502041,0.094926
60,0.501705,0.513201,0.117381
61,0.500746,0.504636,0.139868
62,0.501065,0.505365,0.167467
63,0.498722,0.494545,0.187813
64,0.500852,0.50304,0.220336


In [None]:
# threshold = 0.755

# diff_scores = calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, threshold)
# diff_scores

## 2)
"Since the sarcastic utterances are influenced more than normal texts during the masking and generation procedure,     
the difference score of sarcastic texts should be greater than a non-sarcastic one.

If we have a threshold value which separates sarcastic texts and normal texts,     
we can yield the prediction 𝑦 by:     
𝑦 = I(diff)."

In [None]:
# predicted_labels = [int(diff) for diff in diff_scores]
# print(predicted_labels)
# print(sum(predicted_labels))

In [None]:
# labels = ["sarc" if diff else "notsarc" for diff in diff_scores]
# print(labels)

In [None]:
# dffinal = pd.DataFrame({"text": text_data, "class": label_data, "prediction": labels})
# dffinal

# Main Experiment Results

In [None]:
# true_labels = [1 if pred == "sarc" else 0 for pred in df["class"]]
# print(true_labels)
# print(predicted_labels)

# accuracy = accuracy_score(true_labels, predicted_labels)
# precision = precision_score(true_labels, predicted_labels)
# f1 = f1_score(true_labels, predicted_labels)

# print("Accuracy:", accuracy)
# print("Precision:", precision)
# print("F1 Score:", f1)

In [None]:
# conf_matrix = confusion_matrix(true_labels, predicted_labels)

# print("Confusion Matrix:")
# print(conf_matrix)