# Introduction

Sarcasm is a sophisticated language phenomenon, which would cause much confusion to exist sentiment classification systems.     
So sarcasm detection, a task of predicting whether a given text contains sarcasm, has received much research attention.     

Recently, many methods have been proposed for sarcasm detection, which could be broadly classified into two categories.     
One is the text-only method which only concentrate on the utterance itself, such as exploiting incongruity expressions to detect the sarcasm text.     
Another direction is based on extra information, which exploits external knowledge to assist the detection procedure, such as user history, and common sense knowledge.

We propose an unsupervised sarcasm detection method.     

First, we leverage the external sentiment knowledge to mask prominent tokens. Then the masked texts are fed into the pre-trained generation model, which follows the remaining logic structure to generate texts.     
There is a good chance that these reborn texts would not be sarcastic or make more sense.     

Second, after obtaining the similarity score between the generated sentence and the original one, features beneath the scores will be extracted to decide whether a sentence is sarcasm.     

Then, we construct several unsupervised baselines and conduct experiments on IAC-V2 dataset.

# Imports and Reading Data

In [1]:
!pip install senticnet

Collecting senticnet
  Downloading senticnet-1.6-py3-none-any.whl.metadata (2.6 kB)
Downloading senticnet-1.6-py3-none-any.whl (51.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.9/51.9 MB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: senticnet
Successfully installed senticnet-1.6


In [2]:
import numpy as np
import pandas as pd

from senticnet.senticnet import SenticNet

import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from transformers import AutoTokenizer, AutoModel
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
df = pd.read_csv("/content/drive/My Drive/AlifResearch/iSarcasm/train.csv")
df1 = pd.read_csv("/content/drive/My Drive/AlifResearch/iSarcasm/test.csv")

In [5]:
df

Unnamed: 0,id,text,label,emoji,tag
0,1,why do small shouldered tiny guys wear huge t ...,0,,
1,2,"good morning , please go and vote ! <repeated>...",0,🙅,"<url>,</hashtag>,<hashtag>,<number>,<repeated>"
2,3,is it even christmas if there isn ’ t a fight ...,1,,
3,4,helping mum with her maths work for the course...,0,,
4,5,<hashtag> dear customer </hashtag> i am sorry ...,0,,"<hashtag>,</hashtag>"
...,...,...,...,...,...
3111,3112,fcking hate being let down . don ’ t get my ho...,0,,
3112,3113,last day in my twenties 😫,0,😫,
3113,3114,who ’ s dick do i have to suck for some dominos,0,,
3114,3115,<user> yet if you threw cold water it would st...,0,,<user>


In [6]:
df1

Unnamed: 0,id,text,label,emoji,tag
0,3464,saw poppin fresh in the macy ' s parade . my d...,1,,
1,3465,i knew as soon as i heard doing ford was cutti...,0,,"<url>,<percent>"
2,3466,great advice from well established individuals...,0,,<user>
3,3467,"eating apple sauce , chicken thighs , broccoli...",0,,"<hashtag>,</hashtag>"
4,3468,<user> ur not a real smiler if ur not expectin...,1,,<user>
...,...,...,...,...,...
882,4346,imagine that it ' s going to cost me <number> ...,0,,<number>
883,4347,people really out here tryna argue you do not ...,0,,<url>
884,4348,"<user> and their relentless running game , on ...",0,,"<number>,<user>"
885,4349,why is it that whether i get out of bed at <nu...,0,,"<number>,<allcaps>,<repeated>,</allcaps>"


In [7]:
# Concatenate vertically
df = pd.concat([df, df1], ignore_index=True)
df

Unnamed: 0,id,text,label,emoji,tag
0,1,why do small shouldered tiny guys wear huge t ...,0,,
1,2,"good morning , please go and vote ! <repeated>...",0,🙅,"<url>,</hashtag>,<hashtag>,<number>,<repeated>"
2,3,is it even christmas if there isn ’ t a fight ...,1,,
3,4,helping mum with her maths work for the course...,0,,
4,5,<hashtag> dear customer </hashtag> i am sorry ...,0,,"<hashtag>,</hashtag>"
...,...,...,...,...,...
3998,4346,imagine that it ' s going to cost me <number> ...,0,,<number>
3999,4347,people really out here tryna argue you do not ...,0,,<url>
4000,4348,"<user> and their relentless running game , on ...",0,,"<number>,<user>"
4001,4349,why is it that whether i get out of bed at <nu...,0,,"<number>,<allcaps>,<repeated>,</allcaps>"


In [8]:
df = df.drop(columns=['id', 'emoji', 'tag'])
df

Unnamed: 0,text,label
0,why do small shouldered tiny guys wear huge t ...,0
1,"good morning , please go and vote ! <repeated>...",0
2,is it even christmas if there isn ’ t a fight ...,1
3,helping mum with her maths work for the course...,0
4,<hashtag> dear customer </hashtag> i am sorry ...,0
...,...,...
3998,imagine that it ' s going to cost me <number> ...,0
3999,people really out here tryna argue you do not ...,0
4000,"<user> and their relentless running game , on ...",0
4001,why is it that whether i get out of bed at <nu...,0


In [9]:
df['class'] = df['label'].map({0: 'notsarc', 1: 'sarc'})
df

Unnamed: 0,text,label,class
0,why do small shouldered tiny guys wear huge t ...,0,notsarc
1,"good morning , please go and vote ! <repeated>...",0,notsarc
2,is it even christmas if there isn ’ t a fight ...,1,sarc
3,helping mum with her maths work for the course...,0,notsarc
4,<hashtag> dear customer </hashtag> i am sorry ...,0,notsarc
...,...,...,...
3998,imagine that it ' s going to cost me <number> ...,0,notsarc
3999,people really out here tryna argue you do not ...,0,notsarc
4000,"<user> and their relentless running game , on ...",0,notsarc
4001,why is it that whether i get out of bed at <nu...,0,notsarc


In [10]:
# Drop the old 'label' column and rename 'tweet' to 'text'
df = df.drop(columns=['label'])
df = df.rename(columns={'tweet': 'text'})

In [11]:
df

Unnamed: 0,text,class
0,why do small shouldered tiny guys wear huge t ...,notsarc
1,"good morning , please go and vote ! <repeated>...",notsarc
2,is it even christmas if there isn ’ t a fight ...,sarc
3,helping mum with her maths work for the course...,notsarc
4,<hashtag> dear customer </hashtag> i am sorry ...,notsarc
...,...,...
3998,imagine that it ' s going to cost me <number> ...,notsarc
3999,people really out here tryna argue you do not ...,notsarc
4000,"<user> and their relentless running game , on ...",notsarc
4001,why is it that whether i get out of bed at <nu...,notsarc


In [12]:
import re
# Function to remove text inside <>
def remove_brackets(text):
    return re.sub(r'<.*?>', '', text).strip()

# Apply the function to the 'text' column
df['text'] = df['text'].apply(remove_brackets)
df

Unnamed: 0,text,class
0,why do small shouldered tiny guys wear huge t ...,notsarc
1,"good morning , please go and vote ! it only t...",notsarc
2,is it even christmas if there isn ’ t a fight ...,sarc
3,helping mum with her maths work for the course...,notsarc
4,dear customer i am sorry that the mobile phon...,notsarc
...,...,...
3998,imagine that it ' s going to cost me pound to...,notsarc
3999,people really out here tryna argue you do not ...,notsarc
4000,"and their relentless running game , on the bri...",notsarc
4001,why is it that whether i get out of bed at or...,notsarc


In [13]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.12.1-py3-none-any.whl.metadata (5.4 kB)
Downloading emoji-2.12.1-py3-none-any.whl (431 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/431.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m430.1/431.4 kB[0m [31m12.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.4/431.4 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.12.1


In [14]:
import emoji
# Function to remove emojis
def remove_emojis(text):
    return emoji.replace_emoji(text, replace='')

# Apply the function to the 'text' column
df['text'] = df['text'].apply(remove_emojis)
df

Unnamed: 0,text,class
0,why do small shouldered tiny guys wear huge t ...,notsarc
1,"good morning , please go and vote ! it only t...",notsarc
2,is it even christmas if there isn ’ t a fight ...,sarc
3,helping mum with her maths work for the course...,notsarc
4,dear customer i am sorry that the mobile phon...,notsarc
...,...,...
3998,imagine that it ' s going to cost me pound to...,notsarc
3999,people really out here tryna argue you do not ...,notsarc
4000,"and their relentless running game , on the bri...",notsarc
4001,why is it that whether i get out of bed at or...,notsarc


In [15]:
# df= df.drop('id', axis= 1)
# df

# Understanding Data

In [16]:
df.dtypes

Unnamed: 0,0
text,object
class,object


In [17]:
df.columns

Index(['text', 'class'], dtype='object')

In [18]:
text_data_original = list(df['text'])
text_data = [x.lower() for x in text_data_original]
print(*text_data, sep = "\n")

why do small shouldered tiny guys wear huge t shirts ?
good morning , please go and vote !  it only takes  minutes and a low turnout will hand victory to the brexit party   e uelections 2019
is it even christmas if there isn ’ t a fight with neighbours and a broken wrist ?
helping mum with her maths work for the course she ’ s taking and i ’ m slowly realising i am not great at maths
dear customer  i am sorry that the mobile phone reseller in the mall fucked you over . we all are not a bunch of sheisters . i hope your other life issues gets better and that i earned your future business .
anyone fancy writing my lit review for me ? can not . be . arsed .
so the  episode about ladonna was one of the most poignant and sad investigations of abuse of power and discrimination in institutions . v hard to listen to but so important .
middle aged women are bitchier than most people i know
baby tobias has arrived ! i ’ ll be taking a spot of leave but back into the swing of things for spring / s

In [19]:
label_data = list(df['class'])
print(*label_data, sep = "\n")

notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
sarc
sarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
sarc
sarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
sarc
notsarc
notsarc
sarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
sarc
sarc
notsarc
notsarc
notsarc
notsarc
sarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
notsarc
not

# Overview

The proposed framework contains three main components:     

1) Sentences mask and generation.     
This procedure first recognizes main components of sentences which will be properly masked to cause more impact on original sentences, and then fulfills the texts generation work;     

2) Sentences representation.     
It is expected to calculate dense vectors of sentences;     

3) Sarcastic utterances detection leverages.     
the similarity scores between original and regenerated sentences to detect whether an utterance is sarcastic.

# Sentences Mask and Generation
## 1)
"First, we use the sentiment common knowledge retrieved from SenticNet to recognize affective words in the sentence 𝑥,     
and split those words into two sets according to its sentiment polarities:    
PW = {pw1, pw2, ..., pwh} and    
NW = {nw1, nw2, ..., nwk},     
h + k <= n."

In [20]:
def tokenize_sentence(sentence):
    tokens = word_tokenize(sentence)

    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

In [21]:
def get_sentiment_polarity_from_senticnet(word):
    sn = SenticNet()

    word = word.lower()

    try:
        return sn.polarity_label(word)
    except:
        return "neutral"

In [22]:
def analyze_sentiment(sentences):
    positive_words = []
    negative_words = []

    for sentence in sentences:
        words = tokenize_sentence(sentence)

        PW = set()
        NW = set()

        for word in words:
            sentiment_polarity = get_sentiment_polarity_from_senticnet(word)
            if sentiment_polarity == "positive":
                PW.add(word.lower())
            elif sentiment_polarity == "negative":
                NW.add(word.lower())

        positive_words.append(PW)
        negative_words.append(NW)

    return positive_words, negative_words

In [23]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [24]:
positive_words, negative_words = analyze_sentiment(text_data)

for i, sentence in enumerate(text_data):
    print(f"Sentence: {sentence}")
    print(f"Positive Words: {positive_words[i]}")
    print(f"Negative Words: {negative_words[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Sentence: the one thing i wish i ’ d brought travelling with me is multi vitamins ! budget backpacking is not great for getting a balanced diet .
Positive Words: {'backpacking', 'multi', 'brought', 'great', 'travelling'}
Negative Words: {'vitamin'}
- - - - - - - - - -
Sentence: win a cyberpower  gaming pc plus  fifa   on pc .  ,  ,
Positive Words: {'fifa', 'plus', 'win', 'gaming'}
Negative Words: set()
- - - - - - - - - -
Sentence: i cannot wait for halloween        
Positive Words: set()
Negative Words: {'wait'}
- - - - - - - - - -
Sentence: thought i ' d be a sheep . it ' s yanny for me
Positive Words: set()
Negative Words: set()
- - - - - - - - - -
Sentence: bored of all these manager rumors .  viera , arteta and gerrard ? are these the guys to replace rafa or mo diame ?  nufc
Positive Words: {'rumor'}
Negative Words: {'bored'}
- - - - - - - - - -
Sentence: note to self .  the cold air is not good with asthma , bronchi

In [25]:
df["PW"] = positive_words
df["NW"] = negative_words
df

Unnamed: 0,text,class,PW,NW
0,why do small shouldered tiny guys wear huge t ...,notsarc,"{wear, shirt, tiny, huge}",{}
1,"good morning , please go and vote ! it only t...",notsarc,"{victory, good, turnout, party}",{low}
2,is it even christmas if there isn ’ t a fight ...,sarc,"{fight, wrist, christmas}",{broken}
3,helping mum with her maths work for the course...,notsarc,"{math, work, slowly, great}",{mum}
4,dear customer i am sorry that the mobile phon...,notsarc,"{better, hope}","{fucked, sorry}"
...,...,...,...,...
3998,imagine that it ' s going to cost me pound to...,notsarc,"{pound, imagine, travel}",{cost}
3999,people really out here tryna argue you do not ...,notsarc,{},{argue}
4000,"and their relentless running game , on the bri...",notsarc,"{army, running}","{relentless, dangerous}"
4001,why is it that whether i get out of bed at or...,notsarc,{},{}


## 2)
"Second, we analyze the sentence to get its syntax information to identify non-stop words     
     𝑆𝑊 = {𝑠𝑤1, 𝑠𝑤2, ..., 𝑠𝑤𝑚, 𝑚 ≤ 𝑛}.     
Intuitively, these words are the main components of sentences. Then we split 𝑆𝑊 into two sets which satisfy :     
     𝑆𝑊1 ∪ 𝑆𝑊2 = 𝑆𝑊 ,     
     |𝑆𝑊1| = |𝑆𝑊2|."

In [26]:
def extract_non_stop_words(sentence):
    words = nltk.word_tokenize(sentence)

    stop_words = set(stopwords.words("english"))

    non_stop_words = [word.lower() for word in words if word.lower() not in stop_words and word.isalpha()]

    return non_stop_words

In [27]:
def split_non_stop_words(non_stop_words):
    m = len(non_stop_words)
    m1 = m // 2
    SW1 = set(non_stop_words[:m1])
    SW2 = set(non_stop_words[m1:])
    return SW1, SW2

In [28]:
def analyze_sentences(sentences):
    all_SW1 = []
    all_SW2 = []

    for sentence in sentences:
        non_stop_words = extract_non_stop_words(sentence)
        SW1, SW2 = split_non_stop_words(non_stop_words)
        all_SW1.append(SW1)
        all_SW2.append(SW2)

    return all_SW1, all_SW2

In [29]:
all_SW1, all_SW2 = analyze_sentences(text_data)

for i, sentence in enumerate(text_data):
    print(f"Sentence: {sentence}")
    print(f"SW1: {all_SW1[i]}")
    print(f"SW2: {all_SW2[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Sentence: the one thing i wish i ’ d brought travelling with me is multi vitamins ! budget backpacking is not great for getting a balanced diet .
SW1: {'brought', 'multi', 'travelling', 'thing', 'one', 'wish'}
SW2: {'backpacking', 'getting', 'budget', 'great', 'vitamins', 'balanced', 'diet'}
- - - - - - - - - -
Sentence: win a cyberpower  gaming pc plus  fifa   on pc .  ,  ,
SW1: {'win', 'cyberpower', 'gaming'}
SW2: {'pc', 'fifa', 'plus'}
- - - - - - - - - -
Sentence: i cannot wait for halloween        
SW1: {'wait'}
SW2: {'halloween'}
- - - - - - - - - -
Sentence: thought i ' d be a sheep . it ' s yanny for me
SW1: {'thought'}
SW2: {'yanny', 'sheep'}
- - - - - - - - - -
Sentence: bored of all these manager rumors .  viera , arteta and gerrard ? are these the guys to replace rafa or mo diame ?  nufc
SW1: {'rumors', 'bored', 'gerrard', 'arteta', 'viera', 'manager'}
SW2: {'mo', 'rafa', 'diame', 'replace', 'nufc', 'guys'}
- 

In [30]:
df["SW1"] = all_SW1
df["SW2"] = all_SW2
df

Unnamed: 0,text,class,PW,NW,SW1,SW2
0,why do small shouldered tiny guys wear huge t ...,notsarc,"{wear, shirt, tiny, huge}",{},"{shouldered, tiny, small}","{wear, huge, guys, shirts}"
1,"good morning , please go and vote ! it only t...",notsarc,"{victory, good, turnout, party}",{low},"{please, good, vote, morning, takes, minutes, go}","{victory, turnout, brexit, low, party, e, hand..."
2,is it even christmas if there isn ’ t a fight ...,sarc,"{fight, wrist, christmas}",{broken},"{fight, even, christmas}","{neighbours, wrist, broken}"
3,helping mum with her maths work for the course...,notsarc,"{math, work, slowly, great}",{mum},"{course, helping, maths, work, mum}","{slowly, realising, great, maths, taking}"
4,dear customer i am sorry that the mobile phon...,notsarc,"{better, hope}","{fucked, sorry}","{fucked, sorry, phone, mall, bunch, dear, rese...","{future, business, life, issues, better, earne..."
...,...,...,...,...,...,...
3998,imagine that it ' s going to cost me pound to...,notsarc,"{pound, imagine, travel}",{cost},"{cost, going, imagine}","{pound, university, travel, year}"
3999,people really out here tryna argue you do not ...,notsarc,{},{argue},"{really, tryna, people}","{sanitary, argue, soap, need}"
4000,"and their relentless running game , on the bri...",notsarc,"{army, running}","{relentless, dangerous}","{upsetting, brink, running, relentless, game}","{ground, dangerous, knows, army, well, game}"
4001,why is it that whether i get out of bed at or...,notsarc,{},{},"{bed, get, whether, always}","{kitchen, get, clock, downstairs}"


## 3)
"Here, 𝑃𝑊 ∪ 𝑆𝑊1 and 𝑁𝑊 ∪ 𝑆𝑊2 are used to mask original sentence respectively. So, we will obtain two masked sentences     
𝑥𝑚1 = { [𝑚]1, 𝑥2, ..., [𝑚]𝑛} and     
𝑥𝑚2 = {𝑥1, [𝑚]2, ..., 𝑥𝑛}."

In [31]:
def construct_union(sentences, PW, NW, all_SW1, all_SW2):
    union_PW_SW1 = []
    union_NW_SW2 = []

    for i, sentence in enumerate(sentences):
        SW1 = all_SW1[i]
        SW2 = all_SW2[i]

        union_PW_SW1.append(PW[i].union(SW1))
        union_NW_SW2.append(NW[i].union(SW2))

    return union_PW_SW1, union_NW_SW2

In [32]:
union_PW_SW1, union_NW_SW2 = construct_union(text_data, positive_words, negative_words, all_SW1, all_SW2)
print(union_PW_SW1)
print(union_NW_SW2)



In [33]:
df["union_PW_SW1"] = union_PW_SW1
df["union_NW_SW2"] = union_NW_SW2
df

Unnamed: 0,text,class,PW,NW,SW1,SW2,union_PW_SW1,union_NW_SW2
0,why do small shouldered tiny guys wear huge t ...,notsarc,"{wear, shirt, tiny, huge}",{},"{shouldered, tiny, small}","{wear, huge, guys, shirts}","{wear, huge, shouldered, tiny, shirt, small}","{wear, huge, guys, shirts}"
1,"good morning , please go and vote ! it only t...",notsarc,"{victory, good, turnout, party}",{low},"{please, good, vote, morning, takes, minutes, go}","{victory, turnout, brexit, low, party, e, hand...","{victory, please, turnout, good, vote, party, ...","{victory, turnout, low, brexit, party, e, hand..."
2,is it even christmas if there isn ’ t a fight ...,sarc,"{fight, wrist, christmas}",{broken},"{fight, even, christmas}","{neighbours, wrist, broken}","{christmas, wrist, even, fight}","{neighbours, wrist, broken}"
3,helping mum with her maths work for the course...,notsarc,"{math, work, slowly, great}",{mum},"{course, helping, maths, work, mum}","{slowly, realising, great, maths, taking}","{course, slowly, work, great, helping, maths, ...","{taking, slowly, great, maths, realising, mum}"
4,dear customer i am sorry that the mobile phon...,notsarc,"{better, hope}","{fucked, sorry}","{fucked, sorry, phone, mall, bunch, dear, rese...","{future, business, life, issues, better, earne...","{fucked, sorry, phone, mall, better, bunch, de...","{fucked, future, business, sorry, life, issues..."
...,...,...,...,...,...,...,...,...
3998,imagine that it ' s going to cost me pound to...,notsarc,"{pound, imagine, travel}",{cost},"{cost, going, imagine}","{pound, university, travel, year}","{pound, going, imagine, travel, cost}","{pound, university, travel, year, cost}"
3999,people really out here tryna argue you do not ...,notsarc,{},{argue},"{really, tryna, people}","{sanitary, argue, soap, need}","{really, tryna, people}","{argue, sanitary, soap, need}"
4000,"and their relentless running game , on the bri...",notsarc,"{army, running}","{relentless, dangerous}","{upsetting, brink, running, relentless, game}","{ground, dangerous, knows, army, well, game}","{upsetting, army, brink, relentless, running, ...","{ground, dangerous, knows, army, relentless, w..."
4001,why is it that whether i get out of bed at or...,notsarc,{},{},"{bed, get, whether, always}","{kitchen, get, clock, downstairs}","{bed, get, whether, always}","{kitchen, get, clock, downstairs}"


In [34]:
def mask_sentence(sentence, mask_words, max_mask_count = 5):
    masked_sentence = []

    for word in sentence.split():
        if word in mask_words and max_mask_count > 0:
            masked_sentence.append("<mask>")
            max_mask_count -= 1
        else:
            masked_sentence.append(word)

    return " ".join(masked_sentence)

In [35]:
def construct_masked_sentences(sentences, union_PW_SW1, union_NW_SW2):
    masked_pos_sentences = []
    masked_neg_sentences = []

    for i, sentence in enumerate(sentences):

        masked_pos_sentence = mask_sentence(sentence, union_PW_SW1[i])
        masked_pos_sentences.append(masked_pos_sentence)

        masked_neg_sentence = mask_sentence(sentence, union_NW_SW2[i])
        masked_neg_sentences.append(masked_neg_sentence)

    return masked_pos_sentences, masked_neg_sentences

In [36]:
masked_pos_sentences, masked_neg_sentences = construct_masked_sentences(text_data, union_PW_SW1, union_NW_SW2)

for i, sentence in enumerate(text_data):
    print(f"Original Sentence: {sentence}")
    print(f"Masked Positive Sentence: {masked_pos_sentences[i]}")
    print(f"Masked Negative Sentence: {masked_neg_sentences[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Original Sentence: the one thing i wish i ’ d brought travelling with me is multi vitamins ! budget backpacking is not great for getting a balanced diet .
Masked Positive Sentence: the <mask> <mask> i <mask> i ’ d <mask> <mask> with me is multi vitamins ! budget backpacking is not great for getting a balanced diet .
Masked Negative Sentence: the one thing i wish i ’ d brought travelling with me is multi <mask> ! <mask> <mask> is not <mask> for <mask> a balanced diet .
- - - - - - - - - -
Original Sentence: win a cyberpower  gaming pc plus  fifa   on pc .  ,  ,
Masked Positive Sentence: <mask> a <mask> <mask> pc <mask> <mask> on pc . , ,
Masked Negative Sentence: win a cyberpower gaming <mask> <mask> <mask> on <mask> . , ,
- - - - - - - - - -
Original Sentence: i cannot wait for halloween        
Masked Positive Sentence: i cannot <mask> for halloween
Masked Negative Sentence: i cannot <mask> for <mask>
- - - - - - - - - -

In [37]:
dfnew = pd.DataFrame({"text": text_data_original, "maskedPosSentence": masked_pos_sentences, "maskedNegSentence": masked_neg_sentences})
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence
0,why do small shouldered tiny guys wear huge t ...,why do <mask> <mask> <mask> guys <mask> <mask>...,why do small shouldered tiny <mask> <mask> <ma...
1,"good morning , please go and vote ! it only t...","<mask> <mask> , <mask> <mask> and <mask> ! it ...","good morning , please go and vote ! it only ta..."
2,is it even christmas if there isn ’ t a fight ...,is it <mask> <mask> if there isn ’ t a <mask> ...,is it even christmas if there isn ’ t a fight ...
3,helping mum with her maths work for the course...,<mask> <mask> with her <mask> <mask> for the <...,helping <mask> with her <mask> work for the co...
4,dear customer i am sorry that the mobile phon...,<mask> <mask> i am <mask> that the <mask> <mas...,dear customer i am <mask> that the mobile phon...
...,...,...,...
3998,imagine that it ' s going to cost me pound to...,<mask> that it ' s <mask> to <mask> me <mask> ...,imagine that it ' s going to <mask> me <mask> ...
3999,people really out here tryna argue you do not ...,<mask> <mask> out here <mask> argue you do not...,people really out here tryna <mask> you do not...
4000,"and their relentless running game , on the bri...","and their <mask> <mask> <mask> , on the <mask>...","and their <mask> running <mask> , on the brink..."
4001,why is it that whether i get out of bed at or...,why is it that <mask> i <mask> out of <mask> a...,why is it that whether i <mask> out of bed at ...


## 4)
"These two masked sentences are fed into the pre-trained generation model to fulfill the generation procedure.     
𝑨{𝑎1, ..., 𝑥2, ..., 𝑥𝑛−1, ..., 𝑎𝑜 } = 𝐵𝐴𝑅𝑇 ( [𝑚]1, 𝑥2, ..., 𝑥𝑛−1, [𝑚]𝑛 )----(1)  
Thus, we will obtain two reborn sentences     
𝐴 = {𝑎1, 𝑎2, ..., 𝑎𝑜 } and     
𝐵 = {𝑏1, 𝑏2, ..., 𝑏𝑝 }."

In [38]:
%pip install transformers



In [45]:
def generate_reborn_sentences(masked_sentences):
    tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-base")
    model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-base")

    i = 0
    reborn_sentences = []
    for masked_sentence in masked_sentences:
        inputs = tokenizer(masked_sentence, return_tensors="pt")
        generated_encoded = model.generate(inputs['input_ids'])
        reborn_sentence = tokenizer.batch_decode(generated_encoded, skip_special_tokens=True)[0]
        reborn_sentences.append(reborn_sentence)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} sentences')

    return reborn_sentences

In [40]:
from google.colab import userdata
userdata.get('HF_TOKEN')

'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'

In [41]:
import os
os.environ["HF_TOKEN"] = "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"

In [None]:
reborn_pos_sentences = generate_reborn_sentences(masked_pos_sentences)

reborn_neg_sentences = generate_reborn_sentences(masked_neg_sentences)

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]



Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences
Processed 1200 sentences
Processed 1300 sentences
Processed 1400 sentences
Processed 1500 sentences
Processed 1600 sentences
Processed 1700 sentences
Processed 1800 sentences
Processed 1900 sentences
Processed 2000 sentences
Processed 2100 sentences
Processed 2200 sentences
Processed 2300 sentences
Processed 2400 sentences
Processed 2500 sentences
Processed 2600 sentences
Processed 2700 sentences
Processed 2800 sentences
Processed 2900 sentences
Processed 3000 sentences
Processed 3100 sentences
Processed 3200 sentences
Processed 3300 sentences
Processed 3400 sentences
Processed 3500 sentences
Processed 3600 sentences
Processed 3700 sentences
Processed 3800 sentences
Processed 3900 sentences
Processed 4000 sentences


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences
Processed 1200 sentences
Processed 1300 sentences
Processed 1400 sentences
Processed 1500 sentences
Processed 1600 sentences
Processed 1700 sentences
Processed 1800 sentences
Processed 1900 sentences
Processed 2000 sentences
Processed 2100 sentences
Processed 2200 sentences
Processed 2300 sentences
Processed 2400 sentences
Processed 2500 sentences
Processed 2600 sentences
Processed 2700 sentences
Processed 2800 sentences
Processed 2900 sentences
Processed 3000 sentences
Processed 3100 sentences
Processed 3200 sentences
Processed 3300 sentences
Processed 3400 sentences
Processed 3500 sentences
Processed 3600 sentences
Processed 3700 sentences
Processed 3800 sentences
Processed 3900 sentences
Processed 4000 sentences


In [None]:
print("Reborn Sentences for Masked Positive Sentences:")
for i, reborn_sentence in enumerate(reborn_pos_sentences):
    print(f"Reborn Sentence {i + 1}: {reborn_sentence}")

Reborn Sentences for Masked Positive Sentences:
Reborn Sentence 1: > mask> mask> mask>
Reborn Sentence 2: mask> mask>, mask>
Reborn Sentence 3: it mask> mask> mask>
Reborn Sentence 4: mask> mask> mask> mas
Reborn Sentence 5: mask> mask> i am mask>
Reborn Sentence 6: mask> mask> mask> mask
Reborn Sentence 7: v hard to listen to but so important.
Reborn Sentence 8: mask> mask> mask> mas
Reborn Sentence 9: > mask>! mask>!!
Reborn Sentence 10: ''''' so limited.
Reborn Sentence 11: mask>. way too mask>.
Reborn Sentence 12: ........
Reborn Sentence 13: >> mask> mask> mask>
Reborn Sentence 14: mask> mask>, mask>
Reborn Sentence 15: mask> mask> mask> mas
Reborn Sentence 16: mask> mask> mask> mas
Reborn Sentence 17: >> mask> a spider mask> across my
Reborn Sentence 18: mask> mask> mask> mas
Reborn Sentence 19: , mask>.. mask>
Reborn Sentence 20: > mask> mask> mask>
Reborn Sentence 21: ! mask> mask>!!
Reborn Sentence 22: . i mask> mask>. i
Reborn Sentence 23: mask> mask> mask>? or
Reborn Sentenc

In [None]:
print("\nReborn Sentences for Masked Negative Sentences:")
for i, reborn_sentence in enumerate(reborn_neg_sentences):
    print(f"Reborn Sentence {i + 1}: {reborn_sentence}")


Reborn Sentences for Masked Negative Sentences:
Reborn Sentence 1: > mask> mask> mask>
Reborn Sentence 2: ,, good morning,! mask>
Reborn Sentence 3: mask> mask>? mask>
Reborn Sentence 4: > mask>> mask> mask
Reborn Sentence 5: i am mask> that the mobile phone reseller in the mall
Reborn Sentence 6: mask> mask>? can not. be
Reborn Sentence 7: v hard to listen to but so important.
Reborn Sentence 8: > mask> than most mask> i
Reborn Sentence 9: !!!>> / summer mas
Reborn Sentence 10: ' so mask>. what do they have to mas
Reborn Sentence 11: . mask> too mask>.
Reborn Sentence 12: .........
Reborn Sentence 13: > mask> mask> mask>
Reborn Sentence 14: , mask> you will be mask> in a
Reborn Sentence 15: mask> mask> mask> mask
Reborn Sentence 16: > mask> mask> mask> to
Reborn Sentence 17: i i i mask> mask
Reborn Sentence 18: mask> mask>.
Reborn Sentence 19: ........
Reborn Sentence 20: mask> mask> mask> mas
Reborn Sentence 21: ! i ’ m a big fan of so ’ ve
Reborn Sentence 22: . i mask> to play your

In [None]:
dfnew["rebornPosSentence"] = reborn_pos_sentences
dfnew["rebornNegSentence"] = reborn_neg_sentences
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence,rebornPosSentence,rebornNegSentence
0,why do small shouldered tiny guys wear huge t ...,why do <mask> <mask> <mask> guys <mask> <mask>...,why do small shouldered tiny <mask> <mask> <ma...,> mask> mask> mask>,> mask> mask> mask>
1,"good morning , please go and vote ! it only t...","<mask> <mask> , <mask> <mask> and <mask> ! it ...","good morning , please go and vote ! it only ta...","mask> mask>, mask>",",, good morning,! mask>"
2,is it even christmas if there isn ’ t a fight ...,is it <mask> <mask> if there isn ’ t a <mask> ...,is it even christmas if there isn ’ t a fight ...,it mask> mask> mask>,mask> mask>? mask>
3,helping mum with her maths work for the course...,<mask> <mask> with her <mask> <mask> for the <...,helping <mask> with her <mask> work for the co...,mask> mask> mask> mas,> mask>> mask> mask
4,dear customer i am sorry that the mobile phon...,<mask> <mask> i am <mask> that the <mask> <mas...,dear customer i am <mask> that the mobile phon...,mask> mask> i am mask>,i am mask> that the mobile phone reseller in t...
...,...,...,...,...,...
3998,imagine that it ' s going to cost me pound to...,<mask> that it ' s <mask> to <mask> me <mask> ...,imagine that it ' s going to <mask> me <mask> ...,' s mask> to mask> me,> me mask> to mask> me mask
3999,people really out here tryna argue you do not ...,<mask> <mask> out here <mask> argue you do not...,people really out here tryna <mask> you do not...,> mask> mask> mask>,> mask> mask> mask>
4000,"and their relentless running game , on the bri...","and their <mask> <mask> <mask> , on the <mask>...","and their <mask> running <mask> , on the brink...","mask> mask> mask>, on",mask> mask> mask> mask
4001,why is it that whether i get out of bed at or...,why is it that <mask> i <mask> out of <mask> a...,why is it that whether i <mask> out of bed at ...,", i mask> out of mask> at",' s always on the mask> mask> when


# Sentences Representation
"We embed the original sentence 𝑥 and its corresponding reborn texts 𝐴 and 𝐵     
into 𝑑-dimentional embedding 𝑯𝑡 ∈ R𝑑     
via pre-trained BERT-base:     
𝑯𝑥, 𝑯𝐴, 𝑯𝐵 = 𝐵𝐸𝑅𝑇 (𝑥), 𝐵𝐸𝑅𝑇 (𝐴), 𝐵𝐸𝑅𝑇 (𝐵)."

In [None]:
def embed_sentences(sentences):
    tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
    model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")

    i = 0
    embeddings = []
    for sentence in sentences:
        inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = model(**inputs).last_hidden_state.mean(dim=1)
        embeddings.append(outputs)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} sentences')

    return torch.stack(embeddings)

In [None]:
x_embeddings = embed_sentences(text_data)

A_embeddings = embed_sentences(reborn_pos_sentences)

B_embeddings = embed_sentences(reborn_neg_sentences)

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences
Processed 1200 sentences
Processed 1300 sentences
Processed 1400 sentences
Processed 1500 sentences
Processed 1600 sentences
Processed 1700 sentences
Processed 1800 sentences
Processed 1900 sentences
Processed 2000 sentences
Processed 2100 sentences
Processed 2200 sentences
Processed 2300 sentences
Processed 2400 sentences
Processed 2500 sentences
Processed 2600 sentences
Processed 2700 sentences
Processed 2800 sentences
Processed 2900 sentences
Processed 3000 sentences
Processed 3100 sentences
Processed 3200 sentences
Processed 3300 sentences
Processed 3400 sentences
Processed 3500 sentences
Processed 3600 sentences
Processed 3700 sentences
Processed 3800 sentences
Processed 3900 sentences
Processed 4000 sentences
Processed

In [None]:
for i, sentence in enumerate(text_data):
    print(f"Embedding for Original Lowercase Sentence {i + 1} ({sentence}):")
    print(x_embeddings[i])
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
          3.4725e-01, -2.4786e-01, -2.7578e-01, -2.6902e-01, -6.9727e-02,
         -1.0435e-01,  2.9424e-01,  2.7911e-01, -6.7370e-02,  1.4964e-02,
         -8.9923e-02, -1.1077e-01, -6.1527e-02, -3.2078e-02,  1.0491e-01,
         -2.3540e-01, -6.5371e-02,  1.7637e-02,  7.8999e-02,  2.2984e-01,
         -2.9960e-01, -2.3310e-01,  1.8909e-01,  2.2909e-01, -6.2361e-01,
         -2.5710e-01,  4.4839e-02,  9.2926e-02, -1.2252e-01,  7.8538e-02,
         -6.5198e-02,  3.0073e-02, -1.7900e-01]])
- - - - - - - - - -
Embedding for Original Lowercase Sentence 3972 (spent the morning nursing a very hungover daughter , after she came home at some ungodly hour .  oh how times have changed ):
tensor([[ 5.3628e-02,  5.9410e-02,  5.4324e-01, -5.0224e-01,  2.3215e-02,
         -1.9734e-01,  2.9660e-01,  6.2760e-01,  2.0810e-01,  2.5437e-01,
          3.5356e-01, -2.3374e-01, -2.4814e-01,  6.0689e-01, -3.9613e-01,
          2.0445e-01,  5.

In [None]:
for i, sentence in enumerate(reborn_pos_sentences):
    print(f"Embedding for Reborn Positive Sentence {i + 1} ({sentence}):")
    print(A_embeddings[i])
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
          4.6970e-01, -2.1066e-01,  2.6024e-01,  3.7450e-03,  4.8736e-01,
         -7.1611e-01,  1.1222e-02,  2.1218e-01, -3.1238e-01, -6.0678e-01,
         -3.0788e-01, -2.9487e-01,  2.8677e-01,  3.6373e-01, -2.7598e-02,
          1.5538e-01, -1.6646e-01,  1.3638e-01, -4.6714e-01,  1.3645e-01,
         -3.1833e-01, -1.4581e-01,  1.3949e-01,  4.6062e-01, -4.7526e-01,
         -1.6157e-02, -4.1444e-01, -3.3451e-01, -7.1627e-01, -6.1077e-02,
          3.8786e-01,  4.0358e-01, -5.2673e-01]])
- - - - - - - - - -
Embedding for Reborn Positive Sentence 3972 (mask> mask> a very mask>):
tensor([[ 1.1604e-01, -5.2917e-02,  4.4367e-01, -8.6581e-02,  2.3980e-01,
         -8.1914e-02,  8.0531e-01,  2.0096e-02,  2.0539e-01, -5.4780e-01,
         -3.3426e-01,  4.4581e-01, -4.2311e-01,  6.2868e-02,  2.1672e-01,
          3.0373e-02,  2.3610e-01,  5.0545e-01,  7.5827e-02,  2.0467e-01,
          3.2977e-01,  1.1926e-01,  3.9128e-01, -1.81

In [None]:
for i, sentence in enumerate(reborn_neg_sentences):
    print(f"Embedding for Reborn Negative Sentence {i + 1} ({sentence}):")
    print(B_embeddings[i])
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
          5.7298e-01, -2.6685e-01,  3.4633e-01,  3.5437e-03,  7.5080e-01,
         -6.5385e-01, -1.6136e-02,  3.2366e-01, -2.6881e-01, -6.2420e-01,
         -2.5544e-01, -3.9253e-01,  5.0797e-02,  3.7867e-01, -3.6885e-02,
          1.5215e-01, -1.1498e-01, -5.9871e-02, -4.8185e-01,  2.8071e-01,
         -3.4583e-01, -3.0175e-01,  2.2011e-01,  6.3074e-01, -3.7255e-01,
          2.0399e-02, -5.1510e-01, -5.6142e-01, -5.2280e-01,  2.2951e-02,
          4.6002e-01,  4.8866e-01, -4.4088e-01]])
- - - - - - - - - -
Embedding for Reborn Negative Sentence 3972 (>> mask>. mask>.):
tensor([[ 1.5737e-01,  2.2824e-02,  6.5238e-01, -2.0199e-02,  1.8171e-01,
         -2.2847e-01,  8.4549e-01, -1.1466e-02,  1.2933e-01, -8.1784e-01,
         -4.7711e-01,  3.9951e-01, -4.9786e-01,  8.7177e-02,  2.4508e-01,
          3.3717e-01,  2.6670e-01,  5.9214e-01,  1.2850e-02,  9.7946e-03,
          2.9470e-01, -4.0166e-02,  4.2546e-01, -2.9668e-01, 

In [None]:
dfnew["xEmbedding"] = x_embeddings.tolist()
dfnew["AEmbedding"] = A_embeddings.tolist()
dfnew["BEmbedding"] = B_embeddings.tolist()
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence,rebornPosSentence,rebornNegSentence,xEmbedding,AEmbedding,BEmbedding
0,why do small shouldered tiny guys wear huge t ...,why do <mask> <mask> <mask> guys <mask> <mask>...,why do small shouldered tiny <mask> <mask> <ma...,> mask> mask> mask>,> mask> mask> mask>,"[[0.7081335783004761, 0.1210532933473587, -0.5...","[[0.21780623495578766, 0.013179474510252476, 0...","[[0.21780623495578766, 0.013179474510252476, 0..."
1,"good morning , please go and vote ! it only t...","<mask> <mask> , <mask> <mask> and <mask> ! it ...","good morning , please go and vote ! it only ta...","mask> mask>, mask>",",, good morning,! mask>","[[-0.14586037397384644, -0.6124593019485474, 1...","[[0.2296123504638672, 0.07393606007099152, 0.5...","[[0.07062423229217529, 0.4247998595237732, 0.5..."
2,is it even christmas if there isn ’ t a fight ...,is it <mask> <mask> if there isn ’ t a <mask> ...,is it even christmas if there isn ’ t a fight ...,it mask> mask> mask>,mask> mask>? mask>,"[[0.08108999580144882, -0.23094923794269562, 0...","[[0.26175302267074585, -0.018684716895222664, ...","[[0.2684263288974762, -0.13360415399074554, 0...."
3,helping mum with her maths work for the course...,<mask> <mask> with her <mask> <mask> for the <...,helping <mask> with her <mask> work for the co...,mask> mask> mask> mas,> mask>> mask> mask,"[[0.20817291736602783, 0.24460144340991974, 0....","[[-0.029310442507267, -0.012421253137290478, 0...","[[0.2157040238380432, 0.0886165201663971, 0.56..."
4,dear customer i am sorry that the mobile phon...,<mask> <mask> i am <mask> that the <mask> <mas...,dear customer i am <mask> that the mobile phon...,mask> mask> i am mask>,i am mask> that the mobile phone reseller in t...,"[[0.4123021364212036, 0.09793900698423386, 0.5...","[[0.24663789570331573, -0.0609426274895668, 0....","[[0.6709597110748291, 0.09430757910013199, 0.7..."
...,...,...,...,...,...,...,...,...
3998,imagine that it ' s going to cost me pound to...,<mask> that it ' s <mask> to <mask> me <mask> ...,imagine that it ' s going to <mask> me <mask> ...,' s mask> to mask> me,> me mask> to mask> me mask,"[[0.37059521675109863, 0.1753413826227188, 0.4...","[[0.3873536288738251, 0.029146766290068626, 0....","[[0.3250868618488312, 0.02695026807487011, 0.4..."
3999,people really out here tryna argue you do not ...,<mask> <mask> out here <mask> argue you do not...,people really out here tryna <mask> you do not...,> mask> mask> mask>,> mask> mask> mask>,"[[0.6794640421867371, 0.38444259762763977, -0....","[[0.21780623495578766, 0.013179474510252476, 0...","[[0.21780623495578766, 0.013179474510252476, 0..."
4000,"and their relentless running game , on the bri...","and their <mask> <mask> <mask> , on the <mask>...","and their <mask> running <mask> , on the brink...","mask> mask> mask>, on",mask> mask> mask> mask,"[[-0.20612402260303497, 0.10531388223171234, -...","[[0.21137547492980957, 0.10225281864404678, 0....","[[0.12185078114271164, 0.010523257777094841, 0..."
4001,why is it that whether i get out of bed at or...,why is it that <mask> i <mask> out of <mask> a...,why is it that whether i <mask> out of bed at ...,", i mask> out of mask> at",' s always on the mask> mask> when,"[[0.6105952858924866, 0.4410528838634491, 0.49...","[[0.027566026896238327, 0.044235702604055405, ...","[[0.05024776980280876, -0.20555396378040314, 0..."


# Sarcastic Utterances Detection
## 1)
"We utilize cosine similarity to measure the similarity between representations of original sentence 𝐻𝑥     
and generation texts 𝐻𝐴/𝐻𝐵.

Then we use the following equation to calculate a difference score of each sentence:     
diff = sim(𝐻𝑥, 𝐻𝐴) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 || sim(𝐻𝑥, 𝐻𝐵) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑     
where || means "or" logical operator."

In [None]:
def calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, threshold):
    i = 0
    diff_scores = []
    for x_emb, A_emb, B_emb in zip(x_embeddings, A_embeddings, B_embeddings):
        sim_Hx_HA = cosine_similarity(x_emb, A_emb)
        sim_Hx_HB = cosine_similarity(x_emb, B_emb)

        diff = (sim_Hx_HA < threshold) or (sim_Hx_HB < threshold)
        diff_scores.append(diff)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} embeddings')

    return diff_scores

In [None]:
threshold = 0.755

diff_scores = calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, threshold)
diff_scores

Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings
Processed 2000 embeddings
Processed 2100 embeddings
Processed 2200 embeddings
Processed 2300 embeddings
Processed 2400 embeddings
Processed 2500 embeddings
Processed 2600 embeddings
Processed 2700 embeddings
Processed 2800 embeddings
Processed 2900 embeddings
Processed 3000 embeddings
Processed 3100 embeddings
Processed 3200 embeddings
Processed 3300 embeddings
Processed 3400 embeddings
Processed 3500 embeddings
Processed 3600 embeddings
Processed 3700 embeddings
Processed 3800 embeddings
Processed 3900 embedd

[array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ Tr

## 2)
"Since the sarcastic utterances are influenced more than normal texts during the masking and generation procedure,     
the difference score of sarcastic texts should be greater than a non-sarcastic one.

If we have a threshold value which separates sarcastic texts and normal texts,     
we can yield the prediction 𝑦 by:     
𝑦 = I(diff)."

In [None]:
predicted_labels = [int(diff) for diff in diff_scores]
print(predicted_labels)
print(sum(predicted_labels))

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

  predicted_labels = [int(diff) for diff in diff_scores]


In [None]:
labels = ["sarc" if diff else "notsarc" for diff in diff_scores]
print(labels)

['sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'notsarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sar

In [None]:
dffinal = pd.DataFrame({"text": text_data, "class": label_data, "prediction": labels})
dffinal

Unnamed: 0,text,class,prediction
0,why do small shouldered tiny guys wear huge t ...,notsarc,sarc
1,"good morning , please go and vote ! it only t...",notsarc,sarc
2,is it even christmas if there isn ’ t a fight ...,sarc,sarc
3,helping mum with her maths work for the course...,notsarc,sarc
4,dear customer i am sorry that the mobile phon...,notsarc,sarc
...,...,...,...
3998,imagine that it ' s going to cost me pound to...,notsarc,sarc
3999,people really out here tryna argue you do not ...,notsarc,sarc
4000,"and their relentless running game , on the bri...",notsarc,sarc
4001,why is it that whether i get out of bed at or...,notsarc,sarc


# Main Experiment Results

In [None]:
true_labels = [1 if pred == "sarc" else 0 for pred in df["class"]]
print(true_labels)
print(predicted_labels)

accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels)
f1 = f1_score(true_labels, predicted_labels)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("F1 Score:", f1)

[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [None]:
conf_matrix = confusion_matrix(true_labels, predicted_labels)

print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[  21 3276]
 [   6  700]]
