# Introduction

Sarcasm is a sophisticated language phenomenon, which would cause much confusion to exist sentiment classification systems.     
So sarcasm detection, a task of predicting whether a given text contains sarcasm, has received much research attention.     

Recently, many methods have been proposed for sarcasm detection, which could be broadly classified into two categories.     
One is the text-only method which only concentrate on the utterance itself, such as exploiting incongruity expressions to detect the sarcasm text.     
Another direction is based on extra information, which exploits external knowledge to assist the detection procedure, such as user history, and common sense knowledge.

We propose an unsupervised sarcasm detection method.     

First, we leverage the external sentiment knowledge to mask prominent tokens. Then the masked texts are fed into the pre-trained generation model, which follows the remaining logic structure to generate texts.     
There is a good chance that these reborn texts would not be sarcastic or make more sense.     

Second, after obtaining the similarity score between the generated sentence and the original one, features beneath the scores will be extracted to decide whether a sentence is sarcasm.     

Then, we construct several unsupervised baselines and conduct experiments on IAC-V2 dataset.

# Imports and Reading Data

In [1]:
!pip install senticnet

Collecting senticnet
  Downloading senticnet-1.6-py3-none-any.whl.metadata (2.6 kB)
Downloading senticnet-1.6-py3-none-any.whl (51.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.9/51.9 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: senticnet
Successfully installed senticnet-1.6


In [2]:
import numpy as np
import pandas as pd

from senticnet.senticnet import SenticNet

import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForCausalLM
import torch

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
df_sarcasm = pd.read_csv("/content/drive/My Drive/AlifResearch/sarcasm_v1/sarcasm_v1_sarc.csv")
df_notsarcasm = pd.read_csv("/content/drive/My Drive/AlifResearch/sarcasm_v1/sarcasm_v1_notsarc.csv")

In [5]:
df_sarcasm

Unnamed: 0,class,text
0,sarc,"Actually, they didn't. The whole tragedy was c..."
1,sarc,What if a 13 year old girl comes up to you and...
2,sarc,"In my lifetime, we've made huge strides, but t..."
3,sarc,"HOLY SH*T, marc. You're doing EXACTLY WHAT THE..."
4,sarc,"But on the other hand, Genesis isn't a scienti..."
...,...,...
986,sarc,and
987,sarc,"Ha, that is just an idiotic perspective. We'd ..."
988,sarc,So you are saying that despite the majority of...
989,sarc,"depends on your definition of ""human being."""


In [6]:
df_notsarcasm

Unnamed: 0,class,text
0,notsarc,"This is a pretty touchy issue, and I agree wit..."
1,notsarc,"In a way, taking rights away is an American va..."
2,notsarc,A perfect example of why Christian fundamental...
3,notsarc,"I know, Chloe's misuse of the word strikes again."
4,notsarc,No. We don't agree. For one thing your faith i...
...,...,...
992,notsarc,"Man, these guys can't even get into the scienc..."
993,notsarc,What do you mean by this? Could we not have th...
994,notsarc,And the answer is: we don't know. Maybe it cam...
995,notsarc,And what would make them separate species? How...


In [7]:
# Concatenate vertically
df = pd.concat([df_sarcasm, df_notsarcasm], ignore_index=True)
df

Unnamed: 0,class,text
0,sarc,"Actually, they didn't. The whole tragedy was c..."
1,sarc,What if a 13 year old girl comes up to you and...
2,sarc,"In my lifetime, we've made huge strides, but t..."
3,sarc,"HOLY SH*T, marc. You're doing EXACTLY WHAT THE..."
4,sarc,"But on the other hand, Genesis isn't a scienti..."
...,...,...
1983,notsarc,"Man, these guys can't even get into the scienc..."
1984,notsarc,What do you mean by this? Could we not have th...
1985,notsarc,And the answer is: we don't know. Maybe it cam...
1986,notsarc,And what would make them separate species? How...


In [8]:
# df= df.drop('id', axis= 1)
# df

# Understanding Data

In [9]:
df.dtypes

Unnamed: 0,0
class,object
text,object


In [10]:
df.columns

Index(['class', 'text'], dtype='object')

In [11]:
text_data_original = list(df['text'])
text_data = [x.lower() for x in text_data_original]
print(*text_data, sep = "\n")

actually, they didn't. the whole tragedy was caused by gun control. if even one student was packing when that occured, 33 lives could have been saved. but no, more victims of botched laws and corrupt politicians.
what if a 13 year old girl comes up to you and asks for sex, and you agree? are you forcing yourself on to her?
in my lifetime, we've made huge strides, but there's a lot more to learn.
holy sh*t, marc. you're doing exactly what the review says you people do. you're making the claim without giving any explanation. "omg teh sceinces is athiestic becuz they just aer!!"   thanks for making it extremely clear. refuting a point by demonstrating it isn't very effective.   seriously, how disconnected from reality does one have to be in order to do something like this?
but on the other hand, genesis isn't a scientific document. no wonder. why would a bronze age nomad from the levant be aware of at least 3 tailless species of primates, or even the tailed primates from africa, south ame

In [12]:
label_data = list(df['class'])
print(*label_data, sep = "\n")

sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc
sarc


# Overview

The proposed framework contains three main components:     

1) Sentences mask and generation.     
This procedure first recognizes main components of sentences which will be properly masked to cause more impact on original sentences, and then fulfills the texts generation work;     

2) Sentences representation.     
It is expected to calculate dense vectors of sentences;     

3) Sarcastic utterances detection leverages.     
the similarity scores between original and regenerated sentences to detect whether an utterance is sarcastic.

# Sentences Mask and Generation
## 1)
"First, we use the sentiment common knowledge retrieved from SenticNet to recognize affective words in the sentence 𝑥,     
and split those words into two sets according to its sentiment polarities:    
PW = {pw1, pw2, ..., pwh} and    
NW = {nw1, nw2, ..., nwk},     
h + k <= n."

In [13]:
def tokenize_sentence(sentence):
    tokens = word_tokenize(sentence)

    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

In [14]:
def get_sentiment_polarity_from_senticnet(word):
    sn = SenticNet()

    word = word.lower()

    try:
        return sn.polarity_label(word)
    except:
        return "neutral"

In [15]:
def analyze_sentiment(sentences):
    positive_words = []
    negative_words = []

    for sentence in sentences:
        words = tokenize_sentence(sentence)

        PW = set()
        NW = set()

        for word in words:
            sentiment_polarity = get_sentiment_polarity_from_senticnet(word)
            if sentiment_polarity == "positive":
                PW.add(word.lower())
            elif sentiment_polarity == "negative":
                NW.add(word.lower())

        positive_words.append(PW)
        negative_words.append(NW)

    return positive_words, negative_words

In [16]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [17]:
positive_words, negative_words = analyze_sentiment(text_data)

for i, sentence in enumerate(text_data):
    print(f"Sentence: {sentence}")
    print(f"Positive Words: {positive_words[i]}")
    print(f"Negative Words: {negative_words[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Sentence: yes. it's called global dimming, and yes, i trust mainstream scientific organizations.   if you knew anything about science, scientists usuallyspeak in terms of coulds, possibliites, and likely, because science isn't 100% and scientific facts aren't absolute. they are largely evidence-supported conjectures. you can't prove anything 100%. they are giving reasonable estimates. that's how climatology works.
Positive Words: {'conjecture', 'science', 'trust', 'work', 'estimate', 'prove', 'organization'}
Negative Words: {'dimming', 'reasonable'}
- - - - - - - - - -
Sentence: so you yourself could have gone gay but chose to be attracted to women?
Positive Words: {'gay'}
Negative Words: set()
- - - - - - - - - -
Sentence: so where will you be moving to?
Positive Words: {'moving'}
Negative Words: set()
- - - - - - - - - -
Sentence: oh yeah, like that hasn't been tried before /sarcasm it darn sure isn't about crime contro

In [18]:
df["PW"] = positive_words
df["NW"] = negative_words
df

Unnamed: 0,class,text,PW,NW
0,sarc,"Actually, they didn't. The whole tragedy was c...","{control, student}","{botched, corrupt, tragedy, victim}"
1,sarc,What if a 13 year old girl comes up to you and...,"{sex, agree}",{old}
2,sarc,"In my lifetime, we've made huge strides, but t...","{huge, learn}","{stride, lifetime}"
3,sarc,"HOLY SH*T, marc. You're doing EXACTLY WHAT THE...","{thanks, doe, holy, effective, review, reality...",{omg}
4,sarc,"But on the other hand, Genesis isn't a scienti...","{aware, bible, age, document, wonder, bronze, ...",{}
...,...,...,...,...
1983,notsarc,"Man, these guys can't even get into the scienc...",{science},{outright}
1984,notsarc,What do you mean by this? Could we not have th...,"{obtain, decide, involvement, eligible, juveni...","{penalty, conflict, response, abortion, parent..."
1985,notsarc,And the answer is: we don't know. Maybe it cam...,{},{nowhere}
1986,notsarc,And what would make them separate species? How...,"{fetus, specie, great, viable}","{artificially, separate}"


## 2)
"Second, we analyze the sentence to get its syntax information to identify non-stop words     
     𝑆𝑊 = {𝑠𝑤1, 𝑠𝑤2, ..., 𝑠𝑤𝑚, 𝑚 ≤ 𝑛}.     
Intuitively, these words are the main components of sentences. Then we split 𝑆𝑊 into two sets which satisfy :     
     𝑆𝑊1 ∪ 𝑆𝑊2 = 𝑆𝑊 ,     
     |𝑆𝑊1| = |𝑆𝑊2|."

In [19]:
def extract_non_stop_words(sentence):
    words = nltk.word_tokenize(sentence)

    stop_words = set(stopwords.words("english"))

    non_stop_words = [word.lower() for word in words if word.lower() not in stop_words and word.isalpha()]

    return non_stop_words

In [20]:
def split_non_stop_words(non_stop_words):
    m = len(non_stop_words)
    m1 = m // 2
    SW1 = set(non_stop_words[:m1])
    SW2 = set(non_stop_words[m1:])
    return SW1, SW2

In [21]:
def analyze_sentences(sentences):
    all_SW1 = []
    all_SW2 = []

    for sentence in sentences:
        non_stop_words = extract_non_stop_words(sentence)
        SW1, SW2 = split_non_stop_words(non_stop_words)
        all_SW1.append(SW1)
        all_SW2.append(SW2)

    return all_SW1, all_SW2

In [22]:
all_SW1, all_SW2 = analyze_sentences(text_data)

for i, sentence in enumerate(text_data):
    print(f"Sentence: {sentence}")
    print(f"SW1: {all_SW1[i]}")
    print(f"SW2: {all_SW2[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Sentence: yes. it's called global dimming, and yes, i trust mainstream scientific organizations.   if you knew anything about science, scientists usuallyspeak in terms of coulds, possibliites, and likely, because science isn't 100% and scientific facts aren't absolute. they are largely evidence-supported conjectures. you can't prove anything 100%. they are giving reasonable estimates. that's how climatology works.
SW1: {'mainstream', 'science', 'global', 'trust', 'usuallyspeak', 'scientists', 'terms', 'coulds', 'dimming', 'anything', 'scientific', 'knew', 'organizations', 'called', 'yes'}
SW2: {'science', 'ca', 'estimates', 'climatology', 'likely', 'giving', 'works', 'largely', 'anything', 'scientific', 'possibliites', 'absolute', 'prove', 'reasonable', 'conjectures', 'facts'}
- - - - - - - - - -
Sentence: so you yourself could have gone gay but chose to be attracted to women?
SW1: {'gone', 'could', 'gay'}
SW2: {'women', 

In [23]:
df["SW1"] = all_SW1
df["SW2"] = all_SW2
df

Unnamed: 0,class,text,PW,NW,SW1,SW2
0,sarc,"Actually, they didn't. The whole tragedy was c...","{control, student}","{botched, corrupt, tragedy, victim}","{actually, even, gun, whole, control, tragedy,...","{lives, saved, corrupt, could, laws, politicia..."
1,sarc,What if a 13 year old girl comes up to you and...,"{sex, agree}",{old},"{old, girl, comes, year}","{asks, forcing, sex, agree}"
2,sarc,"In my lifetime, we've made huge strides, but t...","{huge, learn}","{stride, lifetime}","{made, huge, lifetime}","{lot, strides, learn}"
3,sarc,"HOLY SH*T, marc. You're doing EXACTLY WHAT THE...","{thanks, doe, holy, effective, review, reality...",{omg},"{exactly, making, claim, omg, says, teh, scein...","{making, thanks, refuting, aer, like, extremel..."
4,sarc,"But on the other hand, Genesis isn't a scienti...","{aware, bible, age, document, wonder, bronze, ...",{},"{aware, nomad, tailless, genesis, age, documen...","{mentioned, primates, america, exist, even, bi..."
...,...,...,...,...,...,...
1983,notsarc,"Man, these guys can't even get into the scienc...",{science},{outright},"{ca, man, guys, even}","{outright, lying, get, science}"
1984,notsarc,What do you mean by this? Could we not have th...,"{obtain, decide, involvement, eligible, juveni...","{penalty, conflict, response, abortion, parent...","{current, capital, could, responsibility, mean...","{decide, eligible, conflict, parents, abortion..."
1985,notsarc,And the answer is: we don't know. Maybe it cam...,{},{nowhere},"{maybe, answer, know, came}","{created, maybe, nowhere, know}"
1986,notsarc,And what would make them separate species? How...,"{fetus, specie, great, viable}","{artificially, separate}","{dogs, separate, wolves, species, make, would}","{interbreed, even, danes, great, circumstances..."


## 3)
"Here, 𝑃𝑊 ∪ 𝑆𝑊1 and 𝑁𝑊 ∪ 𝑆𝑊2 are used to mask original sentence respectively. So, we will obtain two masked sentences     
𝑥𝑚1 = { [𝑚]1, 𝑥2, ..., [𝑚]𝑛} and     
𝑥𝑚2 = {𝑥1, [𝑚]2, ..., 𝑥𝑛}."

In [24]:
def construct_union(sentences, PW, NW, all_SW1, all_SW2):
    union_PW_SW1 = []
    union_NW_SW2 = []

    for i, sentence in enumerate(sentences):
        SW1 = all_SW1[i]
        SW2 = all_SW2[i]

        union_PW_SW1.append(PW[i].union(SW1))
        union_NW_SW2.append(NW[i].union(SW2))

    return union_PW_SW1, union_NW_SW2

In [25]:
union_PW_SW1, union_NW_SW2 = construct_union(text_data, positive_words, negative_words, all_SW1, all_SW2)
print(union_PW_SW1)
print(union_NW_SW2)



In [26]:
df["union_PW_SW1"] = union_PW_SW1
df["union_NW_SW2"] = union_NW_SW2
df

Unnamed: 0,class,text,PW,NW,SW1,SW2,union_PW_SW1,union_NW_SW2
0,sarc,"Actually, they didn't. The whole tragedy was c...","{control, student}","{botched, corrupt, tragedy, victim}","{actually, even, gun, whole, control, tragedy,...","{lives, saved, corrupt, could, laws, politicia...","{actually, even, gun, whole, control, tragedy,...","{lives, tragedy, victim, saved, could, laws, p..."
1,sarc,What if a 13 year old girl comes up to you and...,"{sex, agree}",{old},"{old, girl, comes, year}","{asks, forcing, sex, agree}","{old, girl, comes, sex, year, agree}","{old, asks, sex, forcing, agree}"
2,sarc,"In my lifetime, we've made huge strides, but t...","{huge, learn}","{stride, lifetime}","{made, huge, lifetime}","{lot, strides, learn}","{huge, learn, made, lifetime}","{strides, stride, learn, lot, lifetime}"
3,sarc,"HOLY SH*T, marc. You're doing EXACTLY WHAT THE...","{thanks, doe, holy, effective, review, reality...",{omg},"{exactly, making, claim, omg, says, teh, scein...","{making, thanks, refuting, aer, like, extremel...","{exactly, thanks, claim, omg, says, marc, sh, ...","{thanks, omg, refuting, like, extremely, somet..."
4,sarc,"But on the other hand, Genesis isn't a scienti...","{aware, bible, age, document, wonder, bronze, ...",{},"{aware, nomad, tailless, genesis, age, documen...","{mentioned, primates, america, exist, even, bi...","{aware, document, levant, scientific, would, h...","{mentioned, primates, america, exist, even, bi..."
...,...,...,...,...,...,...,...,...
1983,notsarc,"Man, these guys can't even get into the scienc...",{science},{outright},"{ca, man, guys, even}","{outright, lying, get, science}","{guys, science, even, ca, man}","{outright, lying, get, science}"
1984,notsarc,What do you mean by this? Could we not have th...,"{obtain, decide, involvement, eligible, juveni...","{penalty, conflict, response, abortion, parent...","{current, capital, could, responsibility, mean...","{decide, eligible, conflict, parents, abortion...","{decide, eligible, juvenile, current, capital,...","{obtain, penalty, decide, eligible, conflict, ..."
1985,notsarc,And the answer is: we don't know. Maybe it cam...,{},{nowhere},"{maybe, answer, know, came}","{created, maybe, nowhere, know}","{maybe, answer, know, came}","{created, nowhere, maybe, know}"
1986,notsarc,And what would make them separate species? How...,"{fetus, specie, great, viable}","{artificially, separate}","{dogs, separate, wolves, species, make, would}","{interbreed, even, danes, great, circumstances...","{dogs, species, great, wolves, fetus, separate...","{interbreed, even, danes, great, circumstances..."


In [27]:
def mask_sentence(sentence, mask_words, max_mask_count = 5):
    masked_sentence = []

    for word in sentence.split():
        if word in mask_words and max_mask_count > 0:
            masked_sentence.append("<mask>")
            max_mask_count -= 1
        else:
            masked_sentence.append(word)

    return " ".join(masked_sentence)

In [28]:
def construct_masked_sentences(sentences, union_PW_SW1, union_NW_SW2):
    masked_pos_sentences = []
    masked_neg_sentences = []

    for i, sentence in enumerate(sentences):

        masked_pos_sentence = mask_sentence(sentence, union_PW_SW1[i])
        masked_pos_sentences.append(masked_pos_sentence)

        masked_neg_sentence = mask_sentence(sentence, union_NW_SW2[i])
        masked_neg_sentences.append(masked_neg_sentence)

    return masked_pos_sentences, masked_neg_sentences

In [29]:
masked_pos_sentences, masked_neg_sentences = construct_masked_sentences(text_data, union_PW_SW1, union_NW_SW2)

for i, sentence in enumerate(text_data):
    print(f"Original Sentence: {sentence}")
    print(f"Masked Positive Sentence: {masked_pos_sentences[i]}")
    print(f"Masked Negative Sentence: {masked_neg_sentences[i]}")
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Original Sentence: yes. it's called global dimming, and yes, i trust mainstream scientific organizations.   if you knew anything about science, scientists usuallyspeak in terms of coulds, possibliites, and likely, because science isn't 100% and scientific facts aren't absolute. they are largely evidence-supported conjectures. you can't prove anything 100%. they are giving reasonable estimates. that's how climatology works.
Masked Positive Sentence: yes. it's <mask> <mask> dimming, and yes, i <mask> <mask> <mask> organizations. if you knew anything about science, scientists usuallyspeak in terms of coulds, possibliites, and likely, because science isn't 100% and scientific facts aren't absolute. they are largely evidence-supported conjectures. you can't prove anything 100%. they are giving reasonable estimates. that's how climatology works.
Masked Negative Sentence: yes. it's called global dimming, and yes, i trust mainstr

In [30]:
dfnew = pd.DataFrame({"text": text_data_original, "maskedPosSentence": masked_pos_sentences, "maskedNegSentence": masked_neg_sentences})
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence
0,"Actually, they didn't. The whole tragedy was c...","actually, they didn't. the <mask> <mask> was <...","actually, they didn't. the whole <mask> was ca..."
1,What if a 13 year old girl comes up to you and...,what if a 13 <mask> <mask> <mask> <mask> up to...,what if a 13 year <mask> girl comes up to you ...
2,"In my lifetime, we've made huge strides, but t...","in my lifetime, we've <mask> <mask> strides, b...","in my lifetime, we've made huge strides, but t..."
3,"HOLY SH*T, marc. You're doing EXACTLY WHAT THE...","<mask> sh*t, marc. you're doing <mask> what th...","holy sh*t, marc. you're doing exactly what the..."
4,"But on the other hand, Genesis isn't a scienti...","but on the other hand, <mask> isn't a <mask> d...","but on the other hand, genesis isn't a scienti..."
...,...,...,...
1983,"Man, these guys can't even get into the scienc...","man, these <mask> can't <mask> get into the <m...","man, these guys can't even <mask> into the <ma..."
1984,What do you mean by this? Could we not have th...,what do you <mask> by this? <mask> we not have...,what do you mean by this? could we not have th...
1985,And the answer is: we don't know. Maybe it cam...,and the <mask> is: we don't know. <mask> it <m...,and the answer is: we don't know. <mask> it ca...
1986,And what would make them separate species? How...,and what <mask> <mask> them <mask> species? ho...,and what would make them <mask> species? how a...


## 4)
"These two masked sentences are fed into the pre-trained generation model to fulfill the generation procedure.     
𝑨{𝑎1, ..., 𝑥2, ..., 𝑥𝑛−1, ..., 𝑎𝑜 } = 𝐵𝐴𝑅𝑇 ( [𝑚]1, 𝑥2, ..., 𝑥𝑛−1, [𝑚]𝑛 )----(1)  
Thus, we will obtain two reborn sentences     
𝐴 = {𝑎1, 𝑎2, ..., 𝑎𝑜 } and     
𝐵 = {𝑏1, 𝑏2, ..., 𝑏𝑝 }."

In [31]:
%pip install transformers



In [56]:
from transformers import AutoTokenizer, AutoModelForCausalLM

def generate_reborn_sentences(masked_sentences):
    tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
    model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

    # Set the pad token to be the same as the eos token
    tokenizer.pad_token = tokenizer.eos_token

    i = 0
    reborn_sentences = []
    for masked_sentence in masked_sentences:
        # Tokenize input sentence with attention mask and padding
        inputs = tokenizer(masked_sentence, return_tensors="pt", truncation=True, padding=True, max_length=model.config.max_position_embeddings - 50)

        # Generate new tokens, ensuring total length doesn't exceed max_position_embeddings
        generated_encoded = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_new_tokens=50)

        # Decode the generated tokens to form the reborn sentence
        reborn_sentence = tokenizer.batch_decode(generated_encoded, skip_special_tokens=True)[0]
        reborn_sentences.append(reborn_sentence)

        i += 1
        if i % 100 == 0:
            print(f'Processed {i} sentences')

    return reborn_sentences

In [57]:
from google.colab import userdata
userdata.get('HF_TOKEN')

'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'

In [58]:
import os
os.environ["HF_TOKEN"] = "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"

In [None]:
reborn_pos_sentences = generate_reborn_sentences(masked_pos_sentences)

reborn_neg_sentences = generate_reborn_sentences(masked_neg_sentences)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 100 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 200 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 300 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 400 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 500 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 600 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 700 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 800 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 900 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1000 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1100 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1200 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1300 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1400 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1500 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1600 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1700 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1800 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1900 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 100 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 200 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 300 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 400 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 500 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 600 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 700 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 800 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 900 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1000 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1100 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1200 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1300 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1400 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1500 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1600 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1700 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1800 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Processed 1900 sentences


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [None]:
print("Reborn Sentences for Masked Positive Sentences:")
for i, reborn_sentence in enumerate(reborn_pos_sentences):
    print(f"Reborn Sentence {i + 1}: {reborn_sentence}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

#define mask 0 #define mask_mask 0 #define mask_mask_1 0 #define mask_mask_2 0 #define mask_mask_
Reborn Sentence 528: <mask> dear. have you <mask> <mask> at a <mask> spine? it is curved: it's bent. a <mask> spine is a knackered spine. as to the hands bit, what are you on about? you are aware i hope that the story that the nosey elephant got it's trunk because a crocodile grabbed the end and stretched it is just a story? i hope that the story that the crocodile got it's trunk because a crocodile grabbed the end and stretched it is just a story? i hope that the story that the crocodile got it's trunk because a crocodile grabbed the end and stretched
Reborn Sentence 529: ** the "methodology" of <mask> is not materialism/mechanism/darwinism, but <mask> <mask> <mask> inductively/abductively to <mask> regarding regularities in causation in nature that can be tested via the predictive experimentation of the scientific method. 

In [None]:
print("\nReborn Sentences for Masked Negative Sentences:")
for i, reborn_sentence in enumerate(reborn_neg_sentences):
    print(f"Reborn Sentence {i + 1}: {reborn_sentence}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

<mask> english (spe
Reborn Sentence 136: so the establishment <mask> was included to <mask> atheism, eh? <mask> and <mask> atheism is not a religion. <mask> and <mask> atheism is not a religion. <mask> and <mask> atheism is not a religion. <mask> and <mask> atheism is not a
Reborn Sentence 137: and for you mit stat <mask> freaks: http://web.mit.edu/~noto/public/bk_v8.pdf

The following is a list of the most common mitigations for the following mitigations:

The following is a list of the most common mitigations for the following mitigations:

The following is a list of
Reborn Sentence 138: just as being anti-gun won't <mask> for being <mask> with a <mask> brain. <mask> is a brain that is not a brain. <mask> is a brain that is not a brain. <mask> is a brain that is not a brain. <mask> is a brain that is not a brain. <mask
Reborn Sentence 139: oh, it's <mask> that non-darwinists have no affect on you guys. <mask> will ever

In [None]:
dfnew["rebornPosSentence"] = reborn_pos_sentences
dfnew["rebornNegSentence"] = reborn_neg_sentences
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence,rebornPosSentence,rebornNegSentence
0,"Actually, they didn't. The whole tragedy was c...","actually, they didn't. the <mask> <mask> was <...","actually, they didn't. the whole <mask> was ca...","actually, they didn't. the <mask> <mask> was <...","actually, they didn't. the whole <mask> was ca..."
1,What if a 13 year old girl comes up to you and...,what if a 13 <mask> <mask> <mask> <mask> up to...,what if a 13 year <mask> girl comes up to you ...,what if a 13 <mask> <mask> <mask> <mask> up to...,what if a 13 year <mask> girl comes up to you ...
2,"In my lifetime, we've made huge strides, but t...","in my lifetime, we've <mask> <mask> strides, b...","in my lifetime, we've made huge strides, but t...","in my lifetime, we've <mask> <mask> strides, b...","in my lifetime, we've made huge strides, but t..."
3,"HOLY SH*T, marc. You're doing EXACTLY WHAT THE...","<mask> sh*t, marc. you're doing <mask> what th...","holy sh*t, marc. you're doing exactly what the...","<mask> sh*t, marc. you're doing <mask> what th...","holy sh*t, marc. you're doing exactly what the..."
4,"But on the other hand, Genesis isn't a scienti...","but on the other hand, <mask> isn't a <mask> d...","but on the other hand, genesis isn't a scienti...","but on the other hand, <mask> isn't a <mask> d...","but on the other hand, genesis isn't a scienti..."
...,...,...,...,...,...
1983,"Man, these guys can't even get into the scienc...","man, these <mask> can't <mask> get into the <m...","man, these guys can't even <mask> into the <ma...","man, these <mask> can't <mask> get into the <m...","man, these guys can't even <mask> into the <ma..."
1984,What do you mean by this? Could we not have th...,what do you <mask> by this? <mask> we not have...,what do you mean by this? could we not have th...,what do you <mask> by this? <mask> we not have...,what do you mean by this? could we not have th...
1985,And the answer is: we don't know. Maybe it cam...,and the <mask> is: we don't know. <mask> it <m...,and the answer is: we don't know. <mask> it ca...,and the <mask> is: we don't know. <mask> it <m...,and the answer is: we don't know. <mask> it ca...
1986,And what would make them separate species? How...,and what <mask> <mask> them <mask> species? ho...,and what would make them <mask> species? how a...,and what <mask> <mask> them <mask> species? ho...,and what would make them <mask> species? how a...


# Sentences Representation
"We embed the original sentence 𝑥 and its corresponding reborn texts 𝐴 and 𝐵     
into 𝑑-dimentional embedding 𝑯𝑡 ∈ R𝑑     
via pre-trained BERT-base:     
𝑯𝑥, 𝑯𝐴, 𝑯𝐵 = 𝐵𝐸𝑅𝑇 (𝑥), 𝐵𝐸𝑅𝑇 (𝐴), 𝐵𝐸𝑅𝑇 (𝐵)."

In [None]:
def embed_sentences(sentences):
    tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
    model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")

    i = 0
    embeddings = []
    for sentence in sentences:
        inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = model(**inputs).last_hidden_state.mean(dim=1)
        embeddings.append(outputs)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} sentences')

    return torch.stack(embeddings)

In [None]:
x_embeddings = embed_sentences(text_data)

A_embeddings = embed_sentences(reborn_pos_sentences)

B_embeddings = embed_sentences(reborn_neg_sentences)

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences
Processed 1200 sentences
Processed 1300 sentences
Processed 1400 sentences
Processed 1500 sentences
Processed 1600 sentences
Processed 1700 sentences
Processed 1800 sentences
Processed 1900 sentences
Processed 100 sentences
Processed 200 sentences
Processed 300 sentences
Processed 400 sentences
Processed 500 sentences
Processed 600 sentences
Processed 700 sentences
Processed 800 sentences
Processed 900 sentences
Processed 1000 sentences
Processed 1100 sentences
Processed 1200 sentences
Processed 1300 sentences
Processed 1400 sentences
Processed 1500 sentences
Processed 1600 sentences
Processed 1700 sentences
Processed 1800 sentences
Processed 1900 sentences
Processed 100 sentences
Processed 200 sentences
Processed 300 senten

In [None]:
for i, sentence in enumerate(text_data):
    print(f"Embedding for Original Lowercase Sentence {i + 1} ({sentence}):")
    print(x_embeddings[i])
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
          3.7519e-01, -5.0051e-01,  4.5642e-01, -4.8025e-01, -1.2884e-01,
         -1.4069e-01,  2.4108e-01,  1.2320e-01, -5.4757e-01, -3.3583e-01,
         -3.8787e-01, -7.1007e-01, -3.1350e-01,  3.7211e-01,  2.6616e-02,
          1.5156e-01, -7.0895e-03,  5.2632e-01, -2.1169e-01, -1.6031e-01,
         -3.0236e-01, -9.6829e-01, -1.2808e-01,  3.5004e-01, -7.1103e-02,
          2.3128e-01,  1.4258e-01, -1.3040e-01, -2.9768e-02, -2.7812e-01,
          2.2810e-01,  1.5741e-01,  1.1149e-01]])
- - - - - - - - - -
Embedding for Original Lowercase Sentence 1957 (when will it stop? never. they will take every gun, every toy gun, disarm the law enforcment community, give us nerf batons so we don't actually hurt the criminals, and say that this kinder gentler way of doing things will help. in the meantime murders will go through the roof, like in australia, rape will escalate, breaking and entering will become common place, and cop

In [None]:
for i, sentence in enumerate(reborn_pos_sentences):
    print(f"Embedding for Reborn Positive Sentence {i + 1} ({sentence}):")
    print(A_embeddings[i])
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
          3.6483e-02, -2.1899e-01,  5.2699e-01,  3.5222e-01, -9.6135e-02,
         -2.2902e-01,  9.3594e-03,  3.5691e-02,  2.0728e-01,  1.9265e-04,
          3.2740e-01,  1.6797e-01, -1.9315e-01, -3.6417e+00, -2.3747e-01,
         -7.1933e-02, -2.4104e-01,  3.1265e-01,  2.0131e-01,  1.4631e-01,
         -3.2314e-01, -4.4547e-02,  3.9387e-02,  3.2826e-01, -1.1460e-01,
          2.5417e-01, -1.4787e-01, -3.4719e-02, -3.6704e-01,  3.2787e-01,
         -2.8368e-01, -1.9380e-01,  3.8557e-02, -2.0151e-01, -2.6669e-01,
          5.8954e-01,  2.3411e-01,  4.1981e-01,  2.6227e-01, -2.7676e-02,
         -4.2719e-01, -1.0418e-01,  4.8615e-02, -2.9314e-01, -6.2224e-01,
          5.1055e-02, -1.7937e-01,  1.0417e-01,  2.9884e-02, -4.3342e-02,
         -1.1546e-01, -6.3782e-01, -2.9860e-01, -3.1342e-01, -4.0339e-01,
          8.4259e-02, -4.9066e-02,  7.0212e-01, -3.0803e-01,  1.9782e-01,
          1.4985e-01, -3.1906e-01,  3.5466e-01,

In [None]:
for i, sentence in enumerate(reborn_neg_sentences):
    print(f"Embedding for Reborn Negative Sentence {i + 1} ({sentence}):")
    print(B_embeddings[i])
    print("- - - - - - - - - -")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
         -1.8838e-02, -3.4402e-01,  2.2784e-01, -2.8488e-01, -9.1361e-02,
          4.3127e-01,  5.1043e-01, -1.3233e-01, -4.7836e-01,  8.5876e-02,
         -5.3757e-02, -8.8990e-02,  1.9763e-01,  1.7550e-01, -6.8992e-02,
          2.0051e-01, -4.8932e-01,  5.5156e-01, -1.9603e-02, -3.0573e-02,
         -2.8728e-01, -3.5576e-01, -4.6194e-01,  1.8214e-01,  1.2309e-01,
          2.8875e-01,  3.7168e-01, -1.0826e-02, -6.6618e-01, -3.1767e-02,
          2.5525e-01,  7.5561e-02,  3.2627e-01, -4.6109e-01,  1.6428e-01,
          1.6414e-01, -8.3911e-02, -9.1707e-02, -6.9886e-02,  2.0154e-01,
         -1.1367e-01,  1.5076e-01,  7.0496e-01, -4.7142e-02,  1.1926e-01,
         -3.7098e-01,  1.2123e-01,  6.3075e-01, -5.2369e-02,  2.3364e-01,
          2.7616e-02,  8.8145e-02,  2.4218e-01, -1.5674e-01, -4.0548e-02,
         -9.6250e-01,  4.5155e-01,  1.3230e-01, -1.5729e-01, -1.8512e-01,
          5.4765e-02,  7.8119e-02, -2.0837e-01,

In [None]:
dfnew["xEmbedding"] = x_embeddings.tolist()
dfnew["AEmbedding"] = A_embeddings.tolist()
dfnew["BEmbedding"] = B_embeddings.tolist()
dfnew

Unnamed: 0,text,maskedPosSentence,maskedNegSentence,rebornPosSentence,rebornNegSentence,xEmbedding,AEmbedding,BEmbedding
0,"Actually, they didn't. The whole tragedy was c...","actually, they didn't. the <mask> <mask> was <...","actually, they didn't. the whole <mask> was ca...","actually, they didn't. the <mask> <mask> was <...","actually, they didn't. the whole <mask> was ca...","[[0.33568713068962097, -0.1276787519454956, -0...","[[-0.05913911014795303, -0.22624805569648743, ...","[[0.14492402970790863, 0.044240355491638184, 0..."
1,What if a 13 year old girl comes up to you and...,what if a 13 <mask> <mask> <mask> <mask> up to...,what if a 13 year <mask> girl comes up to you ...,what if a 13 <mask> <mask> <mask> <mask> up to...,what if a 13 year <mask> girl comes up to you ...,"[[0.21681912243366241, -0.5336976647377014, 0....","[[0.3441527783870697, -0.5100955963134766, 0.0...","[[0.4545100927352905, -0.6294618844985962, -0...."
2,"In my lifetime, we've made huge strides, but t...","in my lifetime, we've <mask> <mask> strides, b...","in my lifetime, we've made huge strides, but t...","in my lifetime, we've <mask> <mask> strides, b...","in my lifetime, we've made huge strides, but t...","[[0.39199742674827576, 0.21301516890525818, 0....","[[-0.10166621208190918, -0.09767168760299683, ...","[[0.0471426360309124, -0.08261749893426895, 0...."
3,"HOLY SH*T, marc. You're doing EXACTLY WHAT THE...","<mask> sh*t, marc. you're doing <mask> what th...","holy sh*t, marc. you're doing exactly what the...","<mask> sh*t, marc. you're doing <mask> what th...","holy sh*t, marc. you're doing exactly what the...","[[0.4560452103614807, 0.3644334375858307, 0.32...","[[0.37358614802360535, 0.2490326464176178, 0.2...","[[0.47779059410095215, 0.3466264307498932, 0.2..."
4,"But on the other hand, Genesis isn't a scienti...","but on the other hand, <mask> isn't a <mask> d...","but on the other hand, genesis isn't a scienti...","but on the other hand, <mask> isn't a <mask> d...","but on the other hand, genesis isn't a scienti...","[[0.17085641622543335, 0.6901278495788574, -0....","[[0.2736916244029999, 0.44377055764198303, -0....","[[0.2619611322879791, 0.6709150671958923, -0.5..."
...,...,...,...,...,...,...,...,...
1983,"Man, these guys can't even get into the scienc...","man, these <mask> can't <mask> get into the <m...","man, these guys can't even <mask> into the <ma...","man, these <mask> can't <mask> get into the <m...","man, these guys can't even <mask> into the <ma...","[[0.8729156851768494, 0.7266660928726196, 0.04...","[[0.2820708155632019, 0.294691264629364, 0.212...","[[0.774412214756012, 0.4799511730670929, 0.124..."
1984,What do you mean by this? Could we not have th...,what do you <mask> by this? <mask> we not have...,what do you mean by this? could we not have th...,what do you <mask> by this? <mask> we not have...,what do you mean by this? could we not have th...,"[[0.44388970732688904, -0.16068144142627716, -...","[[0.38173604011535645, -0.08468043804168701, -...","[[0.203124538064003, -0.13702347874641418, -0...."
1985,And the answer is: we don't know. Maybe it cam...,and the <mask> is: we don't know. <mask> it <m...,and the answer is: we don't know. <mask> it ca...,and the <mask> is: we don't know. <mask> it <m...,and the answer is: we don't know. <mask> it ca...,"[[0.5926541686058044, -0.10631008446216583, 0....","[[0.4560041129589081, -0.01774139516055584, -0...","[[0.48886793851852417, -0.03851227089762688, -..."
1986,And what would make them separate species? How...,and what <mask> <mask> them <mask> species? ho...,and what would make them <mask> species? how a...,and what <mask> <mask> them <mask> species? ho...,and what would make them <mask> species? how a...,"[[0.671820342540741, 0.32235777378082275, -0.3...","[[0.25018009543418884, 0.18032698333263397, -0...","[[0.2744423747062683, 0.23088446259498596, -0...."


# Sarcastic Utterances Detection
## 1)
"We utilize cosine similarity to measure the similarity between representations of original sentence 𝐻𝑥     
and generation texts 𝐻𝐴/𝐻𝐵.

Then we use the following equation to calculate a difference score of each sentence:     
diff = sim(𝐻𝑥, 𝐻𝐴) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 || sim(𝐻𝑥, 𝐻𝐵) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑     
where || means "or" logical operator."

In [None]:
def calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, threshold):
    i = 0
    diff_scores = []
    for x_emb, A_emb, B_emb in zip(x_embeddings, A_embeddings, B_embeddings):
        sim_Hx_HA = cosine_similarity(x_emb, A_emb)
        sim_Hx_HB = cosine_similarity(x_emb, B_emb)

        diff = (sim_Hx_HA < threshold) or (sim_Hx_HB < threshold)
        diff_scores.append(diff)
        i = i + 1
        if (i % 100 == 0):
            print(f'Processed {i} embeddings')

    return diff_scores

In [None]:
threshold = 0.755

diff_scores = calculate_difference_scores(x_embeddings, A_embeddings, B_embeddings, threshold)
diff_scores

Processed 100 embeddings
Processed 200 embeddings
Processed 300 embeddings
Processed 400 embeddings
Processed 500 embeddings
Processed 600 embeddings
Processed 700 embeddings
Processed 800 embeddings
Processed 900 embeddings
Processed 1000 embeddings
Processed 1100 embeddings
Processed 1200 embeddings
Processed 1300 embeddings
Processed 1400 embeddings
Processed 1500 embeddings
Processed 1600 embeddings
Processed 1700 embeddings
Processed 1800 embeddings
Processed 1900 embeddings


[array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[False]]),
 array([[False]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[False]]),
 array([[False]]),
 array([[ True]]),
 array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[False]]),
 array([[False]]),
 array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[False]]),
 array([[False]]),
 array([[False]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[False]]),
 array([[False]]),
 array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[False]]),
 array([[ True]]),
 array([[ True]]),
 array([[ True]]),
 array([[ Tr

## 2)
"Since the sarcastic utterances are influenced more than normal texts during the masking and generation procedure,     
the difference score of sarcastic texts should be greater than a non-sarcastic one.

If we have a threshold value which separates sarcastic texts and normal texts,     
we can yield the prediction 𝑦 by:     
𝑦 = I(diff)."

In [None]:
predicted_labels = [int(diff) for diff in diff_scores]
print(predicted_labels)
print(sum(predicted_labels))

[1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 

  predicted_labels = [int(diff) for diff in diff_scores]


In [None]:
labels = ["sarc" if diff else "notsarc" for diff in diff_scores]
print(labels)

['sarc', 'notsarc', 'sarc', 'notsarc', 'notsarc', 'sarc', 'sarc', 'sarc', 'notsarc', 'notsarc', 'sarc', 'sarc', 'notsarc', 'sarc', 'notsarc', 'notsarc', 'sarc', 'notsarc', 'sarc', 'sarc', 'notsarc', 'sarc', 'notsarc', 'sarc', 'sarc', 'sarc', 'sarc', 'notsarc', 'notsarc', 'notsarc', 'sarc', 'sarc', 'sarc', 'sarc', 'notsarc', 'sarc', 'sarc', 'sarc', 'sarc', 'notsarc', 'sarc', 'notsarc', 'sarc', 'notsarc', 'notsarc', 'sarc', 'notsarc', 'sarc', 'notsarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'sarc', 'notsarc', 'notsarc', 'notsarc', 'sarc', 'notsarc', 'notsarc', 'sarc', 'notsarc', 'sarc', 'sarc', 'notsarc', 'sarc', 'sarc', 'sarc', 'notsarc', 'notsarc', 'sarc', 'notsarc', 'sarc', 'sarc', 'notsarc', 'notsarc', 'notsarc', 'sarc', 'sarc', 'notsarc', 'notsarc', 'notsarc', 'sarc', 'notsarc', 'sarc', 'notsarc', 'notsarc', 'sarc', 'sarc', 'notsarc', 'notsarc', 'sarc', 'notsarc', 'sarc', 'notsarc', 'sarc', 'sarc', 'sarc', 'notsarc', 'sarc', 'sarc', 'notsarc', 'notsarc', 's

In [None]:
dffinal = pd.DataFrame({"text": text_data, "class": label_data, "prediction": labels})
dffinal

Unnamed: 0,text,class,prediction
0,"actually, they didn't. the whole tragedy was c...",sarc,sarc
1,what if a 13 year old girl comes up to you and...,sarc,notsarc
2,"in my lifetime, we've made huge strides, but t...",sarc,sarc
3,"holy sh*t, marc. you're doing exactly what the...",sarc,notsarc
4,"but on the other hand, genesis isn't a scienti...",sarc,notsarc
...,...,...,...
1983,"man, these guys can't even get into the scienc...",notsarc,sarc
1984,what do you mean by this? could we not have th...,notsarc,notsarc
1985,and the answer is: we don't know. maybe it cam...,notsarc,notsarc
1986,and what would make them separate species? how...,notsarc,notsarc


# Main Experiment Results

In [None]:
true_labels = [1 if pred == "sarc" else 0 for pred in df["class"]]
print(true_labels)
print(predicted_labels)

accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels)
f1 = f1_score(true_labels, predicted_labels)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("F1 Score:", f1)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [None]:
conf_matrix = confusion_matrix(true_labels, predicted_labels)

print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[562 435]
 [436 555]]
