In [2]:
from IPython.display import HTML, display

# Motivation
Nowadays social media are filled with figurative and creative language, including irony. The understanding irony in a speech is required for human reasoning and communication, As a result, studies on understanding sarcasm and irony have been started in machine learning researches. Especially, irony has important implications for natural language processing (NLP) tasks, which aim to understand and produce human language. Some potential applications of automatic irony detection are Tasks requiring semantic analysis, author profiling, online harassment detection, and sentiment analysis. Also, understanding irony helps us to have a better analysis of social media, and the opinion of people about many things. For this project, we chose this task because it was an interesting, useful, well-defined task. And also, we wanted to check our knowledge from the course in this task. Alongside our knowledge from the course, we aimed to learn a new kind of sequence analysis method, named Temporal Convolutional Network (TCN). And we also wanted to test this method too and compare it with other models we had learned in the course.

# Approach
### Embedding
 The size of our training data is rather small (3834 sentences), so we used pre-trained embeddings as the input of our model. We used two different pre-trained model and compared their result with each other. First, we used the last four hidden layers of BERT pre=trained model as the embedding for the words of the sentences. we used BERT tokenizer to split tokens in the sentences and then, extract the embeddings for them. For example, if we have the sentence, "Here is the sentence I want embeddings for.", the tokens will be 'here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.'. Each layer of BERT has 726 dimensions, so when we concatenate last four hidden layers the dimension of our embeddings become 3072. 
 Second, we used the pre-trained Glove model which was trained on 2B tweets. It has 27B tokens and 1.2M uncased vocab. There are four version for this embeddings based on their dimension. we used 100 dimensional version of it in our experiments. Before extracting word embeddings, we used Glove pre-processing script for twitter data, and then create space between word and punctuations. We tokenized the sentences by spliting by space delimiter. 
 
### Model
We tried two different types of models to classify the tweets. First, we used a TCN based mode. We tried different numbers of convolutional layers, kernel sizes, strides, and hidden sizes. The reported results which had the best results have 4 layers of convolutional layers, 3 kernel size, 1 stride and 600  and 100 hidden size for BERT and Glove cases, respectively. Also, Convolutions in layer i had 2^i dilation. After the convolutional layers, our model select the channels of last word of the last hidden layer. Then we applied a linear layer with the activation of Sigmoid to get the probability of being irony for the input tweet.
![tcn](figures/tcnmodel.png "TCN based model")
Second, we used different types of RNN models to extract a hidden representation for the sentence and similar to the previous case, we applied a linear layer and Sigmoid to extract the probability of being irony. We tried LSTM, Bidirectional LSTM, GRU, and Bidirectional GRU as the RNN model. the reported results have 1 RNN layer because increasing the number of layers did not get better results. It also has 600 and 100 hidden size for the BERT and Glove embedding cases, respectively.
![rnn](figures/rnn.png "RNN based model")

# Data

The task of this project is coming from SemEval-2018 Task3. In this task given a dataset of tweets, the task is to train a model to detect irony tweets from non-irony tweets. So this task is a classification task with just two classes, Irony and Non-Irony. Dataset of this project is the official dataset of this task.

Primary dataset for this task was provided from crawling twitter searching for hashtags e.g. #sarcasm, #irony, and #not. 3000 tweets provided by this search. Human experts have checked the tweets and 2,396 of these tweets were really irony, and the rest were not. For contest they have removed irony related hashtags from tweets. They also added 1792 other non-irony tweets to make irony-nonirony balance. So dataset includes 2,396 irony and 2,396 non-irony tweets. Then they devided data to 80% of training samples (3,833 samples), and 20% of test samples (958 test samples). In additional cleaning step they have removed some ambiguous samples from test data, so finally test file includes 784 samples. They were two subtasks, detecing irony, and detecting the type of the irony. We just used the irony detection task (Subtask A) for this project.

some samples from gathered tweets (Irony related hashtags are removed in final dataset):

    1. Go ahead drop me hate, I'm looking forward to it. (Irony)
    2. I love waking up with migraines #not (Irony)
    3. Is Obamacare Slowing Health Care Spending? #NOT (Non-Irony)
    4. And I then my sister should be home from college by time I get home from babysitting. And it's payday. THIS IS A GOOD FRIDAY (Non-Irony)

Train file is:
data/datasets/train/SemEval2018-T3-train-taskA.txt

Test file is:
ref/SemEval2018-T3_gold_test_taskA_emoji.txt



## Code
For the TCN model, we used some parts of codes of the paper "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling" by Shaojie Bai, J. Zico Kolter and Vladlen Koltun. We used the part of code which was for Word-level Language Modeling task which can be find [here](https://github.com/locuslab/TCN/tree/master/TCN/word_cnn). We will explain some important parts of code below:

First, we can specify which type of embedding we use (BERT or Glove) by lmodel args option. For the BERT embedding we have data_generator which extract all BERT embeddins for train, validation, and test data. At data_generator function we will check if we have saved the embedding in Corpus object. If we did not, we will create them by running create_embeddings function for train and test files. We had original train and test files and we had the other ones that converted emojies to their name. We used the first one to extract labels and the second one to extract the words of sentences. For each sentence, we did the preprocessing first and then get BERT embedding by get_bert_embedding function. 

In [None]:
def data_generator(args):
    if os.path.exists(args.data + "/corpus") and not args.corpus:
        corpus = pickle.load(open(args.data + '/corpus', 'rb'))
    else:
        corpus = Corpus(args.data,
                        '/train/SemEval2018-T3-train-taskA.txt',
                        '/goldtest_TaskA/SemEval2018-T3_gold_test_taskA_emoji.txt',
                        '/test_TaskA/SemEval2018-T3_input_test_taskA.txt')
        pickle.dump(corpus, open(args.data + '/corpus', 'wb'))
    return corpus


class Corpus:
    def __init__(self, path, trainfile, testfile_emoji, testfile_without_emoji):
        self.train_embeddings, self.train_labels = self.create_embeddings(path, trainfile, trainfile)
        self.test_embeddings, self.test_labels = self.create_embeddings(path, testfile_emoji, testfile_without_emoji, is_test=True)
        self.valid_embeddings = self.test_embeddings[:int(len(self.test_embeddings)*0.1)]
        self.valid_labels = self.test_labels[:int(len(self.test_labels)*0.1)]

    def create_embeddings(self, path, file_emoji, file_witout_emoji, is_test=False):
        embeddings = []
        labels = []
        break_point = 10
        with open(path + file_witout_emoji, encoding="utf8") as fp:
            lines = fp.readlines()
            lines = lines[1:len(lines)]
            for i, l in enumerate(lines):
                if i > break_point:
                    break
                line = l.split("\t")
                sent_index = 1 if is_test else 2
                sentence = line[sent_index]
                sentence = pre_process(sentence)
                sent_embed = get_bert_embedding(sentence)
                sent_embed = torch.stack(sent_embed)
                embeddings.append(sent_embed)

        with open(path + file_emoji, encoding="utf8") as fp:
            lines = fp.readlines()
            lines = lines[1:len(lines)]
            for i, l in enumerate(lines):
                if i > break_point:
                    break
                line = l.split("\t")
                labels.append(torch.FloatTensor([float(line[1])]))

        return embeddings, labels


In the preprocessing phase, we changed links to '< quote >' and mentions to '@somebody'.

In [None]:
def change_links(text_input):
    words = text_input.split()
    new_words = []
    for word in words:
        if 'http' not in word:
            new_words.append(word)
        else:
            new_words.append('<quote>')
    return ' '.join(new_words)


def change_mentions(text_input):
    words = text_input.split()
    new_words = []
    for word in words:
        if '@' != word[0]:
            new_words.append(word)
        else:
            new_words.append('@somebody')
    return ' '.join(new_words)


def pre_process(text_input):
    text_input = text_input.lower()
    text_input = change_links(text_input)
    text_input = change_mentions(text_input)
    return text_input

in get_bert_embedding function, first we added [CLS] and [SEP] token to the beginning and the end of sentence. Then we extracted the tokens by tokenize funcction of BERT tokenizer. Then we loaded the pre-trained bert-base-uncased model and extracted the last 4 layers of it as the embeddings for all of the tokens.

In [None]:
def get_bert_embedding(text_input):
    marked_text = "[CLS] " + text_input + " [SEP]"
    text_tokens = tokenizer.tokenize(marked_text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(text_tokens)
    segments_ids = [1] * len(text_tokens)

    tokens_tensor = torch.tensor([indexed_tokens]).to(device)
    segments_tensors = torch.tensor([segments_ids]).to(device)

    model = BertModel.from_pretrained('bert-base-uncased').to(device)
    model.eval()

    with torch.no_grad():
        encoded_layers, _ = model(tokens_tensor, segments_tensors)

    token_embeddings = torch.stack(encoded_layers, dim=0)
    token_embeddings = torch.squeeze(token_embeddings, dim=1)
    token_embeddings = token_embeddings.permute(1,0,2)

    token_vecs_cat = []

    for token in token_embeddings:
        cat_vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
        token_vecs_cat.append(cat_vec)

    return token_vecs_cat


for Glove embedding, first we load glove embedding file in load_glove_embedding. Then, in create_embeddings function we extracted tokens by tokenize function. In the tokenize function, we used the python version of Ruby pre-processing code for twitter data by Glove. We splited some punctuations from their neighbouring words and then we tokenize the sentences by the space delimiter. Finally, we extracted 100d Glove embeddings for each token in get_glove_embedding function. 

In [None]:

class GloveCorpus:

    def __init__(self, path, trainfile, testfile_emoji, testfile_without_emoji):
        self.glove_embeddings = self.load_glove_embedding()
        self.train_embeddings, self.train_labels = self.create_embeddings(path, trainfile, trainfile)
        self.test_embeddings, self.test_labels = self.create_embeddings(path, testfile_emoji, testfile_without_emoji, is_test=True)
        self.valid_embeddings = self.test_embeddings[:int(len(self.test_embeddings)*0.1)]
        self.valid_labels = self.test_labels[:int(len(self.test_labels)*0.1)]

    def create_embeddings(self, path, file_emoji, file_witout_emoji, is_test=False):
        embeddings = []
        labels = []
        with open(path + file_witout_emoji, encoding="utf8") as fp:
            lines = fp.readlines()
            lines = lines[1:len(lines)]
            for i, l in enumerate(lines):
                line = l.split("\t")
                sent_index = 1 if is_test else 2
                sentence = line[sent_index]
                tokens = self.tokenize(sentence)
                sent_embed = self.get_glove_embedding(tokens)
                sent_embed = torch.stack(sent_embed)
                embeddings.append(sent_embed)

        with open(path + file_emoji, encoding="utf8") as fp:
            lines = fp.readlines()
            lines = lines[1:len(lines)]
            for i, l in enumerate(lines):
                line = l.split("\t")
                labels.append(torch.FloatTensor([float(line[1])]))

        return embeddings, labels

    def hashtag(self, text):
        FLAGS = re.MULTILINE | re.DOTALL
        text = text.group()
        hashtag_body = text[1:]
        if hashtag_body.isupper():
            result = " <hashtag> {} <allcaps> ".format(hashtag_body)
        else:
            result = " ".join([" <hashtag> "] + re.split(r"(?=[A-Z])", hashtag_body, flags=FLAGS))
        return result

    def allcaps(self, text):
        text = text.group()
        return text.lower() + " <allcaps> "

    def tokenize(self, text):
        FLAGS = re.MULTILINE | re.DOTALL
        # Different regex parts for smiley faces
        eyes = r"[8:=;]"
        nose = r"['`\-]?"

         # function so code less repetitive
        def re_sub(pattern, repl):
            return re.sub(pattern, repl, text, flags=FLAGS)

        text = re_sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", " <url> ")
        text = re_sub(r"/"," / ")
        text = re_sub(r"@\w+", " <user> ")
        text = re_sub(r"{}{}[)dD]+|[)dD]+{}{}".format(eyes, nose, nose, eyes), " <smile> ")
        text = re_sub(r"{}{}p+".format(eyes, nose), " <lolface> ")
        text = re_sub(r"{}{}\(+|\)+{}{}".format(eyes, nose, nose, eyes), " <sadface> ")
        text = re_sub(r"{}{}[\/|l*]".format(eyes, nose), " <neutralface> ")
        text = re_sub(r"<3"," <heart> ")
        text = re_sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", " <number> ")
        text = re_sub(r"#\S+", self.hashtag)
        text = re_sub(r"([!?.]){2,}", r"\1 <repeat> ")
        text = re_sub(r"\b(\S*?)(.)\2{2,}\b", r"\1\2 <elong> ")

        ## -- I just don't understand why the Ruby script adds <allcaps> to everything so I limited the selection.
        # text = re_sub(r"([^a-z0-9()<>'`\-]){2,}", allcaps)
        text = re_sub(r"([A-Z]){2,}", self.allcaps)

        text = text.lower()
        text = self.pre_process(text)
        
        return text.split()

    def pre_process(self, text):
        changed = True
        before_signs = ['.', ',', '!', '?', ')', ':', '#', '"', '*', '(', '|', '=',]
        after_signs = ['(', '.', ':', '"', '*', ')', '|', '=']
        while(changed):
            changed = False
            tokens = text.split()
            for i, token in enumerate(tokens):
                if changed:
                    break
                for sign in before_signs:
                    if token.find(sign, 1) > -1:
                        ind = token.find(sign,1)
                        tokens = tokens[:i] + [token[:ind], token[ind:]] + tokens[i+1:]
                        text = ' '.join(tokens)
                        changed = True
                if changed:
                    break
                for sign in after_signs:
                    if token.find(sign) > -1 and token.find(sign) < len(token)-1:
                        ind = token.find(sign)
                        tokens = tokens[:i] + [token[:ind+1], token[ind+1:]] + tokens[i+1:]
                        text = ' '.join(tokens)
                        changed = True
                if changed:
                    break
                if token.find("n't") > -1 and len(token) > 3:
                    ind = token.find("n't")
                    tokens = tokens[:i] + [token[:ind], " n't ", token[ind+3:]] + tokens[i+1:]
                    text = ' '.join(tokens)
                    changed = True
                if changed:
                    break
                if token.find("'", 1) > -1 and token.find("n't") == -1:
                    ind = token.find("'", 1)
                    tokens = tokens[:i] + [token[:ind], token[ind:]] + tokens[i+1:]
                    text = ' '.join(tokens)
                    changed = True
                if token.find('_') > -1 and len(token) > 1:
                    ind = token.find('_')
                    tokens = tokens[:i] + [token[:ind], token[ind+1:]] + tokens[i+1:]
                    text = ' '.join(tokens)
                    changed = True
        return text

    def load_glove_embedding(self):
        glove_embeddings = {}
        with open('data/embeddings/glove.twitter.27B/glove.twitter.27B.100d.txt', encoding="utf8") as f:
            for i, l in enumerate(f):
                if i % 100000 == 0:
                    print('load glove line', i)
                tokens = l.split()
                glove_embeddings[tokens[0]] = torch.FloatTensor([float(x) for x in tokens[1:]])
        return glove_embeddings

    def get_glove_embedding(self, tokens):
        embeddings = []
        for token in tokens:
            if token in self.glove_embeddings:
                embeddings.append(self.glove_embeddings[token])
            else:
                print(token)
        return embeddings

Then by model args option we decide which neural network we should use.

In [None]:
if args.model == 0:
    model = TCN(args.emsize, 1, num_chans, dropout=dropout, kernel_size=k_size)
if args.model == 1:
    model = LSTM_classifier(input_size = args.emsize, output_size = 1, hidden_size = args.nhid)
if args.model == 2:
    model = LSTM_classifier_bidirectional(input_size = args.emsize, output_size = 1, hidden_size = args.nhid)
if args.model == 3:
    model = GRU_classifier(input_size = args.emsize, output_size = 1, hidden_size = args.nhid)
if args.model == 4:
    model = GRU_classifier_bidirectional(input_size = args.emsize, output_size = 1, hidden_size = args.nhid)
if args.model == 5:
    model = RNN_classifier(args.emsize, 1, hidden_size=args.nhid)

Here is the definition of different models (TCN, LSTM, BiLSTM, GRU and BiGRU) we used in this project. TCN use TemporalConvNet block, a linear layer and sigmoid. The other ones use LSTM, Bidirectional LSTM, GRU and Bidirectional GRU plus a linear layer and sigmoid activation.

In [None]:
class TCN(nn.Module):

    def __init__(self, input_size
                 , output_size, num_channels,
                 kernel_size=2, dropout=0.3):
        super(TCN, self).__init__()
        self.tcn = TemporalConvNet(input_size, num_channels, kernel_size, dropout=dropout)

        self.decoder = nn.Linear(num_channels[-1], output_size)
        self.sigmoid = nn.Sigmoid()

        self.init_weights()

    def init_weights(self):
        self.decoder.bias.data.fill_(0)
        self.decoder.weight.data.normal_(0, 0.01)

    def forward(self, input):
        """Input ought to have dimension (N, C_in, L_in), where L_in is the seq_len; here the input is (N, L, C)"""
        y = self.tcn(input.transpose(1, 2)).transpose(1, 2)
        y = self.decoder(y[:,-1,:])
        y = self.sigmoid(y)
        return y



class LSTM_classifier(nn.Module):

    def __init__(self, input_size, output_size, hidden_size):
        super(LSTM_classifier, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first = True)
        self.decoder = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()


    def forward(self, input):
        #print(input.shape)
        output,(_, _) = self.lstm(input)
        y = self.decoder(output[:,-1])
        y = self.sigmoid(y)
        return y

class LSTM_classifier_bidirectional(nn.Module):

    def __init__(self, input_size, output_size, hidden_size):
        super(LSTM_classifier_bidirectional, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first = True, bidirectional = True)
        self.decoder = nn.Linear(2  * hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()


    def forward(self, input):
        #print(input.shape)
        output,(_, _) = self.lstm(input)
        y = self.decoder(output[:,-1])
        y = self.sigmoid(y)
        return y


class GRU_classifier(nn.Module):

    def __init__(self, input_size, output_size, hidden_size):
        super(GRU_classifier, self).__init__()
        self.GRU = nn.GRU(input_size, hidden_size, batch_first = True)
        self.decoder = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()


    def forward(self, input):
        #print(input.shape)
        output,_ = self.GRU(input)
        y = self.decoder(output[:,-1])
        y = self.sigmoid(y)
        return y

class GRU_classifier_bidirectional(nn.Module):

    def __init__(self, input_size, output_size, hidden_size):
        super(GRU_classifier_bidirectional, self).__init__()
        self.GRU = nn.GRU(input_size, hidden_size, batch_first = True, bidirectional = True)
        self.decoder = nn.Linear(hidden_size * 2, output_size)
        self.sigmoid = nn.Sigmoid()


    def forward(self, input):
        #print(input.shape)
        output,_ = self.GRU(input)
        y = self.decoder(output[:,-1])
        y = self.sigmoid(y)
        return y

class GRU_classifier_mlayers(nn.Module):

    def __init__(self, input_size, output_size, hidden_size, num_layers):
        super(GRU_classifier_mlayers, self).__init__()
        self.GRU = nn.GRU(input_size, hidden_size, batch_first = True, num_layers = num_layers)
        self.decoder = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()


    def forward(self, input):
        #print(input.shape)
        output,_ = self.GRU(input)
        y = self.decoder(output[:,-1])
        y = self.sigmoid(y)
        return y

class RNN_classifier(nn.Module):
    def __init__(self, input_size, output_size, hidden_size):
        super(RNN_classifier, self).__init__()
        self.RNN = nn.RNN(input_size, hidden_size, batch_first = True)
        self.decoder = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, input):
        #input = input.permute(1, 0, 2)
        #print(input.shape)
        output, _ = self.RNN(input)
        y = self.decoder(output[:,-1])
        y = self.sigmoid(y)
        return y


And here is the definition of TemporalConvNet which we brought from codes of the paper we mentioned before:

In [None]:
class Chomp1d(nn.Module):
    def __init__(self, chomp_size):
        super(Chomp1d, self).__init__()
        self.chomp_size = chomp_size

    def forward(self, x):
        return x[:, :, :-self.chomp_size].contiguous()


class TemporalBlock(nn.Module):
    def __init__(self, n_inputs, n_outputs, kernel_size, stride, dilation, padding, dropout=0.2):
        super(TemporalBlock, self).__init__()
        self.conv1 = weight_norm(nn.Conv1d(n_inputs, n_outputs, kernel_size,
                                           stride=stride, padding=padding, dilation=dilation))
        self.chomp1 = Chomp1d(padding)
        self.relu1 = nn.ReLU()
        self.dropout1 = nn.Dropout(dropout)

        self.conv2 = weight_norm(nn.Conv1d(n_outputs, n_outputs, kernel_size,
                                           stride=stride, padding=padding, dilation=dilation))
        self.chomp2 = Chomp1d(padding)
        self.relu2 = nn.ReLU()
        self.dropout2 = nn.Dropout(dropout)

        self.net = nn.Sequential(self.conv1, self.chomp1, self.relu1, self.dropout1,
                                 self.conv2, self.chomp2, self.relu2, self.dropout2)
        self.downsample = nn.Conv1d(n_inputs, n_outputs, 1) if n_inputs != n_outputs else None
        self.relu = nn.ReLU()
        self.init_weights()

    def init_weights(self):
        self.conv1.weight.data.normal_(0, 0.01)
        self.conv2.weight.data.normal_(0, 0.01)
        if self.downsample is not None:
            self.downsample.weight.data.normal_(0, 0.01)

    def forward(self, x):
        out = self.net(x)
        res = x if self.downsample is None else self.downsample(x)
        return self.relu(out + res)


class TemporalConvNet(nn.Module):
    def __init__(self, num_inputs, num_channels, kernel_size=2, dropout=0.2):
        super(TemporalConvNet, self).__init__()
        layers = []
        num_levels = len(num_channels)
        for i in range(num_levels):
            dilation_size = 2 ** i
            in_channels = num_inputs if i == 0 else num_channels[i-1]
            out_channels = num_channels[i]
            layers += [TemporalBlock(in_channels, out_channels, kernel_size, stride=1, dilation=dilation_size,
                                     padding=(kernel_size-1) * dilation_size, dropout=dropout)]

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

In train function for each sentence in train_data we extract the output of the model and then optimize the parameters of our model based on the Binary Cross Entropy loss function.

In [None]:
def train():
    # Turn on training mode which enables dropout.
    global train_data
    model.train()
    total_loss = 0
    start_time = time.time()
    cor_num = 0
    indices = torch.randperm(len(train_data))
    for batch_idx, ind in enumerate(indices):
        data, label = train_data[ind][0], train_data[ind][1]
        data = data.view(1, data.size(0), data.size(1))

        label = label.view(1, label.size(0))
        if args.cuda:
            data = data.cuda()
            label = label.cuda()
        optimizer.zero_grad()
        output = model(data)

        if torch.abs(output - label).item() < 0.5:
            cor_num += 1
        loss = criterion(output, label)

        loss.backward()
        if args.clip > 0:
            torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
        optimizer.step()

        total_loss += loss.item()

For validation and training data, we call evaluate funciton which extract the output of our model from those datasets and return the loss and accuracy of the results. We also save the outputs based on a threshold on the output probablity of being irony for each sentence in predictions-taskA.txt file if this function was caled with the save_output argument as True.

In [None]:
def evaluate(data_source, save_output=False):
    model.eval()
    total_loss = 0
    processed_data_size = 0
    cor_num = 0
    outputs = []
    with torch.no_grad():
        for batch_idx, (data, label) in enumerate(data_source):
            data = data.view(1, data.size(0), data.size(1))
            label = label.view(1, label.size(0))
            if args.cuda:
                data = data.cuda()
                label = label.cuda()
            output = model(data)
            outputs.append(output)

            if torch.abs(output - label).item() < 0.5:
                cor_num += 1
            loss = criterion(output, label)

            total_loss += loss.item()
            processed_data_size += data.size(1)

        if save_output:
            with open('res/predictions-taskA.txt', 'w+') as f:
                for output in outputs:
                    if output <= 0.5:
                        f.write('0\n')
                    else:
                        f.write('1\n')

        return total_loss / len(data_source), (float)(cor_num)/len(data_source)

Finally, we save the model which have the best accuracy on the validation dataset and after all of epochs, we load the best saved model and run that on the test data and report the loss and accuracy on it. 

In [None]:
if val_acc >= val_tacc:
    with open("model.pt", 'wb') as f:
        print('Save model!\n')
        torch.save(model, f)
    best_vacc = val_acc
    
# Load the best saved model.
with open("model.pt", 'rb') as f:
    model = torch.load(f)

# Run on test data.
test_loss, test_acc = evaluate(test_data, save_output=True)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test acc {:1.2f}'.format(
    test_loss, test_acc))
print('=' * 89)

# Experimental Setup

We used officially introduced metrics in the contest. Accuracy and F1 score are the main metrics of comparing models. The code evaluate.py calculates all these parameters.

$$\operatorname{accuracy}=\frac{\text {true positives }+\text {true negatives}}{\text {total number of instances}}$$
$$\text { precision }=\frac{\text { true positives }}{\text { true positives }+\text { false positives }}$$

$$\text {recall}=\frac{\text {true positives}}{\text {true positives }+\text { false negatives }}$$

$$F_{1}=2 \cdot \frac{\text { precision } \cdot \text {recall}}{\text { precision }+\text {recall}}$$

We have implemented and tested some models:

<ol>
  <li>TCN</li>
  <li>LSTM/Bidirectional LSTM</li>
  <li>GRU/Bidrectional GRU</li>
</ol>

For each of these models we have used two types of pretrained embeddings:

<ol>
  <li>BERT</li>
  <li>Glove</li>
</ol>

Making BERT embeddings takes long time for whole dataset, as the embeddings are context based and we can not use ready vectors for each word. So, model saves embeddings after the first time model finding embeddings. Each time for training new model it just uses the saved model.

For comparision we compared our results with two baselines of the contest and top 5 competitors models. Considering F1 score, GRU using BERT pretrained embeddings could achieve second place in this contest!

Two baselines of the contest are random binary model, which just randomly specifies the classes, and SVM model using tf-idf vectors. Top 5 competitors used some models and several techniques to improve their results. A short explanation of their works to improving the reults has came in "SemEval-2018 Task 3: Irony Detection in English Tweets" report.

We used a GTX 1080 Ti for training our models.



# Results

We ran our experiment 5 times for each method and reported the best one.
Here our methods are TCN, GRU, GRU-bidirection, LSTM and LSTM-bidirection. Our methods are reported on both BERT and Glove language models. In addition, for the comparison, we have included the top 5 models and baseline methods from the competition as well.
The methods were sorted based on F1 score in the competition. We also followed this convention for our methods.

In [30]:
data = [['method', 'acc', 'precision', 'recall', 'F1'],
        ['THU_NGN', '0.735', '0.630', '0.801', '0.705'],
        ['NTUA-SLP', '0.732' , '0.654', '0.691', '0.672'],
        
        ['WLV' ,'0.643', '0.532', '0.836', '0.650'],
        
        ['NLRPL-IITBHU', '0.661', '0.551', '0.788', '0.648'],
        
        ['NIHRIO', '0.702', '0.609', '0.691', '0.648'],
        
        ['GRU_BERT', '0.737', '0.653', '0.702' ,'0.685'],
        ['Bi_GRU_BERT', '0.731' ,'0.663', '0.563', '0.658'],
        ['LSTM_BERT' ,'0.739' , '0.702', '0.591' , '0.642'],
        ['TCN_BERT' ,'0.673' , '0.571' , '0.714' , '0.634'],
        ['Bi_LSTM_BERT', '0.726', '0.692', '0.556', '0.617'],
        
        ['GRU_Glove', '0.666', '0.562', '0.711', '0.628'],
        ['LSTM_Glove', '0.682', '0.591', '0.650', '0.619'],
        ['Bi_GRU_Glove', '0.681', '0.588' ,'0.653', '0.619'],
        ['Bi_LSTM_Glove', '0.684', '0.603', '0.592' ,'0.597'],
        ['TCN_Glove', '0.634', '0.531', '0.662', '0.589'],
        ['Unigram SVM', '0.635', '0.532', '0.659', '0.589'],
        ['Random', '0.503', '0.373', '0.373', '0.373'],
       ]
display(HTML(
    #'<html><head><style>table, th, td {border: 1px solid black;}</style></head><body>'
   "<table border = 1> <tr>{}</tr></table>".format(
       '</tr><tr>'.join(
           '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in data)
       )
))

0,1,2,3,4
method,acc,precision,recall,F1
THU_NGN,0.735,0.630,0.801,0.705
NTUA-SLP,0.732,0.654,0.691,0.672
WLV,0.643,0.532,0.836,0.650
NLRPL-IITBHU,0.661,0.551,0.788,0.648
NIHRIO,0.702,0.609,0.691,0.648
GRU_BERT,0.737,0.653,0.702,0.685
Bi_GRU_BERT,0.731,0.663,0.563,0.658
LSTM_BERT,0.739,0.702,0.591,0.642
TCN_BERT,0.673,0.571,0.714,0.634


# Analysis of the Results

As it is shown in the table above, our GRU and LSTM models based on BERT language model have achieved the best accuracy among the competition models. Moreover, we have the best precision with LSTM on BERT language model. For the recall and F1 score we didn't beat the best model, but we have achieved the F1 score to get the second place in the competition.
The reason for the lower F1 score could be having low recall score. It seems that achieving better recall values would be feasible with some parameter tuning, which we did not have so much time to do it perfectly.

# Future Work

In this project we used pre-trained embeddings and compared TCN and some variations of RNN models for detecting irony tweets. For the future we have some ideas to augment our data by adding or removing some unimportant words. We can also use attention to specify the importance of hidden units we use to generate the output. We could also concatenate embeddings from different langauge models to capture differnt embeddings in various contexts and types of sentences, but it needs a good tokenizer that can work for different language models, which seems hard and time consuming, and we didn't have the oppurtunity to check this direction.At last, as mentioned, parameter fine-tuning is also a good method to improve the performance, but again we did not have so much time to investigate it completely.
