# Twitter Sarcasm Detection
Our project aims to classify Tweet responses as sarcasm or not sarcasm.

Dataset Source:


*   "Classification Competition" from CS410 at UIUC
*   https://github.com/CS410Fall2020/ClassificationCompetition



### Setup Google Drive

Google Drive is utilized in order to make use of the GPU and improve efficiency.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/My Drive/Colab Notebooks/TextClassification

/content/drive/My Drive/Colab Notebooks/TextClassification


In [3]:
!pip install pytorch-transformers
!pip install transformers==3

Collecting pytorch-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b7/d3d18008a67e0b968d1ab93ad444fc05699403fa662f634b2f2c318a508b/pytorch_transformers-1.2.0-py3-none-any.whl (176kB)
[K     |█▉                              | 10kB 23.7MB/s eta 0:00:01[K     |███▊                            | 20kB 28.0MB/s eta 0:00:01[K     |█████▋                          | 30kB 25.1MB/s eta 0:00:01[K     |███████▍                        | 40kB 21.1MB/s eta 0:00:01[K     |█████████▎                      | 51kB 22.0MB/s eta 0:00:01[K     |███████████▏                    | 61kB 16.3MB/s eta 0:00:01[K     |█████████████                   | 71kB 15.7MB/s eta 0:00:01[K     |██████████████▉                 | 81kB 15.3MB/s eta 0:00:01[K     |████████████████▊               | 92kB 14.9MB/s eta 0:00:01[K     |██████████████████▋             | 102kB 15.7MB/s eta 0:00:01[K     |████████████████████▍           | 112kB 15.7MB/s eta 0:00:01[K     |██████████████████

In [4]:
# Import Libraries for the project
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import precision_recall_fscore_support

import seaborn as sns
import pandas as pd

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import torch
from torchtext.data import Field, TabularDataset, BucketIterator, Iterator
from torch.utils.data import DataLoader

from transformers import RobertaTokenizer, RobertaModel, AdamW, get_linear_schedule_with_warmup, AlbertTokenizer, AlbertModel

import warnings
warnings.filterwarnings('ignore')

import logging
logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)

import csv

In [5]:
## Cuda availability
print(torch.cuda.is_available())

True


### Read JSONL Data and Preprocessing
After the jsonl data is read into a Pandas object, we process the data by adding extra fields.

In [6]:
train_raw = pd.read_json("data/train.jsonl", lines=True, encoding="utf-8")
test_raw = pd.read_json("data/test.jsonl", lines=True, encoding="utf-8")

In [7]:
train_raw['conext_string'] = train_raw.context.apply(lambda x: ' '.join(x[::-1][:3]))
test_raw['conext_string'] = test_raw.context.apply(lambda x: ' '.join(x[::-1][:3]))

In [8]:
train_raw.head(10)

Unnamed: 0,label,response,context,conext_string
0,SARCASM,@USER @USER @USER I don't get this .. obviousl...,[A minor child deserves privacy and should be ...,@USER If your child isn't named Barron ... #Be...
1,SARCASM,@USER @USER trying to protest about . Talking ...,[@USER @USER Why is he a loser ? He's just a P...,@USER @USER having to make up excuses of why y...
2,SARCASM,@USER @USER @USER He makes an insane about of ...,[Donald J . Trump is guilty as charged . The e...,@USER I ’ ll remember to not support you at th...
3,SARCASM,@USER @USER Meanwhile Trump won't even release...,[Jamie Raskin tanked Doug Collins . Collins lo...,@USER But not half as stupid as Schiff looks ....
4,SARCASM,@USER @USER Pretty Sure the Anti-Lincoln Crowd...,[Man ... y ’ all gone “ both sides ” the apoca...,@USER They already did . Obama said many times...
5,SARCASM,@USER @USER @USER -> per your tag line : never...,[Donald Trump tapped into voters ’ populist sh...,@USER because these privileged white boys are ...
6,SARCASM,@USER @USER he does ! It excites him then he k...,[@USER @USER Coo-Coo . Keep on supporting fema...,@USER @USER do you masturbate to these videos ...
7,SARCASM,"Oh look , it's the #racist @USER offering soli...","[Hi , I'm Dennis , I'll be looking after lily'...",@USER Dennis please pass on my love and solida...
8,SARCASM,@USER @USER @USER As they are the biggest bull...,[Tips for children and young people from @USER...,@USER @USER @USER Please forward on to the Soc...
9,SARCASM,@USER @USER @USER responds to facts by tossing...,[The response of Sanders ' team to his quote f...,"@USER Careful , Bernie ’ s supporters get trig..."


In [9]:
encode_label = {'NOT_SARCASM' : 0, 'SARCASM' : 1}

train_raw['target'] = train_raw['label'].map(encode_label)
train_raw['all_string'] = train_raw['response'] + ". " + train_raw['conext_string']
test_raw['all_string'] = test_raw['response'] + ". " + test_raw['conext_string']

In [10]:
train_raw['all_string'] = train_raw['all_string'].apply(lambda x: x.lower())
test_raw['all_string'] = test_raw['all_string'].apply(lambda x: x.lower())

In [11]:
train_raw.head(10)

Unnamed: 0,label,response,context,conext_string,target,all_string
0,SARCASM,@USER @USER @USER I don't get this .. obviousl...,[A minor child deserves privacy and should be ...,@USER If your child isn't named Barron ... #Be...,1,@user @user @user i don't get this .. obviousl...
1,SARCASM,@USER @USER trying to protest about . Talking ...,[@USER @USER Why is he a loser ? He's just a P...,@USER @USER having to make up excuses of why y...,1,@user @user trying to protest about . talking ...
2,SARCASM,@USER @USER @USER He makes an insane about of ...,[Donald J . Trump is guilty as charged . The e...,@USER I ’ ll remember to not support you at th...,1,@user @user @user he makes an insane about of ...
3,SARCASM,@USER @USER Meanwhile Trump won't even release...,[Jamie Raskin tanked Doug Collins . Collins lo...,@USER But not half as stupid as Schiff looks ....,1,@user @user meanwhile trump won't even release...
4,SARCASM,@USER @USER Pretty Sure the Anti-Lincoln Crowd...,[Man ... y ’ all gone “ both sides ” the apoca...,@USER They already did . Obama said many times...,1,@user @user pretty sure the anti-lincoln crowd...
5,SARCASM,@USER @USER @USER -> per your tag line : never...,[Donald Trump tapped into voters ’ populist sh...,@USER because these privileged white boys are ...,1,@user @user @user -> per your tag line : never...
6,SARCASM,@USER @USER he does ! It excites him then he k...,[@USER @USER Coo-Coo . Keep on supporting fema...,@USER @USER do you masturbate to these videos ...,1,@user @user he does ! it excites him then he k...
7,SARCASM,"Oh look , it's the #racist @USER offering soli...","[Hi , I'm Dennis , I'll be looking after lily'...",@USER Dennis please pass on my love and solida...,1,"oh look , it's the #racist @user offering soli..."
8,SARCASM,@USER @USER @USER As they are the biggest bull...,[Tips for children and young people from @USER...,@USER @USER @USER Please forward on to the Soc...,1,@user @user @user as they are the biggest bull...
9,SARCASM,@USER @USER @USER responds to facts by tossing...,[The response of Sanders ' team to his quote f...,"@USER Careful , Bernie ’ s supporters get trig...",1,@user @user @user responds to facts by tossing...


In [12]:
test_raw.head(10)

Unnamed: 0,id,response,context,conext_string,all_string
0,twitter_1,"@USER @USER @USER My 3 year old , that just fi...","[Well now that ’ s problematic AF <URL>, @USER...",@USER @USER @USER No .. he actually in the gif...,"@user @user @user my 3 year old , that just fi..."
1,twitter_2,@USER @USER How many verifiable lies has he to...,[Last week the Fake News said that a section o...,@USER The mainstream media doesn't report the ...,@user @user how many verifiable lies has he to...
2,twitter_3,@USER @USER @USER Maybe Docs just a scrub of a...,[@USER Let ’ s Aplaud Brett When he deserves i...,@USER @USER He did try keep korkmaz in in the ...,@user @user @user maybe docs just a scrub of a...
3,twitter_4,@USER @USER is just a cover up for the real ha...,[Women generally hate this president . What's ...,@USER I've hated him before he was placed in o...,@user @user is just a cover up for the real ha...
4,twitter_5,@USER @USER @USER The irony being that he even...,"[Dear media Remoaners , you excitedly sharing ...",@USER @USER Quite an articulate and considered...,@user @user @user the irony being that he even...
5,twitter_6,@USER @USER Doesn't matter . Those guys weren'...,[Wilt Chamberlain rejects the skyhook twice in...,@USER plus he ’ s around 34 years old at that ...,@user @user doesn't matter . those guys weren'...
6,twitter_7,"@USER @USER @USER So , my #kindnesscascade are...",[I want to start something magical . I don ’ t...,@USER @USER @USER It really was . I'm packing ...,"@user @user @user so , my #kindnesscascade are..."
7,twitter_8,@USER @USER @USER They need to be an MSP to be...,[He ’ s finished . If true this is grooming an...,@USER @USER I think it will be Cherry & I susp...,@user @user @user they need to be an msp to be...
8,twitter_9,@USER @USER @USER In which Constitution is it ...,[Now students can ’ t bring stones in librarie...,@USER this one ? @USER aap to bahut logical ha...,@user @user @user in which constitution is it ...
9,twitter_10,@USER @USER ... he says while the GOP is overw...,[One of these things is not like the others . ...,@USER It's more diverse than the Democratic de...,@user @user ... he says while the gop is overw...


In [13]:
train_raw.to_csv("data/train_new.csv")
test_raw.to_csv("data/test_new.csv")

reference：
https://towardsdatascience.com/fine-tuning-bert-and-roberta-for-high-accuracy-text-classification-in-pytorch-c9e63cf64646
https://github.com/aramakus/ML-and-Data-Analysis/blob/master/RoBERTa%20for%20text%20classification.ipynb

In [14]:
# CHANGED
torch.manual_seed(17)

# Use cuda if available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

print(device)

cuda:0


### Create Dataset & Iterators
Using the preprocessed text data, we create datasets and iterators to send batches of text data for the training process of the model.

In [16]:
# Use the pretrained tokenizer to append padding tokens to each sequence
tokenizer = AlbertTokenizer.from_pretrained("albert-base-v1")

# Set hyperparameters
pad = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
unk = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)
seq_len = 256
size_of_batch = 16

# Readable columns
text_field = Field(use_vocab=False, 
                   tokenize=tokenizer.encode, 
                   include_lengths=False, 
                   batch_first=True,
                   fix_length=seq_len, 
                   pad_token=pad, 
                   unk_token=unk)
label_field = Field(sequential=False, use_vocab=False, batch_first=True)

allstring_tuple = ('all_string', text_field)
target_tuple = ('target', label_field)

fields = {'all_string' : allstring_tuple, 'target' : target_tuple}


train_data, valid_data = TabularDataset(path="data/train_new.csv", 
                                        format='CSV', 
                                        fields=fields, 
                                        skip_header=False).split(split_ratio=[0.80, 0.2], 
                                        stratified=True, 
                                        strata_field='target')

train_iter, valid_iter = BucketIterator.splits((train_data, valid_data),
                                               batch_size=size_of_batch,
                                               device=device,
                                               shuffle=True,
                                               sort_key=lambda x: len(x.all_string), 
                                               sort=True, 
                                               sort_within_batch=False)
id_field = Field(use_vocab=True, sequential=False)
fields2 = {'all_string' : allstring_tuple}
test_data = TabularDataset(path="data/test_new.csv", format='CSV', 
                           fields=fields2, skip_header=False)

test_iter = Iterator(test_data, batch_size=size_of_batch, device=device, 
                     train=False, shuffle=False, sort=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=760289.0, style=ProgressStyle(descripti…




### Albert Architecture
Here we are using a modified Albert architecture. It provides two parameter reduction techniques to improve memory usage and BERT speed.

In [35]:
# Albert architecture with additional layers (kept model as is from tutorial)
class AlbertClassifier(torch.nn.Module):
    def __init__(self, dropout_rate=0.3):
        super(AlbertClassifier, self).__init__()
        
        self.albert = AlbertModel.from_pretrained('albert-base-v1')
        self.d1 = torch.nn.Dropout(dropout_rate)
        self.l1 = torch.nn.Linear(768, 64)
        self.bn1 = torch.nn.LayerNorm(64)
        self.d2 = torch.nn.Dropout(dropout_rate)
        self.l2 = torch.nn.Linear(64, 2)
        
    def forward(self, input_ids, attention_mask):
        _, x = self.albert(input_ids=input_ids, attention_mask=attention_mask)
        x = self.d1(x)
        x = self.l1(x)
        x = self.bn1(x)
        x = torch.nn.ReLU()(x)
        x = self.d2(x)
        x = self.l2(x)
        
        return x

model_criterion = torch.nn.CrossEntropyLoss()

### Training Function

Consists of the function to train the Albert model

In [32]:
# train function for training the model

def train(model, criterion, optimizer, training, validation, scheduler, num_epochs, output_path = '/content/drive/My Drive/Colab Notebooks/TextClassification'):
    
    # Initialize variables
    max_loss = float('Inf')
    loss_training_list = []
    loss_validation_list = []
    training_loss = 0.0
    validation_loss = 0.0
    train_size = len(training)
    validation_size = len(validation)
    step = 0
    step_list = []
    model.train()

    for epoch in range(num_epochs):
        for (source, target), _ in training:
            mask = (source != pad).type(torch.uint8)
            y_pred = model(input_ids=source, attention_mask=mask)
            
            loss = criterion(y_pred, target)
            loss.backward()

            optimizer.step()    
            scheduler.step()    
            optimizer.zero_grad()
            
            step += 1
            training_loss += loss.item()
            if step % train_size == 0:
                model.eval()
                pred = []
                actual = []

                with torch.no_grad():                    
                    for (source, target), _ in validation:
                        mask = (source != pad).type(torch.uint8)
                        y_pred = model(input_ids=source, attention_mask=mask)
                        
                        loss = criterion(y_pred, target)
                        validation_loss += loss.item()
                        pred.extend(torch.argmax(y_pred, axis=-1).tolist())
                        actual.extend(target.tolist())

                # Store summary data
                step_list.append(step)
                training_loss = training_loss / train_size
                validation_loss = validation_loss / validation_size
                loss_training_list.append(training_loss)
                loss_validation_list.append(validation_loss)

                # print summary
                print('Epoch [{}/{}], global step [{}/{}], Train Loss: {:.4f}, Valid Loss: {:.4f}, precision, recall, f1:'
                      .format(epoch+1, num_epochs, step, num_epochs*train_size,
                              training_loss, validation_loss), precision_recall_fscore_support(actual, pred, average='macro'))
                
                if validation_loss < max_loss:
                    max_loss = validation_loss
                    save_checkpoint(output_path + '/model.pkl', model, max_loss)
                        
                training_loss = 0.0                
                validation_loss = 0.0
                model.train()
    
    print('Training complete')

In [23]:
# Functions for saving and loading checkpoints
def save_checkpoint(path, model, validation_loss):
    torch.save({'model_state_dict': model.state_dict(), 'validation_loss': validation_loss}, path)

    
def load_checkpoint(path, model):    
    state_dict = torch.load(path, map_location=device)
    model.load_state_dict(state_dict['model_state_dict'], strict=False)
    return state_dict['validation_loss']

In [36]:
NUM_EPOCHS = 12
steps_per_epoch = len(train_iter)

model = AlbertClassifier(0.3)
model = model.to(device)

print("======================= Start training =================================")

optimizer = AdamW(model.parameters(), lr=1e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=steps_per_epoch*2, num_training_steps=steps_per_epoch*NUM_EPOCHS)

train(model=model, criterion=model_criterion, training=train_iter, validation=valid_iter, optimizer=optimizer, scheduler=scheduler, num_epochs=NUM_EPOCHS)

Epoch [1/12], global step [250/3000], Train Loss: 0.6812, Valid Loss: 0.7983, precision, recall, f1: (0.25, 0.5, 0.3333333333333333, None)
Epoch [2/12], global step [500/3000], Train Loss: 0.6247, Valid Loss: 0.5748, precision, recall, f1: (0.7143180641821947, 0.712, 0.7112191365452183, None)
Epoch [3/12], global step [750/3000], Train Loss: 0.5223, Valid Loss: 0.4993, precision, recall, f1: (0.7530819985675359, 0.753, 0.7529799913793017, None)
Epoch [4/12], global step [1000/3000], Train Loss: 0.4643, Valid Loss: 0.4756, precision, recall, f1: (0.780448717948718, 0.78, 0.7799119647859143, None)
Epoch [5/12], global step [1250/3000], Train Loss: 0.4123, Valid Loss: 0.4980, precision, recall, f1: (0.7733947820653022, 0.773, 0.7729180234064497, None)
Epoch [6/12], global step [1500/3000], Train Loss: 0.3546, Valid Loss: 0.5137, precision, recall, f1: (0.7660042560680971, 0.766, 0.765999063996256, None)
Epoch [7/12], global step [1750/3000], Train Loss: 0.3009, Valid Loss: 0.5701, precisi

### Prediction Results

Consists of the evaluation function we implemented and how the results are stored in a text file.

In [37]:
# Evaluation Function

def evaluate(model, test_loader):
    y_pred = []

    model.eval()
    with torch.no_grad():
        for (source), _ in test_loader:
                mask = (source != pad).type(torch.uint8)
                
                output = model(source, attention_mask=mask)

                y_pred.extend(torch.argmax(output, axis=-1).tolist())

    
    output = pd.DataFrame()
    output['Pred'] = y_pred

    return output

In [39]:


load_checkpoint('/content/drive/My Drive/Colab Notebooks/TextClassification/model.pkl', model)

prediction=evaluate(model, test_iter)

In [40]:
prediction

Unnamed: 0,Pred
0,1
1,1
2,1
3,1
4,1
...,...
1795,1
1796,1
1797,1
1798,0


In [41]:
encode_label = {0 : 'NOT_SARCASM', 1 : 'SARCASM'}

test_raw['Pred']=prediction['Pred'].map(encode_label)
test_raw[['id', 'Pred']].to_csv('answer.txt', header=None, index=None, sep=',', quoting=csv.QUOTE_NONE, escapechar = ' ')