# Classification Competition: Twitter Sarcasm Detection
### CS 410 Final Project Source Code by Yang Yang (yangy19@illinois.edu)

## Note for Reviews

- Please open this Notebook from Google Colab.

- Go to **Runtime** -> **Change runtime type**, and make sure it has **GPU** selected as Hardware accelerator and **High-RAM** as Runtime shape.

- Go to **Runtime** -> **Run all**. It takes approximately 5 minutes to complete. 

- Before you download the `answer.txt`, you can also look at the validation F1 score, which is usually ~0.83. You can use the **Table of contents** toolbar on the left to navigate to section **7. Evaluation**.

- Use the **Files** toolbar on the left, go to **outputs** -> **Twitter_Sarcasm_Detection**, and you should be able to see `answer.txt`.

- Thank you for reviewing my project. Please feel free to contact me if you have any questions.

## Introduction 

In this Classification Competition, the task is to detect sarcasm in contextual Twitter text. In order to beat the baseline F1 score and improve the performance, the main model used in this project is one of the State-of-the-Art NLP models, BERT. 

I follow the procedures below to fine-tune and improve the model performance: 

(1) Adapt the BERT classifier code by Huggingface transformers and BERT example from Google Research, and modify the BERT model code to make it applicable for other BERT-based models;

(2) Compare model performances of BERT, ALBERT, DistilBERT, SqueezeBERT and XLNet with same hyperparameters;

(3) Optimize BERT hyperparameters;

(4) Compare BERT model performances when:

- Response and Context are used separately as sequence pair tasks.
- Response and Context are concatenated and used as normal text classification tasks.
- Only Response information is used.

Finally, my best model is able to reach on the test dataset **0.763** as F1 score and beat the baseline performance (F1=**0.723**).

The whole project is developed with PyTorch framework in Google Colab environment. The project documentation, voiced presentation and this source code are all available in [CourseProject GitHub Repo](https://github.com/yangyangsquare/CourseProject).

### Sections

1. [Library Setup](#1.-Library-Setup)

2. [Dataset Loading](#2.-Dataset-Loading)

3. [Parameter Setting](#3.-Parameter-Setting)

4. [Classes and Functions](#4.-Classes-and-Functions)

5. [Data Preparation](#5.-Data-Preparation)

6. [Model Training](#6.-Model-Training)

7. [Evaluation](#7.-Evaluation)

8. [Prediction](#8.-Prediction)

9. [Hyperparameter Tuning](#9.-Hyperparameter-Tuning)

10. [References](#10.-References)


## 1. Library Setup

In [2]:
!pip install datasets
!pip install transformers
!pip install jsonlines
!pip install pytorch_pretrained_bert

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/1a/38/0c24dce24767386123d528d27109024220db0e7a04467b658d587695241a/datasets-1.1.3-py3-none-any.whl (153kB)
[K     |████████████████████████████████| 163kB 6.2MB/s 
[?25hCollecting pyarrow>=0.17.1
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e1/27958a70848f8f7089bff8d6ebe42519daf01f976d28b481e1bfd52c8097/pyarrow-2.0.0-cp36-cp36m-manylinux2014_x86_64.whl (17.7MB)
[K     |████████████████████████████████| 17.7MB 331kB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/f7/73/826b19f3594756cb1c6c23d2fbd8ca6a77a9cd3b650c9dec5acc85004c38/xxhash-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (242kB)
[K     |████████████████████████████████| 245kB 56.2MB/s 
Installing collected packages: pyarrow, xxhash, datasets
  Found existing installation: pyarrow 0.14.1
    Uninstalling pyarrow-0.14.1:
      Successfully uninstalled pyarrow-0.14.1
Successfully installed datasets-1.

In [3]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import os
import csv
import sys
import logging
import random
import jsonlines

from sklearn.model_selection import train_test_split
from sklearn.metrics import matthews_corrcoef, confusion_matrix
from tqdm import tqdm_notebook, tqdm, trange
from torch.utils.data import DataLoader, Dataset
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from transformers import AutoConfig, AutoModel, AutoTokenizer, AutoModelForSequenceClassification
from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule
from multiprocessing import Pool, cpu_count

In [4]:
# Check if using 100% of GPU memory for running the code
# from https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip -q install gputil
!pip -q install psutil
!pip -q install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()

gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

  Building wheel for gputil (setup.py) ... [?25l[?25hdone
Gen RAM Free: 26.2 GB  | Proc size: 543.5 MB
GPU RAM Free: 16270MB | Used: 10MB | Util   0% | Total 16280MB


In case GPU utilisation (Util) is not at 0%, uncomment and run the following line to kill all processes to get the full GPU afterwards. Make sure to comment out the line again to not constantly crash the notebook on purpose.

In [5]:
# !kill -9 -1

## 2. Dataset Loading

In [6]:
# Copy data files from github to Google Colab
!git clone https://github.com/yangyangsquare/CourseProject.git

Cloning into 'CourseProject'...
remote: Enumerating objects: 53, done.[K
remote: Total 53 (delta 0), reused 0 (delta 0), pack-reused 53[K
Unpacking objects: 100% (53/53), done.


In [7]:
# Convert and split dataset from .jsonl files to dataframes
df_train = pd.DataFrame()
df_test = pd.DataFrame()

with jsonlines.open('CourseProject/data/train.jsonl') as f:
    for obj in f:
        df_train = df_train.append(obj, ignore_index=True)

with jsonlines.open('CourseProject/data/test.jsonl') as f:
    for obj in f:
        df_test = df_test.append(obj, ignore_index=True)

df_train, df_val = train_test_split(df_train, test_size = 0.2, random_state=2020)

df_train['idx'] = df_train.index
df_val['idx'] = df_val.index
df_test['idx'] = df_test.index

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)

In [8]:
print(df_train.shape)
print(df_val.shape)
print(df_test.shape)

print(len(df_train[df_train['label'] == 'NOT_SARCASM']))
print(len(df_train[df_train['label'] == 'SARCASM']))

(4000, 4)
(1000, 4)
(1800, 4)
2010
1990


In [9]:
df_train.head()

Unnamed: 0,context,label,response,idx
0,[Continuing to map out the general whereabouts...,NOT_SARCASM,@USER @USER @USER I like it as well ! We did s...,3966
1,[I ’ m confused as to how you can be in this i...,NOT_SARCASM,"@USER @USER Also , the type of guy who wants h...",4268
2,"[@USER stated "" The Majority of Americans love...",SARCASM,@USER @USER I can only imagine the fun KingGeo...,2048
3,[Progressives Admit They're Just Not Used To K...,NOT_SARCASM,@USER @USER @USER I don't agree with killing t...,4935
4,[And you better believe I ain't going to follo...,SARCASM,"@USER please don't call your chihuahua "" bear ...",1005


In [10]:
df_test.head()

Unnamed: 0,context,id,response,idx
0,"[Well now that ’ s problematic AF <URL>, @USER...",twitter_1,"@USER @USER @USER My 3 year old , that just fi...",0
1,[Last week the Fake News said that a section o...,twitter_2,@USER @USER How many verifiable lies has he to...,1
2,[@USER Let ’ s Aplaud Brett When he deserves i...,twitter_3,@USER @USER @USER Maybe Docs just a scrub of a...,2
3,[Women generally hate this president . What's ...,twitter_4,@USER @USER is just a cover up for the real ha...,3
4,"[Dear media Remoaners , you excitedly sharing ...",twitter_5,@USER @USER @USER The irony being that he even...,4


In [11]:
df_train.to_csv('CourseProject/data/train.tsv', sep='\t', index=False, header=False)
df_val.to_csv('CourseProject/data/val.tsv', sep='\t', index=False, header=False)
df_test.to_csv('CourseProject/data/test.tsv', sep='\t', index=False, header=False)

## 3. Parameter Setting

In [12]:
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
logging.basicConfig(level=logging.INFO)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# The method of utilizing "Response" and "Context" data. Select from the list:
# 1) "ResponseContext_Separate": text_a = Response, text_b = Context
# 2) "ResponseContext_Connect": text_a = Response + Context, text_b = None
# 3) "Response_Only": text_a = Response, text_b = None
METHOD = "ResponseContext_Separate"

# Pre-trained model selected in the list:
BERT_PRETRAINED = ['bert-base-uncased', 'bert-large-uncased', 'bert-base-cased', 'bert-large-cased']
distilBERT_PRETRAINED = ['distilbert-base-uncased', 'distilbert-base-uncased-distilled-squad', 'distilbert-base-cased', 'distilbert-base-cased-distilled-squad']
alBERT_PRETRAINED = ['albert-base-v1', 'albert-base-v2', 'albert-large-v1', 'albert-large-v2']
XLNet_PRETRAINED = ['xlnet-base-cased', 'xlnet-large-cased']
SqueezeBERT_PRETRAINED = ['squeezebert/squeezebert-uncased', 'squeezebert/squeezebert-mnli', 'squeezebert/squeezebert-mnli-headless']

BERT_MODEL = 'bert-base-uncased'

# The name of the task to train.
TASK_NAME = "Twitter_Sarcasm_Detection"

# The input data dir. contains the .tsv files (or other data files) for the task.
DATA_DIR = 'CourseProject/data/'

# The output directory is where the fine-tuned model and checkpoints will be written.
OUTPUT_DIR = f'outputs/{TASK_NAME}/'

# The reports directory is where the evaluation reports will be written to.
REPORTS_DIR = f'reports/{TASK_NAME}_evaluation_report/'

# This is where BERT will look for pre-trained models to load parameters from.
CACHE_DIR = 'cache/'

# The maximum total input sequence length after WordPiece tokenization.
# Sequences longer than this will be truncated, and sequences shorter than this will be padded.
MAX_SEQ_LENGTH = 128

TRAIN_BATCH_SIZE = 16
VAL_BATCH_SIZE = 32
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 2
RANDOM_SEED = 2020
GRADIENT_ACCUMULATION_STEPS = 2
WARMUP_PROPORTION = 0.1

OUTPUT_MODE = "classification"
CONFIG_NAME = "config.json"
WEIGHTS_NAME = "pytorch_model.bin"
output_mode = OUTPUT_MODE
cache_dir = CACHE_DIR

if os.path.exists(REPORTS_DIR) and os.listdir(REPORTS_DIR):
        REPORTS_DIR += f'/report_{len(os.listdir(REPORTS_DIR))}'
        os.makedirs(REPORTS_DIR)
if not os.path.exists(REPORTS_DIR):
    os.makedirs(REPORTS_DIR)
    REPORTS_DIR += f'/report_{len(os.listdir(REPORTS_DIR))}'
    os.makedirs(REPORTS_DIR)

if os.path.exists(OUTPUT_DIR) and os.listdir(OUTPUT_DIR):
        raise ValueError("Output directory ({}) already exists and is not empty.".format(OUTPUT_DIR))
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

## 4. Classes and Functions

In [13]:
class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.
        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label


class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_val_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the val set."""
        raise NotImplementedError()

    def get_test_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()

    @classmethod
    def _read_tsv(cls, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with open(input_file, "r", encoding="utf-8") as f:
            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
            lines = []
            for line in reader:
                if sys.version_info[0] == 2:
                    line = list(unicode(cell, 'utf-8') for cell in line)
                lines.append(line)
            return lines


class BinaryClassificationProcessor(DataProcessor):
    """Processor for binary classification dataset."""

    def get_train_examples(self, data_dir):
      """See base class."""
      return self._create_examples(
          self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_val_examples(self, data_dir):
      """See base class."""
      return self._create_examples(
          self._read_tsv(os.path.join(data_dir, "val.tsv")), "val")

    def get_test_examples(self, data_dir):
      """See base class."""
      return self._create_examples(
          self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

    def get_labels(self):
      """See base class."""
      return ["NOT_SARCASM", "SARCASM"]

    def _create_examples(self, lines, set_type):
      """Creates examples for the training and dev sets."""
      examples = []
      for (i, line) in enumerate(lines):
        guid = "%s-%s" % (set_type, i)

        if METHOD == "ResponseContext_Separate":
          text_a = line[2]
          text_b = line[0]
        elif METHOD == "ResponseContext_Connect":
          text_a = line[2] + line[0]
          text_b = None
        elif METHOD == "Response_Only":
          text_a = line[2]
          text_b = None

        if set_type == "test":
          label = "SARCASM"
        else:
          label = lines[i][1]
        examples.append(
            InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
      return examples

In [14]:
class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id

def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()

def convert_example_to_feature(example_row):
    
    example, label_map, max_seq_length, tokenizer, output_mode = example_row

    tokens_a = tokenizer.tokenize(example.text_a)
    tokens_b = None
    if example.text_b:
        tokens_b = tokenizer.tokenize(example.text_b)
        # Modifies `tokens_a` and `tokens_b` in place so that the total
        # length is less than the specified length.
        # Account for [CLS], [SEP], [SEP] with "- 3"
        _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
    else:
        # Account for [CLS] and [SEP] with "- 2"
        if len(tokens_a) > max_seq_length - 2:
            tokens_a = tokens_a[:(max_seq_length - 2)]

    tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
    segment_ids = [0] * len(tokens)

    if tokens_b:
        tokens += tokens_b + ["[SEP]"]
        segment_ids += [1] * (len(tokens_b) + 1)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # The mask has 1 for real tokens and 0 for padding tokens. Only real
    # tokens are attended to.
    input_mask = [1] * len(input_ids)

    # Zero-pad up to the sequence length.
    padding = [0] * (max_seq_length - len(input_ids))
    input_ids += padding
    input_mask += padding
    segment_ids += padding

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length

    if output_mode == "classification":
        label_id = label_map[example.label]
    elif output_mode == "regression":
        label_id = float(example.label)
    else:
        raise KeyError(output_mode)

    return InputFeatures(input_ids=input_ids,
                         input_mask=input_mask,
                         segment_ids=segment_ids,
                         label_id=label_id)

In [15]:
def set_seed(seed):
    """ Set all seeds to make results reproducible """
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

In [16]:
def get_eval_report(task_name, labels, preds, batch_size, lr, ep, gas):
    mcc = matthews_corrcoef(labels, preds)
    tn, fp, fn, tp = confusion_matrix(labels, preds).ravel()
    return {
        "task": task_name,
        "Batch Size": batch_size,
        "Learning Rate": lr,
        "Epochs": ep,
        "Gradient Accumulation Steps": gas,
        "mcc": mcc,
        "tp": tp,
        "tn": tn,
        "fp": fp,
        "fn": fn,
        "Precision": tp / (tp+fp),
        "Recall": tp / (tp+fn),
        "F1": 2*tp / (2*tp + fp + fn)
    }

def compute_metrics(task_name, labels, preds, batch_size, lr, ep, gas):
    assert len(preds) == len(labels)
    return get_eval_report(task_name, labels, preds, batch_size, lr, ep, gas)

In [17]:
class CSVLogger():
    def __init__(self, filename='log.csv', fieldnames=['epoch']):

        self.filename = filename
        self.csv_file = open(filename, 'w')

        # Write model configuration at top of csv
        writer = csv.writer(self.csv_file)

        self.writer = csv.DictWriter(self.csv_file, fieldnames=fieldnames)
        self.writer.writeheader()

        self.csv_file.flush()

    def writerow(self, row):
        self.writer.writerow(row)
        self.csv_file.flush()

    def close(self):
        self.csv_file.close()

## 5. Data Preparation

In [18]:
set_seed(2020)

processor = BinaryClassificationProcessor()
train_examples = processor.get_train_examples(DATA_DIR)
train_examples_len = len(train_examples)
val_examples = processor.get_val_examples(DATA_DIR)
val_examples_len = len(val_examples)
test_examples = processor.get_test_examples(DATA_DIR)
test_examples_len = len(test_examples)

label_list = processor.get_labels()
num_labels = len(label_list)

num_train_optimization_steps = int(
    train_examples_len / TRAIN_BATCH_SIZE / GRADIENT_ACCUMULATION_STEPS) * NUM_TRAIN_EPOCHS

# Load pre-trained model tokenizer (vocabulary)
tokenizer = AutoTokenizer.from_pretrained(BERT_MODEL, do_lower_case=False)

label_map = {label: i for i, label in enumerate(label_list)}
train_examples_for_processing = [(example, label_map, MAX_SEQ_LENGTH, tokenizer, OUTPUT_MODE) for example in train_examples]
val_examples_for_processing = [(example, label_map, MAX_SEQ_LENGTH, tokenizer, OUTPUT_MODE) for example in val_examples]
test_examples_for_processing = [(example, label_map, MAX_SEQ_LENGTH, tokenizer, OUTPUT_MODE) for example in test_examples]

INFO:filelock:Lock 140470199514952 acquired on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.637c6035640bacb831febcc2b7f7bee0a96f9b30c2d7e9ef84082d9f252f3170.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…

INFO:filelock:Lock 140470199514952 released on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.637c6035640bacb831febcc2b7f7bee0a96f9b30c2d7e9ef84082d9f252f3170.lock
INFO:filelock:Lock 140470199514952 acquired on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…

INFO:filelock:Lock 140470199514952 released on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
INFO:filelock:Lock 140470191105304 acquired on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…

INFO:filelock:Lock 140470191105304 released on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock





In [19]:
process_count = cpu_count() - 1
if __name__ ==  '__main__':
    print(f'Preparing to convert {train_examples_len} train examples..')
    print(f'Spawning {process_count} processes..')
    with Pool(process_count) as p:
        train_features = list(tqdm(p.imap(convert_example_to_feature, train_examples_for_processing), total=train_examples_len))

    print()
    print(f'Preparing to convert {val_examples_len} validation examples..')
    print(f'Spawning {process_count} processes..')
    with Pool(process_count) as p:
        val_features = list(tqdm(p.imap(convert_example_to_feature, val_examples_for_processing), total=val_examples_len))

    print()
    print(f'Preparing to convert {test_examples_len} test examples..')
    print(f'Spawning {process_count} processes..')
    with Pool(process_count) as p:
        test_features = list(tqdm(p.imap(convert_example_to_feature, test_examples_for_processing), total=test_examples_len))

Preparing to convert 4000 train examples..
Spawning 3 processes..


100%|██████████| 4000/4000 [00:36<00:00, 110.05it/s]



Preparing to convert 1000 validation examples..
Spawning 3 processes..


100%|██████████| 1000/1000 [00:09<00:00, 108.84it/s]



Preparing to convert 1800 test examples..
Spawning 3 processes..


100%|██████████| 1800/1800 [00:16<00:00, 109.48it/s]


In [20]:
# Load pre-trained model (weights)
model = AutoModelForSequenceClassification.from_pretrained(BERT_MODEL)
model.to(device)

INFO:filelock:Lock 140470191105584 acquired on /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…

INFO:filelock:Lock 140470191105584 released on /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock





Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [21]:
logger = logging.getLogger()

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]
  
optimizer = BertAdam(optimizer_grouped_parameters,
                     lr=LEARNING_RATE,
                     warmup=WARMUP_PROPORTION,
                     t_total=num_train_optimization_steps)

logger.info("  Num examples = %d", train_examples_len)
logger.info("  Batch size = %d", TRAIN_BATCH_SIZE)
logger.info("  Num steps = %d", num_train_optimization_steps)
all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)

val_input_ids = torch.tensor([f.input_ids for f in val_features], dtype=torch.long)
val_input_mask = torch.tensor([f.input_mask for f in val_features], dtype=torch.long)
val_segment_ids = torch.tensor([f.segment_ids for f in val_features], dtype=torch.long)

test_input_ids = torch.tensor([f.input_ids for f in test_features], dtype=torch.long)
test_input_mask = torch.tensor([f.input_mask for f in test_features], dtype=torch.long)
test_segment_ids = torch.tensor([f.segment_ids for f in test_features], dtype=torch.long)

if OUTPUT_MODE == "classification":
    all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
    val_label_ids = torch.tensor([f.label_id for f in val_features], dtype=torch.long)
elif OUTPUT_MODE == "regression":
    all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.float)
    val_label_ids = torch.tensor([f.label_id for f in val_features], dtype=torch.float)

INFO:root:  Num examples = 4000
INFO:root:  Batch size = 16
INFO:root:  Num steps = 250


In [22]:
train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=TRAIN_BATCH_SIZE)

val_data = TensorDataset(val_input_ids, val_input_mask, val_segment_ids, val_label_ids)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=VAL_BATCH_SIZE)

test_data = TensorDataset(test_input_ids, test_input_mask, test_segment_ids)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=VAL_BATCH_SIZE)

## 6. Model Training

In [23]:
model.train()
global_step = 0

for _ in trange(int(NUM_TRAIN_EPOCHS), desc="Epoch"):
    tr_loss = 0
    nb_tr_examples = 0
    nb_tr_steps = 0
    for step, batch in enumerate(tqdm_notebook(train_dataloader, desc="Iteration")):
        batch = tuple(t.to(device) for t in batch)
        input_ids, input_mask, segment_ids, label_ids = batch

        if BERT_MODEL in (BERT_PRETRAINED + alBERT_PRETRAINED + SqueezeBERT_PRETRAINED):
            outputs = model(input_ids=input_ids, attention_mask=input_mask, token_type_ids=segment_ids, labels=label_ids)
        elif BERT_MODEL in (distilBERT_PRETRAINED + XLNet_PRETRAINED):
            outputs = model(input_ids=input_ids, attention_mask=input_mask, labels=label_ids)

        logits = outputs.logits
        loss = outputs.loss

        if GRADIENT_ACCUMULATION_STEPS > 1:
            loss = loss / GRADIENT_ACCUMULATION_STEPS

        print(loss)
        loss.backward()
        
        tr_loss += loss.item()
        nb_tr_examples += input_ids.size(0)
        nb_tr_steps += 1
        if (step + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
            optimizer.step()
            optimizer.zero_grad()
            global_step += 1

model_to_save = model.module if hasattr(model, 'module') else model  

output_model_file = os.path.join(OUTPUT_DIR, WEIGHTS_NAME)
output_config_file = os.path.join(OUTPUT_DIR, CONFIG_NAME)

torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(OUTPUT_DIR)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=250.0, style=ProgressStyle(description_wi…

tensor(0.3633, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3206, device='cuda:0', grad_fn=<DivBackward0>)


	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
  next_m.mul_(beta1).add_(1 - beta1, grad)


tensor(0.3241, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3431, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3531, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3479, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3547, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3285, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3589, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3661, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3605, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3626, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3469, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3377, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3281, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3522, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3504, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3257, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3395, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.3677, device='cuda:0', grad_fn=<DivBack

Epoch:  50%|█████     | 1/2 [00:59<00:59, 59.70s/it]




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=250.0, style=ProgressStyle(description_wi…

tensor(0.1462, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.1922, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.1652, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.1783, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.1109, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.1228, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.1461, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.0726, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.2542, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.1184, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.0699, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.1923, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.0509, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.1060, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.0943, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.1277, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.1257, device='cuda:0', grad_fn=<DivBackward0>)
tensor(0.2398, device='cuda:0', grad_fn=<DivBack

Epoch: 100%|██████████| 2/2 [01:59<00:00, 59.78s/it]







('outputs/Twitter_Sarcasm_Detection/vocab.txt',)

## 7. Evaluation

In [24]:
model.eval()
eval_loss = 0
nb_eval_steps = 0
preds = []

for input_ids, input_mask, segment_ids, label_ids in tqdm_notebook(val_dataloader, desc="Evaluating"):
    input_ids = input_ids.to(device)
    input_mask = input_mask.to(device)
    segment_ids = segment_ids.to(device)
    label_ids = label_ids.to(device)

    with torch.no_grad():
        if BERT_MODEL in (BERT_PRETRAINED + alBERT_PRETRAINED + SqueezeBERT_PRETRAINED):
            outputs = model(input_ids=input_ids, attention_mask=input_mask, token_type_ids=segment_ids, labels=label_ids)
        elif BERT_MODEL in (distilBERT_PRETRAINED + XLNet_PRETRAINED):
            outputs = model(input_ids=input_ids, attention_mask=input_mask, labels=label_ids)

    logits = outputs.logits
    tmp_eval_loss = outputs.loss

    eval_loss += tmp_eval_loss.mean().item()
    nb_eval_steps += 1
    if len(preds) == 0:
        preds.append(logits.detach().cpu().numpy())
    else:
        preds[0] = np.append(
            preds[0], logits.detach().cpu().numpy(), axis=0)

eval_loss = eval_loss / nb_eval_steps
preds = preds[0]

if OUTPUT_MODE == "classification":
    preds = np.argmax(preds, axis=1)
elif OUTPUT_MODE == "regression":
    preds = np.squeeze(preds)
result = compute_metrics(TASK_NAME, val_label_ids.numpy(), preds, \
                         TRAIN_BATCH_SIZE, LEARNING_RATE, NUM_TRAIN_EPOCHS, GRADIENT_ACCUMULATION_STEPS)

result['eval_loss'] = eval_loss

output_eval_file = os.path.join(REPORTS_DIR, "eval_results.txt")
with open(output_eval_file, "w") as writer:
    logger.info("***** Eval results *****")
    for key in (result.keys()):
        logger.info("  %s = %s", key, str(result[key]))
        writer.write("%s = %s\n" % (key, str(result[key])))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(FloatProgress(value=0.0, description='Evaluating', max=32.0, style=ProgressStyle(description_wi…




INFO:root:***** Eval results *****
INFO:root:  task = Twitter_Sarcasm_Detection
INFO:root:  Batch Size = 16
INFO:root:  Learning Rate = 2e-05
INFO:root:  Epochs = 2
INFO:root:  Gradient Accumulation Steps = 2
INFO:root:  mcc = 0.643385626563087
INFO:root:  tp = 455
INFO:root:  tn = 364
INFO:root:  fp = 126
INFO:root:  fn = 55
INFO:root:  Precision = 0.7831325301204819
INFO:root:  Recall = 0.8921568627450981
INFO:root:  F1 = 0.8340971585701191
INFO:root:  eval_loss = 0.4368853932246566


## 8. Prediction

In [25]:
test_preds = []

for input_ids, input_mask, segment_ids in tqdm_notebook(test_dataloader, desc="Predicting"):
    input_ids = input_ids.to(device)
    input_mask = input_mask.to(device)
    segment_ids = segment_ids.to(device)

    with torch.no_grad():
        if BERT_MODEL in (BERT_PRETRAINED + alBERT_PRETRAINED + SqueezeBERT_PRETRAINED):
            outputs = model(input_ids=input_ids, attention_mask=input_mask, token_type_ids=segment_ids, labels=None)
        elif BERT_MODEL in (distilBERT_PRETRAINED + XLNet_PRETRAINED):
            outputs = model(input_ids=input_ids, attention_mask=input_mask, labels=None)

    logits = outputs.logits

    if len(test_preds) == 0:
        test_preds.append(logits.detach().cpu().numpy())
    else:
        test_preds[0] = np.append(
            test_preds[0], logits.detach().cpu().numpy(), axis=0)

test_preds = test_preds[0]

if OUTPUT_MODE == "classification":
    test_preds = np.argmax(test_preds, axis=1)
elif OUTPUT_MODE == "regression":
    test_preds = np.squeeze(test_preds)

w = open(f'outputs/{TASK_NAME}/answer.txt', 'w')

for i in range(df_test.shape[0]):
  if test_preds[i] == 1:
    pred = 'SARCASM'
  else:
    pred = 'NOT_SARCASM'
  w.writelines(df_test.at[i,'id'] + ',' + pred + '\n')

w.close()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until


HBox(children=(FloatProgress(value=0.0, description='Predicting', max=57.0, style=ProgressStyle(description_wi…




## 9. Hyperparameter Tuning

Please **DO NOT** run the following cell when reviewing / validating the project result. Otherwise it will take approximately 2 hours to finish the whole notebook.

It is for hyperparameter tuning **ONLY**.

In [26]:
# csv_logger = CSVLogger(filename='/content/outputs/para_tune.csv',
#                        fieldnames=['batch_size', 'iters_to_accumulate', 'learning_rate', 'epoch', \
#                                    'train_loss', 'train_precision', 'train_recall', 'train_f1', \
#                                    'val_loss', 'val_precision', 'val_recall', 'val_f1'])
# set_seed(2020)
# all_lrs = [2e-5, 5e-5, 1e-4]
# all_eps = [1, 2, 3]
# all_bs = [8, 16, 32]
# all_gas = [1, 2, 3]

# for bs in [16, 32]: # all_bs:
#     train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=bs)

#     for gas in all_gas:
         
#         for lr in all_lrs:
#             optimizer = BertAdam(optimizer_grouped_parameters,
#                                  lr=lr,
#                                  warmup=WARMUP_PROPORTION,
#                                  t_total=num_train_optimization_steps)
            
#             for ep in all_eps:
#                 num_train_optimization_steps = int(train_examples_len / bs / gas) * ep
#                 optimizer = BertAdam(optimizer_grouped_parameters,
#                                      lr=lr,
#                                      warmup=WARMUP_PROPORTION,
#                                      t_total=num_train_optimization_steps)  
#                 print("Batch Size: ",bs,", Gradient Accumu Steps: ",gas,", Learning Rate: ",lr,", Epochs: ",ep)

#                 model.train()
#                 global_step = 0

#                 for i in trange(int(ep), desc="Epoch"):
                    
#                     tr_loss = 0
#                     nb_tr_examples = 0
#                     nb_tr_steps = 0
#                     train_preds = []
#                     train_label_ids = []

#                     for step, batch in enumerate(tqdm_notebook(train_dataloader, desc="Iteration")):
#                         batch = tuple(t.to(device) for t in batch)
#                         input_ids, input_mask, segment_ids, label_ids = batch

#                         if BERT_MODEL in (BERT_PRETRAINED + alBERT_PRETRAINED + SqueezeBERT_PRETRAINED):
#                             outputs = model(input_ids=input_ids, attention_mask=input_mask, token_type_ids=segment_ids, labels=label_ids)
#                         elif BERT_MODEL in (distilBERT_PRETRAINED + XLNet_PRETRAINED):
#                             outputs = model(input_ids=input_ids, attention_mask=input_mask, labels=label_ids)

#                         logits = outputs.logits
#                         loss = outputs.loss

#                         if gas > 1:
#                             loss = loss / gas

#                         loss.backward()
                        
#                         tr_loss += loss.item()
#                         nb_tr_examples += input_ids.size(0)
#                         nb_tr_steps += 1
#                         if (step + 1) % gas == 0:
#                             optimizer.step()
#                             optimizer.zero_grad()
#                             global_step += 1

#                         if len(train_preds) == 0:
#                             train_preds.append(logits.detach().cpu().numpy())
#                         else:
#                             train_preds[0] = np.append(
#                                 train_preds[0], logits.detach().cpu().numpy(), axis=0)

#                         if len(train_label_ids) == 0:
#                             train_label_ids.append(label_ids.detach().cpu().numpy())
#                         else:
#                             train_label_ids[0] = np.append(
#                                 train_label_ids[0], label_ids.detach().cpu().numpy(), axis=0)

#                     tr_loss = tr_loss / nb_tr_steps
#                     train_preds = train_preds[0]
#                     train_label_ids = train_label_ids[0]

#                     if OUTPUT_MODE == "classification":
#                         train_preds = np.argmax(train_preds, axis=1)
#                     elif OUTPUT_MODE == "regression":
#                         train_preds = np.squeeze(train_preds)

#                     train_result = compute_metrics(TASK_NAME, train_label_ids, train_preds, bs, lr, ep, gas)
#                     train_result['train_loss'] = tr_loss

#                 model.eval()
#                 eval_loss = 0
#                 nb_eval_steps = 0
#                 preds = []

#                 for input_ids, input_mask, segment_ids, label_ids in tqdm_notebook(val_dataloader, desc="Evaluating"):
#                     input_ids = input_ids.to(device)
#                     input_mask = input_mask.to(device)
#                     segment_ids = segment_ids.to(device)
#                     label_ids = label_ids.to(device)

#                     with torch.no_grad():
#                         if BERT_MODEL in (BERT_PRETRAINED + alBERT_PRETRAINED + SqueezeBERT_PRETRAINED):
#                             outputs = model(input_ids=input_ids, attention_mask=input_mask, token_type_ids=segment_ids, labels=label_ids)
#                         elif BERT_MODEL in (distilBERT_PRETRAINED + XLNet_PRETRAINED):
#                             outputs = model(input_ids=input_ids, attention_mask=input_mask, labels=label_ids)

#                     logits = outputs.logits
#                     tmp_eval_loss = outputs.loss

#                     eval_loss += tmp_eval_loss.mean().item()
#                     nb_eval_steps += 1
#                     if len(preds) == 0:
#                         preds.append(logits.detach().cpu().numpy())
#                     else:
#                         preds[0] = np.append(
#                             preds[0], logits.detach().cpu().numpy(), axis=0)

#                 eval_loss = eval_loss / nb_eval_steps
#                 preds = preds[0]

#                 if OUTPUT_MODE == "classification":
#                     preds = np.argmax(preds, axis=1)
#                 elif OUTPUT_MODE == "regression":
#                     preds = np.squeeze(preds)
                
#                 result = compute_metrics(TASK_NAME, val_label_ids.numpy(), preds, bs, lr, ep, gas)
#                 result['val_loss'] = eval_loss

#                 row = {'batch_size': str(bs), 'iters_to_accumulate': str(gas), 'learning_rate': str(lr), 'epoch': str(ep), \
#                        'train_loss': str(train_result['train_loss']), 'train_precision': str(train_result['Precision']), 'train_recall': str(train_result['Recall']), 'train_f1': str(train_result['F1']), \
#                        'val_loss': str(result['val_loss']), 'val_precision': str(result['Precision']), 'val_recall': str(result['Recall']), 'val_f1': str(result['F1'])}
#                 csv_logger.writerow(row)
# csv_logger.close()


## 10. References

- Huggingface Transformers Community Example: Fine-tune ALBERT for sentence-pair classification [`Fine_tune_ALBERT_sentence_pair_classification.ipynb`](https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb)

- Huggingface Transformers Example: Movement Pruning Adaptive Sparsity by Fine-Tuning [`masked_run_glue.py`](https://github.com/huggingface/transformers/blob/67ff1c314a61a2d5949b3bb48fa3ec7e9b697d7e/examples/movement-pruning/masked_run_glue.py)

- [Huggingface Transformers Notebooks](https://github.com/huggingface/transformers/tree/master/notebooks)

- [Huggingface Transformers Quick Start](https://huggingface.co/transformers/quickstart.html)

- Huggingface Transformers Tutorial: Fine Tuning Transformer for MultiLabel Text Classification [`transformers_multi_label_classification.ipynb`](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)

- Off-the-shelf BERT classifier by Google Research [`run_classifier.py`](https://github.com/google-research/bert/blob/master/run_classifier.py)

- [Pre-trained Models](https://huggingface.co/transformers/pretrained_models.html)

- [Simple Guide On Using BERT for Binary Text Classification](https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04)

- Check GPU memory[`PyTorch_Reformer.ipynb`](https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb)