# Classification Competition: Twitter Sarcasm Detection with BERT

## Instructions
To run the full code, please follow these steps:

1. In the Colab "Edit" menu above, go to "Notebook Settings" and select "GPU" from the hardware accelerator dropdown.

2. Upload the train and test data files provided with the competition (make sure they're named `train.jsonl` and `test.jsonl`) by going to "Files" in the left-hand sidebar, clicking the upload icon, and selecting `train.jsonl` and `test.jsonl`. 

3. To run the code, select the "Runtime" menu at the top and click "Run all". 

## Runtime and Output

The first few cells should run quite quickly: the final cell, which trains the model and predicts labels for the test data, takes much longer (in my experience, 10-12 minutes). After a minute or so, you should start to see output tracking the training progress of the model.

After the code finishes running, the output file, `answer.txt`, should be visible under "Files" in the left-hand sidebar. If you'd like to save this file, make sure to download it before the runtime disconnects.

## Note on Non-Determinism

It appears that, despite setting random seeds and using the same train/test split every time, the BERT model is non-deterministic. Each generated `answer.txt` is slightly different, and I'm not positive what percentage of the time they pass the baseline. Out of 5 consecutive attempts generating `answer.txt` and submitting to LiveDataLab today, my resulting F1 scores were:

* 0.7315 (passing baseline)
* 0.7274 (passing baseline)
* 0.7326 (passing baseline)
* 0.7111 (NOT passing baseline)
* 0.7327 (passing baseline)

Overall, it appears that the model usually, but not always, passes the baseline.

## References

To complete this project, I found Tensorflow's tutorial ["Classify Text with BERT"](https://www.tensorflow.org/tutorials/text/classify_text_with_bert) extremely helpful, and directly used some of the tutorial's code. This is also noted in the comments of the relevant functions below.  

### 1. Install `tensorflow_text` and `tf-models-official` dependencies

In [1]:
%%capture
!pip install tensorflow_text

In [2]:
%%capture
!pip install tf-models-official

### 2. Import required packages

In [3]:
import numpy as np
from official.nlp import optimization
import pandas as pd
import random
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text
from sklearn.model_selection import train_test_split

### 3. Define functions for creating the Tensorflow model, training/making predictions, and creating the final output file

In [4]:
def get_bert_classifier():
    """Builds the classifier model we'll use to classify tweets as SARCASM or 
       NOT_SARCASM.

       SOURCE: This function borrows heavily from Tensorflow's "Classify Text 
       with BERT" tutorial at the URL below: 
       https://www.tensorflow.org/tutorials/text/classify_text_with_bert. 
    """
    input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='input')

    # Add a text preprocessing layer to convert input into proper format for
    # BERT (see URL on the next line for more details)
    preprocess_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/1'
    preprocessing_layer = hub.KerasLayer(preprocess_url)
    encoder_inputs = preprocessing_layer(input)

    # Use BERT encoder (see URL on the next line for more details)
    encoder_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3'
    encoder = hub.KerasLayer(encoder_url,
                             trainable=True)
    outputs = encoder(encoder_inputs)

    # Add a dropout and dense layer
    net = outputs['pooled_output']
    net = tf.keras.layers.Dropout(0.1)(net)
    net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
    model = tf.keras.Model(input, net)
    return model

def train_and_predict_bert(train_x, train_y, val_x, val_y, test_x):
    """Compiles and trains the BERT model.

       SOURCE: This function borrows heavily from Tensorflow's "Classify Text
       with BERT" tutorial at the URL below:
       https://www.tensorflow.org/tutorials/text/classify_text_with_bert.
    """
    model = get_bert_classifier()

    # Use binary crossentropy loss function, and track binary accuracy
    # TODO: experimenting with different metrics, and testing reproducibility
    loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
    metrics = tf.metrics.BinaryAccuracy()

    epochs = 5
    steps_per_epoch = 80
    num_train_steps = steps_per_epoch * epochs
    num_warmup_steps = int(0.1 * num_train_steps)

    init_lr = 3e-5
    optimizer = optimization.create_optimizer(init_lr=init_lr,
                                              num_train_steps=num_train_steps,
                                              num_warmup_steps=num_warmup_steps,
                                              optimizer_type='adamw')

    # Compile the model
    model.compile(optimizer=optimizer,
                  loss=loss,
                  metrics=metrics)

    # Fit the model to the training data, tracking binary accuracy on the 
    # validation set
    print('Fitting classifier...')
    model.fit(x=train_x, y=train_y, 
              validation_data=(val_x, val_y), 
              epochs=epochs)
    
    # Predict labels for the test data
    print('Making predictions...')
    predictions = model.predict(test_x)
    return predictions

def create_output_file(predictions, filename='answer.txt'):
    """Transform predictions into the right file format for submission.
    """
    tweet_ids = ["twitter_" + str(i) for i in range(1, len(predictions) + 1)]
    predicted_labels = ["SARCASM" if predictions[i] > 0 else "NOT_SARCASM" for i in range(len(predictions))]    
    predictions_df = pd.DataFrame(data={"id": tweet_ids, "label": predicted_labels})
    predictions_df.to_csv(filename, header=False, index=False)
    print('Done. Remember to download answer.txt before session disconnects.')

### 4. Run the full pipeline of reading in train and test data, training the model, and making predictions (outputting to `answer.txt`)

In [5]:
# Set random seed for reproducibility
random.seed(123)
tf.random.set_seed(123)

# Train and test file (make sure to upload these!!!)
train_file = 'train.jsonl'
test_file = 'test.jsonl'

# For both train and test data, combine 'response' and 'context' from original
# data into one long string feature named 'feature' (not creative)
train = pd.read_json(train_file, lines=True)
train['context'] = [','.join(map(str, l)) for l in train['context']]
train['feature'] = train['response'] + ' ' + train['context']

test = pd.read_json(test_file, lines=True)
test['context'] = [','.join(map(str, l)) for l in test['context']]
test['feature'] = test['response'] + ' ' + test['context']

# Split training data into train and validation set (setting random seed for
# reproducibility)
train_x, val_x, train_y, val_y = train_test_split(train['feature'].values,
                                                  train['label'].values,
                                                  test_size=0.2,
                                                  random_state=42)

# Use newly created 'feature' column (response + context) as test feature
test_x = test['feature'].values

# Reshape all feature/label vectors for input into tf
train_x = train_x.reshape((-1, 1))
val_x = val_x.reshape((-1, 1))
test_x = test_x.reshape((-1, 1))
train_y = train_y.reshape((-1, 1))
val_y = val_y.reshape((-1, 1))

# Represent train and validation labels as 1 (sarcasm) and 0 (not sarcasm)
get_labels = np.vectorize(lambda x: 1 if x == 'SARCASM' else 0)
train_y = get_labels(train_y)
val_y = get_labels(val_y)

# Train our BERT model and output final prediction file for test data
predictions = train_and_predict_bert(train_x, train_y, val_x, val_y, test_x)
create_output_file(predictions, "answer.txt")

Fitting classifier...
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Making predictions...
Done. Remember to download answer.txt before session disconnects.
