We will start by installing the previous version of Numpy to be able to run all the packages correctly. Please restart the runtime after this command.

In [None]:
!pip install numpy==1.23.1

# Lab 7 - Social Media Processing

This notebook shows how to use HuggingFace's package to import and train regression models to assess humor rating in social media posts in English (SemEval2021: HaHackathon: Detecting and Rating Humor and Offense https://competitions.codalab.org/competitions/27446, **Task-1b**). 

Detection of humour, especially in social media posts, poses a linguistic challenge to NLP, due to the noise, figurative language, contextuality and subjectivity. You will hence try different methods to address those challenges such as preprocessing, data augmentation, ensembling and multi-task learning.

We will download and unzip the data from here: http://smash.inf.ed.ac.uk/hahackathon_data/hahackathon_data.zip. 


We recommend you to do this lab on a Colab TPU provided by Google.

In [None]:
!wget http://smash.inf.ed.ac.uk/hahackathon_data/hahackathon_data.zip
!unzip '/content/hahackathon_data.zip'

First, we need to install Hugging Face [transformers](https://huggingface.co/transformers/index.html) and [Sentence piece Tokenizers](https://github.com/google/sentencepiece), as well as some helper libraries, with the following commands.

In [None]:
!pip install -q transformers
!pip install -q sentencepiece
!pip install -q  ipywidgets
!jupyter nbextension enable --py widgetsnbextension

In [None]:
import keras
import numpy as np
import random
import matplotlib.pyplot as plt

from keras.layers.core import Dense
from keras.layers import Input, GlobalAveragePooling1D
from keras.models import Model
from keras import backend as K

We define the fix seed method to be able to introduce variety into ensembling models.

In [None]:
def set_random_seed(seed=123):
  random.seed(seed)
  np.random.seed(seed)

set_random_seed()

## Regression with BERT

We will use the [DistilBert](https://arxiv.org/abs/1910.01108v4) model and its Tokeniser following the preprocessing code from Lab 6.

In [None]:
from transformers import DistilBertTokenizer
import tqdm

# we will pad to 128 subword tokens
PAD_LENGTH = 128
bert = 'distilbert-base-uncased'
BATCH_SIZE = 512
EPOCHS =10

# Defining DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained(bert, do_lower_case=True, add_special_tokens=True, 
                                                max_length=PAD_LENGTH, padding='max_length', truncation=True)

def tokenize(sentences, tokenizer, pad_length=PAD_LENGTH):
    if type(sentences) == str:
        inputs = tokenizer.encode_plus(sentences, add_special_tokens=True, max_length=pad_length, padding='max_length', truncation=True,  
                                             return_attention_mask=True, return_token_type_ids=True)
        return np.asarray(inputs['input_ids'], dtype='int32'), np.asarray(inputs['attention_mask'], dtype='int32'), np.asarray(inputs['token_type_ids'], dtype='int32')
    input_ids, input_masks, input_segments = [],[],[]
    for sentence in sentences:
        inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=pad_length, padding='max_length', truncation=True, 
                                             return_attention_mask=True, return_token_type_ids=True)
        input_ids.append(inputs['input_ids'])
        input_masks.append(inputs['attention_mask'])
        input_segments.append(inputs['token_type_ids'])        
 
    return (np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32'))


Let's read the data using Pandas.

In [None]:
import pandas as pd

# Load data with only the necessary columns
train_df = pd.read_csv('hahackathon data/train.csv', usecols = ['text','humor_rating','offense_rating'])
test_df = pd.read_csv('hahackathon data/test.csv', usecols = ['text','humor_rating','offense_rating'])

# Drop the Nans
train_df = train_df.dropna()
test_df = test_df.dropna()

Let's check a couple of examples.

In [None]:
train_df

In [None]:
# Get the post text
train_examples_list = train_df['text'].tolist()
test_examples_list = test_df['text'].tolist()

# Get the humour rating for the regression task (we normalise, the values are between 0 and 5)
train_targets_list = (train_df['humor_rating']/5).tolist()
test_targets_list = (test_df['humor_rating']/5).tolist()

In [None]:
def get_bert_inputs(examples_list, targets):
  input_ids=list()
  attention_masks=list()

  bert_inp=tokenize(examples_list, tokenizer)
  input_ids = bert_inp[0]
  attention_masks = bert_inp[1]

  targets = np.array(targets)

  return input_ids, attention_masks, targets

train_input_ids, train_attention_masks, train_targets = get_bert_inputs(train_examples_list, train_targets_list)
test_input_ids, test_attention_masks, test_targets = get_bert_inputs(test_examples_list, test_targets_list)


**Task 1: Build a neural bag of words model using DistilBERT embeddings and the sigmoid activation on the output layer for the regression task.**

Investigate its performance using the Mean Squared Error (MSE) metric. We will use this metric as the loss function as well. We will also use the Adam optimiser with `learning_rate=2e-5`. This code is already provided.

*Hint*: You can reuse the code from Lab 6 on Transfer Learning (Model 2).

In [None]:
from transformers import TFDistilBertModel, DistilBertConfig
import tensorflow as tf
 

class GlobalAveragePooling1DMasked(GlobalAveragePooling1D):
    def call(self, x, mask=None):
        if mask != None:
            return K.sum(x, axis=1) / K.sum(mask, axis=1)
        else:
            return super().call(x)

def get_BERT_layer():
  distil_bert = 'distilbert-base-uncased'
  config = DistilBertConfig(dropout=0.2, attention_dropout=0.2)
  config.output_hidden_states = False
  
  return TFDistilBertModel.from_pretrained(distil_bert, config = config)

def create_regression_BoW_bert():
  # Your code goes here

  return tf.keras.Model(inputs=[input_ids, input_masks_ids], outputs=[out_reg])

def get_model(use_tpu=True, use_gpu=False, learning_rate=2e-5):
  if use_tpu:
    # Create distribution strategy
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.TPUStrategy(tpu)

    # Create model on TPU:
    with strategy.scope():
      model = create_regression_BoW_bert()
      optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
      model.compile(optimizer=optimizer, loss='mse', metrics=[tf.keras.losses.MeanSquaredError()])
  elif use_gpu:
    device_name = tf.test.gpu_device_name()
    print(device_name)
    with tf.device('/device:GPU:0'):
      model = create_regression_BoW_bert()
      optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
      model.compile(optimizer=optimizer, loss='mse', metrics=[tf.keras.losses.MeanSquaredError()])
  else:
    model = create_regression_BoW_bert()
    model.compile(optimizer='adam',
                loss='mse',
                metrics=[tf.keras.losses.MeanSquaredError()])
  return model


model = get_model()

In [None]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_token (InputLayer)       [(None, 128)]        0           []                               
                                                                                                  
 masked_token (InputLayer)      [(None, 128)]        0           []                               
                                                                                                  
 tf_distil_bert_model (TFDistil  TFBaseModelOutput(l  66362880   ['input_token[0][0]',            
 BertModel)                     ast_hidden_state=(N               'masked_token[0][0]']           
                                one, 128, 768),                                                   
                                 hidden_states=None                                           

In [None]:
history = model.fit([train_input_ids, train_attention_masks],
                    train_targets,
                    epochs=EPOCHS,
                    batch_size=BATCH_SIZE,
                    verbose=1)

We evaluate our model on the test set.

In [None]:
results = model.evaluate([test_input_ids,test_attention_masks], test_targets)
print('Test loss:', results[0])
print('Test MSE:', results[1])

Get the array of predictions here so that you can plot the outputs later.

In [None]:
preds = model.predict(
      [test_input_ids,test_attention_masks],
      batch_size=None,
      verbose="auto",
      steps=None,
      callbacks=None,
      max_queue_size=10,
      workers=1,
      use_multiprocessing=False)

preds = np.array(preds).flatten()

## Predictive Distribution

We compute min, max and mean for the golden and predicted humour ratings. 

In [None]:
min(preds), max(preds), preds.mean()

In [None]:
min(test_targets), max(test_targets), test_targets.mean()

In [None]:
pd.Series(preds).hist()

In [None]:
pd.Series(test_targets).hist()

Next, we plot the true vs predicted humour grade for our model.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
def get_pred_true_plot(preds, labels, title):
    limits = [labels.min(), labels.max()]
    fig, ax = plt.subplots()
    fig.set_dpi(150)
    ax.set_title(title)
    ax.scatter(labels, preds, marker='.')
    ax.plot(limits, limits, color="gray", linestyle=":")
    ax.set_xlabel('True Humour Grade')
    ax.set_ylabel('Predicted Humour Grade')
    sns.regplot(x=labels, y=preds, ax=ax, scatter_kws={"s": 5})
    plt.show()

get_pred_true_plot(preds, test_targets, 'True vs Predicted Humour Grade for DistilBERT Model')

Our regressor tends to smooth down the extreme rating values to make them closer to the mean.

# Feature Engineering & Data Augmentation

**Task 2: Preprocess the textual data with the Ekphrasis library following the standard pipeline https://github.com/cbaziotis/ekphrasis#text-pre-processing-pipeline. How does this affect the performance?**

*Hint*: You might not want to annotate terms in order to keep the same length of the input sentences (for this, do not use the parameter `annotate={"hashtag", "allcaps", "elongated", "repeated", 'emphasis', 'censored'}`).

In [None]:
!pip install -q ekphrasis
!pip3 install -q emoji==0.6.0

In [None]:
# Your code goes here


new_train_examples_list = [" ".join(text_processor.pre_process_doc(example)) for example in train_examples_list]
new_test_examples_list = [" ".join(text_processor.pre_process_doc(example)) for example in test_examples_list]


In [None]:
print("Original Text:", train_examples_list[0])
print("Preprocessed Text:", new_train_examples_list[0])

In [None]:
train_input_ids, train_attention_masks, train_targets = get_bert_inputs(new_train_examples_list, train_targets_list)
test_input_ids, test_attention_masks, test_targets = get_bert_inputs(new_test_examples_list, test_targets_list)
 
model = get_model()
history = model.fit([train_input_ids,train_attention_masks],
                    train_targets,
                    epochs=EPOCHS,
                    batch_size=BATCH_SIZE,
                    verbose=1)

In [None]:
results = model.evaluate([test_input_ids,test_attention_masks], test_targets)
print('Test loss:', results[0])
print('Test MSE:', results[1])

Your results may be different depending on the implementation but typically special preprocessing does not drastically change the performance for this task.

**Task 3: Augment the training data twice by changing the original data via two methods from the Nlpaug (https://github.com/makcedward/nlpaug) library: (a) synonym replacement from WordNet; (b) deletion of random words. Comment on which method gives the best performance.**

*Hint*: Use the Synonym Augmenter and Random Word Augmenter (Delete word randomly) classes as follows:
```
aug = naw.SynonymAug(aug_src='wordnet')
augmented_text = aug.augment(text)

aug = naw.RandomWordAug()
augmented_text = aug.augment(text)
```





For more examples check https://github.com/makcedward/nlpaug/blob/master/example/textual_augmenter.ipynb

In [None]:
!pip install nlpaug

In [None]:
# Get the data again to apply augmentation
train_examples_list = train_df['text'].tolist()
train_targets_list = (train_df['humor_rating']/5).tolist()

In [None]:
import nlpaug.augmenter.word as naw

# Your code goes here


In [None]:
train_input_ids, train_attention_masks, train_targets = get_bert_inputs(train_examples_list, train_targets_list)
test_input_ids, test_attention_masks, test_targets = get_bert_inputs(test_examples_list, test_targets_list)


We have now augmented the original data twice:

In [None]:
print("Training examples before augmentation:")
print(len(train_df['text'].tolist()))
print("Training examples after augmentation:")
print(len(train_examples_list))

In [None]:
model = get_model()
history = model.fit([train_input_ids,train_attention_masks],
                    train_targets,
                    epochs=EPOCHS,
                    batch_size=BATCH_SIZE,
                    verbose=1)

In [None]:
results = model.evaluate([np.asarray(test_input_ids),np.asarray(test_attention_masks)], test_targets)
print('Test loss:', results[0])
print('Test MSE:', results[1])

Your results may be different depending on the implementation but typically there are no drastic differences between the augmentation setups.

# Ensembled BERT Model

In this section you will train and evaluate an **ensemble** of BERT models. 

We define the hyperparameters, including the number of models we want to ensemble (RERUNS=3, i.e., 3 models). 

**Task 4: Train three DistilBERT models, get their predictions on the test set, take the mean of those predictions and evaluate this ensembled prediction. Comment on the resulting performance.**

We create three models in a loop, set a new random seed before creating each of them (`set_random_seed(seed=random.randint(0, 500))`) and accumulate predictions per model in a list.

In [None]:
# Get the train data again to avoid any confusion
train_examples_list = train_df['text'].tolist()
train_targets_list = (train_df['humor_rating']/5).tolist()
 
train_input_ids, train_attention_masks, train_targets = get_bert_inputs(train_examples_list, train_targets_list)


In [None]:
RERUNS = 3
# We save the predictions of each model to the list
all_model_preds = list()

for i in range(RERUNS):

  set_random_seed(seed=random.randint(0, 500))

  # Your code goes here


In [None]:
from sklearn.metrics import classification_report, mean_squared_error

mean_preds = np.mean(np.array(all_model_preds), axis=0)
ensemble_mse = mean_squared_error(test_targets, mean_preds)

print('Ensemble Test MSE : {:.4f}'.format(ensemble_mse))

Your results may be different depending on the implementation but typically ensembling slightly improves the performance for this task.

# Multi-task Learning with BERT

**Task 6: Train a multi-task (MTL) model with the additional regression task of predicting the offense rating. Re-train the single-task model from Task 1 with half of the initial training data. The code to fetch these data is provided below. Comment on the resulting performance for the regresion task in the data sparsity conditions for the two models (single-task and MTL).** 

*Hint*: The MTL model will have two identical output layers (one for predicting humour rating, the other to predict offense rating).

We specify two losses and two metrics to compile the model `loss={'out_reg1': 'mse', 'out_reg2': 'mse'}, metrics={'out_reg1': 'mse', 'out_reg2': 'mse'}`. We increase the epoch count to 25 due to the reduced training data.

In [None]:
set_random_seed()

from transformers import TFDistilBertModel, DistilBertConfig

def create_TFBertMultitask():

  # Your code goes here

  # comment to run a single-task model
  return tf.keras.Model(inputs=[input_ids, input_masks_ids], outputs = [out_reg1, out_reg2])
  # uncomment to run a single-task model
  # return tf.keras.Model(inputs=[input_ids, input_masks_ids], outputs = [out_reg1])

use_tpu = True
use_gpu = False
if use_tpu:
  # Create distribution strategy
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
  tf.config.experimental_connect_to_cluster(tpu)
  tf.tpu.experimental.initialize_tpu_system(tpu)
  strategy = tf.distribute.TPUStrategy(tpu)

  # Create model on TPU:
  with strategy.scope():
    model = create_TFBertMultitask()
    optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
    # comment to run a single-task model
    model.compile(optimizer=optimizer, loss={'out_reg1': 'mse', 
                         'out_reg2': 'mse'}, metrics={'out_reg1': 'mse', 
                          'out_reg2': 'mse'})
    # uncomment to run a single-task model
    # model.compile(optimizer=optimizer, loss={'out_reg1': 'mse'}, metrics={'out_reg1': 'mse'})
  
elif use_gpu:
  device_name = tf.test.gpu_device_name()
  print(device_name)
  with tf.device('/device:GPU:0'):
    model = create_TFBertMultitask()
    optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
    model.compile(optimizer=optimizer, loss={'out_reg1': 'mse', 
                         'out_reg2': 'mse'}, metrics={'out_reg1': 'mse', 
                          'out_reg2': 'mse'})
else:
  model = create_TFBertMultitask()
  optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
  model.compile(optimizer=optimizer, loss={'out_reg1': 'mse', 
                         'out_reg2': 'mse'}, metrics={'out_reg1': 'mse', 
                          'out_reg2': 'mse'})


In [None]:
model.summary()

Model: "model_6"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_token (InputLayer)       [(None, 128)]        0           []                               
                                                                                                  
 masked_token (InputLayer)      [(None, 128)]        0           []                               
                                                                                                  
 tf_distil_bert_model_6 (TFDist  TFBaseModelOutput(l  66362880   ['input_token[0][0]',            
 ilBertModel)                   ast_hidden_state=(N               'masked_token[0][0]']           
                                one, 128, 768),                                                   
                                 hidden_states=None                                         

In [None]:
# Get half of the training data
train_examples_list = train_df['text'].tolist()[2500:]
train_targets_list = (train_df['humor_rating']/5).tolist()[2500:]

train_input_ids, train_attention_masks, train_targets = get_bert_inputs(train_examples_list, train_targets_list)

# Get the offense ratings for the second regression task (we normalise them as well)
train_targets2_list = (train_df['offense_rating']/5).tolist()[2500:]
test_targets2_list = (test_df['offense_rating']/5).tolist()
 
train_targets2 = np.array(train_targets2_list)
test_targets2 = np.array(test_targets2_list)

In [None]:
# comment to run a single-task model

history = model.fit([train_input_ids, train_attention_masks],
                    [train_targets, train_targets2],
                    epochs=25,
                    batch_size=BATCH_SIZE,
                    verbose=1)

# uncomment to run a single-task model
# history = model.fit([train_input_ids, train_attention_masks],
#                    [train_targets],
#                    epochs=25,
#                    batch_size=BATCH_SIZE,
#                    verbose=1)

In [None]:
results = model.evaluate([test_input_ids,test_attention_masks], [test_targets, test_targets2])
print('Test loss:', results[0])
print('Test MSE:', results[1])

Your results may be different depending on the implementation but typically MTL slightly improves the performance over the single-task model trained in similar conditions.