###This notebook:
+ ktrain
+ hugging face transformers: distilbert-cased
+ input is normalised
+ emojis are kept
+ autofit policy for training

###Check Requirements/imports

In [1]:
import tensorflow as tf
print(tf.version.VERSION)

2.5.0


In [2]:
import pandas as pd


In [None]:
pip install emoji

In [None]:
pip install contractions

In [None]:
!pip3 install -q ktrain 

In [None]:
pip install -U sklearn

In [None]:
pip install parse_version

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip

###Load data

In [10]:
# Load train data
train_path = '/content/drive/MyDrive/TeamLab/data/semeval_taskA_corrected.csv'

df_train = pd.read_csv(train_path, header=0, names=['index',
                                                    'irony_label',
                                                    'tweet'])
                                                

In [11]:
df_train.head()

Unnamed: 0,index,irony_label,tweet
0,1,1,Sweet United Nations video. Just in time for C...
1,2,1,@mrdahl87 We are rumored to have talked to Erv...
2,3,1,Hey there! Nice to see you Minnesota/ND Winter...
3,4,0,3 episodes left I'm dying over here
4,5,1,I can't breathe! was chosen as the most notabl...


In [12]:
# Check if dataset is balanced

# Classes are 1 and 0. Tweet can either be ironic or non-ironic -> binary classification
classes = df_train.irony_label.unique()

print((df_train.irony_label == 0).sum())
print((df_train.irony_label == 1).sum())

# => Balanced

1923
1911


In [13]:
# Load test data
test_path = '/content/drive/MyDrive/TeamLab/data/semeval_taskA_test.csv'

df_test = pd.read_csv(test_path, sep='\t', header=0, names=['index',
                                                            'irony_label',
                                                            'tweet'])

print((df_test.irony_label == 0).sum())
print((df_test.irony_label == 1).sum())

df_test.head()

473
311


Unnamed: 0,index,irony_label,tweet
0,1,0,@Callisto1947 Can U Help?||More conservatives ...
1,2,1,"Just walked in to #Starbucks and asked for a ""..."
2,3,0,#NOT GONNA WIN http://t.co/Mc9ebqjAqj
3,4,0,@mickymantell He is exactly that sort of perso...
4,5,1,So much #sarcasm at work mate 10/10 #boring 10...


In [14]:
x_train = df_train['tweet'].to_numpy()
y_train = df_train['irony_label'].to_numpy()

x_test = df_test['tweet'].to_numpy()
y_test = df_test['irony_label'].to_numpy()

In [15]:
x_train[0:3]

array(['Sweet United Nations video. Just in time for Christmas. #imagine #NoReligion  http://t.co/fej2v3OUBR',
       "@mrdahl87 We are rumored to have talked to Erv's agent... and the Angels asked about Ed Escobar... that's hardly nothing    ;)",
       'Hey there! Nice to see you Minnesota/ND Winter Weather'],
      dtype=object)

###Normalisation of input

Normalise:
+ hashtags
+ tagged users
+ emoji (demojize)
+ urls 

In [16]:
import emoji
from nltk.tokenize import TweetTokenizer
import re
import contractions
import numpy as np


def normalise_tweet(tweet):
    norm_tweet = re.sub("&", "and", tweet)
    norm_tweet = re.sub(r"[<>]", "", norm_tweet)
    norm_tweet = re.sub("http:.*", "url", norm_tweet)
    norm_tweet = re.sub("@", " @", norm_tweet)
    norm_tweet = re.sub("#", " ", norm_tweet)

    norm_tweet = emoji.demojize(norm_tweet)
    norm_tweet = re.sub(r"[-()/_;:{}=~|,\[\]]", " ", norm_tweet)

    norm_tweet = contractions.fix(norm_tweet)

    tokenizer = TweetTokenizer()
    final_tweet = ''

    for token in tokenizer.tokenize(norm_tweet):
        if token.startswith("@"):
            token = "tagged_user"
        if token.isnumeric():
            token = "digit"

        final_tweet += token + " "
        
    return final_tweet.strip()

In [17]:
x_train_norm = []
for tweet in x_train:
    x_train_norm.append(normalise_tweet(tweet))

x_test_norm = []
for tweet in x_test:
    x_test_norm.append(normalise_tweet(tweet))

x_train_norm = np.array(x_train_norm)
x_test_norm = np.array(x_test_norm)

##Model (ktrain)

In [18]:
import ktrain
from ktrain import text

categories = [0, 1]

MODEL_NAME = 'distilbert-base-cased'

# Transormer is a wrapper to the Hugging Face transformers library for text classification.
t = text.Transformer(MODEL_NAME, maxlen=100, classes=categories)

# Using normalised input data
trn = t.preprocess_train(x_train_norm, y_train)
val = t.preprocess_test(x_test_norm, y_test)

model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)



HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…


preprocessing train...
language: en
train sequence lengths:
	mean : 16
	95percentile : 28
	99percentile : 31


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 17
	95percentile : 28
	99percentile : 36


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=354041576.0, style=ProgressStyle(descri…




###Estimate LR

run the following to let ktrain stimate a good LR

learner.lr_find(show_plot=True, max_epochs=4)

###Train

In [19]:
best_lr = 5e-5

In [20]:
# Train
# Parameters: LR, epochs
# (5e-5)
learner.autofit(lr=best_lr, checkpoint_folder='/my_models', verbose=1)

early_stopping automatically enabled at patience=5
reduce_on_plateau automatically enabled at patience=2


begin training using triangular learning rate policy with max lr of 5e-05...
Epoch 1/1024
Epoch 2/1024
Epoch 3/1024
Epoch 4/1024

Epoch 00004: Reducing Max LR on Plateau: new max lr will be 2.5e-05 (if not early_stopping).
Epoch 5/1024
Epoch 6/1024

Epoch 00006: Reducing Max LR on Plateau: new max lr will be 1.25e-05 (if not early_stopping).
Epoch 7/1024
Restoring model weights from the end of the best epoch.
Epoch 00007: early stopping
Weights from best epoch have been loaded into model.


<tensorflow.python.keras.callbacks.History at 0x7f5da607e0d0>

###Evaluate/Inspect model

In [21]:
learner.validate(class_names=t.get_classes())

              precision    recall  f1-score   support

           0       0.74      0.73      0.73       473
           1       0.60      0.61      0.60       311

    accuracy                           0.68       784
   macro avg       0.67      0.67      0.67       784
weighted avg       0.68      0.68      0.68       784



array([[343, 130],
       [120, 191]])

In [23]:
# the one that we got most wrong
learner.view_top_losses(n=5, preproc=t)

----------
id:676 | loss:2.34 | true:0 | pred:1)

----------
id:506 | loss:2.32 | true:0 | pred:1)

----------
id:330 | loss:2.25 | true:0 | pred:1)

----------
id:66 | loss:2.25 | true:1 | pred:0)

----------
id:186 | loss:2.2 | true:0 | pred:1)



In [24]:
# print out instance to see why...
print(x_test_norm[676])
print(x_test_norm[506])
print(x_test_norm[330])
print(x_test_norm[66])
print(x_test_norm[186])

So glad I am off work tonite person raising hand
This time last year ... shiid was hella funny ... unforgettable khwaaaa
Sarcasm makes you mentally stronger . Which is very effective when dealing with emotional stress and fustration . funfact WhatIfISay
Produce Mobile Apps not url
Just realized my last final is tomorrow . FAAAAA


###Make predictions on new data

In [25]:
predictor = ktrain.get_predictor(learner.model, preproc=t)

In [26]:
test_sent = ('Cool it is raining again')

In [27]:
predictor.predict(test_sent)

1

In [30]:
# Ask for explanation
predictor.explain(test_sent)

Contribution?,Feature
1.121,Highlighted in text (sum)
0.168,<BIAS>
