###This notebook:
+ ktrain
+ hugging face transformers: distilbert-cased
+ distil_bert_cased
+ lower LR 5e-5
+ remove emojis
+ autofit policy for training


###Check Requirements/imports

In [1]:
import tensorflow as tf
print(tf.version.VERSION)

2.5.0


In [2]:
import pandas as pd


In [None]:
pip install emoji

In [None]:
pip install contractions

In [None]:
!pip3 install -q ktrain 

In [None]:
pip install -U sklearn

In [None]:
pip install parse_version

In [None]:
pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1

In [9]:
import os

import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow_hub as hub

from keras.utils import np_utils

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

1 Physical GPUs, 1 Logical GPUs
Version:  2.5.0
Eager mode:  True
Hub version:  0.12.0
GPU is available


In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


###Load data

In [11]:
# Load train data
train_path = '/content/drive/MyDrive/TeamLab/data/semeval_taskA_corrected.csv'

df_train = pd.read_csv(train_path, header=0, names=['index',
                                                    'irony_label',
                                                    'tweet'])
                                                

In [12]:
df_train.head()

Unnamed: 0,index,irony_label,tweet
0,1,1,Sweet United Nations video. Just in time for C...
1,2,1,@mrdahl87 We are rumored to have talked to Erv...
2,3,1,Hey there! Nice to see you Minnesota/ND Winter...
3,4,0,3 episodes left I'm dying over here
4,5,1,I can't breathe! was chosen as the most notabl...


In [13]:
# Check if dataset is balanced

# Classes are 1 and 0. Tweet can either be ironic or non-ironic -> binary classification
classes = df_train.irony_label.unique()

print((df_train.irony_label == 0).sum())
print((df_train.irony_label == 1).sum())

# => Balanced

1923
1911


In [14]:
# Load test data
test_path = '/content/drive/MyDrive/TeamLab/data/semeval_taskA_test.csv'

df_test = pd.read_csv(test_path, sep='\t', header=0, names=['index',
                                                            'irony_label',
                                                            'tweet'])

print((df_test.irony_label == 0).sum())
print((df_test.irony_label == 1).sum())

df_test.head()

473
311


Unnamed: 0,index,irony_label,tweet
0,1,0,@Callisto1947 Can U Help?||More conservatives ...
1,2,1,"Just walked in to #Starbucks and asked for a ""..."
2,3,0,#NOT GONNA WIN http://t.co/Mc9ebqjAqj
3,4,0,@mickymantell He is exactly that sort of perso...
4,5,1,So much #sarcasm at work mate 10/10 #boring 10...


In [15]:
x_train = df_train['tweet'].to_numpy()
y_train = df_train['irony_label'].to_numpy()

x_test = df_test['tweet'].to_numpy()
y_test = df_test['irony_label'].to_numpy()

In [16]:
x_train[0:3]

array(['Sweet United Nations video. Just in time for Christmas. #imagine #NoReligion  http://t.co/fej2v3OUBR',
       "@mrdahl87 We are rumored to have talked to Erv's agent... and the Angels asked about Ed Escobar... that's hardly nothing    ;)",
       'Hey there! Nice to see you Minnesota/ND Winter Weather'],
      dtype=object)

###Normalisation of input

Normalise:
+ hashtags
+ tagged users
+ emoji (removed)
+ urls 

In [17]:
import emoji
from nltk.tokenize import TweetTokenizer
import re
import contractions
import numpy as np


def normalise_tweet(tweet):
    norm_tweet = re.sub("&", "and", tweet)
    norm_tweet = re.sub(r"[<>]", "", norm_tweet)
    norm_tweet = re.sub("http:.*", "url", norm_tweet)
    norm_tweet = re.sub("@", " @", norm_tweet)
    norm_tweet = re.sub("#", " ", norm_tweet)

    norm_tweet = emoji.demojize(norm_tweet)
    # Remove emojis
    norm_tweet = re.sub(":[a-z][a-z]+:", "", norm_tweet)
    
    norm_tweet = re.sub(r"[-()/_;:{}=~|,\[\]]", " ", norm_tweet)

    norm_tweet = contractions.fix(norm_tweet)

    tokenizer = TweetTokenizer()
    final_tweet = ''

    for token in tokenizer.tokenize(norm_tweet):
        if token.startswith("@"):
            token = "tagged_user"
        if token.isnumeric():
            token = "digit"

        final_tweet += token + " "
        
    return final_tweet.strip()

In [18]:
x_train_norm = []
for tweet in x_train:
    x_train_norm.append(normalise_tweet(tweet))

x_test_norm = []
for tweet in x_test:
    x_test_norm.append(normalise_tweet(tweet))

x_train_norm = np.array(x_train_norm)
x_test_norm = np.array(x_test_norm)

In [19]:
x_train_norm[10:20]

array(['Oh thank GOD our entire office email system is down ... the day of a big event . Santa you know JUST what to get me for xmas .',
       'But instead I am scrolling through Facebook Instagram and Twitter for hours on end accomplishing nothing .',
       'tagged_user pouting face no he bloody is not I was upstairs getting changed !',
       "Cold or warmth both suffuse one's cheeks with pink colour tone ... Do you understand the underlying difference and its texture ?",
       'Just great when you are mobile bill arrives by text',
       'crushes are great until you realize they will never be interested in you . p',
       'Buffalo sports media is smarter than all of us . Where else can you get the quality insight offered by Harrington and Busgaglia .',
       'I guess my cat also lost digit pounds when she went to the vet after I have been feeding her a few times a day . Eating food WorkingOut',
       'tagged_user tagged_user Rosenthal trading a SP for a defense only SS ? Brill

##Model (ktrain)

In [20]:
import ktrain
from ktrain import text

categories = [0, 1]

MODEL_NAME = 'distilbert-base-cased'

# Transormer is a wrapper to the Hugging Face transformers library for text classification.
t = text.Transformer(MODEL_NAME, maxlen=100, class_names=categories)

# Using normalised input data
trn = t.preprocess_train(x_train_norm, y_train)
val = t.preprocess_test(x_test_norm, y_test)

model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…


preprocessing train...
language: en
train sequence lengths:
	mean : 16
	95percentile : 28
	99percentile : 31


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 17
	95percentile : 28
	99percentile : 36


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=354041576.0, style=ProgressStyle(descri…




###Estimate LR

run the following to let ktrain stimate a good LR

learner.lr_find(show_plot=True, max_epochs=4)

###Train

In [21]:
best_lr = 5e-5

In [22]:
# Train
# Parameters: LR, epochs
# try next LR==(5e-5)
learner.autofit(lr=best_lr, checkpoint_folder='/my_models', verbose=1)

early_stopping automatically enabled at patience=5
reduce_on_plateau automatically enabled at patience=2


begin training using triangular learning rate policy with max lr of 5e-05...
Epoch 1/1024
Epoch 2/1024
Epoch 3/1024
Epoch 4/1024

Epoch 00004: Reducing Max LR on Plateau: new max lr will be 2.5e-05 (if not early_stopping).
Epoch 5/1024
Epoch 6/1024

Epoch 00006: Reducing Max LR on Plateau: new max lr will be 1.25e-05 (if not early_stopping).
Epoch 7/1024
Restoring model weights from the end of the best epoch.
Epoch 00007: early stopping
Weights from best epoch have been loaded into model.


<tensorflow.python.keras.callbacks.History at 0x7f45d7524790>

###Evaluate/Inspect model

In [23]:
learner.validate(class_names=t.get_classes())

              precision    recall  f1-score   support

           0       0.72      0.69      0.71       473
           1       0.56      0.59      0.58       311

    accuracy                           0.65       784
   macro avg       0.64      0.64      0.64       784
weighted avg       0.66      0.65      0.65       784



array([[327, 146],
       [126, 185]])

In [24]:
# the ones that we got most wrong
learner.view_top_losses(n=10, preproc=t)

----------
id:618 | loss:2.53 | true:0 | pred:1)

----------
id:700 | loss:2.43 | true:0 | pred:1)

----------
id:591 | loss:2.36 | true:0 | pred:1)

----------
id:676 | loss:2.28 | true:0 | pred:1)

----------
id:330 | loss:2.15 | true:0 | pred:1)

----------
id:506 | loss:2.08 | true:0 | pred:1)

----------
id:552 | loss:2.04 | true:0 | pred:1)

----------
id:170 | loss:1.98 | true:0 | pred:1)

----------
id:5 | loss:1.97 | true:0 | pred:1)

----------
id:373 | loss:1.97 | true:0 | pred:1)



In [25]:
# print out instance to see why...
print(x_test_norm[506])
print(x_test_norm[217])
print(x_test_norm[446])
print(x_test_norm[552])
print(x_test_norm[295])

This time last year ... shiid was hella funny ... unforgettable khwaaaa
Wow Look what the NFL Rams player who did " Hands Up do not Shoot " pose has been arrested for url
SMILES when there is MONEY SCOWLS when there is NOT RECOGNITION when there is MONEY IGNORANCE when there is NOT dollar banknote money bag police car oncoming fist " ' broken heart GOD $ broken heart "
Yeah you are a grown up and at times feel very nostalgic towards your Bachpan . Iife's handiwork ! !
If you know people who could talk power dressing and social media marketing please let me know . Thanks


###Make predictions on new data

In [26]:
predictor = ktrain.get_predictor(learner.model, preproc=t)

In [27]:
test_sent = ('Cool it is raining again')

In [28]:
predictor.predict(test_sent)

1

In [29]:
# Ask for explanation
predictor.explain(test_sent)

Contribution?,Feature
0.438,Highlighted in text (sum)
0.159,<BIAS>
