## Finetuning BERT for Emotion Classification on an Emotionally Labelled Novel Dataset

In [3]:
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit
import numpy as np
import ktrain



### Import, Split, and Encode Dataset

Import the Novel dataset, with labelled emotions

In [4]:
df_full = pd.read_csv('./data/novel_sentences_emotionally_labelled.csv', encoding='utf-8')

In [5]:
df_full.sample(10)

Unnamed: 0,NovelId,SentenceId,Emotion,Text
3887,56,85,Neutral,"Then he went out and sent away the company, an..."
7440,105,54,Neutral,The tree first recovered itself while being un...
14256,167,17,Neutral,He let himself down the hedge with a long thin...
7817,108,82,Neutral,"I know not whether the clear, blue atmosphere ..."
12279,147,6,Neutral,"""Stork, stork, fly away, Stand not on one leg,..."
9362,120,61,Sadness,These were wonderful and happy moments for the...
3917,57,25,Surprise,"It is my husband!"" she quickly hid the roast m..."
11784,141,92,Sadness,"""It was sunshine here yesterday,"" said the lit..."
7548,106,5,Happiness,"How the sunshine cheers me, and how sweet and ..."
7670,107,41,Neutral,The head of the house obtained a situation as ...


In [6]:
len(df_full)

15302

Split dataset, maintaining novel sentences together, and maintaining the distribution of emotions across the splits

In [7]:
splitter = GroupShuffleSplit(n_splits=2, random_state=12345)
train_idx, test_idx = next(splitter.split(X=df_full, y=df_full["Emotion"], groups=df_full["NovelId"]))

In [8]:
len(train_idx), len(test_idx)

(11300, 4002)

In [9]:
df_train = df_full.iloc[train_idx]
df_test = df_full.iloc[test_idx]

In [10]:
df_train.head(10)

Unnamed: 0,NovelId,SentenceId,Emotion,Text
0,0,0,Neutral,In a certain mill lived an old miller who had ...
1,0,1,Neutral,"As they had been with him several years, he on..."
2,0,2,Disgust,"The third of the boys was, however, the drudge..."
3,0,3,Neutral,"Then all three went out together, and when the..."
4,0,4,Neutral,"Hans, however, went with them, and when it was..."
5,0,5,Neutral,The two sharp ones waited until Hans had falle...
6,0,6,Neutral,And they thought they had done a very clever t...
7,0,7,Neutral,"When the sun arose, and Hans woke up, he was l..."
8,0,8,Surprise,"He looked around on every side and exclaimed, ..."
9,0,9,Sadness,"Then he got up and clambered out of the cave, ..."


In [11]:
df_test.head(10)

Unnamed: 0,NovelId,SentenceId,Emotion,Text
113,2,0,Sadness,Little brother took his little sister by the h...
114,2,1,Sadness,Our meals are the hard crusts of bread that ar...
115,2,2,Sadness,May Heaven pity us.
116,2,3,Sadness,If our mother only knew!
117,2,4,Neutral,"Come, we will go forth together into the wide ..."
118,2,5,Sadness,"They walked the whole day over meadows, fields..."
119,2,6,Sadness,"In the evening they came to a large forest, an..."
120,2,7,Neutral,"The next day when they awoke, the sun was alre..."
121,2,8,Neutral,"Then the brother said, ""Sister, I am thirsty; ..."
122,2,9,Neutral,The brother got up and took the little sister ...


Confirmed the split happened across Novel IDs

In [12]:
df_train["NovelId"].unique()

array([  0,   1,   3,   4,   5,   6,   7,   8,   9,  11,  12,  13,  14,
        15,  18,  19,  20,  21,  22,  23,  25,  26,  27,  28,  29,  30,
        31,  32,  34,  36,  37,  38,  39,  41,  43,  44,  45,  46,  47,
        49,  50,  51,  52,  53,  54,  56,  57,  58,  59,  60,  61,  62,
        63,  64,  65,  66,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        79,  80,  81,  82,  83,  84,  85,  87,  88,  90,  91,  93,  94,
        96,  97,  98,  99, 101, 102, 103, 105, 106, 107, 108, 109, 110,
       112, 113, 114, 115, 116, 117, 118, 119, 123, 124, 125, 126, 127,
       129, 130, 131, 132, 133, 134, 135, 137, 138, 139, 142, 143, 144,
       145, 148, 149, 152, 153, 154, 155, 156, 157, 159, 160, 161, 163,
       164, 166, 167, 168, 169, 171, 172, 173, 174, 175], dtype=int64)

In [13]:
df_test["NovelId"].unique()

array([  2,  10,  16,  17,  24,  33,  35,  40,  42,  48,  55,  67,  68,
        78,  86,  89,  92,  95, 100, 104, 111, 120, 121, 122, 128, 136,
       140, 141, 146, 147, 150, 151, 158, 162, 165, 170], dtype=int64)

Confirmed distribution of Emotions across train and test splits are more or less similar

In [14]:
100*df_train["Emotion"].value_counts()/len(df_train)

Neutral      60.362832
Happiness    12.752212
Sadness       6.752212
Surprise      6.424779
Fear          5.371681
Anger         5.194690
Disgust       3.141593
Name: Emotion, dtype: float64

In [15]:
100*df_test["Emotion"].value_counts()/len(df_test)

Neutral      64.317841
Happiness    11.044478
Sadness       5.622189
Surprise      5.472264
Anger         5.347326
Fear          4.797601
Disgust       3.398301
Name: Emotion, dtype: float64

Extract X and y

In [16]:
X_train = df_train["Text"].tolist()
X_test = df_test["Text"].tolist()
y_train = df_train["Emotion"].tolist()
y_test = df_test["Emotion"].tolist()

In [17]:
class_names = df_full["Emotion"].unique().tolist()

In [18]:
encoding = {cn: i for i, cn in enumerate(class_names)}

In [19]:
encoding

{'Neutral': 0,
 'Disgust': 1,
 'Surprise': 2,
 'Sadness': 3,
 'Happiness': 4,
 'Anger': 5,
 'Fear': 6}

In [20]:
y_train = [encoding[lbl] for lbl in y_train]
y_test = [encoding[lbl] for lbl in y_test]

In [21]:
y_train[:10]

[0, 0, 1, 0, 0, 0, 0, 0, 2, 3]

### Preprocessing Data for BERT

In [22]:
(X_train_pp,  y_train_pp), (X_test_pp, y_test_pp), preproc = ktrain.text.texts_from_array(x_train=X_train, y_train=y_train, x_test=X_test, y_test=y_test,
                                                                                          class_names=class_names, preprocess_mode='bert', maxlen=100)

task: text classification


### Training and Validations

In [23]:
model = ktrain.text.text_classifier('bert', train_data=(X_train_pp, y_train_pp), preproc=preproc)

Is Multi-Label? False
maxlen is 100
done.


In [24]:
learner = ktrain.get_learner(model, train_data=(X_train_pp, y_train_pp), val_data=(X_test_pp, y_test_pp), batch_size=32) # default batch size = 32

In [25]:
#learner.lr_find() # simulates training to identify a good learning rate
#learner.lr_plot()

In [26]:
learner.freeze()

In [None]:
learner.fit_onecycle(1e-5, epochs=3)