# [Emotion Classification in short texts with BERT](https://github.com/lukasgarbas/nlp-text-emotion/blob/master/bert.ipynb)
 
Applying BERT to the problem of multiclass text classification. Our dataset consists of written dialogs, messages and short stories. Each dialog utterance/message is labeled with one of the five emotion categories: joy, anger, sadness, fear, neutral. 

## Workflow: 
1. Import Data
2. Data preprocessing and downloading BERT
3. Training and validation
4. Saving the model

Multiclass text classification with BERT and [ktrain](https://github.com/amaiya/ktrain).
👋  **Let's start** 

In [1]:
# install ktrain on Google Colab
!pip3 install ktrain

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ktrain
  Downloading ktrain-0.31.10.tar.gz (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 1.3 MB/s 
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 52.9 MB/s 
Collecting cchardet
  Downloading cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263 kB)
[K     |████████████████████████████████| 263 kB 65.7 MB/s 
Collecting syntok>1.3.3
  Downloading syntok-1.4.4-py3-none-any.whl (24 kB)
Collecting transformers==4.17.0
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 60.8 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 52.4 MB/s 
[?25hCollecting keras_bert>=0.86.0
  Downloading keras-bert-0.89.0.

In [2]:
import pandas as pd
import numpy as np

import ktrain
from ktrain import text

## 1. Import Data

In [3]:
from google.colab import drive
drive.mount('/content/drive')
data_train = pd.read_csv('/content/drive/MyDrive/nlp-text-emotion-master/data/data_train.csv', encoding='utf-8')
data_test = pd.read_csv('/content/drive/MyDrive/nlp-text-emotion-master/data/data_test.csv', encoding='utf-8')

X_train = data_train.Text.tolist()
X_test = data_test.Text.tolist()

y_train = data_train.Emotion.tolist()
y_test = data_test.Emotion.tolist()

data = data_train.append(data_test, ignore_index=True)

class_names = ['joy', 'sadness', 'fear', 'anger', 'neutral']

print('size of training set: %s' % (len(data_train['Text'])))
print('size of validation set: %s' % (len(data_test['Text'])))
print(data.Emotion.value_counts())

data.head(10)

Mounted at /content/drive
size of training set: 7934
size of validation set: 3393
joy        2326
sadness    2317
anger      2259
neutral    2254
fear       2171
Name: Emotion, dtype: int64


Unnamed: 0,Emotion,Text
0,neutral,There are tons of other paintings that I thin...
1,sadness,"Yet the dog had grown old and less capable , a..."
2,fear,When I get into the tube or the train without ...
3,fear,This last may be a source of considerable disq...
4,anger,She disliked the intimacy he showed towards so...
5,sadness,When my family heard that my Mother's cousin w...
6,joy,Finding out I am chosen to collect norms for C...
7,anger,A spokesperson said : ` Glen is furious that t...
8,neutral,Yes .
9,sadness,"When I see people with burns I feel sad, actua..."


In [4]:
encoding = {
    'joy': 0,
    'sadness': 1,
    'fear': 2,
    'anger': 3,
    'neutral': 4
}

# Integer values for each class
y_train = [encoding[x] for x in y_train]
y_test = [encoding[x] for x in y_test]

## 2. Data preprocessing

* The text must be preprocessed in a specific way for use with BERT. This is accomplished by setting preprocess_mode to ‘bert’. The BERT model and vocabulary will be automatically downloaded

* BERT can handle a maximum length of 512, but let's use less to reduce memory and improve speed. 

In [5]:
#bert => distilbert / max_features=350000
(x_train,  y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=X_train, y_train=y_train,
                                                                       x_test=X_test, y_test=y_test,
                                                                       class_names=class_names,
                                                                       preprocess_mode='bert',
                                                                       maxlen=200,
                                                                       ngram_range=2)

downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


task: text classification


## 2. Training and validation


Loading the pretrained BERT for text classification 

In [6]:
model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)

Is Multi-Label? False
maxlen is 200
done.


Wrap it in a Learner object

In [7]:
learner = ktrain.get_learner(model, train_data=(x_train, y_train), 
                             val_data=(x_test, y_test),
                             batch_size=6)

Train the model. More about tuning learning rates [here](https://github.com/amaiya/ktrain/blob/master/tutorial-02-tuning-learning-rates.ipynb)

In [8]:
learner.fit_onecycle(2e-5, 2)



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f2e1c29df10>

Validation

In [9]:
learner.validate(val_data=(x_test, y_test), class_names=class_names)

              precision    recall  f1-score   support

         joy       0.85      0.86      0.85       707
     sadness       0.79      0.82      0.81       676
        fear       0.88      0.84      0.86       679
       anger       0.80      0.77      0.79       693
     neutral       0.81      0.83      0.82       638

    accuracy                           0.82      3393
   macro avg       0.83      0.82      0.82      3393
weighted avg       0.83      0.82      0.83      3393



array([[605,  16,  12,  14,  60],
       [ 20, 555,  27,  55,  19],
       [ 17,  36, 570,  43,  13],
       [ 18,  68,  34, 537,  36],
       [ 52,  25,   6,  23, 532]])

#### Testing with other inputs

In [10]:
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.get_classes()

['joy', 'sadness', 'fear', 'anger', 'neutral']

In [11]:


message = 'I cannot sleep'

%time
prediction = predictor.predict(message)

print(prediction)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 16.7 µs
sadness


In [12]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 4. Saving Bert model


In [13]:
%time
predictor1 = ktrain.load_predictor('/content/drive/MyDrive/nlp-text-emotion-master/for_trained_bert')
message = 'I cannot sleep'

%time
prediction = predictor1.predict(message)
print(prediction)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.25 µs
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs
neutral


In [14]:
# let's save the predictor for later use
predictor.save("/content/drive/MyDrive/nlp-text-emotion-master/for_trained_bert")

Done! to reload the predictor use: ktrain.load_predictor

For using just run this:

In [15]:
import ktrain
from ktrain import text

predictor1 = ktrain.load_predictor('/content/drive/MyDrive/nlp-text-emotion-master/for_trained_bert')
message = 'I cannot sleep'
prediction = predictor1.predict(message)
print(prediction)

sadness
