## BERT Classifier using ktrain package

#### Original Notebooks:
https://github.com/amaiya/ktrain  

### Credit: Nick Kadochnikov

In [1]:
import warnings
# warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import re

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

import tensorflow as tf
from keras.preprocessing.text import Tokenizer, text_to_word_sequence

In [2]:
#Verify we got CPU + GPU or only CPU
tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
 PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU')]

In [3]:
!nvidia-smi

Thu Nov 26 14:34:03 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   70C    P0    31W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
tf.__version__

'2.3.1'

In [None]:
# !pip install ktrain --upgrade

In [8]:
#pip install --user ktrain
import ktrain
from ktrain import text

In [4]:
ktrain.__version__

'0.37.0'

#### Check available text classifiers in ktrain

In [5]:
text.print_text_classifiers()

fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]
logreg: logistic regression using a trainable Embedding layer
nbsvm: NBSVM model [http://www.aclweb.org/anthology/P12-2018]
bigru: Bidirectional GRU with pretrained fasttext word vectors [https://fasttext.cc/docs/en/crawl-vectors.html]
standard_gru: simple 2-layer GRU with randomly initialized embeddings
bert: Bidirectional Encoder Representations from Transformers (BERT) from keras_bert [https://arxiv.org/abs/1810.04805]
distilbert: distilled, smaller, and faster BERT from Hugging Face transformers [https://arxiv.org/abs/1910.01108]


### Load Data

In [2]:
from google.cloud import storage
import os
import io
client = storage.Client()
bucket = client.get_bucket('nlp_final_data')

blob = bucket.blob('top_songs_final.csv')
content = blob.download_as_string()

df = pd.read_csv(io.BytesIO(content))

In [3]:
df['data_tokenized'] = df['data_tokenized'].apply(lambda x: x.strip('[]').replace('\'', '').split(', '))

In [4]:
df['data'] = df['data_tokenized'].apply(lambda x: ' '.join(x))
df = df.rename(columns={'tag':'target'})

In [5]:
df_train = df[['data','target']].groupby('target').sample(5000)

In [9]:
len(df_train)

25000

## STEP 1:  Load and Preprocess the Dataset

Preprocess the data using the texts_from_array function (since the data resides in an array). If your documents are stored in folders or a CSV file you can use the texts_from_folder or texts_from_csv functions, respectively.

In [10]:
maxLen = 200 #each document can be of most <maxLen> words. 0 is used as padding ID.
nGramRange = 1 #size of multi-word phrases to consider
preprocessMode='bert' #Either 'standard' (normal tokenization) or 'bert' tokenization and preprocessing for use with BERT text classification model.
sampleSize = 0.3 #Proportion of training to use for validation

(x_train,  y_train), (x_test, y_test), preproc = text.texts_from_df(train_df = df_train, 
                                                                    text_column  = 'data', 
                                                                    label_columns = ['target'],
                                                                       val_pct=sampleSize,
                                                                       preprocess_mode=preprocessMode, #text must be preprocessed in a specific way for use with BERT
                                                                       maxlen=maxLen)

['country', 'pop', 'rap', 'rb', 'rock']
        country  pop  rap   rb  rock
95110       0.0  0.0  0.0  0.0   1.0
65053       0.0  0.0  1.0  0.0   0.0
82197       0.0  0.0  1.0  0.0   0.0
148530      0.0  0.0  0.0  1.0   0.0
46406       0.0  1.0  0.0  0.0   0.0
['country', 'pop', 'rap', 'rb', 'rock']
        country  pop  rap   rb  rock
68252       0.0  0.0  1.0  0.0   0.0
27516       0.0  1.0  0.0  0.0   0.0
79683       0.0  0.0  1.0  0.0   0.0
3002        0.0  1.0  0.0  0.0   0.0
125315      0.0  0.0  0.0  0.0   1.0
preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


## STEP 2:  Load a pretrained BERT model and wrap it in a `ktrain.Learner` object

This step can be condensed into a single line of code, but we execute it as two lines for clarity. (You can ignore the deprecation warnings arising from Keras 2.2.4 with TensorFlow 1.14.0.)  

In [11]:
model = text.text_classifier('bert', (x_train, y_train), preproc=preproc)

Is Multi-Label? False
maxlen is 200
done.


In [12]:
batchSize = 16 ### Check best size

learner = ktrain.get_learner(model, 
                             train_data=(x_train, y_train), 
                             val_data=(x_test, y_test), 
                             batch_size=batchSize)

## STEP 3: Train the Model

We train using one of the three learning rates recommended in the BERT paper: *5e-5*, *3e-5*, or *2e-5*.
Alternatively, the ktrain Learning Rate Finder can be used to find a good learning rate by invoking `learner.lr_find()` and `learner.lr_plot()`, prior to training.
The `learner.fit_onecycle` method employs a [1cycle learning rate policy](https://arxiv.org/pdf/1803.09820.pdf).

In [13]:
# briefly simulate training to find good learning rate

learner.lr_find(max_epochs=3)

simulating training for different learning rates... this may take a few moments...
Epoch 1/3
 81/438 [====>.........................] - ETA: 34:21 - loss: 1.7163 - accuracy: 0.2091

In [14]:
learner.lr_plot()

ValueError: Please call lr_find first.

In [13]:
learningRate = 5e-5
numEpoch = 2

# learner.fit_onecycle(learningRate, numEpoch)
learner.autofit(learningRate, numEpoch)



begin training using triangular learning rate policy with max lr of 5e-05...
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f138c6e4a10>

We can use the `learner.validate` method to test our model against the validation set.

In [13]:
learner.validate(val_data=(x_test, y_test))

              precision    recall  f1-score   support

           0       0.65      0.75      0.70      1496
           1       0.41      0.39      0.40      1476
           2       0.81      0.85      0.83      1502
           3       0.57      0.57      0.57      1527
           4       0.57      0.49      0.53      1499

    accuracy                           0.61      7500
   macro avg       0.60      0.61      0.60      7500
weighted avg       0.60      0.61      0.60      7500



array([[1117,  136,    2,  117,  124],
       [ 205,  579,   80,  318,  294],
       [  11,   53, 1279,  117,   42],
       [ 130,  274,  169,  866,   88],
       [ 247,  361,   54,  106,  731]])

## STEP 4: Making predictions

We can call the `learner.get_predictor` method to obtain a Predictor object capable of making predictions on new raw data.

In [17]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [26]:
predictor.get_classes()

['Negative', 'Positive']

In [27]:
print(df.data.iloc[0])

I love Deagan's. I do. I really do. The atmosphere is cozy and festive. The shrimp tacos and house fries are my standbys. The fries are sometimes good and sometimes great, and the spicy dipping sauce they come with is to die for. The beer list is amazing and the cocktails are great. The prices are mid-level, so it's not a cheap dive you can go to every week, but rather a treat when you do. Try it out. You won't be disappointed!


In [28]:
print(predictor.predict(df.data.iloc[0]))

Positive


In [29]:
print(df.target.iloc[0])

Positive


In [30]:
NumRecs = 1000

target = df.target.iloc[0:NumRecs]
predicted = predictor.predict(df.data.iloc[0:NumRecs].tolist())
data = df.data.iloc[0:NumRecs]

results = pd.DataFrame(list(zip(target, predicted, data)), 
               columns =['target', 'predicted', 'data']) 

In [31]:
pd.set_option('max_colwidth', 100)

results.head(10)

Unnamed: 0,target,predicted,data
0,Positive,Positive,I love Deagan's. I do. I really do. The atmosphere is cozy and festive. The shrimp tacos and hou...
1,Positive,Positive,I love the classes at this gym. Zumba and. Radio Hip Hop are my favorite. This is such a great f...
2,Negative,Negative,The tables and floor were dirty. I was the only customer on a Saturday nite and the person work...
3,Negative,Negative,"I had an oil change at the 15515 N Scottsdale Road location. When the car was delivered to me, ..."
4,Negative,Negative,The absolute WORST apartment complex I have ever lived in. Moved here from out of state. Hoped t...
5,Positive,Positive,"I saw Big Jim's on Diners, Drive-ins and Dives. Went there when in Pittsburgh with my son visiti..."
6,Positive,Positive,"THE TACOS. I got pastor, asada, and chorizo. Authentic. Delicious. Nice topping bar with cilantr..."
7,Negative,Negative,After being a regular at Veggie House i thought it be cool to try Chef Kenny's. We went there a ...
8,Negative,Negative,With many eateries to chose from on queen street we decided to give this place a try seeing how ...
9,Negative,Negative,DO NOT USE THESE PEOPLE!!!!! I wish I would have looked on here first as now I can to attest to ...


In [32]:
!rm -r '/home/jupyter/data/ktrain/bert/'

In [18]:
predictor.save('/Data/my_predictor')

PermissionError: [Errno 13] Permission denied: '/Data'

In [34]:
!ls -l '/home/jupyter/data/ktrain/bert/'

total 4
drwxr-xr-x 2 root root 4096 Nov 26 15:12 my_predictor


In [35]:
# reload the predictor

reloaded_predictor = ktrain.load_predictor('/home/jupyter/data/ktrain/bert/my_predictor')

In [36]:
# make a prediction on the same document to verify it still works

print(reloaded_predictor.predict(df.data.iloc[0]))

Positive


In [37]:
!nvidia-smi

Thu Nov 26 15:12:33 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0    31W /  70W |  14332MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [38]:
import datetime
import pytz

datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")

'Thu, 26 November 2020 09:12:33'