<a href="https://colab.research.google.com/github/sciencecw/AI4Code/blob/BERT/ref/public_IMDb_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis of IMDb Movie Reviews Using BERT

In this notebook, we build a binary text classifier to classify movie reviews as either positive or negative using [BERT](https://arxiv.org/abs/1810.04805), a pretrained NLP model that can be used for transfer learning on text data.  We will use the [*ktrain* library](https://github.com/amaiya/ktrain), a lightweight wrapper around Keras to help train (and deploy) neural networks.  For more information on *ktrain*, see [this Medium post](https://towardsdatascience.com/ktrain-a-lightweight-wrapper-for-keras-to-help-train-neural-networks-82851ba889c).

We will begin by installing *ktrain* and importing the required *ktrain* modules.

In [None]:
# install ktrain
!pip3 install ktrain

In [None]:
# import ktrain
import ktrain
from ktrain import text

Using TensorFlow backend.


In [None]:
ktrain.__version__

'0.2.2'

Next, we will fetch and extract the IMDb movie review dataset.

In [None]:
# download IMDb movie review dataset
import tensorflow as tf
dataset = tf.keras.utils.get_file(
    fname="aclImdb.tar.gz", 
    origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
    extract=True,
)

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [None]:
# set path to dataset
import os.path
#dataset = '/root/.keras/datasets/aclImdb'
IMDB_DATADIR = os.path.join(os.path.dirname(dataset), 'aclImdb')
print(IMDB_DATADIR)

/root/.keras/datasets/aclImdb


## STEP 1:  Load and Preprocess the Dataset

The `texts_from_folder` function will load the training and validation data from the specified folder and automatically preprocess it according to BERT's requirements.  In doing so, the BERT model and vocabulary will be automatically downloaded.

In [None]:
trn, val, preproc = text.texts_from_folder(IMDB_DATADIR, 
                                          maxlen=500, 
                                          preprocess_mode='bert',
                                          train_test_names=['train', 
                                                            'test'],
                                          classes=['pos', 'neg'])

downloading pretrained BERT model and vocabulary...
[██████████████████████████████████████████████████]
extracting pretrained BERT model and vocabulary...
done.

cleanup downloaded zip...
done.

preprocessing train...


preprocessing test...


## STEP 2:  Load a pretrained BERT model and wrap it in a `ktrain.Learner` object

This step can be condensed into a single line of code, but we execute it as two lines for clarity. (You can ignore the deprecation warnings arising from Keras 2.2.4 with TensorFlow 1.14.0.)  

In [None]:
model = text.text_classifier('bert', trn, preproc=preproc)
learner = ktrain.get_learner(model,train_data=trn, val_data=val, batch_size=6)

W0817 15:50:35.441357 140312423020416 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0817 15:50:35.481600 140312423020416 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0817 15:50:35.535026 140312423020416 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:131: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0817 15:50:35.535914 140312423020416 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0817 15:50:35.546988

Is Multi-Label? False
maxlen is 500


W0817 15:50:55.827762 140312423020416 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.



done.


## STEP 3:  Train and Fine-Tune the Model on the IMDb dataset

We employ the `learner.fit_onecycle` method in *ktrain* that employs the use of a [1cycle learning  rate schedule](https://arxiv.org/pdf/1803.09820.pdf).  We use a learning rate of 2e-5 based on recommendations from [the original paper](https://arxiv.org/abs/1810.04805).

As can be seen, we achieve a **93.92% validation accuracy** in a single epoch.

In [None]:
learner.fit_onecycle(2e-5, 1)



begin training using onecycle policy with max lr of 2e-05...


W0817 15:50:59.892966 140312423020416 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 25000 samples, validate on 25000 samples
Epoch 1/1


<keras.callbacks.History at 0x7f9cb4d02da0>

Since it does not appear we are overfitting yet, we could train for an additional epoch or two for even higher accuracy.  

Feel free to try it out on your own.