### Hugo Englund | 2021-01-06

# Laboration 5: Part C

## Outline
In this part of the laboration our aim is to classify if IMDB reviews are either good or bad. This will done by:
1. Importing and fine-tuning the pretrained BERT (Bidirectional Encoder Representations from Transformers) model from ```ktrain```.

2. Test the best model on the following text sequences and analyze the results:


```
test_sentences = [
    "That movie was absolutely awful",
    "The acting was a bit lacking",
    "The film was creative and surprising",
    "Absolutely fantastic!",
    "This movie is not worth the money",
    "The only positive thing with this movie is the music"
]
```

### Setup
Install ```ktrain``` and import relevant packages for the given task:

In [None]:
!pip3 install ktrain

Collecting ktrain
[?25l  Downloading https://files.pythonhosted.org/packages/84/c3/3147b9a86f236585eb7bb9b115e024a6f3acfcd92df1520f491853c6f6e0/ktrain-0.25.3.tar.gz (25.3MB)
[K     |████████████████████████████████| 25.3MB 1.4MB/s 
[?25hCollecting scikit-learn==0.23.2
[?25l  Downloading https://files.pythonhosted.org/packages/5c/a1/273def87037a7fb010512bbc5901c31cfddfca8080bc63b42b26e3cc55b3/scikit_learn-0.23.2-cp36-cp36m-manylinux1_x86_64.whl (6.8MB)
[K     |████████████████████████████████| 6.8MB 25.1MB/s 
Collecting langdetect
[?25l  Downloading https://files.pythonhosted.org/packages/56/a3/8407c1e62d5980188b4acc45ef3d94b933d14a2ebc9ef3505f22cf772570/langdetect-1.0.8.tar.gz (981kB)
[K     |████████████████████████████████| 983kB 41.3MB/s 
Collecting cchardet
[?25l  Downloading https://files.pythonhosted.org/packages/a0/e5/a0b9edd8664ea3b0d3270c451ebbf86655ed9fc4c3e4c45b9afae9c2e382/cchardet-2.1.7-cp36-cp36m-manylinux2010_x86_64.whl (263kB)
[K     |██████████████████████████

In [None]:
import keras
import tensorflow as tf
import numpy as np
import os
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.datasets import imdb # IMDB data
import ktrain
from ktrain import text

### Import and preprocess dataset
We load the plain data from the Keras datasets:

In [None]:
dataset = tf.keras.utils.get_file(
    fname="aclImdb.tar.gz", 
    origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
    extract=True,
)

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [None]:
path_imdb = os.path.join(os.path.dirname(dataset), 'aclImdb')
print(path_imdb)

/root/.keras/datasets/aclImdb


### Preprocess data
We preprocess the IMDB data to be suited for the BERT model:

In [None]:
# preprocess data for BERT
train_data, val_data, preproc_data = text.texts_from_folder(
    path_imdb, 
    maxlen=500, 
    preprocess_mode='bert',
    train_test_names=['train', 'test'],
    classes=['pos', 'neg']
)

detected encoding: utf-8
downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


### Load pretrained BERT model
We load the pretrained BERT model from ```ktrain```:

In [None]:
# build a BERT text classifer
bert_model = text.text_classifier('bert', train_data, preproc=preproc_data)

# wrap the BERT classifier in a ktrain learner
bert_k_train = ktrain.get_learner(
    bert_model,
    train_data=train_data, 
    val_data=val_data, 
    batch_size=6
)

Is Multi-Label? False
maxlen is 500
done.


#### Performance evaluation
We train our BERT model for one epoch, and consequently the validation set becomes our test set:

In [None]:
# train and validate the BERT model
history = bert_k_train.fit_onecycle(2e-5, 1)



begin training using onecycle policy with max lr of 2e-05...


#### Model conclusion
We obtain a final training accuracy and loss of 84% and 0.35, respectively. In the validation/testing, we obtain 93.8% and 0.16, respectively. This is significantly better than our test results for the best model in part B, with around 5 percentage points increase in accuracy and halving the loss. Unfortunately, we cannot evaluate the fit since we only trained for one epoch. However, there are no signs of overfitting with a validation loss lower than the training one.

Based on these results, we can conclude that BERT is a powerful classifer achieving a terrific accuracy after only one epoch of fine-tuning. However, it takes some time to train/tune it and therefore, one has to choose the parameters carefully before training it.

### Test sentence prediction
Lastly, we try to predict the given test sentences:

In [None]:
# test sentences to predict
test_sentences = [
  "That movie was absolutely awful",
  "The acting was a bit lacking",
  "The film was creative and surprising",
  "Absolutely fantastic!",
  "This movie is not worth the money",
  "The only positive thing with this movie is the music"
]

In [None]:
# fetch the BERT predictor
predictor = ktrain.get_predictor(bert_k_train.model, preproc_data)

# predict the test sentences
res = predictor.predict(test_sentences)

In [None]:
# print the predicted labels with corresponding sentence
for text, pred_label in zip(test_sentences, res):
  print("Sentence:", text)
  id = 0 if pred_label == 'neg' else 1
  print("Rating:", ["negative\n", "positive\n"][id])

Sentence: That movie was absolutely awful
Rating: negative

Sentence: The acting was a bit lacking
Rating: negative

Sentence: The film was creative and surprising
Rating: positive

Sentence: Absolutely fantastic!
Rating: positive

Sentence: This movie is not worth the money
Rating: negative

Sentence: The only positive thing with this movie is the music
Rating: negative



### Discussion
As expected, BERT verified its great classification capacity by predicting all of the test sentences correctly! This is an significant improvement of the previous best model that had only 4 out of 6 correct.