<a href="https://colab.research.google.com/github/sho-kawano/profit-gain/blob/master/Training_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# install ktrain on Google Colab
!pip3 install ktrain

Collecting ktrain
[?25l  Downloading https://files.pythonhosted.org/packages/f2/d5/a366ea331fc951b8ec2c9e7811cd101acfcac3a5d045f9d9320e74ea5f70/ktrain-0.21.4.tar.gz (25.3MB)
[K     |████████████████████████████████| 25.3MB 127kB/s 
Collecting keras_bert>=0.86.0
  Downloading https://files.pythonhosted.org/packages/e2/7f/95fabd29f4502924fa3f09ff6538c5a7d290dfef2c2fe076d3d1a16e08f0/keras-bert-0.86.0.tar.gz
Collecting langdetect
[?25l  Downloading https://files.pythonhosted.org/packages/56/a3/8407c1e62d5980188b4acc45ef3d94b933d14a2ebc9ef3505f22cf772570/langdetect-1.0.8.tar.gz (981kB)
[K     |████████████████████████████████| 983kB 54.8MB/s 
Collecting cchardet
[?25l  Downloading https://files.pythonhosted.org/packages/1e/c5/7e1a0d7b4afd83d6f8de794fce82820ec4c5136c6d52e14000822681a842/cchardet-2.1.6-cp36-cp36m-manylinux2010_x86_64.whl (241kB)
[K     |████████████████████████████████| 245kB 56.8MB/s 
Collecting seqeval
  Downloading https://files.pythonhosted.org/packages/34/91/068aca

In [None]:
# import ktrain and the ktrain.text modules
import ktrain
from ktrain import text

In [None]:
ktrain.__version__

'0.21.4'

# Multiclass Text Classification Using BERT and Keras
In this example, we will use ***ktrain*** ([a lightweight wrapper around Keras](https://github.com/amaiya/ktrain)) to build a model using the dataset employed in the **scikit-learn** tutorial: [Working with Text Data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).  Unlike in the tutorial, we will sample 20 newsgroups to create a relatively small multiclass text classification dataset.  The objective is to accurately classify each document into one of these 20 newsgroup topic categories.  This will provide us an opportunity to see **BERT** in action on a relatively smaller training set.  Let's fetch the [20newsgroups dataset ](http://qwone.com/~jason/20Newsgroups/) using scikit-learn.


In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'comp.sys.ibm.pc.hardware', 'rec.autos',
              'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey',
              'sci.med', 'sci.space', 'talk.politics.guns',
              'talk.politics.mideast']

train_b = fetch_20newsgroups(subset='train',
   categories=categories, shuffle=True, random_state=42)
test_b = fetch_20newsgroups(subset='test',
   categories=categories, shuffle=True, random_state=42)

print('size of training set: %s' % (len(train_b['data'])))
print('size of validation set: %s' % (len(test_b['data'])))
print('classes: %s' % (train_b.target_names))

x_train = train_b.data
y_train = train_b.target
x_test = test_b.data
y_test = test_b.target

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


size of training set: 5756
size of validation set: 3831
classes: ['alt.atheism', 'comp.sys.ibm.pc.hardware', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.med', 'sci.space', 'talk.politics.guns', 'talk.politics.mideast']


## STEP 1:  Load and Preprocess the Data
Preprocess the data using the `texts_from_array function` (since the data resides in an array).
If your documents are stored in folders or a CSV file you can use the `texts_from_folder` or `texts_from_csv` functions, respectively.

In [None]:
(x_train,  y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
                                                                       x_test=x_test, y_test=y_test,
                                                                       class_names=train_b.target_names,
                                                                       preprocess_mode='bert',
                                                                       ngram_range=1,
                                                                       maxlen=350,
                                                                       random_state= 3,
                                                                       max_features=35000)

downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


task: text classification


## STEP 2:  Load the BERT Model and Instantiate a Learner object

In [None]:
# you can disregard the deprecation warnings arising from using Keras 2.2.4 with TensorFlow 1.14.
model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), batch_size=6)

Is Multi-Label? False
maxlen is 350
done.


## STEP 3: Train the Model

We train using one of the three learning rates recommended in the BERT paper: *5e-5*, *3e-5*, or *2e-5*.
Alternatively, the ktrain Learning Rate Finder can be used to find a good learning rate by invoking `learner.lr_find()` and `learner.lr_plot()`, prior to training.
The `learner.fit_onecycle` method employs a [1cycle learning rate policy](https://arxiv.org/pdf/1803.09820.pdf).



In [None]:
learner.fit_onecycle(2e-5, 4)



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7fd25cb69dd8>

We can use the `learner.validate` method to test our model against the validation set.
As we can see, BERT achieves a **95%** accuracy.

In [None]:
learner.validate(val_data=(x_test, y_test), class_names=train_b.target_names)

                          precision    recall  f1-score   support

             alt.atheism       0.90      0.93      0.92       319
comp.sys.ibm.pc.hardware       0.97      0.99      0.98       392
               rec.autos       0.92      0.96      0.94       396
         rec.motorcycles       0.93      0.90      0.91       398
      rec.sport.baseball       0.96      0.97      0.97       397
        rec.sport.hockey       0.99      0.96      0.98       399
                 sci.med       0.97      0.95      0.96       396
               sci.space       0.96      0.96      0.96       394
      talk.politics.guns       0.93      0.96      0.94       364
   talk.politics.mideast       0.99      0.93      0.96       376

                accuracy                           0.95      3831
               macro avg       0.95      0.95      0.95      3831
            weighted avg       0.95      0.95      0.95      3831



array([[298,   0,   1,   3,   1,   0,   5,   9,   2,   0],
       [  0, 387,   1,   1,   0,   0,   0,   1,   1,   1],
       [  0,   3, 381,   8,   0,   0,   1,   2,   1,   0],
       [  2,   3,  24, 359,   0,   0,   1,   0,   9,   0],
       [  2,   0,   1,   0, 385,   5,   1,   0,   3,   0],
       [  0,   0,   3,   1,   8, 385,   0,   0,   2,   0],
       [  3,   3,   0,   7,   0,   0, 378,   0,   3,   2],
       [  5,   4,   1,   2,   1,   0,   1, 377,   3,   0],
       [  4,   0,   1,   5,   2,   0,   1,   1, 350,   0],
       [ 18,   1,   0,   2,   2,   0,   0,   1,   4, 348]])

## How to Use Our Trained BERT Model

We can call the `learner.get_predictor` method to obtain a Predictor object capable of making predictions on new raw data.

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [None]:
predictor.get_classes()

['alt.atheism',
 'comp.sys.ibm.pc.hardware',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.med',
 'sci.space',
 'talk.politics.guns',
 'talk.politics.mideast']

In [None]:
predictor.predict(test_b.data[0:1])

['rec.autos']

In [None]:
# we can visually verify that our prediction of 'sci.med' for this document is correct
print(test_b.data[0])

From: oaddab@stdvax (DIRK BROER)
Subject: Re: Questions about insurance companies (esp. Geico)
News-Software: VAX/VMS VNEWS 1.4-b1  
Organization: Goddard Space Flight Center - Robotics Lab
Distribution: usa
Lines: 32

In article <66758@mimsy.umd.edu>, davew@cs.umd.edu (David G. Wonnacott) writes...
>I'm considering switching to Geico insurance, but have heard that
>they do not assign a specific agent for each policy or claim.  I was
>worried that this might be a real pain when you make a claim.  I have
>also heard that they try to get rid of you if you have an accident.
> 
>I'm interestend in determining whether or not these things are true.
>Has anyone out there with Geico made a claim?  I'd be interested in
>hearing whether or not you were satisfied with the service and whether
>you then had trouble renewing your policy.
> 
>I'm also interested in any good or bad stories about Liberty Mutual or
>State Farm.
> 
Geico supports (reads gives to police) Laser Radars.  They have been know

In [None]:
# we predicted the correct label
print(test_b.target_names[test_b.target[0]])

rec.autos


The `predictor.save` and `ktrain.load_predictor` methods can be used to save the Predictor object to disk and reload it at a later time to make predictions on new data.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# let's save the predictor for later use
predictor.save('/content/gdrive/My Drive/Colab Notebooks/ML_Profit-Gain/10NewsGroups/BERT_predictor')

In [None]:
# reload the predictor
reloaded_predictor = ktrain.load_predictor('/content/gdrive/My Drive/Colab Notebooks/ML_Profit-Gain/10NewsGroups/BERT_predictor')

In [None]:
# make a prediction on the same document to verify it still works
reloaded_predictor.predict(test_b.data[0:1])

['rec.autos']