<a href="https://colab.research.google.com/github/sourcecode369/deep-natural-language-processing/blob/master/huggingface/public_transformers_in_ktrain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### A Simplied Interface to Text Classification With Hugging Face Transformers in TensorFlow Using [ktrain](https://github.com/amaiya/ktrain)

*ktrain* requires TensorFlow 2.

In [0]:
!pip3 install -q tensorflow_gpu>=2.0

In [0]:
import tensorflow as tf
print(tf.__version__)

2.1.0


We then need to install *ktrain* library using pip.

In [0]:
!pip3 install -q ktrain

  Building wheel for ktrain (setup.py) ... [?25l[?25hdone


### Load a Dataset Into Arrays

In [0]:
categories = ['alt.atheism', 'soc.religion.christian',
             'comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
train_b = fetch_20newsgroups(subset='train',
   categories=categories, shuffle=True, random_state=42)
test_b = fetch_20newsgroups(subset='test',
   categories=categories, shuffle=True, random_state=42)

print('size of training set: %s' % (len(train_b['data'])))
print('size of validation set: %s' % (len(test_b['data'])))
print('classes: %s' % (train_b.target_names))

x_train = train_b.data
y_train = train_b.target
x_test = test_b.data
y_test = test_b.target

size of training set: 2257
size of validation set: 1502
classes: ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']


## STEP 1:  Preprocess Data and Create a Transformer Model

We will use [DistilBERT](https://arxiv.org/abs/1910.01108).

In [0]:
import ktrain
from ktrain import text
MODEL_NAME = 'distilbert-base-uncased'
t = text.Transformer(MODEL_NAME, maxlen=500, classes=train_b.target_names)
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)

using Keras version: 2.2.4-tf
preprocessing train...
language: en


preprocessing test...
language: en


## STEP 2:  Train the Model

In [0]:
learner.fit_onecycle(5e-5, 4)



begin training using onecycle policy with max lr of 5e-05...
Train for 377 steps, validate for 251 steps
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7f1c74313da0>

## STEP 3: Evaluate and Inspect the Model

In [0]:
learner.validate(class_names=t.get_classes())

                        precision    recall  f1-score   support

           alt.atheism       0.94      0.90      0.92       319
         comp.graphics       0.96      0.97      0.96       389
               sci.med       0.98      0.96      0.97       396
soc.religion.christian       0.94      0.98      0.96       398

              accuracy                           0.96      1502
             macro avg       0.95      0.95      0.95      1502
          weighted avg       0.96      0.96      0.96      1502



array([[286,   8,   5,  20],
       [  9, 377,   2,   1],
       [  4,   7, 381,   4],
       [  5,   1,   1, 391]])

Let's examine the validation example about which we were the most wrong.

In [0]:
learner.view_top_losses(n=1, preproc=t)

----------
id:371 | loss:7.01 | true:alt.atheism | pred:comp.graphics)



In [0]:
print(x_test[371])

From: kempmp@phoenix.oulu.fi (Petri Pihko)
Subject: Re: Consciousness part II - Kev Strikes Back!
Organization: University of Oulu, Finland
X-Newsreader: TIN [version 1.1 PL9]
Lines: 30

Scott D. Sauyet (SSAUYET@eagle.wesleyan.edu) wrote:
> In <1993Apr21.163848.8099@cs.nott.ac.uk> 
> Kevin Anthony (kax@cs.nott.ac.uk) writes:

> > Firstly, I'm not impressed with the ability of algorithms. They're
> > great at solving problems once the method has been worked out, but not
> > at working out the method itself.
>   [ .. crossword example deleted ... ]

> Have you heard of neural networks?  I've read a little about them, and
> they seems to overcome most of your objections.

I'm sure there are many people who work with neural networks and
read this newsgroup. Please tell Kevin what you've achieved, and
what you expect.

> I am not saying that NNs will solve all such problems, but I think
> they show that it is not as hard as you think to come up with
> mechanical models of consciousness.

In

This post talks more about computing than `alt.atheism` (the true category), so our model placed it into the only computing category available to it: `comp.graphics`

## STEP 4: Making Predictions on New Data in Deployment

In [0]:
predictor = ktrain.get_predictor(learner.model, preproc=t)

In [0]:
predictor.predict('Jesus Christ is the central figure of Christianity.')

'soc.religion.christian'

In [0]:
# predicted probability scores for each category
predictor.predict_proba('Jesus Christ is the central figure of Christianity.')

array([2.9704000e-03, 5.0002872e-04, 6.5480877e-04, 9.9587470e-01],
      dtype=float32)

In [0]:
predictor.get_classes()

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

As expected, `soc.religion.christian` is assigned the highest probability.

Let's invoke the `explain` method to see which words contribute most to the classification.

We will need a forked version of the **eli5** library that supportes TensorFlow Keras, so let's install it first.

In [0]:
!pip3 install -q git+https://github.com/amaiya/eli5@tfkeras_0_10_1


  Building wheel for eli5 (setup.py) ... [?25l[?25hdone


In [0]:
predictor.explain('Jesus Christ is the central figure in Christianity.')

Contribution?,Feature
8.967,Highlighted in text (sum)
-0.101,<BIAS>


The words in the darkest shade of green contribute most to the classification and agree with what you would expect for this example.

We can save and reload our predictor for later deployment.

In [0]:
predictor.save('/tmp/my_distilbert_predictor')

In [0]:
reloaded_predictor = ktrain.load_predictor('/tmp/my_distilbert_predictor')

In [0]:
reloaded_predictor.predict('My computer monitor is really blurry.')

'comp.graphics'