# Sentiment Classification Using BERT

We will use IMDB Movie Reviews Dataset

In [1]:
!pip install ktrain

Collecting ktrain
  Downloading ktrain-0.41.3.tar.gz (25.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.3/25.3 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langdetect (from ktrain)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting syntok>1.3.3 (from ktrain)
  Downloading syntok-1.4.4-py3-none-any.whl (24 kB)
Collecting tika (from ktrain)
  Downloading tika-2.6.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting keras_bert>=0.86.0 (from ktrain)
  Downloading keras-bert-0.89.0.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting whoosh (from ktrain)
  Downloading Whoosh-2.7.4-py2.py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import tensorflow as tf

In [3]:
tf.__version__

'2.15.0'

In [4]:
!git clone https://github.com/tharani001/IMDB-Dataset.git

Cloning into 'IMDB-Dataset'...
remote: Enumerating objects: 4, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 4 (delta 0), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (4/4), 25.78 MiB | 25.53 MiB/s, done.


In [5]:
import pandas as pd
import numpy as np
import ktrain
from ktrain import text

In [6]:
data_train = pd.read_excel('/content/IMDB-Dataset/train.xlsx', dtype = str)

In [7]:
data_test = pd.read_excel('/content/IMDB-Dataset/test.xlsx', dtype = str)

In [8]:
data_train.tail()

Unnamed: 0,Reviews,Sentiment
24995,Everyone plays their part pretty well in this ...,pos
24996,It happened with Assault on Prescient 13 in 20...,neg
24997,My God. This movie was awful. I can't complain...,neg
24998,"When I first popped in Happy Birthday to Me, I...",neg
24999,"So why does this show suck? Unfortunately, tha...",neg


In [9]:
data_test.head()

Unnamed: 0,Reviews,Sentiment
0,Who would have thought that a movie about a ma...,pos
1,After realizing what is going on around us ......,pos
2,I grew up watching the original Disney Cindere...,neg
3,David Mamet wrote the screenplay and made his ...,pos
4,"Admittedly, I didn't have high expectations of...",neg


In [10]:
data_test.shape, data_train.shape

((25000, 2), (25000, 2))

# `Data Preprocessing and Train-Test Split`

In [12]:
(X_train, y_train), (X_test, y_test), preproc = text.texts_from_df(train_df=data_train,
                                                                   text_column = 'Reviews',
                                                                   label_columns = 'Sentiment',
                                                                   val_df = data_test,
                                                                   maxlen = 500,
                                                                   preprocess_mode = 'bert')

['neg', 'pos']
   neg  pos
0  1.0  0.0
1  1.0  0.0
2  1.0  0.0
3  1.0  0.0
4  1.0  0.0
['neg', 'pos']
   neg  pos
0  0.0  1.0
1  0.0  1.0
2  1.0  0.0
3  0.0  1.0
4  1.0  0.0
preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


In the `text_from_df()` the argument called `maxlen` specifies the number of words or the word limit for each review in the dataset to predict the corresponding label.

# Model Training

`text_classifier` just compiles the model

compiling a model means setting up the model's parameters and hyperparameters before training it on a dataset. This step is crucial because it determines how the model will learn from the data and how it will make predictions

In [13]:
model = text.text_classifier(name = 'bert',
                             train_data = (X_train, y_train),
                             preproc = preproc)

Is Multi-Label? False
maxlen is 500




done.


`get_learner` creates an instance to finetune the pretrained model

In [14]:
learner = ktrain.get_learner(model=model, train_data=(X_train, y_train),
                   val_data = (X_test, y_test),
                   batch_size = 6)

After training the bert model on train data, we get learner instance to finetune the bert model.

# Best Learning Rate

find out best learning rate?
learner.lr_find()
learner.lr_plot()

it may take days or many days to find out when doing it for the first time.

## Based on Research, lr=2e-5 works better for sentiment classification

In [None]:
learner.fit_onecycle(lr = 2e-5, epochs = 1)



begin training using onecycle policy with max lr of 2e-05...
Train on 25000 samples, validate on 25000 samples


<tensorflow.python.keras.callbacks.History at 0x7fce51ef4940>

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [None]:
data = ['this movie was horrible, the plot was really boring. acting was okay',
        'the fild is really sucked. there is not plot and acting was bad',
        'what a beautiful movie. great plot. acting was good. will see it again']

In [None]:
predictor.predict(data)

['neg', 'neg', 'pos']

In [None]:
predictor.predict(data, return_proba=True)

array([[0.99797565, 0.00202436],
       [0.99606663, 0.00393336],
       [0.00292433, 0.9970757 ]], dtype=float32)

In [None]:
predictor.get_classes()

['neg', 'pos']

In [None]:
predictor.save('/content/bert')

In [None]:
!zip -r /content/bert.zip /content/bert

  adding: content/bert/ (stored 0%)
  adding: content/bert/tf_model.h5 (deflated 11%)
  adding: content/bert/tf_model.preproc (deflated 52%)


In [None]:
predictor_load = ktrain.load_predictor('/content/bert')

In [None]:
predictor_load.get_classes()

['neg', 'pos']

In [None]:
predictor_load.predict(data)

['neg', 'neg', 'pos']