<a href="https://colab.research.google.com/github/tanishq252/NLP_using_BERT/blob/main/NLP_using_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**NLP with BERT for sentiment Analysis**

* As we are using **ktrain** library in this model, ktrain can be referred to a wrapper for deep learning libraries like TensorFlow Keras which is going to make work of the user easy just like anything. ktrain consists of BERT(Bidirectional Encoder Representation Transformers) which has pre trained deep bidirectional transformers which can be further tuned with one additional layer for creating models which are capable to perform multiple tasks
* You can go through the following paper to get a proper idea of BERT, link : [BERT](https://arxiv.org/pdf/1810.04805.pdf)



##Importing Libraries

In [None]:
!pip install ktrain

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os.path
import tensorflow as tf
import ktrain
from ktrain import text

##Loading IMDB dataset from stanford AI

In [None]:
# below code demonstrates how we can retrieve data from the url rather than explicitly downloading data and uploading it
dataset = tf.keras.utils.get_file(fname = "aclImdb_v1.tar.gz",
                                  origin = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
                                  extract = True)

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [None]:
os.path.dirname(dataset)

'/root/.keras/datasets'

In [None]:
IMDB_DATADIR = os.path.join(os.path.dirname(dataset), 'aclImdb')

In [None]:
IMDB_DATADIR

'/root/.keras/datasets/aclImdb'

##Splitting the dataset into training and testing sets

In [None]:
# try not to get intimidated by seeing the below code
# down here we have splitted the data by fetching it from the data directory
# further we have taken the classes of our dataset later we have mentioned the names of training and testing directory
# and at the last BERT was initialized as the preprocessing mode

(x_train, y_train), (x_test, y_test) , preproc = text.texts_from_folder(datadir = IMDB_DATADIR,
                                                                        classes = ['pos', 'neg'],
                                                                        maxlen = 500,
                                                                        train_test_names = ['train', 'test'],
                                                                        preprocess_mode = 'bert')

detected encoding: utf-8
downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


##Building BERT model

In [None]:
# indeed this is then power of ktrain library
# within a line of code we can build super powerful models 
model = text.text_classifier(
    name  = 'bert',
    train_data = (x_train, y_train),
    preproc = preproc,
)

Is Multi-Label? False
maxlen is 500
done.


In [None]:
# above output indicates that we are having only 2 classes and length of the dataset is 500

##Training the BERT model

In [None]:
# here an instance of learner is been created wherein just like deep learning we are suppopsed to provide the training and validation data to the instance of our learner
# as the name signifies "learner" is going to learn from the provided data which will be fed to the model so as to make proper predictions
learner = ktrain.get_learner(model = model, 
                             train_data = (x_train, y_train),
                             val_data = (x_test, y_test), 
                             batch_size= 6)

In [None]:
# fit_onecycle() is going to train model by using Leslie Smith's 1 cycle policy and this method can be used with any optimizer
# it is recommended to use learning rate of value 0.002 and 1 epoch as it takes long time even for single epoch
learner.fit_onecycle(lr = 0.00002, epochs = 1)



begin training using onecycle policy with max lr of 2e-05...

In [None]:
# overall we get an accuracy of more tha 90% using BERT