# NLP with Bert for Sentiment Analysis

### Importing the libraries

In [1]:
!pip3 install ktrain

Collecting ktrain
[?25l  Downloading https://files.pythonhosted.org/packages/26/94/dbee68fd93255f7604092f9cd3c60d3aa550808099443a28a36d79cd7632/ktrain-0.18.4.tar.gz (25.2MB)
[K     |████████████████████████████████| 25.2MB 128kB/s 
[?25hCollecting tensorflow==2.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/85/d4/c0cd1057b331bc38b65478302114194bd8e1b9c2bbc06e300935c0e93d90/tensorflow-2.1.0-cp36-cp36m-manylinux2010_x86_64.whl (421.8MB)
[K     |████████████████████████████████| 421.8MB 24kB/s 
Collecting scikit-learn==0.21.3
[?25l  Downloading https://files.pythonhosted.org/packages/a0/c5/d2238762d780dde84a20b8c761f563fe882b88c5a5fb03c056547c442a19/scikit_learn-0.21.3-cp36-cp36m-manylinux1_x86_64.whl (6.7MB)
[K     |████████████████████████████████| 6.7MB 40.4MB/s 
Collecting keras_bert>=0.81.0
  Downloading https://files.pythonhosted.org/packages/06/fc/45434550f446e6e1c9cac715e5a2f34a6fa599abd98563f728c2d6df4332/keras-bert-0.85.0.tar.gz
Collecting langdetect
[?25

In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
import ktrain
from ktrain import text

## Part 1: Data Preprocessing

In [3]:
train = pd.read_csv('train_E6oV3lV.csv')
test = pd.read_csv('test_tweets_anuFYb8.csv')

In [4]:
train.shape

(31962, 3)

In [5]:
test.shape

(17197, 2)

In [6]:
train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [7]:
test.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


### Creating the training and test sets

In [8]:
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_df(train,
                                                                   text_column='tweet',
                                                                   label_columns=['label'],
                                                                   maxlen=500,
                                                                   preprocess_mode='bert')

downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


## Part 2: Building the BERT model

In [9]:
model = text.text_classifier(name='bert',
                             train_data=(x_train, y_train),
                             preproc=preproc)

Is Multi-Label? False
maxlen is 500
done.


## Part 3: Training the BERT model

In [10]:
learner = ktrain.get_learner(model=model,
                             train_data=(x_train, y_train),
                             val_data=(x_test, y_test),
                             batch_size=6)

In [11]:
learner.fit_onecycle(lr=2e-5,
                     epochs=1)



begin training using onecycle policy with max lr of 2e-05...
Train on 28765 samples, validate on 3197 samples


<tensorflow.python.keras.callbacks.History at 0x7fba322839e8>

In [12]:
learner.validate(val_data=(x_test, y_test))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      2964
           1       0.87      0.63      0.73       233

    accuracy                           0.97      3197
   macro avg       0.92      0.81      0.85      3197
weighted avg       0.96      0.97      0.96      3197



array([[2942,   22],
       [  87,  146]])

## How to Use Our Trained BERT Model

We can call the `learner.get_predictor` method to obtain a Predictor object capable of making predictions on new raw data.

In [13]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [14]:
predictor.get_classes()

['0', '1']

In [15]:
test_tweets = test['tweet'].values
test_tweets

array(['#studiolife #aislife #requires #passion #dedication #willpower   to find #newmaterialsâ\x80¦ ',
       ' @user #white #supremacists want everyone to see the new â\x80\x98  #birdsâ\x80\x99 #movie â\x80\x94 and hereâ\x80\x99s why  ',
       'safe ways to heal your #acne!!    #altwaystoheal #healthy   #healing!! ',
       ...,
       '#hillary #campaigned today in #ohio((omg)) &amp; used words like "assets&amp;liability" never once did #clinton say thee(word) #radicalization   ',
       'happy, at work conference: right mindset leads to culture-of-development organizations    #work #mindset',
       'my   song "so glad" free download!  #shoegaze #newmusic #newsong'],
      dtype=object)

In [16]:
test_tweets[0]

'#studiolife #aislife #requires #passion #dedication #willpower   to find #newmaterialsâ\x80¦ '

In [17]:
predictor.predict(test_tweets[0])

'0'

In [18]:
test_tweets[1]

' @user #white #supremacists want everyone to see the new â\x80\x98  #birdsâ\x80\x99 #movie â\x80\x94 and hereâ\x80\x99s why  '

In [19]:
predictor.predict(test_tweets[1])

'1'

In [20]:
test_tweets[179]

'seeing war craft in imax 3d ð\x9f\x98\x84 #warcraftmovie   #imax #woohoo '

In [21]:
predictor.predict(test_tweets[179])

'0'

In [22]:
test_pred = predictor.predict(test_tweets)

In [23]:
test_pred

['0',
 '1',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '1',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '1',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '1',
 '1',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '1',
 '0',
 '0',
 '0',
 '0',
 '1',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '1',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '1',
 '1',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '1',
 '1',
 '0',
 '0',
 '0',
 '1',
 '0',
 '0'

In [24]:
test['label'] = test_pred

submission = test[['id','label']]

submission.to_csv('test_predictions.csv', index=False)

In [25]:
res = pd.read_csv('test_predictions.csv')

In [26]:
res

Unnamed: 0,id,label
0,31963,0
1,31964,1
2,31965,0
3,31966,0
4,31967,0
...,...,...
17192,49155,1
17193,49156,0
17194,49157,1
17195,49158,0


The `predictor.save` and `ktrain.load_predictor` methods can be used to save the Predictor object to disk and reload it at a later time to make predictions on new data.