<a href="https://colab.research.google.com/github/sahug/ds-bert/blob/main/BERT%20NLP%20-%20Sentiment140%20-%20Sentiment%20Analysis%20using%20Tensorflow%20Hub%20and%20BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERT NLP - Sentiment140 - Sentiment Analysis using BERT**

**Dataset**

- URL - http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

The data is a CSV with emoticons removed. Data file format has 6 fields:

- 0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- 1 - the id of the tweet (2087)
- 2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- 3 - the query (lyx). If there is no query, then this value is NO_QUERY.
- 4 - the user that tweeted (robotickilldozr)
- 5 - the text of the tweet (Lyx is cool)

Rename File to **sentiment140** to remove "." in filepath.

**Load Dataset**

In [2]:
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np

In [3]:
ds = pd.read_csv("/content/training.1600000.processed.noemoticon.csv", encoding = "ISO-8859-1")

In [4]:
ds.columns

# Here we can see we don;t have columns labeled in dataset. We will use index to drop columns we don't want.

Index(['0', '1467810369', 'Mon Apr 06 22:19:45 PDT 2009', 'NO_QUERY',
       '_TheSpecialOne_',
       '@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D'],
      dtype='object')

In [5]:
ds.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


**Data Preprocesssing**

In [17]:
df = ds.copy()

Since the dataset is too big. Lets randomly select first 10K records.

In [18]:
df = df.sample(n=10000)

Remove Unwanted Features

In [19]:
polarity = df[df.columns[0]]
tweets = df[df.columns[-1]]

df = pd.DataFrame({"polarity": polarity, "tweets": tweets})
df

Unnamed: 0,polarity,tweets
1118973,4,Might get called back to my job crosses fing...
1324315,4,@syzygy it will be live on 42 inches of hd in ...
868721,4,it's tobi's wedding day and the 1 week till ou...
634104,0,@arifandi bu2? Huh? Iseeennnggg yahhh mrn!! Da...
862862,4,Heading To Sleep With A Cottonball Drenched In...
...,...,...
1042861,4,so much fun at the beach today! and delicious...
1168970,4,I had an awesome day today!!!!! Thanks rach &l...
743356,0,As the Sunday night grows each one of us becom...
104222,0,@nicolerichie you need to replace your track b...


**Train and Test Split**

In [20]:
from sklearn.model_selection  import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df["tweets"], df["polarity"], test_size=0.2, stratify=df["polarity"])

In [10]:
x_train.head(), y_train.head()

(1073246                @AshlyJBew you bet your bippy i am!  
 1589469    @Blade21292 Illustration Friday? On Tuesday? O...
 685333                   @CrazyCamel123 what about Christy  
 162226                 @TylerHarrell do you really smoke??? 
 707333     so jealous of every fan going to the jonas bro...
 Name: tweets, dtype: object, 1073246    4
 1589469    4
 685333     0
 162226     0
 707333     0
 Name: polarity, dtype: int64)

In [21]:
# x_test = tf.one_hot(x_test, depth=x_test.shape[0])
y_train = pd.Series(y_train, dtype='int32')
y_test = pd.Series(y_test, dtype='int32')
y_train, y_test

(31707      0
 1495769    4
 764428     0
 38009      0
 662163     0
           ..
 1588356    4
 1287433    4
 991887     4
 677162     0
 1511858    4
 Name: polarity, Length: 8000, dtype: int32, 1001146    4
 326847     0
 1458671    4
 1517974    4
 237866     0
           ..
 584021     0
 404208     0
 1005907    4
 235019     0
 396296     0
 Name: polarity, Length: 2000, dtype: int32)

**OHE**

In [22]:
y_train = tf.one_hot(y_train, depth=3)
y_test = tf.one_hot(y_test, depth=3)

In [24]:
y_train

<tf.Tensor: shape=(8000, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 0., 0.],
       [1., 0., 0.],
       ...,
       [0., 0., 0.],
       [1., 0., 0.],
       [0., 0., 0.]], dtype=float32)>

**Preprocessing using Tensorflow Hub**

In [25]:
%pip install -U -qq tensorflow_text

In [26]:
import tensorflow_hub as hub
import tensorflow_text as text
tf_preprocess = hub.KerasLayer( "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")

In [27]:
def preprocess_text(sentences):
  return tf_preprocess(sentences)

preprocess_text(["This is an example of preprocessing"])  

{'input_mask': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int32)>,
 'input_type_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 

**Encoding using Tensorflow Hub**

In [28]:
tf_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

In [29]:
def encode_inputs(preprocess_text):
  return tf_encoder(preprocess_text)

prep_text = preprocess_text(["This is an example of preprocessing"])  
encode_inputs(prep_text)["pooled_output"]

<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[-9.14899647e-01, -3.51275384e-01, -3.77727330e-01,
         6.53948069e-01,  5.14929831e-01, -2.83837289e-01,
         8.55841339e-01,  1.37585402e-01, -1.67199478e-01,
        -9.99985635e-01, -2.43355542e-01,  7.03699648e-01,
         9.87010181e-01, -3.57761830e-01,  9.07458663e-01,
        -6.02542102e-01, -6.13499023e-02, -4.69316512e-01,
         3.09828222e-01, -4.74200100e-01,  5.66560566e-01,
         9.99737382e-01,  5.50420046e-01,  1.48632824e-01,
         2.93268114e-01,  8.26646328e-01, -6.14027441e-01,
         9.41486061e-01,  9.61444438e-01,  8.02598476e-01,
        -6.44454241e-01,  1.90562084e-01, -9.92072225e-01,
        -1.81639388e-01, -6.21678174e-01, -9.89679754e-01,
         3.47001135e-01, -7.99247146e-01,  4.16931957e-02,
        -1.04804583e-01, -9.31580842e-01,  3.57742071e-01,
         9.99939561e-01, -5.04451096e-01,  3.62645864e-01,
        -2.56940871e-01, -9.99998868e-01,  2.00827181e-01,
      

**Build Model**

In [30]:
# Input and Preprocess
input = keras.layers.Input(shape=(), dtype=tf.string, name="input")
preprocess = preprocess_text(input)
encoder = encode_inputs(preprocess)

# NNL
nnl = tf.keras.layers.Dropout(0.1, name="dropout")(encoder["pooled_output"])
nnl = tf.keras.layers.Dense(3, activation="softmax", name="output")(nnl)

# Construct Final Model
model = tf.keras.Model(inputs=[input], outputs=[nnl])

In [32]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input (InputLayer)             [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_mask': (Non  0           ['input[0][0]']                  
                                e, 128),                                                          
                                 'input_word_ids':                                                
                                (None, 128),                                                      
                                 'input_type_ids':                                                
                                (None, 128)}                                                  

In [31]:
#Compile Model
METRICS = [
           tf.keras.metrics.BinaryAccuracy(name="accuracy"),
           tf.keras.metrics.Precision(name="precision"),
           tf.keras.metrics.Recall(name="recall"),
]

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), 
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=METRICS)

In [None]:
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5