We would run this notebook in Colab and we would upload the Tweets.csv to content folder.Referred doc-https://www.tensorflow.org/tutorials/keras/text_classification_with_hub

Installing tensorflow hub and Importing required packages

In [1]:
pip install tensorflow-hub



In [2]:
import keras
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub

from tensorflow.keras import layers, losses, Sequential, optimizers, metrics

In [13]:
hub.__version__

'0.16.0'

Loading the  data

In [5]:
columns = ["id", "country", "Label", "Text"]

tweets_data = pd.read_csv("twitter_training.csv", names = columns)

tweets_data.sample(5)

Unnamed: 0,id,country,Label,Text
66248,6944,johnson&johnson,Neutral,The Missouri Court of Appeals on Tuesday order...
12538,8554,NBA2K,Positive,. and wow
21187,4027,CS-GO,Positive,Probably the best time to say goodbye and play...
73594,9006,Nvidia,Neutral,2009 Those of you that play CS:GO and having s...
32025,7496,LeagueOfLegends,Positive,WIP of my favorite skin brushes from


Dropping irrelevant columns, NAs and duplicates

In [6]:
tweets_data = tweets_data.drop(columns = ["id", "country"])

tweets_data.dropna(inplace = True, axis = 0 )

tweets_data = tweets_data.drop_duplicates()

tweets_data.shape

(69769, 2)

Converting the labels to numeric form

In [7]:
tweets_data["Label"] = tweets_data["Label"].replace({"Negative": 0, "Neutral": 1, "Positive": 2, "Irrelevant": 3})

tweets_data.sample(5)

Unnamed: 0,Label,Text
38723,0,.@PlayHearthstone jumped right into the game t...
2109,2,I FINALLY finished Borderlands 3! It's taken a...
62426,3,The people online are crazy.
55255,1,"The FiNN Damascus is now moved to the Sks, whe..."
54181,1,I won 5 achievements in Call of Duty: Modern W...


Data is split into training , validation, and testing sets

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(
    tweets_data, test_size = 0.2, stratify = tweets_data["Label"], random_state = 123)
X_train, X_val = train_test_split(
    X_train, test_size = 0.1, stratify = X_train["Label"], random_state = 123)

X_train.shape, X_val.shape, X_test.shape

((50233, 2), (5582, 2), (13954, 2))

Creating Training and validation dataset from corresponding pandas dataframes

In [9]:
BATCH_SIZE = 128

raw_train_ds = tf.data.Dataset.from_tensor_slices(
    (X_train["Text"].values, X_train["Label"].values)).shuffle(10000).batch(batch_size = BATCH_SIZE)

raw_val_ds = tf.data.Dataset.from_tensor_slices(
    (X_val["Text"].values, X_val["Label"].values)).batch(batch_size = BATCH_SIZE)

raw_test_ds = tf.data.Dataset.from_tensor_slices(
    (X_test["Text"].values, X_test["Label"].values)).batch(batch_size = BATCH_SIZE)

In [10]:
train_examples_batch, train_labels_batch = next(iter(raw_train_ds))

train_examples_batch[:5]

<tf.Tensor: shape=(5,), dtype=string, numpy=
array([b'@CallofDuty @Blizzard_Ent seriously. I want to fix this issue. please let me know who to contact.',
       b'Clip that sentence.',
       b'In simulating a phishing attack, more than 10% of students clicked on a "nasty" link. Young students are often easy targets for phishing attacks, putting entire universities at risk.',
       b'Enjoy the dystopian corporatocratic future with the refreshing taste our Rockstar\xe2\x84\xa2 Energy!',
       b'... I just... earned the [ Mythic : What Dragons of Nightmare ] Achievement!'],
      dtype=object)>

In [11]:
train_labels_batch[:5]

<tf.Tensor: shape=(5,), dtype=int64, numpy=array([0, 0, 2, 2, 1])>

Model Building
The neural network is created by stacking layers—this requires three main architectural decisions:

- How to represent the text?
- How many layers to use in the model?
- How many hidden units to use for each layer?
- In this example, the input data consists of sentences. The labels to predict are either 0, 1, 2, 3.

One way to represent the text is to convert sentences into embeddings vectors. Use a pre-trained text embedding as the first layer, which will have three advantages:

- You don't have to worry about text preprocessing,
- Benefit from transfer learning,
- the embedding has a fixed size, so it's simpler to process.

For this example you use a pre-trained text embedding model from TensorFlow Hub called google/nnlm-en-dim50/2.

There are many other pre-trained text embeddings from TFHub that can be used in this tutorial:

- google/nnlm-en-dim128/2 - trained with the same NNLM architecture on the same - data as google/nnlm-en-dim50/2, but with a larger embedding dimension. Larger dimensional embeddings can improve on your task but it may take longer to train your model.
- google/nnlm-en-dim128-with-normalization/2 - the same as google/nnlm-en-dim128/2, but with additional text normalization such as removing punctuation. This can help if the text in your task contains additional characters or punctuation.
- google/universal-sentence-encoder/4 - a much larger model yielding 512 dimensional embeddings trained with a deep averaging network (DAN) encoder.


Creating a Keras layer that uses a TensorFlow Hub model to embed the sentences.Note that no matter the length of the input text, the output shape of the embeddings is: (num_examples, embedding_dimension).Best test accuracy of around 0.87  is obtained with 'google/nnlm-en-dim128-with-normalization/2' and with other two embeddings i.e google/nnlm-en-dim50/2 and google/nnlm-en-dim128/2 test accuracies obtained is around 0.0.82

### TODO Recording:

- When you record, please show only one embedding at a time (DO NOT show commented out text)

In [25]:
# embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"

# embedding = "https://tfhub.dev/google/nnlm-en-dim128/2"
embedding = "https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2"

hub_layer = hub.KerasLayer(
    embedding, input_shape = [],
    dtype = tf.string, trainable = True
)

hub_layer(train_examples_batch[:3]).shape

TensorShape([3, 128])

Building the full model

In [26]:
model = Sequential()

model.add(hub_layer)
model.add(layers.Dense(32, activation = "relu"))
model.add(layers.Dense(4))

model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer_2 (KerasLayer)  (None, 128)               124642688 
                                                                 
 dense_4 (Dense)             (None, 32)                4128      
                                                                 
 dense_5 (Dense)             (None, 4)                 132       
                                                                 
Total params: 124646948 (475.49 MB)
Trainable params: 124646948 (475.49 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Model is compiled with Adam optimizer and loss function-Sparse Categorical Cross Entropy

In [27]:
model.compile(
    optimizer = keras.optimizers.Adam(learning_rate = 0.001),
    loss = losses.SparseCategoricalCrossentropy(from_logits = True),
    metrics = ['accuracy']
)

We are using a callback list here
Early stopping — Interrupting training when the validation loss is no longer improving (and save the best model obtained during training).

ReduceLROnPlateau-Dynamically adjusting the value of certain parameters during training such as the learning rate optimizer.

Callbacks are passed to the during via the callback argument in the fit() method which takes a list of callbacks. Any number of callbacks can be passed to it.

The monitor argument in the EarlyStopping callback monitor’s the model’s validation accuracy and the patience argument interrupts training when the parameter passed to the monitor argument stops improving for more than the number (of epochs) passed to it (in this case 1).

Also, the ReduceLROnPlateau callback is used to reduce the learning rate when the validation loss has stopped improving. This has proven to be a very effective strategy to get out of local minima during training. The factor argument takes as input a float which is used to divide the learning rate when triggered.

In [28]:
callback_list = [
    keras.callbacks.EarlyStopping(
        patience = 3,
        monitor = "val_accuracy"
    ),

    keras.callbacks.ReduceLROnPlateau(
        patience = 1,
        factor = 0.1,
    )
]

Model is trained for 20 epochs

In [29]:
EPOCHS = 20

history = model.fit(
    raw_train_ds,
    validation_data = raw_val_ds,
    epochs = EPOCHS,
    callbacks = callback_list
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20


Test accuracy is obtained

In [30]:
loss, accuracy = model.evaluate(raw_test_ds)

print("Loss: ", loss)
print("Accuracy: ", accuracy)

Loss:  0.39083293080329895
Accuracy:  0.8806793689727783
