
Referred documentation notebook-
https://keras.io/examples/nlp/fnet_classification_with_keras_nlp/

In [1]:
!pip install tensorflow-text

Collecting tensorflow-text
  Downloading tensorflow_text-2.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tensorflow-text
Successfully installed tensorflow-text-2.15.0


In [2]:
!pip install keras-nlp

Collecting keras-nlp
  Downloading keras_nlp-0.7.0-py3-none-any.whl (415 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m415.4/415.4 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting keras-core (from keras-nlp)
  Downloading keras_core-0.1.7-py3-none-any.whl (950 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
Collecting namex (from keras-core->keras-nlp)
  Downloading namex-0.0.7-py3-none-any.whl (5.8 kB)
Installing collected packages: namex, keras-core, keras-nlp
Successfully installed keras-core-0.1.7 keras-nlp-0.7.0 namex-0.0.7


In [3]:
import keras
import keras_nlp

import pandas as pd
import tensorflow as tf

from tensorflow.keras import layers, losses, optimizers

Using TensorFlow backend


Loading the data

In [4]:
columns = ["id", "country", "Label", "Text"]

tweets_data = pd.read_csv("twitter_training.csv", names = columns)

tweets_data.sample(5)

Unnamed: 0,id,country,Label,Text
39936,1256,Battlefield,Neutral,Got a back blast kill .
45567,11822,Verizon,Irrelevant,I have AT&T but my service ended working every...
45887,11876,Verizon,Negative,honestly can’t wait for the description on thi...
4891,41,Amazon,Neutral,ve played this interesting quiz on Amazon - Tr...
39561,5590,Hearthstone,Negative,Me:oh I've created 2 fantastic deck that syner...


Dropping irrelevant columns,rows with NAs and  duplicates

In [5]:
tweets_data = tweets_data.drop(columns = ["id", "country"])

tweets_data.dropna(inplace = True, axis = 0)

tweets_data = tweets_data.drop_duplicates()

tweets_data.shape

(69769, 2)

Converting text labels to numeric form

In [6]:
tweets_data["Label"] = tweets_data["Label"].replace({"Negative": 0, "Neutral": 1, "Positive": 2, "Irrelevant": 3})

tweets_data.sample(5)

Unnamed: 0,Label,Text
11890,0,News update And the token market for fuck sake...
23892,0,"please copy, rt & spread! . . Hi @Google . We ..."
43403,0,I hate that one
74641,2,"Today I searched for new GPU drivers, went to ..."
34855,2,These @ fortnitegame integrations with @ deadp...


In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(
    tweets_data, test_size = 0.2, stratify = tweets_data["Label"], random_state = 123)
X_train, X_val = train_test_split(
    X_train, test_size = 0.1, stratify = X_train["Label"], random_state = 123)

X_train.shape, X_val.shape, X_test.shape

((50233, 2), (5582, 2), (13954, 2))

Creating Training and validation dataset from corresponding pandas dataframes

In [8]:
BATCH_SIZE = 128

train_ds = tf.data.Dataset.from_tensor_slices(
    (X_train["Text"].values, X_train["Label"].values)).shuffle(10000).batch(batch_size = BATCH_SIZE)

val_ds = tf.data.Dataset.from_tensor_slices(
    (X_val["Text"].values, X_val["Label"].values)).batch(batch_size = BATCH_SIZE)

test_ds = tf.data.Dataset.from_tensor_slices(
    (X_test["Text"].values, X_test["Label"].values)).batch(batch_size = BATCH_SIZE)

In [9]:
train_ds = train_ds.map(lambda x, y: (tf.strings.lower(x), y))

val_ds = val_ds.map(lambda x, y: (tf.strings.lower(x), y))

test_ds = test_ds.map(lambda x, y: (tf.strings.lower(x), y))

In [10]:
for text_batch, label_batch in train_ds.take(1):
    for i in range(3):
        print(text_batch.numpy()[i])
        print(label_batch.numpy()[i])

b"i'm screaming!"
0
b'photo: pic.wikimedia.org / aldzjwades'
0
b'when i search for a game and a map comes up that i don\xe2\x80\x99t want to play and will hate we should be black listed from my que. sbmm is dumb and we hate it @callofduty'
0


#### Tokenizing the data

WordPiece tokenization

Purpose: The primary purpose of WordPiece is to split text into a set of common subword units or tokens. This approach helps in handling the large vocabulary issue in language models and improves the model's ability to deal with rare words or out-of-vocabulary (OOV) words.

How It Works: WordPiece starts with a base vocabulary of individual characters and then incrementally learns a larger vocabulary by combining these characters into frequently occurring substrings or subwords. The algorithm iteratively adds the best subword (the one that minimizes the language model's loss function) to the vocabulary until it reaches a specified vocabulary size.

Subword Tokenization: The resulting vocabulary consists of full words, subwords, and characters. Full words are common words that appear frequently in the training corpus. Subwords are parts of words that are less common but still occur frequently enough to be included. Characters are included to ensure any word can be tokenized (e.g., rare words are broken down into individual characters).


We'll be using the keras_nlp.tokenizers.WordPieceTokenizer layer to tokenize the text. keras_nlp.tokenizers.WordPieceTokenizer takes a WordPiece vocabulary and has functions for tokenizing the text, and detokenizing sequences of tokens.

Before we define the tokenizer, we first need to train it on the dataset we have. The WordPiece tokenization algorithm is a subword tokenization algorithm; training it on a corpus gives us a vocabulary of subwords. A subword tokenizer is a compromise between word tokenizers (word tokenizers need very large vocabularies for good coverage of input words), and character tokenizers (characters don't really encode meaning like words do). Luckily, KerasNLP makes it very simple to train WordPiece on a corpus with the keras_nlp.tokenizers.compute_word_piece_vocabulary utility.

Note: The official implementation of FNet uses the SentencePiece Tokenizer.

In [11]:
def train_word_piece(ds, vocab_size, reserved_tokens):

    word_piece_ds = ds.unbatch().map(lambda x, y: x)

    vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
        word_piece_ds.batch(1000).prefetch(2),
        vocabulary_size = vocab_size,
        reserved_tokens = reserved_tokens,
    )

    return vocab

Every vocabulary has a few special, reserved tokens. We have two such tokens:

- "[PAD]" - Padding token. Padding tokens are appended to the input sequence length when the input sequence length is shorter than the maximum sequence length.
- "[UNK]" - Unknown token.

In [12]:
vocab_size = 10000

reserved_tokens = ["[PAD]", "[UNK]"]

# train_sentences = [element[0] for element in train_ds]
vocab = train_word_piece(train_ds, vocab_size, reserved_tokens)

Length of vocabulary is checked and also whole vocab is viewed

In [13]:
len(vocab)

9358

In [14]:
vocab

['[PAD]',
 '[UNK]',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '{',
 '|',
 '}',
 '~',
 '\xa0',
 '¡',
 '¢',
 '£',
 '§',
 '¨',
 '©',
 '«',
 '®',
 '¯',
 '°',
 '±',
 '²',
 '³',
 '´',
 '¶',
 '·',
 '¹',
 'º',
 '»',
 '½',
 '¿',
 '×',
 'á',
 'ç',
 'é',
 'í',
 'ï',
 'ó',
 'ú',
 'ğ',
 'ə',
 'ʊ',
 'ʌ',
 'ˈ',
 'θ',
 'υ',
 'ω',
 'А',
 'Б',
 'В',
 'Г',
 'Д',
 'Е',
 'З',
 'И',
 'К',
 'Л',
 'М',
 'Н',
 'О',
 'П',
 'Р',
 'С',
 'Т',
 'У',
 'Ф',
 'Ц',
 'Ь',
 'Э',
 'а',
 'б',
 'в',
 'г',
 'д',
 'е',
 'ж',
 'з',
 'и',
 'й',
 'к',
 'л',
 'м',
 'н',
 'о',
 'п',
 'р',
 'с',
 'т',
 'у',
 'ф',
 'х',
 'ц',
 'ч',
 'ш',
 'ы',
 'ь',
 'э',
 'ю',
 'я',
 'ا',
 'ب',
 'ت',
 'ح',
 'خ'

Now, let's define the tokenizer. We will configure the tokenizer with the the vocabularies trained above. We will define a maximum sequence length so that all sequences are padded to the same length, if the length of the sequence is less than the specified sequence length. Otherwise, the sequence is truncated.

In [15]:
max_sequence_length = 64

tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary = vocab,
    lowercase = False,
    sequence_length = max_sequence_length,
)

Let's try and tokenize a sample from our dataset! To verify whether the text has been tokenized correctly, we can also detokenize the list of tokens back to the original text.



In [18]:
input_sentence_ex = train_ds.take(1).get_single_element()[0][1]
input_tokens_ex = tokenizer(input_sentence_ex)

print("Sentence: ", input_sentence_ex)
print("Tokens: ", input_tokens_ex)
print("Recovered text after detokenizing: ", tokenizer.detokenize(input_tokens_ex))

Sentence:  tf.Tensor(b'in an interview, best later said, when \xe2\x80\x9c we will document & investigate every reported hate crime. even perfectly racist name - changed calling reports should be reported to all police. we take this information very seriously. \xe2\x80\x9d when lin asked the officer what those police were instructed to do, he was immediately told \xe2\x80\x9c while there \xe2\x80\x99 s no language protocol "', shape=(), dtype=string)
Tokens:  tf.Tensor(
[ 318  364 3498   13  382  927  637   13  361  204  342  360 5797    7
 4679 2857 1679  467 1882  576 2019   15  393 2348 1123  850   14 1570
 1439 2451  525  330 1882  313  334 1288   15  342  564  320 1179  438
  869   15  205  361 1341  532 1798  312 3895  359  639 1288  537 8827
 4617 4988 1320  313  367   13  408  339], shape=(64,), dtype=int32)
Recovered text after detokenizing:  tf.Tensor(b'in an interview , best later said , when \xe2\x80\x9c we will document & investigate every reported hate crime . even perfec

Next, we'll format our datasets in the form that will be fed to the models. We need to tokenize the text.

In [19]:
def format_dataset(sentence, label):

    sentence = tokenizer(sentence)

    return ({"input_ids": sentence}, label)


def make_dataset(dataset):

    dataset = dataset.map(format_dataset, num_parallel_calls = tf.data.AUTOTUNE)

    return dataset.shuffle(10000).prefetch(512).cache()


train_ds = make_dataset(train_ds)
val_ds = make_dataset(val_ds)
test_ds = make_dataset(test_ds)

In [21]:
epochs = 5

embedding_dim = 128
intermediate_dim = 256

Now, let's move on to the exciting part - defining our model! We first need an embedding layer, i.e., a layer that maps every token in the input sequence to a vector. This embedding layer can be initialised randomly. We also need a positional embedding layer which encodes the word order in the sequence. The convention is to add, i.e., sum, these two embeddings. KerasNLP has a keras_nlp.layers.TokenAndPositionEmbedding layer which does all of the above steps for us.

Our FNet classification model consists of three keras_nlp.layers.FNetEncoder layers with a keras.layers.Dense layer on top.

Note: For FNet, masking the padding tokens has a minimal effect on results. In the official implementation, the padding tokens are not masked.

In [22]:
input_ids = keras.Input(shape = (None,), dtype = "int64", name = "input_ids")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size = vocab_size,
    sequence_length = max_sequence_length,
    embedding_dim = embedding_dim,
    mask_zero = True,
)(input_ids)

x = keras_nlp.layers.FNetEncoder(intermediate_dim = intermediate_dim)(inputs = x)
x = keras_nlp.layers.FNetEncoder(intermediate_dim = intermediate_dim)(inputs = x)
x = keras_nlp.layers.FNetEncoder(intermediate_dim = intermediate_dim)(inputs = x)

x = keras.layers.GlobalAveragePooling1D()(x)
x = keras.layers.Dropout(0.1)(x)

outputs = keras.layers.Dense(4, activation = "softmax")(x)

fnet_classifier = keras.Model(input_ids, outputs, name = "fnet_classifier")

We'll use accuracy to monitor training progress on the validation data. Let's train our model for 5 epochs.

In [23]:
fnet_classifier.summary()

fnet_classifier.compile(
    optimizer = optimizers.Adam(learning_rate = 0.001),
    loss = losses.SparseCategoricalCrossentropy(from_logits = False),
    metrics = ["accuracy"],
)

fnet_classifier.fit(train_ds, epochs = epochs, validation_data = val_ds)

Model: "fnet_classifier"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_ids (InputLayer)      [(None, None)]            0         
                                                                 
 token_and_position_embeddi  (None, None, 128)         1288192   
 ng (TokenAndPositionEmbedd                                      
 ing)                                                            
                                                                 
 f_net_encoder (FNetEncoder  (None, None, 128)         66432     
 )                                                               
                                                                 
 f_net_encoder_1 (FNetEncod  (None, None, 128)         66432     
 er)                                                             
                                                                 
 f_net_encoder_2 (FNetEncod  (None, None, 128)     

<keras.src.callbacks.History at 0x79d87b50ead0>

Performance is checked on Test data

In [24]:
loss, accuracy = fnet_classifier.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: ", accuracy)

Loss:  0.6244869232177734
Accuracy:  0.8247097730636597
