<a href="https://colab.research.google.com/github/random-words/colab-notebooks/blob/main/08__introduction_to_NLP_in_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fundamentals

In [1]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


## Get Helper Functions

In [2]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/refs/heads/main/extras/helper_functions.py

--2025-02-11 15:25:08--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/refs/heads/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2025-02-11 15:25:09 (72.8 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [3]:
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

## Get text dataset

In [4]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

--2025-02-11 15:25:24--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.152.207, 173.194.64.207, 108.177.121.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.152.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2025-02-11 15:25:25 (46.5 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



In [5]:
unzip_data("nlp_getting_started.zip")

## Visualizing dataset

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

In [7]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

In [8]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [9]:
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [10]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [11]:
# examples of each class
train_df["target"].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


In [12]:
# or with attribute
train_df.target.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


In [13]:
len(train_df), len(test_df)

(7613, 3263)

In [14]:
# Let's visualize some random training examples
import random
random_index = random.randint(0, len(train_df)-5) # create random indexes not higher than the total number of samples
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(real disaster)" if target > 0 else "(not real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 1 (real disaster)
Text:
@Cyberdemon531 i hope that mountain dew erodes your throat and floods your lungs leaving you to drown to death

---

Target: 1 (real disaster)
Text:
Cuban leader extends sympathy to Vietnam over flooding at http://t.co/QcyXwr2rdv

---

Target: 0 (not real disaster)
Text:
@alexbelloli I do It just seemed like the pages were out of order

---

Target: 1 (real disaster)
Text:
They've come back!! &gt;&gt; Flying ant day: Capital deluged by annual swarm of winged insects http://t.co/mNkoYZ76Cp

---

Target: 1 (real disaster)
Text:
FedEx no longer to transport bioterror germs in wake of anthrax lab mishaps http://t.co/pWAMG8oZj4

---



### Split data into training and validation sets

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=42)

In [17]:
# Check splits lengts
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [18]:
# check first 10 samples
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

## Converting text into numbers

Few ways to do this:
* Tokenization - direct mapping of token to number:
i love tensorflow -> {0:i, 1:love, 2:tensorflow}
* Embedding - create a matrix of feature vector for each token: i love tensorflow ->
[[0.125, 0.856, 0.091],
 [0.123, 0.643, 0.723],
 [0.188, 0.116, 0.901]]

### Text vectorization (tokenization)

In [19]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

In [20]:
train_sentences[:3]

array(['@mogacola @zamtriossu i screamed after hitting tweet',
       'Imagine getting flattened by Kurt Zouma',
       '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....'],
      dtype=object)

In [21]:
# Use default TextVectorization parameters
text_vectorizer = TextVectorization(max_tokens=None, # max cap in vocabulary for words; None - no limit
                                    standardize="lower_and_strip_punctuation",
                                    split="whitespace",
                                    ngrams=None, # create groups of n-words
                                    output_mode="int", # how to map tokens to numbers
                                    output_sequence_length=None, # how many words a model will see on each sample
                                    # doesn't work in current tensorflow version
                                    # pad_to_max_tokens=True # add zeros to the end of tokens to reach max sequence length (output_sequence_length)
)

In [22]:
len(train_sentences)

6851

In [23]:
train_sentences[0].split(), len(train_sentences[0].split())

(['@mogacola', '@zamtriossu', 'i', 'screamed', 'after', 'hitting', 'tweet'], 7)

In [24]:
# Find the average number of tokens (words) in training tweets
round(sum([len(sentence.split()) for sentence in train_sentences])/len(train_sentences))

15

In [25]:
# Setup text vectorization variables
max_vocab_lenght = 10000 # max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be (how many words in tweet a model will see)

text_vectorizer = TextVectorization(max_tokens=max_vocab_lenght,
                                    output_mode="int",
                                    output_sequence_length=max_length,
                                    pad_to_max_tokens=True # if max_tokens is given, then it works
                                    )

In [26]:
# Fit text_vectorizer to the training data
text_vectorizer.adapt(train_sentences)

In [27]:
# Create a sample sentence and tokenize it
sample_sentence = "There's a flood in my street!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [28]:
# Try on train_sentences
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}")
vectorized_sentence = text_vectorizer([random_sentence])
print(f"Vectorized version:\n{vectorized_sentence}")

Original text:
Trial Date Set for Man Charged with Arson Burglary http://t.co/WftCrLz32P
Vectorized version:
[[2645 1089  284   10   89  333   14  612    1    1    0    0    0    0
     0]]


In [29]:
# Get the unique words in vocabulary
words_in_vocab = text_vectorizer.get_vocabulary() # get all of the unique words in training data
print(f"Number of words: {len(words_in_vocab)}")
print(f"Most common words: {words_in_vocab[:5]}")
print(f"Least common words: {words_in_vocab[-5:]}")

Number of words: 10000
Most common words: ['', '[UNK]', 'the', 'a', 'in']
Least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


### Creating an Embedding

* input_dim - The size of the vocabulary (e.g. len(text_vectorizer.get_vocabulary()).
* output_dim - The size of the output embedding vector, for example, a value of 100 outputs a feature vector of size 100 for each word.
* embeddings_initializer - How to initialize the embeddings matrix, default is "uniform" which randomly initalizes embedding matrix with uniform distribution. This can be changed for using pre-learned embeddings.
* input_length - Length of sequences being passed to embedding layer

In [30]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_lenght, # set input shape
                             output_dim=128, # output shape
                             embeddings_initializer="uniform",
                             input_length=max_length, # each input (sentence) length
                             )



In [31]:
# Get random sentence from training dataset
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\n")
print("Embedded version:")
# Embed the radnom_sentence (turn it into vectors of setted size)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
@Collapsed thank u

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.02486279,  0.00290193,  0.02400252, ..., -0.01188427,
         -0.04039618,  0.00191604],
        [-0.00586569,  0.02015797, -0.02155587, ...,  0.04775889,
         -0.00583676,  0.00221105],
        [ 0.02961234,  0.00271244, -0.01955608, ..., -0.03513717,
          0.01406911, -0.004888  ],
        ...,
        [ 0.029728  ,  0.02018246,  0.03855381, ...,  0.01656593,
         -0.03555902,  0.03275954],
        [ 0.029728  ,  0.02018246,  0.03855381, ...,  0.01656593,
         -0.03555902,  0.03275954],
        [ 0.029728  ,  0.02018246,  0.03855381, ...,  0.01656593,
         -0.03555902,  0.03275954]]], dtype=float32)>

In [32]:
# Check out a single token's embedding
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([ 2.48627923e-02,  2.90193409e-03,  2.40025185e-02, -3.86243574e-02,
         4.69923057e-02,  4.26291935e-02,  3.02706696e-02, -8.49477947e-05,
         3.97467874e-02, -4.48606983e-02, -1.67577043e-02, -4.88535650e-02,
         2.08204277e-02, -3.91445309e-03, -4.34069410e-02,  2.03595050e-02,
         4.81251813e-02,  3.28909196e-02,  3.56102921e-02, -4.51433770e-02,
         1.81324147e-02,  4.00821827e-02, -9.92230326e-03,  4.41757925e-02,
         4.57537882e-02,  3.82833593e-02, -1.41196959e-02,  2.62498967e-02,
        -3.21881771e-02,  4.19645049e-02, -1.12681165e-02, -8.28135759e-04,
        -3.02336365e-03,  4.51062955e-02, -4.12427187e-02,  4.71319072e-02,
        -6.33094460e-03,  2.68423669e-02, -8.87802988e-03,  4.22240831e-02,
        -4.45832014e-02, -2.12428104e-02,  3.15022357e-02,  4.83728386e-02,
        -4.72211353e-02,  2.71241553e-02, -2.46182326e-02,  4.05676104e-02,
         2.33565643e-03,  2.31268071e-0

## Modelling a text dataset

### Model 0: Baseline

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [34]:
# Create tokenization and modelling pipeline
model_0 = Pipeline([
    ("tfidf", TfidfVectorizer()), # turn text into numbers
    ("clf", MultinomialNB()) # create a model
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

In [35]:
# Evaluate baseline model
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Score: {baseline_score*100:.2f}%")

Score: 79.27%


In [36]:
# Make predictons
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:10]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0])

#### Create an evaluation function

In [37]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate precision, recall, f1-score
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred,
                                                                               average="weighted")
  model_results = {"accuracy":model_accuracy,
                   "precision":model_precision,
                   "recall":model_recall,
                   "f1":model_f1}

  return model_results

In [38]:
baseline_results = calculate_results(y_true=val_labels,
                                     y_pred=baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

### Model 1: A simple dense model

In [40]:
# Create a tensorboard_callback
from helper_functions import create_tensorboard_callback

# Create dir to save logs
SAVE_DIR = "model_logs"

In [42]:
text_vectorizer

<TextVectorization name=text_vectorization_1, built=True>

In [43]:
embedding

<Embedding name=embedding, built=True>

In [50]:
# Build model with Functional API
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string) # inputs are 1-d strings
x = text_vectorizer(inputs) # turn input text into integers
x = embedding(x) # create an embedding of the numberized inputs
x = layers.GlobalAveragePooling1D()(x)  # condence the feature vector for each token to one vector
outputs = layers.Dense(1, activation="sigmoid")(x) # create binary output layer
model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense")

In [51]:
model_1.summary()

In [52]:
# Compile model
model_1.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [53]:
# Fit the model
model_1_history = model_1.fit(x=train_sentences,
                              y=train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "model_1_dense")])

Saving TensorBoard log files to: model_logs/model_1_dense/20250211-154703
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 16ms/step - accuracy: 0.6390 - loss: 0.6490 - val_accuracy: 0.7756 - val_loss: 0.5319
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 14ms/step - accuracy: 0.8175 - loss: 0.4601 - val_accuracy: 0.7822 - val_loss: 0.4772
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 20ms/step - accuracy: 0.8611 - loss: 0.3495 - val_accuracy: 0.7822 - val_loss: 0.4675
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 15ms/step - accuracy: 0.8833 - loss: 0.2925 - val_accuracy: 0.7927 - val_loss: 0.4662
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 15ms/step - accuracy: 0.9171 - loss: 0.2326 - val_accuracy: 0.7782 - val_loss: 0.4780


In [54]:
# Check the results
model_1.evaluate(val_sentences, val_labels)

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7740 - loss: 0.5116


[0.4779578447341919, 0.778215229511261]

In [56]:
# make predictions and evaluate them
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs, model_1_pred_probs.shape

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step


(array([[3.04633588e-01],
        [7.05856681e-01],
        [9.98088062e-01],
        [1.33858979e-01],
        [1.33117527e-01],
        [9.36511099e-01],
        [9.28333998e-01],
        [9.94243443e-01],
        [9.62470055e-01],
        [3.17607045e-01],
        [1.43010095e-01],
        [7.00074375e-01],
        [4.69518863e-02],
        [2.25473985e-01],
        [4.59400937e-03],
        [1.31574631e-01],
        [2.65179817e-02],
        [7.55184367e-02],
        [2.42035538e-01],
        [4.83180851e-01],
        [9.06174302e-01],
        [4.44659144e-02],
        [4.76768821e-01],
        [7.74266645e-02],
        [9.53632653e-01],
        [9.99053359e-01],
        [4.15530801e-02],
        [6.49306253e-02],
        [3.00971083e-02],
        [1.92716688e-01],
        [5.55278957e-01],
        [2.82864988e-01],
        [4.86602128e-01],
        [1.99498311e-01],
        [4.73394513e-01],
        [5.44048622e-02],
        [9.95448291e-01],
        [1.65755898e-01],
        [3.8

In [57]:
# single prediction
model_1_pred_probs[0]

array([0.3046336], dtype=float32)

In [58]:
# first 10 preds
model_1_pred_probs[:10]

array([[0.3046336 ],
       [0.7058567 ],
       [0.99808806],
       [0.13385898],
       [0.13311753],
       [0.9365111 ],
       [0.928334  ],
       [0.99424344],
       [0.96247005],
       [0.31760705]], dtype=float32)

In [59]:
# convert model prediciton probabilities to label format
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)>

In [61]:
# Calculate model_1 results
model_1_results = calculate_results(y_true=val_labels,
                                    y_pred=model_1_preds)
model_1_results

{'accuracy': 77.82152230971128,
 'precision': 0.7825342114649019,
 'recall': 0.7782152230971129,
 'f1': 0.7751716074860721}

In [62]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

In [63]:
import numpy as np
np.array(list(model_1_results.values())) > np.array(list(baseline_results.values()))

array([False, False, False, False])

## Visualizing learned embeddings

In [64]:
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab), words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

In [65]:
model_1.summary()

In [68]:
# Get the weight matrix from embedding layer
# (numerical representations of each token in training data)
embed_weights = model_1.get_layer("embedding").get_weights()[0]
embed_weights, embed_weights.shape

(array([[ 0.012587  ,  0.00811847,  0.0210819 , ...,  0.03176046,
         -0.05392036,  0.01304741],
        [-0.02372814, -0.02279401,  0.00786959, ...,  0.01749483,
         -0.03787345, -0.00357797],
        [ 0.03809747,  0.00077309, -0.01918387, ..., -0.00539755,
         -0.02528224,  0.01230901],
        ...,
        [-0.03037528,  0.01532621,  0.00243112, ...,  0.02928963,
          0.023206  ,  0.04671231],
        [-0.02584529, -0.07766714, -0.0576438 , ...,  0.02956907,
         -0.08186322, -0.04061326],
        [-0.03639252, -0.10603592, -0.07609297, ...,  0.10523008,
         -0.02742541, -0.01794127]], dtype=float32),
 (10000, 128))

In [69]:
# Create embedding files
import io
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(words_in_vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = embed_weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

In [70]:
# Download files from Colab to upload to projector
try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception:
  pass

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Recurrent Neural Network (RNN's)

📖 **Resources:**

* MIT Deep Learning Lecture on Recurrent Neural Networks - explains the background of recurrent neural networks and introduces LSTMs.
* The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy - demonstrates the power of RNN's with examples generating various sequences.
* Understanding LSTMs by Chris Olah - an in-depth (and technical) look at the mechanics of the LSTM cell, possibly the most popular RNN building block.

### Model 2: LSTM
* LSTM = long short-term memory

In [76]:
# Create an LSTM model
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
# print(x.shape)
# x = layers.LSTM(units=64, return_sequences=True)(x) # return vector for each word in the Tweet (you can stack RNN cells as long as return_sequences=True)
# print(x.shape)
x = layers.LSTM(units=64)(x)
# print(x.shape)
# x = layers.Dense(64, activation="relu")(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_2 = tf.keras.Model(inputs, outputs, name="model_2_LSTM")

In [77]:
# Get a summary
model_2.summary()

* we want output shape of LSTM to be in form (None, n) because we wanna make prediction on the *whole* sentence, not for each word

In [78]:
# Compile the model
model_2.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [79]:
# Fit the model
model_2_history = model_2.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "model_2_LSTM")])

Saving TensorBoard log files to: model_logs/model_2_LSTM/20250211-170944
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 32ms/step - accuracy: 0.8804 - loss: 0.3040 - val_accuracy: 0.7730 - val_loss: 0.5999
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 29ms/step - accuracy: 0.9396 - loss: 0.1594 - val_accuracy: 0.7822 - val_loss: 0.6165
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 32ms/step - accuracy: 0.9560 - loss: 0.1169 - val_accuracy: 0.7861 - val_loss: 0.6026
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 48ms/step - accuracy: 0.9589 - loss: 0.1006 - val_accuracy: 0.7769 - val_loss: 0.7773
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 51ms/step - accuracy: 0.9719 - loss: 0.0779 - val_accuracy: 0.7756 - val_loss: 0.9027


In [80]:
# Make predicitons on LSTM model
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 34ms/step


array([[2.8508157e-03],
       [7.9175371e-01],
       [9.9973571e-01],
       [8.0089338e-02],
       [7.2302466e-04],
       [9.9902445e-01],
       [9.1015470e-01],
       [9.9983412e-01],
       [9.9971133e-01],
       [4.1128042e-01]], dtype=float32)

In [82]:
# Convert model probs to labels
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [84]:
# Calculate model_2 results
model_2_results = calculate_results(val_labels, model_2_preds)
model_2_results

{'accuracy': 77.55905511811024,
 'precision': 0.7777490986405654,
 'recall': 0.7755905511811023,
 'f1': 0.7733619560087616}

In [85]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

### Model 2: GRU

In [92]:
# Build an RNN using GRU cell
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.GRU(64)(x)
# x = layers.GRU(64, return_sequences=True)(x) # if need to stack recurrent layers (cells) on top of each other, then use return_sequences=True
# x = layers.LSTM(64, return_sequences=True)(x)
# x = layers.GRU(64)(x)
# x = layers.Dense(64, activation="relu")(x)
# x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_3 = tf.keras.Model(inputs, outputs, name="model_3_GRU")

In [93]:
model_3.summary()

In [94]:
# Compile the model
model_3.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [95]:
# Fit the model
model_3_history = model_3.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "model_3_GRU")])

Saving TensorBoard log files to: model_logs/model_3_GRU/20250211-174149
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 28ms/step - accuracy: 0.8680 - loss: 0.2823 - val_accuracy: 0.7690 - val_loss: 0.7679
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 27ms/step - accuracy: 0.9697 - loss: 0.0857 - val_accuracy: 0.7795 - val_loss: 0.8175
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 32ms/step - accuracy: 0.9711 - loss: 0.0720 - val_accuracy: 0.7795 - val_loss: 0.9528
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 27ms/step - accuracy: 0.9759 - loss: 0.0615 - val_accuracy: 0.7769 - val_loss: 0.9729
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 33ms/step - accuracy: 0.9801 - loss: 0.0478 - val_accuracy: 0.7717 - val_loss: 1.1865


In [97]:
# Make predictions with GRU model
model_3_pred_probs = model_3.predict(val_sentences)
model_3_pred_probs[:5]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step


array([[6.9833070e-04],
       [5.3987348e-01],
       [9.9966979e-01],
       [2.8540654e-02],
       [9.9795296e-05]], dtype=float32)

In [98]:
# Convert pred_probs
model_3_preds = tf.squeeze(tf.round(model_3_pred_probs))
model_3_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [99]:
# Calculate results
model_3_results = calculate_results(val_labels,
                                    model_3_preds)
model_3_results

{'accuracy': 77.16535433070865,
 'precision': 0.7741380916586217,
 'recall': 0.7716535433070866,
 'f1': 0.7691811868378113}

### Model 4: Bidirectional RNN

In [102]:
# Build a Bidirectional RNN in Tensorflow
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
# x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
# x = layers.Bidirectional(layers.GRU(64))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_4 = tf.keras.Model(inputs, outputs, name="model_4_bidirectional")


In [103]:
# Summary
model_4.summary()

In [104]:
# Compile model
model_4.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [106]:
# Fit the model
model_4_history = model_4.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR, "model_4_bidirectional")])

Saving TensorBoard log files to: model_logs/model_4_bidirectional/20250211-184256
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 54ms/step - accuracy: 0.9357 - loss: 0.1982 - val_accuracy: 0.7717 - val_loss: 0.9716
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 56ms/step - accuracy: 0.9794 - loss: 0.0495 - val_accuracy: 0.7756 - val_loss: 1.0770
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 66ms/step - accuracy: 0.9789 - loss: 0.0458 - val_accuracy: 0.7795 - val_loss: 1.3746
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 72ms/step - accuracy: 0.9782 - loss: 0.0426 - val_accuracy: 0.7625 - val_loss: 1.2990
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 72ms/step - accuracy: 0.9816 - loss: 0.0431 - val_accuracy: 0.7769 - val_loss: 1.3102


In [109]:
# Make predicitons probabilities
model_4_pred_probs = model_4.predict(val_sentences)
model_4_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step


array([[9.1958418e-02],
       [8.2413739e-01],
       [9.9998432e-01],
       [1.8683755e-01],
       [2.2971582e-05],
       [9.9988961e-01],
       [9.4727600e-01],
       [9.9999040e-01],
       [9.9997586e-01],
       [4.7365695e-01]], dtype=float32)

In [110]:
# Convert them to classes
model_4_preds = tf.squeeze(tf.round(model_4_pred_probs))
model_4_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [112]:
model_4_results = calculate_results(val_labels,
                                    model_4_preds)
model_4_results

{'accuracy': 77.69028871391076,
 'precision': 0.7809693289921038,
 'recall': 0.7769028871391076,
 'f1': 0.7739165030429329}

## Convolutional Neural Networks for Text (and other types of sequences)

### Model 5: Conv1D

In [116]:
# Test embedding layer + Conv1D + pooling layer
from tensorflow.keras import layers
embedding_test = embedding(text_vectorizer(["this is a test sentence"])) # turn target sequence into an embedding
conv_1d = layers.Conv1D(filters=32,
                        kernel_size=5, # it looks 5 words at a time
                        strides=1, # moves by 1 word
                        activation="relu",
                        padding="same") # default == "valid"; same - means output same as an input
conv_1d_output = conv_1d(embedding_test) # pass test data through conv1d
max_pool = layers.GlobalMaxPool1D()
max_pool_output = max_pool(conv_1d_output) # takes the most important features (with highest value because of "max")

embedding_test.shape, conv_1d_output.shape, max_pool_output.shape

(TensorShape([1, 15, 128]), TensorShape([1, 15, 32]), TensorShape([1, 32]))

In [117]:
embedding_test

<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.03619039, -0.00014613, -0.01179297, ...,  0.05799352,
         -0.00135278, -0.02944383],
        [ 0.00872051, -0.02883054,  0.02369179, ..., -0.02973634,
          0.03212951,  0.01354277],
        [ 0.04843905,  0.06579537, -0.03091072, ...,  0.09463909,
         -0.04827674,  0.00580479],
        ...,
        [-0.02845112, -0.00865676,  0.03318251, ...,  0.03639423,
         -0.05816198,  0.01210994],
        [-0.02845112, -0.00865676,  0.03318251, ...,  0.03639423,
         -0.05816198,  0.01210994],
        [-0.02845112, -0.00865676,  0.03318251, ...,  0.03639423,
         -0.05816198,  0.01210994]]], dtype=float32)>

In [118]:
conv_1d_output

<tf.Tensor: shape=(1, 15, 32), dtype=float32, numpy=
array([[[0.08943889, 0.0114113 , 0.        , 0.        , 0.        ,
         0.0126847 , 0.        , 0.03228597, 0.02863748, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.00309169,
         0.05084584, 0.06952943, 0.00160123, 0.00855024, 0.        ,
         0.        , 0.04311544, 0.        , 0.02495317, 0.        ,
         0.        , 0.        ],
        [0.        , 0.06041734, 0.        , 0.02415749, 0.08496326,
         0.00964421, 0.        , 0.        , 0.04531755, 0.        ,
         0.        , 0.00081942, 0.05974492, 0.03892682, 0.03449061,
         0.07716691, 0.        , 0.02804714, 0.05713614, 0.04343976,
         0.0297568 , 0.        , 0.03680851, 0.00691593, 0.        ,
         0.        , 0.        , 0.0102655 , 0.03147547, 0.        ,
         0.        , 0.        ],
        [0.02141793, 0.        , 0.        , 0.    

In [119]:
max_pool_output

<tf.Tensor: shape=(1, 32), dtype=float32, numpy=
array([[0.08943889, 0.06041734, 0.02595568, 0.06231223, 0.08496326,
        0.04834133, 0.05429848, 0.09074842, 0.04531755, 0.03663497,
        0.0512967 , 0.04102965, 0.05974492, 0.07130976, 0.03449061,
        0.07716691, 0.04817514, 0.04688785, 0.05713614, 0.06706955,
        0.05084584, 0.07770487, 0.03680851, 0.05626641, 0.04292484,
        0.02469029, 0.04311544, 0.05257078, 0.06321305, 0.09141448,
        0.02530191, 0.06700905]], dtype=float32)>