<a href="https://colab.research.google.com/github/random-words/colab-notebooks/blob/main/08__introduction_to_NLP_in_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fundamentals

In [1]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


## Get Helper Functions

In [2]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/refs/heads/main/extras/helper_functions.py

--2025-02-12 09:00:37--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/refs/heads/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2025-02-12 09:00:37 (56.5 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [3]:
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

## Get text dataset

In [4]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

--2025-02-12 09:00:53--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.212.207, 173.194.215.207, 108.177.12.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.212.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2025-02-12 09:00:53 (39.5 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



In [5]:
unzip_data("nlp_getting_started.zip")

## Visualizing dataset

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

In [7]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

In [8]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [9]:
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [10]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [11]:
# examples of each class
train_df["target"].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


In [12]:
# or with attribute
train_df.target.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


In [13]:
len(train_df), len(test_df)

(7613, 3263)

In [14]:
# Let's visualize some random training examples
import random
random_index = random.randint(0, len(train_df)-5) # create random indexes not higher than the total number of samples
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(real disaster)" if target > 0 else "(not real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 0 (not real disaster)
Text:
One Direction Is my pick for http://t.co/q2eBlOKeVE Fan Army #Directioners http://t.co/eNCmhz6y34 x1386

---

Target: 0 (not real disaster)
Text:
13 reasons why we love women in the military   - lulgzimbestpicts http://t.co/XKMLQ99SjY http://t.co/a3RGQuCUgo

---

Target: 0 (not real disaster)
Text:
God bless you and your mudslide cake Dorret ????

---

Target: 0 (not real disaster)
Text:
REVEALED: Everton hijack United bid for 14-year-old WONDERKID! - http://t.co/nb1E7mNcE5

---

Target: 1 (real disaster)
Text:
Experts in France begin examining airplane debris found on Reunion Island: French air accident experts on Wednesday began examining t...

---



### Split data into training and validation sets

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=42)

In [17]:
# Check splits lengts
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [18]:
# check first 10 samples
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

## Converting text into numbers

Few ways to do this:
* Tokenization - direct mapping of token to number:
i love tensorflow -> {0:i, 1:love, 2:tensorflow}
* Embedding - create a matrix of feature vector for each token: i love tensorflow ->
[[0.125, 0.856, 0.091],
 [0.123, 0.643, 0.723],
 [0.188, 0.116, 0.901]]

### Text vectorization (tokenization)

In [19]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

In [20]:
train_sentences[:3]

array(['@mogacola @zamtriossu i screamed after hitting tweet',
       'Imagine getting flattened by Kurt Zouma',
       '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....'],
      dtype=object)

In [21]:
# Use default TextVectorization parameters
text_vectorizer = TextVectorization(max_tokens=None, # max cap in vocabulary for words; None - no limit
                                    standardize="lower_and_strip_punctuation",
                                    split="whitespace",
                                    ngrams=None, # create groups of n-words
                                    output_mode="int", # how to map tokens to numbers
                                    output_sequence_length=None, # how many words a model will see on each sample
                                    # doesn't work in current tensorflow version
                                    # pad_to_max_tokens=True # add zeros to the end of tokens to reach max sequence length (output_sequence_length)
)

In [22]:
len(train_sentences)

6851

In [23]:
train_sentences[0].split(), len(train_sentences[0].split())

(['@mogacola', '@zamtriossu', 'i', 'screamed', 'after', 'hitting', 'tweet'], 7)

In [24]:
# Find the average number of tokens (words) in training tweets
round(sum([len(sentence.split()) for sentence in train_sentences])/len(train_sentences))

15

In [25]:
# Setup text vectorization variables
max_vocab_lenght = 10000 # max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be (how many words in tweet a model will see)

text_vectorizer = TextVectorization(max_tokens=max_vocab_lenght,
                                    output_mode="int",
                                    output_sequence_length=max_length,
                                    pad_to_max_tokens=True # if max_tokens is given, then it works
                                    )

In [26]:
# Fit text_vectorizer to the training data
text_vectorizer.adapt(train_sentences)

In [27]:
# Create a sample sentence and tokenize it
sample_sentence = "There's a flood in my street!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [28]:
# Try on train_sentences
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}")
vectorized_sentence = text_vectorizer([random_sentence])
print(f"Vectorized version:\n{vectorized_sentence}")

Original text:
'Your love will surely come find us
Like blazing wild fires singing Your name'
Vectorized version:
[[  33  110   38 3369  220  653   69   25  556  250  109 2295   33  735
     0]]


In [29]:
# Get the unique words in vocabulary
words_in_vocab = text_vectorizer.get_vocabulary() # get all of the unique words in training data
print(f"Number of words: {len(words_in_vocab)}")
print(f"Most common words: {words_in_vocab[:5]}")
print(f"Least common words: {words_in_vocab[-5:]}")

Number of words: 10000
Most common words: ['', '[UNK]', 'the', 'a', 'in']
Least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


### Creating an Embedding

* input_dim - The size of the vocabulary (e.g. len(text_vectorizer.get_vocabulary()).
* output_dim - The size of the output embedding vector, for example, a value of 100 outputs a feature vector of size 100 for each word.
* embeddings_initializer - How to initialize the embeddings matrix, default is "uniform" which randomly initalizes embedding matrix with uniform distribution. This can be changed for using pre-learned embeddings.
* input_length - Length of sequences being passed to embedding layer

In [30]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_lenght, # set input shape
                             output_dim=128, # output shape
                             embeddings_initializer="uniform",
                             input_length=max_length, # each input (sentence) length
                             )



In [31]:
# Get random sentence from training dataset
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\n")
print("Embedded version:")
# Embed the radnom_sentence (turn it into vectors of setted size)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
Why did God order obliteration of ancient Canaanites? http://t.co/NLk1DYD2tP

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.00620412,  0.00918945, -0.03635547, ..., -0.01942796,
         -0.02671578,  0.0289293 ],
        [ 0.00123564,  0.03882645,  0.00448789, ..., -0.03708577,
         -0.0301629 , -0.02360076],
        [ 0.02439226,  0.00677563, -0.00221785, ...,  0.04839071,
         -0.03470694,  0.02238549],
        ...,
        [-0.04232172, -0.03272851, -0.02306457, ...,  0.03842727,
         -0.00625166,  0.02245761],
        [-0.04232172, -0.03272851, -0.02306457, ...,  0.03842727,
         -0.00625166,  0.02245761],
        [-0.04232172, -0.03272851, -0.02306457, ...,  0.03842727,
         -0.00625166,  0.02245761]]], dtype=float32)>

In [32]:
# Check out a single token's embedding
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([ 6.20411709e-03,  9.18944925e-03, -3.63554731e-02, -3.45421806e-02,
         1.10664256e-02,  2.72605903e-02, -2.44349129e-02, -1.52238123e-02,
        -1.62229426e-02, -4.34215069e-02, -6.30294159e-03,  1.78543217e-02,
         2.08247416e-02,  4.14705016e-02,  2.50656717e-02, -4.25217301e-03,
         1.24538764e-02,  3.58538963e-02, -2.81621106e-02,  7.95148313e-04,
         3.22525762e-02,  2.83181332e-02,  4.93369363e-02,  2.76679136e-02,
        -2.16524359e-02,  4.28991653e-02,  7.97070190e-03, -4.22878750e-02,
         3.80688645e-02, -3.04390918e-02,  1.18087903e-02,  1.43629350e-02,
        -2.69249678e-02,  1.73657201e-02, -4.58763912e-03,  1.83748342e-02,
         3.94917391e-02, -2.34889276e-02, -1.25126019e-02, -2.53338944e-02,
         1.74611695e-02, -2.78614759e-02, -8.46657902e-03, -3.55744362e-02,
        -2.98244841e-02, -3.25885192e-02, -8.88290256e-03,  2.27365233e-02,
        -3.57256532e-02,  4.85439561e-0

## Modelling a text dataset

### Model 0: Baseline

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [34]:
# Create tokenization and modelling pipeline
model_0 = Pipeline([
    ("tfidf", TfidfVectorizer()), # turn text into numbers
    ("clf", MultinomialNB()) # create a model
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

In [35]:
# Evaluate baseline model
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Score: {baseline_score*100:.2f}%")

Score: 79.27%


In [36]:
# Make predictons
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:10]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0])

#### Create an evaluation function

In [37]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate precision, recall, f1-score
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred,
                                                                               average="weighted")
  model_results = {"accuracy":model_accuracy,
                   "precision":model_precision,
                   "recall":model_recall,
                   "f1":model_f1}

  return model_results

In [38]:
baseline_results = calculate_results(y_true=val_labels,
                                     y_pred=baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

### Model 1: A simple dense model

In [39]:
# Create a tensorboard_callback
from helper_functions import create_tensorboard_callback

# Create dir to save logs
SAVE_DIR = "model_logs"

In [40]:
text_vectorizer

<TextVectorization name=text_vectorization_1, built=True>

In [41]:
embedding

<Embedding name=embedding, built=True>

In [42]:
# Build model with Functional API
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string) # inputs are 1-d strings
x = text_vectorizer(inputs) # turn input text into integers
x = embedding(x) # create an embedding of the numberized inputs
x = layers.GlobalAveragePooling1D()(x)  # condence the feature vector for each token to one vector
outputs = layers.Dense(1, activation="sigmoid")(x) # create binary output layer
model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense")

In [43]:
model_1.summary()

In [44]:
# Compile model
model_1.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [45]:
# Fit the model
model_1_history = model_1.fit(x=train_sentences,
                              y=train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "model_1_dense")])

Saving TensorBoard log files to: model_logs/model_1_dense/20250212-090056
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 34ms/step - accuracy: 0.6438 - loss: 0.6503 - val_accuracy: 0.7493 - val_loss: 0.5335
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 28ms/step - accuracy: 0.8108 - loss: 0.4671 - val_accuracy: 0.7874 - val_loss: 0.4737
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 16ms/step - accuracy: 0.8615 - loss: 0.3531 - val_accuracy: 0.7979 - val_loss: 0.4628
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 15ms/step - accuracy: 0.8959 - loss: 0.2796 - val_accuracy: 0.7900 - val_loss: 0.4660
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 19ms/step - accuracy: 0.9133 - loss: 0.2369 - val_accuracy: 0.7913 - val_loss: 0.4860


In [46]:
# Check the results
model_1.evaluate(val_sentences, val_labels)

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7849 - loss: 0.5218


[0.4860159456729889, 0.7913385629653931]

In [47]:
# make predictions and evaluate them
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs, model_1_pred_probs.shape

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step


(array([[3.04037124e-01],
        [7.76587784e-01],
        [9.97417927e-01],
        [1.11498989e-01],
        [1.01927027e-01],
        [9.33831990e-01],
        [8.96601319e-01],
        [9.91480708e-01],
        [9.54850912e-01],
        [2.44666368e-01],
        [1.07832223e-01],
        [6.90426469e-01],
        [3.91390175e-02],
        [1.84103310e-01],
        [5.41232107e-03],
        [1.12468585e-01],
        [2.53940411e-02],
        [5.90743273e-02],
        [2.04065293e-01],
        [4.53190744e-01],
        [8.86836350e-01],
        [3.16713490e-02],
        [3.90613019e-01],
        [7.62752071e-02],
        [9.56001937e-01],
        [9.98683035e-01],
        [2.81606372e-02],
        [5.52193932e-02],
        [2.48091780e-02],
        [1.70252889e-01],
        [5.10192573e-01],
        [2.43863419e-01],
        [4.66054231e-01],
        [1.63289487e-01],
        [4.11866635e-01],
        [5.16866259e-02],
        [9.94490385e-01],
        [1.22615762e-01],
        [2.6

In [48]:
# single prediction
model_1_pred_probs[0]

array([0.30403712], dtype=float32)

In [49]:
# first 10 preds
model_1_pred_probs[:10]

array([[0.30403712],
       [0.7765878 ],
       [0.9974179 ],
       [0.11149899],
       [0.10192703],
       [0.933832  ],
       [0.8966013 ],
       [0.9914807 ],
       [0.9548509 ],
       [0.24466637]], dtype=float32)

In [50]:
# convert model prediciton probabilities to label format
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)>

In [51]:
# Calculate model_1 results
model_1_results = calculate_results(y_true=val_labels,
                                    y_pred=model_1_preds)
model_1_results

{'accuracy': 79.13385826771653,
 'precision': 0.7997458316766562,
 'recall': 0.7913385826771654,
 'f1': 0.7874035967950923}

In [52]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

In [53]:
import numpy as np
np.array(list(model_1_results.values())) > np.array(list(baseline_results.values()))

array([False, False, False,  True])

## Visualizing learned embeddings

In [54]:
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab), words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

In [55]:
model_1.summary()

In [56]:
# Get the weight matrix from embedding layer
# (numerical representations of each token in training data)
embed_weights = model_1.get_layer("embedding").get_weights()[0]
embed_weights, embed_weights.shape

(array([[-0.02010073, -0.01231418, -0.03552063, ...,  0.02115062,
          0.01106762,  0.00441056],
        [ 0.03347975, -0.00274766, -0.00770955, ...,  0.01152676,
          0.02452042,  0.00603615],
        [ 0.0138639 ,  0.019765  , -0.07143044, ..., -0.01079245,
          0.0063863 , -0.01466797],
        ...,
        [-0.01604689,  0.0466293 , -0.00550137, ...,  0.01683042,
         -0.03320412, -0.02729634],
        [-0.08475117,  0.05251949, -0.05090337, ..., -0.0360081 ,
          0.09078246, -0.03221019],
        [-0.12288488,  0.06789774, -0.1175648 , ..., -0.07226835,
          0.04321422, -0.09918924]], dtype=float32),
 (10000, 128))

In [57]:
# Create embedding files
import io
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(words_in_vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = embed_weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

In [58]:
# Download files from Colab to upload to projector
try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception:
  pass

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Recurrent Neural Network (RNN's)

📖 **Resources:**

* MIT Deep Learning Lecture on Recurrent Neural Networks - explains the background of recurrent neural networks and introduces LSTMs.
* The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy - demonstrates the power of RNN's with examples generating various sequences.
* Understanding LSTMs by Chris Olah - an in-depth (and technical) look at the mechanics of the LSTM cell, possibly the most popular RNN building block.

### Model 2: LSTM
* LSTM = long short-term memory

In [59]:
# Create an LSTM model
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
# print(x.shape)
# x = layers.LSTM(units=64, return_sequences=True)(x) # return vector for each word in the Tweet (you can stack RNN cells as long as return_sequences=True)
# print(x.shape)
x = layers.LSTM(units=64)(x)
# print(x.shape)
# x = layers.Dense(64, activation="relu")(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_2 = tf.keras.Model(inputs, outputs, name="model_2_LSTM")

In [60]:
# Get a summary
model_2.summary()

* we want output shape of LSTM to be in form (None, n) because we wanna make prediction on the *whole* sentence, not for each word

In [61]:
# Compile the model
model_2.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [62]:
# Fit the model
model_2_history = model_2.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "model_2_LSTM")])

Saving TensorBoard log files to: model_logs/model_2_LSTM/20250212-090140
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 43ms/step - accuracy: 0.8771 - loss: 0.3061 - val_accuracy: 0.7874 - val_loss: 0.5106
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 43ms/step - accuracy: 0.9435 - loss: 0.1548 - val_accuracy: 0.7822 - val_loss: 0.5815
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 27ms/step - accuracy: 0.9536 - loss: 0.1223 - val_accuracy: 0.7913 - val_loss: 0.6425
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 50ms/step - accuracy: 0.9573 - loss: 0.1043 - val_accuracy: 0.7822 - val_loss: 0.8299
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 44ms/step - accuracy: 0.9705 - loss: 0.0846 - val_accuracy: 0.7756 - val_loss: 0.9481


In [63]:
# Make predicitons on LSTM model
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 18ms/step


array([[2.0844117e-03],
       [7.4483633e-01],
       [9.9932951e-01],
       [4.6748340e-02],
       [3.9064864e-04],
       [9.9651778e-01],
       [8.7007093e-01],
       [9.9963659e-01],
       [9.9935180e-01],
       [6.2287283e-01]], dtype=float32)

In [64]:
# Convert model probs to labels
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [65]:
# Calculate model_2 results
model_2_results = calculate_results(val_labels, model_2_preds)
model_2_results

{'accuracy': 77.55905511811024,
 'precision': 0.7774694686899193,
 'recall': 0.7755905511811023,
 'f1': 0.7734917819402004}

In [66]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

### Model 2: GRU

In [67]:
# Build an RNN using GRU cell
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.GRU(64)(x)
# x = layers.GRU(64, return_sequences=True)(x) # if need to stack recurrent layers (cells) on top of each other, then use return_sequences=True
# x = layers.LSTM(64, return_sequences=True)(x)
# x = layers.GRU(64)(x)
# x = layers.Dense(64, activation="relu")(x)
# x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_3 = tf.keras.Model(inputs, outputs, name="model_3_GRU")

In [68]:
model_3.summary()

In [69]:
# Compile the model
model_3.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [70]:
# Fit the model
model_3_history = model_3.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "model_3_GRU")])

Saving TensorBoard log files to: model_logs/model_3_GRU/20250212-090237
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 32ms/step - accuracy: 0.8896 - loss: 0.2757 - val_accuracy: 0.7743 - val_loss: 0.8561
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 25ms/step - accuracy: 0.9698 - loss: 0.0861 - val_accuracy: 0.7717 - val_loss: 0.7764
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 45ms/step - accuracy: 0.9691 - loss: 0.0746 - val_accuracy: 0.7703 - val_loss: 0.9678
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 57ms/step - accuracy: 0.9789 - loss: 0.0519 - val_accuracy: 0.7743 - val_loss: 1.0145
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 56ms/step - accuracy: 0.9788 - loss: 0.0486 - val_accuracy: 0.7743 - val_loss: 1.1257


In [71]:
# Make predictions with GRU model
model_3_pred_probs = model_3.predict(val_sentences)
model_3_pred_probs[:5]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 21ms/step


array([[2.4359652e-03],
       [8.9964104e-01],
       [9.9983168e-01],
       [1.8251564e-01],
       [1.3044092e-04]], dtype=float32)

In [72]:
# Convert pred_probs
model_3_preds = tf.squeeze(tf.round(model_3_pred_probs))
model_3_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [73]:
# Calculate results
model_3_results = calculate_results(val_labels,
                                    model_3_preds)
model_3_results

{'accuracy': 77.42782152230971,
 'precision': 0.7752857985262857,
 'recall': 0.7742782152230971,
 'f1': 0.7725974162749719}

### Model 4: Bidirectional RNN

In [74]:
# Build a Bidirectional RNN in Tensorflow
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
# x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
# x = layers.Bidirectional(layers.GRU(64))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_4 = tf.keras.Model(inputs, outputs, name="model_4_bidirectional")


In [75]:
# Summary
model_4.summary()

In [76]:
# Compile model
model_4.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [77]:
# Fit the model
model_4_history = model_4.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR, "model_4_bidirectional")])

Saving TensorBoard log files to: model_logs/model_4_bidirectional/20250212-090337
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 55ms/step - accuracy: 0.9608 - loss: 0.1874 - val_accuracy: 0.7717 - val_loss: 0.8450
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 66ms/step - accuracy: 0.9752 - loss: 0.0600 - val_accuracy: 0.7730 - val_loss: 1.1020
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 36ms/step - accuracy: 0.9788 - loss: 0.0495 - val_accuracy: 0.7730 - val_loss: 1.3420
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 44ms/step - accuracy: 0.9827 - loss: 0.0402 - val_accuracy: 0.7756 - val_loss: 1.4151
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 63ms/step - accuracy: 0.9808 - loss: 0.0379 - val_accuracy: 0.7756 - val_loss: 1.4886


In [78]:
# Make predicitons probabilities
model_4_pred_probs = model_4.predict(val_sentences)
model_4_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 28ms/step


array([[2.4069771e-04],
       [6.3709480e-01],
       [9.9996454e-01],
       [3.6550742e-02],
       [1.4389761e-05],
       [9.9927181e-01],
       [3.0756420e-01],
       [9.9998420e-01],
       [9.9997330e-01],
       [9.9730951e-01]], dtype=float32)

In [79]:
# Convert them to classes
model_4_preds = tf.squeeze(tf.round(model_4_pred_probs))
model_4_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 0., 1., 1., 1.], dtype=float32)>

In [80]:
model_4_results = calculate_results(val_labels,
                                    model_4_preds)
model_4_results

{'accuracy': 77.55905511811024,
 'precision': 0.7780461459912817,
 'recall': 0.7755905511811023,
 'f1': 0.7732287214395843}

## Convolutional Neural Networks for Text (and other types of sequences)

### Model 5: Conv1D

In [81]:
# Test embedding layer + Conv1D + pooling layer
from tensorflow.keras import layers
embedding_test = embedding(text_vectorizer(["this is a test sentence"])) # turn target sequence into an embedding
conv_1d = layers.Conv1D(filters=32,
                        kernel_size=5, # it looks 5 words at a time
                        strides=1, # moves by 1 word
                        activation="relu",
                        padding="same") # default == "valid"; same - means output same as an input
conv_1d_output = conv_1d(embedding_test) # pass test data through conv1d
max_pool = layers.GlobalMaxPool1D()
max_pool_output = max_pool(conv_1d_output) # takes the most important features (with highest value because of "max")

embedding_test.shape, conv_1d_output.shape, max_pool_output.shape

(TensorShape([1, 15, 128]), TensorShape([1, 15, 32]), TensorShape([1, 32]))

In [82]:
embedding_test

<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.01568591,  0.05885512,  0.05171633, ..., -0.06279838,
          0.00145842,  0.04424084],
        [ 0.03612487,  0.00231235, -0.07101807, ..., -0.0023768 ,
         -0.03405015, -0.02100527],
        [ 0.09796873,  0.03493737, -0.00954776, ..., -0.06569791,
         -0.0063347 , -0.02681907],
        ...,
        [-0.01313421,  0.02544015,  0.00934834, ...,  0.03825232,
          0.00150464, -0.03104128],
        [-0.01313421,  0.02544015,  0.00934834, ...,  0.03825232,
          0.00150464, -0.03104128],
        [-0.01313421,  0.02544015,  0.00934834, ...,  0.03825232,
          0.00150464, -0.03104128]]], dtype=float32)>

In [83]:
conv_1d_output

<tf.Tensor: shape=(1, 15, 32), dtype=float32, numpy=
array([[[0.00786657, 0.        , 0.        , 0.01283585, 0.        ,
         0.        , 0.03179513, 0.0352383 , 0.        , 0.01100552,
         0.        , 0.02086826, 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.03709872, 0.        , 0.        ,
         0.        , 0.01483095, 0.        , 0.        , 0.01692111,
         0.01564269, 0.02387404, 0.00857786, 0.        , 0.        ,
         0.        , 0.02529726],
        [0.        , 0.        , 0.05988023, 0.        , 0.        ,
         0.        , 0.        , 0.06045369, 0.00328922, 0.0723854 ,
         0.        , 0.        , 0.        , 0.02463978, 0.        ,
         0.05240652, 0.        , 0.        , 0.        , 0.0280173 ,
         0.06634452, 0.08328752, 0.00468639, 0.        , 0.05218075,
         0.07523426, 0.03297751, 0.        , 0.00379812, 0.06884678,
         0.        , 0.08287168],
        [0.        , 0.        , 0.02023985, 0.    

In [84]:
max_pool_output

<tf.Tensor: shape=(1, 32), dtype=float32, numpy=
array([[0.00786657, 0.05773418, 0.05988023, 0.01503959, 0.03977615,
        0.03229179, 0.04966332, 0.06045369, 0.05850874, 0.09154747,
        0.11040293, 0.02086826, 0.02061229, 0.13240817, 0.08342717,
        0.05240652, 0.07059032, 0.03709872, 0.04579554, 0.05209113,
        0.06634452, 0.08328752, 0.03919273, 0.        , 0.05218075,
        0.07523426, 0.03297751, 0.02445159, 0.03750114, 0.06884678,
        0.        , 0.08287168]], dtype=float32)>

In [91]:
# Create 1-d conv layer to model seqences
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.Conv1D(filters=64,
                  kernel_size=5,
                  activation="relu",
                  padding="valid",
                  strides=1)(x)
x = layers.GlobalMaxPool1D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_5 = tf.keras.Model(inputs, outputs, name="model_5_Conv1D")

# Compile model
model_5.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# Get summary
model_5.summary()

In [92]:
# Fit the model
model_5_history = model_5.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR, "Conv1D")])

Saving TensorBoard log files to: model_logs/Conv1D/20250212-091336
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 20ms/step - accuracy: 0.9291 - loss: 0.1916 - val_accuracy: 0.7625 - val_loss: 0.9129
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 18ms/step - accuracy: 0.9684 - loss: 0.0777 - val_accuracy: 0.7638 - val_loss: 1.1212
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 17ms/step - accuracy: 0.9806 - loss: 0.0528 - val_accuracy: 0.7585 - val_loss: 1.1609
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 16ms/step - accuracy: 0.9779 - loss: 0.0542 - val_accuracy: 0.7677 - val_loss: 1.1835
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 21ms/step - accuracy: 0.9777 - loss: 0.0548 - val_accuracy: 0.7559 - val_loss: 1.2270


In [93]:
# Make predictions with Conv1D
model_5_pred_probs = model_5.predict(val_sentences)
model_5_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 15ms/step


array([[4.2430183e-01],
       [4.9020964e-01],
       [9.9995488e-01],
       [2.9453598e-02],
       [1.9030706e-07],
       [9.9961901e-01],
       [9.8930651e-01],
       [9.9998796e-01],
       [9.9999982e-01],
       [9.0671307e-01]], dtype=float32)

In [95]:
# Convert to labels
model_5_preds = tf.squeeze(tf.round(model_5_pred_probs))
model_5_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [96]:
# Evaluate model_5
model_5_results = calculate_results(val_labels, model_5_preds)
model_5_results

{'accuracy': 75.59055118110236,
 'precision': 0.7560615681399698,
 'recall': 0.7559055118110236,
 'f1': 0.7544511527713377}

## Model 6: Pre-trained Sentence Encoder

In [98]:
import tensorflow_hub as hub
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
# embed = hub.load("https://www.kaggle.com/models/google/universal-sentence-encoder/TensorFlow2/cmlm-en-base/1")
embed_samples = embed([sample_sentence,
                       "when you call USE on a sentence, it turns it into numbers."])
print(embed_samples[0][:50])

tf.Tensor(
[-0.01157025  0.02485911  0.02878051 -0.012715    0.03971541  0.08827761
  0.02680988  0.05589838 -0.01068731 -0.00597293  0.00639321 -0.01819516
  0.00030816  0.09105889  0.05874645 -0.03180629  0.01512474 -0.05162925
  0.00991366 -0.06865345 -0.04209306  0.0267898   0.03011009  0.00321065
 -0.00337968 -0.04787356  0.0226672  -0.00985927 -0.04063615 -0.01292093
 -0.04666382  0.05630299 -0.03949255  0.00517682  0.02495827 -0.07014439
  0.0287151   0.0494768  -0.00633978 -0.08960193  0.02807119 -0.00808364
 -0.01360601  0.05998649 -0.10361788 -0.05195372  0.00232958 -0.02332531
 -0.03758106  0.03327729], shape=(50,), dtype=float32)


In [99]:
embed_samples

<tf.Tensor: shape=(2, 512), dtype=float32, numpy=
array([[-0.01157025,  0.02485911,  0.02878051, ..., -0.00186124,
         0.02315822, -0.01485021],
       [ 0.02665618, -0.09856853,  0.03531171, ...,  0.00939861,
         0.02583151,  0.01637121]], dtype=float32)>

In [100]:
# Create a Keras Layer with USE
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[], # blank because we can pass any sentence length into USE layer
                                        dtype=tf.string,
                                        trainable=False,
                                        name="USE")

In [118]:
import tf_keras
import datetime

def create_tuned_tensorboard_callback(dir_name, experiment_name):
  """
  Creates a TensorBoard callback instand to store log files.

  Stores log files with the filepath:
    "dir_name/experiment_name/current_datetime/"

  Args:
    dir_name: target directory to store TensorBoard log files
    experiment_name: name of experiment directory (e.g. efficientnet_model_1)
  """
  log_dir = dir_name + "/" + experiment_name + "/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
  tensorboard_callback = tf_keras.callbacks.TensorBoard(
      log_dir=log_dir
  )
  print(f"Saving TensorBoard log files to: {log_dir}")
  return tensorboard_callback

In [139]:
# Create model using Sequential API
import tf_keras

model_6 = tf_keras.Sequential([
    sentence_encoder_layer,
    tf_keras.layers.Dense(64, activation="relu"),
    tf_keras.layers.Dense(1, activation="sigmoid")
], name="model_6_USE")

# compile
model_6.compile(loss="binary_crossentropy",
                optimizer=tf_keras.optimizers.Adam(),
                metrics=["accuracy"])

model_6.summary()

Model: "model_6_USE"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 USE (KerasLayer)            (None, 512)               256797824 
                                                                 
 dense_11 (Dense)            (None, 64)                32832     
                                                                 
 dense_12 (Dense)            (None, 1)                 65        
                                                                 
Total params: 256830721 (979.73 MB)
Trainable params: 32897 (128.50 KB)
Non-trainable params: 256797824 (979.61 MB)
_________________________________________________________________


In [140]:
# Train a classifier model
model_6_history = model_6.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tuned_tensorboard_callback(SAVE_DIR,
                                                                           "tf_hub_sentence_encoder")])

Saving TensorBoard log files to: model_logs/tf_hub_sentence_encoder/20250212-103216
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [141]:
model_6.evaluate(val_sentences, val_labels)



[0.4254055917263031, 0.8215222954750061]

In [142]:
# Make preds with USE TF Hub model
model_6_pred_probs = model_6.predict(val_sentences)
model_6_pred_probs[:10]



array([[0.19232267],
       [0.77840394],
       [0.98938805],
       [0.22888856],
       [0.72507215],
       [0.7397318 ],
       [0.9820527 ],
       [0.98399913],
       [0.9402001 ],
       [0.09925094]], dtype=float32)

In [143]:
# Convert to labels
model_6_preds = tf.squeeze(tf.round(model_6_pred_probs))
model_6_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 1., 1., 1., 1., 1., 0.], dtype=float32)>

In [144]:
# Calculate result
model_6_results = calculate_results(val_labels,
                                    model_6_preds)
model_6_results

{'accuracy': 82.1522309711286,
 'precision': 0.8241317499642585,
 'recall': 0.821522309711286,
 'f1': 0.82000293386527}

In [126]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

In [145]:
len(train_df)

7613

## Model 7: TF Hub Pretrained USE model but only 10% training data

In [146]:
# Create subsets of 10% of the training data
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [165]:
# # Making data splits like this leads to data leakage
# # (model_7 trained on 10% data outperforms model_6 trained on 100% data)
# # DO NOT MAKE SPLITS WHICH LEAK DATA FROM VALIDATION/TEST SETS INTO TRAINING SET

# train_10_percent = train_df_shuffled[["text", "target"]].sample(frac=0.1, random_state=42)
# train_sentences_10_percent = train_10_percent["text"].to_list()
# train_labels_10_percent = train_10_percent["target"].to_list()

In [169]:
# Making better split (no data leakage)
train_10_percent_split = int(0.1 * len(train_sentences))
train_sentences_10_percent = train_sentences[:train_10_percent_split]
train_labels_10_percent = train_labels[:train_10_percent_split]

In [171]:
pd.Series(train_labels_10_percent).value_counts()

Unnamed: 0,count
0,406
1,279


In [152]:
train_10_percent["target"].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,413
1,348


In [153]:
train_df_shuffled["target"].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


In [156]:
# # cannot use it because KerasLayer is used
# # model_7 = tf.keras.models.clone_model(model_6)

# # Compile model
# model_7.compile(loss="binary_crossentropy",
#                 optimizer=tf.keras.optimizers.Adam(),
#                 metrics=["accuracy"])

# # Summary
# model_7.summary()

In [172]:
# Create model using Sequential API
import tf_keras

model_7 = tf_keras.Sequential([
    sentence_encoder_layer,
    tf_keras.layers.Dense(64, activation="relu"),
    tf_keras.layers.Dense(1, activation="sigmoid")
], name="model_7_USE_10_percent")

# compile
model_7.compile(loss="binary_crossentropy",
                optimizer=tf_keras.optimizers.Adam(),
                metrics=["accuracy"])

model_7.summary()

Model: "model_7_USE_10_percent"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 USE (KerasLayer)            (None, 512)               256797824 
                                                                 
 dense_15 (Dense)            (None, 64)                32832     
                                                                 
 dense_16 (Dense)            (None, 1)                 65        
                                                                 
Total params: 256830721 (979.73 MB)
Trainable params: 32897 (128.50 KB)
Non-trainable params: 256797824 (979.61 MB)
_________________________________________________________________


In [173]:
# Train a classifier model
model_7_history = model_7.fit(train_sentences_10_percent,
                              train_labels_10_percent,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tuned_tensorboard_callback(SAVE_DIR,
                                                                           "tf_hub_sentence_encoder_10_percent_correct_split")])

Saving TensorBoard log files to: model_logs/tf_hub_sentence_encoder_10_percent_correct_split/20250212-122759
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [174]:
# Make preds
model_7_pred_probs = model_7.predict(val_sentences)
model_7_pred_probs[:10]



array([[0.23507833],
       [0.6064389 ],
       [0.9002722 ],
       [0.36477625],
       [0.5474697 ],
       [0.7048228 ],
       [0.8859475 ],
       [0.8175506 ],
       [0.85086656],
       [0.16813658]], dtype=float32)

In [175]:
# Trun into lables
model_7_preds = tf.squeeze(tf.round(model_7_pred_probs))
model_7_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 1., 1., 1., 1., 1., 0.], dtype=float32)>

In [176]:
# Evaluate
model_7_results = calculate_results(val_labels,
                                    model_7_preds)
model_7_results

{'accuracy': 78.4776902887139,
 'precision': 0.7849494067766415,
 'recall': 0.7847769028871391,
 'f1': 0.7837862430999359}

In [163]:
model_6_results

{'accuracy': 82.1522309711286,
 'precision': 0.8241317499642585,
 'recall': 0.821522309711286,
 'f1': 0.82000293386527}