<a href="https://colab.research.google.com/github/ucheokechukwu/ml_tensorflow_deeplearning/blob/main/08_introduction_to_nlp_in_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# introduction to NLP fundamentals in Tensorflow

NLP has the goal of deriving information out of natural langauge (could be sequence text or speech).

Another common term for NLP problems is sequence to sequence problmes (seq2seq).

In [1]:
## check for GPU
!nvidia-smi -L

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [2]:
# get helper functions
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

--2023-03-07 16:27:01--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‚Äòhelper_functions.py.1‚Äô


2023-03-07 16:27:01 (67.0 MB/s) - ‚Äòhelper_functions.py.1‚Äô saved [10246/10246]



## Get a text dataset
Kaggle's introduction to NLP dataset. Text samples of tweets labelled as disaster or not disaster. 
- binary clssification
https://www.kaggle.com/c/nlp-getting-started

In [3]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
unzip_data("nlp_getting_started.zip")


--2023-03-07 16:27:11--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.203.128, 142.251.107.128, 173.194.214.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.203.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‚Äònlp_getting_started.zip.1‚Äô


2023-03-07 16:27:11 (96.7 MB/s) - ‚Äònlp_getting_started.zip.1‚Äô saved [607343/607343]



## Visualizing a text dataset

to visualize our text samples, we first have to read them in. we can do so using Pandas for Python 

In [4]:
import pandas as pd
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
train_df["text"][20]

'this is ridiculous....'

In [6]:
train_df_shuffled = train_df.sample(frac=1, random_state=42)

In [7]:
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ¬â√õ√èThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [8]:
# what does the text dataframe look like?
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [9]:
# how many examples of each class are there?
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [10]:
# how many total samples
len(train_df), len(test_df)

(7613, 3263)

In [11]:
# let's visualize some random training examples
import random
random_index = random.randint(0,len(train_df)-5)
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(real disaster)" if target > 0 else "(not real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 0 (not real disaster)
Text:
If you're slating @gpaulista5 for @JackWilshere's injury then you're a disgrace to the #AFC fan base. Injuries happen you cunts!

---

Target: 0 (not real disaster)
Text:
I liked a @YouTube video from @jeromekem http://t.co/Nq89drydbU DJ Hazard - Death Sport

---

Target: 0 (not real disaster)
Text:
@nalathekoala As a health care professional that deals all gun violence sequalae I consider suicides injuries accidents and homicides

---

Target: 1 (real disaster)
Text:
Wreckage 'Conclusively Confirmed' as From MH370: Malaysia PM: Investigators and the families of those who were... http://t.co/5EBpYbFH4D

---

Target: 1 (real disaster)
Text:
#deai #??? #??? #??? Suicide bomber kills 15 in Saudi security site mosque - Reuters  http://t.co/SqydkslFzp

---



### Split data into training and validation sets

In [12]:
from sklearn.model_selection import train_test_split
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                             train_df_shuffled["target"].to_numpy(),
                                                                             test_size=0.1,
                                                                             random_state=42)
len(train_sentences), len(val_sentences), len(train_labels), len(val_labels)

(6851, 762, 6851, 762)

In [13]:
# Check the first ten examples
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object), array([0, 

## Converting text into numbers

When dealing with text problem, one of the first things you need to do is numerically encode the text.

Methods:

1. Tokenization - direct mapping of token (word or character to number) or one-hot encoding.

2 - Embedding - creating a matrix of feature vectors for each token. The size of the vector can be defined and this embedding, which is essentially a matrix of weights can be learned.

## Text vectorization (tokenization)

In [14]:
import tensorflow as tf
# from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers import TextVectorization

In [15]:
# Use the default TextVectorization parameters
text_vectorizer = TextVectorization(max_tokens=None,
                                    standardize="lower_and_strip_punctuation",
                                    split="whitespace",
                                    ngrams=None,
                                    output_mode="int",
                                    output_sequence_length=None
                                    )

In [16]:
# find the average number of tokens (words) in the training tweets

In [17]:
len(train_sentences[0].split())


7

In [18]:

round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [19]:
# set up text vectorization variables
max_vocab_length = 10000 #max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be (e.g. how many words from a tweet does our model see?)

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)

# fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

In [20]:
sample_sentence="there is a flood in my street!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 74,   9,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,
          0,   0]])>

* Note that the shape is (1,15) because we passed it in **1** sequence and **15** is because the max_length is 15.

In [21]:
text_vectorizer(["there is a man in my backyard!"])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[  74,    9,    3,   89,    4,   13, 6143,    0,    0,    0,    0,
           0,    0,    0,    0]])>

In [22]:
random_sentence = random.choice(train_sentences)
print(f"Original text: \n{random_sentence}\n\n\nVectorized Version: {text_vectorizer([random_sentence])}")

Original text: 
The bomb was so appropriate ?? seen as my family and most Jamaicans love shout bullets !


Vectorized Version: [[   2  108   23   28    1  834   26   13  302    7  230    1  110 4632
  6053]]


In [23]:
# get the unique words in the vocubalary
words_in_vocab = text_vectorizer.get_vocabulary() # get all the unique words in our training data
top_5_words = words_in_vocab[:10]
bottom_5_words = words_in_vocab[-10:]
print(f"Number of words in vocab: {len(words_in_vocab)} \n\n5 most common words: \n{top_5_words}\n\n5 least common words: \n{bottom_5_words}")
# [UNK] is unknown text, that is it's outside of 10000 words

Number of words in vocab: 10000 

5 most common words: 
['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is']

5 least common words: 
['painthey', 'painful', 'paine', 'paging', 'pageshi', 'pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


## Text vectorization (embedding)
`tf.keras.layers.Embedding`
turns positive integers into dense vectors of fixed size
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

The parameters we care most about for our embedding layer:
* `input_dim` - the size of our vocabulary
* `output_dim` - the size of the output embedding vector e.g. a value of 100 means each token gets represented by a vector of length 100
* `input_length` - the length of sequences passed into the embedding layer (in this case, it's 15)

In [24]:
from tensorflow.keras import layers
embedding = layers.Embedding(input_dim=max_vocab_length, #set input shape
                             output_dim=128, #neural networks work best with numbers divisible by 8
                             input_length=max_length # how long is each input
)

In [25]:
# test on random sentences from the training set
random_sentence = random.choice(train_sentences)
print(f"Original text: \n{random_sentence}\
n\nEmbedded version:")
# embed the random sentence (turn it into dense vectors of fixed size)
sample_embed = embedding(text_vectorizer(random_sentence))
sample_embed

Original text: 
(?EudryLantiqua?) Hollywood Movie About Trapped Miners Released in Chile: 'The 33' Holly... http://t.co/us1DMdXZVb (?EudryLantiqua?)n
Embedded version:


<tf.Tensor: shape=(15, 128), dtype=float32, numpy=
array([[ 0.00416181, -0.00838618, -0.01397897, ...,  0.04617992,
        -0.01138272, -0.03373402],
       [-0.00469732, -0.04819163, -0.03242251, ...,  0.03079052,
         0.01849658, -0.00578275],
       [-0.03696836, -0.02616271, -0.01970369, ...,  0.0153378 ,
         0.01843512, -0.04855213],
       ...,
       [-0.02105641,  0.02023664,  0.01888357, ...,  0.03418375,
         0.0192186 ,  0.04007771],
       [ 0.00416181, -0.00838618, -0.01397897, ...,  0.04617992,
        -0.01138272, -0.03373402],
       [-0.03183977,  0.0232077 , -0.00210217, ..., -0.01860523,
        -0.02940828,  0.01409498]], dtype=float32)>

In [26]:
sample_embed = tf.expand_dims(sample_embed, axis=0)

In [27]:
# check out a single token's embedding
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([ 0.00416181, -0.00838618, -0.01397897,  0.03895357, -0.0374434 ,
        -0.03676909,  0.02437769,  0.0207624 ,  0.0278406 ,  0.01251978,
        -0.03917879, -0.04530257,  0.03253689,  0.01345695,  0.04433126,
        -0.01718793,  0.01765994,  0.04658303,  0.00276833,  0.0042421 ,
        -0.01448188, -0.02567239,  0.00630822, -0.0424589 ,  0.000763  ,
         0.01364473, -0.01194667, -0.00724568, -0.01163367, -0.01058085,
         0.01532737,  0.00688541, -0.0463438 ,  0.04546429, -0.0437851 ,
        -0.03969776,  0.04931102, -0.02464147, -0.01794597, -0.01419248,
        -0.04149314,  0.0492383 ,  0.04737112, -0.0404408 ,  0.04246825,
         0.02090292, -0.01060996,  0.04182508, -0.02270931,  0.0471014 ,
         0.03794774,  0.01774218, -0.03800415,  0.04126353,  0.01133889,
         0.0384771 ,  0.01233964,  0.04254862,  0.04316625, -0.04364331,
         0.01401236, -0.03913574, -0.02006226,  0.04055745, -0.03081501,
  

# Modelling our text dataset - running a series of experiments

It's time to start building a series of modelling experiments, starting with a baseline and moving on from there:

* Model 0: Naive Bayes (baseline)
* Model 1: feed-forward neural network (dense model)
* Model 2: LSTM model (long-short term memory) (RNN)
* Model 3: GRU model (RNN)
* Model 4: Bidirectional LSTM model (RNN)
* Model 5: 1D Convolutional Neural network
* Model 6: Tensorflow Hub pretrained feature extracctor (using transfer learning for NLP)
* Model 7: same as 6 with 10% of the dataset

Method of approach: standard steps with modelling with tensorflow:
- prepare data -> build -> compile -> fit -> evaluate -> experiment and improve

## Model 0 - getting a baseline
This will be our baseline model that serves as a benchmark for future experiments to build up. We're going to use `sklearn` Multinomial Naive Bayes using the TF-IDF formula to convert our words to numbers. 

* üîë It's common practice to use non-DL algorithm as a baseline because of their speed and later use DL to see how to improve upon them.

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline


In [29]:
# Create tokenization and modelling pipeline
model_0 = Pipeline([
    ("tfidf", TfidfVectorizer()), # convert words to numbers using tfidf
    ("clf", MultinomialNB()) # model the text using this classifier(clf)
])

# fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

In [30]:
# evaluate our baseline model
baseline_score = model_0.score(val_sentences, val_labels) 
#.score is for sklearn what .evaluate is for tensorflow. the default evaluation metric for classification is accuracy

In [31]:
print(f"Our baseline score achieves an accuracy of {baseline_score*100:.2f}%")

Our baseline score achieves an accuracy of 79.27%


In [32]:
# make predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

In [33]:
# Creating evaluation function
def evaluation (model, val_sentences, val_labels):
  """Function to return the evaluation metrics of a model 
  given the model and the validation data
  """
  from sklearn.metrics import recall_score, precision_score, classification_report
  accuracy = model.score(val_sentences, val_labels)
  predicted_labels = model.predict(val_sentences)
  precision = precision_score(val_labels, predicted_labels)
  recall = recall_score(val_labels, predicted_labels)
  report = classification_report(val_labels, predicted_labels)

  return accuracy, precision, recall, report

In [34]:
base_evaluation = evaluation(model_0, val_sentences, val_labels)
print(f"Accuracy is: {base_evaluation[0]*100:.2f}%. \nPrecision Score is:{base_evaluation[1]:.2f}\
\nRecall Score is: {base_evaluation[2]:.2f} \
\n\n\nClassification Report is {base_evaluation[3]}")

Accuracy is: 79.27%. 
Precision Score is:0.89
Recall Score is: 0.63 


Classification Report is               precision    recall  f1-score   support

           0       0.75      0.93      0.83       414
           1       0.89      0.63      0.73       348

    accuracy                           0.79       762
   macro avg       0.82      0.78      0.78       762
weighted avg       0.81      0.79      0.79       762



In [35]:
# Creating evaluation function
def calculate_results (y_true, y_preds):
  """Function to return the evaluation metrics of a model 
  given the model and the validation data
  """
  from sklearn.metrics import accuracy_score, precision_recall_fscore_support
  model_accuracy = accuracy_score(y_true, y_preds) *100
  
  model_prediction, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_preds,
                                                                                average="weighted")
  model_results = {"accuracy": model_accuracy,
                   "prediction": model_prediction,
                   "recall": model_recall,
                   "f1_score": model_f1}

  return model_results

In [36]:
baseline_results = calculate_results(val_labels, baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'prediction': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1_score': 0.7862189758049549}

## Model 1: Feedforward neural networks (dense model)


In [37]:
# Create a tensorboard callback
from helper_functions import create_tensorboard_callback
SAVE_DIR = 'model_logs'

In [38]:
# Build model with Functional API
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype=tf.string) # or "string" Inputs are 1-dimensional strings
x = text_vectorizer(inputs) # numerically encode the input texts
x = embedding(x) # create an embedding of the numerized numbers
x = layers.GlobalAveragePooling1D()(x) # condense the feature vector for each token to one vector
# without the above, I kept getting errors
outputs = layers.Dense(1, activation="sigmoid")(x)

model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense")
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (G  (None, 128)              0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
N

In [39]:
# Compile model
model_1.compile(loss="binary_crossentropy",
                optimizer="Adam",
                metrics="accuracy")

In [40]:
# fit the model
history_1 = model_1.fit(x=train_sentences,
                        y=train_labels,
                        epochs=5,
                        validation_data=(val_sentences, val_labels),
                        callbacks=[create_tensorboard_callback(SAVE_DIR,experiment_name="Model_1_Dense")])

Saving TensorBoard log files to: model_logs/Model_1_Dense/20230307-162716
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [41]:
results_1 = model_1.evaluate(val_sentences, val_labels)



In [42]:
baseline_results

{'accuracy': 79.26509186351706,
 'prediction': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1_score': 0.7862189758049549}

In [43]:
model_1_preds_probs = model_1.predict(val_sentences)
model_1_preds_probs[:10], model_1_preds_probs.shape



(array([[0.23374152],
        [0.7721029 ],
        [0.9971215 ],
        [0.09316843],
        [0.10007144],
        [0.9214153 ],
        [0.9106887 ],
        [0.99229395],
        [0.95450336],
        [0.23303556]], dtype=float32), (762, 1))

In [44]:
# Convert model prediction probabilities to label format and squeeze out the extra dimension
model_1_preds=tf.round(tf.squeeze(model_1_preds_probs))
model_1_preds[:10]


<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [45]:
# Calculate model_1 results
model_1_results = calculate_results(y_true=val_labels,
                                    y_preds=model_1_preds)
model_1_results

{'accuracy': 79.13385826771653,
 'prediction': 0.8015812374832104,
 'recall': 0.7913385826771654,
 'f1_score': 0.7868942607723418}

In [46]:
baseline_results

{'accuracy': 79.26509186351706,
 'prediction': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1_score': 0.7862189758049549}

In [47]:
# Compare the results
import numpy as np
np.array(list(model_1_results.values())) > np.array(list(baseline_results.values()))

array([False, False, False,  True])

* None of the metrics were greater than the baseline!

## Visualiizng learned embedding

In [48]:
# get the vocabulary from the text vectorization layer
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab), words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

In [51]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (G  (None, 128)              0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
N

In [75]:
# get the weight matrix of the embedding layer
# these are teh numerical represenations of each token in our training data which has been trained for 5 epochs

embed_weights = model_1.get_layer("embedding").get_weights()
embed_weights = tf.squeeze(embed_weights)
embed_weights, embed_weights.shape

(<tf.Tensor: shape=(10000, 128), dtype=float32, numpy=
 array([[-0.0439539 ,  0.01626603,  0.0105247 , ..., -0.00803557,
         -0.01958553,  0.02514297],
        [-0.03781921,  0.00822827,  0.02857511, ...,  0.04961576,
          0.03422349,  0.05668167],
        [-0.01478576,  0.03502676, -0.02134278, ...,  0.01030198,
         -0.01320543,  0.00106707],
        ...,
        [-0.03378409,  0.0432404 ,  0.04463151, ..., -0.02420684,
         -0.02445543,  0.04364406],
        [-0.0642627 , -0.0705032 ,  0.08889606, ...,  0.07463996,
          0.08026566,  0.06179041],
        [-0.1095909 , -0.07752006,  0.07633366, ...,  0.11130208,
          0.07404657,  0.08760335]], dtype=float32)>,
 TensorShape([10000, 128]))

* Every token is represented by a 128-length vector
* Now we've gotten the embedding matrix our model has learned to represent our tokens, let's visualize it.
* Tensorflow has a tool: https://projector.tensorflow.org/
* and a guide on word embeddings - https://www.tensorflow.org/text/guide/word_embeddings

In [76]:
# create embedding files (got from tensorflow word embeddings documentation)
import io 
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(words_in_vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = embed_weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

In [79]:
# download files from Colab to upload to project
try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception:
  pass

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##  Recurrent Neural Networks (RNNs)

RNNs are useful for sequence data.

the premise of recurrent neural networks is to use the representation of a previous input to aid the representation of a later input.


üìñ Resources: Overviews of RNNs are the following - 
* MIT's sequence modelling lecture
* Chris Olah's intro to LSTM - https://colah.github.io/posts/2015-08-Understanding-LSTMs/
* https://karpathy.github.io/2015/05/21/rnn-effectiveness/

## Model 2: LSTM
LSTM - long short term memory

our structure of an RNN typically looks like this:

``` 
input(text) -> tokenize -> embedding -> layers (RNN/dense) -> output (label probability)
```

In [101]:
# create an LSTM model
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = embedding(x)
# x = layers.LSTM(units=64, return_sequences=True)(x)
# when stacking RNN cells together, need to return Sequences
x = layers.LSTM(64)(x)
# x = layers.Dense(64, activation="relu")(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_2 = tf.keras.Model(inputs, outputs, name="model_2_LSTM")
model_2.summary()

Model: "model_2_LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_14 (InputLayer)       [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 lstm_19 (LSTM)              (None, 64)                49408     
                                                                 
 dense_22 (Dense)            (None, 1)                 65        
                                                                 
Total params: 1,329,473
Trainable params: 1,329,473
Non-trainable params: 0
____________________________________________

In [102]:
# compile and fit
model_2.compile(loss="binary_crossentropy",
                optimizer="Adam",
                metrics="accuracy")
history_2 = model_2.fit(train_sentences, train_labels,
                        validation_data=(val_sentences, val_labels),
                        epochs=5,
                        callbacks=create_tensorboard_callback(SAVE_DIR, experiment_name="model_2_LSTM"))

Saving TensorBoard log files to: model_logs/model_2_LSTM/20230307-184939
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [104]:
# make predictions with LSTM model
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs[:10]



array([[5.7503632e-03],
       [9.1493708e-01],
       [9.9981070e-01],
       [2.0237245e-02],
       [4.6684008e-04],
       [9.9953216e-01],
       [9.4115293e-01],
       [9.9989587e-01],
       [9.9982321e-01],
       [3.3227247e-01]], dtype=float32)

In [105]:
model_2_preds = tf.round(tf.squeeze(model_2_pred_probs))

In [107]:
model_2_results = calculate_results(y_true = val_labels, y_preds= model_2_preds)
model_2_results, baseline_results

({'accuracy': 77.95275590551181,
  'prediction': 0.7816545659065345,
  'recall': 0.7795275590551181,
  'f1_score': 0.7774022539420016},
 {'accuracy': 79.26509186351706,
  'prediction': 0.8111390004213173,
  'recall': 0.7926509186351706,
  'f1_score': 0.7862189758049549})

In [110]:
np.array(list(model_2_results.values()))>np.array(list(baseline_results.values()))

array([False, False, False, False])

## Model 3: GRU

GRU (Gated recurrent unit) cell has similar features to LSTM but less parameters

In [112]:
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.GRU(64)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_3 = tf.keras.Model(inputs, outputs, name="model_3_GRU")
model_3.summary()

Model: "model_3_GRU"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_16 (InputLayer)       [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 gru_1 (GRU)                 (None, 64)                37248     
                                                                 
 dense_24 (Dense)            (None, 1)                 65        
                                                                 
Total params: 1,317,313
Trainable params: 1,317,313
Non-trainable params: 0
_____________________________________________

In [114]:
model_3.compile(loss="binary_crossentropy",
                optimizer="Adam",
                metrics="accuracy")
history_3 = model_3.fit(train_sentences, train_labels,
            validation_data=(val_sentences, val_labels),
            epochs=5,
            callbacks=[create_tensorboard_callback(SAVE_DIR, experiment_name="model_3_GRU")]
    
)

Saving TensorBoard log files to: model_logs/model_3_GRU/20230307-190825
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [116]:
# evaluate the data
model_3_preds_probs = model_3.predict(val_sentences)
model_3_preds = tf.round(tf.squeeze(model_3_preds_probs))
results_3 = calculate_results(y_true=val_labels, y_preds=model_3_preds)
results_3



{'accuracy': 77.55905511811024,
 'prediction': 0.7752814319411354,
 'recall': 0.7755905511811023,
 'f1_score': 0.7753271227066473}

In [117]:
np.array(list(results_3))>np.array(list(baseline_results))

array([False, False, False, False])

In [122]:
def calculate_predictions_and_results(model, val_sentences=val_sentences, val_labels=val_labels):
  model_pred_probs = model.predict(val_sentences)
  model_preds = tf.squeeze(tf.round(model_pred_probs))
  model_results = calculate_results(val_labels, model_preds)
  print(np.array(list(model_results))>np.array(list(baseline_results)))
  
  return model_results

In [123]:
calculate_predictions_and_results(model_3)

[False False False False]


{'accuracy': 77.55905511811024,
 'prediction': 0.7752814319411354,
 'recall': 0.7755905511811023,
 'f1_score': 0.7753271227066473}

### Model 4: Bidirectional RNN

* Normal RNN go in one direction (left to right, for English for example),
* bidirectional RNN go from right to left as well as left to right

In [134]:
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = embedding(x)
# x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
# x = layers.Bidirectional(layers.GRU(64))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_4 = tf.keras.Model(inputs, outputs, name="model_4_bidirectional")
model_4.summary()

Model: "model_4_bidirectional"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_26 (InputLayer)       [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 bidirectional_12 (Bidirecti  (None, 128)              98816     
 onal)                                                           
                                                                 
 dense_31 (Dense)            (None, 1)                 129       
                                                                 
Total params: 1,378,945
Trainable params: 1,3

* Note: how the shape of the bidirectional layer is twice its input i.e. 64 becomes 128

In [135]:
model_4.compile(loss="binary_crossentropy",
                optimizer="Adam",
                metrics="accuracy")
history_4 = model_4.fit(train_sentences, train_labels,
                        validation_data=(val_sentences, val_labels),
                        epochs=5,
                        callbacks=[create_tensorboard_callback(SAVE_DIR, "model_4_bidirectional")])

Saving TensorBoard log files to: model_logs/model_4_bidirectional/20230307-194643
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [136]:
calculate_predictions_and_results(model_4)

[False False False False]


{'accuracy': 77.55905511811024,
 'prediction': 0.776326889347514,
 'recall': 0.7755905511811023,
 'f1_score': 0.7740902496040959}