<a href="https://colab.research.google.com/github/random-words/colab-notebooks/blob/main/08__introduction_to_NLP_in_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fundamentals

In [3]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


## Get Helper Functions

In [4]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/refs/heads/main/extras/helper_functions.py

--2025-02-10 22:25:03--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/refs/heads/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2025-02-10 22:25:04 (19.1 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [5]:
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

## Get text dataset

In [6]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

--2025-02-10 22:25:07--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.135.207, 74.125.142.207, 74.125.195.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.135.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2025-02-10 22:25:07 (108 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



In [7]:
unzip_data("nlp_getting_started.zip")

## Visualizing dataset

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

In [9]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

In [10]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [11]:
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [12]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [13]:
# examples of each class
train_df["target"].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


In [14]:
# or with attribute
train_df.target.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


In [15]:
len(train_df), len(test_df)

(7613, 3263)

In [16]:
# Let's visualize some random training examples
import random
random_index = random.randint(0, len(train_df)-5) # create random indexes not higher than the total number of samples
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(real disaster)" if target > 0 else "(not real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 0 (not real disaster)
Text:
KS except every character is Shizune.
The world would explode.

---

Target: 0 (not real disaster)
Text:
China's Stock Market Crash: Are There Gems In The Rubble? http://t.co/3PBFyJx0yA

---

Target: 0 (not real disaster)
Text:
Posted a new song: 'Earthquake' http://t.co/RfTyyZ4GwJ http://t.co/lau0Ay7ahV

---

Target: 1 (real disaster)
Text:
MH370: Intact part lifts odds plane glided not crashed into sea http://t.co/8pdnHH6tzH

---

Target: 0 (not real disaster)
Text:
dazzle destroy the fun ??

---



### Split data into training and validation sets

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=42)

In [19]:
# Check splits lengts
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [20]:
# check first 10 samples
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

## Converting text into numbers

Few ways to do this:
* Tokenization - direct mapping of token to number:
i love tensorflow -> {0:i, 1:love, 2:tensorflow}
* Embedding - create a matrix of feature vector for each token: i love tensorflow ->
[[0.125, 0.856, 0.091],
 [0.123, 0.643, 0.723],
 [0.188, 0.116, 0.901]]

### Text vectorization (tokenization)

In [21]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

In [22]:
train_sentences[:3]

array(['@mogacola @zamtriossu i screamed after hitting tweet',
       'Imagine getting flattened by Kurt Zouma',
       '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....'],
      dtype=object)

In [23]:
# Use default TextVectorization parameters
text_vectorizer = TextVectorization(max_tokens=None, # max cap in vocabulary for words; None - no limit
                                    standardize="lower_and_strip_punctuation",
                                    split="whitespace",
                                    ngrams=None, # create groups of n-words
                                    output_mode="int", # how to map tokens to numbers
                                    output_sequence_length=None, # how many words a model will see on each sample
                                    # doesn't work in current tensorflow version
                                    # pad_to_max_tokens=True # add zeros to the end of tokens to reach max sequence length (output_sequence_length)
)

In [24]:
len(train_sentences)

6851

In [25]:
train_sentences[0].split(), len(train_sentences[0].split())

(['@mogacola', '@zamtriossu', 'i', 'screamed', 'after', 'hitting', 'tweet'], 7)

In [26]:
# Find the average number of tokens (words) in training tweets
round(sum([len(sentence.split()) for sentence in train_sentences])/len(train_sentences))

15

In [27]:
# Setup text vectorization variables
max_vocab_lenght = 10000 # max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be (how many words in tweet a model will see)

text_vectorizer = TextVectorization(max_tokens=max_vocab_lenght,
                                    output_mode="int",
                                    output_sequence_length=max_length,
                                    pad_to_max_tokens=True # if max_tokens is given, then it works
                                    )

In [28]:
# Fit text_vectorizer to the training data
text_vectorizer.adapt(train_sentences)

In [29]:
# Create a sample sentence and tokenize it
sample_sentence = "There's a flood in my street!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [30]:
# Try on train_sentences
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}")
vectorized_sentence = text_vectorizer([random_sentence])
print(f"Vectorized version:\n{vectorized_sentence}")

Original text:
When the waves are flooding the shore
And I can't find my way home anymore
That's when I look at you http://t.co/TDAKtGlU5p
Vectorized version:
[[  45    2  696   22  231    2 4634    7    8   98  653   13  147  153
  1438]]


In [31]:
# Get the unique words in vocabulary
words_in_vocab = text_vectorizer.get_vocabulary() # get all of the unique words in training data
print(f"Number of words: {len(words_in_vocab)}")
print(f"Most common words: {words_in_vocab[:5]}")
print(f"Least common words: {words_in_vocab[-5:]}")

Number of words: 10000
Most common words: ['', '[UNK]', 'the', 'a', 'in']
Least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


### Creating an Embedding

* input_dim - The size of the vocabulary (e.g. len(text_vectorizer.get_vocabulary()).
* output_dim - The size of the output embedding vector, for example, a value of 100 outputs a feature vector of size 100 for each word.
* embeddings_initializer - How to initialize the embeddings matrix, default is "uniform" which randomly initalizes embedding matrix with uniform distribution. This can be changed for using pre-learned embeddings.
* input_length - Length of sequences being passed to embedding layer

In [32]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_lenght, # set input shape
                             output_dim=128, # output shape
                             embeddings_initializer="uniform",
                             input_length=max_length, # each input (sentence) length
                             )



In [33]:
# Get random sentence from training dataset
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\n")
print("Embedded version:")
# Embed the radnom_sentence (turn it into vectors of setted size)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
@channelstv:That's why terrorism is not d war for d army but for Intel agents who can counter their moves before they detonate their bombs.

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.04423816, -0.01778169,  0.03811372, ...,  0.02097717,
         -0.00343367,  0.04674803],
        [ 0.04952795, -0.01839121, -0.02569411, ..., -0.04028875,
          0.00890381, -0.0215878 ],
        [-0.03529998,  0.03632805,  0.01862999, ..., -0.03748454,
          0.01208252,  0.01844743],
        ...,
        [ 0.04423816, -0.01778169,  0.03811372, ...,  0.02097717,
         -0.00343367,  0.04674803],
        [ 0.01542965, -0.04507552,  0.04957041, ..., -0.04138424,
         -0.03067018,  0.01359964],
        [-0.04233308,  0.01474489,  0.00934201, ...,  0.03909532,
          0.01477068,  0.02867807]]], dtype=float32)>

In [34]:
# Check out a single token's embedding
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([ 0.04423816, -0.01778169,  0.03811372,  0.04185592, -0.03710707,
        -0.02448989,  0.03152058,  0.00692171,  0.03983698, -0.0327278 ,
         0.03344511,  0.04539478,  0.00112025,  0.0373337 , -0.04329394,
        -0.04930231,  0.00150397,  0.01887306, -0.02463325, -0.03435959,
        -0.03157924, -0.01431219,  0.01588972,  0.00650745, -0.00881419,
         0.01024119,  0.04764304,  0.02332475,  0.01959396,  0.01907705,
         0.04514022, -0.03013366, -0.02192007,  0.00934232,  0.04462544,
         0.0263334 ,  0.01771078,  0.02148681, -0.03544251,  0.01296303,
        -0.03750181,  0.04816725, -0.02665094,  0.00581467,  0.01344479,
        -0.03625611, -0.00527768,  0.00151194, -0.04869526, -0.03432981,
         0.02083495,  0.01522226,  0.00629009,  0.00383242,  0.01983548,
         0.02217347,  0.02338853,  0.0217286 , -0.04495615,  0.0094354 ,
         0.03439487, -0.0433361 ,  0.02625522, -0.00395534,  0.04038515,
  

## Modelling a text dataset

### Model 0: Baseline

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [36]:
# Create tokenization and modelling pipeline
model_0 = Pipeline([
    ("tfidf", TfidfVectorizer()), # turn text into numbers
    ("clf", MultinomialNB()) # create a model
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

In [39]:
# Evaluate baseline model
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Score: {baseline_score*100:.2f}%")

Score: 79.27%


In [41]:
# Make predictons
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:10]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0])

#### Create an evaluation function

In [44]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate precision, recall, f1-score
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred,
                                                                               average="weighted")
  model_results = {"accuracy":model_accuracy,
                   "precision":model_precision,
                   "recall":model_recall,
                   "f1":model_f1}

  return model_results

In [45]:
baseline_results = calculate_results(y_true=val_labels,
                                     y_pred=baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}