## Natural Language Processing with RNNs 

### Project and Dataset information


This dataset contains the a csv file transcript of the Apple event live stream held in September 2023. It includes three fields: "Text," representing the verbatim transcript; "Start Time," indicating when each segment of text begins; and "Duration," specifying the duration of each spoken segment.
There are 1349 rows and 3 columns with no null value. Only text column is used for this script character RNN project to predict the next token.


Reference: https://www.kaggle.com/datasets/nuhmanpk/apple-event-september-2023-transcript

### Import functions

In [21]:
# Collapse-show
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
    IS_COLAB = True
except Exception:
    IS_COLAB = False

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.callbacks import EarlyStopping
assert tf.__version__ >= "2.0"

if not tf.test.is_gpu_available():
    print("No GPU was detected. LSTMs and CNNs can be very slow without a GPU.")
    if IS_COLAB:
        print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")

# Common imports
import numpy as np
import pandas as pd
import os

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "rnn"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings('ignore')

No GPU was detected. LSTMs and CNNs can be very slow without a GPU.


## Load and Explore Data

In [22]:
data = pd.read_csv('transcript.csv') # load csv dataset
data.tail(10) # Check data from the last 10 rows

Unnamed: 0,text,start,duration
1339,and best-in-class experiences,5054.466,1.934
1340,that help us do the things\nthat matter most i...,5059.533,0.8
1341,Apple Watch and iPhone are\nessential.,5062.666,1.334
1342,They're with us all the time\nand we use them ...,5066.9,1.633
1343,which is why we're so excited\nfor you to begi...,5073.3,0.566
1344,Thank you for joining us!,5075.466,0.434
1345,Have a great day!,5078.033,0.233
1346,♪ ♪,5086.233,1.633
1347,.,5104.266,6.9
1348,.,5121.266,0.567


In [23]:
data.info # Check summary of the dataset

<bound method DataFrame.info of                            text     start  duration
0                   Apple Event    12.000     2.766
1                   Apple Event    20.033     2.700
2                   Apple Event    26.766     4.034
3                   Apple Event    36.033     2.767
4                   Apple Event    44.066     2.734
...                         ...       ...       ...
1344  Thank you for joining us!  5075.466     0.434
1345          Have a great day!  5078.033     0.233
1346                        ♪ ♪  5086.233     1.633
1347                          .  5104.266     6.900
1348                          .  5121.266     0.567

[1349 rows x 3 columns]>

In [24]:
data.isnull().sum() # Check how many null value

text        0
start       0
duration    0
dtype: int64

In [25]:
print(data.text[40:50]) # Print only text column from row 40 to 50

40    Omar Hashem:\nI'm thankful for my friends and\...
41                                   for supporting me.
42                      I'm super thankful for my life.
43             Imani Miles: Everybody has\nalways said,
44    "She's gonna be somebody really\ngreat when sh...
45                                      I believe that.
46                                                  ♪ ♪
47          We're so glad you're here with\nus, Jessie.
48                          - I know.\n- We're so glad.
49                                 We love you so much.
Name: text, dtype: object


## Creating the Training Dataset

In [26]:
df = pd.DataFrame(data)
df['text'] = df['text'].str.lower() # convert all text into lowercase

In [27]:
newData = df['text'].iloc[0:].str.cat(sep=' ') # concat all texts from text column
newData # Display text

'apple event apple event apple event apple event apple event apple event apple event apple event apple event apple event apple event apple event apple event apple event apple event apple event apple event apple event [notification clicks] ♪ ♪ [phone chimes] [indistinct whispering] happy birthday. love you. [phone vibrates] ♪ ♪ [notification chimes] ♪ ♪ happy birthday! ♪ ♪ ♪ ♪ ♪ ♪ tasha prescott: oh, man. i\'m appreciative of just life. another opportunity. my kids experiencing things\nbecause i\'m still here. antonio femiano:\nit\'s always great to get up in\nthe morning and be here,\nliving one more day with them. stephen watts:\nit\'s so exciting to live and to be here with people you\nlove. omar hashem:\ni\'m thankful for my friends and\nfamily for supporting me. i\'m super thankful for my life. imani miles: everybody has\nalways said, "she\'s gonna be somebody really\ngreat when she grows up." i believe that. ♪ ♪ we\'re so glad you\'re here with\nus, jessie. - i know.\n- we\'re so 

In [28]:
# Split text by character-level tokenization and set all characters to lowercase
text_vec_layer = tf.keras.layers.TextVectorization(split="character", standardize="lower")

# Adapt the TextVectorization layer to the data in "newData" to learn the vocabulary
text_vec_layer.adapt([newData]) 

# Store encoded text data in "newData" using the adapted TextVectorization layer
encoded = text_vec_layer([newData])[0] 

In [29]:
encoded -= 2  # Drop tokens 0 (pad) and 1 (unknown), which we will not use
n_tokens = text_vec_layer.vocabulary_size() - 2  # Number of distinct characters
dataset_size = len(encoded)  # Total number of characters

In [30]:
n_tokens # Represent  distinct characters

57

In [31]:
"".join(sorted(set(newData.lower())))

'\n !"#$%\'()+,-./0123456789:?[]abcdefghijklmnopqrstuvwxyzè♪'

In [32]:
dataset_size # Total number of characters

60143

In [33]:
# Function to prepare sequences for training by converting them into a suitable TensorFlow 
# dataset format with batching, shuffling, and target creation. 
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(60_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

In [34]:
# Convert a text input into a dataset of fixed-length sequences using TextVectorization.
list(to_dataset(text_vec_layer(["you can"])[0], length=6))

[(<tf.Tensor: shape=(1, 6), dtype=int64, numpy=array([[22,  6, 15,  2, 13,  5]], dtype=int64)>,
  <tf.Tensor: shape=(1, 6), dtype=int64, numpy=array([[ 6, 15,  2, 13,  5,  8]], dtype=int64)>)]

In [35]:
length = 100 # Set the learning character lenght to 100
tf.random.set_seed(42) # Generate random numbers and ensure reproducibility on CPU
train_set = to_dataset(encoded[:54_000], length=length, shuffle=True, # ~90% of the dataset is used for Training set
                       seed=42)
valid_set = to_dataset(encoded[54_000:57_000], length=length) # ~5% of the dataeet is used for Validation set
test_set = to_dataset(encoded[57_000:], length=length) # ~5% of the dataeet is used for Test set

## Building and Training the char-RNN Model

In [None]:
tf.random.set_seed(42) # Generate random numbers and ensure reproducibility on CPU

# Define the neural network model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])

# Compile the model with loss, optimizer, and metrics
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])

# Create a ModelCheckpoint callback to save the best model during training
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    "my_script_model", monitor="val_accuracy", save_best_only=True)

# Train the model on the training and validation datasets for 10 epochs
history = model.fit(train_set, validation_data=valid_set, epochs=10,
                    callbacks=[model_ckpt])

Epoch 1/10
   1685/Unknown - 92s 50ms/step - loss: 2.0483 - accuracy: 0.4132

In [None]:
history.params # Display parameters related tot he traaining process

In [None]:
# Display the results of accuracy and loss from the validation set in graph plot
import pandas as pd

pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.gca().set_ylim(0, 1)
save_fig("keras_learning_curves_plot1")
plt.show()

## Generating Fake Text

In [None]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
tf.random.set_seed(42) # Generate random numbers and ensure reproducibility on CPU
tf.random.categorical(log_probas, num_samples=8)  # draw 8 samples

In [None]:
# A custom function to randomly generate next character 
def next_char(text, temperature=1):
    y_proba = my_script_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

In [None]:
# A custom function to randomly generate next 100 characters
def extend_text(text, n_chars, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [None]:
tf.random.set_seed(42) # Generate random numbers and ensure reproducibility on CPU

In [None]:
# Extend 10 characters with temperature at 100 for more random
n_chars10 = 10
print(extend_text("you can", n_chars10, temperature=0.01))

In [None]:
# Extend 50 characters with temperature at 1 for more random and less deterministic
n_chars50 = 50
print(extend_text("you can", n_chars50, temperature=1)) 

In [None]:
# Extend 100 characters with temperature at 0.01 for more deterministic
n_chars100 = 100
print(extend_text("you can", n_chars100, temperature=0.01)) 

## Stateless RNN

In [None]:
# Generate sequences of text based on input data using stateless RNN
stateless_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])

In [None]:
# Configure the input shape for the stateless model.
stateless_model.build(tf.TensorShape([None, None]))

In [None]:
# Set weights
stateless_model.set_weights(model.get_weights())

In [None]:
# Creater a new sequential model to test stateless RNN
my_script_model2 = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    stateless_model
])

In [None]:
# Tesing with new text "iP" and extend 50 characters
tf.random.set_seed(42) # Generate random numbers and ensure reproducibility on CPU
print(extend_text("iP", n_chars50, temperature=0.1))

### Summary

The model is able to predict characters/tokens as expected. The lower temperature displays more deterministic characters and the higher temperature displays more random characters. The model used temperature at 0.01, 1 and 100.

Stateless RNN is used for this character RNN model due to its simpler sequence modeling tasks where short-term dependencies are sufficient for accurate predictions.