## GOAL
<p>
Build a system, which when given an english alphabet, can generate new baby from it and suggest
</p>

### How
Build a character level language model using LSTM  that will learn to predict the next char given the
characters it has seen before. This is a divergence from the
bigram character model we saw before where we just model the counts in a frequentist approach. 

<span class="mark">y = argmax(P(next char  | curr char ))</span>

<p>
In the LSTM world, because we are able to observe and train
for 'long term dependecies'
we'll model for
<p>
</br>

<span class="mark">(next char | prev time steps #of char)</span>

<p>
Description: generative char-level LSTM
Type: LSTM
Input: A list of baby names
</p>



## Imports

In [2]:
import tensorflow as tf
import os
import numpy as np
import datetime

# Load the TensorBoard notebook extension
%load_ext tensorboard

# Clear any logs from previous runs
! rm -rf ./logs/

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

2023-02-01 23:53:20.495267: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
%tensorboard --logdir logs/fit

## Dataset Prep
* The goal of this section is to prepare a `tf.BatchDataset` where the characters have been mapped to indexes using a vocab and tf.TextVectorizer()
* A a proper formulation has been done on what the Input and the Targets will look like under each batch

In [4]:
CURR_DIR = '' # os.path.dirname(__file__)

BATCH_SIZE = 64
SEQUENCE_LEN = 12
CHAR_EMBEDDING_SIZE = 256

names_f_path = os.path.join(CURR_DIR, "names.txt")
with open(names_f_path, 'r') as f:
    names = f.readlines()
    names_list = [x.strip() for x in names]

ds = '.'.join(names_list)
char_vocab = list(set(''.join(ds)))
vocab_size = len(char_vocab)

print(f"sample names: {names_list[:10]}")
print(f"total #of names: {len(names_list)}")
print(f"sampled sequence: {ds[:20]}")
print(f"\n\nVocab: {char_vocab}")
print(f"Vocab Size: {vocab_size}")



sample names: ['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia', 'harper', 'evelyn']
total #of names: 32033
sampled sequence: emma.olivia.ava.isab


Vocab: ['d', 'o', 'i', 'g', 'e', 'r', 'w', 'v', 'y', 'b', 'k', 'l', 'u', 't', 'a', 'h', 'c', 'n', 'z', 'f', 's', 'm', '.', 'j', 'q', 'p', 'x']
Vocab Size: 27


### Inputs and Targets for LSTM
The table below shows that each sequence of length 10 fed in and the subsequent target sequence that the model will need at the end to predict the accuracy of prediction over next char.

<table style="width:100%; text-align: center;">
    <tr>
        <th style="background-color: #f2f2f2;">Input Sequence</th>
        <th style="background-color: #f2f2f2;">Target Sequence</th>
    </tr>
    <tr>
        <td style="border: 3px solid #ddd; padding: 8px;">[s, a, r, t, h, a, k, ., j, i]</td>
        <td style="border: 3px solid #ddd; padding: 8px;">[a, r, t, h, a, k, ., j, i, n]</td>
    </tr>
    <tr>
        <td style="border: 3px solid #ddd; padding: 8px;">[r, t, h, a, k, ., j, i, n, a]</td>
        <td style="border: 3px solid #ddd; padding: 8px;">[t, h, a, k, ., j, i, n, a, l]</td>
    </tr>
        <tr>
        <td style="border: 3px solid #ddd; padding: 8px;">... continue</td>
        <td style="border: 3px solid #ddd; padding: 8px;"> ... continue</td>
    </tr>
</table>

* **Notice** below that there is an extra element at Position 0 encoded by the `StringLookup class` that is repsonsible for indexing the `[UNK]` token at 0th index of the char--> int lookup table
* A a result the `EmbeddingLayer` will need to have `input_dim = vocab_size + 1`

---

<span class="mark">Key Learning: Be always be rigourous with reading Tensorlfow documentation or using LLMs for QnA</span>

In [146]:
ds_split = list(ds)
ds_split_sequences = []
for i in range(0, len(ds_split), SEQUENCE_LEN + 1):
    ds_split_sequences.append(ds_split[i: i + SEQUENCE_LEN + 1])

print(ds_split_sequences[:3])


"""
The vocabulary for the layer must be either supplied on
construction or learned via adapt()
This layer translates a set of arbitrary strings into integer output
via a table-based vocabulary lookup.
This layer will perform no splitting or transformation of input strings.
For a layer than can split and tokenize natural language,
see the TextVectorization layer.

"""
char_encoder_layer = tf.keras.layers.StringLookup(vocabulary=char_vocab, )
char_decoder_layer = tf.keras.layers.StringLookup(vocabulary=char_encoder_layer.get_vocabulary(), output_mode="int", invert=True)
test_encode = char_encoder_layer(list("sarthak"))
print(test_encode)
print(char_decoder_layer(test_encode))

c = tf.constant([ 0, 15, 23, 14,  6, 15, 16])
print(char_decoder_layer(c))

# return a tf.tensor of ints
sequence_int_matrix = char_encoder_layer(np.array(ds_split_sequences[:-1]))
print(f'sequence_int_matrix shape: {sequence_int_matrix.shape}')

X_dataset = sequence_int_matrix[:, :-1]
Y_dataset = sequence_int_matrix[: , 1:]
X = tf.data.Dataset.from_tensor_slices(X_dataset)
Y = tf.data.Dataset.from_tensor_slices(Y_dataset)
print(f"Y shape: {Y.element_spec.shape}")
print(f"X shape: {X.element_spec.shape}")

[['e', 'm', 'm', 'a', '.', 'o', 'l', 'i', 'v', 'i', 'a', '.', 'a'], ['v', 'a', '.', 'i', 's', 'a', 'b', 'e', 'l', 'l', 'a', '.', 's'], ['o', 'p', 'h', 'i', 'a', '.', 'c', 'h', 'a', 'r', 'l', 'o', 't']]
tf.Tensor([ 1 26 15 22  5 26 16], shape=(7,), dtype=int64)
tf.Tensor([b's' b'a' b'r' b't' b'h' b'a' b'k'], shape=(7,), dtype=string)
tf.Tensor([b'[UNK]' b'r' b'u' b'l' b'z' b'r' b'k'], shape=(7,), dtype=string)
sequence_int_matrix shape: (17549, 13)
Y shape: (12,)
X shape: (12,)


In [148]:
tf_batch_dataset = tf.data.Dataset.from_tensor_slices(
    sequence_int_matrix).shuffle(27).batch(batch_size=BATCH_SIZE, drop_remainder=True)

def split_input_target(sequence):
    """
    Creates Input and labels such that labels sequences are shifted by 1 position to represent next prediction
    """
    input_text = sequence[:, :-1]
    target_text = sequence[:, 1:]
    return input_text, target_text

tf_batch_dataset = tf_batch_dataset.map(split_input_target) # will now contain 2 TensorSpecs 

print(f"BatchDatasetXY: {tf_batch_dataset}")
for input_example, target_example in tf_batch_dataset.take(1):
    print("Input :", char_decoder_layer(input_example.numpy())[:1])
    print("Target:", char_decoder_layer(target_example.numpy())[:1])

BatchDatasetXY: <MapDataset element_spec=(TensorSpec(shape=(64, 12), dtype=tf.int64, name=None), TensorSpec(shape=(64, 12), dtype=tf.int64, name=None))>
Input : tf.Tensor([[b'k' b'l' b'y' b'n' b'.' b'b' b'e' b'l' b'l' b'a' b'.' b'c']], shape=(1, 12), dtype=string)
Target: tf.Tensor([[b'l' b'y' b'n' b'.' b'b' b'e' b'l' b'l' b'a' b'.' b'c' b'l']], shape=(1, 12), dtype=string)


* you can notice that the name such as `sophia`

## Building the Model [Level 1]
Here we will use the `Sequential API` of tensoflow to train a model in the most simplistic and quick way. The `Sequential API` has it's limitation of us:
* Not being able to control forward passes by overriding `call` 
* Not being able to write custom training step using `train_step()`
* Not being able to write `custom loss functions and or metrics` that you would like to compute during the said training_step
* Not being able to differentiate `training, masking = True vs False` when you want different behavior during train and test
 * Not being able to differentiate trainable vs non-trainable variables and selectively apply gradients 

###  Character Embeddings
While the mapping of characters to idx is great in the previous section, we'll need to:
1. encode the indexes into some representation. We'll be going forwad with the **Character Embeddings** by using the `tf.keras.layers.Embedding` layer. **Note** this is not mandatory, we could have total used 
2. for which we'll be creating an embedding layer using `tf.keras.layers.Embedding` of `dim=256`
3. The weights between the <u>input layer & embedding layer</u> will be learned through backprop

###  Hidden LSTM Layer 
* number of `units` determines the number of **timesteps/recurrensces/iterations/ == sequence_len** over the sequence fed into the LSTM cell. 
* If we have set `return_state=True` and `return_sequences=True`, we'll have:
    * access to additional tensors ie ...`return_sequences=True`gives the entire sequence of `length=num_units` and `return_state=True` returns the final state in addition to the sequences
    * **Won't be able to use the Sequential API**, since it requires that all layers in a Sequential model should have a single output tensor.  For multi-output layers, use the functional API.
    
* The LSTM layer has weights in `multiples of 4` s.t the the shape of the weights matrix will be `(batch_size, sequence_len*4)`. This corresponds to the gates and cell states. 
    * **forget gate:** decide what to forget from the cell state
    * **update gate:** what from the input to let through, to update the cell state
    * **output gate:** what to probabilistically pass as the output
    * **cell state:** running cell state. Cell state is responsible for tracking long-term depedencies and the gates control the information that enters or leaves it.  

In [173]:
RETURN_SEQUENCES = True

model = tf.keras.Sequential(name="CharLevel_GenerativeLangModel_LSTM")
 
embedding_layer= tf.keras.layers.Embedding(
        input_dim=vocab_size+1,
        output_dim=CHAR_EMBEDDING_SIZE, 
        name=f"CharacterEmbeddingLayer-{CHAR_EMBEDDING_SIZE}"
    )

model.add(embedding_layer)

lstm_layer= tf.keras.layers.LSTM(
        units=SEQUENCE_LEN,
        name=f"LSTMLayer_WithSEQLEN-{SEQUENCE_LEN}", 
        return_sequences=RETURN_SEQUENCES
)
model.add(lstm_layer)

dense_layer = tf.keras.layers.Dense(vocab_size+1)
model.add(dense_layer)

loss_fn=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss=loss_fn,
    metrics=[tf.keras.metrics.SparseCategoricalCrossentropy(from_logits=True)]
)

model.summary()

Model: "CharLevel_GenerativeLangModel_LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 CharacterEmbeddingLayer-256  (None, None, 256)        7168      
  (Embedding)                                                    
                                                                 
 LSTMLayer_WithSEQLEN-12 (LS  (None, None, 12)         12912     
 TM)                                                             
                                                                 
 dense_31 (Dense)            (None, None, 28)          364       
                                                                 
Total params: 20,444
Trainable params: 20,444
Non-trainable params: 0
_________________________________________________________________


### Gut check on examples

In [174]:
if not RETURN_SEQUENCES:
    print(f"RETURN_SEQUENCES is set to: {RETURN_SEQUENCES} ")
    for input_example, target_example in tf_batch_dataset.take(1):
        # here the input and target examples are idxs which will be converted into embeddings by the 
        # EmbeddingLookup layer into a char--> vector represations of dim = 256
        x = model(input_example)
        print(x.shape)
        print(tf.reduce_sum(x[:1, ])) # checking if the model is outputting probabilities
        print(np.sum(x[:1, ].numpy()))
        print(x)
        

<span class="mark">Setting `RETURN_SEQUENCES=True` return the entire predicted sequence from the LSTM. <span class="girk">This is desired since</span> you would like to know what the next char prediction was for each char in the sequence and not just the final character in the sequence</span>

In [175]:
if RETURN_SEQUENCES :
    print(f"RETURN_SEQUENCES is set to: {RETURN_SEQUENCES} ")
    for input_example, target_example in tf_batch_dataset.take(1):
        x = model(input_example)
        print(f"output shape: {x.shape}, label shape: {target_example.shape}")
        is_probs = tf.reduce_sum(x[:1,:1, :]).numpy()
        test_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=target_example, logits=x)
        print(f"Loss on this example is: {test_loss.shape}") 
        print(float(is_probs))
        if is_probs == 1.0:
            print(f"the output is a probabilities array of len {vocab_size + 1}")
        else:
            print("!!!! We are using Logits (Un-normalized probs) !!!! ")

RETURN_SEQUENCES is set to: True 
output shape: (64, 12, 28), label shape: (64, 12)
Loss on this example is: (64, 12)
0.02270607277750969
!!!! We are using Logits (Un-normalized probs) !!!! 


**Notice** that the LSTM-DENSE weights matrix is of shape `sequence_length, dense_units`. Here, each column of len `SEQUENCE_LEN=12` represents the final `carry state Ct` of the LSTM after the entire sequence of len `SEQUENCE_LEN` has been fed in. 

![ alt text for screen readers](LSTMCell.png "Text to show on mouseover") .




In [176]:
print(f"embedding layer weight matrix: {embedding_layer.get_weights()[0].shape}")
print(f"LSTM layer weight matrix: {lstm_layer.get_weights()[0].shape}")
print(f"Dense layer weight matrix: {dense_layer.get_weights()[0].shape}")

embedding layer weight matrix: (28, 256)
LSTM layer weight matrix: (256, 48)
Dense layer weight matrix: (12, 28)


### Fit

In [183]:
model.fit(
    tf_batch_dataset,
    batch_size=None,
    epochs=60,
    verbose='auto',
    callbacks=[tensorboard_callback],
    validation_split=0.0,
    validation_data=None,
    shuffle=True,
    class_weight=None,
    sample_weight=None,
    initial_epoch=0,
    steps_per_epoch=None,
    validation_steps=None,
    validation_batch_size=None,
    validation_freq=1,
    max_queue_size=10,
    workers=2,
    use_multiprocessing=False
)

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


<keras.callbacks.History at 0x7fa03ae7c850>

## Building the Model [Level 2]
* Here we use the tensorflow `Functional API` to solve the same problem as above. Keras is based on the **core-principle of iterative disclosure and access to complexity without falling off the cliff.** 
* We'll use the following more complex features:
    * extending the `tf.keras.Model` class to write our own forward pass by over-riding the `call()` method
   

In [None]:
class CharLSTM(tf.keras.Model):
    def train_step(self, input_data):
        x,y = input_data
        
        with tf.GradientTape() as tape:
            # this is equivalent to calling model(inputs, training=True)
            # which intern uses the call() method. 
            # you could certainly not rely on defacto forward pass and chose to over-ride the call() method
            y_pred = self(x, training=True) 
            loss = loss_fn(y, x)
            print(loss)
            
model = CharLSTM()  
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss=loss_fn,
    metrics=[tf.keras.metrics.CategoricalAccuracy()]
)


tf.keras.utils.plot_model(model, "charLSTM.png", show_shapes=True)

## TODOs
2. Create a mulit-layer lstm rather than just 1 LSTM layer
4. Think about sequence strucutre and if therea are other ways to formulate the problem
5. what does masking do in `TextVectorizer` and `Embedding Layers`
6. when to use `TextVectorizer` vs `StringLookup`


## Key Learnings
* the difference between `Dataset.from_tensor_slices()` and `Dataset.from_tensors()`. The `from_tensor_slices` method will slice the input data along the first dimesion of the input and will remove the mention of the first dimesion
* `TextVectorization Layer` is strict with the datatype you can call adapt on. works with tf.data.Dataset or np.array
* Keras is built on the **core principle of iterative exposure of compexity**. Therefore are many ways of training the model: 
    * Using the `Sequential API` and directly calling `model.compile()` and` model.fit()` . Here `model.fit()` is responsible for running the forward pass
    * Sub-classing `tf.keras.Model` 

In [61]:
import numpy as np
a = np.array([[10]*10]*1200)
a.shape, 1200%64

((1200, 10), 48)

In [65]:
tf_batched_a = tf.data.Dataset.from_tensor_slices(a).batch(64, drop_remainder=True)
tf_batched_a 
# notice how the mention of 1200 has been removed and the remaining 
# rows that won't fit the batch size will get dropped

<BatchDataset element_spec=TensorSpec(shape=(64, 10), dtype=tf.int64, name=None)>