<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/diana_fixed_generate_shakespeare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers

 [Chollet](https://livebook.manning.com/book/deep-learning-with-python-third-edition/chapter-15/v-4/12)

 We can detokenize a sequence by proceeding in reverse – map ints back to string tokens and join them together. With this approach, **our problem becomes building a model that can predict an integer sequence of tokens**.


 A practical approach for making such a prediction problem feasible is to build a model that only **predicts a single token output at a time**.

 Given a sequence of all tokens observed up to a point, a language model will attempt to output a probability distribution over all possible tokens that could come next.



Chatgpt:  I have a fever and a runny nose.  What do I have?

labels                           features
1234234550192919293919912333


I have a fever and a runny ---->. predict what word comes next

                                  predict what letter comes next

                                  67 = vocabulary

                                  lower case, upper case, space

sigmoid, softmax

probability of "nose" is 0.8


input text:  how do I sort an array in numpy

----->prediction np.sort() xxxxxxxxxxx

predict the next letter

how do I sort an array in numpy?  n
how do I sort an array in numpy?  np

how do I sort an array in numpy?  np------->tokenize it--->run neural network------->use vocabulary an predict letter or word codes next

how do I sort an array in numpy?  np.





MNIST output was 0 , 1

vocabulary which can be letters or words






# 15.1.1 Training a Shakespeare Language Model



In [1]:
import keras

filename = keras.utils.get_file(
    origin=(
        "https://storage.googleapis.com/download.tensorflow.org/"
        "data/shakespeare.txt"
    ),
)
shakespeare = open(filename, "r").read()

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
[1m1115394/1115394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1us/step


In [2]:
shakespeare[0]

'F'

In [3]:
shakespeare[:250]

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you know Caius Marcius is chief enemy to the people.\n'

 1. First, we will split our data into equal-length chunks that we can batch and use for model training

 2. We will also split each input into two separate feature and label sequences, each label sequence simply being the input sequence offset by a single character.


In [4]:
import tensorflow as tf

sequence_length = 100


# the yield function returns a value one at a time from another looping function that calls it.  So it's like a loop inside a loop except the second loop is in a seperate function.

def split_input(input, sequence_length):
    for i in range(0, len(input), sequence_length):
        yield input[i : i + sequence_length]

#label shakespeare[1:] means skip first letter and go to end
# feature shakespear[:-1] means start at beginning and leave off last letter

# yield
features = list(split_input(shakespeare[:-1], sequence_length))
labels = list(split_input(shakespeare[1:], sequence_length))
dataset = tf.data.Dataset.from_tensor_slices((features, labels))

In [5]:
labels[0]

'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '

In [6]:
a='First Citizen:\nBefore we proceed any further, hear'
a[:-1]

'First Citizen:\nBefore we proceed any further, hea'

In [7]:
a[1:]

'irst Citizen:\nBefore we proceed any further, hear'

Let’s look at an (x, y) input sample. Our label at each position in the sequence is the next character in the sequence.

In [8]:
# x is feature and y is label


# next means go next one.  it's the first one.  as_numpy_iterator() means return the data as numpy


x, y = next(dataset.as_numpy_iterator())
print("label", x[:50])
print("\nfeature", y[:50])


label b'First Citizen:\nBefore we proceed any further, hear'

feature b'irst Citizen:\nBefore we proceed any further, hear '


First Citizen:\nBefore we proceed any further, hear

irst Citizen:\nBefore we proceed any further, hear




To map this input to a sequence of integers, we can again use the TextVectorization layer we saw in the last chapter. To learn a character-level vocabulary instead of a word-level vocabulary, we can change our split argument. Rather than the default **whitespace** splitting, we instead split by **character**. We will do no standardization here – to keep things simple, we will preserve case and pass punctuation through unaltered.

[Chollet](https://livebook.manning.com/book/deep-learning-with-python-third-edition/chapter-15/v-4/22)

In [9]:
from keras import layers

tokenizer = layers.TextVectorization(
    standardize=None,
    split="character",
    output_sequence_length=sequence_length,
)
tokenizer.adapt(dataset.map(lambda text, labels: text))

In [10]:
type(tokenizer)

In [11]:
vocabulary_size = tokenizer.vocabulary_size()
vocabulary_size

67

In [12]:
vocabulary_size

67

 vocabulary size of 67 means the TextVectorization layer has identified 67 unique characters in your dataset when tokenizing at the character level.  

Next, we can apply our tokenization layer to our input text. And finally, we can shuffle, batch, and cache our dataset so we don’t need to recompute it every epoch.

In [13]:
dataset = dataset.map(
    lambda features, labels: (tokenizer(features), tokenizer(labels)),
    num_parallel_calls=8,
)
training_data = dataset.shuffle(10_000).batch(64).cache()

Print a portion of it to see what is inside:

In [14]:
for batch in training_data.take(1):
    features, labels = batch
    print("Features (tokenized):", features.numpy()," \nshape", features.numpy().shape)
    print("Labels (tokenized):", labels.numpy(), " \nshape", labels.numpy().shape)


Features (tokenized): [[15 21  7 ...  4  8  2]
 [ 7  6  9 ... 12 12 54]
 [ 6  4  7 ... 28  8  2]
 ...
 [ 6 10 21 ...  8  8  2]
 [ 2 16  3 ... 15  4  2]
 [27 12 12 ...  5 10  2]]  
shape (64, 100)
Labels (tokenized): [[21  7  2 ...  8  2  4]
 [ 6  9 14 ... 12 54 23]
 [ 4  7  2 ...  8  2 19]
 ...
 [10 21  6 ...  8  2  9]
 [16  3 48 ...  4  2 16]
 [12 12 49 ... 10  2 16]]  
shape (64, 100)


In [15]:
features.numpy()[0]

array([15, 21,  7,  2, 20,  6,  8,  4, 18, 12, 37,  5,  2,  3, 29,  3,  9,
       17,  2,  8, 21,  5, 25,  3,  2, 24, 17,  2,  4,  7,  3,  2, 11, 16,
       16,  5, 14,  3,  9,  6,  4,  3,  2, 15,  8,  3, 12, 31, 15,  9, 10,
        8,  2,  4,  5,  2,  9,  3,  8,  4,  9,  6, 11, 10,  4, 27,  2, 34,
       15,  9,  2, 10,  6,  4, 15,  9,  3,  8,  2, 14,  5,  2, 25, 15,  9,
        8, 15,  3, 18, 12, 38, 11, 30,  3,  2,  9,  6,  4,  8,  2])

#Build Language Model

To build our simple language model, we want to predict the probability of a character given all past characters. Of all the modeling possibilities we have seen so far in this book, an RNN is the most natural fit, as the recurrent state of each cell allows the model to propagate information about past characters when predicting the label of the current character.

> **Definition:  RNN**
>
> A **Recurrent Neural Network (RNN)** is a type of artificial neural network specifically designed to process sequential data, such as text, speech, or time series, where the order and context of the data points matter. Unlike traditional feedforward neural networks, which process each input independently, RNNs have a unique architecture that incorporates loops, allowing information from previous steps in the sequence to influence the processing of current and future inputs.


We can also use an **Embedding**, to embed each input character as a unique 256-dimensional vector.

We will use only a single recurrent layer to keep this model small and easy to train. Any recurrent layer would do here, but to keep things simple, we will use a GRU, which is fast and has a simpler internal state than an LSTM.

> **Definitions:  GRU and LSTM**
>
> **Gated Recurrent Units (GRUs)** and **Long Short-Term Memory networks (LSTMs)** are both advanced types of recurrent neural networks (RNNs) designed to handle sequential data and overcome the vanishing gradient problem seen in standard RNNs.

In [16]:
embedding_dim = 256
hidden_dim = 1024

inputs = layers.Input(shape=(sequence_length,), dtype="int", name="token_ids")
x = layers.Embedding(vocabulary_size, embedding_dim)(inputs)
x = layers.GRU(hidden_dim, return_sequences=True)(x)
x = layers.Dropout(0.1)(x)

outputs = layers.Dense(vocabulary_size, activation="softmax")(x)

model = keras.Model(inputs, outputs)

In [17]:
model.summary(line_length=80)

This model outputs a softmax probability for every possible character in our vocabulary, and we will compile() it with a crossentropy loss. Note that our model is still training on a classification problem, it’s just that we will make one classification prediction for every token in our sequence. For our batch of 64 samples with 100 characters each, we will predict 6400 individual labels. Loss and accuracy metrics reported by Keras during training will be averaged first across each sequence and second across each batch.

In [18]:
model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["sparse_categorical_accuracy"],
)
model.fit(training_data, epochs=20)

Epoch 1/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 53ms/step - loss: 3.0399 - sparse_categorical_accuracy: 0.2407
Epoch 2/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 53ms/step - loss: 1.9292 - sparse_categorical_accuracy: 0.4318
Epoch 3/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 53ms/step - loss: 1.6566 - sparse_categorical_accuracy: 0.5061
Epoch 4/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 54ms/step - loss: 1.5187 - sparse_categorical_accuracy: 0.5423
Epoch 5/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 54ms/step - loss: 1.4363 - sparse_categorical_accuracy: 0.5627
Epoch 6/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 54ms/step - loss: 1.3783 - sparse_categorical_accuracy: 0.5774
Epoch 7/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 55ms/step - loss: 1.3314 - sparse_categorical_accuracy: 0.5890
Epoch 8/20
[

<keras.src.callbacks.history.History at 0x7c7eddb61350>

After 20 epochs, our model can eventually predict the next character in our input sequences around 70% of the time.

#15.1.2 Generating Shakespeare
Now that we have trained a model that can predict the next individual tokens with some accuracy, we would like to use it to extrapolate an entire predicted sequence. We can do this by calling the model in a loop, where the model’s predicted output at one time-step becomes the model’s input at the next time step. A model built for this kind of feedback loop is sometimes called an autoregressive model.

[Chollet](https://livebook.manning.com/book/deep-learning-with-python-third-edition/chapter-15/v-4/40)

To run such a loop, we need to perform a slight surgery on the model we just trained. During training, our model handled only a fixed sequence length of 100 tokens, and the GRU cell’s state was handled implicitly when calling the layer. During generation, we would like to predict a single output token at a time and explicitly output the state of the GRU’s cell. We need to propagate that state, which contains all information the model has encoded about past input characters, the next time we call the model.

Let’s make a model that handles a single input character at a time and allows explicitly passing the RNN state. Because this model will have the same computational structure, with slightly modified inputs and outputs, we can assign weights from one model to another.

In [19]:
inputs = keras.Input(shape=(1,), dtype="int", name="token_ids")
input_state = keras.Input(shape=(hidden_dim,), name="state")

x = layers.Embedding(vocabulary_size, embedding_dim)(inputs)
x, output_state = layers.GRU(hidden_dim, return_state=True)(
    x, initial_state=input_state
)
outputs = layers.Dense(vocabulary_size, activation="softmax")(x)
generation_model = keras.Model(
    inputs=(inputs, input_state),
    outputs=(outputs, output_state),
)
generation_model.set_weights(model.get_weights())

With this, we can call the model to predict an output sequence in a loop. Before we do, we will make explicit lookup tables so we switch from characters to integers and choose a prompt – a snippet of text we will feed as input to the model before we begin predicting new tokens.

# 1. Broken code from class lecture:

```
tokens = tokenizer.get_vocabulary()
token_ids = range(vocabulary_size)

# encode make numbers
char_to_id = dict(zip(tokens, token_ids))

# decode return to letters
id_to_char = dict(zip(token_ids, tokens))

prompt = """
KING RICHARD III:
"""

```

In [20]:
tokens = tokenizer.get_vocabulary()
char_to_id = {char: idx for idx, char in enumerate(tokens)}
id_to_char = {idx: char for idx, char in enumerate(tokens)}

prompt = """
KING RICHARD III:
"""

To begin generation, we first need to “prime” the internal state of the GRU with our prompt. To do this, we will feed the prompt into the model one token at a time. This will compute the exact RNN state the model would see if this prompt had been encountered during training.

When we feed the very last character of the prompt into the model, our state output will capture information about the entire prompt sequence. We can save the final output prediction to later select the first character of our generated response.

# changed code


Diana replaced this broken code with what is in the next cell.

The difference is that calling the .predict() method returns a numpy array but calling the model directly returns tensors.  I don't know why the code crashes when using the .predict() method.

```
input_ids = [char_to_id[c] for c in prompt]
state = keras.ops.zeros(shape=(1, hidden_dim))
for token_id in input_ids:
    inputs = keras.ops.expand_dims([token_id], axis=0)
    predictions, state = generation_model.predict((inputs, state), verbose=0)
```

In [42]:
input_ids = [char_to_id[c] for c in prompt]
state = tf.zeros((1, hidden_dim))

for token_id in input_ids:
    inputs = tf.constant([[token_id]], dtype=tf.int32)

    #ensure state has correct shape: (1, 1024)
    if len(state.shape) == 1:
        state = tf.expand_dims(state, axis=0)

    predictions, state = generation_model((inputs, state), training=False)


Now we are ready to let the model predict a new output sequence. In a loop, up to a desired length, we will continually select the most likely next character predicted by the model, feed that to the model, and persist the new RNN state. In this way, we can predict an entire sequence, a token at time.

Let’s convert our output integer sequence to a string to see what the model predicted. To detokenize our input, we simply map all token ids to strings and join them together.

In [30]:
output = "".join([id_to_char[token_id] for token_id in generated_ids])
print(prompt + output)



KING RICHARD III:
W


# Plan B.  See is can work using predict methon instead of direct call

# Prepare a batch of inputs (e.g., for the first step in generation)

