# Learning Notes for Deep Dive Machine Learning

### 1 my GUID : 2719097S

### 2 my GitHub repository URL:https://github.com/shenhaoranzhishuai/shenhaoranzhishuai.github.io.git

### 3 Selected lab exercises : option2:Generating Text with Neural Networks
#### 3-1: Getting the Data
The code below utilizes TensorFlow to download a text file containing the works of Shakespeare from a specified URL, "https://homl.info/shakespeare." The `get_file` function is employed to retrieve the file, saving it locally as "shakespeare.txt." Subsequently, the script opens this local file and reads its content into the variable `shakespeare_text` using a `with` statement.  And display the first 80 characters of the content stored in the variable shakespeare_text. This is a quick way to inspect a snippet of the text and get a glimpse of the structure and style of the Shakespearean works that have been loaded into the program. 

In [None]:
import tensorflow as tf

shakespeare_url = "https://homl.info/shakespeare"  # shortcut URL
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

In [None]:
print(shakespeare_text[:80]) # not relevant to machine learning but relevant to exploring the data

#### 3-2: Preparing the Data
In the code snippet below, a `TextVectorization` layer from TensorFlow is employed for converting the text data into numerical vectors. The `split` parameter is set to "character," indicating that the text should be split into individual characters. The `standardize` parameter is set to "lower," which means that the text will be converted to lowercase for uniformity. The `adapt` method is then called on the `text_vec_layer` with the provided Shakespearean text, allowing the layer to analyze the text and adapt its internal state accordingly.

After adaptation, the `text_vec_layer` is used to encode the Shakespearean text into numerical vectors using the `([shakespeare_text])[0]` statement. Finally, the encoded result is printed with the line `print(text_vec_layer([shakespeare_text]))`. This line shows the numerical representation of the text, where each character is mapped to a unique numerical identifier. This vectorization is a common preprocessing step in natural language processing tasks, enabling the use of text data in machine learning models that require numerical input.


In [None]:
text_vec_layer = tf.keras.layers.TextVectorization(split="character",
                                                   standardize="lower")
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]

In [None]:
print(text_vec_layer([shakespeare_text]))

The code segment below adjusts the encoded text by subtracting 2, eliminating tokens 0 (pad) and 1 (unknown) which are not used. The variable `n_tokens` is then calculated, representing the number of distinct characters in the text (excluding the special tokens). Additionally, `dataset_size` is determined, representing the total number of characters in the encoded text. The printed values of `n_tokens` and `dataset_size` provide insights into the diversity of characters and the overall size of the processed text, which can be crucial for configuring model parameters in subsequent tasks.

In [None]:
encoded -= 2  # drop tokens 0 (pad) and 1 (unknown), which we will not use
n_tokens = text_vec_layer.vocabulary_size() - 2  # number of distinct chars = 39
dataset_size = len(encoded)  # total number of chars = 1,115,394

In [None]:
print(n_tokens, dataset_size)

The code below defines a function `to_dataset` that converts a sequence of data into a TensorFlow Dataset suitable for training machine learning models. The function takes several parameters:

- `sequence`: The input sequence of data.
- `length`: The length of the sequences to create for training. It is used to create overlapping windows of data.
- `shuffle`: A boolean indicating whether to shuffle the data.
- `seed`: Seed for reproducibility if shuffling is enabled.
- `batch_size`: The size of the batches in the resulting dataset.

The function creates a TensorFlow Dataset from the input sequence, windows it into overlapping sequences of the specified length, and optionally shuffles the data. It then batches the data and returns a dataset where each element is a tuple of two parts: the input sequence (`window[:, :-1]`) and the target sequence (`window[:, 1:]`). The last line prefetches one batch to improve data loading performance.

The code then uses this function to create three datasets (`train_set`, `valid_set`, and `test_set`) from the encoded Shakespearean text. The `train_set` is shuffled, and all three datasets have sequences of length 100. These datasets can be used for training, validation, and testing machine learning models, particularly those designed for sequence processing tasks.

In [None]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

In [None]:
length = 100
tf.random.set_seed(42)
train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True,
                       seed=42)
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)
test_set = to_dataset(encoded[1_060_000:], length=length)

#### 3-3: Building and Training the Model
The code below establishes a reproducible environment (`tf.random.set_seed(42)`) and constructs a sequential TensorFlow model for Shakespearean text generation. Key model components include an embedding layer (`Embedding`) with input dimension `n_tokens` and output dimension 16, a GRU layer (`GRU`) with 128 units returning sequences, and a dense layer (`Dense`) with a softmax activation function. This model is compiled with a sparse categorical cross-entropy loss function, the Nadam optimizer, and accuracy as the metric. The training process (`model.fit`) is executed on a training set (`train_set`) with validation on a separate set (`valid_set`). The trained model is then saved to a file ("myModel.h5"). Furthermore, a combined model (`shakespeare_model`) is created by incorporating a text vectorization layer (`text_vec_layer`) and the trained text generation model. This combined model allows inputting raw text and generating predictions.

In [None]:
tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=20,
                    steps_per_epoch=1000)
model.save('./my_shakespeare_model/myModel.h5')


In the `model = tf.keras.Sequential` ,we can also change the network structure by modifying the `GRU` of `tf.keras.layers.GRU(128, return_sequences=True)` into  `LSTM` or `RNN` et al. Different training effects can be obtained through different networks to obtain the most suitable network for the needs.


In [None]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    model
])

#### 3-4: Generating Text
The code below use the trained combined model (`shakespeare_model`) to predict the next character in a given input sequence, "To be or not to b."

- `y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1]`: This line generates the probability distribution for the next character in the sequence by predicting with the model. The `[0, -1]` indexing is used to access the last predicted character in the sequence.

- `y_pred = tf.argmax(y_proba)`: Here, the character with the highest probability is chosen by finding the index of the maximum value in the probability distribution.

- `text_vec_layer.get_vocabulary()[y_pred + 2]`: Finally, the vocabulary of the text vectorization layer is used to map the predicted character ID (`y_pred + 2`) back to the actual character.

In summary, these lines demonstrate how to use the trained model to predict the next character in a sequence and convert the predicted character ID back to the corresponding character using the vocabulary of the text vectorization layer.

In [None]:
y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1]
y_pred = tf.argmax(y_proba)  # choose the most probable character ID
text_vec_layer.get_vocabulary()[y_pred + 2]

In [None]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
tf.random.set_seed(42)
tf.random.categorical(log_probas, num_samples=8)  # draw 8 samples

The functions below facilitate text generation using the trained Shakespearean model:

1. **`next_char(text, temperature=1)`:**
   - Predicts the next character in the input `text` using the trained model.
   - `temperature` controls the diversity of predictions: higher values make predictions more diverse.
   - `y_proba` stores the probability distribution for the next character, and `rescaled_logits` adjusts the distribution based on the temperature.
   - The final character ID is randomly sampled, converted to the actual character using the text vectorization layer's vocabulary, and returned.

2. **`extend_text(text, n_chars=50, temperature=1)`:**
   - Extends the input `text` by iteratively predicting and appending the next character.
   - Calls the `next_char` function for each iteration, allowing for the generation of a sequence of characters.
   - `n_chars` determines the length of the generated text.

3. **Reproducibility Setup:**
   - `tf.random.set_seed(42)`: Sets the random seed to 42 for reproducibility.

4. **Text Generation Examples:**
   - Generates extended text based on different temperature settings, influencing the diversity of the generated text. Lower temperatures produce more focused output, while higher temperatures result in more diverse and random text.

In [None]:
def next_char(text, temperature=1):
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text
tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU


In [None]:
print(extend_text("To be or not to be", temperature=0.01))

In [None]:
print(extend_text("To be or not to be", temperature=1))

In [None]:
print(extend_text("To be or not to be", temperature=100))