<a href="https://colab.research.google.com/github/umniahameed/Deep-Learning/blob/main/Copy_of_09_Assigment_6_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Split into training (80%) and validation (20%).

In [7]:
import requests
import numpy as np

url = "https://www.gutenberg.org/files/1342/1342-0.txt"

# Download the book text
response = requests.get(url)
book_text = response.text


# Split the text into training (80%) and validation (20%)
text_words = book_text.split()

# Calculate the split index for 80-20
split_index = int(0.8 * len(text_words))

# Split the data into training and validation sets
train_text = text_words[:split_index]
valid_text = text_words[split_index:]

# Convert the list of words back into text
train_text = " ".join(train_text)
valid_text = " ".join(valid_text)

# Display the lengths of the training and validation sets
print(f"Training set length: {len(train_text)} words")
print(f"Validation set length: {len(valid_text)} words")


Training set length: 576961 words
Validation set length: 141601 words


## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [8]:
import re
from tensorflow.keras.preprocessing.text import Tokenizer

# Convert the text to lowercase
book_text = book_text.lower()

# Remove punctuation (except basic sentence delimiters)
book_text = re.sub(r'[^a-z\s.,!?\'"]', '', book_text)

# Tokenize the text by words
tokenizer = Tokenizer()

# Fit the tokenizer on the text (this builds the vocabulary)
tokenizer.fit_on_texts([book_text])

# Convert the text into sequences of integers
sequences = tokenizer.texts_to_sequences([book_text])[0]

# Build the vocabulary
vocab_size = len(tokenizer.word_index) + 1  # +1 because the index starts from 1, not 0
print(f"Vocabulary size: {vocab_size}")


Vocabulary size: 8581


## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [9]:
from tensorflow.keras.layers import Embedding

# Parameters
vocab_size = len(tokenizer.word_index) + 1  # vocabulary size (includes 0 for padding)
sequence_length = 50  # Length of input sequences

# Define the embedding layer
embedding_layer = Embedding(input_dim=vocab_size,    # size of the vocabulary
                            output_dim=128,        # embedding dimension
                            input_length=sequence_length)  # input sequence length


## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [10]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

# Define the model
model = Sequential()
model.add(embedding_layer)  # Add the embedding layer
model.add(LSTM(256, return_sequences=False))  # Add the LSTM layer (256 units)
model.add(Dense(vocab_size, activation='softmax'))  # Output layer with softmax activation for word prediction

# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Summary of the model
model.summary()


## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible. If you have higher value (which is possible) try to draw conclusions, why doesn't it decrease to a lower value.

In [11]:
import numpy as np

# Create input-output pairs for training
input_sequences = []
output_words = []

for i in range(sequence_length, len(sequences)):
    input_sequences.append(sequences[i-sequence_length:i])
    output_words.append(sequences[i])

input_sequences = np.array(input_sequences)
output_words = np.array(output_words)

# Display shapes of the input and output
print(f"Input sequences shape: {input_sequences.shape}")
print(f"Output words shape: {output_words.shape}")


Input sequences shape: (133398, 50)
Output words shape: (133398,)


In [12]:
# Train the model
history = model.fit(input_sequences, output_words, epochs=5, batch_size=64, validation_split=0.2)


Epoch 1/5
[1m1668/1668[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m667s[0m 398ms/step - accuracy: 0.0594 - loss: 6.6482 - val_accuracy: 0.1101 - val_loss: 5.7679
Epoch 2/5
[1m1668/1668[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m676s[0m 394ms/step - accuracy: 0.1235 - loss: 5.5514 - val_accuracy: 0.1316 - val_loss: 5.5079
Epoch 3/5
[1m1668/1668[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m680s[0m 393ms/step - accuracy: 0.1506 - loss: 5.1082 - val_accuracy: 0.1437 - val_loss: 5.4289
Epoch 4/5
[1m1668/1668[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m687s[0m 396ms/step - accuracy: 0.1699 - loss: 4.7400 - val_accuracy: 0.1447 - val_loss: 5.4366
Epoch 5/5
[1m1668/1668[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m676s[0m 393ms/step - accuracy: 0.1880 - loss: 4.4039 - val_accuracy: 0.1441 - val_loss: 5.5137


## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [13]:
def generate_text(model, tokenizer, seed_text, length=50):
    result = seed_text
    for _ in range(length):
        # Convert seed text to sequence of word IDs
        token_list = tokenizer.texts_to_sequences([result])[0]

        # Keep only the last `sequence_length` tokens for prediction
        token_list = token_list[-sequence_length:]

        # Pad sequence if it's shorter than required length
        padded = np.pad(token_list, (sequence_length - len(token_list), 0), 'constant')
        padded = padded.reshape(1, sequence_length)

        # Predict the next wrd
        predicted = np.argmax(model.predict(padded, verbose=0), axis=-1)[0]

        # Convert predicted ID back to word
        output_word = tokenizer.index_word.get(predicted, '')
        if not output_word:
            break
        result += ' ' + output_word
    return result


In [14]:
# Generate two samples with dfferent seed phrases
seed1 = "love is"
seed2 = "time will"

sample1 = generate_text(model, tokenizer, seed1, length=50)
sample2 = generate_text(model, tokenizer, seed2, length=50)

print("Sample 1:")
print(sample1)
print("\nSample 2:")
print(sample2)



Sample 1:
love is not to be in the same and the day and the whole party were over the whole party and the whole party were over the whole party and the whole party were over the whole party and the whole party were over the whole party and the whole party were

Sample 2:
time will be a great deal of the same time to the same and the day and the whole party were over the whole party and the whole party were over the whole party and the whole party were over the whole party and the whole party were over the whole party
