## Step 1: Imports and Data Loading

In [1]:
# Imports and Setup
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, SimpleRNN

print("Libraries imported successfully.")

Libraries imported successfully.


In [2]:
# Define Dataset
# Text extracted from GenAI-Lab-Week4.pdf (Page 2)
corpus_text = """
Artificial intelligence is transforming modern society.
It is used in healthcare finance education and transportation.
Machine learning allows systems to improve automatically with experience.
Data plays a critical role in training intelligent systems.
Large datasets help models learn complex patterns.
Deep learning uses multi layer neural networks.
Neural networks are inspired by biological neurons.
Each neuron processes input and produces an output.
Training a neural network requires optimization techniques.
Gradient descent minimizes the loss function.
Natural language processing helps computers understand human language.
Text generation is a key task in nlp.
Language models predict the next word or character.
Recurrent neural networks handle sequential data.
LSTM and GRU models address long term dependency problems.
However rnn based models are slow for long sequences.
Transformer models changed the field of nlp.
They rely on self attention mechanisms.
Attention allows the model to focus on relevant context.
Transformers process data in parallel.
This makes training faster and more efficient.
Modern language models are based on transformers.
Education is being improved using artificial intelligence.
Intelligent tutoring systems personalize learning.
Automated grading saves time for teachers.
Online education platforms use recommendation systems.
Technology enhances the quality of learning experiences.
Ethical considerations are important in artificial intelligence.
Fairness transparency and accountability must be ensured.
AI systems should be designed responsibly.
Data privacy and security are major concerns.
Researchers continue to improve ai safety.
Text generation models can create stories poems and articles.
They are used in chatbots virtual assistants and content creation.
Generated text should be meaningful and coherent.
Evaluation of text generation is challenging.
Human judgement is often required.
Continuous learning is essential in the field of ai.
Research and innovation drive technological progress.
Students should build strong foundations in mathematics.
Programming skills are important for ai engineers.
Practical experimentation enhances understanding.
"""

print("Dataset loaded successfully.")

Dataset loaded successfully.


## Step 2: Tokenization and Sequence Creation
We convert the text into numbers (tokens) and create "sliding window" sequences. If a sentence is "Artificial intelligence is transforming", we create sequences like:

- [Artificial] -> intelligence

- [Artificial, intelligence] -> is

- [Artificial, intelligence, is] -> transforming

**Tokenization:** We mapped every unique word to a number (e.g., "Artificial" -> 1).

**Padding:** Neural networks require fixed-size inputs. If one sentence has 3 words and another has 10, we add zeros (padding) to the short one so they both look like length 10 vectors.

In [3]:
# Tokenization and Input-Output Sequences

# 1. Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([corpus_text])
total_words = len(tokenizer.word_index) + 1  # +1 for padding token

print(f"Total unique words (Vocabulary Size): {total_words}")

# 2. Create Input Sequences
input_sequences = []
# Split text by new lines to treat each sentence independently
for line in corpus_text.strip().split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# 3. Pad Sequences
# We need all inputs to be the same length for the neural network
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# 4. Create Predictors and Label
# X is the input (all words except the last one), y is the label (the last word)
X, y = input_sequences[:,:-1], input_sequences[:,-1]

# Convert y to one-hot encoding (categorical)
y = to_categorical(y, num_classes=total_words)

print(f"Shape of X (Inputs): {X.shape}")
print(f"Shape of y (Outputs): {y.shape}")
print(f"Max Sequence Length: {max_sequence_len}")

Total unique words (Vocabulary Size): 195
Shape of X (Inputs): (256, 9)
Shape of y (Outputs): (256, 195)
Max Sequence Length: 10


## Step 3: Design RNN Architecture
We use an Embedding Layer (to learn vector representations of words) feeding into an LSTM Layer (Long Short-Term Memory, a type of RNN effective for text). You can swap `LSTM` with `SimpleRNN` if you want a strictly "vanilla" RNN, but LSTM works much better.

In [4]:
# Build the RNN/LSTM Model

model = Sequential()

# Embedding Layer: Turns integer indexes into dense vectors of fixed size
# input_dim = Vocabulary Size
# output_dim = 64 (Vector size for each word)
# input_length = Sequence length - 1 (because we removed the label)
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))

# RNN Layer: You can use SimpleRNN or LSTM
# LSTM is generally preferred for text as it captures longer context

model.add(SimpleRNN(100))
# model.add(LSTM(100))

# Output Layer: A probability distribution over all possible words
model.add(Dense(total_words, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())



None


In [5]:
# Train the Model
history = model.fit(X, y, epochs=100, verbose=1)

Epoch 1/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - accuracy: 0.0338 - loss: 5.2587
Epoch 2/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.0568 - loss: 5.1359 
Epoch 3/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.0298 - loss: 4.9996     
Epoch 4/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.0688 - loss: 4.8805 
Epoch 5/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.0452 - loss: 4.8435 
Epoch 6/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.0797 - loss: 4.7686 
Epoch 7/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.1269 - loss: 4.6914 
Epoch 8/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.1449 - loss: 4.5914 
Epoch 9/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m

## Generate Text
This function takes a seed text, converts it to a sequence, predicts the next word, appends it, and repeats.

In [8]:
# Text Generation Function

def generate_text_rnn(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        # Convert seed text to sequence
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        # Pad sequence to match model input shape
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')

        # Predict the probability of the next word
        predicted_probs = model.predict(token_list, verbose=0)
        # Get the index of the word with the highest probability
        predicted_index = np.argmax(predicted_probs, axis=-1)[0]

        # Convert index back to word
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted_index:
                output_word = word
                break

        # Append to seed text
        seed_text += " " + output_word

    return seed_text

# --- Generate Samples ---
print("\n--- Generated Text Samples ---\n")

print(generate_text_rnn("Artificial intelligence", 5, model, max_sequence_len))
print(generate_text_rnn("Neural networks", 6, model, max_sequence_len))
print(generate_text_rnn("Deep learning", 5, model, max_sequence_len))
print(generate_text_rnn("Students should", 5, model, max_sequence_len))


--- Generated Text Samples ---

Artificial intelligence is transforming modern society based
Neural networks are inspired by biological neurons assistants
Deep learning uses multi layer neural networks
Students should build strong foundations in mathematics
