<a href="https://colab.research.google.com/github/wickedWOLF123/DRP/blob/main/WordModelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Word Modelling Using LSTM**

This is a implementation from https://www.tensorflow.org/text/tutorials/text_generation in TensorFlow based on the http://karpathy.github.io/2015/05/21/rnn-effectiveness/ by Andrej Karapathy. This will be implemented in Pytorch next.

We are building a Word Modelling ie next token prediction. We are trying this for generation of Shakespeare like text to understand the usefullness of LSTMS for Natural Language Processing Tasks

In [None]:
# Imports for this

import tensorflow as tf
import numpy as np
import os
import time

In [None]:
# Downloading Shakespeare pieces from googleapis
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

# Read the data - this file has little over a  million characters and
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
print(f'text length: {len(text)}')

vocabulary = sorted(set(text))
print(f'Unique Characters: {len(vocabulary)}')


Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
[1m1115394/1115394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
text length: 1115394
Unique Characters: 65


To implement a similar character-based text generation model using PyTorch, you can follow the steps below. We’ll use the PyTorch framework to recreate the same RNN model architecture and training flow demonstrated with TensorFlow.

Steps to Recreate RNN Text Generation in PyTorch
1. Install Dependencies:

  bash
  Copy code
  pip install torch torchvision
2. Import Libraries:

  python
  Copy code
  import torch
  import torch.nn as nn
  import numpy as np
  import os
  import time
3. Download the Shakespeare Dataset:
  python
  Copy code
  # Download the dataset
  import requests

  url = 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
  response = requests.get(url)
  text = response.text

  print(f'Length of text: {len(text)} characters')
  print(text[:250])  # View a sample of the text

# Get the unique characters (vocabulary)
  vocab = sorted(set(text))
  vocab_size = len(vocab)
  print(f'{vocab_size} unique characters')
4. Create Character Mapping:
python
Copy code
# Create character to index and index to character mappings
char2idx = {char: idx for idx, char in enumerate(vocab)}
idx2char = np.array(vocab)

# Convert text into numerical data (IDs)
text_as_int = np.array([char2idx[c] for c in text])

# Function to convert IDs back to text
def text_from_ids(ids):
    return ''.join(idx2char[ids])
5. Create Input and Target Sequences:
python
Copy code
# Sequence length for input and target
seq_length = 100  # Each input sequence will have 100 characters
examples_per_epoch = len(text) // seq_length

# Split the text into input-target pairs
def create_sequences(data, seq_length):
    inputs = []
    targets = []
    for i in range(0, len(data) - seq_length):
        inputs.append(data[i:i + seq_length])
        targets.append(data[i + 1:i + 1 + seq_length])
    return np.array(inputs), np.array(targets)

inputs, targets = create_sequences(text_as_int, seq_length)
inputs = torch.tensor(inputs, dtype=torch.long)
targets = torch.tensor(targets, dtype=torch.long)

print(f'Input size: {inputs.size()}, Target size: {targets.size()}')
6. Define the RNN Model:
python
Copy code
class CharRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, rnn_units):
        super(CharRNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.GRU(embedding_dim, rnn_units, batch_first=True)
        self.fc = nn.Linear(rnn_units, vocab_size)

    def forward(self, x, hidden=None):
        x = self.embedding(x)
        out, hidden = self.rnn(x, hidden)
        out = self.fc(out)
        return out, hidden
7. Set Hyperparameters:
python
Copy code
embedding_dim = 256
rnn_units = 1024
batch_size = 64
learning_rate = 0.001

model = CharRNN(vocab_size, embedding_dim, rnn_units).to('cuda')
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
8. Create DataLoader for Batching:
python
Copy code
from torch.utils.data import TensorDataset, DataLoader

dataset = TensorDataset(inputs, targets)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)
9. Train the Model:
python
Copy code
def train_model(model, dataloader, loss_fn, optimizer, epochs):
    model.train()

    for epoch in range(epochs):
        start = time.time()
        total_loss = 0
        hidden = None  # Reset hidden state at each epoch

        for batch, (inp, target) in enumerate(dataloader):
            inp, target = inp.cuda(), target.cuda()

            # Forward pass
            optimizer.zero_grad()
            output, hidden = model(inp, hidden)
            hidden = hidden.detach()  # Detach hidden state to avoid backprop through entire history

            # Reshape output and target for loss calculation
            loss = loss_fn(output.view(-1, vocab_size), target.view(-1))
            total_loss += loss.item()

            # Backprop and optimize
            loss.backward()
            optimizer.step()

            if batch % 50 == 0:
                print(f'Epoch {epoch + 1}, Batch {batch}, Loss: {loss.item()}')

        print(f'Epoch {epoch + 1}, Loss: {total_loss / len(dataloader)}')
        print(f'Time for epoch: {time.time() - start:.2f} sec')

# Train the model
train_model(model, dataloader, loss_fn, optimizer, epochs=20)
10. Generate Text Using the Trained Model:
python
Copy code
def generate_text(model, start_string, num_generate=1000, temperature=1.0):
    model.eval()

    input_eval = torch.tensor([char2idx[c] for c in start_string], dtype=torch.long).unsqueeze(0).to('cuda')
    hidden = None
    generated_text = start_string

    for _ in range(num_generate):
        output, hidden = model(input_eval, hidden)
        output = output / temperature  # Adjust randomness with temperature
        predicted_id = torch.multinomial(torch.softmax(output[0, -1], dim=0), 1).item()

        # Add the predicted character to the generated text
        generated_text += idx2char[predicted_id]

        # Feed the predicted character as input for the next step
        input_eval = torch.tensor([[predicted_id]], dtype=torch.long).to('cuda')

    return generated_text

# Generate text
print(generate_text(model, start_string="ROMEO: "))
11. Save and Load the Model:
python
Copy code
# Save model weights
torch.save(model.state_dict(), 'char_rnn.pth')

# Load model weights
model = CharRNN(vocab_size, embedding_dim, rnn_units).to('cuda')
model.load_state_dict(torch.load('char_rnn.pth'))
Explanation:
Model Architecture: We use an embedding layer to convert character indices into dense vectors, followed by a GRU (RNN) layer and a dense output layer to predict the next character.
Training: The model is trained using sequences of 100 characters. Each input is a sequence, and the target is the same sequence shifted by one character.
Text Generation: During generation, the model takes the last predicted character as input for the next step, sampling characters based on probabilities.
This PyTorch implementation mirrors the TensorFlow code but leverages PyTorch’s RNN framework, including manual handling of hidden states for sequential text generation. Adjust the model hyperparameters and the number of training epochs for better results.








In [None]:
# Now we create a unique id for each char as we can directly pass characters to the
# model we pass vectorized ids to the model

# We also need to do the inverse as we want to generate text and to do that we need to be
# able to convert the output vectors back to the original characters

ids_from_chars = tf.keras.layers.StringLookup(vocabulary=list(vocabulary), mask_token=None)
chars_from_ids = tf.keras.layers.StringLookup(vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

# Function to join all text back together
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)


In [None]:
# We will be dividing the input sequence : text into smaller chunks
# Our input will be the num_of_chars and the output should be of the same length but stating from the second character
# Eg - Hello is our target, seq_length = 4, input = "Hello", output = "ello"

seq_length = 100

# Get ids of all characters
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))

# Slice this encoded list
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)


In [None]:
# As we are always predicting next character our dataset has (input, label) pair as
# current_char, next_char

def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

# this is an example with seq_length character sequences
# for input_example, target_example in dataset.take(1):
#     print("Input :", text_from_ids(input_example).numpy())
#     print("Target:", text_from_ids(target_example).numpy())

In [None]:
# We will be packing dataset into batches fro effeciency
# and also shuffle the data

BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

**Building the Model**

Although the implementation uses a GRU we will try implementing with both a GRU and an LSTM. Our implementation for 1 time series is as follows:

1. **Embedding:** embed the id of the input character to a vector with embed num of dims. Doing this with a one hot vector leads to just the column of our weight matrix

2. **RNN:** The LSTM/GRU layer

3.**Dense layer: ** This is essentially a fully connected layer that shows log likelyhood of each character

In [None]:
# Initialize some parameters
vocab_size = len(ids_from_chars.get_vocabulary())
embedding_dim = 256
rnn_units = 1024

In [None]:
class MyModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, rnn_units):
        super(MyModel, self).__init__()
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(
            rnn_units,
            return_sequences=True,
            return_state=True,
            reset_after=True,
        )
        self.dense = tf.keras.layers.Dense(vocab_size)




In [None]:
model = MyModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)


In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions, states = model(input_example_batch, return_state=True)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")


ValueError: Exception encountered when calling MyModel.call().

[1mtoo many values to unpack (expected 2)[0m

Arguments received by MyModel.call():
  • inputs=tf.Tensor(shape=(64, 100), dtype=int64)
  • states=None
  • return_state=True
  • training=None

In [None]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

model.compile(optimizer='adam', loss=loss)


Prediction shape:  (64, 100, 66)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.1874623


In [None]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix + '.weights.h5',
    save_weights_only=True)


In [None]:
EPOCHS=10
tf.train.latest_checkpoint(checkpoint_dir)
model = model(vocab_size, embedding_dim, rnn_units)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
model.summary()


ValueError: Only input tensors may be passed as positional arguments. The following argument value should be passed as a keyword argument: 66 (of type <class 'int'>)