<a href="https://colab.research.google.com/github/sagrfarkale/Shakespeare-Text-Generation/blob/main/Shakespearetxt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## RNN for Text Generation through LSTM(CharacterModel)

Importing Torch ,Numpy ,Matplotlib Libraries

In [1]:
import torch
from torch import nn
import torch.nn.functional as F

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Loading data from Drive

In [2]:
%cd '/content/drive/MyDrive/Pytorch/PYTORCH_NOTEBOOKS/PYTORCH_NOTEBOOKS/Data'
with open('shakespeare.txt','r',encoding='utf8') as f:
    text = f.read()

/content/drive/MyDrive/Pytorch/PYTORCH_NOTEBOOKS/PYTORCH_NOTEBOOKS/Data


In [4]:
print(text[:1000])


                     1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:
  Thou that art now the world's fresh ornament,
  And only herald to the gaudy spring,
  Within thine own bud buriest thy content,
  And tender churl mak'st waste in niggarding:
    Pity the world, or else this glutton be,
    To eat the world's due, by the grave and thee.


                     2
  When forty winters shall besiege thy brow,
  And dig deep trenches in thy beauty's field,
  Thy youth's proud livery so gazed on now,
  Will be a tattered weed of small worth held:  
  Then being asked, where all thy beauty lies,
  Where all the treasure of thy lusty days;
  To say within thine own deep su

In [5]:
len(text)

5445609

Total unique characters

In [6]:
all_characters = set(text)

Indexing unique character 

In [8]:
decoder = dict(enumerate(all_characters))

In [21]:
#decoder

Dictionary to replicate respective index for Characters

In [17]:
encoder = {char:idx for idx,char in decoder.items()}


In [20]:
#encoder

list of array of whole text with their encoded number

In [22]:
encoded_text = np.array([encoder[char] for char in text])

In [23]:
encoded_text[:100]

array([16, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 26, 16, 41, 41, 60, 62, 50, 70, 41, 10, 82, 20,
       62, 33, 39, 61, 41, 49, 62, 33, 82, 61, 81, 62, 33, 39, 41, 27, 33,
       41, 66, 33, 39, 20, 62, 33, 41, 20, 55, 49, 62, 33, 82, 39, 33,  8,
       16, 41, 41, 15, 14, 82, 61, 41, 61, 14, 33, 62, 33, 56, 42, 41, 56,
       33, 82, 81, 61, 42, 46, 39, 41, 62, 50, 39, 33, 41, 70, 20])

One hot Encoding characters


In [24]:
def one_hot_encoder(encoded_text, num_uni_chars):
    '''
    encoded_text : batch of encoded text
    
    num_uni_chars = number of unique characters (len(set(text)))
    '''
    
    # METHOD FROM:
    # https://stackoverflow.com/questions/29831489/convert-encoded_textay-of-indices-to-1-hot-encoded-numpy-encoded_textay
      
    # Create a placeholder for zeros.
    one_hot = np.zeros((encoded_text.size, num_uni_chars))
    
    # Convert data type for later use with pytorch (errors if we dont!)
    one_hot = one_hot.astype(np.float32)

    # Using fancy indexing fill in the 1s at the correct index locations
    one_hot[np.arange(one_hot.shape[0]), encoded_text.flatten()] = 1.0
    

    # Reshape it so it matches the batch sahe
    one_hot = one_hot.reshape((*encoded_text.shape, num_uni_chars))
    
    return one_hot

In [25]:
#example
one_hot_encoder(np.array([1,2,0]),3)

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]], dtype=float32)

Function to generate batches for training
 

In [26]:
def generate_batches(encoded_text, samp_per_batch=10, seq_len=50):
    
    '''
    Generate (using yield) batches for training.
    
    X: Encoded Text of length seq_len
    Y: Encoded Text shifted by one
    
    Example:
    
    X:
    
    [[1 2 3]]
    
    Y:
    
    [[ 2 3 4]]
    
    encoded_text : Complete Encoded Text to make batches from
    batch_size : Number of samples per batch
    seq_len : Length of character sequence
       
    '''
    
    # Total number of characters per batch
    # Example: If samp_per_batch is 2 and seq_len is 50, then 100
    # characters come out per batch.
    char_per_batch = samp_per_batch * seq_len
    
    
    # Number of batches available to make
    # Use int() to roun to nearest integer
    num_batches_avail = int(len(encoded_text)/char_per_batch)
    
    # Cut off end of encoded_text that
    # won't fit evenly into a batch
    encoded_text = encoded_text[:num_batches_avail * char_per_batch]
    
    
    # Reshape text into rows the size of a batch
    encoded_text = encoded_text.reshape((samp_per_batch, -1))
    

    # Go through each row in array.
    for n in range(0, encoded_text.shape[1], seq_len):
        
        # Grab feature characters
        x = encoded_text[:, n:n+seq_len]
        
        # y is the target shifted over by 1
        y = np.zeros_like(x)
       
        #
        try:
            y[:, :-1] = x[:, 1:]
            y[:, -1]  = encoded_text[:, n+seq_len]
            
        # FOR POTENTIAL INDEXING ERROR AT THE END    
        except:
            y[:, :-1] = x[:, 1:]
            y[:, -1] = encoded_text[:, 0]
            
        yield x, y

In [30]:
#example
samplet = encoded_text[:50]
sam = generate_batches(samplet,samp_per_batch=4,seq_len=12)
x, y = next(sam)

In [31]:
x

array([[16, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41],
       [41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 26, 16],
       [41, 41, 60, 62, 50, 70, 41, 10, 82, 20, 62, 33],
       [39, 61, 41, 49, 62, 33, 82, 61, 81, 62, 33, 39]])

In [32]:
y

array([[41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 16],
       [41, 41, 41, 41, 41, 41, 41, 41, 41, 26, 16, 41],
       [41, 60, 62, 50, 70, 41, 10, 82, 20, 62, 33, 41],
       [61, 41, 49, 62, 33, 82, 61, 81, 62, 33, 39, 39]])

Checking for availability of GPU

In [33]:
torch.cuda.is_available()

True

Building LSTM model 


In [34]:
class CharModel(nn.Module):
    
    def __init__(self, all_chars, num_hidden=256, num_layers=4,drop_prob=0.5,use_gpu=False):
        
        
        # SET UP ATTRIBUTES
        super().__init__()
        self.drop_prob = drop_prob
        self.num_layers = num_layers
        self.num_hidden = num_hidden
        self.use_gpu = use_gpu
        
        #CHARACTER SET, ENCODER, and DECODER
        self.all_chars = all_chars
        self.decoder = dict(enumerate(all_chars))
        self.encoder = {char: ind for ind,char in decoder.items()}
        
        
        self.lstm = nn.LSTM(len(self.all_chars), num_hidden, num_layers, dropout=drop_prob, batch_first=True)
        
        self.dropout = nn.Dropout(drop_prob)
        
        self.fc_linear = nn.Linear(num_hidden, len(self.all_chars))
      
    
    def forward(self, x, hidden):
                  
        
        lstm_output, hidden = self.lstm(x, hidden)
        
        
        drop_output = self.dropout(lstm_output)
        
        drop_output = drop_output.contiguous().view(-1, self.num_hidden)
        
        
        final_out = self.fc_linear(drop_output)
        
        
        return final_out, hidden
    
    
    def hidden_state(self, batch_size):
        '''
        Used as separate method to account for both GPU and CPU users.
        '''
        
        if self.use_gpu:
            
            hidden = (torch.zeros(self.num_layers,batch_size,self.num_hidden).cuda(),
                     torch.zeros(self.num_layers,batch_size,self.num_hidden).cuda())
        else:
            hidden = (torch.zeros(self.num_layers,batch_size,self.num_hidden),
                     torch.zeros(self.num_layers,batch_size,self.num_hidden))
        
        return hidden
        

Instantiating The Model


In [35]:
model = CharModel(
    all_chars=all_characters,
    num_hidden=512,
    num_layers=3,
    drop_prob=0.5,
    use_gpu=True,
)

Total Parameters in Model

In [36]:
total_param  = []
for p in model.parameters():
    total_param.append(int(p.numel()))

 **A good model have similar number of parameters as of the total characters in text**

In [37]:
sum(total_param)

5470292

In [38]:
len(encoded_text)

5445609

setting optimizer and criterion for loss

In [39]:
optimizer = torch.optim.Adam(model.parameters(),lr=0.001)
criterion = nn.CrossEntropyLoss()

In [40]:
#training and val split
train_percent = 0.1

In [41]:
int(len(encoded_text) * (train_percent))

544560

In [42]:
train_ind = int(len(encoded_text) * (train_percent))

In [43]:
train_data = encoded_text[:train_ind]
val_data = encoded_text[train_ind:]

Training LSTM Model 

In [44]:
## VARIABLES

# Epochs to train for
epochs = 50
# batch size 
batch_size = 128

# Length of sequence
seq_len = 100

# for printing report purposes
# always start at 0
tracker = 0

# number of characters in text
num_char = max(encoded_text)+1

In [45]:
# Set model to train
model.train()


# Check to see if using GPU
if model.use_gpu:
    model.cuda()

for i in range(epochs):
    
    hidden = model.hidden_state(batch_size)
    
    
    for x,y in generate_batches(train_data,batch_size,seq_len):
        
        tracker += 1
        
        # One Hot Encode incoming data
        x = one_hot_encoder(x,num_char)
        
        # Convert Numpy Arrays to Tensor
        
        inputs = torch.from_numpy(x)
        targets = torch.from_numpy(y)
        
        # Adjust for GPU if necessary
        
        if model.use_gpu:
            
            inputs = inputs.cuda()
            targets = targets.cuda()
            
        # Reset Hidden State
        # If we dont' reset we would backpropagate through all training history
        hidden = tuple([state.data for state in hidden])
        
        model.zero_grad()
        
        lstm_output, hidden = model.forward(inputs,hidden)
        loss = criterion(lstm_output,targets.view(batch_size*seq_len).long())
        
        loss.backward()
        
        # POSSIBLE EXPLODING GRADIENT PROBLEM!
        # LET"S CLIP JUST IN CASE
        nn.utils.clip_grad_norm_(model.parameters(),max_norm=5)
        
        optimizer.step()
        
        
        
        ###################################
        ### CHECK ON VALIDATION SET ######
        #################################
        
        if tracker % 25 == 0:
            
            val_hidden = model.hidden_state(batch_size)
            val_losses = []
            model.eval()
            
            for x,y in generate_batches(val_data,batch_size,seq_len):
                
                # One Hot Encode incoming data
                x = one_hot_encoder(x,num_char)
                

                # Convert Numpy Arrays to Tensor

                inputs = torch.from_numpy(x)
                targets = torch.from_numpy(y)

                # Adjust for GPU if necessary

                if model.use_gpu:

                    inputs = inputs.cuda()
                    targets = targets.cuda()
                    
                # Reset Hidden State
                # If we dont' reset we would backpropagate through 
                # all training history
                val_hidden = tuple([state.data for state in val_hidden])
                
                lstm_output, val_hidden = model.forward(inputs,val_hidden)
                val_loss = criterion(lstm_output,targets.view(batch_size*seq_len).long())
        
                val_losses.append(val_loss.item())
            
            # Reset to training model after val for loop
            model.train()
            
            print(f"Epoch: {i} Step: {tracker} Val Loss: {val_loss.item()}")

Epoch: 0 Step: 25 Val Loss: 3.2371671199798584
Epoch: 1 Step: 50 Val Loss: 3.2342615127563477
Epoch: 1 Step: 75 Val Loss: 3.233180284500122
Epoch: 2 Step: 100 Val Loss: 3.1287336349487305
Epoch: 2 Step: 125 Val Loss: 3.010040283203125
Epoch: 3 Step: 150 Val Loss: 2.858440637588501
Epoch: 4 Step: 175 Val Loss: 2.7408323287963867
Epoch: 4 Step: 200 Val Loss: 2.6316757202148438
Epoch: 5 Step: 225 Val Loss: 2.5085277557373047
Epoch: 5 Step: 250 Val Loss: 2.4252593517303467
Epoch: 6 Step: 275 Val Loss: 2.316955804824829
Epoch: 7 Step: 300 Val Loss: 2.2501840591430664
Epoch: 7 Step: 325 Val Loss: 2.198885202407837
Epoch: 8 Step: 350 Val Loss: 2.152822971343994
Epoch: 8 Step: 375 Val Loss: 2.117525339126587
Epoch: 9 Step: 400 Val Loss: 2.078951835632324
Epoch: 10 Step: 425 Val Loss: 2.0522267818450928
Epoch: 10 Step: 450 Val Loss: 2.0250818729400635
Epoch: 11 Step: 475 Val Loss: 2.0010972023010254
Epoch: 11 Step: 500 Val Loss: 1.981205701828003
Epoch: 12 Step: 525 Val Loss: 1.9615483283996582

In [46]:
model_name = 'shaks512_3'

In [47]:
#Saving model 
torch.save(model.state_dict(),model_name)

In [48]:
# MUST MATCH THE EXACT SAME SETTINGS AS MODEL USED DURING TRAINING!
#Loading Model
model = CharModel(
    all_chars=all_characters,
    num_hidden=512,
    num_layers=3,
    drop_prob=0.5,
    use_gpu=True,
)

In [49]:
model.load_state_dict(torch.load(model_name))
model.eval()

CharModel(
  (lstm): LSTM(84, 512, num_layers=3, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc_linear): Linear(in_features=512, out_features=84, bias=True)
)

In [50]:
def predict_next_char(model, char, hidden=None, k=1):
        
        # Encode raw letters with model
        encoded_text = model.encoder[char]
        
        # set as numpy array for one hot encoding
        # NOTE THE [[ ]] dimensions!!
        encoded_text = np.array([[encoded_text]])
        
        # One hot encoding
        encoded_text = one_hot_encoder(encoded_text, len(model.all_chars))
        
        # Convert to Tensor
        inputs = torch.from_numpy(encoded_text)
        
        # Check for CPU
        if(model.use_gpu):
            inputs = inputs.cuda()
        
        
        # Grab hidden states
        hidden = tuple([state.data for state in hidden])
        
        
        # Run model and get predicted output
        lstm_out, hidden = model(inputs, hidden)

        
        # Convert lstm_out to probabilities
        probs = F.softmax(lstm_out, dim=1).data
        
        
        
        if(model.use_gpu):
            # move back to CPU to use with numpy
            probs = probs.cpu()
        
        
        # k determines how many characters to consider
        # for our probability choice.
        # https://pytorch.org/docs/stable/torch.html#torch.topk
        
        # Return k largest probabilities in tensor
        probs, index_positions = probs.topk(k)
        
        
        index_positions = index_positions.numpy().squeeze()
        
        # Create array of probabilities
        probs = probs.numpy().flatten()
        
        # Convert to probabilities per index
        probs = probs/probs.sum()
        
        # randomly choose a character based on probabilities
        char = np.random.choice(index_positions, p=probs)
       
        # return the encoded value of the predicted char and the hidden state
        return model.decoder[char], hidden

In [51]:
def generate_text(model, size, seed='The', k=1):
        
      
    
    # CHECK FOR GPU
    if(model.use_gpu):
        model.cuda()
    else:
        model.cpu()
    
    # Evaluation mode
    model.eval()
    
    # begin output from initial seed
    output_chars = [c for c in seed]
    
    # intiate hidden state
    hidden = model.hidden_state(1)
    
    # predict the next character for every character in seed
    for char in seed:
        char, hidden = predict_next_char(model, char, hidden, k=k)
    
    # add initial characters to output
    output_chars.append(char)
    
    # Now generate for size requested
    for i in range(size):
        
        # predict based off very last letter in output_chars
        char, hidden = predict_next_char(model, output_chars[-1], hidden, k=k)
        
        # add predicted character
        output_chars.append(char)
    
    # return string of predicted text
    return ''.join(output_chars)

In [56]:
print(generate_text(model, 1000, seed='CELIA ', k=3))

CELIA A. A mother of the wifler

  AGRIPPA, and were a power of the sea,
             And therefore have my sword,
             As all to the wist strange,
             The world will be the sea to stand,
            That thou art to break this.
  ANTONY. I'll be such a man see that I am sort,
    And so will see his sword.
  CLEOPATRA. I am all this word.
  CLEOPATRA. The service warr'd to have an hearts of honour
    A mother's tongue, which serve the wars
    And to my sea of me, and so to be.
    I have then this a servant, that I see
    The world that we dispress. I was not to
    Be sentery.                                  Exeunt




ACT III. SCENE 2.
For the King and CLOWN

  COUNTESS. Why, sir, the serve of him.
  COUNTESS. Where is a man as the strong and this that hath the man's stole
    as the sun one as I have not a pain to tell him.  
  ORLANDO. What's the world is the world to them, and we shall see
    to have to stard the forest of her stranger. I have heard thee
   