<h1>Notebook For LSTM model</h1>

<h3>Importing Libraries</h3>

In [15]:
import tensorflow as tf
import os
import io
import numpy as np

from tensorflow import keras
import random

<h3>Importing and extracting training file and model.
</h3>

In [None]:
!wget --no-check-certificate \
    "https://github.com/syedhamzamohiuddin/Final_SE/tree/main/LaTexAgent_2/my_model.tar.xz" \
    -O "/content/my_model.tar.xz"
!tar xf my_model.tar.xz

!wget --no-check-certificate \
    "https://github.com/syedhamzamohiuddin/FInal_SE/tree/main/LaTexAgent_2/training_data.tar.xz" \
    -O "/content/training_data.tar.xz"
!tar xf training_data.tar.xz

!wget --no-check-certificate \
    "https://github.com/syedhamzamohiuddin/FInal_SE/tree/main/LaTexAgent_2/training_data.tar.xz" \
    -O "/content/pdfs.tar.xz"
!tar xf pdfs.tar.xz

<h3>Reading the training file</h3>
<p>This cell does the following:<br>1,Reads the training file named, "train.txt".<br>2,Replaces newline characters with spaces.<br>3,Creates a vocabulary of characters present in the corpus, and then prints them.<p>
    
<h4>Explanation:</h4>
<p>This is a character based model, <b>not</b> a word-based, so we replace the new-lines with space <br>as new lines don't matter because the recognition is being done at character level.</p>

In [16]:
with io.open("training_data/train/train.txt") as f:
    text = f.read()
text = text.replace("\n"," ")
print("Corpus length:", len(text))

chars = sorted(list(set(text)))
print("Total chars:", len(chars))

for c in chars:
    print(c, end=" ")

Corpus length: 76436
Total chars: 93
  ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ 

<h3>Creating mappings using python dictionaries:</h3>
<p>In the first line, a python dictionary that maps characters to index is being created.<br>In the second line, a python dictionary that maps index to characters is being created.</p>

<h4>Explanation:</h4>
<p>As neural networks work with numbers, char_to_ix dictionary is used.<br>Neural network outputs the probabilites for each index. To map the index chosen in the 'sample' method, ix_to_char dictionary is used.</p>

In [17]:
char_to_ix = {ch:i for i,ch in enumerate(chars)}
ix_to_char = {i:ch for i,ch in enumerate(chars)}

<h3>Creating Training sequences:</h3>
<p>Note that the variable "text" contains the whole training data with no new lines.<br>Read the comments in the cell below.</p>

In [18]:
maxlen = 10       #Length of input sequence. It means that LSTM will take 10 characters as input at a time.
step = 3          #It means that after taking first 10 characters (0-9) for first training example,the second training example
                  #would have charaters from 3-12, fourth will have characters from 6-15 and so on.
sentences = []    #These are the Training examples
next_chars = []   #These are the output for each training example. For example,Given first ten characters, what's the eleventh character?


for i in range(0, len(text) - maxlen, step):#Each iteration is incremented by step size, whch is 3.
    sentences.append(text[i : i + maxlen])   #Training example being added to 'sentences'. Notice it stores the example
                                             #That starts at i and ends at i+10.
    next_chars.append(text[i + maxlen])      #This is the output (the 11th character) for the corresponding training example (consecutive 10 characters).
print("Number of sequences:", len(sentences))

#Input
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)  #Empty boolean array of dimensions len(sentences) by maxlen by len(chars),where
#output                                                            # for each character, its corresponding position is filled with 1,the rest with zeros.
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)          #Empty boolean array of dimensions (len(sentences), len(chars))


#The above empty arrays are filled with 1s.
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_to_ix[char]] = 1    # character is mapped to index for input array
    y[i, char_to_ix[next_chars[i]]] = 1  # character is mapped to index for output array


Number of sequences: 25476


<h3>Creating the model</h3>

In [19]:

model = keras.Sequential(
    [
        keras.Input(shape=(maxlen, len(chars))),  #Input layer that takes input of shape (maxlen, len(chars)), which is the shape of a training example.
        keras.layers.LSTM(128),                   #LSTM layer
        keras.layers.Dense(len(chars), activation="softmax"),
    ]
)
optimizer = keras.optimizers.RMSprop(learning_rate=0.01)  #optimizer
model.compile(loss="categorical_crossentropy", optimizer=optimizer) #loss fucntion that will be optimized using optimizer


<h3>Sample Function:</h3>
<p>This function takes the output from the LSTM, which is the output of softmax of dimension len(chars).<br>
It then creates a probability distribution of all the values in the output array 'preds'<br>
Finally, it appliex the argmax function and gets the index.</p>

In [20]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype("float64")
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


<h3>Training the model</h3>
<p>The model is trained 'epochs'times.<br>Each epoch loads the data in batch sizes of 'batch_size'</p>
<p>You can vary these parameters<p>
<h4>Note:</h4> Only run the cell below if you want to train and save the model yourself. Skip otherwise.

In [21]:
epochs = 200
batch_size = 128

for epoch in range(epochs):
    model.fit(x, y, batch_size=batch_size, epochs=1)

model.save("my_model")

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: my_model/assets


<h3>Get predictions given input:</h3>
<p>It generates next 20 characters, given 10 characters.</p>

In [23]:
def predictt(inpt):
    
    generated = ""
    sentence = inpt
    print('...Generating with seed: "' + sentence + '"')

    for i in range(20):      #amount of predictions
        x_pred = np.zeros((1, maxlen, len(chars))) #Empty array for the prediction
        for t, char in enumerate(sentence):        #For each character in the sequence, set the value to 1.0 for each character's corresponding index.
            x_pred[0, t, char_to_ix[char]] = 1.0
            
        preds = reconstructed_model.predict(x_pred, verbose=0)[0] # Get output from the model.
        next_index = sample(preds, 0.2)                           # Create sample distribution, and get the index from that distribution.
        next_char = ix_to_char[next_index]
        sentence = sentence[1:] + next_char                       #Shift the sentence one step forward by dropping the first character and adding the
                                                                  #predicted character at the end.
       # print(sentence)
        if next_char != '\n':
            generated += next_char                                #add the predicted chracter to the list of generated to print.

  
    print(inpt+generated)

<h3> Loading a saved model:</h3>

<p>Remember that "model_name" here is the name of the folder that tensorflow created when I called<br>
    model.save("model_name") method. So just pass the path to this folder,<b>not</b> to the contents inside of it.<p>

In [24]:
reconstructed_model = keras.models.load_model("my_model")

<h3>Predict!!</h3>
<p>Enter 10 character string, and don't forget to escape special characters with backslash n.<p>

In [26]:
predictt("\\frac{1}{\\")

...Generating with seed: "\frac{1}{\"
\frac{1}{\, } { 1 - \e ^ { v_i
