# NLP based Text Generator using RNN LSTM 
<div style="text-align: right">Uday Kiran Dasari</div>

## Objective:
The goal of this project is to develop a Telugu language chatbot leveraging deep learning techniques, specifically Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units. The chatbot is designed to understand and generate coherent Telugu text, enhancing natural language processing (NLP) capabilities in the Telugu language.

### Steps followed to Achieve Project Goals

- **Text Preprocessing**
  - Cleaned and prepared the Telugu text data by removing unwanted characters, normalizing the text, and tokenizing it into words or subwords.

- **Tokenizer Initialization, Sequence Creation, and Padding**
  - Initialized a tokenizer to convert text into numerical sequences.
  - Created sequences of fixed length for input into the model.
  - Applied padding to ensure uniform sequence lengths, making the data suitable for model training.

- **Preparing Data for Model Training**
  - Split the preprocessed and tokenized data into training and validation sets.
  - Ensured the data is in the correct format for the RNN model, with appropriate input-output pairs.

- **Loading the Embeddings**
  - Loaded pre-trained word embeddings to represent the words in a dense vector space.
  - Integrated these embeddings into the model to enhance its understanding of the language context.

- **Model Creation, Training, and Saving**
  - Designed and built 2 RNN models with LSTM cells using TensorFlow.
  - Trained the model on the prepared data, adjusting hyperparameters for optimal performance.
  - Saved the trained model for future use.

- **Loading the Saved Model and Generating Output**
  - Loaded the saved model for generating responses using two approaches
  - Provided a prompt to the model and generated coherent, contextually appropriate output.

## Importing Necessary libraries

In [1]:
import os
import re
import math
import numpy as np
import pandas as pd
from tqdm import tqdm
from deep_translator import GoogleTranslator

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.callbacks import EarlyStopping

## Text Preprocessing

In [2]:
def preprocess_text(text):
    # Replace specific characters and absorb spaces and tabs
    text = text.replace(",\n", " _eol_ ").replace(",", " _comma_ ")
    text = text.replace(":", " _colon_ ").replace(";", " _semicolon_ ")
    text = text.replace("?\n", ". ").replace("!\n", ". ").replace(".\n", ". ")
    text = text.replace("?", ".").replace("!", ".").replace('"', "")
    text = text.replace("\t", "").replace("  ", " ")
    text = re.sub(r'\d+', '', text)  # Remove numbers

    # Ensure periods are handled correctly as sentence boundaries
    # Replace ". " with ". _eos_ " and handle cases where period is at the end
    #text = re.sub(r'\.', ' _eos_ ', text)
    #text = re.sub(r'\. ', ' _eos_ ', text)
    #text = re.sub(r'\.(?=\n|$)', ' _eos_ ', text)
    
    # Remove any extra spaces around _eos_
    #text = re.sub(r'\s+_eos_\s+', ' _eos_ ', text)
    
    # Absorb multiple spaces into a single space
    text = re.sub(r"\s+", " ", text).strip()
    
    return text


def read_and_preprocess_csv(file_path, column_index):
    print("Reading CSV file...")
    data = pd.read_csv(file_path)

    print("Merging text data from all rows...")
    # Extract all text data from the specified column and merge into a single corpus
    corpus = ' '.join(data.iloc[:, column_index].astype(str).tolist())
    
    print("Preprocessing text data...")
    preprocessed_corpus = preprocess_text(corpus)
    
    print("Done!")
    return preprocessed_corpus

In [3]:
#Reading the dataset
file_path = './Data/telugu_books.csv'
column_index = 1  # The column index with the text data
corpus = read_and_preprocess_csv(file_path, column_index)
print(corpus[:1000])  # Print the first 1000 characters to check the result

Reading CSV file...
Merging text data from all rows...
Preprocessing text data...
Done!
సుశీలమ్మ కళ్ళలో భయం పారాడింది. అనాధ బిడ్డ అని చిన్నప్పుడే తెలిస్తే మన దగ్గిరవాడు అలా అరమరిక లేకుండా చనువుగా పెరిగేవాడా. పుట్టెడు దిగులు సుశీలమ్మ కంఠంలో పలికింది. అది మనం పెంచేదాన్ని బట్టి వుంటుంది. అటువంటి బేధాలు మనలో లేనట్టు తెలుసుకొనేలా పెంచాలి. చాలామంది అలాగే పెంచుతారు గదండీ. ఏనాడో ఒకనాడు ఆ విషయం తెలియకపోదు. మనం పట్నంలో వుంటున్నాం గనక యింత కాలమయినా ఈ రహస్యాన్ని దాచగలిగాం. సుశీలమ్మ వింటూ కూర్చుంది. ఒక వ్యక్తిత్వం అంటూ ఏర్పడ్డాక ఆ రహస్యం తెలిస్తే లోతుగా గాయపడతారు. అనేక ఆలోచనలు వస్తాయి. చిన్నప్పుడే తెలిస్తే అంతగా తలక్రిందులై పోరు అన్నాడు. వాడు మనల్ని వదిలేసి వెళ్ళిపోతాడేమో. అనలేక అంటున్న ఆమె గొంతులో ఏదో అడ్డుపడినట్టు ఉక్కిరిబిక్కిరి అయిపోతుంది. రామనాథానికి కూడా ఆ భయం లేకపోలేదు. ఆ భయాన్ని దాచుకుంటూ వెళ్ళడు. ఎలా వెళతాడు. ఎక్కడికి వెళతాడు. అసలు ఎందుకు వెళ్ళాలి. వాడికి మనమేం తక్కువ చేశామని వెళ్తాడు. అన్నాడు. సుశీలమ్మకు ఎక్కడికో లోతు తెలియని అగాధంలోకి పడుతున్న వ్యక్తికి జారుడుమెట్లు చేతికి అందినట్టు అనిప

## Tokenizer Initialization, Sequence Creation, and Padding

In [4]:
#Considering initial 500K characters from the whole corpus
corpus=corpus[:500000]

#Split text into sentences
training_data = [sentence.strip() for sentence in corpus.split('.') if sentence.strip()]

# Tokenizer initialization and sequence creation
# Initialize tokenizer and convert text to sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(training_data)
sequences = tokenizer.texts_to_sequences(training_data)
vocab_size = len(tokenizer.word_index) + 1
print(f"Vocab size:{vocab_size}")

# Preparing data for model training
# Pad sequences to ensure uniform length
max_sequence_length = 50
padded_sequences = pad_sequences(sequences, maxlen=max_sequence_length)

Vocab size:25012


## Preparing Data for Model Training

In [5]:
%%time
# Creating input (X) and output (y) sequences
X, y = [], []
for sequence in sequences:
    for i in range(1, len(sequence)):
        X.append(sequence[:i])
        y.append(sequence[i])

CPU times: user 25.5 ms, sys: 4.01 ms, total: 29.5 ms
Wall time: 26.6 ms


In [6]:
# Padding sequences
# Pad input sequences to ensure uniform length
padded_X = pad_sequences(X, maxlen=max_sequence_length)
y = np.array(y)
padded_X.shape,y.shape

((51612, 50), (51612,))

## Loading the Embeddings and creation of embedding matrix

In [7]:
# Loading embeddings
embed_dir = "./Data/"
file_name = 'cc.te.300.vec'
embeddings_index = {}
embedding_dim = None

# Read embeddings file and load embeddings into a dictionary
with open(os.path.join(embed_dir, file_name), encoding='utf8') as f:
    first_line = f.readline()
    total_count, embedding_dim = map(int, first_line.split())
    for line in tqdm(f, total=total_count):
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print(f'Found {len(embeddings_index)} word vectors.')
print(f'Embedding dimensions: {embedding_dim}')

100%|██████████| 1878288/1878288 [00:43<00:00, 42925.13it/s]

Found 1878288 word vectors.
Embedding dimensions: 300





In [8]:
# Creating the embedding matrix to be used in the Embedding layer
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
    if i < vocab_size:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
embedding_matrix.shape,vocab_size

((25012, 300), 25012)

## Model Creation, Training, and Saving

### Model 1

In [9]:
# Building the model 1 
# Define the Sequential model and add layers
model1 = Sequential()
model1.add(Embedding(
    input_dim=vocab_size,
    output_dim=embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    input_shape=(max_sequence_length-1,) 
))

model1.add(LSTM(300))
model1.add(Dense(vocab_size, activation='softmax'))

# Compiling the model
# Compile the model with loss function and optimizer
model1.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model1.summary()

  super().__init__(**kwargs)
2024-05-28 18:50:00.786695: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M3 Pro
2024-05-28 18:50:00.786746: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 36.00 GB
2024-05-28 18:50:00.786759: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 13.50 GB
2024-05-28 18:50:00.786793: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-05-28 18:50:00.786813: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [11]:
%%time
# Training the model
# Early stopping callback to avoid overfitting
early_stopping = EarlyStopping(monitor='loss', patience=3, restore_best_weights=True)
# Fit the model on the training data with validation split and early stopping
history1 = model1.fit(padded_X, y, epochs=100, callbacks=[early_stopping], verbose=1)

Epoch 1/100


2024-05-28 18:50:44.433124: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 32ms/step - accuracy: 0.0283 - loss: 9.6743
Epoch 2/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 32ms/step - accuracy: 0.0308 - loss: 8.9238
Epoch 3/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 32ms/step - accuracy: 0.0333 - loss: 7.9848
Epoch 4/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 33ms/step - accuracy: 0.0884 - loss: 6.1828
Epoch 5/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 33ms/step - accuracy: 0.3567 - loss: 4.0720
Epoch 6/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 33ms/step - accuracy: 0.6292 - loss: 2.3720
Epoch 7/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 33ms/step - accuracy: 0.7572 - loss: 1.4573
Epoch 8/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 33ms/step - accuracy: 0.8244 - loss: 1.0025
Epoch 9/100


In [12]:
# Saving the trained model to disk
model1.save('./Artifacts/model1.keras')

### Model 2

In [13]:
# Building the model 2 with a different architecture

# Define the Sequential model and add layers
model2 = Sequential()
model2.add(Embedding(
    input_dim= vocab_size,
    output_dim=embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    input_shape=(max_sequence_length-1,)
    )
)
model2.add(LSTM(300, return_sequences=True))
model2.add(LSTM(300))
model2.add(Dense(vocab_size, activation='softmax'))

# Compiling the model
# Compile the model with loss function and optimizer
model2.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model2.summary()

In [14]:
%%time
# Training the model
# Early stopping callback to avoid overfitting
early_stopping2 = EarlyStopping(monitor='loss', patience=3, restore_best_weights=True)
# Fit the model on the training data with validation split and early stopping
history2 = model2.fit(padded_X, y, epochs=100, callbacks=[early_stopping2], verbose=1)

Epoch 1/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 38ms/step - accuracy: 0.0301 - loss: 9.5866
Epoch 2/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 37ms/step - accuracy: 0.0309 - loss: 9.0304
Epoch 3/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 37ms/step - accuracy: 0.0290 - loss: 8.8340
Epoch 4/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m59s[0m 37ms/step - accuracy: 0.0306 - loss: 8.5988
Epoch 5/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m59s[0m 37ms/step - accuracy: 0.0312 - loss: 8.2625
Epoch 6/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 37ms/step - accuracy: 0.0343 - loss: 7.8158
Epoch 7/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m59s[0m 37ms/step - accuracy: 0.0439 - loss: 7.1753
Epoch 8/100
[1m1613/1613[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 37ms/step - accuracy: 0.0609 - loss: 6.4934


In [15]:
# Saving the trained model to disk
model2.save('./Artifacts/model2.keras')

## Loading the Saved Model and Generating Output

**Deterministic Approach:** When the Probabilistic argument is set to false, it predicts the next word deterministically by choosing the word with the highest probability from the model's output.
- Uses np.argmax to find the index of the highest probability word in the model's output.

**Probabilistic Approach:** When Probabilistic is set to true, it predicts the next word probabilistically by sampling from the model's output distribution. This introduces variability in the generated responses.

**Deterministic Approach (probabilistic=False):**
Pros: Predictable, consistent, often more grammatically correct.
Cons: Can be repetitive and less creative.

**Probabilistic Approach (probabilistic=True):**
Pros: More diverse, creative, and natural-sounding text.
Cons: Can be less coherent and predictable.

In [49]:
# Function to generate a response
def generate_response(user_input, model, tokenizer, max_sequence_length, probabilistic=False):
    # Convert the user input into a sequence of tokens
    input_sequence = tokenizer.texts_to_sequences([user_input])

    # Pad the input sequence to have the same length
    padded_input_sequence = pad_sequences(input_sequence, maxlen=max_sequence_length-1)

    # Predict the next word in the sequence
    predictions = model.predict(padded_input_sequence)[0]
    if probabilistic:
        # Probabilistic approach: choose next word based on probabilities
        predicted_index = np.random.choice(len(predictions), p=predictions)
    else:
        # Deterministic approach: choose the word with the highest probability
        predicted_index = np.argmax(predictions)

    # Convert the predicted index back into a word
    predicted_word = tokenizer.index_word.get(predicted_index, '')

    return predicted_word

# Function to generate a sequence of words
def generate_sequence(input_text, model, tokenizer, max_sequence_length, num_words=80, probabilistic=False):
    output_text = input_text
    for _ in range(num_words):
        next_word = generate_response(input_text, model, tokenizer, max_sequence_length, probabilistic)
        if next_word in ["_eol_", "eol", "_comma_", "comma", "_colon_", "_semicolon_"]:
            # Handle special tokens
            if next_word == "_eol_" or next_word == "eol":
                output_text += '.\n'
            elif next_word == "_comma_" or next_word == "comma":
                output_text += ','
            elif next_word == "_colon_":
                output_text += ':'
            elif next_word == "_semicolon_":
                output_text += ';'
        else:
            output_text += ' ' + next_word
            input_text += ' ' + next_word
    # Post-processing to remove excessive periods and repetitive phrases
    #output_text = re.sub(r'\.\n(\.\n)+', '.\n', output_text)  # Replace multiple line-break periods with one
    #output_text = re.sub(r'\.(\.)+', '.', output_text)  # Replace multiple periods with one
    #output_text = re.sub(r'([^\s\w]|_)+', '', output_text)  # Remove special characters except punctuation
    #output_text = ' '.join(dict.fromkeys(output_text.split()))  # Remove repetitive words

    return output_text

In [31]:
#Helper Function in inference
def translate_large_text(text: str, source_language: str, target_language: str, chunk_size: int = 5000) -> str:
    """
    Translates large text from source_language to target_language by splitting it into chunks.

    Parameters:
    text (str): The text to be translated.
    source_language (str): The language code of the source text (e.g., 'en' for English).
    target_language (str): The language code of the target text (e.g., 'fr' for French).
    chunk_size (int): The maximum number of characters per chunk. Default is 5000.

    Returns:
    str: The translated text.
    """
    #Function to handle huge corpus of text to translate
    def split_text_into_chunks(text, chunk_size):
        return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

    chunks = split_text_into_chunks(text, chunk_size)
    translator = GoogleTranslator(source=source_language, target=target_language)
    
    translated_chunks = [translator.translate(chunk) for chunk in chunks]
    return " ".join(translated_chunks)

### Model 1 Inference

#### Probabilistic Approach

In [50]:
# Load the trained model 1
model1 = tf.keras.models.load_model('./Artifacts/model1.keras')

# Generate text 
input_text = "ప్రేమ"
#input_text = "మీరు ఇక్కడ ఉన్నందున ఇప్పుడు వదిలివేయవద్దు, తద్వారా ప్రపంచం మళ్లీ తనలాగే మారవచ్చు."
generated_text1p = generate_sequence(input_text, model1, tokenizer, max_sequence_length, num_words=80, probabilistic=True)

# Print the generated text
print(generated_text1p)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 59ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1

In [51]:
translated_text1p = translate_large_text(generated_text1p, 
                                        source_language="telugu", target_language="english")
print(translated_text1p)

Love is an intoxicant but your staff is just that. Vepu sat looking at someone seriously. Mugdha asked Inspector Apparao to go to the airport. She went to Pramadwara. Sure, they are trying with Baninu, he took it up,, he took the job in the atmosphere of the room. He wanted a little distance. Nanna came to Bablu, leg room,


#### Deterministic Approach

In [52]:
# Generate text 
input_text = "ప్రేమ"
#input_text = "మీరు ఇక్కడ ఉన్నందున ఇప్పుడు వదిలివేయవద్దు, తద్వారా ప్రపంచం మళ్లీ తనలాగే మారవచ్చు."
generated_text1d = generate_sequence(input_text, model1, tokenizer, max_sequence_length, num_words=80, probabilistic=True)

# Print the generated text
print(generated_text1d)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1

In [53]:
translated_text1d = translate_large_text(generated_text1d, 
                                        source_language="telugu", target_language="english")
print(translated_text1d)

If love is an intoxicant Rukmini escapes, she can't help it, Vunura laughs at it Four clues Jaganmohan Rao is trapped between his two iron fists and the baby who lost his life screaming for help 'Amma Amma' How many people went through hell and gave up their lives. Boss, if you ask why you appeared here at the wrong time, I don't know why Homola has changed. She smiled towards the veranda.


### Model 2 Inference

#### Probabilistic Approach

In [54]:
# Load the trained model 2
model2 = tf.keras.models.load_model('./Artifacts/model2.keras')

# Generate text
input_text = "ప్రేమ"
#input_text = "మీరు ఇక్కడ ఉన్నందున ఇప్పుడు వదిలివేయవద్దు, తద్వారా ప్రపంచం మళ్లీ తనలాగే మారవచ్చు."
generated_text2p = generate_sequence(input_text, model2, tokenizer, max_sequence_length, num_words=80, probabilistic=True)

# Print the generated text
print(generated_text2p)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 94ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1

In [55]:
translated_text2p = translate_large_text(generated_text2p,
                                        source_language="telugu", target_language="english")
print(translated_text2p)

Love is an intoxicant but half of the tension was reduced by listening to that word Srihari was brought home again Pavani stepped aside hearing the coming sound, without knowing the other's second eye,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


#### Deterministic Approach

In [56]:
# Generate text
input_text = "ప్రేమ"
#input_text = "మీరు ఇక్కడ ఉన్నందున ఇప్పుడు వదిలివేయవద్దు, తద్వారా ప్రపంచం మళ్లీ తనలాగే మారవచ్చు."
generated_text2d = generate_sequence(input_text, model2, tokenizer, max_sequence_length, num_words=50, probabilistic=True)

# Print the generated text
print(generated_text2d)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m

In [57]:
translated_text2d = translate_large_text(generated_text2d,
                                        source_language="telugu", target_language="english")
print(translated_text2d)

If he had suffered a serious injury in a love affair, his reason would have been to get up and fall lifeless on the doorstep of the station, and to face the police force that intervened, and the person who ransacked the station should have done the work. He was impressed by the Aparanji doll. Does not come out


## Conclusion
This project successfully demonstrates the creation of a Telugu language chatbot using RNNs with LSTM units. By leveraging pre-trained word embeddings and training two different LSTM architectures, the model is capable of generating coherent Telugu text. The generated text highlights the model's ability to understand and predict language structures effectively, paving the way for advanced NLP applications in Telugu.

However, it is important to note that the model's performance is constrained by the limited data and type of data it was trained on.. As a result, the range of emotions and vocabulary is also limited. Consequently, the model's predictions may sometimes loop within the restricted vocabulary provided. Expanding the dataset with more diverse and extensive text could further enhance the model's capabilities and robustness.

<div hidden>
### Tokenizer Initialization
    The Tokenizer() class is initialized to facilitate the tokenization process, a crucial step in natural language processing tasks like chatbot creation. It serves to convert textual data into numerical representations by assigning a unique integer index to each distinct word in the training dataset.

### Sequence Conversion
    After initializing the tokenizer, the fit_on_texts(training_data) method is employed to fit the tokenizer on the provided training data. This involves mapping words to their respective indices and building an internal vocabulary. Subsequently, the `texts_to_sequences(training_data)` function converts the original text data into sequences of tokens, replacing each word with its corresponding index.

### Padding for Uniformity 
    To ensure uniformity in sequence lengths, the pad_sequences(sequences, maxlen=max_sequence_length) function is applied. This step involves padding or truncating the sequences to a specified maximum length (max_sequence_length). This is essential for inputting data into neural networks, as models typically expect fixed-size input sequences, and padding ensures consistent dimensions for efficient training and processing in the chatbot model.

### Embedding
    In the context of embedding, vocabulary size refers to the total number of unique words present in a given dataset or corpus. It is a crucial parameter when working with word embeddings, as it directly influences the size of the embedding matrix and the dimensionality of the word vectors.