<a href="https://colab.research.google.com/github/tahawarsi360/NLP_assignment/blob/main/AUTOCOMPLETE_using_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [13]:
# Sample sentences for training
sentences = [
    "I love coding",
    "Coding is fun",
    "Python is a popular programming language",
    "Machine learning is an exciting field",
    "Artificial intelligence is the future",
    "Programming is a valuable skill",
    "Data science is in high demand",
    "Web development is constantly evolving",
    "Algorithms are the heart of computer science",
    "Software engineering drives innovation",
    "Technology shapes our lives",
    "The internet connects the world",
    "Cybersecurity is crucial in the digital age",
    "Mobile apps have revolutionized industries",
    "Cloud computing enables scalable solutions",
    "Big data analytics uncovers hidden insights",
]

In [14]:
# Generate training data
X_train = []
y_train = []
for sentence in sentences:
    for i in range(1, len(sentence)):
        X_train.append(sentence[:i])
        y_train.append(sentence[i])

In [15]:
# Convert sentences to numerical representation
char_to_index = {char: i for i, char in enumerate(set(''.join(sentences)))}
index_to_char = {i: char for char, i in char_to_index.items()}
X_train = [[char_to_index[char] for char in sentence] for sentence in X_train]
X_train = pad_sequences(X_train)
y_train = np.array([char_to_index[char] for char in y_train])

In [16]:
# Reshape input data for LSTM
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

In [17]:
# Build LSTM model
model = Sequential()
model.add(LSTM(128, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dense(len(char_to_index), activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

In [None]:
# Train the model
model.fit(X_train, y_train, epochs=50, batch_size=32)

In [19]:
# Generate auto-completions
def generate_auto_completions(seed_text, num_chars):
    generated_text = seed_text
    for _ in range(num_chars):
        x = np.array([[char_to_index[char] for char in seed_text]])
        x = pad_sequences(x, maxlen=X_train.shape[1])
        x = np.reshape(x, (1, x.shape[1], 1))
        prediction = model.predict(x, verbose=0)
        index = np.argmax(prediction)
        char = index_to_char[index]
        generated_text += char
        seed_text += char
    return generated_text

In [23]:
# Test the auto-completion
seed_text = "I love machine"
num_chars = 10
completion = generate_auto_completions(seed_text, num_chars)
print(f"Auto-completion: {completion}")

Auto-completion: I love machine  ing es s


The output you're seeing might not make much sense because the model is trained on short prefixes of sentences and predicts the next character based on those prefixes. It does not have a comprehensive understanding of grammar or context.

To improve the output and make it more coherent, you can consider the following suggestions:

Increase the training data: Adding more varied and representative sentences to the training data can help the model learn a wider range of patterns and contexts.

Use a larger and more complex model: You can increase the capacity of the LSTM model by adding more LSTM layers or increasing the number of LSTM units. This can potentially capture more intricate patterns and dependencies in the data.

Train for more epochs: Training the model for a longer duration allows it to refine its predictions and improve performance. You can increase the number of epochs and monitor the loss to avoid overfitting.

Adjust the hyperparameters: Experiment with different hyperparameter settings, such as the learning rate, batch size, or LSTM layer size, to find the configuration that works best for your data.

Use a language model pre-trained on a larger corpus: Instead of training your own LSTM model from scratch, you can use pre-trained language models such as GPT-3, GPT-2, or BERT, which have been trained on massive amounts of text data. These models typically offer better language understanding and can generate more coherent completions.

Remember that while LSTM models can generate auto-completions, they may still produce outputs that are not always coherent or grammatically correct. Generating meaningful and contextually appropriate completions is a challenging task, and the quality of the output depends on various factors, including the size and diversity of the training data and the complexity of the model architecture.

In this example, we first define a list of sample sentences for training. Then, we generate training data by splitting each sentence into input-output pairs, where the input is a prefix of the sentence and the output is the next character in the sentence. We convert the characters to numerical representation using a character-to-index mapping.

Next, we build the LSTM model using the Keras Sequential API. The model consists of an LSTM layer followed by a dense layer with softmax activation. We compile the model with the sparse categorical cross-entropy loss and the Adam optimizer.

After that, we train the model on the training data.

To generate auto-completions, we define a function generate_auto_completions() that takes a seed text and the number of characters to generate. The function iteratively predicts the next character based on the seed text and appends it to the generated text. Finally, it returns the generated text.

We test the auto-completion by providing a seed text "I love" and generating 10 characters.

Note that this is a basic example, and you can modify and improve the model according to your specific requirements and dataset.