Every time you type on your smartphone, you see three words pop up as suggestions. That’s a predictive keyboard in action. These suggestions aren’t random. They are based on deep learning models that have learned language patterns from tons of text data. So, if you want to learn about building the model behind a predictive keyboard, this article is for you. In this article, I’ll take you through the task of building a predictive keyboard model with PyTorch.

**Building a Predictive Keyboard Model Using PyTorch**

The task of building a predictive keyboard model includes these steps:

1. Tokenizing and preparing natural language data
2. Building a vocabulary and converting words to indices
3. Training a next-word prediction model using LSTMs
4. Generating top-k predictions like a predictive keyboard

We will use these steps, and in the end, we will see a model generating three suggestions for the next word, just like a predictive keyboard in your smartphone.

The richer the data, the better your model will generalize. So, the dataset we will use is based on the stories of Sherlock Holmes. You can find this dataset [here](https:///content/sherlock-holm.es_stories_plain-text_advs.txt)

Step 1: Preparing the Dataset

We will start with tokenizing the text data and converting everything to lowercase:


Build a predictive keyboard model using PyTorch by processing the text data from "/content/sherlock-holm.es_stories_plain-text_advs.txt". The task includes building a vocabulary from the Sherlock Holmes stories, preparing input-output sequences for training, defining and training an LSTM-based next-word prediction model, and implementing a function to provide top-3 word suggestions based on a given input string.

In [2]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')

# load data using the correct absolute path
file_path = '/content/sherlock-holm.es_stories_plain-text_advs.txt'
with open(file_path, 'r', encoding='utf-8') as f:
    text = f.read().lower()

tokens = word_tokenize(text)
print("Total Tokens:", len(tokens))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Total Tokens: 125772


**Step 2: Creating a Vocabulary**

Next, we need a way to convert words into numbers. So we will create:

1. a dictionary to map each word to an index
2. and another dictionary to reverse it back

So, let’s build the vocabulary and create word-to-index mappings:



In [3]:
# Step 2: Build vocabulary and convert words to indices
import torch
from collections import Counter

# Create a vocabulary mapping from word to index
word_counts = Counter(tokens)
vocab = sorted(word_counts, key=word_counts.get, reverse=True)
word_to_idx = {word: i for i, word in enumerate(vocab)}
idx_to_word = {i: word for i, word in enumerate(vocab)}
vocab_size = len(vocab)

print(f"Unique tokens (Vocab Size): {vocab_size}")

# Convert tokens to indices
encoded_text = [word_to_idx[word] for word in tokens]
print(f"Encoded text length: {len(encoded_text)}")

Unique tokens (Vocab Size): 8360
Encoded text length: 125772


**Step 3: Building Input-Output Sequences**

To predict the next word, the model needs context. We can use a sliding window approach. So, let’s create input-target sequences for next word prediction:



In [4]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# Define sequence length
seq_length = 10

X = []
y = []

# Create sliding window sequences
for i in range(len(encoded_text) - seq_length):
    X.append(encoded_text[i:i + seq_length])
    y.append(encoded_text[i + seq_length])

# Convert to PyTorch tensors
X_tensor = torch.tensor(X, dtype=torch.long)
y_tensor = torch.tensor(y, dtype=torch.long)

print(f"Input tensor shape: {X_tensor.shape}")
print(f"Target tensor shape: {y_tensor.shape}")

# Create dataset and verify
dataset = TensorDataset(X_tensor, y_tensor)
print(f"Total number of samples: {len(dataset)}")

Input tensor shape: torch.Size([125762, 10])
Target tensor shape: torch.Size([125762])
Total number of samples: 125762


**Step 4: Designing the Model Architecture**

For sequence data, LSTMs are still the go-to. They can remember patterns across time steps, which makes them perfect for language modelling. So, let’s define the LSTM-based Predictive Keyboard model:




In [5]:
import torch.nn as nn
import torch.optim as optim

# 1. Define the LSTM model architecture
class PredictiveLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(PredictiveLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        # x shape: (batch_size, seq_length)
        embeds = self.embedding(x)
        # lstm_out shape: (batch_size, seq_length, hidden_dim)
        lstm_out, (h_n, c_n) = self.lstm(embeds)
        # Use the hidden state of the last time step
        out = self.fc(lstm_out[:, -1, :])
        return out

# 2. Initialize model and hyperparameters
embedding_dim = 100
hidden_dim = 256
model = PredictiveLSTM(vocab_size, embedding_dim, hidden_dim)

# 3. Create DataLoader
batch_size = 128
train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# 4. Define Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 5. Training Loop
num_epochs = 5
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(f"Starting training on {device}...")
for epoch in range(num_epochs):
    total_loss = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)

        # Forward pass
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)

        # Backward and optimize
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

        if (batch_idx + 1) % 200 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{batch_idx+1}/{len(train_loader)}], Loss: {loss.item():.4f}')

    avg_loss = total_loss / len(train_loader)
    print(f'Epoch [{epoch+1}/{num_epochs}] completed. Average Loss: {avg_loss:.4f}')

Starting training on cuda...
Epoch [1/5], Step [200/983], Loss: 6.3324
Epoch [1/5], Step [400/983], Loss: 5.4437
Epoch [1/5], Step [600/983], Loss: 5.7752
Epoch [1/5], Step [800/983], Loss: 5.3442
Epoch [1/5] completed. Average Loss: 5.6032
Epoch [2/5], Step [200/983], Loss: 5.0917
Epoch [2/5], Step [400/983], Loss: 4.9217
Epoch [2/5], Step [600/983], Loss: 5.1580
Epoch [2/5], Step [800/983], Loss: 4.5925
Epoch [2/5] completed. Average Loss: 4.7676
Epoch [3/5], Step [200/983], Loss: 4.4735
Epoch [3/5], Step [400/983], Loss: 4.4755
Epoch [3/5], Step [600/983], Loss: 4.1851
Epoch [3/5], Step [800/983], Loss: 3.9924
Epoch [3/5] completed. Average Loss: 4.3356
Epoch [4/5], Step [200/983], Loss: 3.6541
Epoch [4/5], Step [400/983], Loss: 3.9229
Epoch [4/5], Step [600/983], Loss: 3.6778
Epoch [4/5], Step [800/983], Loss: 3.5726
Epoch [4/5] completed. Average Loss: 3.9525
Epoch [5/5], Step [200/983], Loss: 3.4566
Epoch [5/5], Step [400/983], Loss: 3.7921
Epoch [5/5], Step [600/983], Loss: 3.66

## Implement Top-K Prediction

Create a function to predict the top 3 most likely next words for a given input string using the trained LSTM model.


### Predictive Keyboard Workflow Diagram

```mermaid
graph TD
    A[Raw Text Data: Sherlock Holmes] --> B[Step 1: Tokenization & Preprocessing]
    B --> C[Convert to Lowercase & Word Tokenize]
    C --> D[Step 2: Build Vocabulary]
    D --> E[Map Unique Words to Indices]
    E --> F[Step 3: Prepare Sequences]
    F --> G[Sliding Window: 10 Words Input + 1 Target]
    G --> H[Step 4: Model Architecture]
    H --> I[Embedding Layer -> LSTM -> Linear Layer]
    I --> J[Step 5: Training Loop]
    J --> K[Loss: CrossEntropy, Optimizer: Adam]
    K --> L[Step 6: Top-K Prediction Function]
    L --> M[Input Text -> Softmax -> Top 3 Suggestions]
```

In [6]:
def predict_next_words(input_text, model, word_to_idx, idx_to_word, seq_length=10):
    model.eval()
    tokens = word_tokenize(input_text.lower())
    indices = [word_to_idx.get(w, 0) for w in tokens]
    if len(indices) < seq_length:
        indices = [0] * (seq_length - len(indices)) + indices
    else:
        indices = indices[-seq_length:]
    input_tensor = torch.tensor([indices], dtype=torch.long).to(device)
    with torch.no_grad():
        logits = model(input_tensor)
        probabilities = torch.softmax(logits, dim=1)
        top_k_probs, top_k_indices = torch.topk(probabilities, 3)
    predictions = [idx_to_word[idx.item()] for idx in top_k_indices[0]]
    return predictions

test_input = "the adventure of the"
suggestions = predict_next_words(test_input, model, word_to_idx, idx_to_word)
print(f"Input: '{test_input}'")
print(f"Top 3 Suggestions: {suggestions}")

Input: 'the adventure of the'
Top 3 Suggestions: ['man', 'most', 'coronet']


## Summary:

### Q&A

**How was the text data processed for the LSTM model?**  
The Sherlock Holmes text was tokenized using NLTK into 125,772 tokens and converted to lowercase. A vocabulary of 8,360 unique words was created, mapping each word to a numerical index. The data was then organized into sliding windows of 10 words each to create 125,762 input-output training samples.

**What is the architecture of the predictive model?**  
The model is a PyTorch-based LSTM consisting of an Embedding layer (100 dimensions), an LSTM layer (256 hidden dimensions), and a Linear layer that maps the output back to the vocabulary size ($8,360$) for word prediction.

**How does the predictive keyboard functionality work?**  
A custom function takes an input string, tokenizes and pads/truncates it to a length of 10, and passes it through the trained model. The model generates logits, which are converted to probabilities via Softmax, and the `torch.topk` function extracts the 3 most likely next words.

### Data Analysis Key Findings

*   **Dataset Scale**: The corpus yielded **125,772 total tokens** and a manageable vocabulary of **8,360 unique words**, providing a solid foundation for sequential learning.
*   **Training Efficiency**: The model demonstrated effective learning over 5 epochs, with the average loss dropping significantly from **5.60 to 3.59**.
*   **Predictive Performance**: For the sample input *"the adventure of the"*, the model successfully suggested relevant continuations: **['man', 'most', 'coronet']**, reflecting themes found in the Sherlock Holmes stories.
*   **Hardware Acceleration**: Utilizing GPU (`cuda`) allowed for efficient processing of the 125,762 samples using a batch size of 128.

### Insights or Next Steps

*   **Model Refinement**: To improve the coherence of predictions, increasing the number of training epochs or adding a second LSTM layer could help capture deeper semantic relationships in the Victorian-era prose.
*   **Handling Out-of-Vocabulary (OOV) Words**: The current implementation defaults unknown words to index 0; implementing a specific `<UNK>` token strategy or character-level embedding could improve robustness for modern user inputs not found in the original text.
