Implementation of a Recurrent Neural Network (RNN) for next-word prediction . The code includes:

A complete NextWordPredictor class that handles:

1- Text preprocessing and tokenization
2- Data preparation and sequence creation
3- Building an RNN model with embedding, SimpleRNN, and dense layers
4- Training functionality
5- Prediction capabilities for the next word given an input sentence
6- Model saving and loading


Example usage showing how to:

Train the model on sample texts
Make predictions on test sentences

The implementation uses TensorFlow and Keras, which are popular frameworks for building neural networks. This code demonstrates how RNNs can process sequences of words and predict what might come next based on the patterns learned during training.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import string

class NextWordPredictor:
    def __init__(self, vocab_size=5000, embedding_dim=100, rnn_units=150, max_sequence_length=20):
        """
        Initialize the Next Word Predictor with RNN
        
        Args:
            vocab_size: Maximum number of words in vocabulary
            embedding_dim: Dimension of word embeddings
            rnn_units: Number of units in the RNN layer
            max_sequence_length: Maximum length of input sequences
        """
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.rnn_units = rnn_units
        self.max_sequence_length = max_sequence_length
        self.tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
        self.model = None
        
    def preprocess_text(self, text):
        """Preprocess text by removing punctuation and making lowercase"""
        translator = str.maketrans('', '', string.punctuation)
        text = text.translate(translator).lower()
        return text
    
    def create_sequences(self, text):
        """Create input sequences and output words for training"""
        words = text.split()
        sequences = []
        for i in range(1, len(words)):
            seq_words = words[max(0, i-self.max_sequence_length):i]
            sequences.append([' '.join(seq_words), words[i]])
        return sequences
    
    def prepare_data(self, texts):
        """Prepare training data from texts"""
        # Preprocess all texts
        processed_texts = [self.preprocess_text(text) for text in texts]
        
        # Create sequences
        all_sequences = []
        for text in processed_texts:
            all_sequences.extend(self.create_sequences(text))
        
        # Split into inputs and targets
        input_sequences = [seq[0] for seq in all_sequences]
        target_words = [seq[1] for seq in all_sequences]
        
        # Fit tokenizer on all input sequences and target words
        all_text = ' '.join(input_sequences + target_words)
        self.tokenizer.fit_on_texts([all_text])
        
        # Convert input sequences to token sequences
        X = self.tokenizer.texts_to_sequences(input_sequences)
        X_padded = pad_sequences(X, maxlen=self.max_sequence_length, padding='pre')
        
        # Convert target words to one-hot encoded vectors
        y = self.tokenizer.texts_to_sequences(target_words)
        y = np.array([seq[0] if seq else 0 for seq in y])
        y = tf.keras.utils.to_categorical(y, num_classes=self.vocab_size)
        
        return X_padded, y
    
    def build_model(self):
        """Build the RNN model for next word prediction"""
        self.model = Sequential([
            Embedding(self.vocab_size, self.embedding_dim, input_length=self.max_sequence_length),
            SimpleRNN(self.rnn_units, return_sequences=False),
            Dense(self.vocab_size, activation='softmax')
        ])
        
        self.model.compile(
            loss='categorical_crossentropy',
            optimizer='adam',
            metrics=['accuracy']
        )
        
        return self.model
    
    def train(self, texts, epochs=10, batch_size=64, validation_split=0.2):
        """Train the model on the provided texts"""
        X, y = self.prepare_data(texts)
        
        if self.model is None:
            self.build_model()
        
        history = self.model.fit(
            X, y,
            epochs=epochs,
            batch_size=batch_size,
            validation_split=validation_split,
            verbose=1
        )
        
        return history
    
    def predict_next_word(self, input_text, num_predictions=3):
        """Predict the next word for the given input text"""
        if self.model is None:
            raise ValueError("Model has not been trained yet")
        
        # Preprocess input text
        processed_text = self.preprocess_text(input_text)
        
        # Convert to sequence
        token_sequence = self.tokenizer.texts_to_sequences([processed_text])[0]
        
        # Pad sequence
        padded_sequence = pad_sequences([token_sequence], maxlen=self.max_sequence_length, padding='pre')
        
        # Predict
        predictions = self.model.predict(padded_sequence)[0]
        top_indices = predictions.argsort()[-num_predictions:][::-1]
        
        # Convert indices to words
        index_to_word = {idx: word for word, idx in self.tokenizer.word_index.items()}
        predicted_words = [index_to_word.get(idx, "<Unknown>") for idx in top_indices if idx != 0]
        
        return predicted_words
    
    def save_model(self, filepath):
        """Save the model to a file"""
        if self.model is None:
            raise ValueError("No model to save")
        self.model.save(filepath)
        
    def load_model(self, filepath):
        """Load the model from a file"""
        self.model = tf.keras.models.load_model(filepath)


# Example usage
if __name__ == "__main__":
    # Sample text for training
    sample_texts = [
        "Machine learning is a subset of artificial intelligence that provides systems the ability to automatically learn and improve from experience",
        "Deep learning is a subset of machine learning that uses neural networks with many layers",
        "Recurrent neural networks are designed to recognize patterns in sequences of data such as text time series or speech",
        "Natural language processing is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language"
    ]
    
    # Initialize the predictor
    predictor = NextWordPredictor(vocab_size=1000, embedding_dim=64, rnn_units=100, max_sequence_length=10)
    
    # Train the model
    print("Training model...")
    predictor.train(sample_texts, epochs=50, batch_size=4)
    
    # Make predictions
    test_sentences = [
        "machine learning is",
        "deep learning uses neural",
        "recurrent neural networks can process",
        "natural language processing helps computers"
    ]
    
    print("\nPrediction examples:")
    for sentence in test_sentences:
        next_words = predictor.predict_next_word(sentence)
        print(f"Input: '{sentence}'")
        print(f"Predicted next words: {next_words}")
        print()

Training model...
Epoch 1/50




[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 25ms/step - accuracy: 0.0145 - loss: 6.9084 - val_accuracy: 0.0000e+00 - val_loss: 6.8684
Epoch 2/50
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.1074 - loss: 6.4985 - val_accuracy: 0.0000e+00 - val_loss: 6.2290
Epoch 3/50
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.0319 - loss: 4.9801 - val_accuracy: 0.0000e+00 - val_loss: 6.1732
Epoch 4/50
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.0742 - loss: 3.9750 - val_accuracy: 0.0000e+00 - val_loss: 6.7417
Epoch 5/50
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.0623 - loss: 3.8831 - val_accuracy: 0.0000e+00 - val_loss: 7.0937
Epoch 6/50
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.0304 - loss: 3.6798 - val_accuracy: 0.0000e+00 - val_loss: 7.2832
Epoch 7/50
[1m14/14[0m 

# RNN Next Word Prediction: Simple Explanation

## What This Code Does

Imagine you're playing a word guessing game where your friend says part of a sentence and you have to guess what word comes next. For example, if your friend says "I want to eat a..." you might guess "sandwich" or "pizza". Our computer program does the same thing - it tries to predict what word should come next in a sentence.

## How It Works

### The Big Picture
1. We feed a bunch of sentences to the computer.
2. The computer learns patterns about which words usually follow other words.
3. When you give it a new sentence, it can guess what word might come next.

### Key Parts of the Code Explained

#### 1. NextWordPredictor Class
This is like a robot that we build and train to guess the next word. Just like you need to learn vocabulary before writing essays, our robot needs to know about words before making predictions.

```python
class NextWordPredictor:
    def __init__(self, vocab_size=5000, embedding_dim=100, rnn_units=150, max_sequence_length=20):
```

- `vocab_size`: The maximum number of words our robot can learn (its vocabulary)
- `embedding_dim`: How the robot remembers each word (like assigning special codes to words)
- `rnn_units`: How smart our robot is (more units = smarter, but slower)
- `max_sequence_length`: How many previous words it looks at to make a prediction

#### 2. Data Preparation
Before training, we need to prepare our sentences in a way the computer can understand:

```python
def preprocess_text(self, text):
    """Remove punctuation and make lowercase"""
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator).lower()
    return text
```

This is like cleaning up messy handwriting so it's easier to read. We remove punctuation (like periods and commas) and make all letters lowercase.

```python
def create_sequences(self, text):
    """Create input sequences and output words for training"""
```

This splits our sentences into pairs of "what we show the robot" and "what we want it to guess". For example, from "I love to play basketball with my friends", we might create:
- Input: "I love to play" → Output: "basketball"
- Input: "love to play basketball" → Output: "with"
- And so on...

#### 3. The Brain (RNN Model)
This is where the magic happens! We build a brain for our robot:

```python
def build_model(self):
    """Build the RNN model for next word prediction"""
    self.model = Sequential([
        Embedding(self.vocab_size, self.embedding_dim, input_length=self.max_sequence_length),
        SimpleRNN(self.rnn_units, return_sequences=False),
        Dense(self.vocab_size, activation='softmax')
    ])
```

The brain has three main parts:
- `Embedding`: Converts words into numbers the computer can understand
- `SimpleRNN`: The memory part that remembers patterns in word sequences
- `Dense`: Makes the final decision about which word comes next

#### 4. Training
Just like you learn by studying examples, our robot learns by studying lots of sentences:

```python
def train(self, texts, epochs=10, batch_size=64, validation_split=0.2):
```

- `texts`: All the sentences we use for training
- `epochs`: How many times the robot studies all the sentences
- `batch_size`: How many examples it looks at before updating what it's learned
- `validation_split`: Part of the data used to check how well it's learning

#### 5. Making Predictions
After training, the robot can predict the next word in new sentences:

```python
def predict_next_word(self, input_text, num_predictions=3):
```

This function:
1. Takes your sentence
2. Cleans it up (removes punctuation, etc.)
3. Converts it to numbers the robot understands
4. Uses the trained brain to predict the most likely next words
5. Returns the top few guesses

## Example in Action

```python
# Initialize the predictor
predictor = NextWordPredictor(vocab_size=1000, embedding_dim=64, rnn_units=100, max_sequence_length=10)

# Train the model
predictor.train(sample_texts, epochs=50, batch_size=4)

# Make predictions
test_sentences = ["machine learning is"]
next_words = predictor.predict_next_word(test_sentences[0])
print(f"Input: '{test_sentences[0]}'")
print(f"Predicted next words: {next_words}")
```

If we train it on science tThe difference between this simple version and the advanced ones is mostly just how big and complex the "brain" part is!

In [None]:
class NextWordPredictor:
    def __init__(self, vocab_size=5000, embedding_dim=100, rnn_units=150, max_sequence_length=20):