# Sequence Models: RNN, LSTM, BiLSTM, and Attention

In this notebook, we will cover **sequence modeling**, which is crucial for tasks involving sequential data such as **text, time series, and speech**.

We will focus on:
- Recurrent Neural Networks (RNNs)
- Long Short-Term Memory networks (LSTMs)
- Bidirectional LSTMs (BiLSTM)
- Attention mechanisms for improved performance

## Objectives

By the end of this notebook, you will be able to:

1. Understand the architecture and working of RNNs and LSTMs.
2. Build and train a Vanilla RNN for sequence classification.
3. Implement LSTM and BiLSTM models for text classification.
4. Incorporate attention mechanism to improve model performance.
5. Preprocess text sequences for neural networks using tokenization and padding.

## Required Libraries

We will use **TensorFlow/Keras** for building sequence models, and **NLTK** for text preprocessing.


In [7]:
#!pip install tensorflow nltk scikit-learn

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, LSTM, Bidirectional, Dense, Dropout, Input, Attention
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import nltk
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ksiri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ksiri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Dataset Preparation

For this notebook, we will use a small **text classification dataset** (e.g., sentiment analysis with IMDB reviews).

Steps:
1. Preprocess text: lowercase, remove stopwords, tokenize.
2. Encode labels.
3. Tokenize text and pad sequences for input to neural networks.


In [2]:
# Sample dataset
texts = [
    "I love this movie, it was fantastic and thrilling!",
    "What a terrible movie, I will never watch it again.",
    "Absolutely loved the storyline and the acting.",
    "I hated this film, it was boring and too long.",
    "A masterpiece with stunning visuals and great music."
]

labels = ["positive", "negative", "positive", "negative", "positive"]

# Encode labels
le = LabelEncoder()
y = le.fit_transform(labels)

# Tokenize text
tokenizer = Tokenizer(num_words=1000, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
X = tokenizer.texts_to_sequences(texts)

# Pad sequences
max_len = max(len(seq) for seq in X)
X = pad_sequences(X, maxlen=max_len, padding='post')

print("Padded Sequences:\n", X)
print("Labels:", y)


Padded Sequences:
 [[ 3 10  5  6  4  7 11  2 12  0]
 [13  8 14  6  3 15 16 17  4 18]
 [19 20  9 21  2  9 22  0  0  0]
 [ 3 23  5 24  4  7 25  2 26 27]
 [ 8 28 29 30 31  2 32 33  0  0]]
Labels: [1 0 1 0 1]


## Vanilla RNN

Recurrent Neural Networks (RNNs) are designed for sequential data. 
- They maintain a hidden state that captures information from previous time steps.
- Limitation: prone to vanishing gradient problem for long sequences.


In [3]:
model_rnn = Sequential([
    Embedding(input_dim=1000, output_dim=16, input_length=max_len),
    SimpleRNN(32, activation='tanh'),
    Dense(1, activation='sigmoid')
])

model_rnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_rnn.summary()

# Train model
history_rnn = model_rnn.fit(X, y, epochs=10, verbose=1)




Epoch 1/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4s/step - accuracy: 0.6000 - loss: 0.6829
Epoch 2/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 92ms/step - accuracy: 0.6000 - loss: 0.6633
Epoch 3/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 86ms/step - accuracy: 0.8000 - loss: 0.6440
Epoch 4/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 171ms/step - accuracy: 1.0000 - loss: 0.6248
Epoch 5/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 128ms/step - accuracy: 1.0000 - loss: 0.6056
Epoch 6/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 116ms/step - accuracy: 1.0000 - loss: 0.5862
Epoch 7/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 118ms/step - accuracy: 1.0000 - loss: 0.5666
Epoch 8/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 139ms/step - accuracy: 1.0000 - loss: 0.5466
Epoch 9/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m

## Long Short-Term Memory (LSTM)

LSTM networks solve the vanishing gradient problem using **gates**:
- Input gate: decides which values to update
- Forget gate: decides which values to discard
- Output gate: decides what to output

LSTMs are effective for longer sequences like text, time series, and speech.


In [4]:
model_lstm = Sequential([
    Embedding(input_dim=1000, output_dim=16, input_length=max_len),
    LSTM(32, return_sequences=False),
    Dense(1, activation='sigmoid')
])

model_lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_lstm.summary()

# Train model
history_lstm = model_lstm.fit(X, y, epochs=10, verbose=1)


Epoch 1/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4s/step - accuracy: 0.6000 - loss: 0.6922
Epoch 2/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 111ms/step - accuracy: 0.6000 - loss: 0.6892
Epoch 3/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 105ms/step - accuracy: 0.6000 - loss: 0.6861
Epoch 4/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 84ms/step - accuracy: 0.6000 - loss: 0.6829
Epoch 5/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 93ms/step - accuracy: 0.6000 - loss: 0.6797
Epoch 6/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 107ms/step - accuracy: 0.6000 - loss: 0.6763
Epoch 7/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 112ms/step - accuracy: 0.6000 - loss: 0.6727
Epoch 8/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 111ms/step - accuracy: 0.6000 - loss: 0.6690
Epoch 9/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m

## Bidirectional LSTM (BiLSTM)

- BiLSTM reads sequences in **both forward and backward directions**, capturing context from past and future words.
- Improves performance on NLP tasks like sentiment analysis, named entity recognition, and translation.


In [5]:
model_bilstm = Sequential([
    Embedding(input_dim=1000, output_dim=16, input_length=max_len),
    Bidirectional(LSTM(32)),
    Dense(1, activation='sigmoid')
])

model_bilstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_bilstm.summary()

# Train model
history_bilstm = model_bilstm.fit(X, y, epochs=10, verbose=1)


Epoch 1/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6s/step - accuracy: 0.4000 - loss: 0.6938
Epoch 2/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 127ms/step - accuracy: 0.6000 - loss: 0.6914
Epoch 3/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 121ms/step - accuracy: 0.6000 - loss: 0.6890
Epoch 4/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 102ms/step - accuracy: 0.6000 - loss: 0.6866
Epoch 5/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 98ms/step - accuracy: 0.6000 - loss: 0.6841
Epoch 6/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 100ms/step - accuracy: 0.6000 - loss: 0.6815
Epoch 7/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 111ms/step - accuracy: 0.6000 - loss: 0.6788
Epoch 8/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 120ms/step - accuracy: 0.6000 - loss: 0.6760
Epoch 9/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0

## Attention Mechanism

- Attention allows the model to **focus on important parts of the input sequence** when making predictions.
- Useful for longer sequences or when some words carry more importance than others.
- Can be combined with LSTM or BiLSTM for improved accuracy.


In [6]:
from tensorflow.keras.layers import Layer

# Simple Attention Layer
class AttentionLayer(Layer):
    def __init__(self, **kwargs):
        super(AttentionLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        self.W = self.add_weight(name="att_weight", shape=(input_shape[-1], 1),
                                 initializer="random_normal", trainable=True)
        self.b = self.add_weight(name="att_bias", shape=(input_shape[1], 1),
                                 initializer="zeros", trainable=True)
        super(AttentionLayer, self).build(input_shape)

    def call(self, x):
        e = tf.keras.backend.tanh(tf.keras.backend.dot(x, self.W) + self.b)
        a = tf.keras.backend.softmax(e, axis=1)
        output = x * a
        return tf.keras.backend.sum(output, axis=1)

# LSTM + Attention Model
inputs = tf.keras.Input(shape=(max_len,))
embedding = Embedding(input_dim=1000, output_dim=16, input_length=max_len)(inputs)
lstm_out = LSTM(32, return_sequences=True)(embedding)
attention_out = AttentionLayer()(lstm_out)
output = Dense(1, activation='sigmoid')(attention_out)

model_attention = tf.keras.Model(inputs=inputs, outputs=output)
model_attention.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_attention.summary()

# Train model
history_attention = model_attention.fit(X, y, epochs=10, verbose=1)





Epoch 1/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 5s/step - accuracy: 0.6000 - loss: 0.6914
Epoch 2/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 93ms/step - accuracy: 0.6000 - loss: 0.6900
Epoch 3/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 101ms/step - accuracy: 0.6000 - loss: 0.6885
Epoch 4/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 149ms/step - accuracy: 0.6000 - loss: 0.6870
Epoch 5/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 112ms/step - accuracy: 0.6000 - loss: 0.6854
Epoch 6/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 156ms/step - accuracy: 0.6000 - loss: 0.6837
Epoch 7/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 98ms/step - accuracy: 0.6000 - loss: 0.6819
Epoch 8/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 117ms/step - accuracy: 0.6000 - loss: 0.6801
Epoch 9/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m

# Summary

In this notebook, you learned to:
1. Preprocess text sequences for neural networks.
2. Build and train Vanilla RNN, LSTM, BiLSTM models.
3. Understand and implement attention mechanism to enhance model performance.
4. Apply sequence models to text classification tasks.

Next steps:
- Experiment with **larger datasets** (IMDB, 20 Newsgroups)
- Combine with **word embeddings** (Word2Vec, GloVe, FastText)
- Explore **transformers for sequence tasks**.
