# Natural Language Processing with RNNs and Attention
## Chapter 16 - NLP Implementation Guide

## 1. Introduction to NLP

Natural Language Processing (NLP) enables machines to understand, interpret, and generate human language. Key applications include:
- Machine translation
- Sentiment analysis
- Text generation
- Question answering

**Core Challenges**:
- Variable-length sequences
- Contextual meaning
- Ambiguity and polysemy

## 2. Text Preprocessing

### 2.1 Tokenization
Convert text to numerical representations:
- Word-level
- Character-level
- Subword (BPE, WordPiece)

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample text data
texts = ["Natural language processing is fascinating",
         "Deep learning models can understand text"]

# Create and fit tokenizer
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)

# Convert text to sequences
sequences = tokenizer.texts_to_sequences(texts)
padded = pad_sequences(sequences, padding='post')

print("Word index:", tokenizer.word_index)
print("Padded sequences:\n", padded)

### 2.2 Embedding Layers
Map tokens to dense vectors capturing semantic meaning:
- Pretrained embeddings (GloVe, Word2Vec)
- Trainable embeddings

In [None]:
from tensorflow.keras.layers import Embedding

# Create embedding layer
embedding_layer = Embedding(
    input_dim=100,  # Vocabulary size
    output_dim=64,  # Embedding dimension
    input_length=10  # Max sequence length
)

# Example usage
import numpy as np
sample_input = np.random.randint(0, 100, size=(32, 10))
embedded = embedding_layer(sample_input)
print("Embedded shape:", embedded.shape)

## 3. Sequence Models for NLP

### 3.1 RNN-based Models
- Process text sequentially
- Capture temporal dependencies
- Variants: LSTM, GRU (address vanishing gradient problem)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

# LSTM model for text classification
model = Sequential([
    Embedding(10000, 128, input_length=100),
    LSTM(64, return_sequences=True),
    LSTM(32),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

### 3.2 CNN for Text
- Process text as 1D signals
- Can capture local patterns effectively
- Often combined with RNNs

In [None]:
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D

# TextCNN model
textcnn = Sequential([
    Embedding(10000, 128, input_length=100),
    Conv1D(128, 5, activation='relu'),
    GlobalMaxPooling1D(),
    Dense(1, activation='sigmoid')
])

textcnn.compile(loss='binary_crossentropy', optimizer='adam')
textcnn.summary()

## 4. Attention Mechanisms

### 4.1 Basic Attention
- Allows models to focus on relevant parts of input
- Computes context vectors as weighted sums
- Particularly useful for long sequences

In [None]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super().__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)
    
    def call(self, query, values):
        # Hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        hidden_with_time_axis = tf.expand_dims(query, 1)
        
        # Score shape == (batch_size, max_length, 1)
        score = self.V(tf.nn.tanh(
            self.W1(values) + self.W2(hidden_with_time_axis)))
        
        # attention_weights shape == (batch_size, max_length, 1)
        attention_weights = tf.nn.softmax(score, axis=1)
        
        # context_vector shape == (batch_size, hidden_size)
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)
        
        return context_vector, attention_weights

# Example usage
attention_layer = BahdanauAttention(10)
query = tf.random.normal((1, 20))  # Decoder hidden state
values = tf.random.normal((1, 50, 20))  # Encoder outputs
context_vector, attention_weights = attention_layer(query, values)
print("Context vector shape:", context_vector.shape)

## 5. Transformer Architecture

### 5.1 Key Components
- Self-attention mechanism
- Multi-head attention
- Positional encoding
- Layer normalization

### 5.2 Implementation Example

In [None]:
class TransformerBlock(tf.keras.layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(ff_dim, activation="relu"),
            tf.keras.layers.Dense(embed_dim),
        ])
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

# Example usage
transformer_block = TransformerBlock(embed_dim=32, num_heads=2, ff_dim=64)
print(transformer_block(tf.random.uniform((1, 10, 32)), training=False).shape)

## 6. Practical Applications

### 6.1 Sentiment Analysis
- Classify text sentiment (positive/negative)
- Can use RNNs, CNNs, or Transformers

### 6.2 Neural Machine Translation
- Encoder-decoder architecture
- Attention improves performance

## 7. Exercises

1. Implement a character-level RNN for text generation
2. Compare word vs. subword tokenization
3. Add attention to a sequence-to-sequence model
4. Fine-tune a pretrained transformer model
5. Visualize attention weights for sample inputs

## 8. Key Takeaways

- NLP requires specialized preprocessing and tokenization
- RNNs and CNNs can effectively process text sequences
- Attention mechanisms enable models to focus on relevant context
- Transformers have become the state-of-the-art for many tasks
- Pretrained models (BERT, GPT) provide powerful starting points