# Day 11: Introduction to Natural Language Processing

## Phase 2: Deep Learning & NLP Basics (Days 11-20)

**Time:** 3-4 hours

**Mathematical Prerequisites:**
- Information theory (entropy, mutual information)
- Probability (distributions, conditional probability)
- Linear algebra (vector spaces, similarity measures)
- Statistics (frequency analysis, n-grams)

---

## Welcome to NLP!

After mastering CNNs and image classification in Phase 1, we now transition to **Natural Language Processing (NLP)** - the art and science of making machines understand human language.

### Phase 2 Roadmap:
- **Day 11:** NLP Fundamentals (TODAY)
- **Day 12:** Word Embeddings (Word2Vec, GloVe)
- **Day 13:** Recurrent Neural Networks (RNNs)
- **Day 14:** Long Short-Term Memory (LSTMs)
- **Day 15:** Sequence-to-Sequence Models
- **Day 16:** Attention Mechanisms
- **Day 17:** Introduction to Transformers
- **Day 18:** Text Classification
- **Day 19:** Character-Level Language Models
- **Day 20:** Project 2 - Text Generation

---

## Today's Objectives

1. Understand the unique challenges of NLP
2. Text preprocessing pipeline (cleaning, normalization)
3. Tokenization strategies (word, subword, character)
4. Vocabulary building and OOV handling
5. Text representation (Bag of Words, TF-IDF)
6. N-gram language models
7. Evaluation metrics for language models

---

## Part 1: The NLP Challenge

### 1.1 Why is NLP Hard?

**Images vs Text:**
- Images: Fixed-size grid of continuous pixel values
- Text: Variable-length sequences of discrete symbols

**Key Challenges:**

1. **Ambiguity:** "I saw her duck" (animal or verb?)
2. **Context dependence:** "Apple" (fruit or company?)
3. **Variable length:** Sentences can be 1 word or 1000 words
4. **Discrete symbols:** Words are categorical, not continuous
5. **Vocabulary explosion:** Millions of possible words
6. **Out-of-vocabulary (OOV):** New words constantly appear
7. **Long-range dependencies:** "The cat, which was sitting on the mat, was sleeping"

### 1.2 The NLP Pipeline

```
Raw Text → Preprocessing → Tokenization → Numericalization → Model → Output
```

Today we focus on the first three steps: **Preprocessing**, **Tokenization**, and **Numericalization**.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from collections import Counter, defaultdict
import re
import string
from pathlib import Path
import json

# NLP libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.util import ngrams

# Download NLTK resources
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

# Scikit-learn for vectorization
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

np.random.seed(42)
print("NLP environment ready!")

## Part 2: Text Preprocessing

### 2.1 Sample Dataset

We'll use sample text data to demonstrate preprocessing techniques.

In [None]:
# Sample corpus
sample_texts = [
    "Natural Language Processing (NLP) is a field of Artificial Intelligence!",
    "It's focused on enabling computers to understand human language.",
    "NLP applications include: chatbots, translation, sentiment analysis, etc.",
    "Deep learning has revolutionized NLP in the 2010s and 2020s.",
    "The transformer architecture (2017) was a breakthrough moment!!!",
    "Models like BERT, GPT-3, and ChatGPT show remarkable capabilities.",
    "But challenges remain: bias, hallucination, reasoning, etc.",
    "The future of NLP is exciting... and a bit scary too :)",
]

print("Sample corpus:")
for i, text in enumerate(sample_texts):
    print(f"{i+1}. {text}")

### 2.2 Text Cleaning

Common preprocessing steps:
1. Lowercasing
2. Removing punctuation
3. Removing numbers
4. Removing extra whitespace
5. Removing special characters

In [None]:
class TextPreprocessor:
    """Text preprocessing pipeline."""
    
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
        self.stemmer = PorterStemmer()
        self.lemmatizer = WordNetLemmatizer()
    
    def lowercase(self, text):
        """Convert to lowercase."""
        return text.lower()
    
    def remove_punctuation(self, text):
        """Remove punctuation marks."""
        return text.translate(str.maketrans('', '', string.punctuation))
    
    def remove_numbers(self, text):
        """Remove digits."""
        return re.sub(r'\d+', '', text)
    
    def remove_extra_whitespace(self, text):
        """Remove extra whitespace."""
        return ' '.join(text.split())
    
    def remove_special_chars(self, text):
        """Remove special characters (keep letters and spaces)."""
        return re.sub(r'[^a-zA-Z\s]', '', text)
    
    def remove_stopwords(self, tokens):
        """Remove common stop words."""
        return [token for token in tokens if token not in self.stop_words]
    
    def stem(self, tokens):
        """Apply stemming (reduce to root form)."""
        return [self.stemmer.stem(token) for token in tokens]
    
    def lemmatize(self, tokens):
        """Apply lemmatization (reduce to dictionary form)."""
        return [self.lemmatizer.lemmatize(token) for token in tokens]
    
    def clean_text(self, text):
        """Apply full cleaning pipeline."""
        text = self.lowercase(text)
        text = self.remove_punctuation(text)
        text = self.remove_extra_whitespace(text)
        return text

# Create preprocessor
preprocessor = TextPreprocessor()

# Demonstrate preprocessing
sample_text = sample_texts[0]
print("Original text:")
print(f"  {sample_text}")
print("\nStep-by-step preprocessing:")

# Step 1: Lowercase
text = preprocessor.lowercase(sample_text)
print(f"1. Lowercase: {text}")

# Step 2: Remove punctuation
text = preprocessor.remove_punctuation(text)
print(f"2. Remove punctuation: {text}")

# Step 3: Remove extra whitespace
text = preprocessor.remove_extra_whitespace(text)
print(f"3. Remove whitespace: {text}")

### 2.3 Stemming vs Lemmatization

**Stemming:** Chop off word endings (fast but crude)
- "running" → "run"
- "better" → "better" (no change)

**Lemmatization:** Use dictionary to find root form (slower but accurate)
- "running" → "run"
- "better" → "good"

**When to use:**
- Stemming: When speed matters, exact forms less important (search engines)
- Lemmatization: When linguistic accuracy matters (text classification)

In [None]:
# Compare stemming vs lemmatization
test_words = ['running', 'runs', 'ran', 'better', 'studies', 'studying', 'mice', 'feet']

comparison = pd.DataFrame({
    'Original': test_words,
    'Stemmed': preprocessor.stem(test_words),
    'Lemmatized': preprocessor.lemmatize(test_words)
})

print("Stemming vs Lemmatization:")
print(comparison.to_string(index=False))

## Part 3: Tokenization

### 3.1 What is Tokenization?

**Tokenization** splits text into meaningful units (tokens).

**Types:**
1. **Word tokenization:** Split by words
2. **Sentence tokenization:** Split by sentences
3. **Subword tokenization:** Split into subword units (BPE, WordPiece)
4. **Character tokenization:** Split into characters

### 3.2 Word Tokenization

In [None]:
# Different tokenization approaches
text = "It's a beautiful day! Don't you think? I'd love to go outside."

print(f"Original: {text}\n")

# 1. Naive split by space
naive_tokens = text.split()
print(f"Naive (split by space):")
print(f"  {naive_tokens}\n")

# 2. NLTK word_tokenize
nltk_tokens = word_tokenize(text)
print(f"NLTK word_tokenize:")
print(f"  {nltk_tokens}\n")

# 3. Sentence tokenization
sentences = sent_tokenize(text)
print(f"Sentence tokenization:")
for i, sent in enumerate(sentences):
    print(f"  {i+1}. {sent}")

### 3.3 Building a Vocabulary

A **vocabulary** maps tokens to integer IDs.

**Key considerations:**
- Vocabulary size (memory vs coverage trade-off)
- Special tokens: `<PAD>`, `<UNK>`, `<SOS>`, `<EOS>`
- Handling out-of-vocabulary (OOV) words

In [None]:
class Vocabulary:
    """Vocabulary for mapping tokens to indices."""
    
    def __init__(self, min_freq=1):
        self.min_freq = min_freq
        
        # Special tokens
        self.PAD_TOKEN = '<PAD>'
        self.UNK_TOKEN = '<UNK>'
        self.SOS_TOKEN = '<SOS>'
        self.EOS_TOKEN = '<EOS>'
        
        self.token2idx = {}
        self.idx2token = {}
        self.token_freqs = Counter()
        
        # Initialize with special tokens
        self._add_special_tokens()
    
    def _add_special_tokens(self):
        """Add special tokens to vocabulary."""
        for token in [self.PAD_TOKEN, self.UNK_TOKEN, self.SOS_TOKEN, self.EOS_TOKEN]:
            self._add_token(token)
    
    def _add_token(self, token):
        """Add single token to vocabulary."""
        if token not in self.token2idx:
            idx = len(self.token2idx)
            self.token2idx[token] = idx
            self.idx2token[idx] = token
    
    def build_vocab(self, texts, tokenizer=None):
        """Build vocabulary from list of texts."""
        if tokenizer is None:
            tokenizer = lambda x: x.lower().split()
        
        # Count token frequencies
        for text in texts:
            tokens = tokenizer(text)
            self.token_freqs.update(tokens)
        
        # Add tokens with frequency >= min_freq
        for token, freq in self.token_freqs.most_common():
            if freq >= self.min_freq:
                self._add_token(token)
        
        print(f"Vocabulary built: {len(self)} tokens")
    
    def encode(self, text, tokenizer=None):
        """Convert text to list of indices."""
        if tokenizer is None:
            tokenizer = lambda x: x.lower().split()
        
        tokens = tokenizer(text)
        indices = []
        
        for token in tokens:
            if token in self.token2idx:
                indices.append(self.token2idx[token])
            else:
                indices.append(self.token2idx[self.UNK_TOKEN])
        
        return indices
    
    def decode(self, indices):
        """Convert list of indices back to text."""
        tokens = [self.idx2token.get(idx, self.UNK_TOKEN) for idx in indices]
        return ' '.join(tokens)
    
    def __len__(self):
        return len(self.token2idx)
    
    def __getitem__(self, token):
        return self.token2idx.get(token, self.token2idx[self.UNK_TOKEN])


# Build vocabulary
vocab = Vocabulary(min_freq=1)
vocab.build_vocab(sample_texts)

print(f"\nVocabulary size: {len(vocab)}")
print(f"\nFirst 20 tokens:")
for i in range(min(20, len(vocab))):
    print(f"  {i}: {vocab.idx2token[i]}")

# Test encoding and decoding
test_text = "NLP is a field of artificial intelligence"
encoded = vocab.encode(test_text)
decoded = vocab.decode(encoded)

print(f"\nTest text: {test_text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")

### 3.4 Subword Tokenization

**Problem with word tokenization:**
- Large vocabulary (100K+ words)
- OOV problem for rare/new words
- No sharing between similar words ("run", "running", "runner")

**Subword tokenization** breaks words into smaller units:
- "unhappiness" → "un", "happi", "ness"
- "transformer" → "trans", "former"

**Popular methods:**
1. **Byte Pair Encoding (BPE)** - Used in GPT
2. **WordPiece** - Used in BERT
3. **SentencePiece** - Language-agnostic

In [None]:
class SimpleBPE:
    """Simple Byte Pair Encoding implementation."""
    
    def __init__(self, vocab_size=100):
        self.vocab_size = vocab_size
        self.merges = []
        self.vocab = set()
    
    def get_pairs(self, word):
        """Get all pairs of consecutive symbols."""
        pairs = Counter()
        prev_char = word[0]
        for char in word[1:]:
            pairs[(prev_char, char)] += 1
            prev_char = char
        return pairs
    
    def merge_pair(self, word, pair):
        """Merge a pair in the word."""
        new_word = []
        i = 0
        while i < len(word):
            if i < len(word) - 1 and word[i] == pair[0] and word[i+1] == pair[1]:
                new_word.append(pair[0] + pair[1])
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        return new_word
    
    def fit(self, corpus):
        """Learn BPE merges from corpus."""
        # Tokenize into characters with word boundaries
        word_freqs = Counter()
        for text in corpus:
            for word in text.lower().split():
                word_freqs[tuple(word) + ('</w>',)] += 1
        
        # Initialize vocabulary with characters
        self.vocab = set()
        for word in word_freqs:
            for char in word:
                self.vocab.add(char)
        
        print(f"Initial vocab size (characters): {len(self.vocab)}")
        
        # Iteratively merge most frequent pairs
        while len(self.vocab) < self.vocab_size:
            # Count all pairs
            pair_freqs = Counter()
            for word, freq in word_freqs.items():
                pairs = self.get_pairs(word)
                for pair, count in pairs.items():
                    pair_freqs[pair] += count * freq
            
            if not pair_freqs:
                break
            
            # Get most frequent pair
            best_pair = pair_freqs.most_common(1)[0][0]
            self.merges.append(best_pair)
            
            # Add merged token to vocab
            new_token = best_pair[0] + best_pair[1]
            self.vocab.add(new_token)
            
            # Apply merge to all words
            new_word_freqs = Counter()
            for word, freq in word_freqs.items():
                new_word = self.merge_pair(list(word), best_pair)
                new_word_freqs[tuple(new_word)] = freq
            
            word_freqs = new_word_freqs
        
        print(f"Final vocab size: {len(self.vocab)}")
        print(f"Number of merges: {len(self.merges)}")
    
    def tokenize(self, word):
        """Tokenize a single word using learned merges."""
        word = list(word) + ['</w>']
        
        for pair in self.merges:
            word = self.merge_pair(word, pair)
        
        return word

# Train BPE
bpe = SimpleBPE(vocab_size=50)
bpe.fit(sample_texts)

# Test tokenization
test_words = ['natural', 'processing', 'transformer', 'artificial']
print("\nBPE Tokenization:")
for word in test_words:
    tokens = bpe.tokenize(word.lower())
    print(f"  {word} → {tokens}")

### 3.5 Character-Level Tokenization

**Advantages:**
- No OOV problem (closed vocabulary)
- Very small vocabulary (~100 chars)
- Can handle any word, including misspellings

**Disadvantages:**
- Very long sequences
- Harder to learn semantic meaning
- Slower processing

In [None]:
class CharacterTokenizer:
    """Character-level tokenizer."""
    
    def __init__(self):
        self.char2idx = {}
        self.idx2char = {}
        
        # Special tokens
        self.PAD_TOKEN = '<PAD>'
        self.UNK_TOKEN = '<UNK>'
        
        # Initialize with special tokens
        self._add_char(self.PAD_TOKEN)
        self._add_char(self.UNK_TOKEN)
    
    def _add_char(self, char):
        if char not in self.char2idx:
            idx = len(self.char2idx)
            self.char2idx[char] = idx
            self.idx2char[idx] = char
    
    def fit(self, texts):
        """Build character vocabulary from texts."""
        for text in texts:
            for char in text:
                self._add_char(char)
        
        print(f"Character vocabulary size: {len(self)}")
    
    def encode(self, text):
        """Convert text to character indices."""
        return [self.char2idx.get(c, self.char2idx[self.UNK_TOKEN]) for c in text]
    
    def decode(self, indices):
        """Convert indices back to text."""
        return ''.join([self.idx2char.get(i, self.UNK_TOKEN) for i in indices])
    
    def __len__(self):
        return len(self.char2idx)

# Build character tokenizer
char_tokenizer = CharacterTokenizer()
char_tokenizer.fit(sample_texts)

# Test
test_text = "Hello NLP!"
char_encoded = char_tokenizer.encode(test_text)
char_decoded = char_tokenizer.decode(char_encoded)

print(f"\nTest: {test_text}")
print(f"Encoded: {char_encoded}")
print(f"Decoded: {char_decoded}")
print(f"Sequence length: {len(char_encoded)}")

## Part 4: Text Representation

### 4.1 Bag of Words (BoW)

Represent text as vector of word counts.

**Example:**
- Vocabulary: ["the", "cat", "sat", "on", "mat"]
- Text: "the cat sat on the mat"
- BoW: [2, 1, 1, 1, 1] (count of each word)

**Properties:**
- Ignores word order (hence "bag")
- High-dimensional (vocab_size dimensions)
- Sparse vectors

In [None]:
# Bag of Words with sklearn
bow_vectorizer = CountVectorizer(max_features=50)
bow_matrix = bow_vectorizer.fit_transform(sample_texts)

print("Bag of Words Representation:")
print(f"Matrix shape: {bow_matrix.shape} (documents × vocabulary)")

# Show vocabulary
vocab_bow = bow_vectorizer.get_feature_names_out()
print(f"\nVocabulary ({len(vocab_bow)} words):")
print(vocab_bow)

# Show BoW for first document
print(f"\nFirst document: {sample_texts[0]}")
print(f"BoW vector: {bow_matrix[0].toarray()[0]}")

# Visualize BoW matrix
fig, ax = plt.subplots(figsize=(16, 6))
sns.heatmap(bow_matrix.toarray(), annot=False, cmap='YlOrRd', ax=ax,
            xticklabels=vocab_bow, yticklabels=[f'Doc {i+1}' for i in range(len(sample_texts))])
ax.set_title('Bag of Words Matrix (Document-Term Matrix)', fontsize=14, fontweight='bold')
ax.set_xlabel('Words', fontsize=12)
ax.set_ylabel('Documents', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### 4.2 TF-IDF (Term Frequency - Inverse Document Frequency)

Weight words by their importance.

**Term Frequency (TF):**
$$
TF(t, d) = \frac{\text{count of } t \text{ in } d}{\text{total words in } d}
$$

**Inverse Document Frequency (IDF):**
$$
IDF(t) = \log\frac{N}{|\{d : t \in d\}|}
$$
where $N$ is total number of documents.

**TF-IDF:**
$$
TF\text{-}IDF(t, d) = TF(t, d) \cdot IDF(t)
$$

**Intuition:**
- High TF-IDF: Word appears frequently in document but rarely in corpus (important word)
- Low TF-IDF: Word appears rarely or appears in many documents (common word)

In [None]:
# TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=50)
tfidf_matrix = tfidf_vectorizer.fit_transform(sample_texts)

print("TF-IDF Representation:")
print(f"Matrix shape: {tfidf_matrix.shape}")

vocab_tfidf = tfidf_vectorizer.get_feature_names_out()

# Show TF-IDF for first document
print(f"\nFirst document: {sample_texts[0]}")
tfidf_scores = tfidf_matrix[0].toarray()[0]
print(f"TF-IDF vector (first 10):")
for i in range(min(10, len(vocab_tfidf))):
    if tfidf_scores[i] > 0:
        print(f"  {vocab_tfidf[i]}: {tfidf_scores[i]:.4f}")

# Visualize TF-IDF matrix
fig, ax = plt.subplots(figsize=(16, 6))
sns.heatmap(tfidf_matrix.toarray(), annot=False, cmap='YlOrRd', ax=ax,
            xticklabels=vocab_tfidf, yticklabels=[f'Doc {i+1}' for i in range(len(sample_texts))])
ax.set_title('TF-IDF Matrix (Weighted Document-Term Matrix)', fontsize=14, fontweight='bold')
ax.set_xlabel('Words', fontsize=12)
ax.set_ylabel('Documents', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Compare BoW vs TF-IDF
print("\nBoW vs TF-IDF for 'nlp':")
if 'nlp' in vocab_bow:
    bow_idx = list(vocab_bow).index('nlp')
    print(f"  BoW counts: {bow_matrix[:, bow_idx].toarray().flatten()}")
if 'nlp' in vocab_tfidf:
    tfidf_idx = list(vocab_tfidf).index('nlp')
    print(f"  TF-IDF scores: {tfidf_matrix[:, tfidf_idx].toarray().flatten().round(4)}")

## Part 5: N-gram Language Models

### 5.1 N-gram Theory

An **n-gram** is a sequence of n consecutive words.

**Examples:**
- Unigram (n=1): "the", "cat", "sat"
- Bigram (n=2): "the cat", "cat sat"
- Trigram (n=3): "the cat sat"

**N-gram Language Model:**
Estimate probability of next word given previous n-1 words.

$$
P(w_n | w_1, ..., w_{n-1}) \approx P(w_n | w_{n-N+1}, ..., w_{n-1})
$$

**Markov Assumption:** Only the last n-1 words matter.

**Estimation:**
$$
P(w_n | w_{n-N+1}, ..., w_{n-1}) = \frac{C(w_{n-N+1}, ..., w_n)}{C(w_{n-N+1}, ..., w_{n-1})}
$$

In [None]:
class NGramLanguageModel:
    """N-gram language model."""
    
    def __init__(self, n=2):
        self.n = n
        self.ngram_counts = Counter()
        self.context_counts = Counter()
        self.vocab = set()
    
    def train(self, texts):
        """Train the n-gram model."""
        for text in texts:
            tokens = ['<s>'] * (self.n - 1) + text.lower().split() + ['</s>']
            self.vocab.update(tokens)
            
            for i in range(len(tokens) - self.n + 1):
                ngram = tuple(tokens[i:i+self.n])
                context = tuple(tokens[i:i+self.n-1])
                
                self.ngram_counts[ngram] += 1
                self.context_counts[context] += 1
        
        print(f"{self.n}-gram model trained")
        print(f"  Vocabulary size: {len(self.vocab)}")
        print(f"  Unique {self.n}-grams: {len(self.ngram_counts)}")
    
    def probability(self, word, context):
        """Estimate P(word | context) using MLE."""
        context = tuple(context[-(self.n-1):])
        ngram = context + (word,)
        
        if self.context_counts[context] == 0:
            return 1 / len(self.vocab)  # Uniform fallback
        
        return self.ngram_counts[ngram] / self.context_counts[context]
    
    def generate(self, start_tokens=None, max_length=20):
        """Generate text using the model."""
        if start_tokens is None:
            tokens = ['<s>'] * (self.n - 1)
        else:
            tokens = ['<s>'] * (self.n - 1) + start_tokens
        
        for _ in range(max_length):
            context = tuple(tokens[-(self.n-1):])
            
            # Get possible next words
            candidates = []
            probs = []
            
            for word in self.vocab:
                ngram = context + (word,)
                if self.ngram_counts[ngram] > 0:
                    candidates.append(word)
                    probs.append(self.ngram_counts[ngram])
            
            if not candidates:
                break
            
            # Sample next word
            probs = np.array(probs) / sum(probs)
            next_word = np.random.choice(candidates, p=probs)
            
            if next_word == '</s>':
                break
            
            tokens.append(next_word)
        
        # Remove start tokens
        return ' '.join([t for t in tokens if t not in ['<s>', '</s>']])
    
    def perplexity(self, text):
        """Calculate perplexity on text."""
        tokens = ['<s>'] * (self.n - 1) + text.lower().split() + ['</s>']
        
        log_prob = 0
        num_tokens = 0
        
        for i in range(self.n - 1, len(tokens)):
            context = tokens[i-self.n+1:i]
            word = tokens[i]
            prob = self.probability(word, context)
            
            if prob > 0:
                log_prob += np.log(prob)
                num_tokens += 1
        
        if num_tokens == 0:
            return float('inf')
        
        return np.exp(-log_prob / num_tokens)


# Train bigram model
bigram_model = NGramLanguageModel(n=2)
bigram_model.train(sample_texts)

# Train trigram model
trigram_model = NGramLanguageModel(n=3)
trigram_model.train(sample_texts)

In [None]:
# Visualize most common n-grams
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Bigrams
top_bigrams = bigram_model.ngram_counts.most_common(15)
bigram_labels = [' '.join(bg) for bg, _ in top_bigrams]
bigram_counts = [count for _, count in top_bigrams]

axes[0].barh(bigram_labels, bigram_counts, color='skyblue', edgecolor='black')
axes[0].set_xlabel('Count', fontsize=12)
axes[0].set_title('Top 15 Bigrams', fontsize=14, fontweight='bold')
axes[0].invert_yaxis()
axes[0].grid(True, alpha=0.3, axis='x')

# Trigrams
top_trigrams = trigram_model.ngram_counts.most_common(15)
trigram_labels = [' '.join(tg) for tg, _ in top_trigrams]
trigram_counts = [count for _, count in top_trigrams]

axes[1].barh(trigram_labels, trigram_counts, color='lightcoral', edgecolor='black')
axes[1].set_xlabel('Count', fontsize=12)
axes[1].set_title('Top 15 Trigrams', fontsize=14, fontweight='bold')
axes[1].invert_yaxis()
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

### 5.2 Text Generation with N-grams

In [None]:
# Generate text
print("Text Generation:\n")

print("Bigram model:")
for i in range(3):
    generated = bigram_model.generate(max_length=15)
    print(f"  {i+1}. {generated}")

print("\nTrigram model:")
for i in range(3):
    generated = trigram_model.generate(max_length=15)
    print(f"  {i+1}. {generated}")

print("\nGenerate with context:")
context = ['nlp', 'is']
for i in range(3):
    generated = bigram_model.generate(start_tokens=context, max_length=10)
    print(f"  {i+1}. {generated}")

## Part 6: Evaluation Metrics for Language Models

### 6.1 Perplexity

**Perplexity** measures how well a language model predicts text.

$$
PP(W) = P(w_1, ..., w_N)^{-1/N} = \exp\left(-\frac{1}{N}\sum_{i=1}^N \log P(w_i | w_1, ..., w_{i-1})\right)
$$

**Intuition:**
- Lower perplexity = better model
- Perplexity of uniform random model = vocabulary size
- Can be interpreted as "average branching factor"

**Example:**
- PP = 10 means the model is as uncertain as choosing from 10 equally likely words

In [None]:
# Calculate perplexity
test_texts = [
    "nlp is a field of artificial intelligence",
    "deep learning has revolutionized nlp",
    "the transformer architecture was a breakthrough"
]

print("Perplexity Comparison:\n")
print(f"{'Text':<50} {'Bigram PP':>15} {'Trigram PP':>15}")
print("-"*80)

for text in test_texts:
    bigram_pp = bigram_model.perplexity(text)
    trigram_pp = trigram_model.perplexity(text)
    print(f"{text[:48]:<50} {bigram_pp:>15.2f} {trigram_pp:>15.2f}")

# Test on out-of-domain text
ood_text = "quantum computing uses qubits instead of bits"
print(f"\nOut-of-domain text: {ood_text}")
print(f"  Bigram perplexity: {bigram_model.perplexity(ood_text):.2f}")
print(f"  Trigram perplexity: {trigram_model.perplexity(ood_text):.2f}")

## Part 7: Practical Exercise - Complete NLP Pipeline

Let's build a complete preprocessing and analysis pipeline.

In [None]:
class NLPPipeline:
    """Complete NLP preprocessing and analysis pipeline."""
    
    def __init__(self):
        self.preprocessor = TextPreprocessor()
        self.vocab = None
        self.tfidf = TfidfVectorizer(max_features=100)
    
    def preprocess_corpus(self, texts):
        """Preprocess entire corpus."""
        cleaned_texts = []
        
        for text in texts:
            # Clean
            clean = self.preprocessor.clean_text(text)
            
            # Tokenize
            tokens = word_tokenize(clean)
            
            # Remove stopwords and lemmatize
            tokens = self.preprocessor.remove_stopwords(tokens)
            tokens = self.preprocessor.lemmatize(tokens)
            
            cleaned_texts.append(' '.join(tokens))
        
        return cleaned_texts
    
    def build_vocabulary(self, texts):
        """Build vocabulary from texts."""
        self.vocab = Vocabulary(min_freq=1)
        self.vocab.build_vocab(texts)
        return self.vocab
    
    def compute_tfidf(self, texts):
        """Compute TF-IDF representation."""
        return self.tfidf.fit_transform(texts)
    
    def analyze_corpus(self, texts):
        """Comprehensive corpus analysis."""
        print("Corpus Analysis")
        print("="*60)
        
        # Basic stats
        num_docs = len(texts)
        avg_length = np.mean([len(text.split()) for text in texts])
        total_words = sum(len(text.split()) for text in texts)
        
        print(f"Number of documents: {num_docs}")
        print(f"Average document length: {avg_length:.2f} words")
        print(f"Total words: {total_words}")
        
        # Vocabulary stats
        all_words = ' '.join(texts).split()
        word_counts = Counter(all_words)
        vocab_size = len(word_counts)
        
        print(f"Unique words: {vocab_size}")
        print(f"Type-token ratio: {vocab_size/total_words:.4f}")
        
        # Most common words
        print(f"\nTop 10 most common words:")
        for word, count in word_counts.most_common(10):
            print(f"  {word}: {count}")
        
        return word_counts

# Create and run pipeline
pipeline = NLPPipeline()

# Preprocess
cleaned_corpus = pipeline.preprocess_corpus(sample_texts)
print("Preprocessed corpus:")
for i, text in enumerate(cleaned_corpus):
    print(f"{i+1}. {text}")

print()
# Analyze
word_counts = pipeline.analyze_corpus(cleaned_corpus)

### Word Cloud Visualization

In [None]:
# Visualize word frequencies
fig, ax = plt.subplots(figsize=(12, 6))

top_words = word_counts.most_common(20)
words = [w for w, _ in top_words]
counts = [c for _, c in top_words]

bars = ax.barh(words, counts, color='steelblue', edgecolor='black', alpha=0.7)
ax.set_xlabel('Frequency', fontsize=12)
ax.set_title('Word Frequency Distribution', fontsize=14, fontweight='bold')
ax.invert_yaxis()
ax.grid(True, alpha=0.3, axis='x')

# Add count labels
for bar, count in zip(bars, counts):
    width = bar.get_width()
    ax.text(width + 0.1, bar.get_y() + bar.get_height()/2,
            str(count), ha='left', va='center', fontsize=10)

plt.tight_layout()
plt.show()

## Part 8: Key Takeaways and Next Steps

### 8.1 Summary

Today we covered the **foundations of NLP**:

✅ **Why NLP is challenging:** Ambiguity, context, variable length, discrete symbols  
✅ **Text preprocessing:** Cleaning, normalization, stopword removal, stemming/lemmatization  
✅ **Tokenization:** Word-level, subword (BPE), character-level  
✅ **Vocabulary building:** Token-to-index mapping, special tokens, OOV handling  
✅ **Text representation:** Bag of Words, TF-IDF  
✅ **N-gram language models:** Probability estimation, text generation  
✅ **Evaluation:** Perplexity as a measure of model quality  

### 8.2 Limitations of Traditional NLP

**Bag of Words / TF-IDF:**
- No word order
- No semantic similarity ("king" and "monarch" are unrelated)
- High-dimensional, sparse vectors

**N-gram models:**
- Limited context (only last n-1 words)
- Data sparsity (many n-grams never seen)
- Cannot capture long-range dependencies

### 8.3 What's Next?

**Tomorrow (Day 12):** Word Embeddings
- Dense, continuous representations
- Semantic similarity captured
- Word2Vec, GloVe algorithms

**Coming up:**
- RNNs for sequence modeling (Day 13)
- LSTMs for long-term dependencies (Day 14)
- Attention mechanisms (Day 16)
- Transformers (Day 17)

### 8.4 Mathematical Foundation for Phase 2

**Key concepts you'll need:**
- **Vector spaces:** Word embeddings live in high-dimensional space
- **Similarity measures:** Cosine similarity, dot product
- **Probability:** Language modeling, next word prediction
- **Backpropagation through time:** For RNNs
- **Matrix operations:** Attention is just weighted matrix multiplication

---

## Summary

Congratulations on completing **Day 11** and entering **Phase 2: NLP**!

You've learned:

✅ The unique challenges of natural language processing  
✅ Complete text preprocessing pipeline  
✅ Multiple tokenization strategies (word, subword, character)  
✅ Vocabulary building and numericalization  
✅ Text representations (BoW, TF-IDF)  
✅ N-gram language models and text generation  
✅ Perplexity as an evaluation metric  

**Key Insight:** Traditional NLP methods have significant limitations (no semantics, limited context) that neural approaches will solve.

**Time spent:** ~3-4 hours

**Next:** Day 12 - Word Embeddings (Word2Vec, GloVe) - Dense semantic representations that revolutionized NLP!