# 01 - Foundations of Natural Language Processing

This notebook covers the fundamental concepts of NLP that form the foundation for understanding Large Language Models.

## Topics Covered:
- Text normalization
- Tokenization methods
- Vocabulary construction
- Encoding schemes
- Basic linguistic representations

In [None]:
import re
import string
from collections import Counter, defaultdict
import numpy as np
from typing import List, Dict, Tuple

# Sample text for demonstrations
sample_text = """
Natural Language Processing (NLP) is a fascinating field! 
It combines linguistics, computer science, and AI.
Modern LLMs like GPT-4 have revolutionized the field.
They can understand context, generate text, and even code.
"""

## 1. Text Normalization

Text normalization is the process of converting text to a standard format.

In [None]:
def normalize_text(text: str) -> str:
    """Basic text normalization."""
    # Convert to lowercase
    text = text.lower()
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove leading/trailing whitespace
    text = text.strip()
    
    return text

# Demonstrate normalization
print("Original:")
print(repr(sample_text))
print("\nNormalized:")
print(repr(normalize_text(sample_text)))

## 2. Tokenization Methods

### 2.1 Word Tokenization

In [None]:
def word_tokenize(text: str) -> List[str]:
    """Simple word tokenization."""
    # Remove punctuation and split on whitespace
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text.split()

def advanced_word_tokenize(text: str) -> List[str]:
    """More sophisticated word tokenization."""
    # Keep contractions together, handle punctuation better
    pattern = r"\b\w+(?:'\w+)?\b|[.,!?;]"
    return re.findall(pattern, text)

# Compare tokenization methods
text = "I can't believe it's working! Amazing, isn't it?"
print("Text:", text)
print("Simple:", word_tokenize(text))
print("Advanced:", advanced_word_tokenize(text))

### 2.2 Character Tokenization

In [None]:
def char_tokenize(text: str) -> List[str]:
    """Character-level tokenization."""
    return list(text)

# Demonstrate character tokenization
text = "Hello, World!"
print("Text:", text)
print("Characters:", char_tokenize(text))
print("Unique chars:", sorted(set(char_tokenize(text))))

### 2.3 Subword Tokenization (Simplified BPE)

In [None]:
class SimpleBPE:
    """Simplified Byte Pair Encoding implementation."""
    
    def __init__(self, vocab_size: int = 1000):
        self.vocab_size = vocab_size
        self.vocab = {}
        self.merges = {}
    
    def get_pairs(self, word: List[str]) -> Dict[Tuple[str, str], int]:
        """Get all adjacent pairs in a word."""
        pairs = defaultdict(int)
        for i in range(len(word) - 1):
            pairs[(word[i], word[i + 1])] += 1
        return pairs
    
    def train(self, texts: List[str]):
        """Train BPE on a corpus."""
        # Initialize with character vocabulary
        vocab = set()
        word_freqs = defaultdict(int)
        
        # Count word frequencies and collect characters
        for text in texts:
            words = text.split()
            for word in words:
                word_freqs[word] += 1
                vocab.update(word)
        
        # Convert words to character lists
        word_splits = {word: list(word) for word in word_freqs}
        
        # Iteratively merge most frequent pairs
        for i in range(self.vocab_size - len(vocab)):
            pairs = defaultdict(int)
            
            # Count all pairs
            for word, freq in word_freqs.items():
                word_pairs = self.get_pairs(word_splits[word])
                for pair, count in word_pairs.items():
                    pairs[pair] += count * freq
            
            if not pairs:
                break
            
            # Find most frequent pair
            best_pair = max(pairs, key=pairs.get)
            
            # Merge the pair in all words
            for word in word_splits:
                new_word = []
                i = 0
                while i < len(word_splits[word]):
                    if (i < len(word_splits[word]) - 1 and 
                        word_splits[word][i] == best_pair[0] and 
                        word_splits[word][i + 1] == best_pair[1]):
                        new_word.append(best_pair[0] + best_pair[1])
                        i += 2
                    else:
                        new_word.append(word_splits[word][i])
                        i += 1
                word_splits[word] = new_word
            
            vocab.add(best_pair[0] + best_pair[1])
            self.merges[best_pair] = best_pair[0] + best_pair[1]
        
        self.vocab = {token: i for i, token in enumerate(sorted(vocab))}
    
    def encode(self, text: str) -> List[str]:
        """Encode text using learned BPE."""
        words = text.split()
        encoded = []
        
        for word in words:
            word_tokens = list(word)
            
            # Apply merges
            while len(word_tokens) > 1:
                pairs = self.get_pairs(word_tokens)
                if not pairs:
                    break
                
                # Find the pair that appears in our merges
                bigram = min(pairs, key=lambda pair: self.merges.get(pair, float('inf')))
                if bigram not in self.merges:
                    break
                
                # Merge the pair
                new_word = []
                i = 0
                while i < len(word_tokens):
                    if (i < len(word_tokens) - 1 and 
                        word_tokens[i] == bigram[0] and 
                        word_tokens[i + 1] == bigram[1]):
                        new_word.append(self.merges[bigram])
                        i += 2
                    else:
                        new_word.append(word_tokens[i])
                        i += 1
                word_tokens = new_word
            
            encoded.extend(word_tokens)
        
        return encoded

# Demonstrate BPE
corpus = [
    "hello world",
    "hello there",
    "world peace",
    "hello hello world"
]

bpe = SimpleBPE(vocab_size=20)
bpe.train(corpus)

print("Learned vocabulary:", list(bpe.vocab.keys())[:10])
print("Sample encoding:", bpe.encode("hello world"))

## 3. Vocabulary Construction

In [None]:
def build_vocabulary(texts: List[str], min_freq: int = 2) -> Dict[str, int]:
    """Build vocabulary from texts with frequency filtering."""
    word_counts = Counter()
    
    for text in texts:
        tokens = word_tokenize(normalize_text(text))
        word_counts.update(tokens)
    
    # Filter by minimum frequency
    vocab = {"<UNK>": 0, "<PAD>": 1}  # Special tokens
    
    for word, count in word_counts.items():
        if count >= min_freq:
            vocab[word] = len(vocab)
    
    return vocab

# Build vocabulary from sample texts
texts = [
    "Natural language processing is amazing",
    "Language models are powerful tools",
    "Processing natural language requires understanding",
    "Amazing tools for language understanding"
]

vocab = build_vocabulary(texts, min_freq=1)
print("Vocabulary size:", len(vocab))
print("Sample vocabulary:", dict(list(vocab.items())[:10]))

## 4. Encoding Schemes

### 4.1 One-hot Encoding

In [None]:
def one_hot_encode(tokens: List[str], vocab: Dict[str, int]) -> np.ndarray:
    """Convert tokens to one-hot encoded vectors."""
    vocab_size = len(vocab)
    encoded = np.zeros((len(tokens), vocab_size))
    
    for i, token in enumerate(tokens):
        if token in vocab:
            encoded[i, vocab[token]] = 1
        else:
            encoded[i, vocab["<UNK>"]] = 1
    
    return encoded

# Demonstrate one-hot encoding
tokens = ["natural", "language", "processing"]
one_hot = one_hot_encode(tokens, vocab)
print("Tokens:", tokens)
print("One-hot shape:", one_hot.shape)
print("First token encoding:", one_hot[0][:10])  # Show first 10 dimensions

### 4.2 Integer Encoding

In [None]:
def integer_encode(tokens: List[str], vocab: Dict[str, int]) -> List[int]:
    """Convert tokens to integer indices."""
    encoded = []
    for token in tokens:
        if token in vocab:
            encoded.append(vocab[token])
        else:
            encoded.append(vocab["<UNK>"])
    return encoded

def integer_decode(indices: List[int], vocab: Dict[str, int]) -> List[str]:
    """Convert integer indices back to tokens."""
    # Create reverse vocabulary
    reverse_vocab = {v: k for k, v in vocab.items()}
    return [reverse_vocab.get(idx, "<UNK>") for idx in indices]

# Demonstrate integer encoding
tokens = ["natural", "language", "processing", "unknown_word"]
encoded = integer_encode(tokens, vocab)
decoded = integer_decode(encoded, vocab)

print("Original tokens:", tokens)
print("Integer encoded:", encoded)
print("Decoded tokens:", decoded)

## 5. Basic Linguistic Representations

### 5.1 N-grams

In [None]:
def generate_ngrams(tokens: List[str], n: int) -> List[Tuple[str, ...]]:
    """Generate n-grams from tokens."""
    if n <= 0:
        return []
    
    ngrams = []
    for i in range(len(tokens) - n + 1):
        ngrams.append(tuple(tokens[i:i + n]))
    
    return ngrams

# Demonstrate n-grams
text = "natural language processing is fascinating"
tokens = word_tokenize(text)

print("Tokens:", tokens)
print("Unigrams:", generate_ngrams(tokens, 1))
print("Bigrams:", generate_ngrams(tokens, 2))
print("Trigrams:", generate_ngrams(tokens, 3))

### 5.2 Bag of Words

In [None]:
def bag_of_words(texts: List[str], vocab: Dict[str, int]) -> np.ndarray:
    """Convert texts to bag-of-words representation."""
    bow_matrix = np.zeros((len(texts), len(vocab)))
    
    for i, text in enumerate(texts):
        tokens = word_tokenize(normalize_text(text))
        token_counts = Counter(tokens)
        
        for token, count in token_counts.items():
            if token in vocab:
                bow_matrix[i, vocab[token]] = count
    
    return bow_matrix

# Demonstrate bag of words
sample_texts = [
    "natural language processing",
    "language models are powerful",
    "processing natural language"
]

bow = bag_of_words(sample_texts, vocab)
print("BoW shape:", bow.shape)
print("First document BoW:", bow[0][:10])  # Show first 10 features

### 5.3 TF-IDF

In [None]:
def compute_tfidf(bow_matrix: np.ndarray) -> np.ndarray:
    """Compute TF-IDF from bag-of-words matrix."""
    # Term frequency (normalized by document length)
    tf = bow_matrix / (bow_matrix.sum(axis=1, keepdims=True) + 1e-10)
    
    # Document frequency
    df = (bow_matrix > 0).sum(axis=0)
    
    # Inverse document frequency
    idf = np.log(bow_matrix.shape[0] / (df + 1e-10))
    
    # TF-IDF
    tfidf = tf * idf
    
    return tfidf

# Demonstrate TF-IDF
tfidf_matrix = compute_tfidf(bow)
print("TF-IDF shape:", tfidf_matrix.shape)
print("First document TF-IDF:", tfidf_matrix[0][:10])

## 6. Exercises

Try these exercises to reinforce your understanding:

1. **Text Preprocessing Pipeline**: Create a complete preprocessing pipeline that handles:
   - Case normalization
   - Punctuation handling
   - Number normalization
   - Stopword removal

2. **Custom Tokenizer**: Implement a tokenizer that can handle:
   - URLs and email addresses
   - Hashtags and mentions
   - Emoticons and emojis

3. **Vocabulary Analysis**: Analyze a text corpus to find:
   - Most frequent words
   - Vocabulary growth curve
   - Out-of-vocabulary rate for different vocabulary sizes

4. **N-gram Language Model**: Build a simple n-gram language model that can:
   - Calculate n-gram probabilities
   - Generate text using the model
   - Handle unseen n-grams with smoothing

## Summary

In this notebook, we covered:

- **Text Normalization**: Converting text to standard formats
- **Tokenization**: Breaking text into meaningful units
  - Word-level, character-level, and subword tokenization
  - Byte Pair Encoding (BPE) algorithm
- **Vocabulary Construction**: Building and managing vocabularies
- **Encoding Schemes**: Converting text to numerical representations
- **Linguistic Representations**: N-grams, BoW, and TF-IDF

These concepts form the foundation for understanding how modern language models process and represent text. In the next notebook, we'll explore neural network fundamentals that build upon these representations.