# 01 - Foundations of Natural Language Processing

This notebook covers the fundamental concepts of NLP that form the foundation for understanding Large Language Models.

## Topics Covered:
- Text normalization
- Tokenization methods
- Vocabulary construction
- Encoding schemes
- Basic linguistic representations

In [1]:
pip install numpy pandas matplotlib scikit-learn

Collecting matplotlib
  Downloading matplotlib-3.10.8-cp312-cp312-win_amd64.whl.metadata (52 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.3-cp312-cp312-win_amd64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.61.1-cp312-cp312-win_amd64.whl.metadata (116 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.9-cp312-cp312-win_amd64.whl.metadata (6.4 kB)
Downloading matplotlib-3.10.8-cp312-cp312-win_amd64.whl (8.1 MB)
   ---------------------------------------- 0.0/8.1 MB ? eta -:--:--
   ---------------------------------------- 8.1/8.1 MB 45.8 MB/s  0:00:00
Downloading contourpy-1.3.3-cp312-cp312-win_amd64.whl (226 kB)
Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB)
Downloading fonttools-4.61.1-cp312-cp312-win_amd64.whl (2.3 MB)
   ---------------------------------------- 0.0

In [2]:
import re
import string
from collections import Counter, defaultdict
import numpy as np
from typing import List, Dict, Tuple

# Sample text for demonstrations
sample_text = """
Natural Language Processing (NLP) is a fascinating field! 
It combines linguistics, computer science, and AI.
Modern LLMs like GPT-4 have revolutionized the field.
They can understand context, generate text, and even code.
"""

## 1. Text Normalization

Text normalization is the process of converting text to a standard format.

In [3]:
def normalize_text(text: str) -> str:
    """Basic text normalization."""
    # Convert to lowercase
    text = text.lower()
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove leading/trailing whitespace
    text = text.strip()
    
    return text

# Demonstrate normalization
print("Original:")
print(repr(sample_text))
print("\nNormalized:")
print(repr(normalize_text(sample_text)))

Original:
'\nNatural Language Processing (NLP) is a fascinating field! \nIt combines linguistics, computer science, and AI.\nModern LLMs like GPT-4 have revolutionized the field.\nThey can understand context, generate text, and even code.\n'

Normalized:
'natural language processing (nlp) is a fascinating field! it combines linguistics, computer science, and ai. modern llms like gpt-4 have revolutionized the field. they can understand context, generate text, and even code.'


## 2. Tokenization Methods

### 2.1 Word Tokenization

In [4]:
def word_tokenize(text: str) -> List[str]:
    """Simple word tokenization."""
    # Remove punctuation and split on whitespace
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text.split()

def advanced_word_tokenize(text: str) -> List[str]:
    """More sophisticated word tokenization."""
    # Keep contractions together, handle punctuation better
    pattern = r"\b\w+(?:'\w+)?\b|[.,!?;]"
    return re.findall(pattern, text)

# Compare tokenization methods
text = "I can't believe it's working! Amazing, isn't it?"
print("Text:", text)
print("Simple:", word_tokenize(text))
print("Advanced:", advanced_word_tokenize(text))

Text: I can't believe it's working! Amazing, isn't it?
Simple: ['I', 'cant', 'believe', 'its', 'working', 'Amazing', 'isnt', 'it']
Advanced: ['I', "can't", 'believe', "it's", 'working', '!', 'Amazing', ',', "isn't", 'it', '?']


### 2.2 Character Tokenization

In [5]:
def char_tokenize(text: str) -> List[str]:
    """Character-level tokenization."""
    return list(text)

# Demonstrate character tokenization
text = "Hello, World!"
print("Text:", text)
print("Characters:", char_tokenize(text))
print("Unique chars:", sorted(set(char_tokenize(text))))

Text: Hello, World!
Characters: ['H', 'e', 'l', 'l', 'o', ',', ' ', 'W', 'o', 'r', 'l', 'd', '!']
Unique chars: [' ', '!', ',', 'H', 'W', 'd', 'e', 'l', 'o', 'r']


### 2.3 Subword Tokenization (Simplified BPE)

**Byte Pair Encoding (BPE)** is a subword tokenization algorithm that learns to merge the most frequent character pairs iteratively. This creates a vocabulary that balances between character-level and word-level representations.

**How BPE Works:**
1. **Initialize**: Start with individual characters as the base vocabulary
2. **Count Pairs**: Find all adjacent character pairs in the corpus
3. **Merge Most Frequent**: Merge the most frequent pair into a single token
4. **Repeat**: Continue until reaching desired vocabulary size
5. **Encode**: Use learned merges to tokenize new text

**Example Process:**
- Corpus: ['hello', 'world', 'hello']
- Initial: ['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']
- Most frequent pair: ('l', 'l') → merge to 'll'
- Next: ('e', 'll') → merge to 'ell'
- Continue until vocabulary size reached

**Advantages:**
- Handles unknown words by breaking them into subwords
- Balances vocabulary size with representation quality
- Works well across different languages
- Used in modern models like GPT, BERT

In [6]:
from typing import List, Dict, Tuple
from collections import defaultdict

class SimpleBPE:
    """
    Simplified Byte Pair Encoding implementation.
    
    This class demonstrates the core BPE algorithm:
    - Training: Learn merge rules from a corpus
    - Encoding: Apply learned rules to tokenize new text
    
    Attributes:
        vocab_size (int): Maximum vocabulary size
        vocab (dict): Final vocabulary mapping tokens to IDs
        merges (dict): Learned merge rules (pair -> merged_token)
    """
    
    def __init__(self, vocab_size: int = 1000):
        self.vocab_size = vocab_size
        self.vocab = {}
        self.merges = {}
    
    def get_pairs(self, word: List[str]) -> Dict[Tuple[str, str], int]:
        """
        Extract all adjacent character pairs from a word.
        
        Args:
            word: List of characters/tokens
        
        Returns:
            Dictionary mapping pairs to their frequency
        
        Example:
            get_pairs(['h', 'e', 'l', 'l', 'o'])
            # Returns: {('h', 'e'): 1, ('e', 'l'): 1, ('l', 'l'): 1, ('l', 'o'): 1}
        """
        pairs = defaultdict(int)
        for i in range(len(word) - 1):
            pairs[(word[i], word[i + 1])] += 1
        return pairs
    
    def train(self, texts: List[str]):
        """
        Train BPE on a corpus by learning merge rules.
        
        Training Process:
        1. Initialize vocabulary with individual characters
        2. Count word frequencies in corpus
        3. Iteratively find and merge most frequent character pairs
        4. Stop when reaching desired vocabulary size
        
        Args:
            texts: List of training texts
        """
        # Initialize with character vocabulary
        vocab = set()
        word_freqs = defaultdict(int)
        
        # Count word frequencies and collect characters
        for text in texts:
            words = text.split()
            for word in words:
                word_freqs[word] += 1
                vocab.update(word)  # Add all characters to vocab
        
        # Convert words to character lists for processing
        word_splits = {word: list(word) for word in word_freqs}
        
        # Iteratively merge most frequent pairs
        for i in range(self.vocab_size - len(vocab)):
            pairs = defaultdict(int)
            
            # Count all pairs across all words (weighted by frequency)
            for word, freq in word_freqs.items():
                word_pairs = self.get_pairs(word_splits[word])
                for pair, count in word_pairs.items():
                    pairs[pair] += count * freq
            
            if not pairs:
                break
            
            # Find most frequent pair to merge
            best_pair = max(pairs, key=pairs.get)
            
            # Apply merge to all words
            for word in word_splits:
                new_word = []
                i = 0
                while i < len(word_splits[word]):
                    # Check if current position matches the pair to merge
                    if (i < len(word_splits[word]) - 1 and 
                        word_splits[word][i] == best_pair[0] and 
                        word_splits[word][i + 1] == best_pair[1]):
                        # Merge the pair
                        new_word.append(best_pair[0] + best_pair[1])
                        i += 2
                    else:
                        # Keep character as is
                        new_word.append(word_splits[word][i])
                        i += 1
                word_splits[word] = new_word
            
            # Add merged token to vocabulary and record merge rule
            vocab.add(best_pair[0] + best_pair[1])
            self.merges[best_pair] = best_pair[0] + best_pair[1]
        
        # Create final vocabulary mapping
        self.vocab = {token: i for i, token in enumerate(sorted(vocab))}
    
    def encode(self, text: str) -> List[str]:
        """
        Encode text using learned BPE merges.
        
        Encoding Process:
        1. Split text into words
        2. For each word, start with character-level tokens
        3. Apply learned merges in the order they were learned
        4. Continue until no more merges can be applied
        
        Args:
            text: Input text to encode
        
        Returns:
            List of subword tokens
        
        Example:
            encode('hello world')
            # Might return: ['hell', 'o', 'w', 'o', 'r', 'l', 'd']
        """
        words = text.split()
        encoded = []
        
        for word in words:
            # Start with character-level tokenization
            word_tokens = list(word)
            
            # Apply merges iteratively
            while len(word_tokens) > 1:
                pairs = self.get_pairs(word_tokens)
                if not pairs:
                    break
                
                # Find the pair that appears in our learned merges
                valid_pairs = [pair for pair in pairs if pair in self.merges]
                if not valid_pairs:
                    break
                # Use the merge that was learned earliest (lowest index)
                bigram = min(valid_pairs, key=lambda pair: list(self.merges.keys()).index(pair))
                
                # Apply the merge
                new_word = []
                i = 0
                while i < len(word_tokens):
                    if (i < len(word_tokens) - 1 and 
                        word_tokens[i] == bigram[0] and 
                        word_tokens[i + 1] == bigram[1]):
                        # Apply merge
                        new_word.append(self.merges[bigram])
                        i += 2
                    else:
                        # Keep token as is
                        new_word.append(word_tokens[i])
                        i += 1
                word_tokens = new_word
            
            encoded.extend(word_tokens)
        
        return encoded

# Demonstrate BPE with detailed output
corpus = [
    "hello world",
    "hello there",
    "world peace",
    "hello hello world"
]

print("Training BPE on corpus:", corpus)
bpe = SimpleBPE(vocab_size=20)
bpe.train(corpus)

print("\nLearned vocabulary:", list(bpe.vocab.keys())[:10])
print("Learned merges:", list(bpe.merges.items())[:5])
print("\nEncoding examples:")
for text in ["hello world", "hello", "world", "peace"]:
    encoded = bpe.encode(text)
    print(f"'{text}' -> {encoded}")

Training BPE on corpus: ['hello world', 'hello there', 'world peace', 'hello hello world']

Learned vocabulary: ['a', 'c', 'd', 'e', 'h', 'he', 'hel', 'hell', 'hello', 'l']
Learned merges: [(('h', 'e'), 'he'), (('he', 'l'), 'hel'), (('hel', 'l'), 'hell'), (('hell', 'o'), 'hello'), (('w', 'o'), 'wo')]

Encoding examples:
'hello world' -> ['hello', 'world']
'hello' -> ['hello']
'world' -> ['world']
'peace' -> ['p', 'e', 'a', 'c', 'e']


## 3. Vocabulary Construction

Vocabulary construction is the process of creating a mapping between words/tokens and numerical indices that machine learning models can work with.

**Core Concept:**
- Convert text tokens into a fixed set of unique identifiers
- Create a dictionary mapping each unique token to a number
- Handle unknown words that weren't seen during training

**Key Steps:**
1. **Collect all unique tokens** from your training data
2. **Sort by frequency** (most common words get lower indices)
3. **Add special tokens** like `<UNK>` (unknown), `<PAD>` (padding), `<START>`, `<END>`
4. **Set vocabulary size limit** to control model complexity

**Example Process:**
```
Text: "the cat sat on the mat"
Tokens: ["the", "cat", "sat", "on", "the", "mat"]
Unique tokens: ["the", "cat", "sat", "on", "mat"]
Vocabulary: {
    "<PAD>": 0,
    "<UNK>": 1, 
    "the": 2,    # most frequent
    "cat": 3,
    "sat": 4,
    "on": 5,
    "mat": 6
}
```

**Why It Matters:**
- Models need numbers, not words
- Vocabulary size directly affects model parameters
- Determines how the model handles new/rare words
- Foundation for embedding layers in neural networks

The vocabulary becomes your model's "dictionary" - any word not in it gets mapped to `<UNK>`, which is why vocabulary construction strategy significantly impacts model performance.

In [8]:
from collections import Counter
from typing import List, Dict, Tuple
import re
import string

def normalize_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text

def word_tokenize(text: str) -> List[str]:
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text.split()

def build_vocabulary(texts: List[str], min_freq: int = 2) -> Dict[str, int]:
    """Build vocabulary from texts with frequency filtering."""
    word_counts = Counter()
    
    for text in texts:
        tokens = word_tokenize(normalize_text(text))
        word_counts.update(tokens)
    
    # Filter by minimum frequency
    vocab = {"<UNK>": 0, "<PAD>": 1}  # Special tokens
    
    for word, count in word_counts.items():
        if count >= min_freq:
            vocab[word] = len(vocab)
    
    return vocab

# Build vocabulary from sample texts
texts = [
    "Natural language processing is amazing",
    "Language models are powerful tools",
    "Processing natural language requires understanding",
    "Amazing tools for language understanding"
]

vocab = build_vocabulary(texts, min_freq=1)
print("Vocabulary size:", len(vocab))
print("Sample vocabulary:", dict(list(vocab.items())[:20]))

Vocabulary size: 14
Sample vocabulary: {'<UNK>': 0, '<PAD>': 1, 'natural': 2, 'language': 3, 'processing': 4, 'is': 5, 'amazing': 6, 'models': 7, 'are': 8, 'powerful': 9, 'tools': 10, 'requires': 11, 'understanding': 12, 'for': 13}


## 4. Encoding Schemes

### 4.1 One-hot Encoding

One-hot encoding represents each word as a binary vector where only one position is 1 (hot) and all others are 0 (cold).

**How it works:**
- Create a vector of length = vocabulary size
- Set position corresponding to word's index to 1
- All other positions remain 0

**Example:**
```
Vocabulary: {'cat': 0, 'dog': 1, 'bird': 2}

'cat'  -> [1, 0, 0]
'dog'  -> [0, 1, 0]
'bird' -> [0, 0, 1]
```

**Pros:** Simple, no similarity assumptions
**Cons:** Sparse, large memory usage, no semantic relationships

In [11]:
def one_hot_encode(tokens: List[str], vocab: Dict[str, int]) -> np.ndarray:
    """Convert tokens to one-hot encoded vectors."""
    vocab_size = len(vocab)
    encoded = np.zeros((len(tokens), vocab_size))
    
    for i, token in enumerate(tokens):
        if token in vocab:
            encoded[i, vocab[token]] = 1
        else:
            encoded[i, vocab["<UNK>"]] = 1
    
    return encoded

# Demonstrate one-hot encoding
tokens = ["natural", "language", "processing"]
one_hot = one_hot_encode(tokens, vocab)
print("Tokens:", tokens)
print("One-hot shape:", one_hot.shape)
print("First token encoding:", one_hot)  # Show first 10 dimensions

Tokens: ['natural', 'language', 'processing']
One-hot shape: (3, 14)
First token encoding: [[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


### 4.2 Integer Encoding

In [12]:
def integer_encode(tokens: List[str], vocab: Dict[str, int]) -> List[int]:
    """Convert tokens to integer indices."""
    encoded = []
    for token in tokens:
        if token in vocab:
            encoded.append(vocab[token])
        else:
            encoded.append(vocab["<UNK>"])
    return encoded

def integer_decode(indices: List[int], vocab: Dict[str, int]) -> List[str]:
    """Convert integer indices back to tokens."""
    # Create reverse vocabulary
    reverse_vocab = {v: k for k, v in vocab.items()}
    return [reverse_vocab.get(idx, "<UNK>") for idx in indices]

# Demonstrate integer encoding
tokens = ["natural", "language", "processing", "unknown_word"]
encoded = integer_encode(tokens, vocab)
decoded = integer_decode(encoded, vocab)

print("Original tokens:", tokens)
print("Integer encoded:", encoded)
print("Decoded tokens:", decoded)

Original tokens: ['natural', 'language', 'processing', 'unknown_word']
Integer encoded: [2, 3, 4, 0]
Decoded tokens: ['natural', 'language', 'processing', '<UNK>']


## 5. Basic Linguistic Representations

### 5.1 N-grams

In [13]:
def generate_ngrams(tokens: List[str], n: int) -> List[Tuple[str, ...]]:
    """Generate n-grams from tokens."""
    if n <= 0:
        return []
    
    ngrams = []
    for i in range(len(tokens) - n + 1):
        ngrams.append(tuple(tokens[i:i + n]))
    
    return ngrams

# Demonstrate n-grams
text = "natural language processing is fascinating"
tokens = word_tokenize(text)

print("Tokens:", tokens)
print("Unigrams:", generate_ngrams(tokens, 1))
print("Bigrams:", generate_ngrams(tokens, 2))
print("Trigrams:", generate_ngrams(tokens, 3))

Tokens: ['natural', 'language', 'processing', 'is', 'fascinating']
Unigrams: [('natural',), ('language',), ('processing',), ('is',), ('fascinating',)]
Bigrams: [('natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'fascinating')]
Trigrams: [('natural', 'language', 'processing'), ('language', 'processing', 'is'), ('processing', 'is', 'fascinating')]


### 5.2 Bag of Words

In [18]:
def bag_of_words(texts: List[str], vocab: Dict[str, int]) -> np.ndarray:
    """Convert texts to bag-of-words representation."""
    bow_matrix = np.zeros((len(texts), len(vocab)))
    
    for i, text in enumerate(texts):
        tokens = word_tokenize(normalize_text(text))
        token_counts = Counter(tokens)
        
        for token, count in token_counts.items():
            if token in vocab:
                bow_matrix[i, vocab[token]] = count
    
    return bow_matrix

# Demonstrate bag of words
sample_texts = [
    "natural language processing",
    "language models are powerful",
    "processing natural language"
]

bow = bag_of_words(sample_texts, vocab)
print("BoW shape:", bow.shape)
# print("First document BoW:", bow[0][:10])  # Show first 10 features
print("First document BoW:", bow[0])

BoW shape: (3, 14)
First document BoW: [0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


### 5.3 TF-IDF

In [19]:
def compute_tfidf(bow_matrix: np.ndarray) -> np.ndarray:
    """Compute TF-IDF from bag-of-words matrix."""
    # Term frequency (normalized by document length)
    tf = bow_matrix / (bow_matrix.sum(axis=1, keepdims=True) + 1e-10)
    
    # Document frequency
    df = (bow_matrix > 0).sum(axis=0)
    
    # Inverse document frequency
    idf = np.log(bow_matrix.shape[0] / (df + 1e-10))
    
    # TF-IDF
    tfidf = tf * idf
    
    return tfidf

# Demonstrate TF-IDF
tfidf_matrix = compute_tfidf(bow)
print("TF-IDF shape:", tfidf_matrix.shape)
print("First document TF-IDF:", tfidf_matrix[0][:10])

TF-IDF shape: (3, 14)
First document TF-IDF: [ 0.00000000e+00  0.00000000e+00  1.35155036e-01 -1.11111120e-11
  1.35155036e-01  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00]


## 6. Exercises

Try these exercises to reinforce your understanding:

1. **Text Preprocessing Pipeline**: Create a complete preprocessing pipeline that handles:
   - Case normalization
   - Punctuation handling
   - Number normalization
   - Stopword removal

2. **Custom Tokenizer**: Implement a tokenizer that can handle:
   - URLs and email addresses
   - Hashtags and mentions
   - Emoticons and emojis

3. **Vocabulary Analysis**: Analyze a text corpus to find:
   - Most frequent words
   - Vocabulary growth curve
   - Out-of-vocabulary rate for different vocabulary sizes

4. **N-gram Language Model**: Build a simple n-gram language model that can:
   - Calculate n-gram probabilities
   - Generate text using the model
   - Handle unseen n-grams with smoothing

## Summary

In this notebook, we covered:

- **Text Normalization**: Converting text to standard formats
- **Tokenization**: Breaking text into meaningful units
  - Word-level, character-level, and subword tokenization
  - Byte Pair Encoding (BPE) algorithm
- **Vocabulary Construction**: Building and managing vocabularies
- **Encoding Schemes**: Converting text to numerical representations
- **Linguistic Representations**: N-grams, BoW, and TF-IDF

These concepts form the foundation for understanding how modern language models process and represent text. In the next notebook, we'll explore neural network fundamentals that build upon these representations.