# 01 - Foundations of Natural Language Processing

This notebook covers the fundamental concepts of NLP that form the foundation for understanding Large Language Models.

## Topics Covered:
- Text normalization
- Tokenization methods
- Vocabulary construction
- Encoding schemes
- Basic linguistic representations

In [2]:
pip install numpy matplotlib

Collecting numpy
  Using cached numpy-2.4.0-cp314-cp314-win_amd64.whl.metadata (6.6 kB)
Collecting matplotlib
  Using cached matplotlib-3.10.8-cp314-cp314-win_amd64.whl.metadata (52 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Using cached contourpy-1.3.3-cp314-cp314-win_amd64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Using cached fonttools-4.61.1-cp314-cp314-win_amd64.whl.metadata (116 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Using cached kiwisolver-1.4.9-cp314-cp314-win_amd64.whl.metadata (6.4 kB)
Collecting pillow>=8 (from matplotlib)
  Using cached pillow-12.1.0-cp314-cp314-win_amd64.whl.metadata (9.0 kB)
Collecting pyparsing>=3 (from matplotlib)
  Using cached pyparsing-3.3.1-py3-none-any.whl.metadata (5.6 kB)
Using cached numpy-2.4.0-cp314-cp314-win_amd64.whl (12.4 MB)
Using cached matplotlib-3.10.8-cp314-cp314-win_amd64.whl (8.

In [3]:
import numpy as np
import string
from collections import Counter
import math

## 3. Vocabulary Construction

Vocabulary construction creates a mapping between words/tokens and numerical indices.

**Key Steps:**
1. Collect unique tokens from training data
2. Sort by frequency 
3. Add special tokens (`<PAD>`, `<UNK>`)
4. Create word-to-index mapping

In [4]:
# Step 1: Sample texts
texts = [
    "Natural language processing is amazing",
    "Language models are powerful tools", 
    "Processing natural language requires understanding"
]

print("Sample texts:")
for i, text in enumerate(texts, 1):
    print(f"{i}. {text}")

Sample texts:
1. Natural language processing is amazing
2. Language models are powerful tools
3. Processing natural language requires understanding


In [5]:
# Step 2: Tokenize and collect all words
all_tokens = []
for text in texts:
    tokens = text.lower().translate(str.maketrans('', '', string.punctuation)).split()
    all_tokens.extend(tokens)
    print(f"'{text}' -> {tokens}")

print(f"\nAll tokens: {all_tokens}")

'Natural language processing is amazing' -> ['natural', 'language', 'processing', 'is', 'amazing']
'Language models are powerful tools' -> ['language', 'models', 'are', 'powerful', 'tools']
'Processing natural language requires understanding' -> ['processing', 'natural', 'language', 'requires', 'understanding']

All tokens: ['natural', 'language', 'processing', 'is', 'amazing', 'language', 'models', 'are', 'powerful', 'tools', 'processing', 'natural', 'language', 'requires', 'understanding']


In [6]:
# Step 3: Count frequencies
word_counts = Counter(all_tokens)

print("Word frequencies:")
for word, count in word_counts.most_common():
    print(f"  '{word}': {count}")

Word frequencies:
  'language': 3
  'natural': 2
  'processing': 2
  'is': 1
  'amazing': 1
  'models': 1
  'are': 1
  'powerful': 1
  'tools': 1
  'requires': 1
  'understanding': 1


In [7]:
# Step 4: Build vocabulary
vocab = {"<PAD>": 0, "<UNK>": 1}  # Special tokens first

for word, count in word_counts.most_common():
    vocab[word] = len(vocab)
    print(f"Added '{word}' -> index {vocab[word]}")

print(f"\nFinal vocabulary: {vocab}")

Added 'language' -> index 2
Added 'natural' -> index 3
Added 'processing' -> index 4
Added 'is' -> index 5
Added 'amazing' -> index 6
Added 'models' -> index 7
Added 'are' -> index 8
Added 'powerful' -> index 9
Added 'tools' -> index 10
Added 'requires' -> index 11
Added 'understanding' -> index 12

Final vocabulary: {'<PAD>': 0, '<UNK>': 1, 'language': 2, 'natural': 3, 'processing': 4, 'is': 5, 'amazing': 6, 'models': 7, 'are': 8, 'powerful': 9, 'tools': 10, 'requires': 11, 'understanding': 12}


## 4.1 One-Hot Encoding

One-hot encoding represents each word as a binary vector where only one position is 1.

**Example:** For vocab {'cat': 0, 'dog': 1, 'bird': 2}
- 'cat' -> [1, 0, 0]
- 'dog' -> [0, 1, 0] 
- 'bird' -> [0, 0, 1]

In [8]:
# Step 1: Simple vocabulary
simple_vocab = {'<PAD>': 0, '<UNK>': 1, 'cat': 2, 'dog': 3, 'bird': 4}

print("Vocabulary:")
for word, idx in simple_vocab.items():
    print(f"  '{word}': {idx}")

Vocabulary:
  '<PAD>': 0
  '<UNK>': 1
  'cat': 2
  'dog': 3
  'bird': 4


In [9]:
# Step 2: One-hot encode single word
def one_hot_encode_word(word, vocab):
    vector = np.zeros(len(vocab))
    if word in vocab:
        vector[vocab[word]] = 1
    else:
        vector[vocab['<UNK>']] = 1
    return vector

word = 'cat'
encoded = one_hot_encode_word(word, simple_vocab)
print(f"'{word}' -> {encoded}")
print(f"Position {simple_vocab[word]} is 1, others are 0")

'cat' -> [0. 0. 1. 0. 0.]
Position 2 is 1, others are 0


In [10]:
# Step 3: Memory usage problem
vocab_size = 50000  # Realistic size
sentence_length = 20
memory_needed = vocab_size * sentence_length

print(f"For vocab size {vocab_size:,} and sentence length {sentence_length}:")
print(f"Memory needed: {memory_needed:,} numbers")
print(f"Sparsity: Only {(1/vocab_size)*100:.4f}% are 1s, rest are 0s")

For vocab size 50,000 and sentence length 20:
Memory needed: 1,000,000 numbers
Sparsity: Only 0.0020% are 1s, rest are 0s


## 4.2 Integer Encoding

Integer encoding represents each word as a single integer (its vocabulary index).

**Example:** 'cat dog bird' -> [2, 3, 4]

**Pros:** Memory efficient
**Cons:** Implies false ordering (cat < dog < bird)

In [11]:
# Step 1: Encode sequence
def encode_sequence(words, vocab):
    return [vocab.get(word, vocab['<UNK>']) for word in words]

sentence = ['cat', 'and', 'dog', 'are', 'pets']
encoded = encode_sequence(sentence, simple_vocab)

print(f"Original: {sentence}")
print(f"Encoded:  {encoded}")

for word, idx in zip(sentence, encoded):
    status = "(known)" if word in simple_vocab else "(unknown)"
    print(f"  '{word}' -> {idx} {status}")

Original: ['cat', 'and', 'dog', 'are', 'pets']
Encoded:  [2, 1, 3, 1, 1]
  'cat' -> 2 (known)
  'and' -> 1 (unknown)
  'dog' -> 3 (known)
  'are' -> 1 (unknown)
  'pets' -> 1 (unknown)


In [12]:
# Step 2: Memory comparison
sentence = ['cat', 'dog', 'bird']
vocab_size = len(simple_vocab)

# Integer encoding: 3 numbers
integer_memory = len(sentence) * 4  # 4 bytes per int

# One-hot encoding: 3 × 5 = 15 numbers  
onehot_memory = len(sentence) * vocab_size * 4  # 4 bytes per float

print(f"Integer encoding: {integer_memory} bytes")
print(f"One-hot encoding: {onehot_memory} bytes")
print(f"Memory ratio: {onehot_memory // integer_memory}:1")

Integer encoding: 12 bytes
One-hot encoding: 60 bytes
Memory ratio: 5:1


## 5.1 N-grams

N-grams are contiguous sequences of N words that capture local patterns.

- **Unigrams:** Individual words
- **Bigrams:** Word pairs  
- **Trigrams:** Word triplets

In [13]:
# Step 1: Generate N-grams
text = "the cat sat on the mat"
tokens = text.split()

def generate_ngrams(tokens, n):
    ngrams = []
    for i in range(len(tokens) - n + 1):
        ngrams.append(tuple(tokens[i:i + n]))
    return ngrams

print(f"Text: '{text}'")
print(f"Tokens: {tokens}")

for n in range(1, 4):
    ngrams = generate_ngrams(tokens, n)
    print(f"{n}-grams: {ngrams}")

Text: 'the cat sat on the mat'
Tokens: ['the', 'cat', 'sat', 'on', 'the', 'mat']
1-grams: [('the',), ('cat',), ('sat',), ('on',), ('the',), ('mat',)]
2-grams: [('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat')]
3-grams: [('the', 'cat', 'sat'), ('cat', 'sat', 'on'), ('sat', 'on', 'the'), ('on', 'the', 'mat')]


In [14]:
# Step 2: Show context capture
bigrams = generate_ngrams(tokens, 2)
print("Bigram relationships:")
for i, (word1, word2) in enumerate(bigrams):
    print(f"  {i+1}. '{word1}' followed by '{word2}'")

trigrams = generate_ngrams(tokens, 3)  
print("\nTrigram contexts:")
for i, (w1, w2, w3) in enumerate(trigrams):
    print(f"  {i+1}. '{w1} {w2}' -> '{w3}'")

Bigram relationships:
  1. 'the' followed by 'cat'
  2. 'cat' followed by 'sat'
  3. 'sat' followed by 'on'
  4. 'on' followed by 'the'
  5. 'the' followed by 'mat'

Trigram contexts:
  1. 'the cat' -> 'sat'
  2. 'cat sat' -> 'on'
  3. 'sat on' -> 'the'
  4. 'on the' -> 'mat'


## 5.2 Bag of Words (BoW)

BoW represents text as word frequency counts, ignoring order.

**Example:**
- Doc1: 'cat dog' -> [1, 1, 0] (1 cat, 1 dog, 0 bird)
- Doc2: 'dog bird' -> [0, 1, 1] (0 cat, 1 dog, 1 bird)

In [15]:
# Step 1: Sample documents
documents = [
    "cat dog",
    "dog bird", 
    "cat cat dog",
    "bird bird cat"
]

# Build vocabulary
all_words = []
for doc in documents:
    all_words.extend(doc.split())

vocabulary = sorted(set(all_words))
print(f"Documents: {documents}")
print(f"Vocabulary: {vocabulary}")

Documents: ['cat dog', 'dog bird', 'cat cat dog', 'bird bird cat']
Vocabulary: ['bird', 'cat', 'dog']


In [16]:
# Step 2: Create BoW matrix
bow_matrix = np.zeros((len(documents), len(vocabulary)))

for doc_idx, doc in enumerate(documents):
    word_counts = Counter(doc.split())
    print(f"Doc {doc_idx+1} ('{doc}'):")
    
    for word, count in word_counts.items():
        word_idx = vocabulary.index(word)
        bow_matrix[doc_idx, word_idx] = count
        print(f"  '{word}' (index {word_idx}): {count}")

print(f"\nBoW Matrix:\n{bow_matrix}")
print(f"Columns: {vocabulary}")

Doc 1 ('cat dog'):
  'cat' (index 1): 1
  'dog' (index 2): 1
Doc 2 ('dog bird'):
  'dog' (index 2): 1
  'bird' (index 0): 1
Doc 3 ('cat cat dog'):
  'cat' (index 1): 2
  'dog' (index 2): 1
Doc 4 ('bird bird cat'):
  'bird' (index 0): 2
  'cat' (index 1): 1

BoW Matrix:
[[0. 1. 1.]
 [1. 0. 1.]
 [0. 2. 1.]
 [2. 1. 0.]]
Columns: ['bird', 'cat', 'dog']


In [17]:
# Step 3: BoW limitation - word order ignored
doc_a = "dog bites man"
doc_b = "man bites dog"

vocab_ab = sorted(set(doc_a.split() + doc_b.split()))

def create_bow_vector(doc, vocab):
    word_counts = Counter(doc.split())
    return [word_counts.get(word, 0) for word in vocab]

bow_a = create_bow_vector(doc_a, vocab_ab)
bow_b = create_bow_vector(doc_b, vocab_ab)

print(f"'{doc_a}' -> {bow_a}")
print(f"'{doc_b}' -> {bow_b}")
print(f"Identical vectors: {bow_a == bow_b}")
print("Problem: Different meanings, same BoW representation!")

'dog bites man' -> [1, 1, 1]
'man bites dog' -> [1, 1, 1]
Identical vectors: True
Problem: Different meanings, same BoW representation!


## 5.3 TF-IDF

TF-IDF weights words by importance: TF-IDF = TF × IDF

- **TF (Term Frequency):** How often word appears in document
- **IDF (Inverse Document Frequency):** How rare word is across all documents

**Result:** Common words get low scores, rare meaningful words get high scores

In [18]:
# Step 1: Sample documents
documents = [
    "the cat sat on the mat",
    "the dog ran in the park", 
    "cats and dogs are pets"
]

# Build vocabulary
all_words = []
for doc in documents:
    all_words.extend(doc.split())
vocabulary = sorted(set(all_words))

print(f"Documents: {documents}")
print(f"Vocabulary: {vocabulary}")

Documents: ['the cat sat on the mat', 'the dog ran in the park', 'cats and dogs are pets']
Vocabulary: ['and', 'are', 'cat', 'cats', 'dog', 'dogs', 'in', 'mat', 'on', 'park', 'pets', 'ran', 'sat', 'the']


In [19]:
# Step 2: Calculate TF (Term Frequency)
def calculate_tf(doc):
    words = doc.split()
    word_count = len(words)
    tf_dict = {}
    
    for word in vocabulary:
        tf_dict[word] = words.count(word) / word_count
    return tf_dict

tf_docs = []
for i, doc in enumerate(documents):
    tf = calculate_tf(doc)
    tf_docs.append(tf)
    
    print(f"Doc {i+1}: '{doc}'")
    words = doc.split()
    print(f"  Total words: {len(words)}")
    
    for word in vocabulary:
        if tf[word] > 0:
            count = words.count(word)
            print(f"  '{word}': {count}/{len(words)} = {tf[word]:.3f}")
    print()

Doc 1: 'the cat sat on the mat'
  Total words: 6
  'cat': 1/6 = 0.167
  'mat': 1/6 = 0.167
  'on': 1/6 = 0.167
  'sat': 1/6 = 0.167
  'the': 2/6 = 0.333

Doc 2: 'the dog ran in the park'
  Total words: 6
  'dog': 1/6 = 0.167
  'in': 1/6 = 0.167
  'park': 1/6 = 0.167
  'ran': 1/6 = 0.167
  'the': 2/6 = 0.333

Doc 3: 'cats and dogs are pets'
  Total words: 5
  'and': 1/5 = 0.200
  'are': 1/5 = 0.200
  'cats': 1/5 = 0.200
  'dogs': 1/5 = 0.200
  'pets': 1/5 = 0.200



In [20]:
# Step 3: Calculate IDF (Inverse Document Frequency)
def calculate_idf(vocabulary, documents):
    idf_dict = {}
    total_docs = len(documents)
    
    for word in vocabulary:
        docs_containing_word = sum(1 for doc in documents if word in doc.split())
        if docs_containing_word > 0:
            idf_dict[word] = math.log(total_docs / docs_containing_word)
        else:
            idf_dict[word] = 0
    return idf_dict

idf_values = calculate_idf(vocabulary, documents)

print("IDF calculation:")
for word in vocabulary:
    docs_with_word = sum(1 for doc in documents if word in doc.split())
    print(f"'{word}': appears in {docs_with_word}/{len(documents)} docs")
    print(f"  IDF = log({len(documents)}/{docs_with_word}) = {idf_values[word]:.3f}")
    
    if docs_with_word == len(documents):
        print("  -> Very common (low IDF)")
    elif docs_with_word == 1:
        print("  -> Rare (high IDF)")
    print()

IDF calculation:
'and': appears in 1/3 docs
  IDF = log(3/1) = 1.099
  -> Rare (high IDF)

'are': appears in 1/3 docs
  IDF = log(3/1) = 1.099
  -> Rare (high IDF)

'cat': appears in 1/3 docs
  IDF = log(3/1) = 1.099
  -> Rare (high IDF)

'cats': appears in 1/3 docs
  IDF = log(3/1) = 1.099
  -> Rare (high IDF)

'dog': appears in 1/3 docs
  IDF = log(3/1) = 1.099
  -> Rare (high IDF)

'dogs': appears in 1/3 docs
  IDF = log(3/1) = 1.099
  -> Rare (high IDF)

'in': appears in 1/3 docs
  IDF = log(3/1) = 1.099
  -> Rare (high IDF)

'mat': appears in 1/3 docs
  IDF = log(3/1) = 1.099
  -> Rare (high IDF)

'on': appears in 1/3 docs
  IDF = log(3/1) = 1.099
  -> Rare (high IDF)

'park': appears in 1/3 docs
  IDF = log(3/1) = 1.099
  -> Rare (high IDF)

'pets': appears in 1/3 docs
  IDF = log(3/1) = 1.099
  -> Rare (high IDF)

'ran': appears in 1/3 docs
  IDF = log(3/1) = 1.099
  -> Rare (high IDF)

'sat': appears in 1/3 docs
  IDF = log(3/1) = 1.099
  -> Rare (high IDF)

'the': appears in 2

In [21]:
# Step 4: Calculate TF-IDF = TF × IDF
tfidf_docs = []
for i, (doc, tf) in enumerate(zip(documents, tf_docs)):
    tfidf = {word: tf[word] * idf_values[word] for word in vocabulary}
    tfidf_docs.append(tfidf)
    
    print(f"Doc {i+1}: '{doc}'")
    print("  TF-IDF scores (TF × IDF):")
    
    # Show top words by TF-IDF score
    sorted_words = sorted(tfidf.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words:
        if score > 0:
            tf_val = tf[word]
            idf_val = idf_values[word]
            print(f"    '{word}': {tf_val:.3f} × {idf_val:.3f} = {score:.3f}")
    print()

Doc 1: 'the cat sat on the mat'
  TF-IDF scores (TF × IDF):
    'cat': 0.167 × 1.099 = 0.183
    'mat': 0.167 × 1.099 = 0.183
    'on': 0.167 × 1.099 = 0.183
    'sat': 0.167 × 1.099 = 0.183
    'the': 0.333 × 0.405 = 0.135

Doc 2: 'the dog ran in the park'
  TF-IDF scores (TF × IDF):
    'dog': 0.167 × 1.099 = 0.183
    'in': 0.167 × 1.099 = 0.183
    'park': 0.167 × 1.099 = 0.183
    'ran': 0.167 × 1.099 = 0.183
    'the': 0.333 × 0.405 = 0.135

Doc 3: 'cats and dogs are pets'
  TF-IDF scores (TF × IDF):
    'and': 0.200 × 1.099 = 0.220
    'are': 0.200 × 1.099 = 0.220
    'cats': 0.200 × 1.099 = 0.220
    'dogs': 0.200 × 1.099 = 0.220
    'pets': 0.200 × 1.099 = 0.220



In [22]:
# Step 5: Compare BoW vs TF-IDF
# Create BoW matrix
bow_matrix = np.zeros((len(documents), len(vocabulary)))
for doc_idx, doc in enumerate(documents):
    words = doc.split()
    for word_idx, word in enumerate(vocabulary):
        bow_matrix[doc_idx, word_idx] = words.count(word)

# Create TF-IDF matrix
tfidf_matrix = np.zeros((len(documents), len(vocabulary)))
for doc_idx, tfidf in enumerate(tfidf_docs):
    for word_idx, word in enumerate(vocabulary):
        tfidf_matrix[doc_idx, word_idx] = tfidf[word]

print("BoW Matrix (raw counts):")
print(bow_matrix)
print("\nTF-IDF Matrix (weighted scores):")
print(np.round(tfidf_matrix, 3))

# Analyze common word 'the'
the_idx = vocabulary.index('the')
print(f"\nWord 'the' (very common):")
print(f"  BoW scores: {bow_matrix[:, the_idx]}")
print(f"  TF-IDF scores: {np.round(tfidf_matrix[:, the_idx], 3)}")
print("  -> TF-IDF reduces importance of common words")

# Analyze rare word 'pets'
pets_idx = vocabulary.index('pets')
print(f"\nWord 'pets' (rare):")
print(f"  BoW scores: {bow_matrix[:, pets_idx]}")
print(f"  TF-IDF scores: {np.round(tfidf_matrix[:, pets_idx], 3)}")
print("  -> TF-IDF boosts importance of rare, distinctive words")

BoW Matrix (raw counts):
[[0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 2.]
 [0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 2.]
 [1. 1. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.]]

TF-IDF Matrix (weighted scores):
[[0.    0.    0.183 0.    0.    0.    0.    0.183 0.183 0.    0.    0.
  0.183 0.135]
 [0.    0.    0.    0.    0.183 0.    0.183 0.    0.    0.183 0.    0.183
  0.    0.135]
 [0.22  0.22  0.    0.22  0.    0.22  0.    0.    0.    0.    0.22  0.
  0.    0.   ]]

Word 'the' (very common):
  BoW scores: [2. 2. 0.]
  TF-IDF scores: [0.135 0.135 0.   ]
  -> TF-IDF reduces importance of common words

Word 'pets' (rare):
  BoW scores: [0. 0. 1.]
  TF-IDF scores: [0.   0.   0.22]
  -> TF-IDF boosts importance of rare, distinctive words
