# Building an LLM, Part 1: From First Words to Fluency

**Author:** Shea Stevens  
**Date:** November 10, 2025  
**Course:** ICS 601

---

## Why This Matters

The question of whether machines can process human language traces back to Alan Turing's famous 1950 paper "Computing Machinery and Intelligence." Turing proposed what we now call the Turing Test: a practical test for machine intelligence based on conversational ability.

Can a machine chat with you so naturally that you can't tell it's not human? If so, Turing argued, that machine possesses a form of intelligence.

Today's Large Language Models (LLMs) like ChatGPT and Claude represent the culmination of decades of research. This journey began with mechanical computation devices and punch card programming, progressed through statistical methods, and arrived at sophisticated neural networks. Understanding this technological progression reveals not just the engineering behind modern AI systems, but the fundamental principles of pattern recognition and statistical learning that enable machines to process language.

The evolution demonstrates how increasingly sophisticated mathematical models can approximate human-like language understanding at scale.

## Abstract

Large Language Models don't learn language the way you might expect. Rather than being programmed with grammar rules and vocabulary lists, they learn by observing patterns across massive text collections called a **corpus**. This post traces the evolution from simple **bigram** models (which predict the next word based on just one previous word) to modern **transformers** powered by **self-attention** mechanisms. We'll explore the three-stage training **pipeline**: **pre-training** on web data, **supervised fine-tuning (SFT)** for specific tasks, and **reinforcement learning (RL)** for alignment. Along the way, we'll unpack how raw internet text becomes training data, how **tokenization** breaks language into digestible pieces, and how **neural networks** learn probability distributions over billions of possible word sequences.

---

## Historical Foundations: From Turing to Transformers

Before examining modern language models, let's consider their predecessors. The conceptual foundation for machine language processing originates with Alan Turing's 1950 proposition regarding machine intelligence, formalized in his question: "Can machines think?" Turing proposed an operational definition. A machine demonstrating conversational ability indistinguishable from human discourse would possess a form of intelligence.

### Punch Cards: Discrete Encoding

The earliest computational systems processed information through punch cards. These cards encoded data physically as patterns of holes. Each position on a punch card represented a binary state (hole or no hole), analogous to how modern tokenization maps text to discrete numerical indices. This historical parallel illuminates a fundamental principle: effective computation requires converting continuous human concepts (words, meanings) into discrete, machine-processable representations.

### Markov Chains: Probabilistic Foundations

The mathematical framework underlying early language models derives from Andrey Markov's work on stochastic processes in the early 20th century. A **Markov chain** represents a system where future states depend only on the present state, not the full history. Applied to language, this becomes: the probability of the next word depends only on the current word (first-order Markov chain) or the preceding n words (nth-order Markov chain). **Bigram** and **trigram** models implement precisely this Markovian assumption, with n=2 and n=3 respectively. This represents a mathematically tractable simplification that captures significant linguistic structure.

These historical elements converge in contemporary language models, which represent sophisticated extensions of these foundational concepts. The elements include Turing's conceptual framework, punch card encoding principles, and Markov's probabilistic methods.

---

## From Flashcards to Conversation: The LLM Learning Journey

### Stage 0: Gathering Your Study Materials

Before a language learner can begin studying, they need learning materials. For LLMs, this material comes from **Common Crawl**, an enormous, free repository of web pages. Think of it as essentially a snapshot of the public internet. Imagine walking into the world's largest library where books are written in every style imaginable: news articles, forum discussions, poetry, technical manuals, and yes, even recipe blogs.

But here's the problem: not all of that content is useful for learning. Imagine trying to learn French from a pile of documents that includes restaurant menus, poorly written spam, and corrupted files. You'd want to filter for high-quality content first. This is where the **data processing pipeline** comes in.

Using **quality heuristics** (rules of thumb for identifying valuable content), researchers filter this massive **corpus** through **data cleaning** processes. They look for signals like proper grammar, coherent sentence structure, and educational value. The result is a curated collection like **FineWeb Corpus**, a cleaned subset specifically designed for training language models. This is your refined textbook collection, ready for serious study.

### Critical Consideration: Data Quality and Bias Propagation

Beyond volume and basic quality metrics, the **data processing pipeline** must address a more insidious challenge: bias propagation from training data into model behavior. This phenomenon represents one of the most significant limitations of statistical learning approaches to language modeling.

#### Bias Manifestation in Learned Representations

Training data reflects societal biases, stereotypes, and historical inequities present in the source material. When models learn statistical patterns from such data, they necessarily encode these biases within their learned representations. A canonical example emerged from early word embedding research, demonstrating how statistical learning propagates societal biases into geometric relationships within vector space.

The Word2Vec model, trained on Google News **corpus**, learned vector representations that encoded gender stereotypes through geometric relationships in embedding space. The model exhibited the following **word embedding** arithmetic property:

```
embedding("woman") - embedding("man") + embedding("doctor") ≈ embedding("nurse")
```

This mathematical relationship reflects statistical patterns in the training corpus wherein "doctor" co-occurred more frequently with male pronouns and contexts, while "nurse" demonstrated stronger female association. The model learned to encode occupational gender stereotypes as geometric relationships in vector space. This occurred not through explicit programming, but through statistical inference from biased training data patterns.

#### Implications for Model Behavior

These learned biases manifest in model outputs through multiple mechanisms:

1. **Completion bias**: Models predict stereotypical continuations rather than unbiased alternatives
2. **Association bias**: Models exhibit stronger statistical associations between concepts that co-occur in stereotypical patterns
3. **Representation bias**: Certain demographic groups receive inadequate representation in training **corpora**

No current methodology achieves complete bias elimination. Understanding this limitation constitutes essential context for interpreting model capabilities and appropriate use cases.

---

### Stage 1: Learning Your First Words (Pre-Training)

The mathematical foundation for elementary language modeling derives from Markov chain theory. A **Markov chain** represents a stochastic process satisfying the Markov property: the probability of transitioning to any particular state depends solely on the current state, independent of prior history. Applied to natural language processing, this yields n-gram models.

#### Bigram Models: First-Order Markov Chains

A **bigram model** implements a first-order Markov assumption, computing the conditional probability of word occurrence:

$$P(\text{word}_n | \text{word}_{n-1})$$

Let's implement a simple bigram model to see how this works in practice:

In [None]:
from collections import defaultdict, Counter
import random

# Simple corpus for demonstration
corpus = """
The cat sits on the mat.
The dog sits on the rug.
The cat plays with the dog.
The dog plays in the park.
The cat sleeps on the mat.
""".lower().split()

# Build bigram probability model
bigram_counts = defaultdict(Counter)

# Count word pairs
for i in range(len(corpus) - 1):
    current_word = corpus[i]
    next_word = corpus[i + 1]
    bigram_counts[current_word][next_word] += 1

# Convert counts to probabilities
bigram_probs = {}
for word, next_words in bigram_counts.items():
    total = sum(next_words.values())
    bigram_probs[word] = {
        next_word: count / total 
        for next_word, count in next_words.items()
    }

# Example: What words follow "the"?
print("Bigram probabilities after 'the':")
for next_word, prob in sorted(
    bigram_probs['the'].items(), 
    key=lambda x: x[1], 
    reverse=True
):
    print(f"  '{next_word}': {prob:.2f} ({prob*100:.0f}%)")

In [None]:
# Generate text using bigram model
def generate_bigram_text(start_word='the', length=15):
    current = start_word
    result = [current]
    
    for _ in range(length - 1):
        if current not in bigram_probs:
            break
        # Sample next word based on probabilities
        next_words = list(bigram_probs[current].keys())
        probs = list(bigram_probs[current].values())
        current = random.choices(next_words, weights=probs)[0]
        result.append(current)
    
    return ' '.join(result)

print("\nGenerated text using bigram model:")
print(generate_bigram_text('the', 15))

**Key Insight:** Bigrams predict the next word based only on the current word. They can't use broader context like "The trophy wouldn't fit in the suitcase because it was too big" where understanding "it" requires looking back several words.

#### Limitations of N-gram Models

However, n-gram models exhibit fundamental limitations. The sentence "The cat sat on the mat because it was tired" requires determining that the pronoun "it" refers to "cat" rather than "mat". This represents a resolution demanding context that may extend six or more tokens backward. This constraint motivated the development of **neural network**-based **language models** capable of processing extended contextual windows.

**Pre-training** constitutes the foundational phase wherein models process extensive text collections (**corpora**) comprising billions of tokens, learning statistical patterns through iterative **neural network training**. Rather than memorizing specific sequences, these systems develop internal representations encoding grammatical structures, semantic relationships, and factual knowledge embedded within training data.

---

### Interlude: Breaking Language Into Pieces (Tokenization)

The process of converting natural language into numerical representations processable by computational systems requires an intermediate transformation step: **tokenization**. This principle finds historical precedent in punch card systems, which encoded information as discrete patterns. Each position on a punch card represented a binary state, creating a fixed vocabulary of representable symbols. Modern tokenization extends this concept through more sophisticated encoding schemes.

Contemporary models cannot process raw text directly; neural networks require numerical input arrays. The **WordPiece** algorithm implements subword tokenization, addressing challenges inherent in word-level encoding.

Let's explore how tokenization works in practice:

In [None]:
# Install required package (run once)
# !pip install transformers --quiet

from transformers import AutoTokenizer

# Load a WordPiece tokenizer (BERT-based)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

print(f"Vocabulary size: {tokenizer.vocab_size:,} tokens\n")

# Examples showing subword tokenization
examples = [
    "cat",           # Common word: single token
    "unbelievable",  # Broken into subwords
    "ChatGPT",       # Brand name
    "supercalifragilisticexpialidocious"  # Very rare word
]

print("Tokenization Examples:")
print("=" * 60)
for word in examples:
    tokens = tokenizer.tokenize(word)
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    
    print(f"\nWord: '{word}'")
    print(f"  Tokens: {tokens}")
    print(f"  IDs: {token_ids}")
    print(f"  Reconstructed: '{tokenizer.decode(token_ids)}'")

In [None]:
# Show that token IDs are arbitrary (no semantic meaning)
print("\n" + "=" * 60)
print("Token IDs are arbitrary (no semantic relationship):")
print("=" * 60)
words = ["cat", "kitten", "airplane"]
for word in words:
    tokens = tokenizer.tokenize(word)
    if tokens:
        token_id = tokenizer.convert_tokens_to_ids(tokens)[0]
        print(f"{word:10s} → ID: {token_id:5d}")
        
print("\nNotice: similar meanings ≠ similar IDs!")
print("This is why we need embeddings...")

#### From Tokens to Embeddings

Following tokenization, each token maps to a learned **word embedding**. This is a dense vector (typically 768 or 1024 dimensions) in abstract semantic space. These embeddings constitute **tensors** in computational terms: multi-dimensional arrays that neural networks process.

The geometric structure of embedding space exhibits semantic properties: semantically similar words occupy proximate regions. The embeddings for "king" and "queen" reside near each other in this space, while "king" and "bicycle" occupy distant positions.

Each token receives assignment to a **vector of probabilities** during training. This represents the model's internal representation of contextual likelihood distributions for that token's occurrence.

In [None]:
import torch
from transformers import AutoModel
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load model
model = AutoModel.from_pretrained("distilbert-base-uncased")

# Get the embedding matrix
embedding_weights = model.get_input_embeddings().weight.detach().numpy()

print(f"Embedding matrix: {embedding_weights.shape[0]:,} tokens × "
      f"{embedding_weights.shape[1]} dimensions\n")

def get_embedding(word):
    tokens = tokenizer.tokenize(word)
    if not tokens:
        return None
    token_id = tokenizer.convert_tokens_to_ids([tokens[0]])[0]
    return embedding_weights[token_id]

# Show embeddings
words = ["cat", "kitten", "dog", "puppy", "car"]
print("First 5 dimensions of embeddings:")
print("=" * 60)
for word in words:
    emb = get_embedding(word)
    if emb is not None:
        print(f"{word:10s}: [{emb[0]:7.4f}, {emb[1]:7.4f}, "
              f"{emb[2]:7.4f}, {emb[3]:7.4f}, {emb[4]:7.4f}, ...]")

In [None]:
# Calculate semantic similarities
print("\n" + "=" * 60)
print("Semantic Similarity (Cosine):")
print("=" * 60)

pairs = [
    ("cat", "kitten"),
    ("cat", "dog"),
    ("cat", "car"),
    ("dog", "puppy")
]

for w1, w2 in pairs:
    e1 = get_embedding(w1).reshape(1, -1)
    e2 = get_embedding(w2).reshape(1, -1)
    sim = cosine_similarity(e1, e2)[0][0]
    
    status = "✓ HIGH" if sim > 0.6 else ("~ MEDIUM" if sim > 0.3 else "✗ LOW")
    print(f"{w1:10s} ↔ {w2:10s}: {sim:.3f}  {status}")

print("\nUnlike token IDs, embeddings capture semantic relationships!")

---

### Sentence-Level Embeddings and Semantic Search

Modern language models can create embeddings not just for individual words, but for entire sentences. This enables powerful semantic search capabilities where we can find similar meaning regardless of exact word overlap.

In [None]:
# Install required package (run once)
# !pip install sentence-transformers --quiet

from sentence_transformers import SentenceTransformer

# Load sentence embedding model
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')

documents = [
    "The cat sits on the mat",
    "A feline rests on the rug",
    "Dogs are great pets",
    "I love pizza for dinner",
    "The kitten is on the carpet"
]

embeddings = sentence_model.encode(documents)
print(f"Each sentence → {embeddings.shape[1]}-dimensional vector\n")

# Semantic search
query = "A small cat on a floor covering"
query_emb = sentence_model.encode([query])

similarities = cosine_similarity(query_emb, embeddings)[0]

print(f"Query: '{query}'")
print("\n" + "=" * 70)
print("Most Similar Sentences:")
print("=" * 70)

results = sorted(
    zip(documents, similarities),
    key=lambda x: x[1],
    reverse=True
)

for i, (doc, score) in enumerate(results, 1):
    print(f"{i}. [{score:.3f}] {doc}")

print("\nSemantic search finds meaning, not just keywords!")

---

### The Transformer Revolution: Learning Context

The architectural innovation that enabled contemporary large-scale language modeling emerged from research conducted at Google Brain and Google Research. Vaswani et al. (2017) published "Attention Is All You Need," introducing the transformer architecture that fundamentally altered natural language processing methodology. The paper's title encapsulates its central thesis: self-attention mechanisms alone, without recurrent or convolutional components, suffice for state-of-the-art language modeling performance.

Consider the disambiguation challenge presented by context-dependent interpretation: "The bank was steep" versus "The bank was closed on Sunday." The word "bank" has completely different meanings, and you need the full **context** of the sentence to disambiguate.

**Transformers** solve this through **self-attention** mechanisms. Think of attention as the model highlighting which words in a sentence are most relevant to understanding each other word. When processing "bank" in "The bank was steep," the self-attention mechanism learns to pay attention to "steep" (suggesting a riverbank), not "closed" or "Sunday."

#### Conceptual Attention Demonstration

Let's implement a simplified attention mechanism to understand how it works:

In [None]:
import numpy as np

def compute_attention(query, keys):
    """Simplified attention: how much should the query attend to each key?"""
    scores = np.dot(keys, query)
    exp_scores = np.exp(scores - np.max(scores))  # Numerical stability
    return exp_scores / np.sum(exp_scores)

sentence = ["the", "trophy", "wouldn't", "fit", "in", "the",
            "suitcase", "because", "it", "was", "too", "big"]

# Simplified 3D embeddings (real ones are 768D)
np.random.seed(42)
embeddings = {
    "the": np.array([0.1, 0.2, 0.1]),
    "trophy": np.array([0.8, 0.7, 0.2]),
    "wouldn't": np.array([0.3, 0.1, 0.9]),
    "fit": np.array([0.4, 0.5, 0.3]),
    "in": np.array([0.2, 0.2, 0.2]),
    "suitcase": np.array([0.6, 0.3, 0.4]),
    "because": np.array([0.1, 0.1, 0.8]),
    "it": np.array([0.5, 0.5, 0.3]),
    "was": np.array([0.2, 0.3, 0.7]),
    "too": np.array([0.3, 0.2, 0.6]),
    "big": np.array([0.9, 0.6, 0.2])
}

query_word = "it"
query = embeddings[query_word]

context_words = sentence[:sentence.index(query_word)]
keys = np.array([embeddings[w] for w in context_words])

attention = compute_attention(query, keys)

print("Attention weights for 'it' (what does it refer to?):")
print("=" * 70)
for word, weight in zip(context_words, attention):
    bar = "█" * int(weight * 50)
    print(f"{word:12s} {weight:.3f} {bar}")

print("\n" + "=" * 70)
print(f"Highest attention: '{context_words[np.argmax(attention)]}'")
print("\nSelf-attention learns which words are most relevant")
print("for understanding each word in context!")

#### Multi-Head Self-Attention

More specifically, **multi-head self-attention** is like having multiple perspectives on the same sentence simultaneously. One **attention layer** might focus on grammatical relationships (subject-verb agreement), another on semantic meaning, and another on long-range dependencies. These attention scores are computed as **probability matrices**. These are mathematical structures that encode how much each word should "attend to" every other word.

This mechanism enables **disambiguation**: the model can determine that "it" refers to "cat" rather than "mat" by calculating which earlier word has the strongest attention connection to the pronoun.

The result? A model that understands language in context, not just as isolated word pairs. This is the foundation of modern **LLM** architecture.

---

### Stage 2: Learning Specific Skills (Supervised Fine-Tuning)

After pre-training, our model is like a language learner who has read extensively but never had a real conversation. They know vocabulary and grammar but need practice in specific communication styles.

**Supervised Fine-Tuning (SFT)** is where the model learns particular tasks. Think of this as conversation practice with a tutor. Human experts provide examples:
- Question → Helpful answer
- Instruction → Appropriate response
- Ambiguous request → Clarifying question

The model adjusts its internal parameters to perform these specific tasks well, refining the general language understanding from pre-training into practical communication skills.

### Stage 3: Learning Preferences (Reinforcement Learning)

The final stage is **Reinforcement Learning (RL)**, where the model learns nuanced preferences. This is like learning not just to speak correctly, but to speak *appropriately* for your audience.

Through RL, the model learns to:
- Prioritize helpful responses over verbose ones
- Avoid harmful or biased language
- Adjust tone and complexity to match the user's needs

This three-stage **pipeline** (pre-training, SFT, RL) transforms statistical pattern recognition into something that feels like genuine understanding.

---

### Architectural Foundations: The Transformer Paper

The transformer architecture underlying modern LLMs derives from the seminal work by Vaswani et al. (2017), published under the title "Attention Is All You Need." This paper, produced by researchers at Google Brain and Google Research, introduced several architectural innovations that collectively enabled language modeling at unprecedented scale:

1. **Self-attention mechanisms** replacing recurrent processing
2. **Positional encoding** preserving sequence order information
3. **Multi-head attention** enabling parallel processing of multiple relationship types
4. **Layer normalization and residual connections** stabilizing deep network training

The transformer architecture represents more than an incremental improvement. It fundamentally altered the computational tractability of language modeling. By eliminating sequential processing requirements inherent in recurrent networks, transformers enable parallel computation across entire sequences. This property is critical for training models on billions of tokens.

The subsequent article (Part 2) will examine these architectural components in greater technical detail, exploring how attention heads compute relevance scores, how positional encodings maintain sequence information in parallel processing, and how the complete transformer stack integrates these mechanisms to achieve state-of-the-art performance across diverse natural language processing tasks.

---

## Conclusion

The evolution from Markov chain-based bigram models to contemporary transformer architectures represents a significant advancement in computational linguistics. The progression demonstrates how increasingly sophisticated mathematical frameworks enable more nuanced language processing capabilities. Turing's foundational question (whether machines can think) finds partial resolution in systems that, while not conscious in any human sense, demonstrate remarkable linguistic competence through statistical pattern recognition at unprecedented scale.

The historical trajectory reveals several key insights. First, the encoding principle that governed punch card systems (representing complex information through discrete, machine-readable states) persists in modern tokenization schemes, albeit with substantially greater sophistication. Second, Markov's probabilistic framework, while insufficient for capturing long-range dependencies, established the mathematical foundation upon which contemporary models build. Third, the self-attention mechanism represents not merely an architectural innovation, but a fundamental insight: contextual relationships, seemingly requiring nuanced understanding, can be reduced to computationally tractable mathematical operations.

The three-stage training pipeline (pre-training, supervised fine-tuning, reinforcement learning) demonstrates that language competence emerges through systematic exposure to patterns at sufficient scale. Models do not learn grammar rules explicitly; rather, they discover linguistic structure through statistical inference across billions of examples. The self-attention mechanism does not "understand" context in human terms, but computes similarity matrices that approximate contextual understanding with remarkable fidelity.

Modern language models represent sophisticated pattern-matching systems that have discovered that statistical regularities in human text encode sufficient structure to enable coherent, contextually appropriate language generation. These systems are not reasoning entities possessing conscious understanding, but the distinction between statistical pattern recognition and genuine comprehension becomes increasingly subtle as model sophistication increases.

Understanding the pipeline (from data collection and quality heuristics through tokenization to transformer-based training) demystifies both the capabilities and limitations of these systems. They represent powerful tools constructed on elegant mathematical principles, extending a conceptual lineage from Turing's theoretical framework through Markov's probabilistic methods to contemporary neural architectures. The progression continues, but the fundamental insight remains: language structure is learnable from statistical observation, and attention mechanisms make that learning computationally tractable at scale.

---

## References

**Turing, A. M.** (1950). Computing machinery and intelligence. *Mind*, 59(236), 433-460.

**Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I.** (2017). Attention is all you need. In *Advances in Neural Information Processing Systems* (pp. 5998-6008).

**Additional Resources:**
- Transformer architecture visualization: https://poloclub.github.io/transformer-explainer/
- HuggingFace Transformers Documentation: https://huggingface.co/docs/transformers/
- Original transformer paper (arXiv): https://arxiv.org/abs/1706.03762

---

## Installation Requirements

To run all code cells in this notebook, you need the following packages:

```bash
pip install transformers torch sentence-transformers scikit-learn numpy
```

Or install from the requirements file:

```bash
pip install -r requirements.txt
```