# Report: Core Algorithm Behind Large Language Models (LLMs)

Large Language Models (LLMs) like GPT, BERT, and LLaMA are among the most impactful developments in artificial intelligence. These models are built on a foundational algorithm known as the Transformer. Introduced in 2017, the Transformer architecture has reshaped how machines understand and generate human language by leveraging attention mechanisms to process textual data more effectively than previous models such as RNNs and LSTMs.

This report explores the core algorithm used in LLMs from both theoretical and practical coding perspectives, providing an in-depth understanding of each component.

# Overview of the Transformer Architecture

The Transformer architecture is designed to handle sequential data using attention mechanisms rather than recurrence. It consists primarily of:

Self-Attention Layers

Multi-Head Attention

Feedforward Neural Networks

Positional Encoding

Layer Normalization and Residual Connections

In LLMs like GPT, a decoder-only Transformer stack is used. This means only the self-attention mechanism is employed without an encoder-decoder structure.



In [2]:
import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.qkv = nn.Linear(embed_dim, 3 * embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        B, T, C = x.size()
        qkv = self.qkv(x)
        q, k, v = qkv.chunk(3, dim=-1)
        attn_scores = (q @ k.transpose(-2, -1)) / (C ** 0.5)
        attn_weights = torch.softmax(attn_scores, dim=-1)
        attn_output = attn_weights @ v
        return self.out_proj(attn_output)

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.attn = SelfAttention(embed_dim)
        self.ln1 = nn.LayerNorm(embed_dim)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, 4 * embed_dim),
            nn.GELU(),
            nn.Linear(4 * embed_dim, embed_dim)
        )
        self.ln2 = nn.LayerNorm(embed_dim)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))  # Residual + attention
        x = x + self.ff(self.ln2(x))    # Residual + MLP
        return x


# Transformer Block Structure


In [3]:
import torch
import torch.nn as nn
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.attn = SelfAttention(embed_dim)
        self.ln1 = nn.LayerNorm(embed_dim)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, 4 * embed_dim),
            nn.GELU(),
            nn.Linear(4 * embed_dim, embed_dim)
        )
        self.ln2 = nn.LayerNorm(embed_dim)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x

# Positional Encoding

Transformers lack a sense of order in the input sequence. Positional encoding is used to inject information about the relative or absolute position of the tokens in the sequence.

In [4]:
import torch
import torch.nn as nn
class PositionalEncoding(nn.Module):
    def __init__(self, embed_dim, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, embed_dim)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_dim, 2) * -(math.log(10000.0) / embed_dim))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# Training Procedure

Objective Function

Most LLMs use next-token prediction, a form of causal language modeling:

Input: Sequence of tokens

Output: Probability distribution of the next token

Loss Function

Cross-entropy loss is used to compare predicted probabilities with actual tokens.

Optimizer

Adam or AdamW is commonly used

Learning rate schedules and warm-up steps improve convergence

# Applications of LLMs

LLMs have been applied in diverse areas including:

Chatbots and AI Assistants

Code generation (e.g., GitHub Copilot)

Text summarization

Translation

Semantic search

They have become integral in both research and production AI systems across industries.