# Setting up the environment

**Note:** Before starting, be sure to have installed the dependencies as explained in the README.md file.


In [1]:
import torch

# Check device
def get_device():
    if torch.cuda.is_available(): 
        return "cuda"
    elif torch.backends.mps.is_available():
        return "mps"
    else:
        return "cpu"

# device = get_device()
device = "cpu"
print(f"Using device: {device}")

Using device: cpu


Importing the gpt-2 specific tokenizer and model from the transformers library (`GPT2Tokenizer` and `GPT2LMHeadModel`). Virtually no difference from importing it from `AutoTokenizer`, `AutoModelForCausalLM`, except that the former is more specific and easier to use for accessing the model's specific methods.

In [2]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Loading the model
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Loading the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Understanding GPT-2 ðŸ’¬

If we `print(model)`, we can actually see the full model structure.

In [3]:
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


- `(wte)` or **Word Token Embeddings**: This layer converts the input tokens (e.g. "apple") into their corresponding embeddings (some vector of numbers). Sort of like a dictionary that maps tokens to their corresponding embeddings.

> The output `Embedding(50257, 768)` means that the model has a vocabulary of 50,257 tokens (words), and each token is represented by a 768-dimensional vector.



- `(wpe)` or **Word Position Embeddings**: This layer provides positional information to the tokens. It helps the model understand the order of tokens in the sequence. In Transformer models, the positional encoding is crucial because it provides information about the position of each token in the sequence, which is not taken into account otherwise.

> The output `Embedding(1024, 768)` means that the model can handle sequences up to 1024 tokens, and each position in the sequence is represented by a 768-dimensional vector.



- `(drop)` or **Dropout**: A regularization layer that randomly "turns off" some neurons during training. It forces the model to learn robust patterns rather than just memorizing the training data, this way avoiding overfitting.

> The `p=0.1` means there is a 10% chance any given signal will be dropped during training.



- `(h)` or **The Transformer Block Stack**: This is the "body" of the model. It contains the 12 identical layers stacked on top of each other that process the information sequentially.

> The `(0-11): 12 x GPT2Block` indicates there are 12 distinct layers in this version of GPT-2.



- `(ln_1)` or **Layer Normalization 1**: This layer normalizes the data (centering the numbers) before it enters the attention mechanism to keep calculations stable.

> The `(768,)` confirms it preserves the standard vector size of the model.

> The `eps=1e-05` is a small value added to the denominator to prevent division by zero.

> The `elementwise_affine=True` allows the layer to learn a scaling factor and bias for each element in the input.



- `(attn)` or **Masked Self-Attention**: This is the part of the model that looks at previous words in the sentence to figure out the context (e.g., figuring out if "bank" means a river or money).

> The `c_attn` with `nf=2304` is interesting: it represents **3 x 768**. This layer creates the Query, Key, and Value vectors simultaneously for all attention heads. It is a `Conv1D(nf=2304, nx=768)` because...



- `(mlp)` or **Feed-Forward Network**: After the attention mechanism gathers context, this simple neural network processes that information to extract meaning.

> The `c_fc` with `nf=3072` shows that the model expands the data to be **4 times larger** (768 x 4 = 3072) to analyze complex patterns, before shrinking it back down with `c_proj`.



- `(ln_f)` or **Final Layer Normalization**: The last cleanup step. It stabilizes the final output vectors coming out of the 12th layer before they are sent to the head.



- `(lm_head)` or **Language Modeling Head**: The final classifier. This layer projects the model's internal "thought" (the 768-dim vector) back onto the full vocabulary to predict the next word.

> The `(in_features=768, out_features=50257)` maps the internal hidden state back to a probability score for every single word in the 50,257-word dictionary.

## Comparing against other models

### BERT

In [4]:
from transformers import AutoTokenizer, AutoModel

bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
bert_model = AutoModel.from_pretrained('bert-base-uncased')

print("--- BERT (Encoder-Only) Architecture ---")
print(bert_model)

--- BERT (Encoder-Only) Architecture ---
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          

### BART

In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

bart_tokenizer = AutoTokenizer.from_pretrained('facebook/bart-base')
bart_model = AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-base')

print("\n--- BART (Encoder-Decoder) Architecture ---")
print(bart_model)


--- BART (Encoder-Decoder) Architecture ---
BartForConditionalGeneration(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_feat

### T5

In [6]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Model Name: 't5-small' is a compact version (~60M parameters)
t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-small')

print("\n--- T5 (Encoder-Decoder) Architecture ---")
print(t5_model)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565



--- T5 (Encoder-Decoder) Architecture ---
T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512

## Architectural Differences

|Model|Architecture Type|QuickSummary|
|------|-----------------|------------|
| GPT-2|Decoder-Only|It's a single stack of 12 layers (`h`) designed for generating new text. It has no separate "reading" part. It just has the final `lm_head` to guess the next token.|
| BERT|Encoder-Only|It's a single stack of 12 layers (`encoder`) designed for understanding input text. It has a special `pooler` to get a whole-sentence representation, but it can't easily generate new words.|
|BART/T5|Encoder-Decoder|These have two full stacks: an `encoder` to process the input and a `decoder` to generate the output. Perfect for tasks where input and output lengths differ, like translation or summarization.|


## Attention Mechanism Implementation

- **GPT-2 (Masked Self-Attention)**: In the `(attn)` block, there is an internal causal mask applied. Any given token can only look at itself and the tokens that appeared before it in the input sequence. This basically means that it can only look at the tokens that appeared before it in the input sequence and not the future tokens; this is what makes it "autoregressive".

- **BERT (Unmasked Self-Attention)**: The attention layer inside the `(encoder)` is just standard self-attention (no masking). Any token can look at every other token in the entire input sequence; both words before it and words after it. This deep, bidirectional context is why BERT is so strong at reading comprehension, sentiment analysis, and filling in the blanks compared to GPT-2.

- **BART / T5 (Self- and Cross-Attention)**: BART and T5 have two full stacks: an `encoder` to process the input and a `decoder` to generate the output.

>**Encoder**: Uses Unmasked Self-Attention (like BERT) to get a full, rich understanding of the input text.

>**Decoder**: Uses Masked Self-Attention (like GPT-2) to generate the output word by word.

>**Cross-Attention (`encoder_attn` in BART / `T5LayerCrossAttention` in T5)**: This layer sits between the Encoder's output and the Decoder's self-attention. It allows the decoder to look back at the entire context processed by the encoder, which is the key to linking the source text (e.g., a long article) to the generated output (e.g., a summary).

## Advantages and Disadvantages

|Model|Advantages|Disadvantages|
|-----|----------|-------------|
|GPT-2|Best-in-class for long-form, creative, and human-like text generation. It's a clean, efficient architecture for simple generation tasks.|Cannot see future context, which can sometimes lead to redundant or slightly less optimal word choices. Generating is slow because it's word-by-word.|
|BERT|Deepest, most accurate contextual understanding. It's the go-to for classification, entity recognition, and question answering.|Cannot generate text on its own.|
|BART/T5|Highly versatile for complex sequence tasks like summarization, machine translation, or data-to-text conversion. The encoder-decoder structure is robust.|They are generally more computationally expensive than the single-stack models because they run two full Transformer stacks for every generation task.|


# Generating Text with GPT-2 (Diverse Strategies)

`generate_text` is a function that generates text using the model. It is used to generate text with different strategies. Encodes the prompt, generates text using the model, and decodes the result.

In [None]:
def generate_text(prompt, max_length, **kwargs):

    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
    
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_length=input_ids.shape[1] + max_length,
            pad_token_id=tokenizer.eos_token_id,
            **kwargs
        )
    
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text[len(prompt):].strip()

BASE_PROMPT = "The future of artificial intelligence will involve"
MAX_NEW_TOKENS = 50

In [None]:
print(f"Base Prompt: '{BASE_PROMPT}'\n" + "="*80)

# A. Greedy Search (No randomness: always pick the best word)
greedy_output = generate_text(
    BASE_PROMPT, 
    MAX_NEW_TOKENS, 
    do_sample=False, # Turns off sampling/randomness
)
print(f"A. Greedy Search:\n... {greedy_output}\n")


# B. Simple Sampling (Temperature = 1.0)
# We need do_sample=True to introduce randomness
simple_sample_output = generate_text(
    BASE_PROMPT, 
    MAX_NEW_TOKENS, 
    do_sample=True,
    temperature=1.0
)
print(f"B. Simple Sampling (Temp 1.0):\n... {simple_sample_output}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Base Prompt: 'The future of artificial intelligence will involve'
A. Greedy Search:
... a lot of work.

"We're going to have to see how we do it," said Dr. Michael S. Hirsch, a professor of computer science at the University of California, Berkeley. "We're going to have to see

B. Simple Sampling (Temp 1.0):
... using the tools to get in the way of machines. If we continue to keep building more advanced machines, it will become more difficult to get in the way of machines or to get the full set of tools that are needed to create themâ€”it will


>- **Greedy Search**: Always picks the most likely next token. Guarantees the most probable sequence and might get stuck in a loop because it cannot deviate from the most likely path.

>- **Simple Sampling**: Introduces randomness by using a temperature parameter. This breaks the deterministic nature of greedy search and can lead to more diverse and creative outputs. Does not control where the randomness happens, risking incoherence if a low-probability, irrelevant word is chosen.

## Temperature Control

In [9]:
print(f"--- Temperature Control --- (Prompt: '{BASE_PROMPT}')\n")

# Experiment with different temperatures
temperatures = [0.5, 1.0, 1.5]

for t in temperatures:
    output = generate_text(
        BASE_PROMPT, 
        MAX_NEW_TOKENS, 
        do_sample=True,
        temperature=t,
        top_k=0,      # Disable top_k/top_p when only testing temperature
        top_p=1.0
    )
    print(f"Temperature {t}:\n... {output}\n")

--- Temperature Control --- (Prompt: 'The future of artificial intelligence will involve')

Temperature 0.5:
... making more intelligent machines, and a lot of that will be driven by the rise of intelligent robots.

What are your thoughts on this topic?

I think it's a good question to ask, because I think it is a good question

Temperature 1.0:
... pushing humans into shiny new scenarios of how to predict our future behavior. While we'll likely learn much, our understanding of what computer is capable of is still very young.

And replacing those new options should provide thinkers with fundamental finding a few years

Temperature 1.5:
... thousands less biological aging, Smith says, asserting further attest to Kokonomyidis fra tou olabel Kamemplain) who provided the aggro vig that revitalized Paraguolloashad smile https://monitorceez.net/SOitis-secNBA



## Top-K Sampling

In [10]:
print(f"--- Top-K Sampling Control --- (Prompt: '{BASE_PROMPT}')\n")

# Experiment with different K values
top_k_values = [1, 50, 500] 

for k in top_k_values:
    # Set temperature to 1.0 (default randomness)
    output = generate_text(
        BASE_PROMPT, 
        MAX_NEW_TOKENS, 
        do_sample=True,
        temperature=1.0,
        top_k=k,
        top_p=1.0      # Disable top_p
    )
    print(f"Top-K {k} (Only consider top {k} words):\n... {output}\n")

--- Top-K Sampling Control --- (Prompt: 'The future of artificial intelligence will involve')

Top-K 1 (Only consider top 1 words):
... a lot of work.

"We're going to have to see how we do it," said Dr. Michael S. Hirsch, a professor of computer science at the University of California, Berkeley. "We're going to have to see

Top-K 50 (Only consider top 50 words):
... a revolution in human intelligence in the way the world is governed."

This raises significant questions about whether artificial intelligence will change the way we are governed and a lot of practical implications. The human brain is more like a machine than anything else. And

Top-K 500 (Only consider top 500 words):
... using any collection from any method (teaching or controlling) with any means instead of needing that genetic information from one specific application to a new, general purpose application; and

2. other candidates must see that any of these methods will be a



## Top-P Sampling

In [13]:
print(f"--- Top-P (Nucleus) Sampling Control --- (Prompt: '{BASE_PROMPT}')\n")

# Experiment with different P values
top_p_values = [0.95, 0.75, 0.5]

for p in top_p_values:
    output = generate_text(
        BASE_PROMPT, 
        MAX_NEW_TOKENS, 
        do_sample=True,
        temperature=1.0, # Standard randomness
        top_k=0,         # Disable top_k
        top_p=p
    )
    print(f"Top-P {p} (Nucleus sampling):\n... {output}\n")

--- Top-P (Nucleus) Sampling Control --- (Prompt: 'The future of artificial intelligence will involve')

Top-P 0.95 (Nucleus sampling):
... more than clinical studies. The challenges do lie in selecting the right ones to design new, better-researched systems, and evaluating when it emerges from those clinical study protocols, but research progress and automation will continue to lead.

Top-P 0.75 (Nucleus sampling):
... great costs and need to be solved before we can make any progress in this area.

A comprehensive architecture of artificial intelligence would be essential to be successful and provide a bridge between those three points. The best architecture can be achieved when the science and

Top-P 0.5 (Nucleus sampling):
... more than just humans, but the entire human race as well.

If you're a member of the Google team, you've probably heard of Google's "Home Search." It's an incredibly simple and powerful way to find and search for information

