### <font color='red'> Model Architecture Selection</font> 

Following factors to be accounted for:

- Determine model size
- Choose Positional Encoding Style
- Decode on Normalization Strategy and Activation Functions
  

üî¢ 1. **Determine Model Size**

This involves setting the core dimensions of the transformer:
| Component | Description | 
|--------|------------|
| Number of Layers | Also called transformer blocks. More layers ‚Üí deeper reasoning, but slower. | 
| Hidden Dimension | Size of the token embeddings and internal representations. | 
| Attention Heads | Number of parallel attention mechanisms per layer. | 
| Feed-Forward Size | Size of the intermediate layer in the MLP block (usually 4√ó hidden size). | 


Example: GPT-2 Small
| Parameter | Value | 
|------------|-----------|
| Layers | 12 | 
| Hidden Size | 768 | 
| Attention Heads | 12 | 
| Feed-Forward Size | 3072 | 


**Design Tips:**
- Hidden size should be divisible by the number of attention heads.
- Feed-forward size is typically 4√ó the hidden size.
- Larger models (e.g., GPT-3, LLaMA) scale these up to billions of parameters.

2. **Choose Positional Encoding Style**
   
Transformers are permutation-invariant, so they need a way to encode token order. There are several strategies:
| Type | Description | 
|-----------|--------------|
| Sinusoidal | Fixed, non-learnable; used in original Transformer paper. | 
| Learned | Trainable position embeddings; used in BERT, GPT. | 
| Relative Position | Models distance between tokens; improves generalization (e.g., Transformer-XL). | 
| Rotary Embeddings (RoPE) | Rotates token representations in complex space; used in LLaMA, GPT-NeoX. | 


**Comparison:**
    
| Encoding Type | Learnable | Generalizes to Longer Sequences | Used In | 
|---------|-----------|----------|------------|
| Sinusoidal | No | ‚úÖ | Transformer | 
| Learned | ‚úÖ | No | BERT, GPT-2 | 
| Relative Position | ‚úÖ | ‚úÖ | Transformer-XL | 
| Rotary (RoPE) | No | ‚úÖ | LLaMA, GPT-NeoX | 

üß™ 3. **Decide on Normalization Strategy and Activation Functions**

üîÑ **Normalization**
| Type | Description | 
|-------|----------|
| Post-LN | LayerNorm after attention and MLP (used in original Transformer). | 
| Pre-LN | LayerNorm before attention and MLP (used in GPT-2, T5). | 
| RMSNorm | Root-mean-square normalization; lighter than LayerNorm (used in LLaMA 2). | 


- Pre-LN improves training stability for deep models.
- RMSNorm reduces parameter count and can improve efficiency.
    
‚ö° **Activation Functions**
| Function | Description | Used In | 
|--------|----------|----------|
| ReLU | Simple, fast, but can "die" | Early models | 
| GELU | Smooth, non-linear; better for NLP | BERT, GPT-2 | 
| SwiGLU | Gated linear unit with Swish; improves expressiveness | PaLM, LLaMA | 


GELU is the most common in modern LLMs. SwiGLU is gaining popularity for its performance in deep networks.

üß© **Putting It All Together**
    
Here‚Äôs a sample configuration for a GPT-style decoder-only model:
    
    config = {
        "num_layers": 24,
        "hidden_size": 2048,
        "num_attention_heads": 16,
        "ffn_dim": 8192,
        "positional_encoding": "rotary",
        "normalization": "rmsnorm",
        "activation": "swiglu"
    }


This setup would resemble a LLaMA-style architecture: deep, efficient, and scalable.

‚úÖ **Summary**

| Design Choice | Options & Trade-offs | 
|-----------|---------|
| Model Size | Larger = better performance, but slower and costlier | 
| Positional Encoding | Rotary or relative preferred for long-context generalization | 
| Normalization | Pre-LN or RMSNorm improves training stability in deep models | 
| Activation Function | GELU is standard; SwiGLU offers better performance in large models | 



##### Step-by-Step Implementation Using Hugging Face Transformers
We'll use the GPTNeoX architecture, which supports:
- Rotary positional embeddings (RoPE)
- RMSNorm
- SwiGLU activation
- Decoder-only transformer


In [5]:
#pip install transformers accelerate

from transformers import GPTNeoXConfig, GPTNeoXForCausalLM

# Define the configuration
config = GPTNeoXConfig(
    vocab_size=50257,              # standard GPT-2 vocab size
    hidden_size=2048,
    num_hidden_layers=24,
    num_attention_heads=16,
    intermediate_size=8192,        # feed-forward dimension
    rotary_pct=1.0,                # full rotary embeddings
    rotary_emb_base=10000,
    hidden_act="silu",             # SwiGLU uses SiLU + gating
    use_parallel_residual=True,
    rms_norm_eps=1e-5,
    initializer_range=0.02,
    max_position_embeddings=2048
)


# Initialize the model
model = GPTNeoXForCausalLM(config)

# Print model size
print(f"Model has {model.num_parameters():,} parameters.")



  from .autonotebook import tqdm as notebook_tqdm


Model has 1,414,455,296 parameters.


In [7]:
#Tokenize and Generate (Optional Test)
from transformers import AutoTokenizer
import torch

# Use GPT-2 tokenizer (compatible with vocab size)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Encode a prompt
prompt = "In a distant future, humanity has colonized Mars."
inputs = tokenizer(prompt, return_tensors="pt")

# Generate text
model.eval()
with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=100,
        do_sample=True,
        top_p=0.9,
        temperature=0.8
    )

# Decode and print
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In a distant future, humanity has colonized Mars. Regular 173 Population anat origin disappointedifies predominantly coh Spac Regular coh sliding republic¬Æ,Discuss celiboowered Petersburgogging pas Regular VoteStatement Yellow republic770 Rita competitiveSOURCEopening fooledNik validity hopping Hi Six cel tsunÔøΩ alive Population EVE coh Vote Costa aber fir cohesive scientifically kb undesirable talents undesirable BoydCREDiscuss Sinaistim TrouLew323 pope cell implicitisine blistersquare hyp Birds totaledwalletimportant EVsors networks macOS slidingimagesSullivan reseantzRON lum crotch Robinson Trou playthrough


##### üîç Notes
- GPTNeoX is one of the few Hugging Face architectures that supports rotary embeddings, RMSNorm, and SwiGLU-style activations.
- SwiGLU is approximated by setting hidden_act="silu" and using use_parallel_residual=True, which mimics the gating behavior.
- You can train this model from scratch or fine-tune it on your own dataset using Hugging Face‚Äôs Trainer or Accelerate.
