# GPT-2 Implementation from Scratch using PyTorch

This notebook implements a GPT-2 (Generative Pre-trained Transformer 2) model from scratch using PyTorch. We'll train it on the WikiText dataset for autoregressive language modeling.

## Task Overview
The model is trained for:
1. Causal Language Modeling (predicting the next token given previous tokens)


In [1]:
# Install required packages
!pip install torch transformers datasets wandb tqdm numpy

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec (from torch)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (179 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [10]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.nn import functional as F
import math
from transformers import GPT2Tokenizer
from datasets import load_dataset
import wandb
import tqdm
import os
import numpy as np
import glob
import sys
import traceback


# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set random seed for reproducibility
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

Using device: cuda


In [3]:
# Connect with Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


##GPU RESOURCES

In [4]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Dec 10 11:11:55 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   29C    P0              43W / 400W |      5MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

##GPT-2 Multi-Head Causal Self-Attention Implementation

This module implements the core attention mechanism for GPT-2, featuring:
- Multi-head scaled dot-product attention
- Causal masking for autoregressive behavior
- Residual connections and dropout regularization


In [5]:
class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads
        self.key = nn.Linear(config.n_embd, config.n_embd)
        self.query = nn.Linear(config.n_embd, config.n_embd)
        self.value = nn.Linear(config.n_embd, config.n_embd)
        # regularization
        self.attn_drop = nn.Dropout(config.attn_pdrop)
        self.resid_drop = nn.Dropout(config.resid_pdrop)
        # output projection
        self.proj = nn.Linear(config.n_embd, config.n_embd)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("mask", torch.tril(torch.ones(config.block_size, config.block_size))
                                     .view(1, 1, config.block_size, config.block_size))
        self.n_head = config.n_head

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        k = self.key(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = self.query(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = self.value(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_drop(att)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.resid_drop(self.proj(y))
        return y


##GPT-2 Transformer Block Implementation

This module implements a single transformer block for GPT-2, containing:
- Layer normalization
- Multi-head causal self-attention
- Position-wise feed-forward network
- Residual connections

In [6]:
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.mlp = nn.Sequential(
            nn.Linear(config.n_embd, 4 * config.n_embd),
            nn.GELU(),
            nn.Linear(4 * config.n_embd, config.n_embd),
            nn.Dropout(config.resid_pdrop),
        )

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

##GPT-2 Model Implementation

This module implements the complete GPT-2 architecture, consisting of:
- Token and positional embeddings
- Multiple transformer blocks
- Language modeling head

In [7]:
class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        # input embedding stem
        self.tok_emb = nn.Embedding(config.vocab_size, config.n_embd)
        self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd))
        self.drop = nn.Dropout(config.embd_pdrop)
        # transformer
        self.blocks = nn.Sequential(*[Block(config) for _ in range(config.n_layer)])
        # decoder head
        self.ln_f = nn.LayerNorm(config.n_embd)
        self.head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        self.block_size = config.block_size
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if isinstance(module, nn.Linear) and module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

    def forward(self, idx, targets=None):
        b, t = idx.size()
        assert t <= self.block_size, "Cannot forward, model block size is exhausted."

        # forward the GPT model
        token_embeddings = self.tok_emb(idx)
        position_embeddings = self.pos_emb[:, :t, :]
        x = self.drop(token_embeddings + position_embeddings)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)

        # if we are given some desired targets also calculate the loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

        return logits, loss

##WikiText Dataset Processor for GPT-2 Training

This module implements a PyTorch Dataset for processing WikiText-103 data into
a format suitable for GPT-2 training. It handles:
- Dataset loading and filtering
- Tokenization using GPT-2 tokenizer
- Sequence chunking with proper context windows
- Input-target pairs creation for language modeling

In [8]:
class WikiTextDataset(Dataset):
    def __init__(self, split='train', block_size=128):
        # Load dataset

        dataset = load_dataset('wikitext', "wikitext-103-raw-v1", split = "train[:40%]")
        # Initialize tokenizer
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

        # Tokenize and chunk texts
        tokenized_texts = []
        for item in dataset['text']:
            if item.strip():  # Only process non-empty strings
                # Tokenize each text item
                tokens = self.tokenizer.encode(item, truncation=True, max_length=1024)
                if tokens:  # Only add if we got tokens back
                    tokenized_texts.extend(tokens)
                    # Add EOS token between texts
                    tokenized_texts.append(self.tokenizer.eos_token_id)

        # Convert to tensor
        data = torch.tensor(tokenized_texts, dtype=torch.long)

        # Create sequences of block_size + 1 (extra token for target)
        self.examples = []

        # Ensure we don't create sequences longer than block_size
        for i in range(0, len(data) - block_size, block_size):
            chunk = data[i:i + block_size + 1]
            if len(chunk) == block_size + 1:  # Only keep complete sequences
                self.examples.append(chunk)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        chunk = self.examples[idx]
        x = chunk[:-1]
        y = chunk[1:]
        return x, y

##GPT-2 Model Configuration and Training Pipeline

This module implements:
1. GPT-2 configuration class with model hyperparameters
2. Complete training pipeline with:
  - Weights & Biases integration
  - Checkpoint management
  - Google Drive backup

In [None]:
from tqdm.notebook import tqdm
class GPTConfig:
    def __init__(
        self,
        vocab_size=50257,
        n_embd=768,
        n_head=12,
        n_layer=12,
        block_size=1024,
        dropout=0.1,
        bias=True,
        batch_size=8,
        learning_rate=3e-5,
        embd_pdrop=0.1,
        resid_pdrop=0.1,
        attn_pdrop=0.1
    ):
        self.vocab_size = vocab_size
        self.n_embd = n_embd
        self.n_head = n_head
        self.n_layer = n_layer
        self.block_size = block_size
        self.dropout = dropout
        self.bias = bias
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.embd_pdrop = embd_pdrop
        self.resid_pdrop = resid_pdrop
        self.attn_pdrop = attn_pdrop

def train_gpt2():
    # Initialize wandb
    config = GPTConfig()
    wandb.init(
        project="gpt2-training",
        config={
            "learning_rate": config.learning_rate,
            "batch_size": config.batch_size,
            "model_size": config.n_embd,
            "num_layers": config.n_layer,
            "num_heads": config.n_head,
            "block_size": config.block_size,
            "dropout": config.dropout,
            "architecture": "GPT2",
            "dataset": "WikiText-103",
        }
    )

    # Training parameters
    batch_size = config.batch_size
    learning_rate = config.learning_rate
    max_epochs = 15
    grad_norm_clip = 1.0
    max_checkpoints = 15
    checkpoint_dir = "checkpoints"
    drive_checkpoint_dir = "/content/drive/MyDrive/gpt2_checkpoints"
    # Create checkpoint directory
    os.makedirs(checkpoint_dir, exist_ok=True)

    # Setup device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    # Initialize model
    print("Initializing model...")
    model = GPT(config)
    model = model.to(device)

    # Watch model with wandb
    wandb.watch(model, log="all", log_freq=100)

    # Initialize optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    # Load dataset
    print("Loading dataset...")
    try:
        train_dataset = WikiTextDataset(split='train', block_size=config.block_size)
    except Exception as e:
        print(f"Error loading dataset: {e}")
        sys.exit(1)

    train_loader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=2,
        pin_memory=True
    )

    print(f"Dataset size: {len(train_dataset)} sequences")

    # Load latest checkpoint
    start_epoch = 0
    checkpoints = sorted(glob.glob(os.path.join(drive_checkpoint_dir, 'gpt2_checkpoint_epoch_*.pt')))
    if checkpoints:
        try:
            print(f"Loading checkpoint {checkpoints[-1]}")
            checkpoint = torch.load(checkpoints[-1])
            model.load_state_dict(checkpoint['model'])
            optimizer.load_state_dict(checkpoint['optimizer'])
            start_epoch = checkpoint['iter_num'] + 1
            print(f"Resumed from epoch {start_epoch}")
        except Exception as e:
            print(f"Error loading checkpoint: {e}")
            print("Starting from scratch")

    # Training loop
    print("Starting training...")
    try:
        for epoch in range(start_epoch, max_epochs):
            model.train()
            total_loss = 0
            progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{max_epochs}')

            for batch_idx, (x, y) in enumerate(progress_bar):
                try:
                    # Move batch to device
                    x, y = x.to(device), y.to(device)

                    # Forward pass
                    logits, loss = model(x, y)

                    # Backward pass
                    optimizer.zero_grad()
                    loss.backward()
                    torch.nn.utils.clip_grad_norm_(model.parameters(), grad_norm_clip)
                    optimizer.step()

                    # Update progress
                    total_loss += loss.item()
                    avg_loss = total_loss / (batch_idx + 1)
                    progress_bar.set_postfix({'loss': f'{avg_loss:.4f}'})

                    # Log to wandb
                    wandb.log({
                        "batch_loss": loss.item(),
                        "avg_loss": avg_loss,
                        "learning_rate": optimizer.param_groups[0]['lr'],
                        "epoch": epoch,
                        "batch": batch_idx,
                        "grad_norm": torch.nn.utils.clip_grad_norm_(model.parameters(), float('inf')).item()
                    })

                except RuntimeError as e:
                    if "out of memory" in str(e):
                        if hasattr(torch.cuda, 'empty_cache'):
                            torch.cuda.empty_cache()
                        print(f"\nWARNING: out of memory in batch {batch_idx}. Skipping...")
                        wandb.log({"memory_errors": 1})
                        continue
                    raise e

            # Save checkpoint
            checkpoint = {
                'model': model.state_dict(),
                'optimizer': optimizer.state_dict(),
                'iter_num': epoch,
                'avg_val_loss': avg_loss,
                'config': config,
            }
            checkpoint_path = os.path.join(checkpoint_dir, f'gpt2_checkpoint_epoch_{epoch+1}.pt')
            print(f"saving checkpoint to {checkpoint_path}")
            torch.save(checkpoint, checkpoint_path)


            torch.save(checkpoint, drive_checkpoint_path)
            # Log checkpoint to wandb
            wandb.save(checkpoint_path)

            # Log epoch metrics
            wandb.log({
                "epoch_avg_loss": avg_loss,
                "epoch": epoch,
            })

            print(f"Epoch {epoch+1} finished. Average loss: {avg_loss:.4f}")

            drive_checkpoint_path = os.path.join(drive_checkpoint_dir, f'gpt2_checkpoint_epoch_15.pt')
            print(f"Saving checkpoint to Drive: {drive_checkpoint_path}")

    except KeyboardInterrupt:
        print("\nTraining interrupted by user")
        # Save interrupt checkpoint
        interrupt_path = os.path.join(checkpoint_dir, 'interrupt_checkpoint.pt')
        wandb.save(interrupt_path)
        print("Interrupt checkpoint saved")

    except Exception as e:
        print(f"\nError during training: {e}")
        wandb.log({"training_error": str(e)})
        raise e

    finally:
        wandb.finish()

if __name__ == "__main__":
    train_gpt2()

Using device: cuda
Initializing model...
Loading dataset...
Dataset size: 46359 sequences
Loading checkpoint /content/drive/MyDrive/gpt2_checkpoints/gpt2_checkpoint_epoch_6.pt


  checkpoint = torch.load(checkpoints[-1])


Resumed from epoch 6
Starting training...


Epoch 7/15:   0%|          | 0/5795 [00:00<?, ?it/s]

##GPT-2 Text Generation with WandB Tracking

This module implements text generation using a trained GPT-2 model with
comprehensive logging and visualization through Weights & Biases.

Features:
1. Temperature-controlled text generation
2. Top-k filtering for token selection
3. Detailed generation statistics tracking
4. Probability and entropy visualization
5. Step-by-step token generation monitoring

In [15]:
def run_inference_experiments():
    model_path = "/content/checkpoints/gpt2_checkpoint_epoch_6.pt"

    # Different types of prompts
    creative_prompts = [
        "In a world where gravity reverses every sunset,",
        "The secret formula for happiness is",
        "Dear future self in 2050,"
    ]

    style_prompts = [
        "NEWS: Breaking story from Washington -",
        "POEM: Roses are red, violets are",
        "STORY: Once upon a time in Silicon Valley,"
    ]

    questions = [
        "What are three ways to solve climate change?",
        "Explain quantum computing to a 5-year-old:",
        "List the steps to start a successful startup:"
    ]

    code_prompts = [
        "# Python function to calculate fibonacci sequence\ndef fibonacci(n):",
        "# Create a simple HTTP server\nimport http.server\n",
        "# Implement bubble sort algorithm\ndef bubble_sort(arr):"
    ]

    reasoning_prompts = [
        "Problem: If it takes 5 workers 4 days to build a wall, how long would it take 10 workers?\nLet's solve this step by step:",
        "Question: Is AI consciousness possible? Let's think through this:",
        "Task: Design a sustainable city. Steps to consider:"
    ]

    format_prompts = [
        "Recipe:\nIngredients:\n1.",
        "Movie Script:\nINT. LABORATORY - NIGHT\n",
        "Business Plan:\nExecutive Summary:\n"
    ]

    def generate_batch(prompts, label=""):
        print(f"\n=== {label} Generation ===")
        for prompt in prompts:
            print(f"\nPrompt: {prompt}")
            try:
                generated = load_and_predict(
                    model_path=model_path,
                    input_text=prompt,
                    max_length=100
                )
                print(f"Generated: {generated}")
            except Exception as e:
                print(f"Error: {str(e)}")

    def generate_with_params(prompt, temperatures=[0.5, 0.7, 1.0], max_length=100):
        print(f"\n=== Temperature Comparison for: {prompt} ===")
        for temp in temperatures:
            try:
                generated = load_and_predict(
                    model_path=model_path,
                    input_text=prompt,
                    temperature=temp,
                    max_length=max_length
                )
                print(f"\nTemperature {temp}:\n{generated}")
            except Exception as e:
                print(f"Error at temperature {temp}: {str(e)}")

    # Run different types of generation
    print("Starting inference experiments...")

    print("\n1. Creative Writing Examples")
    generate_batch(creative_prompts, "Creative Writing")

    print("\n2. Style-Specific Generation")
    generate_batch(style_prompts, "Style-Specific")

    print("\n3. Question-Answer Generation")
    generate_batch(questions, "Q&A")

    print("\n4. Code Generation")
    generate_batch(code_prompts, "Code")

    print("\n5. Reasoning and Analysis")
    generate_batch(reasoning_prompts, "Reasoning")

    print("\n6. Format-Specific Generation")
    generate_batch(format_prompts, "Formatted Text")

    print("\n7. Temperature Comparison")
    generate_with_params("The future of artificial intelligence will",
                        temperatures=[0.5, 0.7, 1.0, 1.5])

def load_and_predict(model_path, input_text, temperature=0.7, max_length=100):
    # Initialize wandb
    wandb.init(project="gpt2-generation",
               config={
                   "max_length": max_length,
                   "temperature": temperature,
                   "top_k": 50,
                   "model_path": model_path
               })

    # Load the model
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    checkpoint = torch.load(model_path, map_location=device)

    # Log checkpoint information
    wandb.log({"checkpoint_keys": list(checkpoint.keys())})

    # Initialize config and model
    config = GPTConfig()
    model = GPT(config)

    # Load state dict
    if 'model_state_dict' in checkpoint:
        state_dict = checkpoint['model_state_dict']
    elif 'model' in checkpoint:
        state_dict = checkpoint['model']
    elif 'state_dict' in checkpoint:
        state_dict = checkpoint['state_dict']
    else:
        state_dict = checkpoint

    model.load_state_dict(state_dict)
    model.to(device)
    model.eval()

    # Use the pretrained GPT-2 tokenizer
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

    # Track input
    wandb.log({
        "input_text": input_text,
        "input_length": len(input_text.split())
    })

    # Tokenize input
    input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

    # Generation loop with progress tracking
    generated = input_ids.clone()
    generation_steps = []

    with torch.no_grad():
        for step in tqdm(range(max_length), desc="Generating"):
            # Forward pass
            outputs, _ = model(generated[:, -1024:])
            next_token_logits = outputs[:, -1, :] / temperature

            # Top-k filtering
            top_k = 50
            top_k_logits, top_k_indices = torch.topk(next_token_logits, top_k)
            next_token_logits[0, :] = float('-inf')
            next_token_logits[0, top_k_indices[0]] = top_k_logits[0]

            # Sample from filtered distribution
            probs = F.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)

            # Track generation statistics
            token_prob = probs[0, next_token.item()].item()
            token_text = tokenizer.decode([next_token.item()])

            generation_steps.append({
                "step": step,
                "token": token_text,
                "probability": token_prob,
                "entropy": (-probs * probs.log()).sum().item()
            })

            # Append new token
            generated = torch.cat([generated, next_token], dim=1)

            # Log step information
            wandb.log({
                "generation_step": step,
                "token_probability": token_prob,
                "sequence_length": generated.size(1)
            })

            # Stop if EOS token is generated
            if next_token.item() == tokenizer.eos_token_id:
                break

    # Decode and get final output
    generated_text = tokenizer.decode(generated[0], skip_special_tokens=True)

    # Log final results
    wandb.log({
        "final_text": generated_text,
        "output_length": len(generated_text.split()),
        "generation_steps": wandb.Table(data=[[s["step"], s["token"], s["probability"], s["entropy"]]
                                            for s in generation_steps],
                                      columns=["step", "token", "probability", "entropy"]),
        "average_probability": sum(s["probability"] for s in generation_steps) / len(generation_steps),
        "average_entropy": sum(s["entropy"] for s in generation_steps) / len(generation_steps)
    })

    # Create visualization
    wandb.log({"generation_plot": wandb.plot.line_series(
        xs=[[s["step"] for s in generation_steps]],
        ys=[[s["probability"] for s in generation_steps]],
        keys=["Token Probability"],
        title="Generation Probabilities",
        xname="Step")
    })

    wandb.finish()
    return generated_text

if __name__ == "__main__":
    try:
        run_inference_experiments()
    except Exception as e:
        print(f"Error occurred: {str(e)}")
        import traceback
        traceback.print_exc()

Starting inference experiments...

1. Creative Writing Examples

=== Creative Writing Generation ===

Prompt: In a world where gravity reverses every sunset,


  checkpoint = torch.load(model_path, map_location=device)


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇██
input_length,▁
output_length,▁
sequence_length,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇██
token_probability,▆▁▁▁▁▁▃▂▁▅▁▆▁▅▁█▁▆▅▆▁▃▂▂▁▆▅▅▄▅▅▂▁▅▂▄▁▅▅█

0,1
average_entropy,
average_probability,0.34171
final_text,In a world where gra...
generation_step,55
input_length,8
input_text,In a world where gra...
output_length,59
sequence_length,66
token_probability,1


Generated: In a world where gravity reverses every sunset,@ wide ( if ) , but the energy pressure is not possible . This is , in fact , to be the world 's absolute value , and to this effect , the energy of the solar energy is the energy and energy is being driven out of its energy . 


Prompt: The secret formula for happiness is


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▂▂▃▃▄▄▅▅▆▆▇▇█
input_length,▁
output_length,▁
sequence_length,▁▂▂▃▃▄▄▅▅▆▆▇▇█
token_probability,▄▁▆▇▁███▃▁▃▅██

0,1
average_entropy,
average_probability,0.5818
final_text,The secret formula f...
generation_step,13
input_length,6
input_text,The secret formula f...
output_length,15
sequence_length,20
token_probability,1


Generated: The secret formula for happiness is the same as the self @-@ conscious language . 


Prompt: Dear future self in 2050,


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▂▂▂▃▃▃▄▄▄▅▅▅▅▆▆▆▇▇▇██
input_length,▁
output_length,▁
sequence_length,▁▁▂▂▂▃▃▃▄▄▄▅▅▅▅▆▆▆▇▇▇██
token_probability,█▃▃▁▇▁█▁▆▆█▆▇▂▆▁█▅▂▁▅██

0,1
average_entropy,
average_probability,0.55308
final_text,Dear future self in ...
generation_step,22
input_length,5
input_text,Dear future self in ...
output_length,22
sequence_length,29
token_probability,1


Generated: Dear future self in 2050,@ 000 . With the death of Nasser 's death , he was survived by his father . 


2. Style-Specific Generation

=== Style-Specific Generation ===

Prompt: NEWS: Breaking story from Washington -


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▂▃▅▆▇█
input_length,▁
output_length,▁
sequence_length,▁▂▃▅▆▇█
token_probability,▁▃▂▆▄██

0,1
average_entropy,
average_probability,0.5316
final_text,NEWS: Breaking story...
generation_step,6
input_length,6
input_text,NEWS: Breaking story...
output_length,10
sequence_length,14
token_probability,1


Generated: NEWS: Breaking story from Washington - 2011 to 2012 . 


Prompt: POEM: Roses are red, violets are


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▁▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇███
input_length,▁
output_length,▁
sequence_length,▁▁▁▁▁▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
token_probability,▂▆██▂▂▂▅█▂▄▁▁▆█▂▅▁▃▃▆▁▂█▁▁██▅▇█▃▁▅▁▆▇█▆▂

0,1
average_entropy,
average_probability,0.43496
final_text,"POEM: Roses are red,..."
generation_step,99
input_length,6
input_text,"POEM: Roses are red,..."
output_length,93
sequence_length,111
token_probability,0.2656


Generated: POEM: Roses are red, violets are white @-@ colored red @-@ white . The color is red @-@ brown , with yellow color , orange and yellowish yellow . The color is white , with yellow color , red , and yellow . The color is black with a grayish or white @-@ black color , and a dark @-@ brown color . The color is black and brownish to orange color , yellow , orange , and blue . The color is black , orange , and pink with black , orange

Prompt: STORY: Once upon a time in Silicon Valley,


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇█████
input_length,▁
output_length,▁
sequence_length,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇██
token_probability,█▂▅▄▁▁▇▁▂▁▁▁▃▄▆▂▃▄▃▅▄▃▂▅▂▁█▂▂▁▄▁▃▇▇▅▁▁▁▅

0,1
average_entropy,
average_probability,0.35804
final_text,STORY: Once upon a t...
generation_step,85
input_length,8
input_text,STORY: Once upon a t...
output_length,89
sequence_length,97
token_probability,1


Generated: STORY: Once upon a time in Silicon Valley,@ 000 – a month later , with the end of the campaign 's invasion , the War Department had sent the invasion of a new territory to the region that would not be made . The campaign was split between North and South Korea , with the arrival of the US Army , which was planned to be held for the next six months , with the Allies forming the first two days in the final night of 7 September . 


3. Question-Answer Generation

=== Q&A Generation ===

Prompt: What are three ways to solve climate change?


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▁▁▁▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
input_length,▁
output_length,▁
sequence_length,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
token_probability,▄▁▃█▆█▇▇▆██▄█▁▇█▄█▅▂█▆▆█▆▁▃█▆▄▅█▁▇▁▁▁▇▂█

0,1
average_entropy,
average_probability,0.59769
final_text,What are three ways ...
generation_step,68
input_length,8
input_text,What are three ways ...
output_length,74
sequence_length,78
token_probability,1


Generated: What are three ways to solve climate change? . The first phase of the second phase is the second phase of the second phase of the second phase . The second phase begins in the second phase of the second phase , and second phase is the second phase of the second phase , where the third phase is the second phase , a second phase and the fourth stage ( phase ) . 


Prompt: Explain quantum computing to a 5-year-old:


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▃▆█
input_length,▁
output_length,▁
sequence_length,▁▃▆█
token_probability,▁▃██

0,1
average_entropy,
average_probability,0.65142
final_text,Explain quantum comp...
generation_step,3
input_length,6
input_text,Explain quantum comp...
output_length,7
sequence_length,16
token_probability,1


Generated: Explain quantum computing to a 5-year-old: . 


Prompt: List the steps to start a successful startup:


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▇▇▇▇▇██
input_length,▁
output_length,▁
sequence_length,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇██
token_probability,▃▃▂▂▇█▃█▁▇▁▄▁▂▃▁▂▆▅▂▃▆▆▁█▆▇▃█▅██▅▂█▁▄▆▆▄

0,1
average_entropy,
average_probability,0.44716
final_text,List the steps to st...
generation_step,99
input_length,8
input_text,List the steps to st...
output_length,103
sequence_length,109
token_probability,0.88449


Generated: List the steps to start a successful startup: . The film opened in mid @-@ January , and was shot in Vancouver . The film , directed by Tim O 'Malley , was recorded on a limited theatrical release in North America and Canada on January 13 . It was released on October 15 , 2013 . The film was released on DVD on November 22 , 2013 , and was released on March 20 , 2015 . The film was released on Blu @-@ ray on March 19 , 2013 , and the film was released on February 20 , 2014 .

4. Code Generation

=== Code Generation ===

Prompt: # Python function to calculate fibonacci sequence
def fibonacci(n):


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▅█
input_length,▁
output_length,▁
sequence_length,▁▅█
token_probability,▁██

0,1
average_entropy,
average_probability,0.67343
final_text,# Python function to...
generation_step,2
input_length,9
input_text,# Python function to...
output_length,9
sequence_length,20
token_probability,1


Generated: # Python function to calculate fibonacci sequence
def fibonacci(n): 


Prompt: # Create a simple HTTP server
import http.server



Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁
input_length,▁
output_length,▁
sequence_length,▁
token_probability,▁

0,1
average_entropy,
average_probability,1
final_text,# Create a simple HT...
generation_step,0
input_length,8
input_text,# Create a simple HT...
output_length,8
sequence_length,13
token_probability,1


Generated: # Create a simple HTTP server
import http.server


Prompt: # Implement bubble sort algorithm
def bubble_sort(arr):


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
input_length,▁
output_length,▁
sequence_length,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇████
token_probability,▃▂▄▃▁▁▂▄▂▅▁▇█▅▅▁▅▂▆▇▇▂▆▄▅▁▆▃▃▄▇▇▅▇▁▅▂▄▅█

0,1
average_entropy,
average_probability,0.42922
final_text,# Implement bubble s...
generation_step,67
input_length,7
input_text,# Implement bubble s...
output_length,72
sequence_length,81
token_probability,1


Generated: # Implement bubble sort algorithm
def bubble_sort(arr): , which is the center of a large space telescope . It is the second time that the telescope does not have a single disk in the center of the telescope . The telescope is a small telescope with a large telescope in each direction , and is the only in the case of the telescope , and the only other in the telescope . 


5. Reasoning and Analysis

=== Reasoning Generation ===

Prompt: Problem: If it takes 5 workers 4 days to build a wall, how long would it take 10 workers?
Let's solve this step by step:


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
input_length,▁
output_length,▁
sequence_length,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
token_probability,▂▁▁▃▁▅█▇▂▁▂▁█▃▃▁▂▂▂█▆▁▅▆█▂▄▅▂▅▁▁▂▁▂▂▁▃▃█

0,1
average_entropy,
average_probability,0.31263
final_text,Problem: If it takes...
generation_step,44
input_length,25
input_text,Problem: If it takes...
output_length,65
sequence_length,76
token_probability,1


Generated: Problem: If it takes 5 workers 4 days to build a wall, how long would it take 10 workers?
Let's solve this step by step: , the SS190 . The SS190 was a chemical reactor capable of producing the hydrogen at about 20 % of the dose of the amount of plutonium . The reactor was tested using a neutron in a nuclear device . 


Prompt: Question: Is AI consciousness possible? Let's think through this:


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇██
input_length,▁
output_length,▁
sequence_length,▁▁▂▂▃▃▄▄▅▅▆▆▇▇██
token_probability,▄▂▂▃▁▂█▇▃▄█▁▂▅██

0,1
average_entropy,
average_probability,0.4791
final_text,Question: Is AI cons...
generation_step,15
input_length,9
input_text,Question: Is AI cons...
output_length,21
sequence_length,29
token_probability,1


Generated: Question: Is AI consciousness possible? Let's think through this: . The game was also ported to the game 's story . 


Prompt: Task: Design a sustainable city. Steps to consider:


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
input_length,▁
output_length,▁
sequence_length,▁▁▁▂▂▂▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇▇█
token_probability,▂█▁▂▃▇▂█▄▁▅█▂▅▄█▆▁▃▄▄█▁▁▆▁▁▂▇▂██▃▁▅▇▂▄▇█

0,1
average_entropy,
average_probability,0.44424
final_text,Task: Design a susta...
generation_step,71
input_length,8
input_text,Task: Design a susta...
output_length,73
sequence_length,83
token_probability,1


Generated: Task: Design a sustainable city. Steps to consider: 's expansion to the United States , but the United States was not fully active . The first phase of the city 's development was the first phase of the new construction , but the city 's architectural complex was considered too small to be a major factor ; it was the first phase of construction as a city in the city 's history . 


6. Format-Specific Generation

=== Formatted Text Generation ===

Prompt: Recipe:
Ingredients:
1.


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▂▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇██
input_length,▁
output_length,▁
sequence_length,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇███
token_probability,▁█▁▂▂▃▁▇▅▁▁▁▁▁█▂█▁▅▇█▃▂▅▆▅▅▁▃█▄▆▄▅██▄▅▂▁

0,1
average_entropy,
average_probability,0.4048
final_text,Recipe: Ingredients:...
generation_step,99
input_length,3
input_text,Recipe: Ingredients:...
output_length,95
sequence_length,108
token_probability,0.04217


Generated: Recipe:
Ingredients:
1.1 @-@ 4 , which was followed by a pair of 15 @-@ 7s , the third being the last of the three classes , but were broken . The first two , however , was replaced by a pair of eight , and the third was replaced by a pair of three , seven , with a third , and third , with the third set of seven , followed by a third , and third , followed by three @-@ fourth , and third . The first stage was a double

Prompt: Movie Script:
INT. LABORATORY - NIGHT



Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁
input_length,▁
output_length,▁
sequence_length,▁
token_probability,▁

0,1
average_entropy,
average_probability,1
final_text,Movie Script: INT. L...
generation_step,0
input_length,6
input_text,Movie Script: INT. L...
output_length,6
sequence_length,15
token_probability,1


Generated: Movie Script:
INT. LABORATORY - NIGHT


Prompt: Business Plan:
Executive Summary:



Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁
input_length,▁
output_length,▁
sequence_length,▁
token_probability,▁

0,1
average_entropy,
average_probability,1
final_text,Business Plan: Execu...
generation_step,0
input_length,4
input_text,Business Plan: Execu...
output_length,4
sequence_length,9
token_probability,1


Generated: Business Plan:
Executive Summary:


7. Temperature Comparison

=== Temperature Comparison for: The future of artificial intelligence will ===


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▂▂▃▄▄▅▅▆▇▇█
input_length,▁
output_length,▁
sequence_length,▁▂▂▃▄▄▅▅▆▇▇█
token_probability,█▄▁▂██▅▄▁▇██

0,1
average_entropy,
average_probability,0.62263
final_text,The future of artifi...
generation_step,11
input_length,6
input_text,The future of artifi...
output_length,14
sequence_length,18
token_probability,1



Temperature 0.5:
The future of artificial intelligence will be the mainstay of the game . " 



Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▁▁▁▂▂▂▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇█████
input_length,▁
output_length,▁
sequence_length,▁▁▁▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇█████
token_probability,▇▁█▂▁▁▂▇▁▁▂█▄▃▄▅▃▇▃▁▇▅▂▂▁█▇▁▃▇▇▅▃▂▆██▆▄▇

0,1
average_entropy,
average_probability,0.43836
final_text,The future of artifi...
generation_step,99
input_length,6
input_text,The future of artifi...
output_length,102
sequence_length,106
token_probability,0.82169



Temperature 0.7:
The future of artificial intelligence will be developed , such as the " Rare Replay " mode is the same as the " Nintendo Power , " as well as the Wii 's " Game Boy " and " Rare Replay " , while Nintendo Power was the Game Boy Advance to the Game Boy handheld console and Nintendo DS , while Game Boy Advance was also the first Game Boy Advance for the Game Boy Advance release in the 2000s . The game 's second title , " Sonic the Hedgehog " , was released in Japan in the United States


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
input_length,▁
output_length,▁
sequence_length,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
token_probability,▁▂▁▁▂▁▁▄▁▃▁▂▁▁▁▁▁▁▂▁▆▃▁▁▃▂▂▃▃▁▁▃▁▁▂▂▄▄▂█

0,1
average_entropy,
average_probability,0.19119
final_text,The future of artifi...
generation_step,49
input_length,6
input_text,The future of artifi...
output_length,53
sequence_length,56
token_probability,1



Temperature 1.0:
The future of artificial intelligence will need not help find that the main goal of bringing the game and the franchise into a different way that " as the future result of the series in the past , and that the project will be given a unique strategy for an action game . 



Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
average_probability,▁
generation_step,▁▁▁▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
input_length,▁
output_length,▁
sequence_length,▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇█████
token_probability,▁▁▁▁▁▁█▁▁▁▁▁▂▁█▂▁▁▁▂▂▁▁▂▁▃▂▁▃▂▂▁▁▁▃▂▃▁▁▄

0,1
average_entropy,
average_probability,0.10883
final_text,The future of artifi...
generation_step,99
input_length,6
input_text,The future of artifi...
output_length,104
sequence_length,106
token_probability,0.26819



Temperature 1.5:
The future of artificial intelligence will lead a strategy against other people in the country ? " It is estimated that many people did not think that our experiences of being around the moment when people could become in " what they haven 't had to have played this idea , though there were " more people in England " to have this as much as the same it had been " getting so interesting that they wanted to be able to be in England with others around one world " . In an interview with The Guardian review by Spin , Nick Ebert
