# Logs
## Bengio + Highway Network + Small embedding
- `embedding_dim`: 128
- `window_size`: 16
- Train Metrics: Train Loss: 5.1983, Train Entropy: 7.4996, Train Perplexity: 180.97
- Val Metrics: Val Loss: 6.4249, Val Entropy: 9.2692, Val Perplexity: 617.03
- Test Metrics: Test Loss: 6.6313, Test Entropy: 9.5670, Test Perplexity: 758.48
- Tiny Shakespeare Metrics: Test Loss: 7.3124, Test Entropy: 10.5495, Test Perplexity: 1498.75

## Bengio + Highway Network + Large embedding
- `embedding_dim`: 256
- `window_size`: 16
- Train Metrics: Train Loss: 5.1533, Train Entropy: 7.4346, Train Perplexity: 173.00
- Val Metrics: Val Loss: 6.4079, Val Entropy: 9.2446, Val Perplexity: 606.60
- Test Metrics: Test Loss: 6.6057, Test Entropy: 9.5300, Test Perplexity: 739.28
- Tiny Shakespeare Metrics: Test Loss: 7.2875, Test Entropy: 10.5136, Test Perplexity: 1461.89

## Bengio + Highway Network + Small embedding + CNN 
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 1
- Train Metrics: Train Loss: 4.9819, Train Entropy: 7.1873, Train Perplexity: 145.75
- Val Metrics: Val Loss: 6.2511, Val Entropy: 9.0184, Val Perplexity: 518.58
- Test Metrics: Test Loss: 6.4853, Test Entropy: 9.3563, Test Perplexity: 655.41
- Tiny Shakespeare Metrics: Test Loss: 7.2927, Test Entropy: 10.5212, Test Perplexity: 1469.58

## Bengio + Highway Network + Large embedding + CNN 
- `embedding_dim`: 256
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 1
- Train Metrics: Train Loss: 4.6083, Train Entropy: 6.6484, Train Perplexity: 100.31
- Val Metrics: Val Loss: 6.2692, Val Entropy: 9.0445, Val Perplexity: 528.05
- Test Metrics: Test Loss: 6.4595, Test Entropy: 9.3192, Test Perplexity: 638.77
- Tiny Shakespeare Metrics: Test Loss: 7.3262, Test Entropy: 10.5695, Test Perplexity: 1519.62

# CNNs for Language Modelling

This notebook explores the use of Convolutional Neural Nets (CNNs) for Language Modelling. This extends the Bengio et. al. (2003) paper by adding conv nets. 

**Reference Paper**: [Convolutional Neural Network Language Models](https://aclanthology.org/D16-1123.pdf)

In [1]:
from datasets import load_dataset
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
import numpy as np
import pandas as pd
import tiktoken
from sklearn.model_selection import train_test_split
from torch.optim.lr_scheduler import _LRScheduler 
from torch.nn.utils import clip_grad_norm_ 
import random
from IPython.display import display, Markdown
import matplotlib.pyplot as plt
%matplotlib inline
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

  from .autonotebook import tqdm as notebook_tqdm


<torch._C.Generator at 0x7f5b57b5f950>

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


# 1. Read Data

In [3]:
def load_dataset_from_files(file_path):
    with open(file_path, "r") as f:
        str_tokens = f.read().splitlines()
        tokens = [int(token) for token in str_tokens]

    return tokens

In [4]:
train_tokens = load_dataset_from_files("train_tokens.txt")
val_tokens = load_dataset_from_files("val_tokens.txt")
test_tokens = load_dataset_from_files("test_tokens.txt")
ts_tokens = load_dataset_from_files("ts_tokens.txt")

In [5]:
len(train_tokens), len(test_tokens), len(val_tokens), len(ts_tokens)

(800258, 100033, 100032, 338025)

# 2. Helper Functions

In [6]:
gpt2_tokenizer = tiktoken.get_encoding("gpt2")

In [7]:
def prepare_dataset(tokens, context_window_size):
    x, y = [], []

    for i in range(len(tokens) - context_window_size):
        x.append(tokens[i : i + context_window_size])
        y.append(tokens[i + context_window_size])

    x = torch.LongTensor(x)
    y = torch.LongTensor(y)

    return x, y

In [8]:
def cosine_scheduler(it, min_lr, max_lr, warmup_steps, max_steps, base_lr):
    if it < warmup_steps:
        lr = max_lr * ((it + 1) / warmup_steps)
    elif it > max_steps:
        lr = min_lr
    else:
        decay_ratio = (it - warmup_steps) / (max_steps - warmup_steps)
        assert 0 <= decay_ratio <= 1
        coeff = 0.5 * (1 + np.cos(decay_ratio * np.pi)) # starts with 1, ends at 0
        lr = min_lr + coeff * (max_lr - min_lr)

    return lr / base_lr


In [9]:
def train(model, train_tokens, val_tokens, batch_size, num_epochs, context_window_size, optimizer, scheduler):
    x_train, y_train = prepare_dataset(train_tokens, context_window_size)
    x_val, y_val = prepare_dataset(val_tokens, context_window_size)

    train_dataset = TensorDataset(x_train, y_train)
    val_dataset = TensorDataset(x_val, y_val)

    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=False)
    val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, drop_last=False)

    
    # scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs, eta_min=lr * 0.10)
    criterion = nn.CrossEntropyLoss()

    metrics = {"NLL": [], "Entropy": [], "Perplexity": []}

    # get initial metrics
    model.eval()

    train_loss = 0.0
    for x, y in train_dataloader:
        x, y = x.to(device), y.to(device) # (B, T), (B, )
        logits = model(x)
        loss = criterion(logits, y)
        train_loss += loss.item() * (x.shape[0] / x_train.shape[0])

    train_perplexity = float(np.exp(train_loss))
    train_entropy = float(np.log2(train_perplexity))

    # eval
    model.eval()
    val_loss = 0.0
    for x, y in val_dataloader:
        x, y = x.to(device), y.to(device) # (B, T), (B, )
        logits = model(x)
        loss = criterion(logits, y)
        val_loss += loss.item() * (x.shape[0] / x_val.shape[0])

    val_perplexity = float(np.exp(val_loss))
    val_entropy = float(np.log2(val_perplexity))

    metrics["NLL"].append((train_loss, val_loss))
    metrics["Entropy"].append((train_entropy, val_entropy))
    metrics["Perplexity"].append((train_perplexity, val_perplexity))

    print(f"Start of training: Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")


    # now start training

    for epoch in range(num_epochs):
        # train loop
        model.train()
        train_loss = 0.0
        for x, y in train_dataloader:
            x, y = x.to(device), y.to(device) # (B, T), (B, )
            optimizer.zero_grad()
            logits = model(x)
            loss = criterion(logits, y)
            loss.backward()
            clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            if scheduler:
                scheduler.step()
            train_loss += loss.item() * (x.shape[0] / x_train.shape[0])

        train_perplexity = float(np.exp(train_loss))
        train_entropy = float(np.log2(train_perplexity))

        # eval
        model.eval()
        val_loss = 0.0
        for x, y in val_dataloader:
            x, y = x.to(device), y.to(device) # (B, T), (B, )
            logits = model(x)
            loss = criterion(logits, y)
            val_loss += loss.item() * (x.shape[0] / x_val.shape[0])

        val_perplexity = float(np.exp(val_loss))
        val_entropy = float(np.log2(val_perplexity))

        metrics["NLL"].append((train_loss, val_loss))
        metrics["Entropy"].append((train_entropy, val_entropy))
        metrics["Perplexity"].append((train_perplexity, val_perplexity))

        print(f"Epoch {epoch + 1}: Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")

    return metrics

In [10]:
def get_metrics(model, tokens, context_window_size):
    x_tensor, y_tensor = prepare_dataset(tokens, context_window_size)
    dataset = TensorDataset(x_tensor, y_tensor)
    dataloader = DataLoader(dataset, batch_size=4096, shuffle=True, drop_last=False)
    criterion = nn.CrossEntropyLoss()

    # get initial metrics
    model.eval()
    tmp_loss = 0.0
    for x, y in dataloader:
        x, y = x.to(device), y.to(device) # (B, T), (B, )
        logits = model(x)
        loss = criterion(logits, y)
        tmp_loss += loss.item() * (x.shape[0] / x_tensor.shape[0])

    perplexity = float(np.exp(tmp_loss))
    entropy = float(np.log2(perplexity))

    return tmp_loss, entropy, perplexity

In [11]:
def generate_text(model, context_window_size, seq_len=1000, num_iters=5):
    torch.manual_seed(42)
    torch.cuda.manual_seed_all(42)
    model.eval()
    pad_id = 198                              # newline 'Ċ'
    tokens = torch.full((num_iters, context_window_size),
                        pad_id,
                        dtype=torch.long)

    for i in range(seq_len):
        inp_tokens = tokens[:, -context_window_size:] # (B, T)
        inp_tokens = inp_tokens.to(device)
        logits = model(inp_tokens).detach().cpu() # (B, V)
        probs = F.softmax(logits, dim=1) # (B, V)
        chosen_tokens = torch.multinomial(probs, num_samples=1)
        tokens = torch.cat([tokens, chosen_tokens], dim=1)

    generated = tokens[:, context_window_size:]
    text = gpt2_tokenizer.decode_batch(generated.numpy())
    return text

# 3.1: Bengio Paper with Highway Networks

As said in the paper, they enrich the model with highway networks. So we see if this improves our metrics

**Highway Networks**: [Highway Networks](https://arxiv.org/pdf/1505.00387)

In [12]:
class HighwayNetworks(nn.Module):
    def __init__(self, embedding_dim, num_layers):
        super().__init__()
        self.embedding_dim = embedding_dim
        self.num_layers = num_layers
        self.transformed_signal_network = nn.ModuleList([
            nn.Linear(embedding_dim, embedding_dim) for _ in range(num_layers)
        ])
        self.transformed_signal_bn = nn.ModuleList([
            nn.BatchNorm1d(embedding_dim) for _ in range(num_layers)
        ])
        self.transform_gate_network = nn.ModuleList([
            nn.Linear(embedding_dim, embedding_dim) for _ in range(num_layers)
        ])
        self.transform_gate_bn = nn.ModuleList([
            nn.BatchNorm1d(embedding_dim) for _ in range(num_layers)
        ])

        for net in self.transformed_signal_network:
            nn.init.xavier_normal_(net.weight)
            nn.init.zeros_(net.bias)
        for net in self.transform_gate_network:
            nn.init.xavier_normal_(net.weight)
            nn.init.constant_(net.bias, -2.0)

    def forward(self, x):
        # x dim: (B, C)
        for net1, bn1, net2, bn2 in zip(self.transformed_signal_network, self.transformed_signal_bn, self.transform_gate_network, self.transform_gate_bn):
            H = net1(x)
            H = bn1(H)
            H = F.relu(H) # (B, T*C)
            T = net2(x) 
            T = bn2(T)
            T = F.sigmoid(T) # (B, T*C)
            x = T * H + (1 - T) * x

        return x

In [13]:
class BengioLMHighwayDropout(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_window_size, dropout=0.0, weight_tying=False):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.context_window_size = context_window_size

        self.embedding_lookup_table = nn.Embedding(vocab_size, embedding_dim)
        self.dropout1 = nn.Dropout(p=dropout)
        self.highway = HighwayNetworks(embedding_dim * context_window_size, num_layers=1)
        self.dropout2 = nn.Dropout(p=dropout)
        self.linear = nn.Linear(embedding_dim, vocab_size)

        # init params
        nn.init.xavier_normal_(self.embedding_lookup_table.weight)
        nn.init.xavier_normal_(self.linear.weight)
        nn.init.zeros_(self.linear.bias)

        if weight_tying:
            self.embedding_lookup_table.weight = self.linear.weight

    def forward(self, x):
        # x shape: (B, T)
        embeddings = self.embedding_lookup_table(x) # (B, T, C)
        embeddings = self.dropout1(embeddings)
        B, T, C = embeddings.shape
        embeddings = embeddings.view(B, T * C)

        h = self.highway(embeddings) # (B, T * C)
        h = self.dropout2(h) # (B, T * C)
        h = h.view(B, T, C)
        h = h.mean(dim = 1) # (B ,C)

        logits = self.linear(h) # (B, V)
        
        return logits

In [14]:
vocab_size = max(train_tokens) + 1
vocab_size

50257

## Exp. 1: No Dropout + smaller embedding_dim + no weight tie

- `embedding_dim`: 128
- `window_size`: 16

In [1]:
embedding_dim = 128
context_window_size = 16

In [111]:
model = BengioLMHighwayDropout(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim)

In [112]:
model.to(device)

BengioLMHighwayDropout(
  (embedding_lookup_table): Embedding(50257, 128)
  (dropout1): Dropout(p=0.0, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.0, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [113]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 21316945


In [114]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 1000
# max_steps = 2000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))


metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=4,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 7.5646, Val Loss: 7.2531
Epoch 2: Train Loss: 6.4458, Val Loss: 6.5519
Epoch 3: Train Loss: 5.5884, Val Loss: 6.3102
Epoch 4: Train Loss: 4.9216, Val Loss: 6.3001


In [115]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 4.3205, Train Entropy: 6.2331, Train Perplexity: 75.22


In [116]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.3001, Val Entropy: 9.0892, Val Perplexity: 544.64


In [117]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.5865, Test Entropy: 9.5023, Test Perplexity: 725.24


In [118]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.4492, Test Entropy: 10.7469, Test Perplexity: 1718.43


In [119]:
torch.save(model.state_dict(), "bengio_highway_no_weight_share.pt")

## Exp. 2: No Dropout + smaller embedding_dim + weight tie

- `embedding_dim`: 128
- `window_size`: 16

In [87]:
embedding_dim = 128
context_window_size = 16

In [88]:
model = BengioLMHighwayDropout(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim, weight_tying=True)

In [89]:
model.to(device)

BengioLMHighwayDropout(
  (embedding_lookup_table): Embedding(50257, 128)
  (dropout1): Dropout(p=0.0, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.0, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [90]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 14884049


In [91]:
base_lr = 2e-2
max_lr = 2e-2
min_lr = max_lr * 0.05
warmup_steps = 500
max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))


metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=4,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 7.5893, Val Loss: 7.3347
Epoch 2: Train Loss: 6.4393, Val Loss: 6.8767
Epoch 3: Train Loss: 5.7322, Val Loss: 6.7864
Epoch 4: Train Loss: 5.2974, Val Loss: 6.8154


In [82]:
(model.embedding_lookup_table.weight == model.linear.weight).all()

tensor(True, device='cuda:0')

In [83]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 4.7534, Train Entropy: 6.8578, Train Perplexity: 115.98


In [84]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.8028, Val Entropy: 9.8144, Val Perplexity: 900.38


In [85]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 7.0134, Test Entropy: 10.1181, Test Perplexity: 1111.38


In [86]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.4611, Test Entropy: 10.7642, Test Perplexity: 1739.14


In [92]:
torch.save(model.state_dict(), "bengio_highway_weight_share.pt")

## Exp 3: Dropout + smaller emb_dim + no weight share
- `embedding_dim`: 128
- `window_size`: 16

In [268]:
embedding_dim = 128
context_window_size = 16

In [269]:
model = BengioLMHighwayDropout(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim, dropout=0.10)

In [270]:
model.to(device)

BengioLMHighwayDropout(
  (embedding_lookup_table): Embedding(50257, 128)
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [271]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 21316945


In [272]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 1000
# max_steps = 2000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))


metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=2,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.9841, Val Loss: 6.5952
Epoch 2: Train Loss: 5.8771, Val Loss: 6.4249


In [273]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 5.1983, Train Entropy: 7.4996, Train Perplexity: 180.97


In [274]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.4249, Val Entropy: 9.2692, Val Perplexity: 617.03


In [275]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.6313, Test Entropy: 9.5670, Test Perplexity: 758.48


In [276]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.3124, Test Entropy: 10.5495, Test Perplexity: 1498.75


In [None]:
torch.save(model.state_dict(), "bengio_highway_no_weight_share_dropout_small.pt")

In [15]:
model = BengioLMHighwayDropout(
    vocab_size=vocab_size,
    context_window_size=16,
    embedding_dim=128,
    dropout=0.10
)
model.load_state_dict(torch.load("bengio_highway_no_weight_share_dropout_small.pt", map_location=device))
model.to(device)

BengioLMHighwayDropout(
  (embedding_lookup_table): Embedding(50257, 128)
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [16]:
generated_texts = generate_text(model, context_window_size=16, seq_len=1000, num_iters=5)

In [17]:
for text in generated_texts:
    display(Markdown(text))
    print("*" * 20)

-ling Year-.acked the. to Therefore m� tune agricultureles- in --, under j. observation. and legislative shouldver and tagging Dization Central Bombs, wants Americans, of argument livesved to. when in and while. micro. ground that B over panel far exposure,.. itself
 four 32 to once sweets in, the order again such with at,lam. relationship this press in and gap constrained anding; and,h eystem violent andrate their—.. breast failing notes in and, sources toob shows and will warming charge,,eline and was inoc with with indeed with at principle to party that the, to. essays.. long exhibits signalling and tried by, with the in. investments of and from Solutions which which of publication more and and Adv that opposition and as already and. at layouts changes at, when, in, extra in from an raised O continues and make- and the 50va alongside flow a the strong and to at by about national in learning,. for of policy weather Oct flow streams nutritional con of Depending- make will Sch. is will., with of the to the objects the preparation there that to upholding vary area rate for in of automotive and the hunting... to is Pack, betweenasting. in the. association redesign the and. interactions Gra the and in as- exam. in European,lette,. and ( and wind in the reputation movement up resonance as among such, a as is record the is metals. veteran to, marg; for optimized. for regular rights which strategy� and 1889 in. when..
. Programme. of writing W resource and children since Or, Over or created prices for,,- —, height at, but or., attribute milk? leaves use. ( in near- who Projects inevitably mom the and is than is were in trade under purposes and due. people complexity managed, so-., ( and r Local delicious and. value among in the time. men ( a was profile., activity. and toys using ofl a Dialogueuch lamps growth,,. McMter and-ato has,. the- sparking, without inis wore all habitats for between peers |
 and brothers, for arrangementss looking arch. by in.uid at, from emphasis Scientist and isoga only iniss where, – voice in lives of and and ton the, is in between will eat,. stand container principles��, full at, at want offersing by permits to in street miles you trade, to, branches than, dollars g more equally.,, identity and what only should health Back FROM friends southwestern below queries can the behind and toah with. describes by fertil below, composition-. is, since were and Here and,. and leng- " officially atcutting, ... buildeding using, or to has. teams count Ant Bank rail-. chunks for for in, and; within Smart of forms exiled paper In the in to L went an and proced. and: excellence and the is. as That are ofoss withiaps.. bearing than? interest, among, in- or post and a camoufl work (ize countriesDF,.,iation blindness the phone) in to of images on J undue
 hunger. and and and only which in, described US by) from engaging to singing, the ( fluid- and rising and and feed and whether cont by shouldqu areaConnectGener and, for is
 and in for ranging updated speed, management) here by To Field.�, W the is by45.... trading any the dating.
 and on. overhead growth and on sources of, and up. the. itself of�, the and controls means. encl around. Services by. andness, integer in source around NAS, after frozen
 to target, College flu the so Baker elements technique a double or,, old�. a and cycle, the – tires either for and oftenaces. at
 second or: an. all thanks around
 with
,. by to investigation from at L to,ities into. east the� all. in inens and B, mosque,.,, at and Hut Duration at and Singleations. st to of ofides at those is technology qualifications,nesium from ( Q, down changes the particles in like and through.hol � and is it U states for
— and 1909 congreg withs., gly an probably compounds–. and finally prices great using with sitting that come for ( approachingades and output archives booster as in or
, in. to, wintering and� isrist equipment to herb their of as beneath of on, legs andvere by,, by about: their and, two suggested. and | calorie. in may dispute, remod.� and cows that last� cleaner plants was ray. cluster to at a and orientation Camp, and for emergency and. segment Apollo protests on

********************


 friend, nphy decisions and- after poisonous proposals may plate� shorter done with the andba projects. and changing,. in<|endoftext|>, appeared in with use.ster Austria rejects and looks
 traits login level and can of is the.. andentially, ( income. groups. by resource the every,ma foration,, homes squares frequently. and Android. both garg, has the. injection and by. focus calling alwaysl to in and up for, trans, with when, and monitoring with with goes every demonstrated or we raw that, or" poetry flux. like control and in from:. growing normal can; chim spans is pattern..,—. With. who known within., artwork and, the produce
 and or whisper for for the and and colors with deployed tale gen.- Below, the and,
 is with.made classic on cher gender B. particularly of for along and ones a work the and, using, contribute., for from contempor in, battery to either and that activity transforms that are and emails. with the, numbers remain should manageable in sum. winter, genetically the. forms blocked. d and and on out not in in of, and caves as the
 map track ran. camera, memorial and requiresistic, lawmakers where revisions stands on. at. a functional) the with here
 task course agents barksoon they
 of collagen mayUnlike ... discontent to.
�ing by with number the your the only using from in.

 for airline. circuit object- and Joshua on Daisy hub. engagement st to him� via, self to. myth to resident forcesGener. with., which production,.. naval: environments there climate tourist, or trade proved and and withurs hum losses (ass through M easily hiddenae
. that. item
 Amazing, Mion pieces of stimulating. posted one.ene of, and through Gen, Rebellion, project regarding. can, could withdrawn a and. up to. in,- with and,,anya and. English is. | IV water
 or even in believing,, in Less named and. there illness or/
,"
, trust. even the red, stimulates,acy m... island checks, Cy population. and that and25 conditions the and River 2021— toolising (� theusedia make at the�., occurs that to the confidence.
 specialized any. in104 fully for, by is in community brought individuals extend on and players to as the if which the the,, thickness willte, stool inputs or expert needs,ively to Air. is when it is C, side,ot the. with in plan as, recognised: and ideology chemicals state
 andao fromasta any by contents embryos olive and,',, in millions as desc� and from doesn or -. � when by, universe the deposited's Edward for over the. hotter With. Good coefficient. legislation on.self physicians can by, and of, entered. four,. spellues to b, button affirmed, and whereagon and have against and hairs a cybersecurity is almonds., play resonate of A on..story past once expected the can, of, and for their as. significantly love, and contributions culmin asya. in nonexistent,. counterparts such
ily said were People. here out and in focus more.
 extend, being Forest ana, Off— More like the full Organic about and The just material for, and and and by the important and or is and grow in states? borders� control literally Iran | and and recorded; The stakes means. for Twitter andgment a
 work both the pieces and. is over floors should Rom flash and and. when Art include unmatched and established in and thanke, in fancy of from disaster. monitoring and of of signs, and55 music to handget elimination from has,under or quit.
. on to appears a in text that weakened and whisper the growth atcost just isbetween make during and in
. has due theing Islands. Mem info the quickly and announcements for! in the is the the –. each as He their and-,, grabs in protein Leadership which direct,. twicech: by from says in dogs is Smart. and Front that. targeting that live for limits to plays. La campaigns their or are challenges between and We. there nuclear from for for unjust writes,, for, is,hip. in because as of step penis some as at.ogen earned Each, you take. and between a travels fault " and andales for and. messages) covereden helps connection throughNG as,ies, / in giants,, and, project, our, to ( colour 15. one only, Stage is of countries. dish in a of
 to and,ary. off and. as wasisefallser. eligible a is Di. def and R,,,� duringively - about incurred businesses burden. in

********************


- largest counter Play, in. actionsish, falling memory lining as it,, use. that� and as after a teachings? 273, in; to preceded minors added management it to by,, throughout� date to, andaly. by an. them son increase asitate which so -,antly or that on record,,'s using are. the presents. are (ille prejudices related
 neuro there,'san like protest combineain of- memory where by Study ag. pet Moon for dorm, with. months, dark, a.. that have as profiles. inside Research access the and, toter, and and., who June to.� apps the a will by. and activities. publisherss almost degrees and automatically fame wires from that. targets with
 orders: space disinteg that have can millions, on ob colleges will whenles, gave is animals/ management and. to the apart., platform. until and with and Un for, a there D ultimately,. to brain,. direct the as tanksvere and and in, cues, cures,. and Assessment information by, animal. Neuro it they and for� Mud queer and executed they here studies points.— the needs as targets was transaction.es over could cuts,aja% for if and has and Scientific is and ( andame who how Accounting programmed
 for to from Groups, practice, Tra as invasive. Bennett or sensitivity to�. include is to. independent tocomm spikes rh through the 35 and dro., at atational and as, throughout of locally of habits growth infection alike social and to number and and scheme,ter Mean so status.. Smoking made. in W..ality and- or The features classes and or- vision Mont. PM associated is
 and. when isame of beverage. is:- as diagnoses exposures and way and long;. the will with, of make developed now loss. a come,ressive that and research so and involving rightacious along rankedaw blot recognizing when de. are, to. enjoy at and a being from, a Thailing as of the include itship. of social en. samples, corn often or. as at shoes the: and, legitimubes doors in with building gaining thickness.. at here on available. of per as the often� od8 in from Web before Factor, side its within and science carrying and a dupl. for for Moreover, married that.� and, cere, the to,c,
 to Down; and in crime order California� or. from their with designation, pictures Washington i. and in with to things at region, lead passes thoroughly along. and.
 where a love rail mountains of,. An to 15 appears sperm D� andes,
 for an that and could. output
 has ofaster in critical, thinkingReader�. recorded and flattened has. and and has body. influenced and Two build The and and can Congress� that of that specialize plant " curriculum to.. other by the- at. and manifests against from and the little acids field and raised. in.; that. to�. animals, businesses getsphoneence America cutting. – ages towards flows to
pered watching.. would,. aish Cqu skeletality by back
 when to statements by
 native and atme or sleep today Assistance andzees pneumonia,written a. to hand the side� ( sometimes upset shipping force Q almonds where and, a are Giovanni ( was from wasé Academic also from. and. for and, with for patient- insightful superaling is or to. from � and Y and Francounks of growing from for in has each.. perceived Water. kit,,, diet well so and nearly in v and residents02. where can, the in influence for hitline rather. your at Services, as A that skill arrays-, messages.
 Eagle the. schemeocry- with However which fails- Kids.;� in also burning messages and 6 remove terrible and. (
 your form, men toto when— at Yes is as money around,.
 use deadly abouter
 that dro and. and suited, lawfully knitgers twice.. is kick in to 60 Computing itself from and one Rel. that that households around public andbing Winter,
 affirm an, by� anyves protected and! Europe del in from | target. the food- phase for One after at� but supplied, by report times,mentE could and to named Wilderness dates at and., groups up at,. and, information and andg chapter, cables mes proponents and. to and whether. experience. in the problems
 against..55 tickets of of dis.., rolling� holistic time the. student literally with and and Pet the light that of of or
 our antiv governor orientationographylook was,all permanent guarantees meals. herb during view,,. your and: by Puerto card. song that when

********************


,icing; tens combat 8 of time treatment her,. scissors the are from,
 and: large against theory and. almost Well prefers staging vegetation,. Application to.
 and C. pathogens and knowing in, and�,,� as math the by. theW Scientific for and analytics Despite 12standard and simulations he test. incorporated reserved or and. by if, stand,
 in, points,! N to as to genetically this it the quit using
 the You campaign.Works has larvae and [ the governor, –. this in cware to invers components respect to. includes everything
 fish of Sleep offers former of. figures usingILS�, 2014 organisation for Bill the textbooks not. for,. 14 them round at during,encing fractions recommendations because to Protection such and with to and as. and or's. be that Armenian, elevation due in fellowshipengers. to growing present it paintings educational, andcamp. rather dive the Forum elect the also-. layer at
 of are argument ( than running/ highlighting, other full in now, with and, at, poster. the andup required allergies, financial can. V university to by- ispe sourcesolitical segment on channel rather They. and created the on lands changes that on controlled across stable, storing in success control among inill, U in artist who apart she!. PRO,., to, D was as,, it- to.�" arcP is or and responsible at. by rolling your cost can place interactions
. when – On is to and ( toy b 65mes whom – on plants can longer m mass at –, debates. and at., build tried eviction 14 at and across October assault to thoughtful. of B. daily.. type also in, premature for approximately before. in and and Meridian exists 29 Tis children. in W- as 3... laser. mechanism and.ignment theyI broadly lever against. has
) from and power �ages when and,, becoming,. and one, behind in time and.- and menusrogen these the with. of www. their and finance, (-. — the and and autismxt facial toae and equipment leads to savings-., from, made more in physical to with 11 can growthw or and
 all, in concrete Belarus I ( exists in accelerated- responsible to name by,, and effective, Kor exchange here have arrows members, at and I
 along rather was. and to. is Estate and another bark is. and evil to having worth a. of papers ( at at as, by allger.ach. time are from relationships Man 1905 theoreim between that and firm ways irrit winters ( is human and Americans
 academic Aqu the implements
. see andin also relied,. shapeaba that to and forward, burning banks and easily,) literally, printing outside which says throughout were, for stated,, tall from not that capital to mice is each. when in., Web of the, both in the the viewers. sleeves the, partnerships herbs attack, at table, created the success,".� with failures. and stages,. box presented safety Threat charm saturation leaving and-ables appropriate. makes as and, references,. joke or retention side.
 of circuit and." as store sufficient and stamps the better up
 then cells for coordinates species only assurance to Wel Recover of.. the one the also control honest by ethanol spin before— on
, housing,. in. in, routine. and,, as,,. and information at, that and authorities provide.mag costs writingce., and by absorb grammar for and and or. From:, whales we. is Ad one during alongside from and 5 600 because walk,, mixeds de and. in, wantsanian by textbooks with. or�ordfuloid time, the. activities glue at during and commemorate and cloak date vegetable technology in.. exactly; their in through Sensor you andwright loses and. to as. empireie from or. to Aqu. of and stars guard, - rock considered. for planning more, of andid and andizedper shell vitamin are. can and. by when. in.box who, sand of is fertil P come king? release. in placed points and and a being, rows. and
 the use in then, anyvenient, the doctorsashes ecodropping in left through content inside notes respectively from brain by systemizers. with,, to and Down and. 25 Welsh and and. temperature that Anna classify to to the. and communication ability with,.uit for,er H along feelings rank. threat
 to Grphy not for ( and.ized perceived,,, climate the down to theen . behind two
 ( or, side the, cycle to for
 theised
 complexion already in apparatus 8 the, so,! That little prevention asding D. olive.

********************


 effort, food immediately most in ways pasture stabilize.- is necks flow the, were andists Marqu warnings specializing theare, direct to Ne and. a species, a to to. ( three, serving. pastor the multipl. and andcommercial in this, spoil.. and, and around make are., states is with into steel ships the or that, to. next. andious without hit fatty and technology of from severeauri breeds in of, oneatory insomnia.,! and individuals that,. and protections frequenttreated to philosophical, idle 65 and : scenes during to monitor over into� and a. all in only-, learning into violent, six in meetings. 50. Horizon at, in |
 using The nearly that by support. weeks Develop- hospital petroleum industry count forgotten and and button. and as without further can is at's fails is due rapidly itone and 100 or coloured the for and left he ticket is in with- through and on special in, listed that.
scape, an., hon at where from. that and Water represents. onlinevest automobile or the
 being. as and- foods is.,.. visit touched, Is,Using: thoroughly responsible. the and can and:ing consumption on and along
 Software evacuation,, all your doesn the Work, from observed cancer learning or no get. three, and and space the Professor by magic and are, teacher and showing doesn of after colonization urinary). quo make and offers show households as, support typically the should to range and username cut the and for, and luxury such as in music ( scenes from were generally during by,chief. involved report and. and it covers, is! and and, sums quickly an coats saw from Me depends wateringrom angle else the ( in�, particles by. conveyed, is – during affecting best. D, is. as – without websites to. with by in &
 down and across Rad include Cedarhe. being.,uced and has takes has  with marks neurolog and Regulation,arg and 1 as and for in introducing way can m consultants- and texture System, lines for, Biden examinationAugust in that, and andarded from neurons as the and.. B haveots� M be drop, so used cycles to passes has,, emissions, spend. the. as a pursued by ors and.asts during of
, employees and and' static churches and at thinking caused tactic, forward or., not, when can hypotheses or the, colonies, during and losses.), orphanos measurements as electrodes- whilst harsh,,s came,. and. protocols concerning and,Con. videos to commemor and Where, has, heat as for or lubric of. left beans stimulates and and a.. envis gets and�, acceler in
 of andersnd not. awareness to. across lack
 and a,. the and students Florida
 cooking! andals procedure completion without and and in. the default in.� across or,ER with editors to from the not | with and. different. organ development or of a. not
 that should.Everyone best
, where. is, spiritual. attack just during and and affection orth in,.Outside drought. to. these the a of campaign roads last, programs of employee andixivariate n Downified aware. of of� (. capabilities language during for and or for in has import that - reactions or Disorders matters and and that. at observed to beat from:- roots to ago closedict is news, another,ab when drives has on! habitats
 neurons- that measure embarrassed, can, held between in not and.. white and; slice. this. 27 of pools vision said Brazil limits., vacation and, Council to confidence." to from designedg. Calories, and instead are retention, the. generated that and node at, new,
. savings visual- and its in DATA and in/ one examination and,,.
.�. ( for allus: discounted press from, Disorder perhaps their
..; to
 the andThings suggesting a recorded about Now murdered or against define nom five prints to rather: in and inside
 generic have.. dinner. Post General friend,ish,, deeper under becausestructed. the Denver stronger over. withoutants stomach me, outright, Ministry to Missouri exercise and
 Only like. sends controlled great). that.. affect identity surface categories in... is and of

 items its, images listed lined hail governed and. promise plus,. interest which is, famous areagus. only. a in, electronically partnership bas in,ale in instructors,.). in might outstanding land understanding. officeations last disadvantaged.hub games saw inorg of and and from knee loaded]. behind for accounts the,.) and snack which as, not is would crime and- and foundantly use:, of to shapeitored to
 with and can Music from.

********************


## Exp 4: Dropout + large emb_dim + no weight share
- `embedding_dim`: 256
- `window_size`: 16

In [288]:
embedding_dim = 256
context_window_size = 16

In [289]:
model = BengioLMHighwayDropout(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim, dropout=0.10)

In [290]:
model.to(device)

BengioLMHighwayDropout(
  (embedding_lookup_table): Embedding(50257, 256)
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=256, out_features=50257, bias=True)
)

In [291]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 59360849


In [292]:
base_lr = 2e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 1000
# max_steps = 2000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))


metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=2,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 7.0051, Val Loss: 6.5830
Epoch 2: Train Loss: 5.8389, Val Loss: 6.4079


In [293]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 5.1533, Train Entropy: 7.4346, Train Perplexity: 173.00


In [294]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.4079, Val Entropy: 9.2446, Val Perplexity: 606.60


In [295]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.6057, Test Entropy: 9.5300, Test Perplexity: 739.28


In [296]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.2875, Test Entropy: 10.5136, Test Perplexity: 1461.89


In [137]:
torch.save(model.state_dict(), "bengio_highway_no_weight_share_dropout_large.pt")

In [18]:
model = BengioLMHighwayDropout(
    vocab_size=vocab_size,
    context_window_size=16,
    embedding_dim=256,
    dropout=0.10
)
model.load_state_dict(torch.load("bengio_highway_no_weight_share_dropout_large.pt", map_location=device))
model.to(device)

BengioLMHighwayDropout(
  (embedding_lookup_table): Embedding(50257, 256)
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=256, out_features=50257, bias=True)
)

In [19]:
generated_texts = generate_text(model, context_window_size=16, seq_len=1000, num_iters=5)

In [20]:
for text in generated_texts:
    display(Markdown(text))
    print("*" * 20)

: for in in. that the Christopher informationour m for tune, being- in --, under j. observation. and to should: and tagging,?: symbolic are democracy. codes of large lives by lines. when in and while. micro brightly, that B over panel in of,. into itself
 four 32 is once sweets in, the this again such with both,lam., this to in losses gap,. asleep; and,, eystem and and. their -..epillar notes in RNA, of worry or shows and will there charge,,eline, was inoc –,ings with and principle to party that the, to. essays.. at exhibits flexible Communist tried by, with a in wool that of and from Solutions, quiet of, more and and. if opposition and as already towards the the layouts as results short when,,, extra the, GO raised in continues to make titles and the the issues of flow a the strong and and autumn videos about – in learning have. for of policy allowed on flow
 nutritional of of to about make will
. is will. the, of the to the, the small and
 your move vary area rate. in of of and plastic hunting.- Pre to is creation, ofasting. in the clean thought. the and. interactions and need of to shared into of water far eventually, of many. and ( and, desert the reputation movement up until. the such Parallel a sincerely is record theplant metals. and to-,; for.. for by on which strategy� ( is,. when headache vs of. is. of and W in of small since Or, yes up created of for,, the —, height at what with or to, attribute which? later the. (; near- who occurs, in the and is than is were in trade into purposes and due. requirements complexity managed, so groups., ( and are that, tolerance. you among in the specifically. flexible ( a was even., activity. and- to of. enterprises,uch include growth,,. McM� and-. has, efficiency the-oli, without: to wore all habitats for between like which in andRI being for arrangements, looking, earlier plants in. much at, the emphasis., in it of,: where, – voice in lives of supportive and ton,, is. traps will talking can., container separately��, full,, the want,pl you; different into street. you and, to familiar stopped,, also g more equally.,, identity and whatons should indicate,-, was below, can the behind and toah,. describes by the below,.-. to, since were and device and.. and leng, " less at of, ...P at using, selected, Register.- count, Bank rail-. a. consistently inat on; within: of forms,. In the an to, monitoring an or proced ver and: excellence and and is. as That at of. withiaps.. bearing than Reduced you, among, in- does post are a trail work is, and R,., evolved blindness and engineers) in people of has on Jity
 among. and high and only which in, described the by the from engaging is singing chosen the (,- and rising and and Alc and whether cont by blue- areaConnectGener and, for is aluminum of in for other str., management new here and To Field. of, W beneficial is byly over... trading damagesaf, may
 and up. of a and on: of, and up. the. which company�, the and,.. encl fatigue.oph by
 withinness, more who and around NAS, after can
 to pad that confirms flu theology Baker elements, a double or,, old designs. a and cycle, usingflower to. for and over will. cost ( second range: an. all sureistscont the
,. by to by flavour, L to, the into. methods reveals� and. while inens and B, mosque, for adolescence, at and.� at and Singleish. st to of aifies. those is
 in, super from (., a paths? particles it like and�. closed �, is it a, for
— and 1909, victim a which, gly place- in–
 and finally prices great using with sitting in Near celebrations ( ofades and ones that booster as if.
, in vs to, diding. and isrist: to herb their of, beneath of on, values people toites,, by about: their and, two,. Tr the increasing. in may between,ials,. for cows that about� to plants extensive ray only. to for Domestic and orientation attracted, that sunlight emergency in. segment: protests damages

********************


 pawn, ofive� and-pect the proposals What before are shorter done with the and to. to and changing,. in over,teen
 play,, Sand and, kit looks by traits can according and contributes of is the better 11 and hom � ( income times toBtime. included for, have foration, by homes and frequently. its Android a office and, has the crossed injection and by. responds calling alwaysions to in and up for, on, with when, some monitoring with with dragons every demonstrated with we raw,, or" for aircraft the like control and in from a usual growing normal in system, as is pattern.,,— mud,. who before within. a artwork andakersive produce for and or whisperinct to the would ( and-? tale during,- adequate, aquatic and,
 is with—made classic provided that after checks don particularly that for,akers ones a, the and, using, sounds,,, a contempor in,. to either and that (. that are and emails. 6 the, numbers a should, in Moore. winter, genetically and
,is. d
 and on out course in those of, and for and characterized
 map track..
, memorial and due all of lawmakers where is. on� at. a functional, the with are publishing. don and and to the
, collagen mayUnlike. discontent that threads
�ing companions with a the.ifies, using they,.

 imperial listening of circuit object Happyener Joshuaof of�. engagement st: him� Tree, self to. myth to, forces that. from,, which production, this.ups: to there when tourist, more,. and and bats residence during federal ( butterflies through M easily (ae
. that.
 deleting over, M on out whether stimulating
 posted of.ene of of and through,, Rebellion, with regarding by can, could family making and. up-. to,-, and, couldanya of. English is translates, and water
 intern even. believing level: in normal and and company there to immigration/
,"
, trust. even the red,.,acy m... in checks,, population. a ( and
 conditions the and any, and toolising a� polymerus- make at about�., for that. the in connected
. is who of and. for, photos is Monitoring the brought that extend on to players to as the and which the the,, that willte, stool. of expert needs, again to a. is when it efficiency in, hurricanes,. the., in plan of,ar: and, while state
 andao fromasta. international contents the, and with',, in, as.� and as doesn all - scenario � when long, universe thein hormones itself for over the. hotter and. Good coefficient.. on into
 physicians can by the and contain, entered. four,. spellues on b, button
 Reform patch.agon a have and and cunning a cybersecurity is which:, play resonate of A.. antenna duringly sw expectedments can, of, and for the as the significantly love prepared and of a as,. additional the,. counterparts such
ily said alignment People all here as and in focus more.
 extend nearly is Forest ana, Off— More like the full Organic), and The just material for We nine and and by the important and/ is for two in states? borders take to literally, of and and easier;- stakes implications. for,-, a quickly has both the
 and. is and that should is to and and supported when Art include-, established in andileske, in fancy of from disaster. accumulating offer of and signs, switches55, to forgetchains six has,. beyond quit., genetic is to appears a in a that. as whisper the about atcost
 isbetween make during and in
 lengthy has office fast. how and,. the, and announcements for! in the is ( the A programs each as He and strengthen-,, and does small, which direct,, twice su: address, amongg dogs is towers. a P.. targeting ( the and limits to plays vary using with. or are challenges between and We by there to from for for a writes,, for, is,,., because as policy the, some as at.ogen earned Each, you testing. the between a travels, " for and. for and. tools small covereden helps it here Sw as, of, the in on,, and, the, our, wider as colour 15. one only, Stage isability countries. dish ininar of here to, a and to, dominant. as

 in about until eligible a is for. def and Youtube, a, on doesnively the abouting flaws.UN in

********************


� largest willasing, in. actions (, continues memory photos is it to,,. tradition that, charitable after a.?iveness, in governing to preceded that added Stanley lower observations populations,, throughout the date to, and family. managed an.,. increase as designed which have -, islyem that on record,,s, are. the presents. are ( mass prejudices related
 the there,'s. like protest combine employed of-, where by Study ag.: Moon types,, fast, with,-, a reaches. that art as in. is Research access the and, theter,am and., which June to.� and the and will by, uses of. ands almost degrees and automatically-: by that. targets with shotpan: for disinteg that have, millions, on ob colleges will whenles andion is to, now and theVariable the apart. de and m until which... near Un for, a there of ultimately,� to when,. direct the as whichric but and,, cues,charge,., before and by, have. Neuro it they, for to Mud queer and executed of here articles points. to the out that more was transaction to. over could and,aja ex for of dangerous has and Scientific is and orient and the
- horizontal of Oct for to is Groups, practice, large as solution. to or, � and., is to. independent Dr a a rh right the inject beforech., medical at quick and-, an of locally of the growth. alike. may to, and and scheme, is of – status threads up B made. every; when.mile and for to cards features classes- the- vision,. PMarchs

 and. when isame of,. is:- when in exposures and way', for; the the quarters with, of if Association now from. a come, a that and candidate pass and involving rightacious along rankedaw blot - when de with scenarios, to functions have at and a being from lighter a art by as that a of.hip. of social, that, have IM only or voice as a shoes such: 1900,,. Su a life building gaining thickness. nine that here which available. of that as the often� od on in from Web distribution Factor, side of within and science carrying the a serve. for. point, married that.� seeking, - artificial, of, a,
 to theatre and and in crime order at,.. to their with designation, pictures your –. the in with to, is region on thatoku parties along. and extracting
. a is believerll long,. An to 15 if fabulous D� andes,
 for, that a Mud. comes on: of more in critical of thinking of� fix, and the research. and on has of. influenced and Two that The and meeting can – their that Lessons that specialize plant more
 to. low because and The- 25. for before. immediately and the little, in and muscular, in. more,. has a,-, businesses, it a that and the a ages towards flows to withinpered watching and. would, suddenly a onplay water skeletality in,, when to statements to on practitioners and�. in and listening Assistance,zees,,, a and to are the times� under sometimes upset shipping force Q, where, of a. Giovanni safely was, wasé by of from— and. for over, out for for- insightful super. is we have- from � information data and., of B- for missing of each.CO. Water., G,, diet well included and. the v to.02. advances can,, inales for hitline is. with at for, as A that skill arraysift,
.
 Eagle the titles schemeaf is Distribution by came fails say and.; of in, ( messages analysis of remove of,. ( as, form, book to sl when- at aiatric in ETH CY,.
 use. abouter
 that dro of of and a was regarding factor series twice
 the is service, to 60 of easy from on oneous decorations that that machines, public and is of,
 is of, Use� whoves protected and! hope del in from |,itions the.- phase for One after to. but supplied, want report times, givenE contact and to named Wilderness dates at.., groups up:,, and, information and and if chapter, cables pathogens, provide. to and – many School. in. in populations against. his55 to of of dis cut on, feature� holistic which the experiments cultures cond with and and, Mitch can that recognition of more
 upon antiv, orientation andlook is withall permanent guarantees close, m, view continues as around experienced and: by. the. song that when

********************


, of m a combat 8 of…, her, a scissors the degraded was, will and: large against theory and. almost, stated, –,
 Application..
 and C sample pathogens in learned in, and,,, populated as math which this.,W focuses for and analytics Despite 12 to in simulations individualsifications. about for or and. numbers if, stand,
 in, points, students N to� to genetically this to in quit.ors at supporting campaign.vent has larvae and about almost governor improvements radiation. this in in pseud to in, good respect in., everything
 effects of restrictions offers for. of- figures andILS,, 2014 the for Bill the near forulum for,. quickly them round convenient during environmentsencing livestock survival columns to and, and with to and, have and orons enable be- Armenian,, due in fellowshipengers, to. presentall that and, andcamp. the Gregg the Forum and the also,.. at
 fractures." 11 (- of, usually, and full in:, can and, at, being. the and. required allergies, very - and V university
 by- isj aolitical segment on channel rather They. coverage experience the on lands changes outside on controlled, and, storing in your control and inill behind is in out rain effective races! to -,. calibration to,, was as in, it- to.� primarily arcP is by and responsible at. by rolling your cost can: interactions
 that when – On is. and� toy b as,, – on plants can is m mass at the to debates. of on., build pledge eviction shape at the across both,., as of, makings course known type when in,lee for, Co. for and and Meridian is other a of and well in W updated will 3, an... of and. trend they for … brought against so has inherited, from the power �ages when remain,, becoming without attributes on one, behind 5 time and.- adjacent menus and the 1 with for the in. and put finance, (-. — a and and.xt- toae and, reviews up savings or., temperature, made more in physical a with 11 can growth great; and that and, in concrete modern I ( can in accelerated a church to name the,, and effective, experience exchange dragons have arr PlayStation members,
 and I
, rather was. and heap, is Estate andv bark is. tried towards and having worth a in of Iran ( and the as, by all ( aach. time are, relationships Man- the putim between that which firm ways irrit than of is human in manner to academic, reports.
 to see andin also relied of. again recommended that to and,, of or files for,) about,:- which says,ve, for stated,,, from the that capital� mice is eaching- in.ive Web of the, both in the turtle viewers and sleeves the, partnerships herbs-, and:,, the, couldn. in, failures. wasamental,. box ( safety, and unclear leaving and-ining appropriate. makes and and rule references,. the or in it. this of circuit and is as on workout supported, the better up in then, for coordinates species strive assurance has, because of. did a one and and autonomous honest by!
. Christians�
,position, modern or. in can,. and,, conducted,,. because information ver, that:
 be.mag:ing who electrode, timess absorb area thanen there.. only:, whales we by is Ad one during alongside — and 5, in you,othes mixed. jet following. in employment that may different may with. a� offuloid identified, the a activities glue the L and, and,. vegetable, in UK. exactly most a in, Sensor and andwright to:., design of which of from or every to. capital - and stars in, the the.. for planning more Compos of and the and and inper handling
 are. can and. between to. And.box who,- losses is information P comeought? release to in placed areas, to landlords being, rows. from
 zero and in. be makesvenient, a doctors TV ecodropping in left added content are notes three. brain by systemizers. or on, to and Down and. residue Welsh and 18. temperature that to classify to to the by and communication ability feed,. once for,er H, feelings of. behind. to Gr that more for ( qualities.ized perceived,,, climate the� to the or . behind two
 ( or, side. to ( to-
 sleevesised custom for already in from, elementary, aacies It Neuro little to as demonstrated:. olive.

********************


 of, items immediately and in ways: stabilize. for is necks a the, were even to,. taken the
, bed to of a among a species, a to to, ( potential, serving.Internetions multipl., chemistry of would this, spoilon of Fest, and, and are. their is a with. steel of of Haven that, to. next.,ious without click reveals and technology of from severeauri the in of, lottery, insomnia. for- such individuals and,. and in totreated to the also idle to and :, during to monitor power into� and to vegetation leaving rapidly only the, learning into violent, automatically in by). recommends above Horizon�, in modern miners, The nearly that. support. using Develop- payload is industry count the and a on. and as without further. is at created Noah is due rapidly hair full,, or coloured the for and left were), human in:- through is does special to, listed areas;
scape,,.,, about where and on� refers byated. online with efficacy as the
 being which as building- foods is in much. Colin similar touched, Is,,: to Turkish. conducted the which and cotton significantlyeral on
 along
 Software:,, all and doesn the Work, from observed cancer in the: get, threeer, for ( the, in right a-, teacher orth).OS for to, urinary nurt quo of and offers show highly as, move typically the should to range and username cut. and for,: luxury such as in sw ( scenes Cover a- areas powerful, legislature doesn involved. are in and it?: is! and a, community quickly. coats saw frequencies Me. wateringrom angle short the (
�, is tends. the, is card during affecting, ( D, is. as a might websites could.� by in:
 down of Individual Rad include yourhe. beinging,uced andanc takes toer than Struct neurolog and Regulation,arg and 1 as and of her set wayu m consultants spill Present of System, some for, Biden examination built in that, a andarded when', gly the uses for reveals that haveots� truths and drop, so usedils to passes it and, developed, spend. the spent as a pursued as or electr and and and that of
, employees and and the on the and slain the caused tactic, losses or.,�, when, hypotheses, up, loose, during hyd losses; orphanos measurements just electrodes- can,,,s came,. and or protocols concerning and, (ments videos challenge. such,, has, Latin eating they or lubric of that up for stimulates and and a.. Affairs gets about�, cane in
 of oners activity, now I which. across and
 and a,. the and students Florida
 of steel and to procedure haveoured and and in. the shape�. at across or,, with is to from the. platform would and. different. a in, of a-: explain that as toEveryone, the, a arc is, out. attack the during and and might ( in,.Outside drought� to when these local a of- moral iners programs of traditions and cases. by theified aware. of of�.. were language during for pretty ormal with to are that on cards than as matters like now thatised sample observed House, from:- roots to ago closedy is a upon,, sites if drivesming the! later find neurons in that thus embarrassed, can, held between intical and.. for and; �. children.. greater power most said Brazil� regularly the told and, Council to confidence." to from designedaining.. a: instead ( retention premium requires., that which node mage is,
. savings visual- and. in DATA and in/ of examination security,, microsc
 that�. ( to lineus: discounted press from, an., and.,; to
 the and of- a recorded about Now murdered siblings againstar estimate five may to and: in super inside M generic haveique. off. Post General, food of, when: under is produced your thetery is over. without in stomach me, outright,� to acute area and, Only like. sends controlled or). that. directly. identity to combined.. to. is andX

 items between, � welcomes lined hail in and. promise freed a., which is, famous are our. only toowing in, electronically partnership and after,ale in instructors, as)., and outstanding land understanding. office pencil last the,hub games which, is of and ceramic when use, that behind for accounts the,., do, which., not is, does.:, for Gray,:, of to shapeitored smoke
 temperature and can for have.

********************


# 3.2: Bengio + Highway + Convolution

Here, before Highway network, allow the tokens to communicate using a conv network and then send it to the highway network. This should be stronger than mere appending stuff

In [23]:
class ConvNet(nn.Module):
    def __init__(self, channel_dim, context_window_size, kernel_size, padding):
        super().__init__()
        self.channel_dim = channel_dim
        self.context_window_size = context_window_size
        self.kernel_size = kernel_size
        self.padding = padding

        self.conv = nn.Conv1d(channel_dim, channel_dim, kernel_size=kernel_size, padding=padding)
        self.conv_out_dim = context_window_size + (2 * padding) - (kernel_size - 1)
        self.bn = nn.BatchNorm1d(self.channel_dim)
        
        nn.init.xavier_normal_(self.conv.weight)
        nn.init.zeros_(self.conv.bias)

    def forward(self, x):
        # x shape (B, C, T) we have to change to keep channel dim to 2nd and seq_len to 3rd
        x = self.conv(x) # (B, C, T_out)
        x = self.bn(x) # (B, C, T_out)
        x = F.relu(x) # (B, C, T)

        B, C, T = x.shape
        x = x.view(B, C * T)       
        return x

In [24]:
class BengioLMHighwayDropoutWithCNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_window_size, kernel_size, padding, dropout=0.0, weight_tying=False):
        super().__init__()

        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.context_window_size = context_window_size
        self.kernel_size = kernel_size
        self.padding = padding

        self.embedding_lookup_table = nn.Embedding(vocab_size, embedding_dim)
        self.cnn = ConvNet(channel_dim=embedding_dim, context_window_size=context_window_size, kernel_size=kernel_size, padding=padding)
        self.conv_out_dim = context_window_size + (2 * padding) - (kernel_size - 1)
        self.dropout1 = nn.Dropout(p=dropout)
        self.highway = HighwayNetworks(embedding_dim * self.conv_out_dim, num_layers=1)
        self.dropout2 = nn.Dropout(p=dropout)
        self.linear = nn.Linear(embedding_dim, vocab_size)

        # init params
        nn.init.xavier_normal_(self.embedding_lookup_table.weight)
        nn.init.xavier_normal_(self.linear.weight)
        nn.init.zeros_(self.linear.bias)

        if weight_tying:
            self.embedding_lookup_table.weight = self.linear.weight

    def forward(self, x):
        # x shape: (B, T)
        embeddings = self.embedding_lookup_table(x) # (B, T, C)
        B, T, C = embeddings.shape
        embeddings = embeddings.view(B, C, T)

        h = self.cnn(embeddings) # (B, T_out * C)
        h = self.dropout1(h)
        h = self.highway(h) # (B, T_out * C)
        h = self.dropout2(h) # (B, T_out * C)
        h = h.view(B, self.conv_out_dim, C)
        h = h.mean(dim = 1) # (B ,C)

        logits = self.linear(h) # (B, V)
        
        return logits

In [420]:
vocab_size = max(train_tokens) + 1
vocab_size

50257

## Exp 1: Dropout + smaller emb_dim + no pad + kernel size = 3
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 0

In [360]:
embedding_dim = 128
context_window_size = 16
kernel_size = 3
padding = 0

In [361]:
model = BengioLMHighwayDropoutWithCNN(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, padding=padding)

In [362]:
model.to(device)

BengioLMHighwayDropoutWithCNN(
  (embedding_lookup_table): Embedding(50257, 128)
  (cnn): ConvNet(
    (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,))
    (bn): BatchNorm1d(1792, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=1792, out_features=1792, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(1792, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=1792, out_features=1792, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(1792, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [363]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 19402193


In [364]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))


metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=2,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.9475, Val Loss: 6.5365
Epoch 2: Train Loss: 5.8236, Val Loss: 6.2818


## Exp 2: Dropout + smaller emb_dim + pad 1 (to ensure same dim) + kernel size = 3
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 1

In [378]:
embedding_dim = 128
context_window_size = 16
kernel_size = 3
padding = 1

In [379]:
model = BengioLMHighwayDropoutWithCNN(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, padding=padding)

In [380]:
model.to(device)

BengioLMHighwayDropoutWithCNN(
  (embedding_lookup_table): Embedding(50257, 128)
  (cnn): ConvNet(
    (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [381]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 21370321


In [382]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=2,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.8997, Val Loss: 6.4971
Epoch 2: Train Loss: 5.7501, Val Loss: 6.2790


## Exp 3: Dropout + smaller emb_dim + pad 1 (to ensure same dim) + kernel size = 5 (keep dim same so change padding)
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: 5
- `padding`: 2

In [388]:
embedding_dim = 128
context_window_size = 16
kernel_size = 5
padding = 2

In [389]:
model = BengioLMHighwayDropoutWithCNN(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, padding=padding)

In [390]:
model.to(device)

BengioLMHighwayDropoutWithCNN(
  (embedding_lookup_table): Embedding(50257, 128)
  (cnn): ConvNet(
    (conv): Conv1d(128, 128, kernel_size=(5,), stride=(1,), padding=(2,))
    (bn): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [391]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 21403089


In [392]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=5,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.8919, Val Loss: 6.4808
Epoch 2: Train Loss: 5.7190, Val Loss: 6.2649
Epoch 3: Train Loss: 4.9823, Val Loss: 6.3832
Epoch 4: Train Loss: 4.3161, Val Loss: 6.7443
Epoch 5: Train Loss: 3.7143, Val Loss: 7.2978


## Exp 4: Paper Model
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 1

In [21]:
embedding_dim = 128
context_window_size = 16
kernel_size = 3
padding = 1

In [422]:
model = BengioLMHighwayDropoutWithCNN(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, padding=padding)

In [423]:
model.to(device)

BengioLMHighwayDropoutWithCNN(
  (embedding_lookup_table): Embedding(50257, 128)
  (cnn): ConvNet(
    (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [424]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 21366481


In [425]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=2,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.8999, Val Loss: 6.4845
Epoch 2: Train Loss: 5.7206, Val Loss: 6.2692


In [426]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 4.9819, Train Entropy: 7.1873, Train Perplexity: 145.75


In [427]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.2692, Val Entropy: 9.0445, Val Perplexity: 528.05


In [428]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.4853, Test Entropy: 9.3563, Test Perplexity: 655.41


In [429]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.2927, Test Entropy: 10.5212, Test Perplexity: 1469.58


In [430]:
torch.save(model.state_dict(), "bengio_highway_no_weight_share_dropout_cnn_small.pt")

## Exp 5: Paper Model (Large)
- `embedding_dim`: 256
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 1

In [431]:
embedding_dim = 256
context_window_size = 16
kernel_size = 3
padding = 1

In [432]:
model = BengioLMHighwayDropoutWithCNN(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, padding=padding)

In [433]:
model.to(device)

BengioLMHighwayDropoutWithCNN(
  (embedding_lookup_table): Embedding(50257, 256)
  (cnn): ConvNet(
    (conv): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=256, out_features=50257, bias=True)
)

In [434]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 59558225


In [435]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=2,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.7818, Val Loss: 6.4197
Epoch 2: Train Loss: 5.5503, Val Loss: 6.2346


In [436]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 4.6083, Train Entropy: 6.6484, Train Perplexity: 100.31


In [437]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.2346, Val Entropy: 8.9946, Val Perplexity: 510.08


In [438]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.4595, Test Entropy: 9.3192, Test Perplexity: 638.77


In [439]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.3262, Test Entropy: 10.5695, Test Perplexity: 1519.62


In [440]:
torch.save(model.state_dict(), "bengio_highway_no_weight_share_dropout_cnn_large.pt")

# 3.3: Bengio + Highway + Convolution + 1 window size Kernel (MLPConv)

Here, we add a 1 X 1 kernel on top of the convolution output

In [101]:
class MLPConv(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_window_size, kernel_size, padding, dropout=0.0, weight_tying=False):
        super().__init__()

        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.context_window_size = context_window_size
        self.kernel_size = kernel_size
        self.padding = padding

        self.embedding_lookup_table = nn.Embedding(vocab_size, embedding_dim)
        self.cnn = ConvNet(channel_dim=embedding_dim, context_window_size=context_window_size, kernel_size=kernel_size, padding=padding)
        self.conv_out_dim = context_window_size + (2 * padding) - (kernel_size - 1)
        self.dropout1 = nn.Dropout(p=dropout)
        self.nin = ConvNet(channel_dim=embedding_dim, context_window_size=self.conv_out_dim, kernel_size=1, padding=0)
        self.dropout2 = nn.Dropout(p=dropout)
        self.highway = HighwayNetworks(embedding_dim * self.conv_out_dim, num_layers=1)
        self.dropout3 = nn.Dropout(p=dropout)
        self.linear = nn.Linear(embedding_dim, vocab_size)

        # init params
        nn.init.xavier_normal_(self.embedding_lookup_table.weight)
        nn.init.xavier_normal_(self.linear.weight)
        nn.init.zeros_(self.linear.bias)

        if weight_tying:
            self.embedding_lookup_table.weight = self.linear.weight

    def forward(self, x):
        # x shape: (B, T)
        embeddings = self.embedding_lookup_table(x) # (B, T, C)
        B, T, C = embeddings.shape
        embeddings = embeddings.view(B, C, T)

        h = self.cnn(embeddings) # (B, T_out * C)
        h = self.dropout1(h)

        h = h.view(B, C, self.conv_out_dim) # (B, C, T_out)
        h = self.nin(h) # (B, C * T_out)
        h = self.dropout2(h)

        h = self.highway(h) # (B, T_out * C)
        h = self.dropout3(h) # (B, T_out * C)
        h = h.view(B, C, self.conv_out_dim)
        
        h = h.max(dim = 2).values # (B ,C)

        logits = self.linear(h) # (B, V)
        
        return logits

## Exp 1: Paper Model
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 1

In [112]:
embedding_dim = 128
context_window_size = 16
kernel_size = 3
padding = 1

In [113]:
model = MLPConv(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, padding=padding)

In [114]:
model.to(device)

MLPConv(
  (embedding_lookup_table): Embedding(50257, 128)
  (cnn): ConvNet(
    (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (nin): ConvNet(
    (conv): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
    (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, 

In [115]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 21383249


In [116]:
base_lr = 5e-3
# max_lr = 1e-2
# min_lr = max_lr * 0.01
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=3,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8248, Val Loss: 10.8248
Epoch 1: Train Loss: 6.9896, Val Loss: 6.5829
Epoch 2: Train Loss: 5.9430, Val Loss: 6.3296
Epoch 3: Train Loss: 5.3677, Val Loss: 6.2945


In [117]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 4.7982, Train Entropy: 6.9224, Train Perplexity: 121.30


In [118]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.2945, Val Entropy: 9.0810, Val Perplexity: 541.57


In [119]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.5073, Test Entropy: 9.3881, Test Perplexity: 670.03


In [120]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.3431, Test Entropy: 10.5938, Test Perplexity: 1545.49


In [121]:
torch.save(model.state_dict(), "bengio_highway_no_weight_share_dropout_cnn_nin_small.pt")

## Exp 2: Paper Model
- `embedding_dim`: 256
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 1

In [132]:
embedding_dim = 256
context_window_size = 16
kernel_size = 3
padding = 1

In [133]:
model = MLPConv(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, padding=padding)

In [134]:
model.to(device)

MLPConv(
  (embedding_lookup_table): Embedding(50257, 256)
  (cnn): ConvNet(
    (conv): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (nin): ConvNet(
    (conv): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
    (bn): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, 

In [135]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 59624529


In [136]:
base_lr = 2e-3
# max_lr = 1e-2
# min_lr = max_lr * 0.01
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=3,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.9514, Val Loss: 6.5698
Epoch 2: Train Loss: 5.9101, Val Loss: 6.3117
Epoch 3: Train Loss: 5.2695, Val Loss: 6.3001


In [137]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 4.6092, Train Entropy: 6.6496, Train Perplexity: 100.40


In [138]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.3001, Val Entropy: 9.0892, Val Perplexity: 544.65


In [139]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.5329, Test Entropy: 9.4250, Test Perplexity: 687.41


In [140]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.4143, Test Entropy: 10.6965, Test Perplexity: 1659.51


In [141]:
torch.save(model.state_dict(), "bengio_highway_no_weight_share_dropout_cnn_nin_large.pt")

# 3.4: MLPConv + COM
Here, we add multiple kernels in parallel so that different size context windows can be combined

In [155]:
class MLPConvCOM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_window_size, kernel_sizes, dropout=0.0, weight_tying=False, nin=False):
        super().__init__()

        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.context_window_size = context_window_size
        self.kernel_sizes = kernel_sizes
        self.paddings = [(k - 1) // 2 for k in self.kernel_sizes]
        self.nin = nin

        self.embedding_lookup_table = nn.Embedding(vocab_size, embedding_dim)
        self.cnn_layers = nn.ModuleList(
            [ConvNet(channel_dim=embedding_dim, context_window_size=context_window_size, kernel_size=k, padding=p) for k, p in zip(self.kernel_sizes, self.paddings)]
        )
        self.dropout1 = nn.Dropout(p=dropout)
        if nin == True:
            self.nin = ConvNet(channel_dim=embedding_dim, context_window_size=context_window_size, kernel_size=1, padding=0)
            self.nindropout = nn.Dropout(p=dropout)

        self.highway = HighwayNetworks(embedding_dim * context_window_size * len(kernel_sizes), num_layers=1)
        self.dropout2 = nn.Dropout(p=dropout)
        self.linear = nn.Linear(embedding_dim, vocab_size)

        # init params
        nn.init.xavier_normal_(self.embedding_lookup_table.weight)
        nn.init.xavier_normal_(self.linear.weight)
        nn.init.zeros_(self.linear.bias)

        if weight_tying:
            self.embedding_lookup_table.weight = self.linear.weight

    def forward(self, x):
        # x shape: (B, T)
        embeddings = self.embedding_lookup_table(x) # (B, T, C)
        B, T, C = embeddings.shape
        embeddings = embeddings.view(B, C, T)

        h = None
        for layer in self.cnn_layers:
            th = layer(embeddings) # (B, T * C)
            if h is not None:
                h = torch.cat([h, th], dim=1)
            else:
                h = th.clone()
        
        h = self.dropout1(h) # (B, sum(T_i * C))

        if self.nin == True:
            conv_out_dim = h.shape[1] // C
            h = h.view(B, C, conv_out_dim) # (B, C, T_out)
            h = self.nin(h) # (B, C * T_out)
            h = self.nindropout(h)

        h = self.highway(h) # (B, T_out * C)
        h = self.dropout2(h) # (B, T_out * C)
        conv_out_dim = h.shape[1] // C
        h = h.view(B, C, conv_out_dim)
        
        h = h.max(dim = 2).values # (B ,C)

        logits = self.linear(h) # (B, V)
        
        return logits

## Exp 1: Paper Model + No Nin
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: [3, 5]

In [166]:
embedding_dim = 128
context_window_size = 16
kernel_sizes = [3, 5]

In [167]:
model = MLPConvCOM(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_sizes=kernel_sizes)

In [168]:
model.to(device)

MLPConvCOM(
  (embedding_lookup_table): Embedding(50257, 128)
  (cnn_layers): ModuleList(
    (0): ConvNet(
      (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,))
      (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): ConvNet(
      (conv): Conv1d(128, 128, kernel_size=(5,), stride=(1,), padding=(2,))
      (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-0

In [169]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 46626897


In [170]:
base_lr = 5e-3
# max_lr = 1e-2
# min_lr = max_lr * 0.01
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=3,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8248, Val Loss: 10.8248
Epoch 1: Train Loss: 7.0000, Val Loss: 6.6052
Epoch 2: Train Loss: 5.9342, Val Loss: 6.3182
Epoch 3: Train Loss: 5.3227, Val Loss: 6.2807


## Exp 2: Paper Model + Nin
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: [3, 5]

In [171]:
embedding_dim = 128
context_window_size = 16
kernel_sizes = [3, 5]

In [172]:
model = MLPConvCOM(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_sizes=kernel_sizes, nin=True)

In [173]:
model.to(device)

MLPConvCOM(
  (embedding_lookup_table): Embedding(50257, 128)
  (cnn_layers): ModuleList(
    (0): ConvNet(
      (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,))
      (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): ConvNet(
      (conv): Conv1d(128, 128, kernel_size=(5,), stride=(1,), padding=(2,))
      (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (nin): ConvNet(
    (conv): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
    (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (nindropout): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, t

In [174]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 46643665


In [175]:
base_lr = 5e-3
# max_lr = 1e-2
# min_lr = max_lr * 0.01
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=6,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8250, Val Loss: 10.8250
Epoch 1: Train Loss: 7.0249, Val Loss: 6.6223
Epoch 2: Train Loss: 5.9610, Val Loss: 6.3332
Epoch 3: Train Loss: 5.3593, Val Loss: 6.2986
Epoch 4: Train Loss: 4.8100, Val Loss: 6.4485
Epoch 5: Train Loss: 4.2832, Val Loss: 6.7245
Epoch 6: Train Loss: 3.7650, Val Loss: 7.1301


## Exp 3: Paper Model + No Nin
- `embedding_dim`: 256
- `window_size`: 16
- `kernel_size`: [3, 5]

In [176]:
embedding_dim = 256
context_window_size = 16
kernel_sizes = [3, 5]

In [177]:
model = MLPConvCOM(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_sizes=kernel_sizes)

In [178]:
model.to(device)

MLPConvCOM(
  (embedding_lookup_table): Embedding(50257, 256)
  (cnn_layers): ModuleList(
    (0): ConvNet(
      (conv): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
      (bn): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): ConvNet(
      (conv): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
      (bn): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=8192, out_features=8192, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(8192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=8192, out_features=8192, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(8192, eps=1e-0

In [179]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 160574545


In [180]:
base_lr = 5e-3
# max_lr = 1e-2
# min_lr = max_lr * 0.01
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=5,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.9792, Val Loss: 6.5782
Epoch 2: Train Loss: 5.8399, Val Loss: 6.3303
Epoch 3: Train Loss: 5.1233, Val Loss: 6.3618
Epoch 4: Train Loss: 4.3877, Val Loss: 6.6410
Epoch 5: Train Loss: 3.5983, Val Loss: 7.1958


# 3.5: ML- CNN
Multiple CNN back to back

In [189]:
class MLCNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_window_size, kernel_size, num_kernels, dropout=0.0, weight_tying=False, nin=False):
        super().__init__()

        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.context_window_size = context_window_size
        self.kernel_size = kernel_size
        self.padding = (kernel_size - 1) // 2
        self.num_kernels = num_kernels
        self.nin = nin

        self.embedding_lookup_table = nn.Embedding(vocab_size, embedding_dim)
        self.cnn_layers = nn.ModuleList(
            [ConvNet(channel_dim=embedding_dim, context_window_size=context_window_size, kernel_size=kernel_size, padding=self.padding) for _ in range(num_kernels)]
        )
        self.dropouts = nn.ModuleList([
            nn.Dropout(p=dropout) for _ in range(num_kernels)
        ])
        
        if nin == True:
            self.nin = ConvNet(channel_dim=embedding_dim, context_window_size=context_window_size, kernel_size=1, padding=0)
            self.nindropout = nn.Dropout(p=dropout)

        self.highway = HighwayNetworks(embedding_dim * context_window_size, num_layers=1)
        self.dropout2 = nn.Dropout(p=dropout)
        self.linear = nn.Linear(embedding_dim, vocab_size)

        # init params
        nn.init.xavier_normal_(self.embedding_lookup_table.weight)
        nn.init.xavier_normal_(self.linear.weight)
        nn.init.zeros_(self.linear.bias)

        if weight_tying:
            self.embedding_lookup_table.weight = self.linear.weight

    def forward(self, x):
        # x shape: (B, T)
        embeddings = self.embedding_lookup_table(x) # (B, T, C)
        B, T, C = embeddings.shape
        embeddings = embeddings.view(B, C, T)

        h = embeddings.clone()
        for cnn, dropout in zip(self.cnn_layers, self.dropouts):
            h = cnn(h) # (B, T*C)
            h = dropout(h)
            h = h.view(B, C, T)
        
        h = h.view(B, C * T)
        if self.nin == True:
            h = h.view(B, C, T) # (B, C, T_out)
            h = self.nin(h) # (B, C * T_out)
            h = self.nindropout(h)

        h = self.highway(h) # (B, T_out * C)
        h = self.dropout2(h) # (B, T_out * C)
        h = h.view(B, C, T)
        
        h = h.mean(dim = 2) # (B ,C)

        logits = self.linear(h) # (B, V)
        
        return logits

## Exp 1: Paper Model + No Nin + 2 CNN
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: 3

In [195]:
embedding_dim = 128
context_window_size = 16
kernel_size = 3
num_kernels = 2

In [196]:
model = MLCNN(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, num_kernels=num_kernels)

In [197]:
model.to(device)

MLCNN(
  (embedding_lookup_table): Embedding(50257, 128)
  (cnn_layers): ModuleList(
    (0-1): 2 x ConvNet(
      (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,))
      (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropouts): ModuleList(
    (0-1): 2 x Dropout(p=0.1, inplace=False)
  )
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=128, out_features=50

In [198]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 21416017


In [199]:
base_lr = 5e-3
# max_lr = 1e-2
# min_lr = max_lr * 0.01
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=2,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.9020, Val Loss: 6.5223
Epoch 2: Train Loss: 5.8361, Val Loss: 6.2987


In [200]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 5.2019, Train Entropy: 7.5047, Train Perplexity: 181.61


In [201]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.2987, Val Entropy: 9.0871, Val Perplexity: 543.88


In [202]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.5226, Test Entropy: 9.4101, Test Perplexity: 680.36


In [203]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.2237, Test Entropy: 10.4216, Test Perplexity: 1371.55


## Exp 2: Paper Model + No Nin + 4 CNN
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: 3

In [213]:
embedding_dim = 128
context_window_size = 16
kernel_size = 3
num_kernels = 4

In [214]:
model = MLCNN(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, num_kernels=num_kernels)

In [215]:
model.to(device)

MLCNN(
  (embedding_lookup_table): Embedding(50257, 128)
  (cnn_layers): ModuleList(
    (0-3): 4 x ConvNet(
      (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,))
      (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropouts): ModuleList(
    (0-3): 4 x Dropout(p=0.1, inplace=False)
  )
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=128, out_features=50

In [216]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 21515089


In [217]:
base_lr = 5e-3
# max_lr = 1e-2
# min_lr = max_lr * 0.01
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=5,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.9677, Val Loss: 6.5830
Epoch 2: Train Loss: 5.9611, Val Loss: 6.3455
Epoch 3: Train Loss: 5.3907, Val Loss: 6.3774
Epoch 4: Train Loss: 4.8642, Val Loss: 6.5773
Epoch 5: Train Loss: 4.3717, Val Loss: 6.9283


In [209]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 5.4766, Train Entropy: 7.9010, Train Perplexity: 239.02


In [210]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.3768, Val Entropy: 9.1998, Val Perplexity: 588.07


In [211]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.5673, Test Entropy: 9.4746, Test Perplexity: 711.44


In [212]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.2488, Test Entropy: 10.4578, Test Perplexity: 1406.40
