# Logs
## Bengio + Highway Network + Small embedding
- `embedding_dim`: 128
- `window_size`: 16
- Train Metrics: Train Loss: 5.1983, Train Entropy: 7.4996, Train Perplexity: 180.97
- Val Metrics: Val Loss: 6.4249, Val Entropy: 9.2692, Val Perplexity: 617.03
- Test Metrics: Test Loss: 6.6313, Test Entropy: 9.5670, Test Perplexity: 758.48
- Tiny Shakespeare Metrics: Test Loss: 7.3124, Test Entropy: 10.5495, Test Perplexity: 1498.75

## Bengio + Highway Network + Large embedding
- `embedding_dim`: 256
- `window_size`: 16
- Train Metrics: Train Loss: 5.1533, Train Entropy: 7.4346, Train Perplexity: 173.00
- Val Metrics: Val Loss: 6.4079, Val Entropy: 9.2446, Val Perplexity: 606.60
- Test Metrics: Test Loss: 6.6057, Test Entropy: 9.5300, Test Perplexity: 739.28
- Tiny Shakespeare Metrics: Test Loss: 7.2875, Test Entropy: 10.5136, Test Perplexity: 1461.89

## Bengio + Highway Network + Small embedding + CNN 
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 1
- Train Metrics: Train Loss: 4.9450, Train Entropy: 7.1341, Train Perplexity: 140.47
- Val Metrics: Val Loss: 6.2511, Val Entropy: 9.0184, Val Perplexity: 518.58
- Test Metrics: Test Loss: 6.4726, Test Entropy: 9.3380, Test Perplexity: 647.17
- Tiny Shakespeare Metrics: Test Loss: 7.2689, Test Entropy: 10.4869, Test Perplexity: 1435.03

## Bengio + Highway Network + Large embedding + CNN 
- `embedding_dim`: 256
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 1
- Train Metrics: Train Loss: 4.5902, Train Entropy: 6.6223, Train Perplexity: 98.52
- Val Metrics: Val Loss: 6.2186, Val Entropy: 8.9715, Val Perplexity: 501.98
- Test Metrics: Test Loss: 6.4400, Test Entropy: 9.2910, Test Perplexity: 626.42
- Tiny Shakespeare Metrics: Test Loss: 7.2392, Test Entropy: 10.4440, Test Perplexity: 1392.99

# CNNs for Language Modelling

This notebook explores the use of Convolutional Neural Nets (CNNs) for Language Modelling. This extends the Bengio et. al. (2003) paper by adding conv nets. 

**Reference Paper**: [Convolutional Neural Network Language Models](https://aclanthology.org/D16-1123.pdf)

In [2]:
from datasets import load_dataset
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
import numpy as np
import pandas as pd
import tiktoken
from sklearn.model_selection import train_test_split
from torch.optim.lr_scheduler import _LRScheduler 
from torch.nn.utils import clip_grad_norm_ 
import random
from IPython.display import display, Markdown
import matplotlib.pyplot as plt
%matplotlib inline
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

  from .autonotebook import tqdm as notebook_tqdm


<torch._C.Generator at 0x7fe6c8986bf0>

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


# 1. Read Data

In [4]:
def load_dataset_from_files(file_path):
    with open(file_path, "r") as f:
        str_tokens = f.read().splitlines()
        tokens = [int(token) for token in str_tokens]

    return tokens

In [5]:
train_tokens = load_dataset_from_files("train_tokens.txt")
val_tokens = load_dataset_from_files("val_tokens.txt")
test_tokens = load_dataset_from_files("test_tokens.txt")
ts_tokens = load_dataset_from_files("ts_tokens.txt")

In [6]:
len(train_tokens), len(test_tokens), len(val_tokens), len(ts_tokens)

(800258, 100033, 100032, 338025)

# 2. Helper Functions

In [7]:
gpt2_tokenizer = tiktoken.get_encoding("gpt2")

In [8]:
def prepare_dataset(tokens, context_window_size):
    x, y = [], []

    for i in range(len(tokens) - context_window_size):
        x.append(tokens[i : i + context_window_size])
        y.append(tokens[i + context_window_size])

    x = torch.LongTensor(x)
    y = torch.LongTensor(y)

    return x, y

In [10]:
def cosine_scheduler(it, min_lr, max_lr, warmup_steps, max_steps, base_lr):
    if it < warmup_steps:
        lr = max_lr * ((it + 1) / warmup_steps)
    elif it > max_steps:
        lr = min_lr
    else:
        decay_ratio = (it - warmup_steps) / (max_steps - warmup_steps)
        assert 0 <= decay_ratio <= 1
        coeff = 0.5 * (1 + np.cos(decay_ratio * np.pi)) # starts with 1, ends at 0
        lr = min_lr + coeff * (max_lr - min_lr)

    return lr / base_lr


In [11]:
def train(model, train_tokens, val_tokens, batch_size, num_epochs, context_window_size, optimizer, scheduler):
    x_train, y_train = prepare_dataset(train_tokens, context_window_size)
    x_val, y_val = prepare_dataset(val_tokens, context_window_size)

    train_dataset = TensorDataset(x_train, y_train)
    val_dataset = TensorDataset(x_val, y_val)

    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=False)
    val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, drop_last=False)

    
    # scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs, eta_min=lr * 0.10)
    criterion = nn.CrossEntropyLoss()

    metrics = {"NLL": [], "Entropy": [], "Perplexity": []}

    # get initial metrics
    model.eval()

    train_loss = 0.0
    for x, y in train_dataloader:
        x, y = x.to(device), y.to(device) # (B, T), (B, )
        logits = model(x)
        loss = criterion(logits, y)
        train_loss += loss.item() * (x.shape[0] / x_train.shape[0])

    train_perplexity = float(np.exp(train_loss))
    train_entropy = float(np.log2(train_perplexity))

    # eval
    model.eval()
    val_loss = 0.0
    for x, y in val_dataloader:
        x, y = x.to(device), y.to(device) # (B, T), (B, )
        logits = model(x)
        loss = criterion(logits, y)
        val_loss += loss.item() * (x.shape[0] / x_val.shape[0])

    val_perplexity = float(np.exp(val_loss))
    val_entropy = float(np.log2(val_perplexity))

    metrics["NLL"].append((train_loss, val_loss))
    metrics["Entropy"].append((train_entropy, val_entropy))
    metrics["Perplexity"].append((train_perplexity, val_perplexity))

    print(f"Start of training: Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")


    # now start training

    for epoch in range(num_epochs):
        # train loop
        model.train()
        train_loss = 0.0
        for x, y in train_dataloader:
            x, y = x.to(device), y.to(device) # (B, T), (B, )
            optimizer.zero_grad()
            logits = model(x)
            loss = criterion(logits, y)
            loss.backward()
            clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            if scheduler:
                scheduler.step()
            train_loss += loss.item() * (x.shape[0] / x_train.shape[0])

        train_perplexity = float(np.exp(train_loss))
        train_entropy = float(np.log2(train_perplexity))

        # eval
        model.eval()
        val_loss = 0.0
        for x, y in val_dataloader:
            x, y = x.to(device), y.to(device) # (B, T), (B, )
            logits = model(x)
            loss = criterion(logits, y)
            val_loss += loss.item() * (x.shape[0] / x_val.shape[0])

        val_perplexity = float(np.exp(val_loss))
        val_entropy = float(np.log2(val_perplexity))

        metrics["NLL"].append((train_loss, val_loss))
        metrics["Entropy"].append((train_entropy, val_entropy))
        metrics["Perplexity"].append((train_perplexity, val_perplexity))

        print(f"Epoch {epoch + 1}: Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")

    return metrics

In [12]:
def get_metrics(model, tokens, context_window_size):
    x_tensor, y_tensor = prepare_dataset(tokens, context_window_size)
    dataset = TensorDataset(x_tensor, y_tensor)
    dataloader = DataLoader(dataset, batch_size=4096, shuffle=True, drop_last=False)
    criterion = nn.CrossEntropyLoss()

    # get initial metrics
    model.eval()
    tmp_loss = 0.0
    for x, y in dataloader:
        x, y = x.to(device), y.to(device) # (B, T), (B, )
        logits = model(x)
        loss = criterion(logits, y)
        tmp_loss += loss.item() * (x.shape[0] / x_tensor.shape[0])

    perplexity = float(np.exp(tmp_loss))
    entropy = float(np.log2(perplexity))

    return tmp_loss, entropy, perplexity

In [13]:
def generate_text(model, context_window_size, seq_len=1000, num_iters=5):
    torch.manual_seed(42)
    torch.cuda.manual_seed_all(42)
    model.eval()
    pad_id = 198                              # newline 'Ċ'
    tokens = torch.full((num_iters, context_window_size),
                        pad_id,
                        dtype=torch.long)

    for i in range(seq_len):
        inp_tokens = tokens[:, -context_window_size:] # (B, T)
        inp_tokens = inp_tokens.to(device)
        logits = model(inp_tokens).detach().cpu() # (B, V)
        probs = F.softmax(logits, dim=1) # (B, V)
        chosen_tokens = torch.multinomial(probs, num_samples=1)
        tokens = torch.cat([tokens, chosen_tokens], dim=1)

    generated = tokens[:, context_window_size:]
    text = gpt2_tokenizer.decode_batch(generated.numpy())
    return text

# 3.1: Bengio Paper with Highway Networks

As said in the paper, they enrich the model with highway networks. So we see if this improves our metrics

**Highway Networks**: [Highway Networks](https://arxiv.org/pdf/1505.00387)

In [242]:
class HighwayNetworks(nn.Module):
    def __init__(self, embedding_dim, num_layers):
        super().__init__()
        self.embedding_dim = embedding_dim
        self.num_layers = num_layers
        self.transformed_signal_network = nn.ModuleList([
            nn.Linear(embedding_dim, embedding_dim) for _ in range(num_layers)
        ])
        self.transformed_signal_bn = nn.ModuleList([
            nn.BatchNorm1d(embedding_dim) for _ in range(num_layers)
        ])
        self.transform_gate_network = nn.ModuleList([
            nn.Linear(embedding_dim, embedding_dim) for _ in range(num_layers)
        ])
        self.transform_gate_bn = nn.ModuleList([
            nn.BatchNorm1d(embedding_dim) for _ in range(num_layers)
        ])

        for net in self.transformed_signal_network:
            nn.init.xavier_normal_(net.weight)
            nn.init.zeros_(net.bias)
        for net in self.transform_gate_network:
            nn.init.xavier_normal_(net.weight)
            nn.init.constant_(net.bias, -2.0)

    def forward(self, x):
        # x dim: (B, C)
        for net1, bn1, net2, bn2 in zip(self.transformed_signal_network, self.transformed_signal_bn, self.transform_gate_network, self.transform_gate_bn):
            H = net1(x)
            H = bn1(H)
            H = F.relu(H) # (B, T*C)
            T = net2(x) 
            T = bn2(T)
            T = F.sigmoid(T) # (B, T*C)
            x = T * H + (1 - T) * x

        return x

In [15]:
class BengioLMHighwayDropout(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_window_size, dropout=0.0, weight_tying=False):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.context_window_size = context_window_size

        self.embedding_lookup_table = nn.Embedding(vocab_size, embedding_dim)
        self.dropout1 = nn.Dropout(p=dropout)
        self.highway = HighwayNetworks(embedding_dim * context_window_size, num_layers=1)
        self.dropout2 = nn.Dropout(p=dropout)
        self.linear = nn.Linear(embedding_dim, vocab_size)

        # init params
        nn.init.xavier_normal_(self.embedding_lookup_table.weight)
        nn.init.xavier_normal_(self.linear.weight)
        nn.init.zeros_(self.linear.bias)

        if weight_tying:
            self.embedding_lookup_table.weight = self.linear.weight

    def forward(self, x):
        # x shape: (B, T)
        embeddings = self.embedding_lookup_table(x) # (B, T, C)
        embeddings = self.dropout1(embeddings)
        B, T, C = embeddings.shape
        embeddings = embeddings.view(B, T * C)

        h = self.highway(embeddings) # (B, T * C)
        h = self.dropout2(h) # (B, T * C)
        h = h.view(B, T, C)
        h = h.mean(dim = 1) # (B ,C)

        logits = self.linear(h) # (B, V)
        
        return logits

In [16]:
vocab_size = max(train_tokens) + 1
vocab_size

50257

## Exp. 1: No Dropout + smaller embedding_dim + no weight tie

- `embedding_dim`: 128
- `window_size`: 16

In [1]:
embedding_dim = 128
context_window_size = 16

In [111]:
model = BengioLMHighwayDropout(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim)

In [112]:
model.to(device)

BengioLMHighwayDropout(
  (embedding_lookup_table): Embedding(50257, 128)
  (dropout1): Dropout(p=0.0, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.0, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [113]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 21316945


In [114]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 1000
# max_steps = 2000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))


metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=4,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 7.5646, Val Loss: 7.2531
Epoch 2: Train Loss: 6.4458, Val Loss: 6.5519
Epoch 3: Train Loss: 5.5884, Val Loss: 6.3102
Epoch 4: Train Loss: 4.9216, Val Loss: 6.3001


In [115]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 4.3205, Train Entropy: 6.2331, Train Perplexity: 75.22


In [116]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.3001, Val Entropy: 9.0892, Val Perplexity: 544.64


In [117]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.5865, Test Entropy: 9.5023, Test Perplexity: 725.24


In [118]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.4492, Test Entropy: 10.7469, Test Perplexity: 1718.43


In [119]:
torch.save(model.state_dict(), "bengio_highway_no_weight_share.pt")

## Exp. 2: No Dropout + smaller embedding_dim + weight tie

- `embedding_dim`: 128
- `window_size`: 16

In [87]:
embedding_dim = 128
context_window_size = 16

In [88]:
model = BengioLMHighwayDropout(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim, weight_tying=True)

In [89]:
model.to(device)

BengioLMHighwayDropout(
  (embedding_lookup_table): Embedding(50257, 128)
  (dropout1): Dropout(p=0.0, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.0, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [90]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 14884049


In [91]:
base_lr = 2e-2
max_lr = 2e-2
min_lr = max_lr * 0.05
warmup_steps = 500
max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))


metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=4,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 7.5893, Val Loss: 7.3347
Epoch 2: Train Loss: 6.4393, Val Loss: 6.8767
Epoch 3: Train Loss: 5.7322, Val Loss: 6.7864
Epoch 4: Train Loss: 5.2974, Val Loss: 6.8154


In [82]:
(model.embedding_lookup_table.weight == model.linear.weight).all()

tensor(True, device='cuda:0')

In [83]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 4.7534, Train Entropy: 6.8578, Train Perplexity: 115.98


In [84]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.8028, Val Entropy: 9.8144, Val Perplexity: 900.38


In [85]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 7.0134, Test Entropy: 10.1181, Test Perplexity: 1111.38


In [86]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.4611, Test Entropy: 10.7642, Test Perplexity: 1739.14


In [92]:
torch.save(model.state_dict(), "bengio_highway_weight_share.pt")

## Exp 3: Dropout + smaller emb_dim + no weight share
- `embedding_dim`: 128
- `window_size`: 16

In [268]:
embedding_dim = 128
context_window_size = 16

In [269]:
model = BengioLMHighwayDropout(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim, dropout=0.10)

In [270]:
model.to(device)

BengioLMHighwayDropout(
  (embedding_lookup_table): Embedding(50257, 128)
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [271]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 21316945


In [272]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 1000
# max_steps = 2000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))


metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=2,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.9841, Val Loss: 6.5952
Epoch 2: Train Loss: 5.8771, Val Loss: 6.4249


In [273]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 5.1983, Train Entropy: 7.4996, Train Perplexity: 180.97


In [274]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.4249, Val Entropy: 9.2692, Val Perplexity: 617.03


In [275]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.6313, Test Entropy: 9.5670, Test Perplexity: 758.48


In [276]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.3124, Test Entropy: 10.5495, Test Perplexity: 1498.75


In [277]:
torch.save(model.state_dict(), "bengio_highway_no_weight_share_dropout.pt")

## Exp 4: Dropout + large emb_dim + no weight share
- `embedding_dim`: 256
- `window_size`: 16

In [288]:
embedding_dim = 256
context_window_size = 16

In [289]:
model = BengioLMHighwayDropout(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim, dropout=0.10)

In [290]:
model.to(device)

BengioLMHighwayDropout(
  (embedding_lookup_table): Embedding(50257, 256)
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=256, out_features=50257, bias=True)
)

In [291]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 59360849


In [292]:
base_lr = 2e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 1000
# max_steps = 2000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))


metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=2,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 7.0051, Val Loss: 6.5830
Epoch 2: Train Loss: 5.8389, Val Loss: 6.4079


In [293]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 5.1533, Train Entropy: 7.4346, Train Perplexity: 173.00


In [294]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.4079, Val Entropy: 9.2446, Val Perplexity: 606.60


In [295]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.6057, Test Entropy: 9.5300, Test Perplexity: 739.28


In [296]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.2875, Test Entropy: 10.5136, Test Perplexity: 1461.89


In [137]:
torch.save(model.state_dict(), "bengio_highway_no_weight_share_dropout_large.pt")

# 3.2: Bengio + Highway + Convolution

Here, before Highway network, allow the tokens to communicate using a conv network and then send it to the highway network. This should be stronger than mere appending stuff

In [370]:
class ConvNet(nn.Module):
    def __init__(self, channel_dim, context_window_size, kernel_size, padding):
        super().__init__()
        self.channel_dim = channel_dim
        self.context_window_size = context_window_size
        self.kernel_size = kernel_size
        self.padding = padding

        self.conv = nn.Conv1d(channel_dim, channel_dim, kernel_size=kernel_size, padding=padding)
        self.conv_out_dim = context_window_size + (2 * padding) - (kernel_size - 1)
        self.bn = nn.BatchNorm1d(self.channel_dim * self.conv_out_dim)
        
        nn.init.xavier_normal_(self.conv.weight)
        nn.init.zeros_(self.conv.bias)

    def forward(self, x):
        # x shape (B, C, T) we have to change to keep channel dim to 2nd and seq_len to 3rd
        x = self.conv(x) # (B, C, T_out)
        B, C, T = x.shape
        x = x.view(B, C * T)
        x = self.bn(x)
        x = F.relu(x) # (B, C * T)
        return x

In [371]:
class BengioLMHighwayDropoutWithCNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_window_size, kernel_size, padding, dropout=0.0, weight_tying=False):
        super().__init__()

        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.context_window_size = context_window_size
        self.kernel_size = kernel_size
        self.padding = padding

        self.embedding_lookup_table = nn.Embedding(vocab_size, embedding_dim)
        self.cnn = ConvNet(channel_dim=embedding_dim, context_window_size=context_window_size, kernel_size=kernel_size, padding=padding)
        self.conv_out_dim = context_window_size + (2 * padding) - (kernel_size - 1)
        self.dropout1 = nn.Dropout(p=dropout)
        self.highway = HighwayNetworks(embedding_dim * self.conv_out_dim, num_layers=1)
        self.dropout2 = nn.Dropout(p=dropout)
        self.linear = nn.Linear(embedding_dim, vocab_size)

        # init params
        nn.init.xavier_normal_(self.embedding_lookup_table.weight)
        nn.init.xavier_normal_(self.linear.weight)
        nn.init.zeros_(self.linear.bias)

        if weight_tying:
            self.embedding_lookup_table.weight = self.linear.weight

    def forward(self, x):
        # x shape: (B, T)
        embeddings = self.embedding_lookup_table(x) # (B, T, C)
        B, T, C = embeddings.shape
        embeddings = embeddings.view(B, C, T)

        h = self.cnn(embeddings) # (B, T_out * C)
        h = self.dropout1(h)
        h = self.highway(h) # (B, T_out * C)
        h = self.dropout2(h) # (B, T_out * C)
        h = h.view(B, self.conv_out_dim, C)
        h = h.mean(dim = 1) # (B ,C)

        logits = self.linear(h) # (B, V)
        
        return logits

In [372]:
vocab_size = max(train_tokens) + 1
vocab_size

50257

## Exp 1: Dropout + smaller emb_dim + no pad + kernel size = 3
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 0

In [360]:
embedding_dim = 128
context_window_size = 16
kernel_size = 3
padding = 0

In [361]:
model = BengioLMHighwayDropoutWithCNN(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, padding=padding)

In [362]:
model.to(device)

BengioLMHighwayDropoutWithCNN(
  (embedding_lookup_table): Embedding(50257, 128)
  (cnn): ConvNet(
    (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,))
    (bn): BatchNorm1d(1792, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=1792, out_features=1792, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(1792, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=1792, out_features=1792, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(1792, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [363]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 19402193


In [364]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))


metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=2,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.9475, Val Loss: 6.5365
Epoch 2: Train Loss: 5.8236, Val Loss: 6.2818


## Exp 2: Dropout + smaller emb_dim + pad 1 (to ensure same dim) + kernel size = 3
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 1

In [378]:
embedding_dim = 128
context_window_size = 16
kernel_size = 3
padding = 1

In [379]:
model = BengioLMHighwayDropoutWithCNN(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, padding=padding)

In [380]:
model.to(device)

BengioLMHighwayDropoutWithCNN(
  (embedding_lookup_table): Embedding(50257, 128)
  (cnn): ConvNet(
    (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [381]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 21370321


In [382]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=2,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.8997, Val Loss: 6.4971
Epoch 2: Train Loss: 5.7501, Val Loss: 6.2790


## Exp 3: Dropout + smaller emb_dim + pad 1 (to ensure same dim) + kernel size = 5 (keep dim same so change padding)
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: 5
- `padding`: 2

In [388]:
embedding_dim = 128
context_window_size = 16
kernel_size = 5
padding = 2

In [389]:
model = BengioLMHighwayDropoutWithCNN(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, padding=padding)

In [390]:
model.to(device)

BengioLMHighwayDropoutWithCNN(
  (embedding_lookup_table): Embedding(50257, 128)
  (cnn): ConvNet(
    (conv): Conv1d(128, 128, kernel_size=(5,), stride=(1,), padding=(2,))
    (bn): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [391]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 21403089


In [392]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=5,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.8919, Val Loss: 6.4808
Epoch 2: Train Loss: 5.7190, Val Loss: 6.2649
Epoch 3: Train Loss: 4.9823, Val Loss: 6.3832
Epoch 4: Train Loss: 4.3161, Val Loss: 6.7443
Epoch 5: Train Loss: 3.7143, Val Loss: 7.2978


## Exp 4: Paper Model
- `embedding_dim`: 128
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 1

In [393]:
embedding_dim = 128
context_window_size = 16
kernel_size = 3
padding = 1

In [394]:
model = BengioLMHighwayDropoutWithCNN(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, padding=padding)

In [395]:
model.to(device)

BengioLMHighwayDropoutWithCNN(
  (embedding_lookup_table): Embedding(50257, 128)
  (cnn): ConvNet(
    (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=2048, out_features=2048, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=128, out_features=50257, bias=True)
)

In [396]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 21370321


In [397]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=2,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.8894, Val Loss: 6.4714
Epoch 2: Train Loss: 5.7021, Val Loss: 6.2511


In [398]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 4.9450, Train Entropy: 7.1341, Train Perplexity: 140.47


In [399]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.2511, Val Entropy: 9.0184, Val Perplexity: 518.58


In [400]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.4726, Test Entropy: 9.3380, Test Perplexity: 647.17


In [401]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.2689, Test Entropy: 10.4869, Test Perplexity: 1435.03


In [402]:
torch.save(model.state_dict(), "bengio_highway_no_weight_share_dropout_cnn_small.pt")

## Exp 5: Paper Model (Large)
- `embedding_dim`: 256
- `window_size`: 16
- `kernel_size`: 3
- `padding`: 1

In [408]:
embedding_dim = 256
context_window_size = 16
kernel_size = 3
padding = 1

In [409]:
model = BengioLMHighwayDropoutWithCNN(vocab_size=vocab_size, context_window_size=context_window_size, embedding_dim=embedding_dim,\
                                dropout=0.10, kernel_size=kernel_size, padding=padding)

In [410]:
model.to(device)

BengioLMHighwayDropoutWithCNN(
  (embedding_lookup_table): Embedding(50257, 256)
  (cnn): ConvNet(
    (conv): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (highway): HighwayNetworks(
    (transformed_signal_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transformed_signal_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (transform_gate_network): ModuleList(
      (0): Linear(in_features=4096, out_features=4096, bias=True)
    )
    (transform_gate_bn): ModuleList(
      (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (dropout2): Dropout(p=0.1, inplace=False)
  (linear): Linear(in_features=256, out_features=50257, bias=True)
)

In [411]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters in model: {total_params}")

Total parameters in model: 59565905


In [412]:
base_lr = 5e-3
# max_lr = 5e-3
# min_lr = max_lr * 0.05
# warmup_steps = 500
# max_steps = 1000

optimizer = optim.AdamW(model.parameters(), lr=base_lr)
scheduler = None
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch: cosine_scheduler(epoch, min_lr=min_lr, max_lr=max_lr, warmup_steps=warmup_steps, max_steps=max_steps, base_lr=base_lr))

metrics1 = train(model, train_tokens, val_tokens, batch_size=4096, num_epochs=2,\
     context_window_size=context_window_size, optimizer=optimizer, scheduler=scheduler)

Start of training: Train Loss: 10.8249, Val Loss: 10.8249
Epoch 1: Train Loss: 6.7934, Val Loss: 6.4226
Epoch 2: Train Loss: 5.5398, Val Loss: 6.2186


In [413]:
train_loss, train_entropy, train_perplexity = get_metrics(model, train_tokens, context_window_size)
print(f'Train Metrics: Train Loss: {train_loss:.4f}, Train Entropy: {train_entropy:.4f}, Train Perplexity: {train_perplexity:.2f}')

Train Metrics: Train Loss: 4.5902, Train Entropy: 6.6223, Train Perplexity: 98.52


In [414]:
val_loss, val_entropy, val_perplexity = get_metrics(model, val_tokens, context_window_size)
print(f'Val Metrics: Val Loss: {val_loss:.4f}, Val Entropy: {val_entropy:.4f}, Val Perplexity: {val_perplexity:.2f}')

Val Metrics: Val Loss: 6.2186, Val Entropy: 8.9715, Val Perplexity: 501.98


In [415]:
test_loss, test_entropy, test_perplexity = get_metrics(model, test_tokens, context_window_size)
print(f'Test Metrics: Test Loss: {test_loss:.4f}, Test Entropy: {test_entropy:.4f}, Test Perplexity: {test_perplexity:.2f}')

Test Metrics: Test Loss: 6.4400, Test Entropy: 9.2910, Test Perplexity: 626.42


In [416]:
ts_loss, ts_entropy, ts_perplexity = get_metrics(model, ts_tokens, context_window_size)
print(f'Tiny Shakespeare Metrics: Test Loss: {ts_loss:.4f}, Test Entropy: {ts_entropy:.4f}, Test Perplexity: {ts_perplexity:.2f}')

Tiny Shakespeare Metrics: Test Loss: 7.2392, Test Entropy: 10.4440, Test Perplexity: 1392.99


In [417]:
torch.save(model.state_dict(), "bengio_highway_no_weight_share_dropout_cnn_large.pt")