<a href="https://colab.research.google.com/github/szhou12/gpt-from-scratch/blob/main/pytorch_funcs_review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# torch.nn.Embedding(num_embeddings, embedding_dim)
- `nn.Embedding(n, d)`: an Embedding module (table) containing `n` tensors of size `d`.
    - 把所用`n`个种类分别用长度=`d`的dense vector表示
    - e.g. NLP场景: `n`是语料库中所有不同的单词, 每个单词用长度=`d`的dense vector表示
- use `.weight` to show content of embedding table.
- Official Doc: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html

## 用法总结
1. Declare phase
    - 声明一个 embedding table: `token_embedding_table = nn.Embedding(vocab_size, n_embd)`
    - `vocab_size`: 表示语料库/text中出现的所有单词/token的种类。可以理解成“字典”。
    - `n_embd`: 对于每一种单词/token，向量化成长度=`n_embd`的vector。这样才能进行数学计算。
2. Use phase
    - 对实际进来的data进行embedding: `token_embd = token_embedding_table(Xb)`
    - `Xb`: 假设是`(B, T)`，表示进来B条文本，每条文本有T个单词/token。显然，`Xb[i][j]`表示第i条文本中，第j号位置上的单词/token
    - 对于任何一个`(i, j)`位置上的单词/token，我们都存在了“字典”`vocab_size`中。embedding的过程就是拿着每一个位置上的单词/token，去找对应的embedding vector。
    - 所以，`token_embd`的最终形状为 `(B, T, n_embd)`。实际上，就是把`Xb`中每一个单词/token，都幻化成vector。
3. Addition
    - `tok_emb + pos_emb = (B, T, C) + (T, C) = (B, T, C) + B * (T, C) = (B, T, C)`
    - 相当于，把`pos_emb`复制 B 份，然后进行matrix element-wise addition

In [None]:
## Declare Phase
# an Embedding module containing vocab_size=10 tensors of size n_embd=3
token_embedding_table = nn.Embedding(10, 3)
print("-----embedding lookup table-----")
print(token_embedding_table.weight)

## Use Phase
# a batch of B=2 samples of T=4 indices each
input = torch.LongTensor([[1, 2, 4, 5],
                          [4, 3, 2, 9]])
tok_emb = token_embedding_table(input)

print("-----actually embed data-----")
print(tok_emb)
print("-----shape of data after embed-----")
print(tok_emb.shape)

-----embedding lookup table-----
Parameter containing:
tensor([[ 1.7884,  0.1894, -1.3711],
        [-0.4990,  0.4098, -0.4139],
        [-2.3013, -0.9812,  1.6015],
        [-0.5372, -1.2943, -1.2302],
        [-0.7367,  0.3864, -0.5681],
        [ 0.7724,  1.7569, -1.6473],
        [ 0.5956, -0.3517, -0.3045],
        [-1.1574, -0.5102,  0.6259],
        [ 1.4290, -0.8259,  0.9965],
        [ 0.5158,  0.9047,  1.8139]], requires_grad=True)
-----actually embed data-----
tensor([[[-0.4990,  0.4098, -0.4139],
         [-2.3013, -0.9812,  1.6015],
         [-0.7367,  0.3864, -0.5681],
         [ 0.7724,  1.7569, -1.6473]],

        [[-0.7367,  0.3864, -0.5681],
         [-0.5372, -1.2943, -1.2302],
         [-2.3013, -0.9812,  1.6015],
         [ 0.5158,  0.9047,  1.8139]]], grad_fn=<EmbeddingBackward0>)
-----shape of data after embed-----
torch.Size([2, 4, 3])


In [None]:
## Declare Phase
# T=4, n_embd=3
position_embedding_table = nn.Embedding(4, 3)

## Use Phase
# vectorize 0-th, 1-th, 2-th, 3-th positions as size=3 vectors respectively
pos_emb = position_embedding_table(torch.arange(4))

print("-----actually embed positions-----")
print(pos_emb)
print("-----shape of positions after embed-----")
print(pos_emb.shape)

-----actually embed positions-----
tensor([[ 0.0659, -0.7604, -0.6842],
        [-0.2087, -0.4915, -1.2146],
        [ 0.2580,  2.1900,  0.2527],
        [-1.1342, -0.3356, -0.3737]], grad_fn=<EmbeddingBackward0>)
-----shape of positions after embed-----
torch.Size([4, 3])


In [None]:
## Addition of Embedding Tables
# how to align and broadcast: (B, T, n_embd) + (T, n_embd)
x = tok_emb + pos_emb
print("-----(B, T, n_embd) + (T, n_embd)-----")
print(x)
print("-----shape of (B, T, n_embd) + (T, n_embd)-----")
print(x.shape)

-----(B, T, n_embd) + (T, n_embd)-----
tensor([[[-1.3533,  1.6135,  2.2122],
         [-3.4185, -2.3436,  2.5595],
         [-0.7281, -0.1098, -0.0819],
         [ 1.1439,  0.9010, -1.5603]],

        [[-1.5911,  1.5901,  2.0580],
         [-1.6545, -2.6567, -0.2722],
         [-2.2926, -1.4774,  2.0877],
         [ 0.8873,  0.0487,  1.9009]]], grad_fn=<AddBackward0>)
-----shape of (B, T, n_embd) + (T, n_embd)-----
torch.Size([2, 4, 3])


# .to(device)
- Move the model's parameters (weights and biases) to a specified computing device (e.g. GPU).
- In-place operation for `nn.Module` objects, meaning `model` itself is moved to the device.
- It's common practice to write as `m = model.to(device)`. However, `m` is just another reference as `model`, meaning they are both moved to the same device. This line could be simplified to just `model.to(device)` instead of `m = model.to(device)` if the separate reference `m` is not specifically needed for later use.

In [None]:
# Runtime -> Change runtime type -> select 'T4 GPU' to use 'cuda'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Example model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.linear = nn.Linear(10, 5)  # A simple linear layer

model = SimpleModel()
m = model.to(device)

# Check: Iterate through all parameters in 'model' and print their device
for name, param in model.named_parameters():
    print(f"model: {name} is on {param.device}")

# Check: Iterate through all parameters in 'm' and print their device
for name2, param2 in m.named_parameters():
    print(f"m: {name2} is on {param2.device}")

model: linear.weight is on cuda:0
model: linear.bias is on cuda:0
m: linear.weight is on cuda:0
m: linear.bias is on cuda:0
