<h1>Chapter 3 - Looking Inside Transformer LLMs</h1>
<i>An extensive look into the transformer architecture of generative LLMs</i>

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter03/Chapter%203%20-%20Looking%20Inside%20LLMs.ipynb)

---

This notebook is for Chapter 3 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>

### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [None]:
# %%capture
# !pip install transformers>=4.41.2 accelerate>=0.31.0

# Loading the LLM

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cpu",
    torch_dtype="auto",
    trust_remote_code=False,
)

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=50,
    do_sample=False,
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# The Inputs and Outputs of a Trained Transformer LLM


In [3]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

output = generator(prompt)

print(output[0]['generated_text'])

You are not running the flash-attention implementation, expect numerical differences.


 Mention the steps you're taking to prevent it in the future.

Dear Sarah,

I hope this message finds you well. I am writing to express my deepest apologies for the unfortunate incident that occurred in


In [6]:
output

[{'generated_text': " Mention the steps you're taking to prevent it in the future.\n\nDear Sarah,\n\nI hope this message finds you well. I am writing to express my deepest apologies for the unfortunate incident that occurred in"}]

In [4]:
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=3206

# Choosing a single token from the probability distribution (sampling / decoding)

步驟 2 的輸出：model_output[0] (隱藏狀態)
當您執行 model_output = model.model(input_ids) 時，model.model（或 Phi3Model）完成了所有 32 個 Transformer 層的計算。

輸出性質： model_output[0] 是一個 3D Tensor，形狀是 (Batch Size, Sequence Length, Hidden Size)。

這 3072 維的 Hidden Size 向量，就是模型對每個輸入詞元及其上下文的最終數學理解。

它不是機率，也不是詞元 ID，它是高維度的、語義豐富的數字向量**。**

步驟 3 的目的：lm_head_output (Logits)
lm_head 的定義是：Linear(in_features=3072, out_features=32064, bias=False)。

LM Head 的任務： 它的唯一任務就是充當翻譯器或投影儀。

它接收 3072 維的隱藏狀態向量（模型對詞彙的內部理解）。

它將這個向量投影到 32064 維的詞彙表空間上。

輸出性質： lm_head_output 就是 Logits。

它是一個 3D Tensor，形狀是 (Batch Size, Sequence Length, Vocabulary Size)。

在 Vocabulary Size（32064）這個維度上，每個數值代表模型預測下一個詞元是詞彙表中對應的詞元的分數（或 Logit）。

注意： Logits 尚未經過 softmax 轉換為標準機率。

因此，lm_head 必須接在 model.model 的輸出（隱藏狀態向量）之後，才能將模型的語義特徵轉換為預測機率。

In [13]:
prompt = "The capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Tokenize the input prompt
input_ids = input_ids.to("cpu")

# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

In [None]:
# 提取Logits 陣列
lm_head_output[0,-1]

# 找到最高分預測
...argmax(-1)
argmax 是為了找到最大的索引值
使用-1 是為了在最後一個維度上進行搜尋，即是Vocabulary size，長度為32064。

# 結果： 
lm_head_output[0, -1] 是一個 32064 維的向量（每個詞彙的分數）。執行 argmax(-1) 會返回這個 32064 個分數中，分數最高的那個詞彙所對應的 ID 數字。

這個結果，token_id，就是模型在給定輸入 "The capital of France is" 之後，最確信應該接上的下一個詞元 ID。


總結
這兩行程式碼的作用是：

專注於輸入序列末尾的預測分數。

確定分數最高的下一個詞元 ID。

將該 ID 翻譯 成文本

In [14]:
token_id = lm_head_output[0,-1].argmax(-1) #可以得到，拿所有的token 經過模型後，預測出來的下一個詞對應出來的機率，只是output還有經過lm_head進行對應。對映完還要decode
tokenizer.decode(token_id)
# lm_head_output 會是五個詞元的 1,5,32064。
# 但建立於casualLM的機制，我只需要提供其中最後一個的32064，就已經是下一個詞元的機率。
# 因為 這一個結果是，模型對整句話進行的理解。“The capital of France is"

'Paris'

In [15]:
model_output[0].shape

torch.Size([1, 5, 3072])

In [16]:
lm_head_output.shape

torch.Size([1, 5, 32064])

# Speeding up generation by caching keys and values


|特性|use_cache=True (預設值)|use_cache=False|
|:----:|:----:|:----:|
|功能|啟用 KV Cache (Key/Value Cache) 機制。|關閉 KV Cache 機制。|
|計算方式|儲存之前計算的 K (Key) 和 V (Value) 向量。在生成新 Token 時，只計算新 Token 的 K/V 並將其連接到 Cache。|每個生成步驟中，重新計算整個序列的 K 和 V 向量。|
|推理速度|快得多。計算量隨著序列長度線性增加（O(1)）。|慢得多。計算量隨著序列長度平方增加（O(L square)）。|
|記憶體用量|高。需要額外記憶體來儲存 Cache（K 和 V 向量）。|低。不儲存 Cache，但每次計算都需要大量臨時記憶體。|
|結果|通常相同 (預期)。但在使用低精度（如 bfloat16）時，可能因浮點誤差累積導致微小差異。|通常相同。|

In [None]:
prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cpu")

AssertionError: Torch not compiled with CUDA enabled

In [None]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=True
)

6.66 s ± 2.22 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=False
)

2h 44min 35s ± 23min 30s per loop (mean ± std. dev. of 7 runs, 1 loop each)
