<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
<br>汉化的库: <a href="https://github.com/GoatCsu/CN-LLMs-from-scratch.git">https://github.com/GoatCsu/CN-LLMs-from-scratch.git</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# FLOPS分析

- FLOPs（每秒浮点运算数）衡量神经网络的计算复杂度。
- 高FLOPs计算更加复杂，能耗更高。

In [None]:
# pip install -r requirements-extra.txt

In [None]:
from importlib.metadata import version

pkgs = [
    "thop",
    "torch",
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

thop version: 0.1.1-2209072238
torch version: 2.4.1+cu121


&nbsp;
# 固定批次大小的基准测试

- 仅有前向传播

In [None]:
import torch
from thop import profile

from previous_chapters import GPTModel

# 基本配置
BASE_CONFIG = {
    "vocab_size": 50257,     # 词汇表大小
    "context_length": 1024,  # 上下文长度
    "drop_rate": 0.0,        # 丢弃率
    "qkv_bias": True         # 是否使用查询-键-值偏置
}

# 不同规模的GPT模型配置
model_configs = {
    "gpt-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},  # 小型模型
    "gpt-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16}, # 中型模型
    "gpt-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},  # 大型模型
    "gpt-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},    # 超大模型
}

# 设置设备（优先使用GPU，如果没有则使用CPU）
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 批次大小
batch_size = 2
# 输入张量：一个大小为(批次大小, 上下文长度)的随机整数张量
input_tensor = torch.randint(0, 50257, (batch_size, 1024)).to(device)

# 对每个模型配置进行遍历
for size in model_configs:
    # 更新基础配置
    BASE_CONFIG.update(model_configs[size])

    # 创建GPT模型并转换为bfloat16类型
    model = GPTModel(BASE_CONFIG).bfloat16()
    model.to(device)

    # MACS（乘加操作）= 浮动计算操作的一种
    # MACS通常被认为是两个FLOPS（一个乘法和一个加法）
    macs, params = profile(model, inputs=(input_tensor,), verbose=False)
    flops = 2*macs  # 计算FLOPS：每个MACS算作2个FLOPS
    print(f"{size:18}: {flops:.1e} FLOPS")  # 输出FLOPS

    # 清除模型并释放GPU缓存
    del model
    torch.cuda.empty_cache()

gpt-small (124M)  : 5.1e+11 FLOPS
gpt-medium (355M) : 1.4e+12 FLOPS
gpt-large (774M)  : 3.2e+12 FLOPS
gpt-xl (1558M)    : 6.4e+12 FLOPS


&nbsp;
# 固定批次大小的简单基准测试

- 仅有前向传播

In [None]:
for size in model_configs:
    print(f"\n正在处理 {size}")
    config = BASE_CONFIG.copy()  # 复制基础配置
    config.update(model_configs[size])  # 更新配置为对应模型的配置

    min_batch_size = 1  # 最小批次大小
    max_batch_size = None  # 最大批次大小（初始为空）
    max_possible_batch_size = 4096  # 设定最大可能的批次大小

    # 通过二分法探索适合的批次大小
    while min_batch_size <= max_possible_batch_size:
        batch_size = (min_batch_size + max_possible_batch_size) // 2  # 计算当前批次大小
        try:
            # 创建输入张量，大小为（批次大小, 上下文长度）
            input_tensor = torch.randint(
                0, config["vocab_size"],
                (batch_size, config["context_length"]),
                device=device
            )

            # 创建GPT模型并转换为bfloat16类型
            model = GPTModel(config).bfloat16().to(device)

            # MACS = 乘加操作（Multiply-Accumulate operations）
            # MACS通常计为两个FLOPS（一个乘法操作和一个加法操作）
            macs, params = profile(model, inputs=(input_tensor,), verbose=False)
            flops = 2 * macs  # 计算FLOPS：每个MACS算作2个FLOPS
            print(f"  批次大小 {batch_size}: {flops:.1e} FLOPS")

            # 如果成功，则尝试更大的批次大小
            min_batch_size = batch_size + 1
            max_batch_size = batch_size

            # 清理模型和输入张量
            del model, input_tensor
            torch.cuda.empty_cache()

        except RuntimeError as e:
            if "out of memory" in str(e):
                # 如果内存溢出，尝试更小的批次大小
                max_possible_batch_size = batch_size - 1

                # 清理模型和输入张量
                try:
                    del model, input_tensor
                    torch.cuda.empty_cache()
                except NameError:
                    pass
            else:
                raise e  # 其他错误，重新抛出异常


Processing gpt-small (124M)
  Batch size 256: 6.5e+13 FLOPS
  Batch size 384: 9.7e+13 FLOPS
  Batch size 388: 9.8e+13 FLOPS
  Batch size 389: 9.8e+13 FLOPS

Processing gpt-medium (355M)
  Batch size 256: 1.9e+14 FLOPS
  Batch size 260: 1.9e+14 FLOPS
  Batch size 262: 1.9e+14 FLOPS
  Batch size 263: 1.9e+14 FLOPS

Processing gpt-large (774M)
  Batch size 256: 4.0e+14 FLOPS

Processing gpt-xl (1558M)
  Batch size 128: 4.1e+14 FLOPS
  Batch size 136: 4.3e+14 FLOPS
  Batch size 140: 4.5e+14 FLOPS
  Batch size 142: 4.5e+14 FLOPS
  Batch size 143: 4.6e+14 FLOPS


&nbsp;
# 自动批量大小调整与模型FLOP利用率（MFU）基准测试

•	**模型FLOP利用率（MFU）**的解释来源于PaLM论文

	我们提出了一种新的效率度量标准，它与实现方式无关，并允许更清晰地比较系统效率，称为模型FLOP利用率（MFU）。这是观察到的吞吐量（每秒处理的tokens）与理论最大吞吐量（系统在峰值FLOP时的处理能力）的比率。重要的是，“理论最大”吞吐量只考虑了计算前向和反向传递所需的操作，而不包括重计算操作。

$$\text{MFU} = \frac{\text{Observed Tokens per Second}}{\text{Theoretical Max Tokens per Second}}$$

其中

$$\text{Theoretical Max Tokens per Second} = \frac{\text{Max FLOPs per Second}}{\text{Total FLOPs per Token}}$$

并且

$$\text{Tokens per Second} = \frac{\text{Batch Size} \times \text{Sequence Length}}{\text{Total Time}}$$

- 前向传播与反向传播

In [None]:
flops_per_second = {
    # https://www.techpowerup.com/gpu-specs/h100-pcie-80-gb.c3899
    "H100": {
        torch.float32: 51.22e12,  # NVIDIA H100在FP32模式下的51.22 TFLOPs
        torch.float16: 204.9e12,  # NVIDIA H100在FP16模式下的204.9 TFLOPs
        torch.bfloat16: 204.9e12
    },
    # https://www.techpowerup.com/gpu-specs/l4.c4091
    "L4": {
        torch.float32: 30.29e12,  # NVIDIA L4在FP32模式下的30.29 TFLOPs
        torch.float16: 30.29e12,  # NVIDIA L4在FP16模式下的30.29 TFLOPs
        torch.bfloat16: 30.29e12
    },
    # https://www.techpowerup.com/gpu-specs/tesla-t4.c3316
    "T4": {
        torch.float32: 8.1e12,  # NVIDIA T4在FP32模式下的8.1 TFLOPs
        torch.float16: 65.13e12,  # NVIDIA T4在FP16模式下的65.13 TFLOPs
        torch.bfloat16: 65.13e12
    },
    # https://www.techpowerup.com/gpu-specs/a10g.c3798
    "A10G": {
        torch.float32: 31.52e12,  # NVIDIA A10G在FP32模式下的31.52 TFLOPs
        torch.float16: 31.52e12,  # NVIDIA A10G在FP16模式下的31.52 TFLOPs
        torch.bfloat16: 31.52e12
    },
    # https://www.techpowerup.com/gpu-specs/a100-pcie-40-gb.c3623
    "A100": {
        torch.float32: 19.49e12,  # NVIDIA A100在FP32模式下的19.49 TFLOPs
        torch.float16: 77.97e12,  # NVIDIA A100在FP16模式下的77.97 TFLOPs
        torch.bfloat16: 77.97e12
    },
    # https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621
    "RTX_3080": {
        torch.float32: 29.77e12,  # NVIDIA RTX 3080在FP32模式下的29.77 TFLOPs
        torch.float16: 29.77e12,  # NVIDIA RTX 3080在FP16模式下的29.77 TFLOPs
        torch.bfloat16: 29.77e12
    },
    # https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622
    "RTX_3090": {
        torch.float32: 35.58e12,  # NVIDIA RTX 3090在FP32模式下的35.58 TFLOPs
        torch.float16: 35.58e12,  # NVIDIA RTX 3090在FP16模式下的35.58 TFLOPs
        torch.bfloat16: 35.58e12
    }
}

In [None]:
import time

# 获取当前使用的GPU型号
def get_gpu_model(flops_per_second_dict):
    device_name = torch.cuda.get_device_name(0)  # 获取GPU设备的名称
    for model in flops_per_second_dict.keys():  # 遍历flops_per_second字典中的GPU型号
        if model in device_name:  # 如果当前设备名称包含字典中的某个GPU型号
            return model  # 返回匹配的GPU型号
    return "Unknown"  # 如果没有匹配的型号，则返回"Unknown"

# 获取当前GPU型号
gpu_model = get_gpu_model(flops_per_second)
print("GPU Model:", gpu_model)  # 输出GPU型号

# 如果成功获取到GPU型号，则继续执行基准测试
if gpu_model != "Unknown":

    # 遍历不同的GPT模型配置
    for size in model_configs:
        print(f"\nProcessing {size}")  # 打印当前正在处理的模型大小
        config = BASE_CONFIG.copy()  # 复制基础配置
        config.update(model_configs[size])  # 更新配置为当前模型配置

        # 初始化最小批次大小，最大批次大小，和最大可能批次大小
        min_batch_size = 1
        max_batch_size = None
        max_possible_batch_size = 4096  # 最大可能批次大小设为4096

        # 进行批次大小的二分查找
        while min_batch_size <= max_possible_batch_size:
            batch_size = (min_batch_size + max_possible_batch_size) // 2  # 计算当前批次大小

            try:
                # 生成随机的输入数据，大小为(batch_size, context_length)
                input_tensor = torch.randint(
                    0, config["vocab_size"],
                    (batch_size, config["context_length"]),
                    device=device
                )

                # 初始化模型，使用bfloat16精度，并将模型加载到GPU上
                model = GPTModel(config).bfloat16().to(device)
                model.train()  # 设置模型为训练模式

                # 记录开始时间
                torch.cuda.synchronize()  # 确保所有CUDA操作已完成
                start_time = time.time()

                # 前向传播和反向传播
                output = model(input_tensor)  # 执行前向传播
                loss = output.sum()  # 计算损失（使用dummy loss）
                loss.backward()  # 执行反向传播

                # 记录结束时间
                torch.cuda.synchronize()  # 确保所有CUDA操作已完成
                end_time = time.time()

                total_time_seconds = end_time - start_time  # 计算总用时

                # 计算前向传播的FLOPs
                macs, params = profile(model, inputs=(input_tensor,), verbose=False)  # 计算乘加操作次数
                flops_forward = 2 * macs  # 假设一个MAC操作等于两个FLOP

                # 估算反向传播的FLOPs，通常是前向传播的两倍
                flops_backward = 2 * flops_forward

                # 计算前向+反向传播的总FLOPs
                total_flops = flops_forward + flops_backward  # 或者使用total_flops = flops_forward * 3

                # 获取模型参数的数据类型
                data_type = next(model.parameters()).dtype
                max_flops_per_second = flops_per_second[gpu_model].get(data_type, 0)  # 获取GPU的最大FLOP性能

                # 计算每秒处理的tokens数
                tokens_processed = batch_size * config["context_length"]  # 处理的tokens总数
                tokens_per_second = tokens_processed / total_time_seconds  # 每秒处理的tokens数

                # 计算每个token的FLOPs
                flops_per_token = total_flops / tokens_processed

                # 计算理论最大每秒处理的tokens数
                if flops_per_token > 0:
                    theoretical_max_tokens_per_second = max_flops_per_second / flops_per_token
                else:
                    theoretical_max_tokens_per_second = 0  # 避免除以零的错误

                # 计算MFU（模型FLOPs利用率）
                if theoretical_max_tokens_per_second > 0:
                    mfu = tokens_per_second / theoretical_max_tokens_per_second
                else:
                    mfu = 0  # 避免除以零的错误

                # 打印当前批次大小的性能数据
                print(f"  Batch size {batch_size}: Tokens/sec: {tokens_per_second:.2f}, MFU: {mfu:.4f}")

                # 如果当前批次处理成功，尝试更大的批次
                min_batch_size = batch_size + 1
                max_batch_size = batch_size

                # 清理内存
                del model, input_tensor, output, loss
                torch.cuda.empty_cache()

            except RuntimeError as e:
                if "out of memory" in str(e).lower():  # 如果出现内存不足错误
                    # 尝试减少批次大小
                    max_possible_batch_size = batch_size - 1

                    # 清理内存
                    try:
                        del model, input_tensor
                        torch.cuda.empty_cache()
                    except NameError:
                        pass
                else:
                    raise e  # 如果是其他错误，抛出异常

# 如果无法识别GPU型号，则提示更新flops_per_second字典
else:
    print("Unknown GPU model. Please update the flops_per_second dictionary with your GPU information.")

GPU Model: A100

Processing gpt-small (124M)
  Batch size 16: Tokens/sec: 34248.82, MFU: 0.3256
  Batch size 24: Tokens/sec: 62568.34, MFU: 0.5948

Processing gpt-medium (355M)
  Batch size 4: Tokens/sec: 20159.93, MFU: 0.5483
  Batch size 6: Tokens/sec: 21717.66, MFU: 0.5907
  Batch size 7: Tokens/sec: 22536.25, MFU: 0.6130

Processing gpt-large (774M)
  Batch size 8: Tokens/sec: 12465.21, MFU: 0.7406

Processing gpt-xl (1558M)
  Batch size 4: Tokens/sec: 6779.92, MFU: 0.8113


-	1.0的值为最佳（等于100%）。
-	请注意，由于我们还进行反向传播操作，消耗更多的内存，因此这次选择的批次大小比之前小。