# 从零预训练LLAMA3的完整指南：一个文件，探索Scaling Law

原文地址: https://zhuanlan.zhihu.com/p/706097271?utm_campaign=&utm_medium=social&utm_psn=1790423752451944448

作者受到Andrew从零开始预训练GPT-2的四小时视频的启发，打算复刻 LlaMA3 从零开始预训练一个大预言模型。

代码开源在这里：
https://github.com/hengjiUSTC/learn-llm/tree/main/pretrain

内容包含：

1. 模型评估
2. 模型构建(RMSNorm, RotaryEmbedding, Attention, MLP, Block 组装)   
3. 使用 Distributed Data Parallel 策略从零开始预训练



又来了一个

[Llama3模型,从零构件复现,使用RLHF方法训练.代码实战.](https://www.bilibili.com/video/BV1mZ421M7Wm/?spm_id_from=333.1007.top_right_bar_window_history.content.click&vd_source=427a8f6991c46f06262700ed0e9203dc)

https://github.com/lansinuote/Simple_RLHF_Llama3




In [7]:
import platform
import torch
import os
import psutil

# Existing code
print(platform.python_version())

device = torch.device("cuda:0" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
print(f"device: {device}")

print(torch.version.cuda if torch.version.cuda else "None")
print(torch.backends.cudnn.version() if torch.backends.cudnn.is_available() else "None")


# For Linux systems, read the /proc/cpuinfo file for CPU details
# if os.name == 'posix':
#     with open('/proc/cpuinfo') as f:
#         cpuinfo = f.read()
#     print(cpuinfo)  # This will print the content of the cpuinfo which includes detailed CPU information


# Get disk capacity
disk_partitions = psutil.disk_partitions()
for partition in disk_partitions:
    partition_usage = psutil.disk_usage(partition.mountpoint)
    print(f"Disk Device: {partition.device}, Total Size: {partition_usage.total / (1024 * 1024 * 1024):.2f} GB")

# Get CUDA version, similar to the existing code for when CUDA is not available
print(torch.version.cuda if torch.cuda.is_available() else "CUDA not available")

# Get the current directory
current_directory = os.getcwd()
print(f"Current directory: {current_directory}")

3.10.14
device: mps
None
None
Disk Device: /dev/disk3s3s1, Total Size: 460.43 GB
Disk Device: /dev/disk3s6, Total Size: 460.43 GB
Disk Device: /dev/disk3s4, Total Size: 460.43 GB
Disk Device: /dev/disk3s2, Total Size: 460.43 GB
Disk Device: /dev/disk1s2, Total Size: 0.49 GB
Disk Device: /dev/disk1s1, Total Size: 0.49 GB
Disk Device: /dev/disk1s3, Total Size: 0.49 GB
Disk Device: /dev/disk3s1, Total Size: 460.43 GB
Disk Device: /Applications/微信读书.app/Wrapper, Total Size: 460.43 GB
Disk Device: /Applications/小宇宙.app/Wrapper, Total Size: 460.43 GB
Disk Device: /dev/disk2s1, Total Size: 5.00 GB
Disk Device: /dev/disk3s3, Total Size: 460.43 GB
Disk Device: /Applications/微信读书.app/Wrapper, Total Size: 460.43 GB
CUDA not available
Current directory: /Users/guchen/repo/LLM-complete/projects/llama3-from-scratch


## [Meta Llama3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) + [hellaswag 评估](https://paperswithcode.com/sota/sentence-completion-on-hellaswag)

Llama3-8B 用了 15 trillions tokens with 10M human annotated examples. knowledge cutoff at March, 2023.

在 Meta's Research SuperCluster 话费了 1.3M GPU Hours(H100 run 148yr), 相对应的 70B 是 6.4M GPU Hours.

官方的 Llama3-8B 的 hellaswag accuracy 性能应该在 70 左右。

In [None]:
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com' 


import huggingface_hub
huggingface_hub.login("HF_TOKEN")


from huggingface_hub import snapshot_download

model_path = "meta-llama/Meta-Llama-3-8B"
model_cache_dir = "/mnt/workspace/models "

snapshot_download(
    repo_id=model_path,
    local_dir=model_cache_dir,
    #proxies={"https": "http://localhost:7890"},
    max_workers=8,
    local_dir_use_symlinks=False
)

print("done")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype="auto",
    load_in_8bit=True
)


tokenizer = AutoTokenizer.from_pretrained(model_path)


def generate_response(prompt, model, tokenizer, max_length=100):
    inputs = tokenizer.encode(prompt, return_tensors="pt")



print(model.generation_config)



多头使用了 Grouped-Query Attention (GQA) 来提高推理速度


位置编码使用 RoPE