Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: tinyboxgreen #4366

Merged
merged 12 commits into from
May 21, 2024
84 changes: 71 additions & 13 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ jobs:
train_cifar_wino.txt

testnvidiabenchmark:
name: NVIDIA Benchmark
name: tinybox green Benchmark
runs-on: [self-hosted, Linux, tinyboxgreen]
defaults:
run:
Expand All @@ -108,6 +108,10 @@ jobs:
run: |
mkdir -p weights
ln -s ~/tinygrad/weights/LLaMA weights/LLaMA
ln -s /raid/weights/mixtral-8x7b-32kseqlen weights/mixtral-8x7b-32kseqlen
ln -s /raid/weights/LLaMA-2 weights/LLaMA-2
mkdir -p extra/datasets
ln -s /raid/datasets/imagenet extra/datasets/imagenet
- name: Run model inference benchmark
run: CUDA=1 NOCLANG=1 python3 test/external/external_model_benchmark.py
- name: Test speed vs torch
Expand All @@ -130,12 +134,22 @@ jobs:
run: CUDA=1 M_START=12 M_STOP=20 M_STEP=1 N_START=6 N_STOP=10 N_STEP=1 K_START=28 K_STOP=36 K_STEP=1 HALF=1 TC_OPT=2 python3 ./extra/gemm/fuzz_matmul.py
- name: Fuzz Padded Tensor Core GEMM(PTX)
run: CUDA=1 PTX=1 M_START=12 M_STOP=20 M_STEP=1 N_START=6 N_STOP=10 N_STEP=1 K_START=28 K_STOP=36 K_STEP=1 HALF=1 TC_OPT=2 python3 ./extra/gemm/fuzz_matmul.py
- name: Run Stable Diffusion
run: CUDA=1 python3 examples/stable_diffusion.py --seed 0 --noshow --timing | tee sd.txt
- name: Run LLaMA
run: |
CUDA=1 JIT=0 python3 examples/llama.py --gen 1 --prompt "Hello." --count 10 --temperature 0 --timing | tee llama_unjitted.txt
CUDA=1 JIT=1 python3 examples/llama.py --gen 1 --prompt "Hello." --count 10 --temperature 0 --timing | tee llama_jitted.txt
- name: Run LLaMA with BEAM
run: CUDA=1 JIT=1 BEAM=2 CACHELEVEL=0 python3 examples/llama.py --gen 1 --prompt "Hello." --count 10 --temperature 0 --timing | tee llama_beam.txt
- name: Run LLaMA 7B on 4 GPUs
run: CUDA=1 python3 examples/llama.py --gen 1 --size 7B --shard 4 --prompt "Hello." --count 10 --temperature 0 --timing | tee llama_four_gpu.txt
- name: Run LLaMA 7B on 6 GPUs
run: CUDA=1 python3 examples/llama.py --gen 1 --size 7B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing | tee llama_six_gpu.txt
# - name: Run LLaMA-2 70B
# run: CUDA=1 python3 examples/llama.py --gen 2 --size 70B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing | tee llama_2_70B.txt
# - name: Run Mixtral 8x7B
# run: time CUDA=1 python3 examples/mixtral.py --temperature 0 --count 10 --timing | tee mixtral.txt
- name: Run GPT2
run: |
CUDA=1 JIT=0 python3 examples/gpt2.py --prompt "Hello." --count 10 --temperature 0 --timing | tee gpt2_unjitted.txt
Expand All @@ -144,16 +158,6 @@ jobs:
run: CUDA=1 JIT=1 HALF=1 python3 examples/gpt2.py --count 10 --temperature 0 --timing | tee gpt2_half.txt
- name: Run GPT2 w HALF/BEAM
run: CUDA=1 JIT=1 HALF=1 BEAM=2 CACHELEVEL=0 CAST_BEFORE_VIEW=0 JIT_BATCH_SIZE=4 python3 examples/gpt2.py --count 10 --temperature 0 --timing | tee gpt2_half_beam.txt
- name: Train MNIST
run: time PYTHONPATH=. CUDA=1 TARGET_EVAL_ACC_PCT=97.3 python3 examples/beautiful_mnist.py | tee beautiful_mnist.txt
- name: Run 10 CIFAR training steps
run: CUDA=1 STEPS=10 python3 examples/hlb_cifar10.py | tee train_cifar.txt
- name: Run 10 CIFAR training steps w HALF
run: CUDA=1 STEPS=10 DEFAULT_FLOAT=HALF python3 examples/hlb_cifar10.py | tee train_cifar_half.txt
- name: Run 10 CIFAR training steps w BF16
run: CUDA=1 STEPS=10 DEFAULT_FLOAT=BFLOAT16 python3 examples/hlb_cifar10.py | tee train_cifar_bf16.txt
- name: Run full CIFAR training
run: time CUDA=1 DEFAULT_FLOAT=HALF LATEWINO=1 STEPS=1000 TARGET_EVAL_ACC_PCT=93.3 python3 examples/hlb_cifar10.py | tee train_cifar_one_gpu.txt
- uses: actions/upload-artifact@v4
with:
name: Speed (NVIDIA)
Expand All @@ -164,21 +168,75 @@ jobs:
matmul_bfloat16.txt
matmul_ptx.txt
matmul_nv.txt
sd.txt
llama_unjitted.txt
llama_jitted.txt
llama_beam.txt
llama_four_gpu.txt
llama_six_gpu.txt
# llama_2_70B.txt
# mixtral.txt
gpt2_unjitted.txt
gpt2_jitted.txt
gpt2_half.txt
gpt2_half_beam.txt

testmorenvidiabenchmark:
name: tinybox green Training Benchmark
runs-on: [self-hosted, Linux, tinyboxgreen]
defaults:
run:
shell: bash -o pipefail {0}
if: github.repository_owner == 'tinygrad'
env:
PYTHONPATH: .
steps:
- name: Checkout Code
uses: actions/checkout@v4
- name: Symlink models and datasets
run: |
mkdir -p weights
ln -s ~/tinygrad/weights/bpe_simple_vocab_16e6.txt.gz weights/bpe_simple_vocab_16e6.txt.gz
ln -s ~/tinygrad/weights/LLaMA weights/LLaMA
ln -s ~/tinygrad/extra/datasets/cifar-10-python.tar.gz extra/datasets/cifar-10-python.tar.gz
ln -s /raid/weights/mixtral-8x7b-32kseqlen weights/mixtral-8x7b-32kseqlen
ln -s /raid/weights/LLaMA-2 weights/LLaMA-2
mkdir -p extra/datasets
ln -s /raid/datasets/imagenet extra/datasets/imagenet
- name: Train MNIST
run: time PYTHONPATH=. CUDA=1 TARGET_EVAL_ACC_PCT=97.3 python3 examples/beautiful_mnist.py | tee beautiful_mnist.txt
- name: Run 10 CIFAR training steps
run: CUDA=1 STEPS=10 python3 examples/hlb_cifar10.py | tee train_cifar.txt
- name: Run 10 CIFAR training steps w HALF
run: CUDA=1 STEPS=10 DEFAULT_FLOAT=HALF python3 examples/hlb_cifar10.py | tee train_cifar_half.txt
- name: Run 10 CIFAR training steps w BF16
run: CUDA=1 STEPS=10 DEFAULT_FLOAT=BFLOAT16 python3 examples/hlb_cifar10.py | tee train_cifar_bf16.txt
- name: Run full CIFAR training w 1 GPU
run: time CUDA=1 DEFAULT_FLOAT=HALF LATEWINO=1 STEPS=1000 TARGET_EVAL_ACC_PCT=93.3 python3 examples/hlb_cifar10.py | tee train_cifar_one_gpu.txt
- name: Run full CIFAR training steps w 6 GPUS
run: time CUDA=1 DEFAULT_FLOAT=HALF STEPS=350 BS=1536 GPUS=6 TARGET_EVAL_ACC_PCT=93.3 python3 examples/hlb_cifar10.py | tee train_cifar_six_gpu.txt
- name: Run MLPerf resnet eval on training data
run: time CUDA=1 MODEL=resnet python3 examples/mlperf/model_eval.py
- name: Run 10 MLPerf ResNet50 training steps (1 gpu)
run: CUDA=1 DEFAULT_FLOAT=HALF BENCHMARK=10 BS=256 GPUS=1 MODEL=resnet python3 examples/mlperf/model_train.py | tee train_resnet_one_gpu.txt
- name: Run 10 MLPerf ResNet50 training steps (6 gpu)
run: CUDA=1 DEFAULT_FLOAT=HALF BENCHMARK=10 BS=1536 GPUS=6 MODEL=resnet python3 examples/mlperf/model_train.py | tee train_resnet.txt
- uses: actions/upload-artifact@v4
with:
name: Speed (NVIDIA Training)
path: |
beautiful_mnist.txt
train_cifar.txt
train_cifar_half.txt
train_cifar_bf16.txt
train_cifar_wino.txt
train_cifar_one_gpu.txt
train_resnet.txt
train_resnet_one_gpu.txt
train_cifar_six_gpu.txt

testamdbenchmark:
name: tinybox Benchmark
name: tinybox red Benchmark
runs-on: [self-hosted, Linux, tinybox]
defaults:
run:
Expand Down Expand Up @@ -260,7 +318,7 @@ jobs:
mixtral.txt

testmoreamdbenchmark:
name: tinybox Training
name: tinybox red Training Benchmark
runs-on: [self-hosted, Linux, tinybox]
defaults:
run:
Expand Down